1000 friends of Pavel Durov: how to pump out VKontakte data

Home » VKontakte

Today, the social network Vkontakte is considered the most popular in Russia and the CIS countries. Every day, hundreds of millions of users visit the vk.com website to read news, find out something interesting, listen to music, watch movies and, of course, chat with friends. After all, what are social networks for in the first place? Of course, for communication!

Today, the maximum number of VKontakte friends for one user is no less than 10 thousand people, and the average number of people who are on such a list for a socially active user is, according to statistics, 200-300 people.

Looking through these people, many of us think about how VKontakte friends are sorted, how their list is formed and what influences this sequence. Some people go even further in their thinking and want to know how to change the order of VKontakte friends and whether it can be done at all. In this article we will try to answer these and many other questions about the VK friends list, revealing to you several useful secrets of the “white-blue” social network.

Introduction

Almost everyone involved in quantitative research is familiar with the following problem: there is an idea, but there is no data to implement it. And although there are now many sites with data (for example, Kaggle.com (https://www.kaggle.com/), where you can find both a corpus of wine reviews and a classifier of ancient Japanese hieroglyphs or metadata on the collections of the Metropolitan Museum of Art) , it is still quite difficult to find something ready-made that would meet the individual needs of a particular study. In such cases, there is only one thing left to do: pump out the data yourself. VKontakte is a valuable source of data that can help in linguistic, sociological and other research. Do you want to explore a corpus of original poetic works by teenagers? Please, an almost finished body is already waiting for you in the group dedicated to poetry! Would you like to understand how the theory of six handshakes works in reality? Pump out random users and build a graph! In fact, it is not difficult at all, and today we will show you how to download the treasured data.

What will you need?

VKontakte account
Environment for working with Python* (almost no programming knowledge required)
Desire and a little time

*For example, any IDE: PyCharm, Sublime Text or any other one you like. In this guide, we will use the Jupyter shell, the functionality of which is convenient for analytics: you can write and edit code at each stage without running the entire program, build beautiful visualizations, and do a bunch of other interesting things. More information about the installation and capabilities of Jupyter can be found here.

Please note that the code for the tutorial is written to help you understand what's going on and how things work, so it's not meant to be awesome.

Stage 1. Obtaining a key to the VK API

The first step on the path to data is to gain access to the VKontakte API (Application Programming Interface), with the help of which we will further interact with VKontakte. The API contains many methods that allow you to quickly retrieve the necessary information from the VK database, for example, from user pages, groups, etc. (or rather, only that information that is not hidden). To be able to use them, you need to log in to VK and create a Standalone application on the API page. To do this, in the My Applications click Create an application . Next, you need to select the Standalone application, name it and connect it. As soon as you connect it, you will find yourself in the Information , where the description, title, etc. are located. There is no need to touch anything here, and you can immediately go to the Settings on the left. Here we are looking at the Service Access Key , which we need to copy or remember because we will need it later.
We leave the API page open for now, we will return to it a little later.

How to get into possible friends on VKontakte. Where is the VKontakte “possible friends” function located?

The VKontakte social network unites people, and various algorithms are constantly being introduced into it to make it more convenient for users to find their friends, relatives and simply acquaintances whom they may have seen several times, but when meeting they did not have time or did not want to add each other as friends. . One way to quickly add new people to your friends list is to use the VKontakte “Possible Friends” tool. In the article, we will consider the principle of operation of this algorithm, as well as how to use it.

Part 1. Users

Step 1

Now that we have an API access key, we can safely move on to the most interesting part - writing code. First, install the libraries and load the necessary modules from there . In the tutorial we will use the requests library - a package that allows you to send http requests to the server and return responses to these requests in various formats *.

import requests

** If you don’t know how to install libraries in Python, don’t be alarmed, the Internet is full of various guides on how to do this. For example, here. ***In fact, this is only one of the ways to mine VKontakte data through Python. There are also a number of libraries that allow you to receive responses from the VKontakte API in other ways. They are very easy to google, so you can learn alternative techniques if you wish. Or if anyone is already familiar with other methods, it would be great to hear about them in the comments).

Step 2

Now let's figure out how to actually make requests to the VKontakte API. In fact, the developers have already told us all this: we generate an http request for the information we need, which we send to the VK database. At the same time, VK developers even offer a request template: The part that is interesting to us is highlighted in bold: METHOD_NAME - a required parameter - the method that we want to apply. It is selected depending on what information we want to get from the database. A complete list of VKontakte API methods is available here. The method is separated from subsequent parts of the request by the ? character. PARAMETERS is already an optional parameter, each method has its own set. Each method supplies some initial information by default, which can be expanded using this parameter. If there are several parameters, they are separated by the & symbol. ACCESS_TOKEN - remember that service access key from the developers’ page that we remembered earlier? This is him. The key is required when making a request. V is the VK API version, without which it is also impossible to generate a request. At the time of writing, the API version is 5.92.

Step 3

Let's consider the process of generating and sending a request. First, let's decide what exactly we want to get. Let's say something related to user data. Let's look at the list of available methods and what they can offer us. To do this, select the Users .

A list of methods for working with user data has appeared.

Step 4

So, let's try to pull out extended information about the user. get method page and carefully read the parameters. Let's say we want to get information about a user with ID 1 (user_ids=1). First, let’s write the access key and VK API version into separate variables in the “string” format so that we can use them later****.

access_token = 'your token' api_version = '5.89'

* Please note that since we are using Jupyter Notebook, we do not need to create a separate file on the computer, although this is an option. But we'll just set the keys as variables in a separate script cell. Now let's create a request to the API. To refer to a method, you must write before its name which group it belongs to (since method names overlap in different groups). We write the request as a formatted string, where the values in {} will take the values of the variables that are written there**.

res_users = requests.get(f'https://api.vk.com/method/users.get?user_ids=1&access_token={access_token}&v={api_version}')

And let's see the result

res_users.json()

By default (without specified parameters), we simply received the first name, last name and page status.

***** If you want to get information about more than one person, you can write their IDs separated by commas. If there are many users, then you can set their id in a separate variable (for example, load it from a file), and also write it in {}. It is important that they are separated by commas. Learn more about string formatting in Python.

Step 5.

Now let's add specifics. Let's say we want to get more dates of birth, country and city. These fields are in the fields parameter, as the documentation tells us.

res_users = requests.get(f'https://api.vk.com/method/users.get?user_ids=1&fields=bdate, city&access_token={access_token}&v={api_version}')

Step 6.

Let's go further: download 200 friends of Pavel Durov and add them to a separate list. To do this, we need a function from the same Users - getFollowers. First, let's format the request as a string, in which we will then format the parts located in {}.

url = 'https://api.vk.com/method/users.getFollowers?user_id=1&fields=city,country&count=100&offset={offset}&access_token={access_token}&v={api_version}'

Here we have new parameters - count and offset. Count shows how many friends we will pump out in one request, and offset shows how many values we will shift by each time we send a new request. That is, if the offset parameter did not exist, we would download only the first 100 friends each time, and with it we select each next hundred in turn (iteratively). So, first we create an empty list of friends, into which we will then write the unloaded users. Then we determine what offsets we will have each time and in what increments. The range function is responsible for this: relatively speaking, we will have 3 steps from 0 to 300, each of which will be equal to 100. After which we pump out 100 subscribers in a cycle for each iteration. And ultimately we write down the user names in the list.

friends = [] for i in range(0, 201, 100): url_formatted = url.format(access_token = access_token, api_version = api_version, offset = i) print(i) res_friends = requests.get(url_formatted) for friend in res_friends .json()["response"]['items']: friends.append(friend["last_name"])

And voila! The list of subscribers is ready!

Parsim groups

Great! Everything worked out (at least it should have).

Now let’s try to get the necessary data on VKontakte groups - the number of participants for each group and a list of these same participants in the form of a list of IDs.

It is important to know that the VKontakte API displays a maximum of 1000 group users

– You won’t be able to question him again. However, for carrying out a rough analysis of groups, it will do. If you need more, you will have to parse group pages directly.

The function below takes as input a list of VKontakte group names, and at the output gives the data we need for these groups.

#!/usr/bin/env python2 # -*- coding: utf-8 -*- import vk_auth import vkontakte import time def get_groups_users(groups_list): groups_out = {} (token,user_id) = vk_auth.auth("your_login" , “your_password”, “2951857”, “groups”) vk = vkontakte.API(token=token) for group in groups_list: #here we specify count=10, which will give us 10 users from the group #this is done for clarity. A maximum of 1000 users can be pulled out groups_out = vk.get("groups.getMembers", group_id=group, sort="id_asc", offset=100, count=10) time.sleep(1) return groups_out if __name__ == "__main__" : group_list = ["oldlentach", "obrazovach", "superdiscoteka"] print get_groups_users(group_list) >>> {"oldlentach": {u"count": 740868, u"users": }, "obrazovach": {u "count": 217978, u"users": }, "superdiscoteka": {u"count": 150538, u"users": }}

The structure of the output data is as follows: the key is the name of the group, the value is a dictionary with two keys: u'count' - the number of members in the group and u'users' - a list of IDs of the members of this group (maximum 1000, as we remember)

The name of the group is taken from its VKontakte address, for example, there is a group called Obrazovach, which is located at https://vk.com/obrazovach and we take the last part of the address, i.e. "obrazovach" as group name

Part 2. Communities

Now let’s assume that for our research purposes we want to obtain a corpus of some texts. To do this, we can, for example, download the wall of a community or user. The principle here is the same as when working with users: choose a method, create a request, and enjoy. A list of methods for working with walls is located in the corresponding Wall in the documentation. If we want to download posts on the wall, then we need the wall.get method. Let's try to download posts from the Vyshkinsky haiku community. First, let's try to download the first 100 publications.

res_wall = requests.get(f'https://api.vk.com/method/wall.get?domain=hsehokku&count=100&access_token={access_token}&v={api_version}')

In fact, nothing much has changed except the method and its parameter: domain contains the short address of the community from which we are pumping out data. Let's look at the results for the first post: we received not only the text of the publication itself, but also meta-information for it, which can also be pulled out from the list and used if desired.

But for educational purposes we will focus only on the texts of publications. Now let's collect the first 400 posts from the wall and write them into a file on the computer so that we can use it later.

Step 1

We write a query string to which we add the offset parameter, which is responsible for indentation.

url = 'https://api.vk.com/method/wall.get?domain=hsehokku&count=100&offset = {offset}&access_token={access_token}&v={api_version}'

Step 2

We create an empty texts list into which we will write texts. Then we write a cycle that will pump out 100 posts for each of 4 iterations, and write each post to the created list.

texts = [] for i in range(0, 301, 100): url_formatted = url.format(access_token = access_token, api_version = api_version, offset = i) print(i) res_wall = requests.get(url_formatted) for post in res_wall .json()["response"]['items']: texts.append(post["text"])

Let's look at how it looks in the list: Yeah, there are extra characters in the text, or rather \n . This is just a separator: it shows that the text at a given place in the original begins on a new line.

Step 3

We create a texts.txt file in the root folder on the computer, in which we will write the final result.