Code and dataset from our ICWSM 2018 dataset paper. In the paper, we present a first look at what data can be collected from WhatsApp public groups. The following repository contains details of the data collection and an anonymised version of the dataset. For details, please refer to our paper.
- A standalone Android phone. A rooted android phone will make your life much much easier.
- A mobile network connection (SIM card) - required to set up WhatsApp. This is a one time requirement. You can re-use the SIM card after you activate WhatsApp.
- Internet connection - to keep receiving messages on the phone.
Step 1: Collect a list of public WhatsApp groups for the problem you want to study.
For instance, if you want to study how people have been using WhatsApp for job search, look for potential groups here.
We also provide a list of ~3,000 groups that we could find on Google by looking for urls containing "chat.whatsapp.com" (whatsapp_group_links.txt
).
These groups represent a pretty diverse set of topics which we mention in the paper, including: sports, politics, entertainment, job search, etc. You can use the file getGroupTitles.py
to get the group titles for a set of groups.
Step 2: Set up a stand-alone Android device with WhatsApp installed and the phone number validated.
Then, once you have the list of groups that you want to join, use the script joinWhatsappGroups.py
.
The script uses a python library for browser automation called Selenium and simulates the process of joining the groups one by one using the web interface of WhatsApp (web.whatsapp.com). Note that, to use web.whatsapp.com, the phone should be connected to the internet.
A one-time user input is required to scan the QR code needed to log into WhatsApp on the browser.
-
Note 1: If you have a small number of groups (say, 10), you can skip this step and do the joining manually.
-
Note 2: We tried this script with only 178 groups. I am not sure if the script can handle a much larger set, like joining 1000 groups.
Step 3: Once you join the groups, the data gets collected on the Android device. The data is stored on the device in an encrypted database. The next challenge is to decrypt the database and extract the messages. It is much much easier to decrypt the database if your phone is rooted. The process for unrooted Android phones is here and the code here (We haven't tested these, and so can't vouch for them). We describe the process for rooted phones below:
-
The decrypted database is stored in a path that looks like this:
Device Storage/WhatsApp/Databases
, containing a filenamemsgstore.db.crypt12
-
Note that we recommend using a stand-alone phone that only can be used for this data collection, since rooting your phone is not a great idea.
-
After you have rooted your phone, following the tons of tutorials available online, you can look for the decryption key on your phone.
-
WhatsApp saves this key in your system storage. To open the system folder you can use an app such as ES File Explorer. Look for ES File Explorer File Manager – Android Apps on Google Play.
-
Navigate to the location
/data/data/com.whatsapp/files/key
on the file explorer app. -
Copy the
key
file to a convenient location. This is the decryption key. -
Then use the code in the
WhatsApp-Crypt12-Decrypter
folder to decrypt the database. Copy thekey
file to this folder. The WhatsApp-Crypt12-Decrypter code was obtained (and slightly modified) from here. -
Use the
decrypt12.py
. It takes the key file and the encrypted database as input.python decrypt12.py key msgstore.db.crypt12 msgstore.db
. The decrypted database is stored in the file namedmsgstore.db
. -
If you reached until here, Congrats! Almost done! The file msgstore.db is a simple sqlite3 database which can be manipulated programmatically. You can browse the contents using a database browsing tool like DB Browser for SQLite (on Linux, Mac and Windows). You can export the contents of the database to a tsv file using the file
saveDataAsTSV.py
. -
The above process was partly gathered from here.
Due to the sensitive nature of the data (containing phone numbers), we decided to release an anonymised version of the data.
The file anonymised_data_to_share.tsv
contains the anonymised data we collected for around 5 months from around 178 groups.
We are not sharing the original message content due to potential sensitive information contained in them. However, if you are a researcher/academic who wants access to the actual message content (for research purposes only), please send an email to Kiran Garimella (kiran.garimella@aalto.fi).
The publicly released version of the dataset is within a tab separated file that contains over 300,000 messages, with the following fields:
-
Group ID - Unique identifier of a group. Anonymised by replacing the original group id with a unique random integer for a group
-
Phone number - Anonymised phone number, replaced by a unique random integer and the 2 letter country code (ISO 3166-1 alpha-2) of the phone.
-
message length - number of characters in the message.
-
message num words - number of words in the message.
-
message lang - Language of the message, automatically inferred using this tool. Note that the detection might not be completely accurate, especially since the messages have a lot of emojis and are typically short.
-
timestamp - the unix timestamp when the message was sent.
-
media url - URL contained in the message
-
media mime type - image/video or audio
-
media size - size of the media in bytes
-
media name - in case of urls shared, whatsapp fetches the title of the page. This field contains the title.
-
media caption - contains the short snippet metadata about the url.
We now have tools to collect multimedia data too (image and video). (NOTE: We just put together scripts from multiple sources to create these scripts. I am not the author of any of these. I do not claim ownership.)
Here's a rough pipeline. You might have to tweak the python script to get this working
We need to follow two steps
-
Download the sqlite jdbc jar file.
-
export CLASSPATH=/mnt/kiran/whatsapp/sqlite-jdbc-3.23.1.jar:$CLASSPATH (path to the sqlite jdbc jar.)
-
Compile the java file javac Search.java
-
run the java file Search.java to obtain the mediakey
java Search <database_filename> (e.g. msgstore.db) | sort -u > url_mediakey.txt
- downloadEncImage.py will download the encrypted images
-
For any questions regarding the data collection or the dataset itself, please contact Kiran Garimella (EPFL, kiran.garimella@aalto.fi) or Gareth Tyson (Queen Mary University, gareth.tyson@qmul.ac.uk).
-
Please cite our ICWSM 2018 paper if you use the dataset.
-
The data collection process and the (anonymised) dataset provided here are only for research purposes.
-
We do not release the original message content publicly. If you are interested in the content, for research purposes, please contact us.
-
As we note in the paper, any member of a group (public or not) is allowed according to WhatsApp legal restrictions to export group message data. Since we were a member of the group when collecting the data, we do not break the WhatsApp privacy policy rules.
-
Please understand what you are doing before doing any of the stuff mentioned above. Use the tools mentioned above at your own risk. We do not take any responsibility.