The Twitathon project

This code is used to connect to the Twitter API and download tweets associated with a predefined list of hashtags and users.


Table of contents

  1. Handling requirements
  2. Defining list of hashtags and users
  3. Launching the code
  4. Cron
  5. Data

Handling requirements

Create a new virtual environment (optional):

virtualenv -p python3 .venv

Enter the environment:

source .venv/bin/activate

Install requirements:

pip install -r requirements.txt

Defining list of hashtags and users

The list of hashtags and users for which we want to retrieve tweets is written in a file called data/entities/entities_to_retrieve.txt. This file is automatically generated following the steps described below.

1. Connect to Drive and download entities

The first step is launched with the script src/automatically_download_hashtags_users_sheet.py, which, using Google API, connects to an internal Google Spreadsheet (called Hashtags and users) with the most updated entities to retrieve and downloads them to two different local files:

  • data/entities/original_hashtags.csv, from sheet Hashtags
  • data/entities/original_users.csv, from sheet Users

To connect to Drive API, we followed the steps defined here. The process of enabling the API results in the file credentials.json, required to launch the script. The JSON file is saved in folder config, but included in .gitignore.

The first time the script is launched (requiring the JSON credentials), the user will have to manually give permission to access the spreadsheet. After that, a pickle object with the permission token is automatically generated and also saved in folder config under the name token.pickle.

Note that to connect to a specific spreadsheet, we need to have its identifier. For security issues, this identifier is saved in config.yaml, under the keys drive / spreadsheet_id. The original id is found in the spreadsheet URL. For example:

https://docs.google.com/spreadsheets/d/<spreadsheet_id>/edit

2. Generate file of entities

Once the automatic retrieval from Drive is finished, the script src/update_entities_to_retrieve_txt.py is in charge of joining together the two files into data/entities/entities_to_retrieve.txt. The txt file is a raw list of entities (both hashtags and users with the corresponding sign @ or #).

Launching the code

To initiate the tweet retrieval, one must run the following instruction from the root of the repo:

python src/retrieve_tweets.py retrieve_tweets_from_file --file='data/entities_to_retrieve.txt' --number_of_tweets=1000

Note that the number of tweets can be changed.

Cron

The instruction that has to be included in the cron (including entering the virtual environment) is:

cd twitathon && /home/pi/twitathon/.venv/bin/python src/retrieve_tweets.py retrieve_tweets_from_file --file='data/entities_to_retrieve.txt' --number_of_tweets=1000

Data

Data is stored within the data folder following the sub-folder structure described below.

  • raw Data retrieved from Twitter.
  • dataset Files containing only id and processed message.