This code is used to connect to the Twitter API and download tweets associated with a predefined list of hashtags and users.
Create a new virtual environment (optional):
virtualenv -p python3 .venv
Enter the environment:
source .venv/bin/activate
Install requirements:
pip install -r requirements.txt
The list of hashtags and users for which we want to retrieve tweets is written
in a file called data/entities/entities_to_retrieve.txt
. This file is automatically generated
following the steps described below.
The first step is launched with the script src/automatically_download_hashtags_users_sheet.py
,
which, using Google API, connects to an internal Google Spreadsheet (called Hashtags and users)
with the most updated entities to retrieve and downloads them to two different local files:
data/entities/original_hashtags.csv
, from sheet Hashtagsdata/entities/original_users.csv
, from sheet Users
To connect to Drive API, we followed the steps defined here.
The process of enabling the API results in the file credentials.json
, required to
launch the script. The JSON file is saved in folder config
, but included in .gitignore
.
The first time the script is launched (requiring the JSON credentials), the user will have
to manually give permission to access the spreadsheet. After that, a pickle object with
the permission token is automatically generated and also saved in folder config
under the
name token.pickle
.
Note that to connect to a specific spreadsheet, we need to have its identifier. For security
issues, this identifier is saved in config.yaml
, under the keys drive
/ spreadsheet_id
.
The original id is found in the spreadsheet URL. For example:
https://docs.google.com/spreadsheets/d/<spreadsheet_id>/edit
Once the automatic retrieval from Drive is finished, the script src/update_entities_to_retrieve_txt.py
is in charge of joining together the two files into data/entities/entities_to_retrieve.txt
.
The txt file is a raw list of entities (both hashtags and users with the corresponding
sign @ or #).
To initiate the tweet retrieval, one must run the following instruction from the root of the repo:
python src/retrieve_tweets.py retrieve_tweets_from_file --file='data/entities_to_retrieve.txt' --number_of_tweets=1000
Note that the number of tweets can be changed.
The instruction that has to be included in the cron (including entering the virtual environment) is:
cd twitathon && /home/pi/twitathon/.venv/bin/python src/retrieve_tweets.py retrieve_tweets_from_file --file='data/entities_to_retrieve.txt' --number_of_tweets=1000
Data is stored within the data
folder following the sub-folder structure described below.
raw
Data retrieved from Twitter.dataset
Files containing only id and processed message.