Tweet Analysis (Python)

So you've collected tens of millions of tweets about a given topic and stored them in Google BigQuery. Now it's time to analyze them.

This project builds upon the research of Tauhid Zaman, Nicolas Guenon Des Mesnards, et. al., as described by the paper: "Detecting Bots and Assessing Their Impact in Social Networks".

NOTE: we used the code in this repo to support the collection and analysis of tweets about the First Trump Impeachment. But this codebase is superseded by the Tweet Analysis (2021) repo for subsequent projects.

Version 2020:

Tweet Collection v1
Friend Collection v1
- Friend Graphs v1
PG Pipeline (Local Database Migrations)
~~Retweet Graphs v1~~
- ~~Bot Classification v1~~
- ~~Bot Classification v2~~
- ~~KS Tests v1~~
Retweet Graphs v2
API v0
API v1
Toxicity Classification
Tweet Recollection
News Sources
Botometer Sampling

Version 2021:

Tweet Collection v2
Continued at: Tweet Analysis 2021

Installation

Dependencies:

Git
Python 3.8
PostgreSQL (optional)

Clone this repo onto your local machine and navigate there from the command-line:

cd tweet-analysis-py/

Create and activate a virtual environment, using anaconda for example, if you like that kind of thing:

conda create -n tweet-analyzer-env-38 python=3.8
conda activate tweet-analyzer-env-38

Install package dependencies:

pip install -r requirements.txt

Setup

Twitter API Credentials

If you want to collect tweets or user friends, obtain credentials which provide read access to the Twitter API. Set the environment variables TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET, TWITTER_ACCESS_TOKEN, and TWITTER_ACCESS_TOKEN_SECRET accordingly (see environment variable setup below).

Google BigQuery and API Credentials

The massive volume of tweets are stored in a Google BigQuery database, so we'll need BigQuery credentials to access the data. From the Google Cloud console, enable the BigQuery API, then generate and download the corresponding service account credentials. Move them into the root directory of this repo as "credentials.json", and set the GOOGLE_APPLICATION_CREDENTIALS environment variable accordingly (see environment variable setup below).

Google Cloud Storage

There will be many twitter user network graph objects generated, and they can be so large that trying to construct them on a laptop is not feasible due to memory constraints. So there may be need to run various graph construction scripts on a larger remote server. File storage on a Heroku server is ephemeral, so we'll save the files to a Google Cloud Storage bucket so they persist. Create a new bucket or gain access to an existing bucket, and set the GCS_BUCKET_NAME environment variable accordingly (see environment variable setup below).

FYI: in the bucket, there will also exist some temporary tables used by BigQuery during batch job performances, so we're namespacing the storage of graph data under "storage/data", which is a mirror of the local "data" directory.

SendGrid Emails

The app will run scripts that take a long time. To have those scripts send emails when they are done, first obtain a SendGrid API Key, then set it as an environment variable (see environment variable setup below).

Local Database

To optionally download some of the data from BigQuery into a local database, first create a local PostgreSQL database called something like "impeachment_analysis", then set the DATABASE_URL environment variable accordingly (see environment variable setup below).

Environment Variables

Create a new file in the root directory of this repo called ".env", and set your environment variables there, as necessary:

# example .env file

#
# GOOGLE APIs
#

GOOGLE_APPLICATION_CREDENTIALS="/path/to/tweet-analysis-py/credentials.json"
BIGQUERY_PROJECT_NAME="tweet-collector-py"
BIGQUERY_DATASET_NAME="impeachment_development"
GCS_BUCKET_NAME="impeachment-analysis-2020"

#
# LOCAL PG DATABASE
#

# DATABASE_URL="postgresql://USERNAME:PASSWORD@localhost/impeachment_analysis"

#
# EMAIL
#

# SENDGRID_API_KEY="__________"
# MY_EMAIL_ADDRESS="hello@example.com"

#
# NLP
#

# BASILICA_API_KEY="______________"

Usage

Testing the Google BigQuery connection:

python -m app.bq_service

Testing the Google Cloud Storage connection, saving some mock files in the specified bucket:

python -m app.gcs_service

Testing

Run tests:

APP_ENV="test" pytest

On the CI server, skips web requests:

CI="true" APP_ENV="test" pytest

s2t2/tweet-analysis-2020