/twitter_inec_elt_pipeline

A repo for dataengineering zoomcamp

Primary LanguagePython

Twitter INEC Analysis ELT Pipeline

Motivation

Recently, the outcome of the 2023 presidential election in Nigeria has sparked heated debates among Nigerians and foreigners. Many Nigerians have taken to social media, particularly Twitter, to express their frustrations towards the Independent National Electoral Commission (INEC) for the results. Understanding the sentiment about INEC on social media is very important. Hence, the aim of the project is to build an automated data pipeline that will collect tweets that mention INEC and analyze the content of the tweet daily.

The analysis can help inform the Government about the perception of her citizen towards INEC, and seek ways to salvage the situation. In order to conduct a quality analysis, I used the SNSCRAPE tool to scrape older tweets, since the free version of the twitter API can't produce this.

Architectural Diagram

image

Architectural Diagram Explanation

Tweets containing "INEC" or "inec" were scraped from Twitter, and the data was then loaded into a pandas dataframe. To optimize storage and query performance, the data was stored as a Parquet file on Google Cloud Storage. Parquet, known for its columnar storage format and efficient compression algorithms, was chosen to minimize storage costs and maximize query speed.

Apache Spark (on Google DataProc) was used to load and perform minor transformations such as removal of URLs, hashtags, and emojis. Once the transformations were completed, the data was moved into a Google BigQuery table. DBT was used for heavy transformation and analysis, and the transformed data was written back to BigQuery.

Google LookerStudio was used for analysis and visualization. The steps from ingesting the data up to DBT were orchestrated by Prefect at 12 am daily and ran on Google Cloud Platform. The data refresh rate on Looker Studio was 12 hours.

Data Visualization Motivation

Given that the project aimed to get the perception of twitter users about INEC and also see how the perception grows over time. My final visualization was composed of the information below,

  • date that the tweet was made
  • tweet content
  • username of user that made the tweet
  • verified status of the user that made the tweet
  • number of the retweets on a tweet
  • number of likes on a tweet
  • influencer status of the user. This is gotten by dividing the number of users the tweet author is following by those that are following the author. Values less than one means that the user is an influencer.
  • sentiment of each tweet
  • if the tweet was made before the presidential election (which was on February 25th 2023)

Hence the final table that was used for visualizing the data is shown below

Field Name Description Type
date_of_tweet The date the tweet was made TIMESTAMP
twitter_text Content of the Tweet STRING
username username of the tweet author STRING
user_is_verified verified status of the user BOOLEAN
number_of_retweet_count the number of count on the tweet INTEGER
user_is_influencer influencer status of the user STRING
tweet_sentiment sentiment of the tweet STRING
before_presidential_election if the tweet was made before the presidential election STRING

Visualization

How to run the Project on your personal system

To run the project locally, on your system, you will need to install tools such as

  • Terraform. This is for automating the creation of applications on Google cloud
  • Google Cloud Account
  • Data Build Tools (DBT) for transformation.
  • Python

Steps to replicate it locally

  • Once you have terraform installed on your system
  • Clone the repository to your system and cd into the folder
  • Go to Google Cloud and create a service account. Ensure the service account has storage admin, storage object admin, google bigquery admin, dataproc adminstrator, dataproc service agent role. Also, ensure you grant the compute storage admin, compute object admin to a service accont with name ********-compute@developer.gserviceaccount.com (this is required for dataproc).
  • Generate the service account key as Json, you can rename it to service_account.json.
  • In your terminal type this export GOOGLE_APPLICATION_CREDENTIALS="path_to_your_service_account.json"
  • cd into the terraform folder
  • type terraform apply in your terminal, click on enter and type yes. The neccessary infrastructure has been provisioned on Google Cloud.
  • change directory into twitter inec project, and type make instal on your terminal to create the virtual environment and download the neccessary tools that is needed. Ensure you change the shell which python3 to shell which python if you have python installed. Alternatively, you can just type pip install requirements.txt, to install the packages on your system.
  • Change the directory into dbt_idowu_project_zoomcamp and, move into twitter_analysis_tweepy. Go to the profiles.yml file, change the project id to your id, google_bigquery dataset to your preferred name and the location to your preferred location.
  • change the directory into models/raw. Modify the schema.yml file by changing the database name to your googleaccountprojectid, and schema to what you specified earlier.
  • Change the directory back to the twitter_inec_project. Create a file auth_keys.py for storing the twitter api authentication, you can get the authetication key from twitter developer. The file should look like this,
import tweepy
auth = tweepy.OAuthHandler("*********","*****")
auth.set_access_token("***********","*********")
api = tweepy.API(auth,wait_on_rate_limit=True) 
DATA_PROC_CLUSTER_NAME='yourdataprocname'
GCP_BUCKET_NAME='your_gcp_bucket_name(the one created by terraform)'
DATAPROC_CLUSTER_REGION='the_region_that_your_dataproc_cluster_is_on'
DATASET_LOCATION = 'your preferred bigquery dataset location'
PROJECT_NAME='your google account project name'
DATASET_NAME='your preferred bigquery dataset name'
TABLE_NAME='your prefered bigquery table name'
  • The next step is to start the prefect server with prefect server start, set the api url. Move back into the twitter_inec_project directory and run python twitter_flow.py or python3 twitter_flow.py. This will automatically run all the flows and ingest the data into bigquery.

  • To visualize the data, go to lookerstudio. Duplicate the visual, and connect the core_tweet_data table as a source to the data. The visuals will automatically be populated.

  • For the top 10 common words in the tweet chart, you need to click on Add data icon, select Bigquery Custom SQL and type this in the editor

WITH word_selected AS (SELECT
  lower(word) word 
FROM (
  SELECT
    SPLIT(twitter_text, ' ') as words
  FROM
    `yourprojectname.yourdatasetname.core_tweet_data`
)
CROSS JOIN
  UNNEST(words) as word
WHERE
  LOWER(word) NOT IN ('the','to','of','and','where','com','in','is','a','inec','for','that','not','be','on','as','this','has','are','you','by','will','was','with','have','they','i','from','it','he','&','his','what','we','all','their','at','no','just','an','our','do','can','but','-','your','his','should','stop', 'the', 'to', 'and', 'a', 'in', 'it', 'is', 'I', 'that', 'had', 'on', 'for', 'were', 'was','i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't") and word != ''
)

SELECT word, COUNT(*) as count
FROM word_selected
GROUP BY word
ORDER BY
  count DESC
LIMIT
  20;
  • Once you are done, you will see a visual similar to what is shown below image

Next step

  • Containerize the workflow and run on Kubernetes
  • Implement a CI/CD framework for testing and continuously deploying the code