Recently, the outcome of the 2023 presidential election in Nigeria has sparked heated debates among Nigerians and foreigners. Many Nigerians have taken to social media, particularly Twitter, to express their frustrations towards the Independent National Electoral Commission (INEC) for the results. Understanding the sentiment about INEC on social media is very important. Hence, the aim of the project is to build an automated data pipeline that will collect tweets that mention INEC and analyze the content of the tweet daily.
The analysis can help inform the Government about the perception of her citizen towards INEC, and seek ways to salvage the situation. In order to conduct a quality analysis, I used the SNSCRAPE tool to scrape older tweets, since the free version of the twitter API can't produce this.
Tweets containing "INEC" or "inec" were scraped from Twitter, and the data was then loaded into a pandas dataframe. To optimize storage and query performance, the data was stored as a Parquet file on Google Cloud Storage. Parquet, known for its columnar storage format and efficient compression algorithms, was chosen to minimize storage costs and maximize query speed.
Apache Spark (on Google DataProc) was used to load and perform minor transformations such as removal of URLs, hashtags, and emojis. Once the transformations were completed, the data was moved into a Google BigQuery table. DBT was used for heavy transformation and analysis, and the transformed data was written back to BigQuery.
Google LookerStudio was used for analysis and visualization. The steps from ingesting the data up to DBT were orchestrated by Prefect at 12 am daily and ran on Google Cloud Platform. The data refresh rate on Looker Studio was 12 hours.
Given that the project aimed to get the perception of twitter users about INEC and also see how the perception grows over time. My final visualization was composed of the information below,
- date that the tweet was made
- tweet content
- username of user that made the tweet
- verified status of the user that made the tweet
- number of the retweets on a tweet
- number of likes on a tweet
- influencer status of the user. This is gotten by dividing the number of users the tweet author is following by those that are following the author. Values less than one means that the user is an influencer.
- sentiment of each tweet
- if the tweet was made before the presidential election (which was on February 25th 2023)
Hence the final table that was used for visualizing the data is shown below
Field Name | Description | Type |
---|---|---|
date_of_tweet | The date the tweet was made | TIMESTAMP |
twitter_text | Content of the Tweet | STRING |
username | username of the tweet author | STRING |
user_is_verified | verified status of the user | BOOLEAN |
number_of_retweet_count | the number of count on the tweet | INTEGER |
user_is_influencer | influencer status of the user | STRING |
tweet_sentiment | sentiment of the tweet | STRING |
before_presidential_election | if the tweet was made before the presidential election | STRING |
To run the project locally, on your system, you will need to install tools such as
- Terraform. This is for automating the creation of applications on Google cloud
- Google Cloud Account
- Data Build Tools (DBT) for transformation.
- Python
- Once you have terraform installed on your system
- Clone the repository to your system and cd into the folder
- Go to Google Cloud and create a service account. Ensure the service account has storage admin, storage object admin, google bigquery admin, dataproc adminstrator, dataproc service agent role. Also, ensure you grant the compute storage admin, compute object admin to a service accont with name ********-compute@developer.gserviceaccount.com (this is required for dataproc).
- Generate the service account key as Json, you can rename it to service_account.json.
- In your terminal type this
export GOOGLE_APPLICATION_CREDENTIALS="path_to_your_service_account.json"
- cd into the terraform folder
- type
terraform apply
in your terminal, click on enter and type yes. The neccessary infrastructure has been provisioned on Google Cloud. - change directory into twitter inec project, and type
make instal
on your terminal to create the virtual environment and download the neccessary tools that is needed. Ensure you change theshell which python3
toshell which python
if you have python installed. Alternatively, you can just typepip install requirements.txt
, to install the packages on your system. - Change the directory into dbt_idowu_project_zoomcamp and, move into twitter_analysis_tweepy. Go to the
profiles.yml
file, change the project id to your id, google_bigquery dataset to your preferred name and the location to your preferred location. - change the directory into models/raw. Modify the
schema.yml
file by changing the database name to your googleaccountprojectid, and schema to what you specified earlier. - Change the directory back to the twitter_inec_project. Create a file
auth_keys.py
for storing the twitter api authentication, you can get the authetication key from twitter developer. The file should look like this,
import tweepy
auth = tweepy.OAuthHandler("*********","*****")
auth.set_access_token("***********","*********")
api = tweepy.API(auth,wait_on_rate_limit=True)
DATA_PROC_CLUSTER_NAME='yourdataprocname'
GCP_BUCKET_NAME='your_gcp_bucket_name(the one created by terraform)'
DATAPROC_CLUSTER_REGION='the_region_that_your_dataproc_cluster_is_on'
DATASET_LOCATION = 'your preferred bigquery dataset location'
PROJECT_NAME='your google account project name'
DATASET_NAME='your preferred bigquery dataset name'
TABLE_NAME='your prefered bigquery table name'
-
The next step is to start the prefect server with
prefect server start
, set the api url. Move back into the twitter_inec_project directory and runpython twitter_flow.py or python3 twitter_flow.py
. This will automatically run all the flows and ingest the data into bigquery. -
To visualize the data, go to lookerstudio. Duplicate the visual, and connect the
core_tweet_data
table as a source to the data. The visuals will automatically be populated. -
For the
top 10 common words in the tweet chart
, you need to click onAdd data
icon, select Bigquery Custom SQL and type this in the editor
WITH word_selected AS (SELECT
lower(word) word
FROM (
SELECT
SPLIT(twitter_text, ' ') as words
FROM
`yourprojectname.yourdatasetname.core_tweet_data`
)
CROSS JOIN
UNNEST(words) as word
WHERE
LOWER(word) NOT IN ('the','to','of','and','where','com','in','is','a','inec','for','that','not','be','on','as','this','has','are','you','by','will','was','with','have','they','i','from','it','he','&','his','what','we','all','their','at','no','just','an','our','do','can','but','-','your','his','should','stop', 'the', 'to', 'and', 'a', 'in', 'it', 'is', 'I', 'that', 'had', 'on', 'for', 'were', 'was','i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't") and word != ''
)
SELECT word, COUNT(*) as count
FROM word_selected
GROUP BY word
ORDER BY
count DESC
LIMIT
20;
- Containerize the workflow and run on Kubernetes
- Implement a CI/CD framework for testing and continuously deploying the code