This project builds a scalable flow to capture real-time Twitter data using a stack of:
- Google Kubernetes
- Google PubSub
- Google BigQuery (Streaming)
- Google Cloud Build & Container Registry
- Python libraries (Tweepy & Google Cloud SDK)
Originally forked from https://github.com/GoogleCloudPlatform/kubernetes-bigquery-python. This fork adds enhancements to the original Google repo which has not been updated in some time:
- Updated
bigquery-setup/schema.json
to latest available schema found in Twitter API docs - Moved keyword terms to
twitter-stream.yaml
under the fieldTWKEYWORDS
to allow for adjustment without rebuild - Updated script to use latest Google SDK Python library and avoid oauth2 library issues + Python 3.x
This repo uses a guided bash script for easier setup.
Prior to running, ensure that your local instance of Google Cloud SDK is properly configured for your project. https://cloud.google.com/sdk/install. This includes:
- Installing the SDK
- Authenticating to your account
- Adding your project
Additionally, you will need a valid Google Cloud project that is set up with payment information or has an active free trial.
Head over to https://developers.twitter.com and create a new application to obtain the following keys & tokens:
- API key
- API secret key
- Access token
- Access token secet
Put all 4 items in their own line, in this order, in a file called twitter.key
. This file will be parsed in sequence by the setup script to populate values.
The bash script will walk you through each step in an optional approach to create all of the necessary resources on Google Cloud platform. Each step is optional if you already have a resource created.
- Launch the script using
sh make-environment.sh
and enter your GCP project id. This will update the gcloud sdk for you.
sh make-environment.sh
GCP Project ID: jbencina-144002
Updated property [core/project].
- Next, you are prompted to create a container image. If you do not already have one, press Y and enter the tag name. This will build and upload to Google Container Registry.
Build new Google Container Image? (Y/n): y
Image tag: v1
# Lots of output as build progresses. You should see the final line containing something like gcr.io/yourproject-12345/pubsub_bq:v1 SUCCESS
- Next, create a BigQuery dataset & table. This automatically uses the schema specified under
bigquery-setup/schema.json
Create Google BigQuery table? (Y/n): Y
BQ Dataset Name: testds
BQ Table Name: testtable
Dataset 'yourproject-12345:testds' successfully created.
Table 'yourproject-12345:testds.testtable' successfully created.
- Next, create the PubSub topic. This will show an error if the topic is new because it will try to delete an existing one under the same name first.
Create Google PubSub Topic? (Y/n): Y
PubSub Topic Name: mytopic
ERROR: Failed to delete topic [projects/jbencina-144002/topics/mytopic]: Resource not found (resource=mytopic).
ERROR: (gcloud.pubsub.topics.delete) Failed to delete the following: [mytopic].
Created topic [projects/jbencina-144002/topics/mytopic].
- Next, create the Kubernetes cluster by supplying a name. This uses small instances to help minimize cost. 2 nodes is generally fine unless you see some performance impact.
Create Google Kubernetes Cluster? (Y/n): Y
Cluster Name: test
Number of nodes (2 recommended): 2
# Lots of output. Should show success after 2-3 minutes to start up cluster
- Lastly, the script will create the
bigquery-controller.yaml
andtwitter-stream.yaml
file for you. If you skipped any of the previous steps, you will be prompted for the input. Otherwise, prior entries are recycled.
Number of PubSub -> BQ Nodes (2 recommended): 1
Keywords to track (Comma separated): test,topic
You are now ready to push the flow to kubernetes
Simply run sh deploy-environment.sh
to upload the data to Kubernetes. You can check the current status on the GCP Cloud Console or by running kubectl get pods -o wide
from the CLI. If this takes more than a few minutes, you may have to try deleting the workflows & redeploying or possibly recreating the cluster.
sh push-environment.sh
deployment.apps/bigquery-controller created
deployment.apps/twitter-stream created
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
bigquery-controller-xxx 0/1 ContainerCreating 0 1s <none> gke-test-default-pool-xxx <none> <none>
twitter-stream-xxx 0/1 ContainerCreating 0 1s <none> gke-test-default-pool-xxx <none> <none>
Assuming all workflows are running you should start seeing new data loading into BigQuery
bq query 'select COUNT(*) FROM dataset.tablename'
+-------+
| f0_ |
+-------+
| 16400 |
+-------+
There are a few important caveats to the data collected
- There are likely duplicate entries in BigQuery. This is because multiple workers may grab the same message. Dedup on Tweet ID. If you are only using 1 pubsub-to-bigquery instance this should not be a problem.
- Most Tweets are retweets. Filter for unique with
retweeted_status.id IS NULL
- Tweets over 140 characters are truncated with
...
. These are typically retweets but some original tweets may also exceed the limit. Useextended_tweet.full_text
to capture the full tweet. - The data has many instances of Unicode
- Despite using the streaming API, there is still a rate limit imposed. You can view the GKE logs to see errors like
ERROR:root:Tweet Parse: Missing ID - b'{"limit":{"track":480788,"timestamp_ms":"1604114035737"}}\r\n'
which indicate 480,788 tweets were not sent back from the API. See more info at https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/guides/streaming-message-types
Follow the Twitter API guide for the most detailed field explainations https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/tweet-object
This is a sample query focusing on the core elements of the data
SELECT
created_at,
id,
user.name,
retweeted_status.id IS NOT NULL is_retweet,
retweeted_status.text AS retweeted_text,
COALESCE(extended_tweet.full_text, text) AS text
FROM `dataset.tablename`
LIMIT 1000;