This project deals with collecting real-time twitter data on COVID-19 topic, processing these tweets after mapping the attributes to the desired columns and then storing in the data for further analysis.
- boto3
pip install boto3
- tweepy
pip install tweepy
Cloud services used in this project:
- Kinesis Firehose
- AWS S3
- AWS Glue
- IAM (For role creation)
- CloudWatch
-
Use tweetercred.py to store twitter developer account credentials.
-
Use buildFirehose_AWS.py which uses boto3 API to create a data delivery stream using Kinesis Firehose.
-
Use ingesttwitterdata.py to ingest data to data delivery stream.
-
Delete data delivery stream after the data has been successfully ingested to S3 bucket.
-
To set the credentials for AWS account run the below command on command prompt after installing AWS CLI.
Data pipeline has two parts :
- Collecting Data
- Processing Data