/Data-Pipeline-AWS

This is a project which demonstrates creation of a data pipeline by scraping data using twitter API and creating a data delivery stream using Kinesis Firehose for ingesting data to Amazon S3.

Primary LanguagePython

DATA-PIPELINE USING AWS SERVICES

Objective

This project deals with collecting real-time twitter data on COVID-19 topic, processing these tweets after mapping the attributes to the desired columns and then storing in the data for further analysis.

Dependencies

  • boto3
  pip install boto3
  • tweepy
  pip install tweepy

Cloud services used in this project:

  • Kinesis Firehose
  • AWS S3
  • AWS Glue
  • IAM (For role creation)
  • CloudWatch

How to use ?

  • Use tweetercred.py to store twitter developer account credentials.

  • Use buildFirehose_AWS.py which uses boto3 API to create a data delivery stream using Kinesis Firehose.

  • Use ingesttwitterdata.py to ingest data to data delivery stream.

  • Delete data delivery stream after the data has been successfully ingested to S3 bucket.

  • To set the credentials for AWS account run the below command on command prompt after installing AWS CLI.

Architecture

Data pipeline has two parts :

  • Collecting Data
  • Processing Data