The aim of this project is to fetch real-time data from the Binance API, process and store it using AWS infrastructure. This project ensures efficient collection, processing, and storage of high-volume data. The process creates an automated and scalable data pipeline using AWS services.
This project utilizes the following AWS services or technologies:
- VPC
- EC2 Instance
- Lambda Function
- IAM Role
- S3 Bucket
- SQS
- DynamoDB
- NoSQL Workbench
The system architecture is designed to demonstrate real-time data handling using AWS services. The following AWS services are utilized:
- VPC: To create an isolated network environment.
- EC2 Instance: For running the application and managing data buffers.
- Lambda Function: To process data and interact with other AWS services.
- IAM Role: To securely manage access to AWS services.
- S3 Bucket: To store the .tsv files that contain the data to be processed.
- SQS: To queue the data before it is inserted into DynamoDB.
- DynamoDB: For storing the processed data in a NoSQL database.
- NoSQL Workbench: To query and analyze the data stored in DynamoDB.
- Python: The programming language used for writing Lambda functions.
- AWS Services: Including but not limited to VPC, EC2, Lambda, IAM, S3, SQS, DynamoDB, and NoSQL Workbench for comprehensive cloud-based operations and data management.
-
Data Storage on EC2:
- TSV files containing data are fetched from the Binance API every minute and stored on an EC2 instance.
-
Triggering Lambda Function from S3:
- When a .tsv file is uploaded to the S3 bucket from the EC2 instance, the first Lambda function is triggered.
- This Lambda function reads the first 5 rows of the .tsv file and sends each row as a message to the SQS queue.
-
Triggering Lambda Function from SQS:
- The second Lambda function is triggered when a new message is added to the SQS queue.
- This Lambda function reads the messages from the SQS queue and writes the data to the DynamoDB table
ProcessedData
.
-
Data Verification with NoSQL Workbench:
- After all processes are completed, NoSQL Workbench is used to connect to the DynamoDB table and verify if the data has been correctly inserted.
- A counter mechanism between S3 and the first Lambda function ensures that the process only runs for three TSV files. After processing the third TSV file, the first Lambda function stops.
- A throttle mechanism between SQS and the second Lambda function ensures controlled data processing by gradually feeding data to the Lambda function from the SQS queue.
- Create a VPC.
- Create an IAM Role (EC2 to S3).
- Launch an EC2 instance.
- Create an S3 Bucket (with a folder named
data_1_min
). - Test the first part by verifying if .tsv files are uploaded to the S3 Bucket.
- Set up SQS.
- Create the first Lambda Function.
- Add code to the first Lambda Function.
- Add an S3 Bucket trigger to the first Lambda Function.
- Attach the IAM Role to the first Lambda Function.
- Test and verify if the first Lambda Function is working correctly by checking if messages are sent to the SQS queue.
- Create the second Lambda Function.
- Add code to the second Lambda Function.
- Attach the IAM Role to the second Lambda Function.
- Add an SQS trigger to the second Lambda Function.
- Create a table in DynamoDB.
- Use NoSQL Workbench to connect to the DynamoDB table and verify if the data is correctly inserted.
This project is licensed under the MIT License - see the LICENSE file for details.
This README file provides a comprehensive guide to setting up and running the Real-Time Automated Binance Data Processing Pipeline using AWS services.