The goal of this project is to create a simple Data Pipeline in Google Cloud Platform that generates random log data by using Faker package in Python, sends it to the Cloud Pub/Sub, transforms in Cloud DatFlow and stores data in BigQuery table.
- Pub/Sub is a messaging service that uses a Publisher-Subscriber model allowing us to ingest data in real-time.
- DataFlow is a service that simplifies creating data pipelines and automatically handles things like scaling up the infrastructure which means we can just concentrate on writing the code for our pipeline.
- Cloud Storage is the object storage in GCP.
- BigQuery is a cloud data warehouse. If you are familiar with other SQL style databases then BigQuery should be pretty straightforward.
- Apache Beam allows us to create a pipeline for streaming or batch processing that integrates with GCP. It is particularly useful for parallel processing and is suited to Extract, Transform, and Load (ETL) type tasks so if we need to move data from one place to another while performing transformations or calculations Beam is a good choice.
- Real-time log data is published to the Pub/Sub topic via Python script.
- Then we create a Data Flow job which reads data from the topic and applies some transformations on the data.
- After transforming the data, Beam will then connect to BigQuery and append the data to our table.
- To carry out analysis we can connect to BigQuery using a variety of tools such as Tableau and Python.
Batch pipeline is very similar to the streaming one but the difference is that the data is uploaded to the Cloud Storage and Data Flow transforms it.
- Upload the code base to the Cloud Storage and run the following commands in the Cloud console:
# Copy file from cloud storage
gsutil cp gs://<YOUR-BUCKET>/ * .
sudo pip install apache-beam[gcp] oauth2client==3.0.0
sudo pip install -U pip
sudo pip install Faker==1.0.2
# Environment variables
BUCKET=<YOUR-BUCKET>
PROJECT=<YOUR-PROJECT>
- Run the script which generates random log data and pushes data to the Cloud Pub/Sub topic (you should see logs messages in the terminal):
python publish_logs.py
We can see topic metrics here:
- Run the Data Flow pipeline by the next command:
python main_pipeline_stream.py --runner DataFlow --project $PROJECT --temp_location $BUCKET/tmp --staging_location $BUCKET/staging --streaming
- Query the data in BigQuery
SELECT * FROM `capstone-project-355520.log_data_dataset.log_data`