Code to create dataflow pipeline that reads file data from a cloud storage, processes and transforms it and outputs the transformed data in Google's own in-memory datastore which is their Redis implemenation called memorystore. The pipeline code is written in Java and have been worked upon Apache Beam's SDK.
Dataflow is a fully-managed service to execute pipelines within the Google Cloud Platform ecosystem. It is a service which is fully dedicated towards transforming and enriching data in stream (real time) and batch (historical) modes. It is a serverless approach where users can focus on programming instead of managing server clusters, can be integrated with Stackdriver, which lets you monitor and troubleshoot pipelines as they are running. It acts as a convenient integration point where Tensorflow machine learning models can be added to process data pipelines.
Memorystore for Redis provides a fully-managed service that is powered by the Redis in-memory data store to build application caches that provide sub-millisecond data access. With Memorystore for Redis, you can easily achieve your latency and throughput targets by scaling up your Redis instances with minimal impact to your application's availability.
mvn compile exec:java \
-Dexec.mainClass=com.viveknaskar.DataFlowPipelineForMemStore \
-Dexec.args="--project=your-project-id \
--jobName=dataflow-memstore-job \
--inputFile=gs://cloud-dataflow-input-bucket/*.txt \
--redisHost=127.0.0.1 \
--stagingLocation=gs://dataflow-pipeline-batch-bucket/staging/ \
--dataflowJobFile=gs://dataflow-pipeline-batch-bucket/templates/dataflow-custom-redis-template \
--gcpTempLocation=gs://dataflow-pipeline-batch-bucket/tmp/ \
--runner=DataflowRunner"
For checking whether the processed data is stored in the Redis instance after the dataflow pipeline is executed successfully, you must first connect to the Redis instance from any Compute Engine VM instance located within the same project, region and network as the Redis instance.
-
Create a VM instance and SSH to it
-
Install telnet from apt-get in the VM instance
sudo apt-get install telnet
- From the VM instance, connect to the ip-address of the redis instance
telnet instance-ip-address 6379
- Once you are in the redis, check the keys inserted
keys *
- Check whether the data is inserted using the intersection command to get the guid
sinter firstname:<firstname> lastname:<lastname> dob:<dob> postalcode:<post-code>
- Check with individual entry using the below command to get the guid
smembers firstname:<firstname>
- Command to clear the redis data store
flushall
https://redis.io/topics/data-types-intro