/cassandra-etl-pipeline

Losing customer it’s not an option. Today in the world we have a ton of devices that are gathering and sending data. The benefit of using a document store database #NoSQL, is that developers don’t need to maintain and/or adjust entities, migrations and changes on existing products. Companies and product moves in an agile environment, where requirements are constantly changing; NoSQL allows us to spin these requirements in a quick manner.

Primary LanguageJupyter Notebook

Creating an ETL process with Apache Cassandra

Handling BIG DATA and storage now a days is no just feasible, it's a must.

import pandas
import cassandra

Losing customer it's not an option. Today in the world we have a ton of devices that are gathering and sending data. The benefits of using a document store database #NoSQL, is that developers don't need to maintain and/or adjust entities, migrations and changes on existing products. Companies and product moves in agile environment, where requirements are constantly changing; NoSQL allows us to spin these requirements in a quick manner.

The Business Case 💼

The following application establishes the follwing case where we have deploy a music app and its collecting data which it's store to a local text file. From that we known which songs does the user listens to and which membership they are on (at higher level).

// "Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million requests per day), and eBay (over 100 nodes, 250 TB)." https://cassandra.apache.org/

My job was to extract transform and load this data into system where business teams could bring their requirements and collect solutions from the data.

🔗 Jupyter Notebook ETL Process

Github

tech: Apache Cassandra, Python, Pandas