PySpark-Real-Time-Project

End to End PySpark Real Time Project Implementation.

Projects uses all the latest technologies - Spark, Python, PyCharm, HDFS, YARN, Google Cloud, AWS, Azure, Hive, PostgreSQL.

This project aims to understand the business Model and project flow of a USA Healthcare project.

Below is the project flow and explanation of its architecture :

Explanation of the above data pipeline :

1- Data Ingestion: Bring row data into the system (HDFS, Spark DataFrame) from different sources (different relational databases, different file systems like CSV, Avro, Parquet, ORC, SAS Applications etc.).

2- Data Pre-Processing: Includes operations of cleansing the data.

3- Transformation: Is the core of any data pipeline project, refers to the operations that change the data, which may include data standardization, Sorting, Deduplication, Validation, Adding new columns, Dropping existing columns. The ultimate goal of this step is to make it possible to analyse the data.

4- Storage: In this step, we're going to persist the final transform data at some relational database or cloud services like S3 Bucket, Azure Blob etc.

run_pipeline.py: It's the pipeline script which is the execution starting point, this script will perform a series of operation, starting with ingestion data, data pre-processing and transforming data, data storage and data transfer. All we need to do is to execute the run_pipeline.py and that will trigger the other processes. See the picture above.

Let's analyse the data ingestion stage, in this step we will insert the vendor data into PySpark DataFrame.

The vendor data could be in different file formats like CSV, Parquet, ORC. The first task consists of bring data to the local server and from there we would copy it to our HDFS, from HDFS, we would load it to PySpark DataFrame.

This is the flow of Data Ingestion, and we're going to implement all these changes into a script.

The next step in the pipeline is the data pre-processing, where we will do some data cleansing operations, we can have one or more data pre-processing operations.

After this, we'll have a series of transformations to modify the row of data into a final layout which make the data ready for analysing purpose.

In the next step, we would persist the data into Hive table and PostgreSQL database

Also we will transfer the final files to client's S3 Bucket, Azure blob and also to a Linux Path

Througout this process will make sure that all the scripts should have a good Exception handling mechnisme and good logging mechanism

Below is an overview of some transormations :

achrafbenyounes/PySpark-Real-Time-Project

PySpark-Real-Time-Project

Explanation of the above data pipeline :