Parallel Data Processing on a HPC Cluster (On-Going Project)

Implementing a python script to perform data processing on a HPC Cluster. This project involved reading a large dataset, performing some data transformations, and writing the processed data back to the disk. The goal is to show case my ablity to utilize HPC reasources for efficent data processing.

Steps Implemented in this project:

Prepared the Dataset:

Reading the data in chunks to handle large files efficently.
Performed a simple data transformation, such as calcualting income per captia()

Parallel Processing:

The script uses the 'multiprocesing' libary to process the multiple files in parallel.

Job Submission - The SLURM job Script submits the data processing job to the HPC cluster, allocating resources as specified.

Future Implemenation

Setup the HPC Envirnoment

access to HPC cluster with necessary permissions
Loading the required modules (eg: Pythonm, SLURM)

Test and Validate

Run jobs on the HPC Cluster, mointer performance, and validate the results
Do any necessary optimizations to improve performances

saikiranAnnam/Parallel-DP-HPC

Parallel Data Processing on a HPC Cluster (On-Going Project)

Steps Implemented in this project:

Future Implemenation