- Used python evironment(anaconda)
pip install libraries
-
Pandas
-
PySpark
- Installation guide for pySpark here
In the same folder with supplier_car.json
python main.py
-
Data Files :
- supplier_car.json : initial supplier data
- Target Data.xlsx: target data
- pre-processing.csv : pre-processed version of supplier data
- normalisation.csv : normalized version of pre-processed data
- extraction.csv : extracted version of normalised data
- integration.csv : integrated version of extracted data (This is the final form of the data)
-
Python files :
- main.py : pipeline needed to be executed
- Classes:
- preprocessor.py : Class with functions for preprocessing
- normalizer.py : Class with functions for normalizing
- extractor.py : Class with functions for extraction
- integrator.py : Class with functions for integration
-
Presentation files :
-
data_flow_presentation.pdf : Data flow presentation and explanations. Answer of part 5 can be found in last 2 slides.