Data wrangling, routing, transformation, and system mediation logic.
- After the data has been collected, they have to be loaded into remote database.
- Giving format to xlxs files' tables to json for each file and book that it is inside on it.
- The data is turned into sql format after json data structure has done in order to load with diagram by itself (i.e. automaticlly).
- Extract, transform, and load data
In a perfect world, data is clean, structured, and all in one place. That’s every analyst’s dream! But much like most things in life, data is messy and unpredictable.
"... it has been stated that up to 80% of data analysis is spent on the process of cleaning and preparing data. However, being a prerequisite to the rest of the data analysis workflow (visualization, modeling, reporting), it’s essential that you become fluent and efficient in data wrangling techniques." — Page v, Data Wrangling with R, 2016
Several files withim INE's site to be handled are messy, no align to easily read, and they do not have a structuring header for storing them. In this section, we, by seeking the right method, explored different ways to read, transform, and load all data withim Excel files. The most optimal, regarding time of execution and memory output, would be choosen. By the way, below is presented, an example of the INE's excel file, requirements of Python's libraries, and data Pipeline comprised in three paths.
Frame of data: Messy data and unreadeable fields as headers for database |
---|
Within INE's data bank, several documents with different format is storaged. This library comprises the process of cleaning and unifying raw, messy and complex data sets for easy access and analysis for Biosoft aproaches. In below is shown the usage.
This way was built using Xldr Jupyter notebook or also OOP paradigm in files.
-
Requirements
polars==0.15.8 numpy==1.21.6 xlrd==1.2.0
-
Datapipeline Main class or how to instance "wrangling procceses". Reading Excel files by FileReader Class where it requires files' folder to be processed.
-
files = FileReader('/content/data_income')
-
files.collect_files()
-
files.show_files()
1 /content/data_income/numero-upa-practicas-mejoramiento-suelo.xlsx 2 /content/data_income/numero-productores-tramos-edad.xlsx 3 /content/data_income/existencias-colmenas.xlsx .
For each file, the access and analysis over sheet content, structure and design have been into account to specify the work zone as a dataframe.
for url_file in files.list_files:
-
array_xl = Array_xl(url_file, '/content/data_outcome_ine_cl')
-
array_xl.data_wrangling()
-
array_xl.data_normalization()
-
array_xl.default_dataframe_csv()
orarray_xl.custom_dataframe_csv(1)
-
-
Outcome CSV dataset
Dataframe on defining range action
This way was built using PySpark Jupyter notebook.
-
Requirements
pandas==1.3.5 numpy==1.21.6 pyspar==3.3.1 findspark==2.0.1
-
Datapipeline Main class or how to instance "wrangling procceses". Reading Excel files by FileReader Class where it requires files' folder to be processed.
-
folder = FileReader('/content/data_income')
-
folder.collect_files()
-
folder.show_files()
1 /content/data_income/numero-upa-practicas-mejoramiento-suelo.xlsx 2 /content/data_income/numero-productores-tramos-edad.xlsx 3 /content/data_income/existencias-colmenas.xlsx .
For each file, the access and analysis over sheet content, structure and design have been into account to specify the work zone as a dataframe.
for xl_file in folder.list_f
-
first = spark_dataframe(path_origin=xl_file, path_export="./pyspark/")
-
first.remove_nan()
-
first.remove_comments()
-
first.remove_empty_column()
-
first.Establishing_header()
-
first.export_to_csv()
-
Outcome CSV dataset
Preliminar Dataframe Extracting Data
- Getting Started with the Polars DataFrame Library
- Handling dataframe
- Read_excel
- Polars DataFrame library: from Numpy and from Pandas
- How to Convert NumPy Array to Pandas DataFrame
- Data wrangling with Pandas
- Pandas DataFrame Tutorial with Examples
- [Chapter 4] Data Wrangling with Python by Jacqueline Kazil, Katharine Jarmul
- Scraping Excel Data with Python