Data Wrangling Project — Udacity Data Analyst Nanodegree
This project belongs to Udacity's Data Analyst Nanodegree. Below you'll find the rest of the Nanodegree projects and I also wrote a short post in my blog about the course experience.
ℹ️ This project was developed in 2017 during the Nanodegree and it is no longer maintained. If you like to see what I'm currently working on, please, visit my now page.
The project was built with:
- Python 2.7
- NumPy
- SQL Database
The entire project is documented and explained in the OpenStreetMap.md
file, I encourage you to start there.
Here's the file structure:
app.py
: calls all the functions and executes the program. To create the .csv files and import the data to the database in thedata
folder, just runpython app.py
and the script will take care of the rest.app.py
can also runaudit.py
functions, but those are commented by default since they don't cause any modification to the data itself.audit.py
: this is the first look at the data. It programmatically checks for data validity, accuracy and other measures and prints its results in the terminal. It does not modify the data itself, only reports the issues it encounters.
The script consists of two similar modules:
- audit_nodes(): checks for
node
elements. - audit_ways(): checks for
way
elements.
Running both at the same time could lead to parsing errors, therefore it is recommended to leave one of them commented in the app.py
script and run the other separately after the first has finished.
to_csv.py
: reads in the data from the.osm
file and exports all the data to.csv
files. During the process, it ensures the export is compliant with the structure dictated byschema.py
. For data validity it focuses more on semantics rather than format, but unlikeaudit.py
,to_csv.py
treats and modifies (throughfix.py
) any data related problems described in the Part II of theOpenStreetMap.md
document.to_sql.py
: after the data has been stored in.csv
files,to_sql.py
creates a databaseosm.db
and the necessary tables matching the structure described inschema.py
.
fix.py
: contains all the data wrangling functions used byto_csv.py
.compress.py
: takes an.osm
file as an input and outputs a k-reduced version of it. k is a parameter that can be changed in the code.schema.py
: schema of how the data will be exported from the.osm
file to the database.