Project repository for Semester 1 Module using Python, PostgreSQL, MongoDB for XML, CSV, JSON files
This project renders a critical analysis of Traffic related data in New York City to understand the problems caused by the Traffic and to suggest possible measures to curb the same.
For this project, four related datasets were analyzed to get an understanding on the Traffic scenario of New York City. Following are the links to the datasets used in the project
- Jupyter notebook
- MongoDB database
- PostgreSQL database
Python Language was primarily used in this project for analyzing the datasets. Along with Python, following Python Libraries and packages were also used:
- Pymongo - for extracting data from MongoDB database
- SQLAlchemy - for establishing connection with PostgreSQL database
- Psycopg2 - for storing and extracting data from PostgreSQL database
- Pandas - for manipulating dataframes
- Plotly - for visualizing the analytics and results
Steps to run the codes
Prerequisites : IDE - Jupyter Notebook; Python Libraries - Pandas, Numpy, SQLAlchemy, Psycopg2, Datetime, Plotly, Matplotlib, Pymongo, XML; DB - MongoDB and PostgreSQL.
a. For New_York_Taxi.ipynb file
-
Download the dataset -New_York_Taxi_Trips.
-
Create a Database and its Collection in MongoDB Database and import the data using the GUI of MongoDB. (PS: The dataset of New York Taxi has no comma separating the instances, hence if you try to import this dataset using python, you will get an error regarding the file content Since there are 1 million instances in the dataset, it was not feasible to manually add comma to separate the instance on the JSON file. Hence GUI approach of MongoDB was used to load the data into the database).
-
Create a Database in the PostgreSQL database.
-
Run the file cell-by-cell (Since the dataset is huge, it is recommended to run cell-by-cell. Also it may take some time to execute).
-
In ‘Storing the Processed and Structured Data in PostgreSQL’ of the code’ Configure the following line as per the configuration of the system.
'connection = create_engine('postgresql://postgres:admin@localhost:5432/dap')'
In case of, postgres - write the username of your PostgreSQL database admin - write the password of your PostgreSQL database dap - write the name of the database that was created in Step 3.
-
Post the configuration of Step 5, keep running the code line by line.
b. For XMLTrafficdata.ipynb file
-
Download the dataset - New York City Traffic.
-
Create a Database in the PostgreSQL database.
-
Run the file cell-by-cell (Since the dataset is huge, it is recommended to run cell-by-cell. Also it may take some time to execute).
-
Configure the following line as per the configuration of the system.
'connection = create_engine('postgresql://postgres:admin@localhost:5432/dap')'
-
In case of, postgres - write the username of your PostgreSQL database admin - write the password of your PostgreSQL database dap - write the name of the database that was created in Step 2.
-
Post the configuration of Step 4, keep running the code line by line.
c. For Accidents_and_Salary_data_Final_EDA_v2.ipynb file
-
Download the datasets - NYPD Accidents and Public Sector Salary.
-
Create a Database in the PostgreSQL database.
-
Run the file cell-by-cell (Since the dataset is huge, it is recommended to run cell-by-cell. Also it may take some time to execute)
-
Configure the following line as per the configuration of the system.
'connection = create_engine('postgresql://postgres:admin@localhost:5432/dap')'
-
In case of, postgres - postgres - write the username of your PostgreSQL database admin - write the password of your PostgreSQL database dap - write the name of the database that was created in Step 2.
-
Post the configuration of Step 4, keep running the code line by line.