/geospatial-analysis-newyork

Geospatial data analysis: ETL, EDA and baseline modeling

Primary LanguageJupyter Notebook

NY taxi_GeospatialAnalysis

This repository contains the results I delivered to a spatial data science company for a Junior Data Scientist role (3-9 Nov 2020)


Tasks:

The first process is ETL - NYC taxi data and census block group geometries were loaded to the PostGIS.
The next step is to explore the data to build a baseline model for predicting number of taxi pickup number using ACS dataset.

The two tasks were separately handled in different jupyter notebooks:

  1. ETL
  2. Data exploration and baseline modeling

Summary report is available: pdf.


Data

All data files were saved under "./data" directory.

  1. NYC taxi data (Jan, Apr, July 2015) .zip
  2. ACS demographic and socio-economic data by census block group .csv
  3. NYC census block group geometries .json

Working environment

Two docker containers for a PostGIS database and Jupyter notebook were created using a docker compose.

docker-compose up

Libraries used in this project

Libraries for geospatial data processing (e.g. geopandas, GeoAlchemy2) were used in addition to the general pydata stack (pandas, numpy, sklearn, matplotlib, seaborn).


To-do list

[] Expand data exploratory analysis

[] Improve modeling