This is the Data Wrangling of NYC OpenStreetMap data. The map area consists of Manhattan, New York, NY, USA. It is my project of Udacity Data Analyst Nanodegree Program Project: Wrangle OpenStreetMap Data.
In the report, I will wrangle the OpenStreetMap data of Manhattan, New York, United States.
First, I will audit the dataset to find out if there is any problem within the dataset that needs to be fixed. Next, I will use SQL queries to obtain an overview of the dataset. Last, I will provide some ideas to further improve and analyze the dataset.
New York (Manhattan), New York, United States I've obtained a custom map that includes the Manhattan borough of New York City through Mapzen. I have chosen this area because I am living in Jersey City and I use to visit NYC during my weekends. I would like to find out if I will be able to find some interesting facts about the city I love by investigating the OpenStreetMap data.
File Size
File Name | File Size (MB) |
---|---|
NYC.osm | 477.565 |
manhattan.db | 280.372 |
nodes.csv | 169.271 |
nodes_tags.csv | 10.277 |
ways.csv | 21.063 |
ways_tags.csv | 64.331 |
ways_nodes.csv | 54.527 |
Number of Unique Users: 2163
Number of Nodes: 1884748
Number of Ways: 320048
Number of Subway Stations: 378
Top 10 cuisines
Cuisine Types | Num | |
---|---|---|
0 | italian | 134 |
1 | pizza | 88 |
2 | american | 85 |
3 | mexican | 79 |
4 | chinese | 62 |
5 | japanese | 46 |
6 | french | 44 |
7 | indian | 43 |
8 | thai | 43 |
9 | burger | 42 |
Top 10 cafes
Cafe Shop Names | Num | |
---|---|---|
0 | Starbucks | 114 |
1 | Dunkin' Donuts | 33 |
2 | Starbucks Coffee | 24 |
3 | Le Pain Quotidien | 13 |
4 | Cafe Grumpy | 4 |
5 | Dunkin Donuts | 4 |
6 | Piccolo Cafe | 4 |
7 | Pinkberry | 4 |
8 | Pret a Manger | 4 |
9 | The Coffee Bean & Tea Leaf | 4 |
User Ratings One piece of crucial information missing from the dataset is the ratings of places. By incorporating a node tag with user ratings can help user answer questions such as "What are some of the best restaurants in town?", "Which doctor in my neighborhood should I go to?".
I can think of two ways to gather this rating information:
- User contribution. It's easy to implement this, but the problem is the number of active contributing users for our OpenStreetMap data is low, so the ratings will not have a sample size large enough to be representative.
- Aggregate from other web sources. This approach can get good ratings information fast and more accurate than the first approach. But the problem is how to get permissions from other sources to provide their rating data, not to mention that those sources are probably OpenStreetMap's direct and indirect competitors.
This analysis of OpenStreetMap Manhattan extract has helped me dig into the problems and inconsistency of the OpenStreetMap data. After cleaning zip codes, address types and phone numbers of this dataset, I imported this dataset into a SQL database for further exploration. I obtained some statistics and answered some questions using SQL queries, but I also found some questions that couldn't be anwsered without incorporating user ratings into our dataset.
I really liked this project, and if all our Udacians can incorporate our cleaned data and other ideas to improve the dataset of OpenStreepMap, I believe it will make OpenStreepMap cleaner and more popular.