2020/02/02 使用Kmeans cluster機器學習,分析新冠肺炎和鄰里、人文、交通建設之間的關聯,並決定:哪一個鄰里(Neighborhoods)需要封城

好讀版Mideum文章 https://medium.com/@a626854993/covid-19-confirmed-cases-neighborhoods-data-analysis-of-toronto-by-using-k-means-clustering-in-c17dd8931f12

Covid-19 Confirmed Cases & Neighborhoods Data Analysis of Toronto by using K-means Clustering in Machine Learning

Lockdown or not ? That’s the question.

Instruction

Description & Disscusion of the Background

We are the employee of the government. This project is to make the decision for the Toronto Goverment: Which neighborhood should be lockdowned due to the Covid-19.

Lots of people in Toronto traveled inside and outside the Toronto before. No matter whether he or she lives in Toronto or not. People infected with the 2019 novel coronavirus may have mild to severe respiratory illness with symptoms of fever, cough, and shortness of breath. Once who is infected lives or travels in Tonronto, has a very high probability to infect other. And the process mentioned above happened again and again. So the government want to take action to stop this.

It will remind people to get higher awareness of the self-safety in this special condition, and tell people to stay home in order to reduce the spread of the novel coronavirus.

Data Description

  1. I used a crawler to scrape Wikipedia’s Canadian zip code page, the postal code and neighborhood information into a Data Frame.

  2. Read the CSS file from the Covid-19 neighborhood confirmation count in Toronto Open Data Portal into the Data Frame.

  3. I used the CSV file to merge the latitude and longitude to the corresponding Postal Code item.

  4. I used the API provided by the developer account of Foursquare, a well-known foreign landmark service provider, to ask my neighbors to specify Venues within a radius of a meter.

  5. I used Google map to look up the coordinates of the Neighborhood Center.

  6. Because Toronto open data Portal does not provide neighboring boundary JSON format files, so for me to quote the GEO JSON file provided by AG2816 on github, for Folium to make Choropleth map.

Methodology

First, do a cleanup of the crawl from Wikipedia. Drop some of the “nan” and “not assigned” values, and merge the resulting coordinates with them.

Next, I modified Toronto Open Data’s Covid-19 confirmed cases CSV file and discarded some of the data that were less important in this project.

Here is the result:

Data Frame: Covid19-cases included

Created the map by using Folium library and the value of coordinate.

Neighborhoods in City Of Toronto

Now, I request the server of Foursquare to return the venues around the neighborhoods in the radius for 450 m.

Venues returned

Clustering

Now it’s the best part of this project. Let’s cluster the data we just collected, cleaned, by using k-means.

Why clustering? Clustering is a subset of unsupervised machine learning method. Unsupervised machine learning can be very powerful in its own right, and clustering is by far the most common expression of this group of problems.

We want the optimal k value for our model, so here, I used the elbow method to show the best k we want in the plot.

OK, next step, we could just show all the most common venues after clustering in one plot. We don’t really need to exam the result on by one.

Here is it! According to the number of cluster I defined, 8, so we got lot of data. And this much data is just enough for us to analyze.

Obviously, almost about 85% of venues are located in the cluster 2.

Notice the sky blue bar, it contains Trail, Restaurant, and Coffee shop.

And we predicted there is a higher number of the confirmed cases in the neighborhoods in cluster 2 , due to almost all crowded place are there.

Result

Final step, we are going to analyze the map include two data set, clustered data and Covid-19 confirmed cases in each neighborhood.

Map for clustered data

Plot above is the clustered data with different color pinned.

Let’s take a look on the mixed map we just talk about.

The Choropleth map generated by Covid19 cases

Cluster 2 is colored blue. According to the Choropleth Map , higher number of cases broke out in the most dots in cluster 2.

Discussion & Conclusion

Our prediction is on the right way, because virus spread more fast in the places full of people, such like Coffee shop, restaurant,etc. Furthermore, most of the neighborhoods in cluster 2 are covered by railway or track.

Notice the Top left corner of the map, there the highest case-number confirmed in that Crimson colored region. About 884 cases confirmed.

Geopolitically, it’s close to the international airport and highways, so it’s a major traffic artery, so many people who were infected abroad may be returning home with more of the virus.

In conclusion, I suggest that neighborhoods passing through airports, railroads, and even highways should be lockdowned, even though it would have a huge economic impact.

Reference

  1. https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

  2. https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/e5bf35bc-e681-43da-b2ce-0242d00922ad?format=csv

  3. http://cocl.us/Geospatial_data

  4. https://github.com/ag2816/Visualizations/blob/master/data/Toronto2.geojson