In collaboration with Diego Mamanche Castellanos
The City of Toronto consists of 140 neighbourhoods. While each of the neighbourhood has its own strength and vitality, some neighbourhoods share similar socioeconomic traits. In this study, we intend to identify neighbourhoods that have disadvantaged socioeconomic status through a clustering analysis using hierarchical clustering. Features of low-income rate, lack of education and racial diversity are employed to cluster the neighbourhoods into two groups. There are 34 neighbourhoods in group 1, while the other 106 are assigned to group 2. The 34 neighbourhoods in group 1 have a higher percentage of the population that have low-income status, possess no educational certification, and are visible minorities; the neighbourhoods in group 1 are seemed as more disadvantaged. The clusters of neighbourhoods are plotted as a choropleth map, which allows us to examine the spatial relationship between disadvantaged neighbourhoods. What we discover is that the disadvantaged the neighbourhoods not only share similar socioeconomic characteristics, but they are also geographically close to each other.
To address the research question, the City of Toronto, through the Open Toronto Data Portal, offers a comprehensive dataset called Neighbourhood Profiles. The data set contains several categories divided by topics, that in turn, are broken down into different characteristics, presented in a total of 2383 rows. As for the neighbourhoods, all of them are displayed as columns. For this analysis, the variables "no certificate, diploma, or degree 2", "18 to 64 years %", and "total visible minority population" will feed into the model to answer the research question. The variables are renamed to "no certificate," "visible minority," and "low income."
Exposing the most disadvantaged neighbourhoods in the City of Toronto is a classification case. The goal is to group all neighbourhoods by clusters based on how similar they are. To achieve this, the classification method used is hierarchical clustering, an unsupervised machine learning technique that partitions a set of objects with similar characteristics into subsets.
Different approaches were analyzed. Clustering techniques such as K-means, Density-based classification, and Hierarchical clustering were tested out with the dataset, finding the last one the most suitable for this investigation. Hierarchical clustering allows us to view at once, each possible number of clusters (k) through the dendrogram, which is a tree representation of those clusters. Moreover, the hierarchical method needs no k in advance. The undetermination of k is important because the purpose of this study is to find the most disadvantaged neighbourhoods without pre-assumption on the number of groups there should be. Lastly, by using the Elbow curve, the best number of clusters (k) can be impartially calculated.