Kuala Lumpur Neighborhood Analysis using Machine Learning Clustering algorithm

In this project we will compare different neighborhoods of Kuala Lumpur based on Real estate dataset (property prices, types, sizes) and venues around the neighborhood.

Introduction

Kuala Lumpur, the capital of Malaysia is a home to about over 7 million people and is split into more than 40 districts. Each neighborhood has its own features and characteristics. Some are popular for parks and tourist attractions, while others are dense urban neighborhoods with office buildings and skyscrapers. There are neighborhoods mostly populated with local Chinese people, and places with large expat community. There are industrial neighborhoods with houses for low income foreign labour, and areas such as Damansara Heights, the home of celebrities, politicians and rich people in general.

In this project we will compare different neighborhoods of the KL city based on property prices, types and venues around that neighborhood using machine learning clustering algorithms.

Business Problem

There are many reasons people relocate within city boundaries, e.g. getting a new job offer, kids moving to different school or moving away from bad neighborhood etc. This project will help such people to compare different neughborhoods of Kuala Lumpur city in terms of property types, prices, size and most importantly the venues around specific neighborhood. If you're looking for a better neighborhood with more parks or coffee shops and restaurants but with similar real estate price, this is the place for you.

Dataset

We will use the dataset created by Jan S available on Kaggle. This dataset contains tens of thousands of property listings scraped from iproperty.com for every neighborhood of Kuala Lumpur city. The dataset contains such information as number of rooms, bathrooms, parking slots, furnshing condition, property type and size as well as price per room and price per area for each listed property in iproperty.com.

For venues around each neighborhood we will gather data from Foursquare.com.

Methodology

We will divide the project into multiple phases:

Step 1 is Data wrangling. Real estate dataset has many missing values, i.e. some properties missing the number of rooms info, others does not have the furnishing or size data. Rather than filling missing values we will simple remove the raws with NaN values, and still have a huge dataset with over 31000 property listings.

Step 2 is Getting Latitude and Longitude coordinates for each neighborhood. We will use Geopy geocoders library to do so.

Step 3 is Data analysis and clustering based on real estate data. We will use onehot encoding to convert categorical data such as type of property and furnishing condition into numericla form. Next we will perform the K-means clustering algorithm. After few experiments we decided to set K=5.

Step 4 is collecting data about each neighborhood using a Foursquare API.

We will collect 100 venues around each neighborhood based on longitude and latitude information.

Next we will select top 10 venue for each neighborhood and perform a onehot encoding.

Step 5 is Clustering based on Neighborhood venues. In this we will perform the K-means clustering algorithm using venues dataset. We will use the same number of clusters as before (real estate clustering).

Step 6 is comparing both clustering results, and see how different neighborhoods correlated in terms of real estate dataset as well as venues around them.

Results

Real estate based clustering

Lets analyse clustering results based on Real estate dataset:

We have 5 clusters. Let's analyse each cluster one by one.

Cluster 0 (red circles) is characterized by expensive properties with 3-4 bedrooms and area size from 1500 to 4000 sqft. Most of the properties in this cluster are serviced residences and condominiums. The price in this cluster is rather high, with price per room from RM400K all the way up to RM700K. The neighborhoods in this cluster are mostly located close to KL downtown, such as KLCC, Bukit Bintang, KL central, Seputeh.

Cluster 1 (purple circle) has only a single member: Country Heights Damansara. This is a luxury neighborhood with luxury villas (over 80% properties here are bungalows/villas). This is the home of celebrities, politicians and rich people in general. Properties in this area are rather huge with 6 rooms, 6 bathrooms and over 9000 sqft in average, 5 parking spots, and skyrocket prices (over RM1.5 million per room).

Cluster 2 (light blue circles) is the biggest cluster that includes the neighborhoods with properties for low to middle income families. These residential neighborhoods are scattered all around the city. The properties in this cluster are anywhere from 900 to 2000 sqft in size (2-3 rooms) and the price per room vary from RM100K to RM400K.

Cluster 3 (light green circles) includes 2 expensive neighborhoods with large properties (over 5000 sqft) and high price values (over RM900K per room).

Cluster 4 (yellow circles) includes three neighborhoods for upper income category. Most of the properties here are condominiums and serviced residences, with average size of over 3000 sqft. Price per room in this cluster is around RM700K in average.

Venues based clustering

Next we cluster the neighborhhods based on Venues dataset obtained from Foursquare.com.

First, lets look at the top 10 venues for different neighborhoods:

Let's see the clustering results.

Cluster 0 (red circle) Consists of single neighborhood (Segambut) and distinc character of this area is lack of restaurants in top 10 venues.

Cluster 1 (purple circles) Includes tourist hotspots and crowded areas. These are very diverse neighborhoods with a lot of Cafes, Restaurants, Coffee shops, Department stores, Shopping malls, entertainment centers etc.

Cluster 2 (blue circles) is mostly residential areas characterised by a lot of eatiries, food trucks, food courts and coffee shops.

Clusters 3 and 4 (green and yellow circles, respectively) are neighborhoods with Art Gallery, Yoga studios, Pool and hiking trail.

Real-estate vs Venues

Next lets overlap both maps and see how clustering using Real estate dataset matches with clustering using venues dataset:

In this figure, outer circles indicate the clustering results based on property, and inner circles are for clustering based on venues.

We can see that neighborhoods with similar real estate are more or less similar in terms of venues as well. However, there is a option for anyone who is plannign to relocate from one neighborhhod to another. If for whatever reason, you're planning to move to similar neighborhood (in terms of both, property and venues) you have to consider the clusters with matching inner and outer colors.

Someone who is planning to change the atmosphere can target the circles with different inner color and same outer one.

Someone who is planning to upgrade his living conditions but likes his neighborhood can choose the circles with same inner but different outer color.

Conclusion

We can conclude that Kuala Lumpur is a diverse metropolis with diverse neighborhoods, and it might be confusing for someone to relocate within the city without proper information about each neighborhood. In this project we tried to analyse the neighborhoods based on real estate data and venues. We believe our results can help people to better navigate the real estate market and find the neighborhood which is best suited for him.

We can further improve the clustering results by include more datasets, such us crime rates, nearby schools, hospitals end etc.

anvarnarz/ML-clustering-kl-properties-vs-venues