- TP065783: Khaled Awad
- TP064361: Abdelrahman Mourad
- TP066168: Mohamed Khairy
Welcome to the House Rent Data Analysis and Prediction project! This project delves into analyzing a comprehensive dataset of rental housing costs. With 4,746 observations across 12 columns, our goal is to identify patterns and relationships in the data, which include factors like rent, area type, city, size, and furnishing status. We also aim to use predictive analysis to provide insights into the rental market.
- 🔧 Installing Packages
- 📦 Loading Libraries
- 📂 Data Loading and Pre-processing
- 📊 Analysis and Visualizations
- ✨ Additional Features
- 📌 Conclusion
- 📖 References
To perform the analysis and visualizations, you need to install the following R packages:
install.packages("dplyr")
install.packages("ggplot2")
install.packages("corrplot")
install.packages("plotly")
install.packages("tidyr")
install.packages("tidyverse")
install.packages("caTools")
Load the necessary libraries to utilize various functions for data manipulation and visualization:
library(dplyr)
library(ggplot2)
library(corrplot)
library(plotly)
library(tidyr)
library(tidyverse)
library(caTools)
We load the house rent dataset into R for our analysis:
data <- read.csv("path_to_dataset/House_Rent_Dataset.csv")
head(data)
Ensuring data quality is crucial, so we start by checking for missing values in the dataset:
colSums(is.na(data))
In this dataset, there are no missing values.
We look for inconsistencies in categorical columns to ensure data integrity:
unique(data$Area.Type)
unique(data$City)
unique(data$Furnishing.Status)
unique(data$Tenant.Preferred)
unique(data$Point.of.Contact)
No garbage values were found in this dataset.
To understand the basic properties of the data, we generate summary statistics:
summary(data)
- Rent: Average rent is
34,993
, with a maximum value of3,500,000
. - Size: Average size is
967 sq ft
, with a maximum size of8,000 sq ft
. - Bathroom: Average number of bathrooms is
1.9
, with a maximum of10
.
Outliers are identified using the interquartile range method to maintain the integrity of our analysis:
outliers <- function(x) {
Q1 <- quantile(x, probs=.25)
Q3 <- quantile(x, probs=.75)
iqr = Q3-Q1
upper_limit = Q3 + (iqr*1.5)
lower_limit = Q1 - (iqr*1.5)
x > upper_limit | x < lower_limit
}
remove_outliers <- function(df, cols = names(df)) {
for (col in cols) {
df <- df[!outliers(df[[col]]),]
}
df
}
data <- remove_outliers(data, c('Rent', 'Size', 'Bathroom'))
We explored properties rented directly through contact with the owner:
data[which(data$Point.of.Contact == "Contact Owner"),]
Most properties rented through direct contact with the owner are suitable for singles and families.
Determine the average and maximum rent:
mean(data$Rent)
max(data$Rent)
Examine how rent varies across different area types using a boxplot:
ggplot(data = data, mapping = aes(x = Area.Type, y = Rent)) +
geom_boxplot(col="orange") +
labs(title = "Distribution of Rent By Area Type")
- Carpet Area: Highest average rent.
- Super Area: Moderately priced.
- Built Area: Lowest average rent.
Determine average house sizes and rents by point of contact:
ggplot(temp, aes(x = "", y = Avg_Rent, fill = Point.of.Contact)) +
geom_col() +
geom_text(aes(label = round(Avg_Rent, 2)), position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y") +
labs(title = "Average Rent By Point of Contact")
- Agent Contact: Highest average rent and size.
- Builder Contact: Lowest rent and size.
Identifying the most and least preferred cities based on rental properties:
City_Count <- data %>% group_by(City) %>% summarise(count = length(BHK)) %>% arrange(desc(count))
ggplot(City_Count, mapping = aes(x= City, y= count, fill = count)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "City", y = "Count", title = "Houses Counted by Cities")
- Most Preferred: Chennai
- Least Preferred: Mumbai
Calculating the rent per size for each property:
data$Rent_per_size <- data$Rent / data$Size
Exploring how house size impacts the rent:
ggplot(data, aes(x=Size, y=Rent)) +
geom_point() + geom_smooth() +
labs(title = "Relationship Between Size & Rent")
- Positive relationship: Larger size generally corresponds to higher rent.
Identify the average house sizes for each city:
temp <- data %>% group_by(City) %>% summarise(Avg_Size = mean(Size))
ggplot(data = temp, mapping = aes(x = City, y = Avg_Size, fill = City)) +
geom_bar(stat="identity", position = "dodge") +
labs(title = "Average House Sizes By City")
- Largest Average Size: Hyderabad
- Smallest Average Size: Delhi
Finding the most and least preferred furnishing status:
Furnished_Status <- data %>% group_by(Furnishing.Status) %>% summarise(count = length(BHK))
ggplot(Furnished_Status, mapping = aes(x= Furnishing.Status, y= count, fill = count)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Furnished Status", y = "Count", title = "Count By Facilities")
- Most Preferred: Semi-Furnished
Examining how rent prices vary by city:
ggplot(data = data, mapping = aes(x = City, y = Rent)) +
geom_boxplot(col="black") +
labs(title = "Distribution of Rent By City")
- Highest Rent: Mumbai
- Lowest Rent: Kolkata
Identify the most common number of bathrooms in rental properties:
Bathroom_Count <- data %>% group_by(Bathroom) %>% summarise(count
= length(BHK)) %>% top_n(5)
ggplot(Bathroom_Count, mapping = aes(x= Bathroom, y= count, fill = count)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Bathrooms", y = "Count", title = "Count Of Bathrooms in 1 House")
- Most Common: 2 Bathrooms
Analyze the distribution of house sizes:
Size_Count <- data %>% group_by(Size) %>% summarise(count = length(Size)) %>% top_n(8)
ggplot(Size_Count, mapping = aes(x= Size, y= count, fill = count)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Count By Size")
- Most Popular Size: 600 sq ft
Identifying the city with the most BHK:
Total_Amount_BHK_Per_city <- data %>% group_by(City) %>% summarise(Total_BHK = sum(BHK))
ggplot(Total_Amount_BHK_Per_city, mapping = aes(x= City, y= Total_BHK, fill = Total_BHK)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Amount Of BHK Per City")
- Highest Total BHK: Chennai
Finding which city has the highest total rent:
Total_Amount_Rent_Per_city <- data %>% group_by(City) %>% summarise(Total_Rent = sum(Rent))
ggplot(Total_Amount_Rent_Per_city, mapping = aes(x= City, y= Total_Rent, fill = Total_Rent)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total amount Of rent per City")
- Highest Total Rent: Mumbai and Chennai are close.
Generating a correlogram to visualize the relationships between different variables:
Correlogram_Matrix <- cor(data[,c(2,3,4,11)])
corrplot(Correlogram_Matrix, addCoef.col = TRUE)
Create a scatter plot with a regression line for rent by house size:
attach(data)
plot(Size, Rent, main = "Scatterplot of rent vs size", xlab = "House size", ylab ="House rent")
abline(lm(Rent ~ Size), col ="blue", lwd = 2)
Visualizing the distribution of rent with a violin plot:
ggplot(data, aes(x = Rent, y = Size)) + geom_violin(trim = FALSE)
This analysis offers a comprehensive exploration of rental housing data, revealing key insights into factors affecting rent prices, area preferences, and housing features. By understanding these patterns, both renters and property managers can make more informed decisions in the rental market.