/Laptop_cost_prediction

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Laptop_cost_prediction

EDA

Distribution Plot of our dependent feature Price.

plot1

As we can see the distribution it is little a bit a left skewed Gaussian Distribution.

Plotting the countplots of categorical Variables

Companies Selling Laptops

company_plot2

HP as around 275 laptops out of 1303. Whereas Dell and Lenovo has crossed the HP with 290 approx laptops out of 1303. We see that the least is with google, LG, Fujitsu, Huawei, etc

Laptop types

typename_plot2

And we have the what kind of models are sold mostly Notebooks followed by Ultrabook and Gaming. The least models sold are netbook and workstation.

RAM

ram_plot2

Operating System

opsys_plot2

Compnay VS Price

company_vs_price

Then we will see average price for each laptop brand. It will give us the insight how the price of laptop will vary. According to the dataset, We can see the ticks at boxplot of HP the maximum (avg) selling price it goes to 65K - 70K (It can be more but the dataset does not contain that data) and average selling would be to 50K- 52K. For Apple the maximum (avg)selling price it goes to approx 1Lac (It can be more but the dataset does not contain that data) and average selling would be to 65K- 70K. So we can a lot variations of price, so we get to know what is the average variation of a company's laptop and what can be maximum price which a company can assign to its laptop.

Laptop Type VS Price

plot4

In the above plot, we get to know that notebook does not have that much price variation. maybe beacuse the notebooks are used for general purpose so that it can be scalable to peoples.

ScreenSize_Price

plot5

In the plot, we observe that the most of the people buys laptops with screensize around 13 - 14 inches. While the laptops with screensize of 17inches are sold upto 3Lakhs or more. Also, most people buys laptop with screensize of 15.6 inches and we can see some scatter over the 15.6 so we can say that the data is almost right.


For the Screen Resolution column we have many types of Screen Resolutions out there as shown Touch Screen and Normal and IPS Panel are the 3 parts on basis of which we can segregate the things

Full HD 1920x1080 507 1366x768 281 IPS Panel Full HD 1920x1080 230 IPS Panel Full HD / Touchscreen 1920x1080 53 Full HD / Touchscreen 1920x1080 47 1600x900 23 Touchscreen 1366x768 16 Quad HD+ / Touchscreen 3200x1800 15 IPS Panel 4K Ultra HD 3840x2160 12 IPS Panel 4K Ultra HD / Touchscreen 3840x2160 11 4K Ultra HD / Touchscreen 3840x2160 10 Touchscreen 2560x1440 7 4K Ultra HD 3840x2160 7 IPS Panel 1366x768 7 IPS Panel Quad HD+ / Touchscreen 3200x1800 6 Touchscreen 2256x1504 6 IPS Panel Retina Display 2560x1600 6 IPS Panel Retina Display 2304x1440 6 IPS Panel Touchscreen 2560x1440 5 IPS Panel 2560x1440 4 IPS Panel Retina Display 2880x1800 4 1440x900 4 IPS Panel Touchscreen 1920x1200 4 2560x1440 3 1920x1080 3 IPS Panel Quad HD+ 2560x1440 3 IPS Panel Touchscreen 1366x768 3 Touchscreen 2400x1600 3 Quad HD+ 3200x1800 3 IPS Panel Full HD 2160x1440 2 IPS Panel Quad HD+ 3200x1800 2 IPS Panel Touchscreen / 4K Ultra HD 3840x2160 2 Touchscreen / Full HD 1920x1080 1 Touchscreen / Quad HD+ 3200x1800 1 Touchscreen / 4K Ultra HD 3840x2160 1 IPS Panel Full HD 1920x1200 1 IPS Panel Full HD 2560x1440 1 IPS Panel Retina Display 2736x1824 1 IPS Panel Touchscreen 2400x1600 1 IPS Panel Full HD 1366x768 1

So now will be creating a new col,touchscreen if the value is 1 that laptop is touch screen

resolution_code

data_image

Count of TouchScreen Laptops

plot 6

So using countplot, we get to know that almost 190 laptops are touchscreen and rest of all laptops are not touchscreen.

TouchScreen VS Price

plot7

The price for touchscreen laptops are the highest 80k or it can go more than 80k and average pricr is 70K.


So now will be creating a new col for ips panel as well if the value is 1 that laptop is touch screen

ips_image

ips_data_image

IPS Panel count

plot8

So using countplot, we get to know that almost 350 - 400 laptops are IPS and rest of all laptops are not IPS

Panel VS Price

plot9

The price for touchscreen laptops are the highest 80k and average price is 65K.


Correlation

plot10

corr1_image

From the correlation plot we observed that as the X_res and Y_res is increasing,the price of the laptop is also increasing,so X_res and Y_res are positively correlated and they are giving much information,so that is the reason why i had splitted Resolution column into X_res and Y_res columns respectively.So to make things good,we can create a new column named PPI{pixels per inch},now as we saw from the correlation plot that the X_res and Y_res are having much collinearity,so why not combine them with Inches which is having less collinearity,so we will combine them as follows ↓,so here is the formula of how to calculate PPI {pixels per inch}.

ppi

corr2_image

So as we observe from the correlation data that the PPI is having good correlation,so we will be using that,as that is a combination of 3 features and that gives collective results of 3 columns,so we will drop Inches,X_res,Y_res as well


Now we will work on CPU column,as that also has much text data and we need to process it efficiently as we may get good insights from them

Most common processors are made by intel right,so we will be clustering their processors into different categories like i5,i7,other,now other means the processors of intel which do not have i3,i5 or i7 attached to it,they're completely different so that's the reason i will clutter them into other and other category is AMD which is a different category in whole

So if we observe we need to extract the first 3 words of the CPU column,as the first 3 words of every row under the CPU col is the type of the CPU,so we will be using them as shown

plot11

we see that i5 is around 430-440, and i7 are around more than 500

plot12

Analysis on RAM

plot13

We can see that more than 600 uses 8gb ram and other major quantity is 4gb ram and we find few people who prefer 16gb ram

plot14

RAM is having good relation with price


Memory Column

We will seperate the Type of memory and the value of it,just similar to the one which is done in the previous part

This part involves things which are needed to be done in steps,so here we do not have the memory as a complete we have it in different dimension as 128GB SSD + 1TB HDD,so inorder to for it come in a same dimension we need to do some modifications which are done below as shown

corr_image3

Based on the correlation we observe that Hybrid and Flash Storage are almost negligible,so we can simply drop them off,where as HDD and SDD are having good correlation,we find that HDD has -ve relation with Price,and that's true,if the price of laptop is increasing there is more probability that the laptop is gonna use SDD instead of HDD and vice versa as well


Analysis on GPU

Here as we are having less data regarding the laptops,its better that we focus on GPU brands instead focusing on the values which are present there beside them,we will focus on the brands

plot15

plot16

Removing the "ARM" tuple

plot17


Operating System Analysis

plot18

Grouping all Windows version into Windows and all Mac versions into Mac OS

plot19

plot20


Laptop Weight analysis

plot21

plot22


Price Analysis

plot23

As we can see that the plot is kind of left skewed.

plot24

So we apply log to the price and it become centrally distributed (Gaussian)

--------------------------------------Models-----------------------------------

We will be using Pipelines to load the model

Regression Models

Linear Regression

Step 1 would be to convert categorical values to Numerical Values Step 2 is like an object of the model

linear_reg

R2 score 80.73%

MAE 0.2101783

Ridge Regression

ridge_reg

R2 score 81.27%

MAE 0.2092680

Lasso Regression

lasso_reg

R2 score 80.71%

MAE 0.2111435

Ensemble Models

Decision Tree

decision_tree

R2 score 84.33%

MAE 0.1830225

Random Forest

random_forest

R2 score 90.82%

MAE 0.1587025

Checking how Random Forest Model predicts the value wrt to Actual value.

actual_pred


Conclusion:

Random Forest model gives best score from the above all Models. So we will be using Random Forest Model for Web App.

Note: I also worked on Hyperparameter Tunning of Random Forest but after tunning the model is giving us the same score.