Course project of MGE303 DataMining, UNIST
Implement : link
report : link
I will exploring the House Price of Kimg County Regions by term project of MGE303 DataMining class(2020, spring). Actually, This is a problem of regression. We want to know the house price in the King County area. So, Our goal is to more accurately predict house price for a house that meets the conditions i want by using the given data(Problem Definition).
- Data Load
- Data Handle
- Encoding
- Histogram
-
The relationship between PRICE and AREA
Hypothesis 1:The relationship between
PRICE
and area variables is likely to have a major impact on variablesAR_LIV
,AR_RFTP
andAVG_LIV_NEAR
. -
The relationship between PRICE and the number of room and floor(internal facilities)
Hypothesis 2 : Looking at each case, the higher
NUM_BD
andBTH_STAT_2
, the higher the tendency forPRICE
. However, it does not seem to have much to do withNUM_FLR
. -
The relationship between Price and Climate change
Hypothesis 3:
Price
,PRECIPITATION
,Summer_HIGH
, andWinter_Low
are not related. -
The relationship between Price and Near_RIV
Hypothesis 4: If there is water near the house, the price is high. Therefore,
NEAR_RIV
affectsPRICE
. -
The relationship between Price and Crime
Hypothesis 5:
Price
,violent_crime
, andProperty_crime
are not related. -
The relationship between Price and house position(latitude, longitude)
Hypothesis 6:
Price
,Longitude
, andLatitude
are not related. -
The relationship between price and Appearance and Facilities
Hypothesis 7: As I said earlier, the
outlook
andprice
seem to be related, and thecondition
was so focused on a particular value that it was difficult to find the relationship. Also, the price change depending on 'is_renovated' is not significant at less than 500,000. Therefore,price
andIS_RENOVATED
seem irrelevant. And there seems to be no change inPRICE
according toYR_PAST
.
Conclustion of Hypothesis :
-
have relationship with
PRICE
:AR_LIV
,AR_RFTP
,AVG_AR_LIV_NEAR
,NUM_BD
,BTH_STAT_2
,NEAR_RIV
,OUTLOOK
-
don't have relationship with
PRICE
:AR_PL
,AR_BASE
,AVG_AR_PL_NEAR
,NUM_FLR
,PRECEPITATION
,SUMMER_HIGH
,WINTER_LOW
,VIOLENT_CRIME
,PROPERTY_CRIME
,LAT
,LONG
,CONDITION
,IS_RENOVATED
,YR_PAST
Construct correlation matrix
-
Multiple Linear Regression
mlr = LinearRegression() mlr.fit(X_train,y_train) mlr_score = mlr.score(X_test,y_test) pred_mlr = mlr.predict(X_test) expl_mlr = explained_variance_score(pred_mlr,y_test) mse_mlr = mean_squared_error(pred_mlr,y_test) rmse_mlr = np.sqrt(mse_mlr) cv_mlr = cross_val_score(mlr,X_train,y_train, cv=10).mean()
-
Decision Tree
tr_regressor = DecisionTreeRegressor(random_state=0) tr_regressor.fit(X_train,y_train) tr_regressor.score(X_test,y_test) pred_tr = tr_regressor.predict(X_test) decision_score=tr_regressor.score(X_test,y_test) expl_tr = explained_variance_score(pred_tr,y_test) mse_tr = mean_squared_error(pred_tr,y_test) rmse_tr = np.sqrt(mse_tr) cv_tr = cross_val_score(tr_regressor,X_train,y_train, cv=10).mean()
-
Random Forest Regression Model
rf_regressor = RandomForestRegressor(n_estimators=28,random_state=0) rf_regressor.fit(X_train,y_train) rf_regressor.score(X_test,y_test) rf_pred =rf_regressor.predict(X_test) rf_score=rf_regressor.score(X_test,y_test) expl_rf = explained_variance_score(rf_pred,y_test) mse_rf = mean_squared_error(rf_pred,y_test) rmse_rf = np.sqrt(mse_rf) cv_rf = cross_val_score(rf_regressor,X_train,y_train, cv=10).mean()
Calculate Score
print("Multiple Linear Regression Model Score is ",round(mlr.score(X_test,y_test)*100))
print("Decision tree Regression Model Score is ",round(tr_regressor.score(X_test,y_test)*100))
print("Random Forest Regression Model Score is ",round(rf_regressor.score(X_test,y_test)*100))
>> Multiple Linear Regression Model Score is 70.0
>> Decision tree Regression Model Score is 76.0
>> Random Forest Regression Model Score is 88.0
models_score =pd.DataFrame({'Model':['Multiple Linear Regression','Decision Tree','Random forest Regression'],
'Score':[mlr_score,decision_score,rf_score],
'Explained Variance Score':[expl_mlr,expl_tr,expl_rf],
'RMSE':[rmse_mlr,rmse_tr,rmse_rf],
'10-CV':[cv_mlr,cv_tr,cv_rf]
})
models_score.sort_values(by='10-CV',ascending=False)
Of the three models, RMSE values are the smallest, and 10-CV and Score think the largest random forest is the most suitable model.