/TianChi-Industrial-steam-volume-prediction

Industrial steam volume prediction competition on TianChi

Primary LanguagePython

TianChi-Industrial-steam-volume-prediction

competition introduction

home page

submit log

note: all the submit file were upload

time method offline MSE offline R2 online MSE 状态
10.21 lightGBM all features 0.1033 0.1496 little overfitting
10.22 xgboost all features 0.0914 0.2566 strong overfitting
10.23 lightGBM 1.drop abnormal feature 'V9','V23','V25','V30','V31','V33','V34'; 2.hyperparameters optimize 0.1035 0.8961 0.1341 weak overfitting
10.23 lightGBM 1.drop abnormal feature 'V9','V23','V25','V30','V31','V33','V34'; 2.hyperparameters optimize; 3.drop abnormal data on train set according to sns.boxplot 0.0917 0.8139 0.2548 middle overfitting
10.24 lightGBM 1.drop abnormal feature: 'V9','V23','V25','V30','V31','V33','V34'; 2.drop bilinear feature: 'V0','V6','V15','V10','V8','V27'; 3.hyperparameters optimize 0.1181 0.8815 0.1502 weak overfitting
10.25 lightGBM 1.构造二项式特征780个; 2.标准化 0.1068 0.8928 0.1549 weak overfitting
10.25 lightGBM 1.构造二项式特征780个; 2.PCA降维到100个 0.1871 0.8123 0.5202 strongly overfitting
10.26 LightGBM f-regression 选择20个特征 0.1040 0.8957 0.1417 weak overfitting
10.26 LightGBM 1.构造二项式特征780个; 2.f-regression选择100个特征; 3.PCA降维到30个 0.1365 0.8630 0.4164 strong overfitting
10.27 lightGBM 1.构造二项式特征780个; 2.互信息选择100个特征; 3.PCA降维到30个 0.1392 0.8603 0.8113 strong overfitting

selecting features using f-regression

num lgb - - - online linear reg - - - online
train MSE train R2 valid MSE valid R2 MSE train MSE train R2 valid MSE valid R2 MSE
38 0.0126 0.9867 0.1010 0.8986 0.1079 0.8875 0.1076 0.8920
37 0.0134 0.9859 0.1013 0.8984 0.1094 0.8860 0.1062 0.8934
36 0.0156 0.9836 0.1015 0.8981 0.1094 0.8860 0.1061 0.8935
35 0.0150 0.9843 0.1005 0.8991 0.1094 0.8859 0.1061 0.8935
34 0.0138 0.9855 0.1016 0.8980 0.1095 0.8859 0.1067 0.8930
33 0.0097 0.9898 0.1012 0.8984 0.1097 0.8857 0.1065 0.8931
32 0.0113 0.9881 0.1019 0.8977 0.1097 0.8857 0.1069 0.8927
31 0.0124 0.9870 0.1020 0.8976 0.1098 0.8856 0.1069 0.8927
30 0.0204 0.9786 0.1027 0.8969 0.1099 0.8855 0.1064 0.8932
29 0.0291 0.9696 0.1053 0.8943 0.1115 0.8838 0.1074 0.8923
28 0.0303 0.9683 0.1056 0.8941 0.1132 0.8820 0.1105 0.8891
27 0.0255 0.9733 0.1034 0.8962 0.1139 0.8814 0.1108 0.8888
26 0.0251 0.9738 0.1047 0.8949 0.1139 0.8813 0.1107 0.8889
25 0.0261 0.9728 0.1042 0.8954 0.1155 0.8796 0.1136 0.8860
24 0.0240 0.9749 0.1037 0.8959
23 0.0335 0.9650 0.1073 0.8923
22 0.0305 0.9681 0.1056 0.8940
21 0.0224 0.9766 0.1051 0.8945
20 0.0315 0.9671 0.1073 0.8923
19 0.0368 0.9616 0.1071 0.8926
18 0.0407 0.9575 0.1075 0.8921
17 0.0398 0.9584 0.1072 0.8925 0.1182 0.8768 0.1177 0.8818
16 0.0429 0.9553 0.1082 0.8914 0.1427
15 0.0556 0.9420 0.1102 0.8895
14 0.0629 0.9344 0.1131 0.8865
13 0.0656 0.9316 0.1138 0.8858
12 0.0727 0.9242 0.1176 0.8820
11 0.0786 0.9181 0.1196 0.8800
10 0.0911 0.9051 0.1325 0.8671
9 0.0945 0.9015 0.1348 0.8648
8 0.1001 0.8957 0.1380 0.8615
7 0.1018 0.8940 0.1415 0.8580
6 0.1017 0.8940 0.1451 0.8545
5 0.1219 0.8730 0.1645 0.8350
4 0.1285 0.8661 0.1669 0.8325
3 0.1388 0.8554 0.1748 0.8247
2 0.1638 0.8293 0.1893 0.8101
1 0.2040 0.7875 0.2345 0.7648

features engineering and features selection

model drop data one hot add min log exp sqrt squa poly drop fea select KBest pca train mse valid mse vaid R2 test mse
lgb - - - - - - - - - - - 0.0182 0.1030 0.8968 -
lgb True,3 - - - - - - - - - - 0.0119 0.0999 0.828 -
lgb - - - - - - - - True - - 0.0192 0.1058 0.8939 -
lgb - - - - - - - - - 35 - 0.0189 0.1037 0.8960 -
lgb True,3 - - - - - - - True - - 0.0131 0.0993 0.8292 -
lgb - - - - - - - - True 28 - 0.0207 0.1061 0.8936 -
lgb True,3 - - - - - - - - 30 - 0.0140 0.1007 0.8269 -
lgb - True - - - - - - - - - 0.0179 0.1035 0.8962 0.1270
lgb - True,drop - - - - - - - - - 0.0112 0.1051 0.8947 0.1284
lgb - - True - - - - - - - - 0.0174 0.1041 0.8957 14.1
lgb - True True - - - - - - - - 0.0178 0.1024 0.8973 0.1264
lgb - True True - - - - - - - - 0.0013 0.1008 0.8989
lgb - True True - - - - - True - - 0.0189 0.1046 0.8951 0.1276
lgb True True True - - - - - - - - 0.0180 0.1035 0.8962 0.1285
lgb True True True - - - - - True - - 0.0195 0.1052 0.8945 -
lgb - True True - - - - - True - - 0.0189 0.1046 0.8951 -
lgb - True True - - - - - True 50 - 0.0208 0.1056 0.8940 -
lgb - True True - - - - - True 25 - 0.0241 0.1075 0.8922 -
lgb - - - True - - - - - - - 0.0167 0.1046 0.8952 14.2
lgb - - - - True - - - - - - 0.0170 0.1049 0.8948 14.2
lgb - - - - - True - - - - - 0.0169 0.1038 0.8959 14.2
lgb - - - - - - True - - - - 0.0146 0.1068 0.8929 14.2
lgb - - - True True True True - - - - 0.0170 0.1049 0.8948 14.2
lgb - True True True True True True - True - - 0.0148 0.1084 0.8913 -
lgb - True True True True True True - True - - 0.0098 0.1025 0.8972 -
lgb - True True True True True True - True 50 - 0.0540 0.1139 0.8858 -
lgb - True True True True True True - True 30 - 0.1066 0.1371 0.8626 -
lgb - True True True True True True - True 16 - 0.1268 0.1638 0.8358 -
lgb - True True True True True True True True 100 - 0.0641 0.1332 0.8665 -
lgb - True True True True True True True True 30 - 0.1066 0.1457 0.8539 -
lgb - True True True True True True True True - 30 0.0309 0.1486 0.8510 -
lgb - True True True True True True True True 400 20 0.0546 0.1195 0.8802 -
lgb - True True True True True True True True 100 30 0.0264 0.1221 0.8775 -
lgb - True True True True True True True True 100 20 0.0352 0.1271 0.8725 -
lgb True,3 true true true true true true - true 30 - 0.1082 0.1384 -

summary:

  • do not need too many features, maybe about 25 is a accepted value.
  • must drop abnormal-distributed features, such as 'V9'
  • drop abnormal samples on train set will lead to heavily overfitting, so how to deal with these data is critical
  • can't simplily drop bilinear features, which will decrease the appearance while weaken overfitting, maybe PCA is a better method
  • constract many features and then use PCA on all features is a terrible try, maybe use PCA only on bilinear features is better
  • use f-regression to select KBest features looks like effective, maybe combine with PCA will work better
  • looks like the R2 metric stands for the ability of overcoming overfitting in some way