3、利用CNN的思路,reshape(samples, 4, features,1)
4、利用Multi-input model
(1)Data Exploration
在这一步要做的基本就是 EDA (Exploratory Data Analysis),也就是对数据进行探索性的分析,从而为之后的处理和建模提供必要的结论。 用 pandas 来载入数据,并做一些简单的可视化来理解数据.
通常来说 matplotlib 和 seaborn 提供的绘图功能就可以满足需求了。
对 Numerical Variable,可以用 Box Plot 来直观地查看它的分布。
对于坐标类数据,可以用 Scatter Plot 来查看它们的分布趋势和是否有离群点的存在。
对于分类问题,将数据根据 Label 的不同着不同的颜色绘制出来,这对 Feature 的构造很有帮助。
特征工程有待进一步丰富,主要从data time feature, lagging feature, window feature,这里仅仅考虑了window feature。也可以尝试利用聚类等无监督学习的方式学习特征。最高得分上限为:1.85,应该是特征工程做得不够充分,没有对时序特征进行加工。
利用RF/ET/GBM/XGB作为Base Model,XGB作为第二层
利用seq2seq model with attention(LSTM as kernel) input:month data8\9\10 ->->-> output:month data 11 input:month data9\10\11 ->->-> output:month data 12
1) Define the problem at hand and the data you will be training on; collect this data or
annotate it with labels if need be.
2) Choose how you will measure success on your problem. Which metrics will you be
monitoring on your validation data?
3) Determine your evaluation protocol: hold-out validation? K-fold validation? Which
portion of the data should you use for validation?
4) Develop a first model that does better than a basic baseline: a model that has
"statistical power".
5) Develop a model that overfits.
6) Regularize your model and tune its hyperparameters, based on performance on the
validation data
在训练的时候,很容易发生过拟合,也就是training loss减少,但是validation_loss增大。解决思路为:
1)Add dropout.
2)Try different architectures, add or remove layers.
3)Add L1 / L2 regularization.
4)Try different hyperparameters (such as the number of units per layer, the learning rate of
the optimizer) to find the optimal configuration.
5)Optionally iterate on feature engineering: add new features, remove features that do not
seem to be informative.
利用8、9、10月的数据进行折叠,reshape(samples, 4, features,1),然后参照图像的方式,利用Conv2D(3,3),Maxpooling...
下图所示:利用CNN进行month data的特征提取,然后利用Dense NN进行sum data的特征提取,
得到 A 榜 Best Score:1.845, B 榜:1.826