- Kdd cup data mining competition, the main task is to predict air quality(aq) in Beijing and London in the next 48 hours.
- Use seq2seq and xgboost models, ranking 31th in the final leaderboard.
- The spatial distributionof the sites.
- Correlation analysis data of different sites.
- Clustering of different kinds of stations in Beijing.
Data preprocess then split the dataset into training, val and aggr dataset.
-
Data preprocess
Steps of data preprocess:
- Remove duplicated data. Some of the hour data are duplicated, remove them.
- Missing value processing. If hour level data are missing for all stations for 5 hours in a row, all (X,y) data that have these missing data in X or y are droped. Then if data are missing for all stations for less than 5 hours in a row, data before and after missing data are used to generate padding data linearly. In some cases, data for some specific stations are nan, then data from the nearest station will be used to pad.
-
Split the data
All data points that are valid after data preprocess will be split into 3 parts : training set, validation set and aggregation set.
Training set is used for the training of single models, and usually data from 20170101-20180328 will be used in the training set.
Validation set will be used for selecting the best single models from the checkpoints of all single models. Then all best single models will be aggregated on the validation set and eveluated finally on the aggregation set. The aggregation model will be used for the final prediction.
2.3 Oversampling
-
Why oversampling?
Symmetric mean absolute percentage error (SMAPE) is used in this competation as evaluation metric. In SMAPE, relative error matters rather than absolut error, as shown in the function.
However, loss functions like L1 loss, L2 loss and huber loss are applied in different models and they all aim at decreasing absolute error rather than relative error. So if models are trained using original data and these 3 loss functions, trained models would be optimized to fit data points with huge number rather than data points with smaller numbers, which would lead to larger SAMPE when evaluating with validation set and test with test set.
-
Oversamping Strategies
Training data from 20170101-20180328 are used in the training data. Oversampling steps are as follows:
- PM2.5 mean of y is caculated for every (X,y) pair, and all data points in the training set are sorted in ascending order.
- Smallest oversample_part of all datapoints are picked and repeated for repeats times and are appended to the original training dataset. So (1+repeats*oversample_part) times the original amount of training data are finally used to generate training data batch (X,y), which may help to shift the optimization target from those loss functions to SMAPE.
Oversample_part and repeats are hyperparameters which suitable values can be found by random search or grid search. Oversampling lead to a 0.02~0.04 improvement on SMAPE of validation set.
Seq2seq model is a machine learning model that use decoder and encoder to learn serialized feature pattern from data. Seq2seq model is applied to a lot of machine learning applications, especially NLP applications like Machine translation. In this project, seq2seq is applied to generate time series forecast of different granularity, which are Day model and Hour model. The basic graph of seq2seq model is as follows.
-
Day model
The air condition seem to be very cyclical every day, as shown in the 3rd part in bj_aq_data_exploration and below. So the basic seq2seq model would be Day model, which means that we just predict the mean value of all aq parameters in the next 2 days, and then overlay the parameter trend during 24 hours to generate the final prediction.
PM2.5 PM10 O3 NO2 The computation graph of Day model is as follws.
-
Hour model, Predicting 2 days together
-
Hour model, Predicting 1 day at a time