- x_train: (8437, 114)
- x_test: (1428, 114)
- y_train: (8437, 3)
- y_test: (1428, 3)
name | type |
---|---|
key | int64 |
date | object |
x1 - x112 | object |
name | type |
---|---|
key | int64 |
date | object |
y | object |
name | type |
---|---|
key | int64 |
date | object |
y | float64 (EMPTY template for predictions) |
The dataset consists of continious data in the all columns x1-x112. Therefore repeating values seem to be dubious.
zero | nan | repeating | |
---|---|---|---|
x1 | 6104 | 637 | 6741 |
x2 | 6082 | 644 | 6726 |
x3 | 6114 | 640 | 6754 |
x4 | 6118 | 642 | 6760 |
x5 | 271 | 5561 | 5832 |
x6 | 262 | 5531 | 5793 |
x7 | 28 | 657 | 685 |
x8 | 27 | 657 | 684 |
x9 | 48 | 637 | 685 |
x10 | 23 | 650 | 971 |
x11 | 266 | 5569 | 5835 |
x12 | 1465 | 635 | 2100 |
x13 | 72 | 637 | 709 |
x14 | 2413 | 635 | 3048 |
x15 | 27 | 660 | 687 |
x16 | 44 | 642 | 686 |
x17 | 198 | 6392 | 6686 |
x18 | 278 | 5554 | 5832 |
x19 | 1482 | 640 | 2122 |
x20 | 68 | 642 | 710 |
x21 | 2451 | 641 | 3092 |
x22 | 2457 | 633 | 3340 |
x23 | 51 | 637 | 688 |
x24 | 6121 | 637 | 6758 |
x25 | 269 | 5519 | 5788 |
x26 | 1479 | 635 | 2114 |
x27 | 71 | 637 | 708 |
x28 | 2427 | 635 | 3062 |
x29 | 27 | 652 | 679 |
x30 | 47 | 633 | 680 |
x31 | 6124 | 633 | 6757 |
x32 | 272 | 5553 | 5825 |
x33 | 1482 | 632 | 2114 |
x34 | 77 | 633 | 710 |
x35 | 2457 | 633 | 3090 |
x36 | 27 | 659 | 686 |
x37 | 47 | 640 | 687 |
x38 | 6102 | 640 | 6742 |
x39 | 268 | 5541 | 5809 |
x40 | 1489 | 638 | 2127 |
x41 | 73 | 640 | 713 |
x42 | 2450 | 638 | 3088 |
x43 | 24 | 657 | 681 |
x44 | 43 | 640 | 683 |
x45 | 6137 | 634 | 6771 |
x46 | 270 | 5553 | 5823 |
x47 | 1471 | 636 | 2107 |
x48 | 70 | 640 | 710 |
x49 | 2428 | 638 | 3066 |
x50 | 27 | 648 | 675 |
x51 | 45 | 631 | 676 |
x52 | 6118 | 631 | 6749 |
x53 | 262 | 5568 | 5830 |
x54 | 1455 | 633 | 2088 |
x55 | 72 | 631 | 703 |
x56 | 2420 | 633 | 3053 |
x57 | 24 | 653 | 677 |
x58 | 47 | 633 | 680 |
x59 | 6113 | 633 | 6746 |
x60 | 271 | 5550 | 5821 |
x61 | 1447 | 629 | 2076 |
x62 | 69 | 633 | 702 |
x63 | 2456 | 632 | 3088 |
x64 | 25 | 645 | 670 |
x65 | 47 | 627 | 674 |
x66 | 6084 | 627 | 6711 |
x67 | 267 | 5569 | 5836 |
x68 | 1450 | 628 | 2078 |
x69 | 73 | 627 | 700 |
x70 | 2434 | 627 | 3061 |
x71 | 23 | 645 | 668 |
x72 | 46 | 626 | 672 |
x73 | 6108 | 626 | 6734 |
x74 | 261 | 5568 | 5829 |
x75 | 1454 | 624 | 2078 |
x76 | 72 | 626 | 698 |
x77 | 2435 | 625 | 3060 |
x78 | 26 | 638 | 664 |
x79 | 42 | 623 | 665 |
x80 | 6131 | 623 | 6754 |
x81 | 255 | 5575 | 5830 |
x82 | 1480 | 622 | 2102 |
x83 | 68 | 623 | 691 |
x84 | 2427 | 623 | 3050 |
x85 | 24 | 661 | 685 |
x86 | 44 | 642 | 686 |
x87 | 6105 | 642 | 6747 |
x88 | 66 | 634 | 1028 |
x89 | 1449 | 640 | 2089 |
x90 | 70 | 642 | 712 |
x91 | 2406 | 641 | 3047 |
x92 | 24 | 654 | 678 |
x93 | 45 | 637 | 682 |
x94 | 6113 | 637 | 6750 |
x95 | 264 | 5566 | 5830 |
x96 | 1453 | 635 | 2088 |
x97 | 68 | 637 | 705 |
x98 | 2430 | 636 | 3066 |
x99 | 27 | 664 | 691 |
x100 | 49 | 644 | 693 |
x101 | 42 | 634 | 961 |
x102 | 267 | 5599 | 5866 |
x103 | 1450 | 640 | 2090 |
x104 | 70 | 644 | 714 |
x105 | 2453 | 641 | 3094 |
x106 | 23 | 641 | 664 |
x107 | 46 | 620 | 666 |
x108 | 6137 | 620 | 6757 |
x109 | 1463 | 632 | 2712 |
x110 | 1469 | 617 | 2086 |
x111 | 73 | 620 | 693 |
x112 | 2454 | 619 | 3073 |
Every single row and column contain NaN value.
wo_nan0 = x_train_df.dropna(axis=0)
wo_nan0.shape == (0, 114)
wo_nan1 = x_train_df.dropna(axis=1)
wo_nan1.shape == (8437, 2) # only 'key' and 'date' columns
yrv = y_train_df["y"].value_counts()
yrv[yrv > 1].index
value | y |
---|---|
0.0000 | 130 |
10000.0000 | 71 |
2444.4444 | 9 |
2666.6667 | 6 |
3111.1111 | 6 |
2222.2222 | 6 |
1777.7778 | 5 |
3777.7778 | 5 |
2888.8889 | 4 |
586.9793 | 4 |
4000.0000 | 4 |
1555.5556 | 4 |
3555.5556 | 4 |
1383.9462 | 3 |
3021.6735 | 3 |
2000.0000 | 3 |
1333.3333 | 3 |
3333.3333 | 3 |
5000.0000 | 2 |
1107.1570 | 2 |
444.4444 | 2 |
460.4279 | 2 |
1146.6983 | 2 |
593.1198 | 2 |
1344.4049 | 2 |
2807.4338 | 2 |
1700.2768 | 2 |
2570.1858 | 2 |
1067.6157 | 2 |
2685.8536 | 2 |
2214.3140 | 2 |
222.2222 | 2 |
3084.2230 | 2 |
1426.5290 | 2 |
1739.8181 | 2 |
2259.8222 | 2 |
1660.7355 | 2 |
1726.2846 | 2 |
303.7405 | 2 |
4448.9080 | 2 |
3433.7827 | 2 |
4726.4841 | 2 |
5322.1656 | 2 |
888.8889 | 2 |
3451.1435 | 2 |
2635.9911 | 2 |
2308.0562 | 2 |
4666.6667 | 2 |
666.5863 | 2 |
2098.1688 | 2 |
1916.4038 | 2 |
1720.3714 | 2 |
355.8719 | 2 |
1463.0289 | 2 |
237.2479 | 2 |
1557.2125 | 2 |
790.8264 | 2 |
672.2025 | 2 |
4557.5097 | 2 |
293.1369 | 2 |
2727.8311 | 2 |
5266.6638 | 1 |
... | 1 |
y_train dubious values: [ 10000.0, 2444.444444, 2666.666667, 3111.111111, 2222.222222, 1777.777778, 3777.777778, 2888.888889, 4000.0, 1555.555556, 3555.555556, 2000.0, 1333.333333, 3333.333333, 5000.0, 444.4444444, ]
The above picture of y_train distribution demonstrates an anomality appearing with values 0 and 10000, meanwhile these values are borders of the interval containing all the y_train values. This makes me think that values outside of the interval were somehow aggregated to its borders.
sumsquare_error | aic | bic | kl_div | |
---|---|---|---|---|
beta | 1.370405e-08 | 1909.951166 | -220331.161274 | 0.007121 |
nakagami | 1.448949e-08 | 1912.762703 | -219887.114378 | 0.007551 |
johnsonsb | 1.590840e-08 | 1909.449062 | -219118.668745 | 0.008402 |
burr12 | 1.798824e-08 | 1916.217251 | -218119.852460 | 0.009306 |
rayleigh | 2.699707e-08 | 1918.448148 | -214837.403326 | 0.013449 |
the correlation matrix showed presence of 16 blocks of narrowly correlated columns.
Initially I used several Imputer strategies (SimpleImputer with strategies mean, median, most_frequent) and KNNImputer (n_neighbors=3,5,7). The mean strategy followed by unconverging the PCA algorithm. Nevertheless all others convergated therefore I choose the SimpleImputer with the strategy median.
The PCA algorithm suggested I should choose n_components=41.
imputer.fit(xs)
x_train_filled = imputer.transform(xs)
pca.fit(x_train_filled)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum > 0.95) + 1
d
>>> d=41