- Author: Yuan Gao
- Academic Standing: Freshman
- Result: Winner
- In this contest I am given a huge csv file (raw file can be downloaded here) with almost 10 million rows of data and with lots of missing values. I have cleaned up these columns and cleaned csv can be downloaded here.
- Language: Python 3.6 and Jupyter Notebook
- Library used: Pandas, Scikit-learn, Numpy, and Matplotlib
- Supervised learning: Linear Regression, and could be implemented by Support Vector Regression
- Others: Ipython kernel memory management, Object-Oriented Programming
- Null values in
elet
,lat
,lon
are implemented by data from GoogleMap, which ensures the accuracy - Null values in
temp
,stp
,dewp
,hmdy
, are filled by monthly mean value. For instance, if missingtemp
at2001-11-01
, I use the mean of all the non-missing value at November to fill it. - Null values in
gbrd
, and part oftemp
are filled by using linear Regression, since they are missing 50% of the total values, and the confidence of the classifier forgbrd
is almost 60%, which is acceptable for missing values. - Null values in
wdsp
,wdct
,gust
are implemented by using previous five values, with a minimam threshold dealing with consecutive nans. - Null values in
tmax
,tmin
,smax
,smin
,dmax
,dmin
are implemented by adding a value to corresponding columns. For instance, I implementtmax
by first calculating the mean of the difference between non-missingtmax
andtemp
and then add the number totemp
in columns missingtmax
- Dropped values: A group in
wsnm
calledJACAREPAGUA
in Rio De Janiero contains mostly (more than 99%) nan values, which could not provide any useful informations, so instead of filling them, I dropped them.
Go to the nbviewer for better rendering on this project!