w207_OutBrain_Cluster-F

Outbrain click prediction competition on Kaggle.

To run from locally and explore machine learning approaches:

1) Download csv's from kaggle and put into folder ../input  

2) Run Data_EDA_Loading_and_Features.ipynb  

	- Note this only includes a sample of some of the documentation we got

	- Page_views.csv is 100GB and >2 billion rows, so we used page_views_sample instead

3) Then you can run Machine_Learning_Final.ipynb which will use the training and dev data set up by the Data_EDA_Features.ipynb.

	- The GridSearches and Random Forest models are SLOW

To load full data for Kaggle Submission:

1) The full version of the data loading code is in Archive/Final_Code_FULL.py 

2)This script can be run on an AMI instance that should be large size and connected to a volume that is at least 300 GB. 

	-It will still take a while to run and should be run in the background through `nohup python Final_Code_FULL.py &` to make sure sleep doesn't kill the process

sarahkelley/w207_OutBrain_Cluster-F

w207_OutBrain_Cluster-F