This repository contains code for processing the data and building the model for Truly Native? Kaggle competition organized by Dato.
Main objective: predict whether the content in an HTML file was sponsored or not on StumbleUpon.
Follow these steps to get the set up for the project ready and running on your local instance.
- Python 3.7+
pip install -r requirements.txt
- Download
*.zip
files that contain raw HTMLs from Kaggle and place underdata
folder
.
├── data # Data files
│ ├── raw # Raw zip files downloaded from Kaggle
│ ├── csv # Transformed csv files
│ └── html_targets.csv # Targets csv file downloaded from Kaggle
├── models # Models, EDA, hyper parameter tuning Jupyter notebooks
│ ├── eda.ipynb # Exploratory analysis on the processed dataset
│ ├── hp_tuning.ipynb # Hyper parameter tuning for selected models
│ └── models_eval.ipynb # Model evaluation with the best parameters
├── app.py # Streamlit app
├── process_raw_html.py # Extract features from zip files
└── ...
- Run
python3 process_raw_html.py
to extract features from zip files - Run
hp_tuning.ipynb
for hyper parameter tuning with Randomized Search - Run
model_eval.ipynb
to evaluate and save final model topickle
file