Kaggle Competition: Truly Native?

This repository contains code for processing the data and building the model for Truly Native? Kaggle competition organized by Dato.

Main objective: predict whether the content in an HTML file was sponsored or not on StumbleUpon.

Project setup

Follow these steps to get the set up for the project ready and running on your local instance.

Prerequisites

Python 3.7+
pip install -r requirements.txt
Download *.zip files that contain raw HTMLs from Kaggle and place under data folder

Expected directory structure

.
├── data                   # Data files
│   ├── raw                # Raw zip files downloaded from Kaggle
│   ├── csv                # Transformed csv files
│   └── html_targets.csv   # Targets csv file downloaded from Kaggle
├── models                 # Models, EDA, hyper parameter tuning Jupyter notebooks
│   ├── eda.ipynb          # Exploratory analysis on the processed dataset
│   ├── hp_tuning.ipynb    # Hyper parameter tuning for selected models
│   └── models_eval.ipynb  # Model evaluation with the best parameters
├── app.py                 # Streamlit app
├── process_raw_html.py    # Extract features from zip files
└── ...

Running the project

Run python3 process_raw_html.py to extract features from zip files
Run hp_tuning.ipynb for hyper parameter tuning with Randomized Search
Run model_eval.ipynb to evaluate and save final model to pickle file

ohryshyn/sponsored-ad-prediction

Kaggle Competition: Truly Native?

Project setup

Prerequisites

Expected directory structure

Running the project

Presentation