The dataset is from TalkingData, a Chinese data service platform, for the Ad Tracking Fraud Detection contest
The goal is to identify which ad clicks for mobile apps are legitimate clicks based on the propensity of ad clicks to result in an actual app download. The remaining clicks are suspected fraud.
The dataset is highly imbalanced, consisting of almost 190 million rows, with approximately a half-million actual app downloads.
- Feature engineering
- Subsampling
- Class rebalancing
- Pyspark
- SQL Library
- ML Classification Library