pyspark-adtracking
About the dataset
The dataset is from TalkingData, a Chinese data service platform, for the Ad Tracking Fraud Detection contest
The goal is to identify which ad clicks for mobile apps are legitimate clicks based on the propensity of ad clicks to result in an actual app download. The remaining clicks are suspected fraud.
The dataset is highly imbalanced, consisting of almost 190 million rows, with approximately a half-million actual app downloads.
Techniques used
- Feature engineering
- Subsampling
- Class rebalancing
- Pyspark
- SQL Library
- ML Classification Library