/customer-purchase-price-prediction

๐Ÿ‘• Customer consumption prediction analysis project based on transaction log

Primary LanguageJupyter NotebookMIT LicenseMIT

๋กœ๊ทธ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ๋ฏธ๋ž˜ ๊ตฌ๋งค์•ก ์˜ˆ์ธก ๊ฒฝ์ง„๋Œ€ํšŒ

  • ๋ถ€์ŠคํŠธ์บ ํ”„ AI Tech 1๊ธฐ ๊ณผ์ • ์ค‘, P stage 2 ๊ธฐ๊ฐ„ ๋™์•ˆ ์ฐธ์—ฌํ•œ ์ •ํ˜•๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜ ๊ฒฝ์ง„๋Œ€ํšŒ ์†Œ์Šค์ฝ”๋“œ ์ž…๋‹ˆ๋‹ค.
  • ๋Œ€ํšŒ๊ธฐ๊ฐ„: 2021.04. (2 weeks)

๋Œ€ํšŒ ์„ค๋ช…

  • ์˜จ๋ผ์ธ ๊ฑฐ๋ž˜ ๊ณ ๊ฐ log ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ณ ๊ฐ๋“ค์˜ ๋ฏธ๋ž˜ ์†Œ๋น„๋ฅผ ์˜ˆ์ธก ๋ถ„์„ํ•˜๋Š” ํ”„๋กœ์ ํŠธ์ž…๋‹ˆ๋‹ค.
  • 5914๋ช…์˜ 2009๋…„ 11์›” ~ 2011๋…„ 11์›” ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ ๊ณ ๊ฐ๋“ค์˜ 2011๋…„ 12์›”์˜ ์ด ๊ตฌ๋งค์•ก์ด 300์„ ๋„˜์„์ง€์˜ ํ™•๋ฅ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.
  • 2011๋…„ 12์›” ์ด ๊ตฌ๋งค์•ก์ด 300์„ ๋„˜์œผ๋ฉด 1, ๋„˜์ง€ ์•Š์œผ๋ฉด 0์œผ๋กœ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. (๊ณ ๊ฐ๋ณ„ ์˜ˆ์ธก ์‹ค์‹œ)

๊ฒฐ๊ณผ

  • ROC-AUC: 0.8601
  • ๋“ฑ์ˆ˜: 18๋“ฑ (18/96)

๋ฐ์ดํ„ฐ ์„ค๋ช…

  • 2009๋…„ 12์›”๋ถ€ํ„ฐ 2011๋…„ 11์›”๊นŒ์ง€์˜ ์˜จ๋ผ์ธ ์ƒ์ ์˜ ๊ฑฐ๋ž˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ์–ด์ง
  • 2011๋…„ 11์›” ๊นŒ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ 2011๋…„ 12์›”์˜ ๊ณ ๊ฐ ๊ตฌ๋งค์•ก 300์ดˆ๊ณผ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•ด์•ผ ํ•จ
  • Unique Customer_id : 5914๋ช…
  • Customer ๋‹น ๋กœ๊ทธ ์ˆ˜ : 1๊ฐœ ~ 12714๊ฐœ

๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ ์„ค๋ช…

  • order_id : ์ฃผ๋ฌธ ๋ฒˆํ˜ธ. ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ™์€ ์ฃผ๋ฌธ๋ฒˆํ˜ธ๋Š” ๋™์ผ ์ฃผ๋ฌธ์„ ๋‚˜ํƒ€๋ƒ„
  • product_id : ์ƒํ’ˆ ๋ฒˆํ˜ธ
  • description : ์ƒํ’ˆ ์„ค๋ช…
  • quantity : ์ƒํ’ˆ ์ฃผ๋ฌธ ์ˆ˜๋Ÿ‰
  • order_date : ์ฃผ๋ฌธ ์ผ์ž
  • price : ์ƒํ’ˆ ๊ฐ€๊ฒฉ
  • customer_id : ๊ณ ๊ฐ ๋ฒˆํ˜ธ
  • country : ๊ณ ๊ฐ ๊ฑฐ์ฃผ ๊ตญ๊ฐ€
  • total : ์ด ๊ตฌ๋งค์•ก(quantity X price)

ํ‰๊ฐ€๋ฐฉ์‹

  • AUC(Area Under Curve)

์‚ฌ์šฉํ•œ ์•„ํ‚คํ…์ฒ˜

  • ์‚ฌ์šฉ๋œ ML ์•Œ๊ณ ๋ฆฌ์ฆ˜: LightGBM
  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ
    model_params = {
        'objective': 'binary', # ์ด์ง„ ๋ถ„๋ฅ˜
        'boosting_type': 'gbdt',
        'metric': 'auc', # ํ‰๊ฐ€ ์ง€ํ‘œ ์„ค์ •
        'feature_fraction': 0.8, # ํ”ผ์ฒ˜ ์ƒ˜ํ”Œ๋ง ๋น„์œจ
        'bagging_fraction': 0.8, # ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง ๋น„์œจ
        'bagging_freq': 1,
        'n_estimators': 10000, # ํŠธ๋ฆฌ ๊ฐœ์ˆ˜
        'early_stopping_rounds': 100,
        'seed': SEED,
        'verbose': -1,
        'n_jobs': -1,    
    }

Feature engineering & Other methods

Index Feature Description Intention
1 cumsum - ๊ธฐ์กด์˜ ๊ฐ feature์— ๋Œ€ํ•œ ๋ˆ„์ ํ•ฉ์„ ๊ณ„์‚ฐ - ํ˜„์žฌ ์‹œ์ ์—์„œ๋Š” ๋ฏธ๋ž˜์˜ ์†Œ๋น„๋ฅผ ์•Œ ์ˆ˜ ์—†์œผ๋‹ˆ, ํ‰๊ท ์ด๋‚˜ ํ•ฉ ๋“ฑ์˜ aggregation function์€ ํ˜„์žฌ ์ด์ „์˜ ๊ฐ’๋“ค์—๋งŒ ์˜ํ–ฅ์„ ๋ฐ›์•„์•ผํ•จ
๋”ฐ๋ผ์„œ ํ˜„์žฌ ์ด์ „์˜ ๊ฐ’ ๋งŒ์„ ํ™œ์šฉํ•œ feature๊ฐ€ ์ ์ ˆํ•  ๊ฒƒ์ด๋ผ ์ƒ๊ฐํ•˜์—ฌ ์ถ”๊ฐ€ํ•จ
2 order_ts - ๊ฐ€์žฅ ์ตœ๊ทผ์— ๊ตฌ๋งคํ•œ total sum(last)
- ๊ฐ€์žฅ ์ฒ˜์Œ์— ๊ตฌ๋งคํ•œ total sum(first)
- ๊ฐ€์žฅ ์ตœ๊ทผ์— ๊ตฌ๋งคํ•œ ์ด์•ก(total)์ด ํƒ€์ผ“ month์— ์˜ํ–ฅ์„ ์ค„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ
3 order_ts_plus - ๊ฐ€์žฅ ์ตœ๊ทผ์— ๊ตฌ๋งคํ•œ ๊ธˆ์•ก ์ค‘, ์–‘์ˆ˜์ธ ๊ฐ’๋“ค์˜ total sum(last)
- ๊ฐ€์žฅ ์ฒ˜์Œ์— ๊ตฌ๋งคํ•œ ๊ธˆ์•ก ์ค‘, ์–‘์ˆ˜์ธ ๊ฐ’๋“ค์˜ total sum(first)
- ์Œ์ˆ˜์ธ ๊ฐ’๋“ค์ด ๋“ค์–ด๊ฐ€๋Š” ๊ฒƒ์ด ์–ด๋–ค ์˜ํ–ฅ์„ ๋ผ์น˜๋Š”์ง€ ํ™•์ธํ•˜๊ณ ์ž feature๋ฅผ ์ถ”๊ฐ€
4 mode - ๊ฐ feature ๋‹น ๊ฐ€์žฅ ๋งŽ์ด ๋‚˜์˜จ ๊ฐ’(์ตœ๋นˆ๊ฐ’)์„ ๋‹ค์‹œ feature๋กœ ์‚ผ์Œ
5 cycle_1224 - ๊ฐ ์‚ฌ์šฉ์ž๊ฐ€ 1๋…„ ์ „(12๊ฐœ์›” ์ „)๊ณผ 2๋…„ ์ „(24๊ฐœ์›” ์ „)์— ๊ตฌ๋งคํ•œ ์ด์•ก์˜ ํ‰๊ท ์„ feature๋กœ ์‚ผ์Œ
- aggregation function์„ ์ ์šฉํ•˜์ง€ ์•Š์Œ
- ๋งค๋…„ OO์›”์— 300์ด์ƒ ๊ตฌ๋งคํ•  ํ™•๋ฅ ์„ ์•Œ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, feature๋กœ์„œ ์ ์ ˆํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•จ
6 trend - OO๊ฐœ์›” ์ „์˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ customer ๋ณ„๋กœ ๊ฐ๊ฐ aggregation function์„ ์ ์šฉํ•œ ๊ฒฐ๊ณผ๋ฅผ feature๋กœ ์‚ผ์Œ
- price, quantity, total์— ๋Œ€ํ•ด์„œ๋งŒ ์ ์šฉ
- ๋Œ€์ƒ: [1, 2, 3, 5, 7, 12, 20, 23]
- ๊ธฐ์กด aggregation function์„ ํ•จ๊ป˜ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ๋”ฐ๋กœ aggregation function์„ ์ ์šฉํ•˜๊ณ , ๋งˆ์ง€๋ง‰์— ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์ถ”๊ฐ€ํ•˜๋Š” ํ˜•์‹์œผ๋กœ ์‚ฌ์šฉ
- ์žฅ๊ธฐ์ ์ธ ๊ด€์ ์—์„œ ๋ดค์„๋•Œ ๊ทธ๋ž˜ํ”„๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š”์ง€, ๊ฐ์†Œํ•˜๋Š”์ง€, ๋˜๋Š” ์ •์ฒด๋˜์–ด ์žˆ๋Š”์ง€ ๋“ฑ์˜ ์ถ”์„ธ๋ฅผ ์•Œ๊ธฐ์œ„ํ•ด์„œ ์‚ฌ์šฉํ•จ
- ๊ทธ๋Ÿฌ๋‚˜ ์ด์ „์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ „๋ถ€ ๋‹ค ๋”ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ์ตœ๊ทผ OO๊ฐœ์›”์˜ ๋ฐ์ดํ„ฐ๋งŒ์„ ๋ณธ๋‹ค๋Š” ์ ์—์„œ ๊ธฐ์กด feature์™€ ๋‹ค๋ฆ„
7 seasonality - ์ฃผ๊ธฐ์„ฑ์„ ๋ชจ๋ธ์ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก, ๊ตฌ๊ฐ„์„ ๋‚˜๋ˆ„์–ด aggregation function์„ ์ ์šฉํ•จ
- (1~3๊ฐœ์›”์ „), (6-8๊ฐœ์›”์ „), (12-14๊ฐœ์›”์ „), (18-20๊ฐœ์›”์ „) ์ด๋Ÿฐ ์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌถ์–ด์„œ aggregation์„ customer ๋ณ„๋กœ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•จ
- ์ฃผ๊ธฐ: [1, 6, 12, 18]
- ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜๋Š” 12์›”์—๋Š” ๋ณ€๋™ํญ์ด ๊ฝค ์ปค์„œ ํ•ด๋‹น ์ฃผ๊ธฐ์„ฑ์„ ๋ชจ๋ธ์ด ํ•™์Šตํ•˜๋Š” ๊ฒƒ ๋˜ํ•œ ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•จ
  • ์ด์™ธ ์ ์šฉํ•œ ๊ฒƒ: Quantile Transform
    • not feature, ์ „์ฒ˜๋ฆฌ
    • ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๋ง์„ ์œ„ํ•ด์„œ ์‚ฌ์šฉ
    • ๋ณ€์ˆ˜๋“ค์˜ ์Šค์ผ€์ผ์„ 0~1 ์‚ฌ์ด๋กœ ์กฐ์ •ํ•˜๋ฏ€๋กœ, ์†๋„๊ฐ€ ๋นจ๋ผ์ง„๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Œ

Feature importance


๊ฒฝ์ง„๋Œ€ํšŒ ๊ณผ์ •์— ๋Œ€ํ•œ ๊ธฐ๋ก, ์‚ฌ์šฉํ•œ ์•„ํ‚คํ…์ฒ˜๋Š” Notion์— wrap-up report๋กœ ์˜ฌ๋ ค๋‘์—ˆ์Šต๋‹ˆ๋‹ค.