/gfk

Primary LanguageJupyter Notebook

Data

Main info

Variable Description
period_end_date date of provided record
translated_when date of modelling
if_data_corrected information whether record was updated
prod_gr_id Id of product group
country_id_n Id of a country
delivery_type_id type of record delivery to the system
freq_id type of regularity of delivered records
retailer_id Id of record provider
brand_id Id of a Brand
predict_automatch result of model prediction 0/1
class_acctual actual class of predicted value 0/1

Size

11 columns

19697 rows

Description

Variable NaN count NaN ratio Role Type Note
period_end_date 57 0.00289384 explanatory date 18 days (from 30 Aug till 1 Dec)
translated_when 0 0 explanatory date + time 154 days (from 1 Sep till 1 Feb)
if_data_corrected 0 0 explanatory categorical
0 17085
1 2612
prod_gr_id 0 0 explanatory categorical
413 4486
426 11844
427 3367
country_id_n 1292 0.0655937 explanatory categorical 35 entries
delivery_type_id 1335 0.0677768 explanatory categorical 915 entries
freq_id 0 0 explanatory categorical
1 7763
2 11934
retailer_id 0 0 explanatory categorical 52 entries
brand_id 0 0 explanatory categorical 199 entries
predict_automatch 329 0.0167031 output
class_acctual 0 0 output

Material

Suggestions

  • rename class_acctual to class_actual

  • amputate the data for the column translated_when after the 1st Dec

df[df['translated_when'].dt.date > datetime('2020-12-01')]
  • Countries with ids 106 and 109 are really poorly correlated

  • cross tabulation showed

    • split between if_data_corrected vs period_end_date except for 1st Nov 2020

    • split between country_id_n vs prod_gr_id except for the countries: 106, 108, 113, 116, 176

    • split between retailer_id vs freq_id