Product pricing gets even harder at scale, considering just how many products are sold online. Clothing has strong seasonal pricing trends and is heavily influenced by brand names, while electronics have fluctuating prices based on product specs.
Mercari, Japan’s biggest community-powered shopping app, knows this problem deeply. They’d like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari's marketplace.
This model automatically suggests the right product prices, provided user-inputted text descriptions of products, including details like product category name, brand name, and item condition.
The files consist of a list of product listings. These files are tab-delimited.
- train_id or test_id - the id of the listing
- name - the title of the listing. Note that we have cleaned the data to remove text that look like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
- item_condition_id - the condition of the items provided by the seller
- category_name - category of the listing
- brand_name
- price - the price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn't exist in test.tsv since that is what you will predict.
- shipping - 1 if shipping fee is paid by seller and 0 by buyer
- item_description - the full description of the item. Note that we have cleaned the data to remove text that look like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
Though a separate EDA notebook is added to this project with detailed description, the interactive graphs have issues when rendered in the ipython display of github. Thus refere to the following link for the compelete analysis https://www.kaggle.com/nehaytamore/my-eda-of-mercari
-
For categorical variables:
- Shipping : Shipping information was present for all the products. And from the EDA it was found that products that tend to be shipped by the seller fall under slightly higher price range than the other products. For modeling shipping is represnted by a binary variable
- Item condition id : The condition id is the categorical variable, taking 1 to 5 values. From the prices perspective it's bit tricky to deduce which id would correspond to most used or unused, as the variation is prices was impacted hugely by the product being sold. condition id could also be represented by one-hot-encoding, yet it's embedded into 5dimensional space so that the network could learn additional information if any. Also the model tends to perform slightly better when embeddings are used for item condition id!
- Brand name : Feature imputation technique Brand name is the only feature for which has almost half of the valus missing. The solution to missing values was found with very elegant string matching algorithm run on item name and description. This gave us 0.02 rmsle hike!
- Category : Splitting the category adds some value to the model, but time vs rmsle trade off did not allow to implement the same in this model.
-
Text features : Text features were embedded into appropriate sized dimensional space and GRUs were used to extract important information from these two features
- name : Length of the name variable has some corelation with the price of the product.
- item description : Using item description length does not harm the performance either (Future Scope : Using YAKE to find keywords from description could be helpful in this scenario)
-
Numerical variable :
- price : Major improvement in the model performance could be observed when few transformations were used on price variable. The distribution of target variable was skewed and thus taking log and then using min max scaler increased accuracy by 20%
-
There was some duplicate products, but removal of which didn't improve the rmsle score. Thus it is removed from the baseline model