Tomato Price Prediction

tl;dr

File System

tomato_price_prediction/ #Home Directory
  |-images/
  |-static/
    |-style.css #CSS file for the web app
  |-templates/
    |-home.html #html code for home page
    |-predict.html #html page for predict page
  |-Scrapper.ipynb #Python Notebook for web scraping code
  |-api.py #Flask API
  |-app.py #Flask Web App
  |-code.ipynb #Python notebook with EDA and Model developemnt code
  |-prediction_model.py #functions used in api.py

Note: Click here to access the pre-trained ML model.

Technologies Used

  • Python
  • Pandas
  • Plotly
  • Machine Learning
  • Flask
  • HTML, CSS
  • Selenium
  • Beuatiful Soup

Data

Data used in this application was scraped from the Agricultural Marketing website of the Government of India using Selenium and Beautiful Soup.
The data consists of 35544 enties of Tomato prices in Karnataka from Jan-01-2015 to Feb-01-2021 from different districts and markets within these districts.
First five entries in the data set are:

District Name Market Name Commodity Variety Grade Min Price (Rs./Quintal) Max Price (Rs./Quintal) Modal Price (Rs./Quintal) Price Date
0 Davangere Davangere Tomato Tomato FAQ 400 600 500 2015-01-01
1 Davangere Honnali Tomato Tomato FAQ 800 1000 900 2015-01-01
2 Kolar Srinivasapur Tomato Tomato FAQ 465 1335 935 2015-01-01
3 Bangalore Channapatana Tomato Tomato FAQ 1000 1400 1200 2015-01-01
4 Shimoga Shimoga Tomato Tomato FAQ 400 600 500 2015-01-01

Data Analysis

By looking at the rolling average with a 30 day window, we can observe that tomato prices in Karnatak follows a seasonal trend:
  • There are two major spikes in the prices during a year. First is the sharp rise around the months of June-July. This rise is followed by another but lower spike in the month of december.
  • The lowest prices are observed in the year 2018.
  • The highest peaks are observed in the year 2016 and 2017.
Another observable trend is that average modal price of tomatoes per quintal in Bangalore is higher than that in the rest of the state.

Model

A Random Forest Regression Model was used in as the prediction model. Presence of categorical variables suits the base estimator (Decision Trees) and Random forest being a bagging algorithm, is robust to varying varaible values.

Pipeline(steps=[('column_transformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat',
                                                  OneHotEncoder(drop='first',
                                                                sparse=False),
                                                  ['District Name',
                                                   'Market Name', 'Variety',
                                                   'Grade']),
                                                 ('scale', MinMaxScaler(),
                                                  ['year', 'month',
                                                   'day of the month',
                                                   'day of the week'])])),
                ('rfr', RandomForestRegressor(n_estimators=300))])

Model Performance

Evaluation metric used to check the model performance was Mean Absolute Error.
The Mean Absolute Error value given by the model on the test data was 175.86

Screenshots