- Scraped Tomato prices in Karnataka from Jan-01-2015 to Feb-01-2021 from the Agricultural Marketing website of the Government of India
- Trained a Random Forest Regression model
- Developed a Flask API
- Developed a web app using Flask
tomato_price_prediction/ #Home Directory
|-images/
|-static/
|-style.css #CSS file for the web app
|-templates/
|-home.html #html code for home page
|-predict.html #html page for predict page
|-Scrapper.ipynb #Python Notebook for web scraping code
|-api.py #Flask API
|-app.py #Flask Web App
|-code.ipynb #Python notebook with EDA and Model developemnt code
|-prediction_model.py #functions used in api.py
Note: Click here to access the pre-trained ML model.
- Python
- Pandas
- Plotly
- Machine Learning
- Flask
- HTML, CSS
- Selenium
- Beuatiful Soup
Data used in this application was scraped from the Agricultural Marketing website of the Government of India using Selenium and Beautiful Soup.
The data consists of 35544 enties of Tomato prices in Karnataka from Jan-01-2015 to Feb-01-2021 from different districts and markets within these districts.
First five entries in the data set are:
District Name | Market Name | Commodity | Variety | Grade | Min Price (Rs./Quintal) | Max Price (Rs./Quintal) | Modal Price (Rs./Quintal) | Price Date | |
---|---|---|---|---|---|---|---|---|---|
0 | Davangere | Davangere | Tomato | Tomato | FAQ | 400 | 600 | 500 | 2015-01-01 |
1 | Davangere | Honnali | Tomato | Tomato | FAQ | 800 | 1000 | 900 | 2015-01-01 |
2 | Kolar | Srinivasapur | Tomato | Tomato | FAQ | 465 | 1335 | 935 | 2015-01-01 |
3 | Bangalore | Channapatana | Tomato | Tomato | FAQ | 1000 | 1400 | 1200 | 2015-01-01 |
4 | Shimoga | Shimoga | Tomato | Tomato | FAQ | 400 | 600 | 500 | 2015-01-01 |
- There are two major spikes in the prices during a year. First is the sharp rise around the months of June-July. This rise is followed by another but lower spike in the month of december.
- The lowest prices are observed in the year 2018.
- The highest peaks are observed in the year 2016 and 2017.
A Random Forest Regression Model was used in as the prediction model. Presence of categorical variables suits the base estimator (Decision Trees) and Random forest being a bagging algorithm, is robust to varying varaible values.
Pipeline(steps=[('column_transformer',
ColumnTransformer(remainder='passthrough',
transformers=[('cat',
OneHotEncoder(drop='first',
sparse=False),
['District Name',
'Market Name', 'Variety',
'Grade']),
('scale', MinMaxScaler(),
['year', 'month',
'day of the month',
'day of the week'])])),
('rfr', RandomForestRegressor(n_estimators=300))])
Evaluation metric used to check the model performance was Mean Absolute Error.
The Mean Absolute Error value given by the model on the test data was 175.86