Uber_Data_Analysis: A Jupyter Notebook repository from nandinee33

UBER DATA ANALYSIS

Objective:

The objective is to first explore hidden or previously unknown information by applying exploratory data analytics on the dataset and to know the effect of each field on price with every other field of the dataset. Then we apply different machine learning models to complete the analysis. After this, the results of applied machine learning models were compared and analyzed on the basis of accuracy, and then the best performing model was suggested for further predictions of the label ‘Price’.

DataSet:

For this project we are taken the dataset from kaggle in which there are 57 columns and approx 6 lakh rows and the file is in csv format. you can view or download the dataset through this link: https://www.kaggle.com/brllrb/uber-and-lyft-dataset-boston-ma

Application of this Project:

We use machine learning algorithms to predict the price of Uber, so that it is easy for the company to do analysis on price based on certain features.

Conclusion:

Before working on features first we need to know about the data insights which we get to know by EDA. Apart from that, we visualize the data by drawing various plots, due to which we understand that we don’t have any data for taxi’s price, also the price variations of other cabs and different types of weather. Other value count plots show the type and amount of data the dataset has. After this, we convert all categorical values into continuous data type and fill price Nan by the median of other values. Then the most important part of feature selection came which was done with the help of recursive feature elimination. With the help of RFE, the top 25 features were selected. Among those 25 features still, there are some features which we think are not that important to predict the price so we drop them and left with 8 important columns. We apply four different models on our remaining dataset among which Decision Tree, Random Forest, and Gradient Boosting Regressor prove best with 96%+ accuracy on training for our model. This means the predictive power of all these three algorithms in this dataset with the chosen features is very high but in the end, we go with random forest because it does not prone to overfitting and design a function with the help of the same model to predict the price.

nandinee33/Uber_Data_Analysis