Briefly describe your approach to this problem and the steps you took In this problem we are given data which one probably collected through car servicing. And so in the features we are given with various parameters for cars like odometer reading, car , inspection time, battery health, etc. Here are the steps which I took for this problem:
Import necessary libraries which you can get from requirements.txt
Import the data, and visualize it, and see the necessary statistics.
Remove the data which has >20% missing values
Convert categorical features to one hot encode form, calculate used time using registration year & Inspection date year, etc.
Select the dependent and independent data as X,y
Split the data into test & train, and fit models like linear regression, Random forest, K Means etc.
Visualize confusion metric for each label
Perform outlier analysis using KNN.
Precision weighted avg 0.44 Recall F1 0.45 0.45 As we see from the weighted avg score, our model did quite good, given that we use only 1⁄2 of the features given. b.
I use simple linear regression as a baseline, which gives a score of weighted f1 of 0.32, and so we can surely say our model works well. c.
Confusion Matrix, and weighted F1 score. Because F1 score include both precision and recall. d.
First issue which is most common in classification problem is the issue of very low number of classes for class=1,3.
Also the data is missing for >1⁄2 the features.e.
important? Why? What visualizations help you understand the data?
The features like odometer reading, “used time” are some of the important character which we can also relate from the real life that these features are responsible for good/bad health of a patient.
Also features like ‘fuel_type_Electric', 'fuel_type_Hybrid', 'fuel_type_Petrol', 'fuel_type_Petrol + CNG', 'fuel_type_Petrol + LPG' are also important.
Well if we have more data on engine transmission comment, we can easily do some sort of sentiment analysis which gives score on each comment, which we can further use as a feature of rating prediction. b.
As I describe in part a, we have 4 features related to comments, so we can produce a sentiment score for each text, and for missing comments we can just give 0 score as neutral sentiment for this task.