/DS_Project

Fourth year data science project

Primary LanguageJupyter Notebook

Data Science Course Project

Team members:-

Name
Robert Mounir
Bishoy Atef
Mohamed Ahmed
Nour eldin Mostafa

Outline

  • Idea
  • Problem
  • Dataset
  • Data preparation
  • Data description/exploration
  • Building models & Interpreting the results
  • Testing
  • Team contribution

Idea

AutoPricing: Navigating the American Car Market with Data

As a new car brand, we are planning to introduce our latest car model in the market. However, we are currently in the process of determining the optimal pricing strategy for maximizing profitability.

Problem

When deciding on pricing for our new car brand, we must take into account the product features and quality that customers value and are willing to pay a premium for. It's crucial to consider these factors as they can impact our pricing strategy significantly. As a new brand, we should be particularly mindful of how we price our models in relation to our competitors while also keeping in mind the value that our products offer to potential customers.

Dataset

As the automobile industry continues to produce a wider range of vehicles with unique capabilities and attributes, such as make, year, category, brand, fuel type, engine volume, mileage, number of cylinders, color, airbags, and more, consumers are increasingly seeking to purchase a vehicle that meets their specific needs while remaining within their budget. In an effort to address this pricing challenge, we have compiled a training dataset of 19,237 records that includes various car features for analysis from Kaggle.

Link: https://www.kaggle.com/datasets/sidharth178/car-prices-dataset?select=train.csv

Data preparation

  1. Pre-processing:
  • For Levy column, I had to remove “ - “ that was found in the csv as it indicates wrong answer
  • For Engine volume, I had to remove turbo that is found in the column to ensure it is float column, and add column named “turbo” with value True, False
  • Remove space found in engine volume
  1. Organisation & Cleaning:
  • Drop columns Doors, Color,Wheel as we don’t use them
  1. Visualisation:

Data description/exploration

  1. What is the distribution of car prices in the dataset?

  1. What is the most common car brand in the dataset?

We can see that prius is the most common car brand with 1083 count

  1. What is the relationship between car price and its other features such as Fuel type, cylinders count, turbo engine?

  1. Can we predict the price of a car based on mileage?
  • The graph shown below illustrates that we can do that in low and midrange cars but in more expensive cars the relationship does not hold.

  1. What is the relationship between car prices and the year of manufacture?
  • Car prices increase almost exponentially with the year of production

  1. Can we predict the engine volume of a car based on its features?
  • We conclude that we can't predict the engine volume based on its features, as we can see that the different engines Volume has many solutions, so it will be hard to find the correct engine volume using these features.

  1. Can we predict the levy of a car based on its features?

  1. Does the relationship between car prices and the fuel type differ between cars with different production years, and if so, how?

  1. Is there a relation between the number of airbags in a car and the production year of the car?

  1. What is the distribution of manufacturers in the dataset?
  • HYUNDAI and TOYOTA are the most dominant at 3700 and 3600 respectively.

  1. What is the relationship between the car category and engine volume?

Building models & Interpreting the results.

  1. The correlation between car prices and the number of cylinders in the engine
  • The correlation between Cylinders and Price is: 0.15600840175540825
  • The correlation is positive, but not very strong. This means that the number of cylinders in the engine has a weak impact on the price of the car.

  1. The correlation between car prices and the mileage
  • The correlation between Mileage and Price is: -0.03485695421138315
  • The correlation is negative and relatively strong. This means that the mileage of the car has a strong impact on the price of the car.

  1. The correlation between car prices and the year of manufacture
  • The correlation between Prod. year and Price is: 0.3286631210183748
  • The correlation is positive and relatively strong. This means that the year of manufacture of the car has a strong impact on the price of the car

  1. Difference in car prices between cars with different numbers of cylinders in the engine
No. Cylinders 16 14 12 10 8 7 6 5 4 3 2 1
Mean Price 17519 15681 40896 31561 31782 12058 19276 15630 17930 9957 11302 17317
Median Price 11604 15681 27754 38103 25089 11259 13000 14113 14167 8467 10243 9722
  • Bar plot between Number of cylinders and the mean price

  • Conclusion: There is a significant difference in car prices between cars with different numbers of cylinders in the engine.
  1. Prediction of the price of a car based on its features
  • We used RandomForestRegressor with n_estimators=100 after converting prices to log scale.
  • Evaluation:
    • MSE: 0.20
    • RMSE: 0.44
    • Accuracy: 0.7235767238565514
    • prediction is possible but not very accurate.
  1. Prediction of the engine volume of a car based on its features
  • We used RandomForestRegressor:
    • n_estimators=100 to predict the engine volume.
    • Evaluation:
      • MSE: 0.08
      • RMSE: 0.28
      • Accuracy: 0.8940905680450747
      • prediction is possible with good accuracy.
  1. Prediction of the levy of a car based on its features
  • We used RandomForestRegressor with n_estimators=100 to predict the engine volume.
  • Evaluation:
    • MSE: 0.03
    • RMSE: 0.17
    • Accuracy: 0.7772434533874942
    • The prediction is possible but not very accurate.
  1. The relationship between car prices and the mileage of the car Differences between cars with different fuel types, and how the relation holds
  • Scatter plot shows the relationship between car prices and the mileage of the car with different fuel types.
  • Conclusion: yes, the relationship between car prices and the mileage of the car differs between cars with different fuel types.
  • For example, the relationship between car prices and the mileage of the car is negative for diesel cars, but positive for electric cars this because diesel cars are more expensive than electric cars, so the more mileage a diesel car has, the lower the price will be the diesel and petrol cars have a similar relationship between car prices and the mileage of the car (negative) because there engine more prone to damage with higher mileage

  1. The relationship between car prices and the Manufacturer of the car Differences between cars with different production years, and how the relation holds
  • The plot shows the relationship between car prices and the Manufacturer of the car with different production years.
  • Conclusion: No, the relationship between car prices and the Manufacturer of the car does not differ between cars with different production years
  • The relationship between car prices and the Manufacturer of the car is positive for all production years this is because the more expensive cars are produced by the more expensive
  • manufacturers (BMW, Mercedes-Benz, Toyota, Volkswagen, Audi) and by years the price of the cars increases because of inflation and the increase in technology and quality of the cars

  1. The relationship between car prices and the fuel type Differences between cars with different production years and how the relation holds
  • The plot shows the relationship between car prices and the fuel type of the car with different production years.

  • Conclusion: The relationship between car prices and the fuel type differs between cars with different production years before 2000, the relationship between car prices and the fuel type is positive for all fuel types this is because before 2000, the only fuel types available were diesel and petrol, and diesel cars were more expensive than petrol cars after 2000, the relationship between car prices and the fuel type is negative for diesel cars and positive for petrol cars this is because after 2000, the electric cars were introduced, and they were more expensive than diesel cars.

  1. The Correlation between the number of airbags in a car and the production year of the car
  • Correlation between the number of airbags in a car and the production year of the car: 0.23696931505274932
  • The plots show the relationship between the number of airbags in a car and the production year of the car.
  • Conclusion: Yes, there is a relation between the number of airbags in a car and the production year of the car.
  • The correlation between the number of airbags in a car and the production year of the car is positive and slightly strong. This is because the number of airbags in a car increases with the increase in technology and quality of the cars.
  • This is also because the number of airbags in a car increases with the increase in the safety of the cars which it is important for the customers

  1. The relationship between the car category and engine volume
  • The plot shows the relationship between the car category and the engine volume.
  • Conclusion: The plot shows that the engine volume of the cars in the category of "Jeep", "limousine" is the highest, and the engine volume of the cars in the category of "sedan", "hatchback" etc. is the lowest.
  • Reason is that the cars in the category of "Jeep", "limousine" are bigger than the cars in the category of "sedan", "hatchback" etc. so they need bigger engines to move and more power to accelerate.

Testing

  1. test whether the average price of Mercedes-Benz cars in the dataset is significantly different from a given value.
  • Null hypothesis: The average price of Mercedes cars is equal to or greater than $20,000.
  • Alternative hypothesis: The average price of Mercedes cars is less than $20,000.
  • Significance level: Let's use α = 0.05 (or 5%).

Fail to reject the null hypothesis. The average price of Mercedes cars is equal to or greater than $20,000.

  1. Test whether the average price of cars with leather interiors is higher than the average price of cars without leather interiors.
  • Null hypothesis: The average price of cars with leather interiors is greater than or equal the average price of cars without leather interiors(claim).
  • Alternative hypothesis: The average price of cars with leather interiors is less than the average price of cars without leather interiors.
  • Statistical test: One-tailed t-test for independent samples.

The average price of cars with leather interiors is significantly higher than the average price of cars without leather interiors.

  1. Test whether the average price of cars with diesel engines is significantly different from the average price of cars with petrol engines.
  • Null hypothesis: The average price of cars with diesel engines is equal to the average price of cars with petrol engines.
  • Alternative hypothesis: The average price of cars with diesel engines is significantly different from the average price of cars with petrol engines.
  • Statistical test: Two-tailed t-test for independent samples.

Reject null hypothesis. The average price of diesel cars is significantly different from the average price of petrol cars.

  1. Test whether there is no association between the manufacturer and the car category.
  • Null hypothesis: there is no association between the manufacturer and the car category.
  • Alternative Hypothesis: there is an association between the manufacturer and the car category.
  • Statistical test: Chi-Square Goodness of Fit Test.

Reject Null hypothesis.