Car Evaluation Dataset

picture alt

This dataset is procured from opensource UCI Repository . It can be taken from here

License :

License can be seen here

Business Problem :

To classify a car as acceptable , unacceptable , good or very good based on its price , characterstics and maintenance cost

Data Description :

 1. Title: Car Evaluation Database

 2. Sources:
   (a) Creator: Marko Bohanec
   (b) Donors: Marko Bohanec   (marko.bohanec@ijs.si)
           Blaz Zupan      (blaz.zupan@ijs.si)
   (c) Date: June, 1997

3. Past Usage:

The hierarchical decision model, from which this dataset is
derived, was first presented in 

M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for
multi-attribute decision making. In 8th Intl Workshop on Expert
Systems and their Applications, Avignon, France. pages 59-78, 1988.

Within machine-learning, this dataset was used for the evaluation
of HINT (Hierarchy INduction Tool), which was proved to be able to
completely reconstruct the original hierarchical model. This,
together with a comparison with C4.5, is presented in

B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by
function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

4. Relevant Information Paragraph:

Car Evaluation Database was derived from a simple hierarchical
decision model originally developed for the demonstration of DEX
(M. Bohanec, V. Rajkovic: Expert system for decision
making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates
cars according to the following concept structure:

CAR                      car acceptability
. PRICE                  overall price
. . buying               buying price
. . maint                price of the maintenance
. TECH                   technical characteristics
. . COMFORT              comfort
. . . doors              number of doors
. . . persons            capacity in terms of persons to carry
. . . lug_boot           the size of luggage boot
. . safety               estimated safety of the car


5. Number of Instances: 1728
(instances completely cover the attribute space)

6. Number of Attributes: 6

7. Attribute Values:

buying       v-high, high, med, low
maint        v-high, high, med, low
doors        2, 3, 4, 5-more
persons      2, 4, more
lug_boot     small, med, big
safety       low, med, high

8. Missing Attribute Values: none

9. Class Distribution (number of instances per class)

class      N          N[%]
-----------------------------
unacc     1210     (70.023 %) 
acc        384     (22.222 %) 
good        69     ( 3.993 %) 
v-good      65     ( 3.762 %) 

The column names are changed to : 
 Price                    overall price
 Maintenance Cost         price of the maintenance
 Number of Doors          number of doors
 Capacity                 capacity in terms of persons to carry
 Size of Luggage boot     the size of luggage boot
 safety                   estimated safety of the car
 
 Decision                 class 
 for better understanding and convenience

Libraries Used :

  1. Sklearn
  2. Matplotlib
  3. Pandas
  4. Numpy
  5. Seaborn

Exploratory Data Analysis:

  1. Univariate Analysis : Pie charts are used to visualise the distribution between elements of an attribute
  2. Bi-Variate Analysis : Stacked Bar plots , Box plots and Violin plots are used for comparative analysis between attributes and Decision which are more deeply explained in the notebook

Data Processing :

Caetgorical attributes are converted to Numerical attributes for certain visualisations and for machine learning algorithms to work

Model Building :

Two classification algorithms , i.e :

  • KNN Classifier
  • Random Forest Classifier

are used for model building. They are further deeply explained and explored in the notebook. Scoring measures like Accuracy and F1 score are both evaluated for proper analysis

HyperTuning :

Hyperparamters are trained with the help of graphs and GridSearch to give an idea about both methods and to properly assess the best model

Conclusion :

All models are analysed and the best one is picked out