Data_mining_census_shop

Decision tree.py

1.1 Drop the attribute fnlwgt and print the information about the adult dataset. 1.2 Ignore the instances with missing values, and then convert all ttributes into nominal data. 1.3 Split the nominal dataset into X_training data & y_training data and X_test data & y_test data with a 7:3 of ratio, and create a decision tree to fit them. Test the model's performance by error rate. 1.4 Compare two approaches for handling missing values(i) creating a new value “missing” for each attribute and using this value for every missing value in the dataset; (ii) using the most popular value for all missing values of each attribute.

Clustering.py

2.1 Import the dataset with dropping attributes CHANNEL and REGION, and print the mean and range for each attribute. 2.2 Using k-means clusters the datasets with 3of k value, and construct the scatterplots for 15 pairs of attributes. 2.3 Using k-means clusters the datasets with 3, 5 and 10 of k value, respectively. Then compare the differences in k-means clusterings with these k values by calculating their BC (Between Cluster sub of squared distances), WC (Within Cluster sum of squared distances) and the ratio of WC/BC.