The provided code is a Python script that performs data preprocessing, analysis, visualization, and machine learning tasks on a dataset.
- It imports the necessary libraries: pandas, numpy, seaborn, and matplotlib.pyplot.
- Reads the dataset from a CSV file using pandas and displays the first few rows of the dataset.
- Prints information about the dataset, including the column names and data types.
- Defines several functions to fix and transform specific columns in the dataset.
- Applies the defined functions to clean and transform the data in the respective columns.
- Fills missing values with the mode of each column.
- Removes any duplicate rows from the dataset.
- Performs various data analysis tasks and prints the results:
- Average income of the employees in the dataset.
- Percentage of employees earning more than $50,000.
- Percentage of employees who have purchased a bike.
- Most common occupation in the dataset.
- Number of employees with no children.
- Average number of cars owned by employees.
- Number of employees living in the Pacific region.
- Average age in the dataset.
- Percentage of employees with a commute distance of 5-10 miles and own a home.
- Most common commute distance in the dataset.
- Most common gender in the dataset.
- Average income of male employees.
- Average income of female employees.
- Percentage of male employees in the dataset.
- Percentage of female employees in the dataset.
- Number of employees with 2 or more cars and income < $50,000.
- Percentage of individuals who are home owners and have purchased a bike.
- Highest income in the dataset.
- Number of employees with a partial college education.
- Number of employees over 50 years old.
- Percentage of male employees over 50 years old.
- Number of employees with a skilled manual occupation.
- ID(s) of rows with the highest income.
- Number of employees with a graduate degree.
- Average income in Europe.
- Average income in the Pacific region.
- Most common marital status in the dataset.
- Average income for single individuals.
- Average income for married individuals.
- Generates several visualizations using seaborn and matplotlib:
- Bar chart showing the average income by the number of cars owned.
- Box plot for the income distribution.
- Box plot for the age distribution.
- Heatmap of the count of education levels by occupation.
- Count plot of occupations.
- Heatmap of the correlation matrix.
- Scatter plot of age versus income.
- Saves the modified dataset to a new CSV file called 'Data_after.csv'.
- Performs machine learning tasks using logistic regression:
- Converts categorical variables to numerical using one-hot encoding.
- Splits the dataset into training and testing sets.
- Initializes and trains a logistic regression model.
- Scales the features using standardization.
- Predicts the income category for the test data.
- Evaluates the accuracy of the model.