- In this lab, you will use
learningSet.csv
file which you already have cloned in today's activities. The full process for the week is shown in the PDF file.
Complete the following steps on the categorical columns in the dataset:
-
Check for null values in all the columns
-
Exclude the following variables by looking at the definitions. Create a new empty list called
drop_list
. We will append this list and then drop all the columns in this list later:OSOURCE
- symbol definitions not provided, too many categoriesZIP
- we are including state already
-
Identify columns that have over 50% missing values.
-
Remove those columns from the dataframe
-
Perform all of the cleaning processes from the Lesson.
-
Reduce the number of categories in the column
GENDER
. The column should only have either "M" for males, "F" for females, and "other" for all the rest- Note that there are a few null values in the column. We will first replace those null values using the code below:
print(categorical['GENDER'].value_counts()) categorical['GENDER'] = categorical['GENDER'].fillna('F')