In this lab, you'll continue to practice techniques for dealing with missing data. Moreover, you'll observe the impact on distributions of your data produced by various techniques for dealing with missing data.
In this lab you will:
- Evaluate and execute the best strategy for dealing with missing, duplicate, and erroneous values for a given dataset
- Determine how the distribution of data is affected by imputing values
To start, load the dataset 'titanic.csv'
using pandas.
# Your code here
Use the .info()
method to quickly preview which features have missing data
# Your code here
Let's look at the 'Age'
feature. Calculate the mean, median, and standard deviation of this feature. Then plot a histogram of the distribution.
# Your code here
Fill the missing 'Age'
values using the average age. (Don't overwrite the original data, as we will be comparing to other methods for dealing with the missing values.) Then recalculate the mean, median, and std and replot the histogram.
# Your code here
Note that the standard deviation dropped, the median was slightly raised and the distribution has a larger mass near the center.
Fill the missing 'Age'
values, this time using the media age. (Again, don't overwrite the original data, as we will be comparing to other methods for dealing with the missing values.) Then recalculate the mean, median, and std and replot the histogram.
# Your code here
Imputing the median has similar effectiveness to imputing the mean. The variance is reduced, while the mean is slightly lowered. You can once again see that there is a larger mass of data near the center of the distribution.
Finally, let's observe the impact on the distribution if we were to simply drop all of the rows that are missing an age value. Then, calculate the mean, median and standard deviation of the ages along with a histogram, as before.
# Your code here
Dropping missing values leaves the distribution and associated measures of centrality unchanged, but at the cost of throwing away data.
In this lab, you briefly practiced some common techniques for dealing with missing data. Moreover, you observed the impact that these methods had on the distribution of the feature itself. When you begin to tune models on your data, these considerations will be an essential process of developing robust and accurate models.