Machine Learning Data Prep: Handling outliers using Python
Summary
This presentation is about data preparation for machine learning. It begins
with a general introduction of machine learning. Then briefly introduces
the importance of data preparation and handling outliers with Python as
a specific example.
The presentation was delevered twice to Orlando area organizations.
First, to the Orlando Machine Learning and Data Science Meetup Group (OMLDS)
on October 26, 2021 during an online Lunch and Learn.
Second, as part of the machine learning track of SQL Saturday Orlando 2022
on October 8, 2022.
Data preparation for machine learning is essential. Outliers are a a fundamental concept to understand in machine learning and statistics. This presentation uses EPA fuel efficiency data as a real world example of how outliers can lead us to identify new and interesting groups/subsets within data.
Small snippets of Python code are highlighted in the presentation slides. The full code is in the notebook within this repository. For those seeking to grow python visual graphing skills, the code shows a method using Seaborn regplots overlayed with scatter plots to highlight outliers with a different color, something not inherent in the Seaborn regplot.