/data_preparation_journey

The repository for the work-in-progress book _The Data Preparation Journey_

Primary LanguageScheme

The Data Preparation Journey: Finding Your Way With R

It is routinely noted that the Pareto principle applies to data science—80% of one's time is spent on data collection and preparation, and the remaining 20% on the "fun stuff" like modelling, data visualization, and communication.

There is no shortage of material—textbooks, journal articles, blog posts, online courses, podcasts, etc.—about the 20%. That's not to say that there is no material for the other 80%. But it is scattered, found across technique-specific articles and domain-specific books, along with Stack Overflow questions and miscellaneous blog posts. This book serves as a travel guide: an introduction and wayfinder through some of the scattered resources for readers seeking to understand the core elements of data preparation. Like a lighthouse, it is hoped that it will both guide you in the right direction and keep you from running aground.

The book will introduce the principles of data preparation, framed in a systematic approach that follows a typical data science or statistical workflow. With that context, readers will then work through practical solutions to resolving problems in data using the statistical & data science programming language R. These solutions will include examples of complex real-world data.