/data_cleaning

An SQL data cleaning project

data_cleaning

A repository of SQL data cleaning projects.

Introduction

This is a repo for small projects that can be used to practice data cleansing using SQL, Excel or any other method. This small project was inspired by a post made by Sushanta Khara on LinkedIn.

Project List:

Problem Statement

In Data Analysis, the analyst must ensure that the data is 'clean' before doing any analysis. 'Dirty' data can lead to unreliable, inaccurate and/or misleading results. Garbage in = garbage out.

These are the some steps that can be taken to properly prepare your dataset for analysis.

  • Check for duplicate entries and remove them.
  • Remove extra spaces and/or other invalid characters.
  • Separate or combine values as needed.
  • Ensure that certain values (age, dates...) are within certain range.
  • Check for outliers.
  • Correct incorrect spelling or inputted data.
  • Adding new and relevant rows or columns to the new dataset.
  • Check for null or empty values.

Using the criteria above, create a new SQL table with the properly formatted data.

Datasets used

This repository contains different projects/datasets to give the user many opportunities to practice:

  • Basic select statements (select, where, group by, having).
  • Aggregate functions (count, sum, min, max, avg)
  • Joins (inner, outer, left, right)
  • CTE's, temp tables and views
  • string & date manipulation functions.
  • Window functions (rank, lead, lag, row_number, ntile...)