/data_cleaning_project_sber

Data cleaning of a dataset about apartments in Moscow

Primary LanguageJupyter Notebook

Data cleaning of a dataset about apartments in Moscow

1. Project description
2. Solving case
3. Data summary
4. Stages of the project work
5. Results

Project description

The data are taken from a competition on the Kaggle platform, initiated by Sber. The competition was held in 2017. Sber's requirements were to build a model that would predict housing prices in Moscow based on the parameters of property.

⬆️To contents

Solving case

Prepare data for model building, which include:

I. basic analysis of the data structure II. detection of missing data III. processing of missing data IV. outliers detection and cleaning V. Search and eliminate duplicates

Quality metric
The output should be clear data which means removing duplicates, outliers, and converting data into a form suitable for analysis

What we practice
Processing and cleaning of data to prepare them for further work

Data summary

The data is a database of 30471 records of real properties in Moscow and Moscow region with specified attributes which probably influence the price. For each record 61 attributes are provided: the area of the property, living area, district, price, ecology, information about kindergartens, schools, hospitals, distance to the center, distance to the subway and train stations, distance to stores, museums, etc.

⬆️To contents

Stages of the project work

  1. basic analysis of the data structure

Analyzing the data structure and determining the necessary transformations.

  1. detection of missing data

Missing value detection and analysis. Detection of entries/features that need to be removed and those that need to be processed

  1. processing of missing data

Analysis of missing data and filling numerical features with median values and categorical features with modes

  1. outliers detection and cleaning

Finding and identifying outliers using logic, rules of three Sigma, and Tuke's method

  1. Search and eliminate duplicates

Search and delete duplicates

⬆️To contents

Results:

As a result of the work, we received a dataset with records cleared from duplicates, missing data and outliers

⬆️To contents