Data cleaning of a dataset about apartments in Moscow

1. Project description
2. Solving case
3. Data summary
4. Stages of the project work
5. Results

Project description

The data are taken from a competition on the Kaggle platform, initiated by Sber. The competition was held in 2017. Sber's requirements were to build a model that would predict housing prices in Moscow based on the parameters of property.

⬆️To contents

Solving case

Prepare data for model building, which include:

I. basic analysis of the data structure II. detection of missing data III. processing of missing data IV. outliers detection and cleaning V. Search and eliminate duplicates

Quality metric
The output should be clear data which means removing duplicates, outliers, and converting data into a form suitable for analysis

What we practice
Processing and cleaning of data to prepare them for further work

Data summary

The data is a database of 30471 records of real properties in Moscow and Moscow region with specified attributes which probably influence the price. For each record 61 attributes are provided: the area of the property, living area, district, price, ecology, information about kindergartens, schools, hospitals, distance to the center, distance to the subway and train stations, distance to stores, museums, etc.

⬆️To contents

Stages of the project work

basic analysis of the data structure

Analyzing the data structure and determining the necessary transformations.

detection of missing data

Missing value detection and analysis. Detection of entries/features that need to be removed and those that need to be processed

processing of missing data

Analysis of missing data and filling numerical features with median values and categorical features with modes

outliers detection and cleaning

Finding and identifying outliers using logic, rules of three Sigma, and Tuke's method

Search and eliminate duplicates

Search and delete duplicates

⬆️To contents

Results:

As a result of the work, we received a dataset with records cleared from duplicates, missing data and outliers

⬆️To contents

danilovabg/data_cleaning_project_sber

Data cleaning of a dataset about apartments in Moscow

Contents

Project description

Solving case

Data summary

Stages of the project work

Results: