This repository contains the study projects I completed as part of the OTUS Data Science course.

You'll find a brief overview of the projects below.

Please note that some of the commentary in the code and project reports is in Russian, as it was sometimes easier and faster for me to express my ideas in my native language. I needed to balance my regular job and course assignments with deadlines! However, I was trying to write in English as much as possible.

Projects for Data Science course

project_01

This is a small, mostly research-based project. The theme is Identification of Psychiatric Disorders From EEG Using Methods of Machine Learning (Идентификация психических расстройств методами машинного обучения на основе данных ЭЭГ)

I believe that it will be more convenient to check it out on Kaggle:

presentation.pttx

Presentation in Russian about the project and its results

Homework for Data Science course

homework_01

Basic programming tasks on Python

homework_02

Basic OOP tasks on Python + tests with pytest (test_homework_02 folder)

homework_03

Basics of numpy, pandas and data visualisation with seaborn and matplotlib

homework_04

Binary classification:

EDA
KNN
Logistic Regression
Cross-validation

homework_05

Regression with Linear Regression algorithm:

EDA
pre-processing: missing data, categorical feature encoding, outliers, scaling, feature engineering
L1 and L2 regularisation
Cross-validation
Comparison between different models with different regularisations, scaling and features

homework_06

Binary classification with gradient boosting algorithms:

Data Cleaning
EDA
Pre-processing: missing data, categorical feature encoding
Cross-validation
Comparison between different gradient boosting algorithms: sklearn, XGBoost, Catboost, LightGBM
Feature importance

homework_07

Basic linear regression on PyTorch

homework_08

Clusterisation with K-means, Agglomerative hierarchical clustering and DBSCAN

EDA
Pre-processing
Dimension reduction (PCA, t-SNE, UMAP)
Choosing number of cluster with elbow method and silhouette score
Results interpretation

homework_09

Overfitting MNIST on PyTorch to understand what might lead to overfitting of FC-ANN. Practiced monitoring training progress with WANDB
In my project, I was able to overfit only using many epochs and small datasets. Experiments with larger FC-ANN led to poor model performance overall:

homework_10

Anomaly detection with various methods:

Basic methods: std and IQR based
Distance-based
Density-based (DBSCAN)
One-class SVM
Isolation forest
Results visualisations and comparisons

Be warned: it is mostly in Russian.

homework_11

Infection spreading simulations. There is almost no connection to data science. However, it was an interesting coding project and practice with basics of NetworkX library.
Check it out on Kaggle, since I just downloaded the notebook from there to this repo.

homework_12

Practice with autoencoders on PyTorch.

To be continued... You can expect more study project with Computer Vision tasks and final project of this course!

ShikaSensei/ds-course