This repository contains the study projects I completed as part of the OTUS Data Science course.
You'll find a brief overview of the projects below.
Please note that some of the commentary in the code and project reports is in Russian, as it was sometimes easier and faster for me to express my ideas in my native language. I needed to balance my regular job and course assignments with deadlines! However, I was trying to write in English as much as possible.
This is a small, mostly research-based project. The theme is Identification of Psychiatric Disorders From EEG Using Methods of Machine Learning (Идентификация психических расстройств методами машинного обучения на основе данных ЭЭГ)
I believe that it will be more convenient to check it out on Kaggle:
- Exploratory Data Analysis and multiclass classification
- Binary classification and feature importance
Presentation in Russian about the project and its results
Basic programming tasks on Python
Basic OOP tasks on Python + tests with pytest (test_homework_02 folder)
Basics of numpy, pandas and data visualisation with seaborn and matplotlib
Binary classification:
- EDA
- KNN
- Logistic Regression
- Cross-validation
Regression with Linear Regression algorithm:
- EDA
- pre-processing: missing data, categorical feature encoding, outliers, scaling, feature engineering
- L1 and L2 regularisation
- Cross-validation
- Comparison between different models with different regularisations, scaling and features
Binary classification with gradient boosting algorithms:
- Data Cleaning
- EDA
- Pre-processing: missing data, categorical feature encoding
- Cross-validation
- Comparison between different gradient boosting algorithms: sklearn, XGBoost, Catboost, LightGBM
- Feature importance
Basic linear regression on PyTorch
Clusterisation with K-means, Agglomerative hierarchical clustering and DBSCAN
- EDA
- Pre-processing
- Dimension reduction (PCA, t-SNE, UMAP)
- Choosing number of cluster with elbow method and silhouette score
- Results interpretation
Overfitting MNIST on PyTorch to understand what might lead to overfitting of FC-ANN. Practiced monitoring training progress with WANDB
In my project, I was able to overfit only using many epochs and small datasets. Experiments with larger FC-ANN led to poor model performance overall:
Anomaly detection with various methods:
- Basic methods: std and IQR based
- Distance-based
- Density-based (DBSCAN)
- One-class SVM
- Isolation forest
- Results visualisations and comparisons
Be warned: it is mostly in Russian.
Infection spreading simulations. There is almost no connection to data science. However, it was an interesting coding project and practice with basics of NetworkX library.
Check it out on Kaggle, since I just downloaded the notebook from there to this repo.
Practice with autoencoders on PyTorch.
To be continued... You can expect more study project with Computer Vision tasks and final project of this course!