mini_gcp: A Python repository from STASYA00

Save time during Data Science interviews!
View Demo · Report Bug · Request Feature

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Design
Roadmap

Challenges
Future work

License
Contact
Acknowledgments

About The Project

This is a framework (and base template) for data science technical interview tasks. Inspired by GCP structure and repeatedness of interview tasks.

Intro

I'm sure that almost any person that went through data science interviews (and especially technical tasks) was surprised by the lack of creativity in the given questions. The technical tasks (unless there was someone who put his time and made a custom, company-adapted task for the applicants) are very trivial, require rewriting similar functions and reasoning. Most importantly, almost all are evaluating the candidate from the same point of view.
This project was made during my interview with Klarna to save time on the following interviews with similar tasks. The goal is to present a generic framework that would have most of the functions / recepts used in a common basic task, make it modular, extendable, reusable and customizable.
Almost any data science project follows the standard pipeline: EDA -> Cleaning -> Feature Engineering -> Training -> Evaluation (cross) -> HP tuning -> Model Selection -> Training -> Deployment. Most of the cloud service providers made this pipeline and its orchestration automated and modular because of its repeatedness. I took inspiration from them and made a local framework to be used for common data science tasks.
The main advantages here are:

No need to rewrite the code 12931 times
Missing pieces, additional feature engineering or cleaning recepts can be introduced in the framework
The framework is modular, which shows software engineering skills of the applicant (so often missing in this field)
Potentially gives an incentive to the managers and senior data scientists to put more time and effort into technical tasks.

(back to top)

Built With

(back to top)

Getting Started

To get a local copy up and running follow these simple example steps.

Prerequisites

Docker install Docker
with conda or venv: create a new environment

conda create -n py3

conda activate py3

to run with GPU install CUDA Toolkit and follow the instruction on configuring the environment to run torch with CUDA

Installation

with Docker

docker build . -t name:tag

docker run -dit --name NAME name:tag

alternatively

Clone the repo

git clone https://github.com/STASYA00/mini_gcp.git

Install pip packages
```
pip install -r requirements.txt
```

(back to top)

Usage

replace FILENAME, categorical columns and other parameters with your values
start with EDA.ipynb to understand which of the recepts and models fit your problem best; do some additional EDA if required
add the necessary recepts as children of Recept and models as children of Model
combine all the ingridients in a child class of Experiment, refer to BaseExperiment for an example
run the code:

python main.py

The following steps will be automated in the future updates. 6. Check your models' performance in log.csv 7. Choose the best performing model and recept collection 8. Use this selection for the final model 9. Deploy the model

(back to top)

Design

The framework was designed to be reusable for most common cases, extendable (as we know, there is no one-fits-all solution), modular. It represents a typical data science pipeline: EDA -> Cleaning -> Feature Engineering -> Training -> Evaluation (cross) -> HP tuning -> Model Selection -> Training -> Deployment. The main extendable parts here are Cleaning + Feature Engineering, represented as Recept modules and training + evaluation + hp tuning, represented as Model modules.
A UML of the framework will be added soon to provide more clarity.