/mini_gcp

Primary LanguagePythonMIT LicenseMIT

Contributors Forks Stargazers Issues MIT License LinkedIn


Logo

Save time during Data Science interviews!
View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Design
  5. Roadmap
  6. License
  7. Contact
  8. Acknowledgments

About The Project

This is a framework (and base template) for data science technical interview tasks. Inspired by GCP structure and repeatedness of interview tasks.

Intro

I'm sure that almost any person that went through data science interviews (and especially technical tasks) was surprised by the lack of creativity in the given questions. The technical tasks (unless there was someone who put his time and made a custom, company-adapted task for the applicants) are very trivial, require rewriting similar functions and reasoning. Most importantly, almost all are evaluating the candidate from the same point of view.
This project was made during my interview with Klarna to save time on the following interviews with similar tasks. The goal is to present a generic framework that would have most of the functions / recepts used in a common basic task, make it modular, extendable, reusable and customizable.
Almost any data science project follows the standard pipeline: EDA -> Cleaning -> Feature Engineering -> Training -> Evaluation (cross) -> HP tuning -> Model Selection -> Training -> Deployment. Most of the cloud service providers made this pipeline and its orchestration automated and modular because of its repeatedness. I took inspiration from them and made a local framework to be used for common data science tasks.
The main advantages here are:

  • No need to rewrite the code 12931 times
  • Missing pieces, additional feature engineering or cleaning recepts can be introduced in the framework
  • The framework is modular, which shows software engineering skills of the applicant (so often missing in this field)
  • Potentially gives an incentive to the managers and senior data scientists to put more time and effort into technical tasks.

(back to top)

Built With

(back to top)

Getting Started

To get a local copy up and running follow these simple example steps.

Prerequisites

  • Docker install Docker
  • with conda or venv: create a new environment
conda create -n py3
conda activate py3
  • to run with GPU install CUDA Toolkit and follow the instruction on configuring the environment to run torch with CUDA

Installation

  • with Docker
    docker build . -t name:tag
    docker run -dit --name NAME name:tag
  • alternatively
  1. Clone the repo
    git clone https://github.com/STASYA00/mini_gcp.git
  2. Install pip packages
    pip install -r requirements.txt

(back to top)

Usage

  1. replace FILENAME, categorical columns and other parameters with your values
  2. start with EDA.ipynb to understand which of the recepts and models fit your problem best; do some additional EDA if required
  3. add the necessary recepts as children of Recept and models as children of Model
  4. combine all the ingridients in a child class of Experiment, refer to BaseExperiment for an example
  5. run the code:
python main.py

The following steps will be automated in the future updates. 6. Check your models' performance in log.csv 7. Choose the best performing model and recept collection 8. Use this selection for the final model 9. Deploy the model

(back to top)

Design

The framework was designed to be reusable for most common cases, extendable (as we know, there is no one-fits-all solution), modular. It represents a typical data science pipeline: EDA -> Cleaning -> Feature Engineering -> Training -> Evaluation (cross) -> HP tuning -> Model Selection -> Training -> Deployment. The main extendable parts here are Cleaning + Feature Engineering, represented as Recept modules and training + evaluation + hp tuning, represented as Model modules.
A UML of the framework will be added soon to provide more clarity.

Roadmap

Challenges

Future work

  • Add visualization notebook
  • Add explanation notebook
  • Add hyperparameter tuning module;
  • Add logging of recepts as a separate table;
  • Add different deployment methods;
  • Add choosing the best model by result
  • Add rebuilding model and recept by record from the log
  • Add training on full dataset experiment
  • Add prediction module (for the test set)

See the open issues for a full list of proposed features (and known issues).

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Stasja - @stasya00 - e-mail

(back to top)

Acknowledgments

(back to top)