ARBML/klaam

[Proposal] Codebase refactoring

ma7dev opened this issue ยท 6 comments

To organize the code and introduce testing and continuous integration, it would be beneficial to refactor the entire codebase.

TL;DR

  • Re-organizing the codebase to follow best practices and to introduce testing and continuous integration.
  • Separating logic to import the package as a separate module, scripts to localize scripts that were used for train/inference of the logic, notebooks to localize demos and simple scripts that were written as notebooks, and tests to test the logic
  • Adding GitHub Actions to test build, logic of the package, auto-generate docs, and to publish the package to pypi
  • Moving from pip and requirements.txt setup to conda for environment management and poetry for packages management. This will ease the development as the project scales.

Codebase refactoring

Mapping
file/dir action placement
FastSpeed2/* moved kaalm/external/FastSpeed2/*
dialect_speech_corpus moved klaam/speech_corpus/dialect.py
egy_speech_corpus moved klaam/speech_corpus/egy.py
mor_speech_corpus moved klaam/speech_corpus/mor.py
samples moved samples
.gitignore moved .gitignore
LICENSE moved LICENSE
README.md moved README.md
audio_utils.py moved klaam/utils/audio.py
demo.ipynb moved notebooks/demo.ipynb
demo_with_mic.ipynb moved notebooks/demo_with_mix.ipynb
inference.ipynb moved notebooks/inference.ipynb
klaam.py moved klaam/run.py
klaam_logo.PNG moved misc/klaam_logo.png
models.py moved klaam/models/wav2vec.py
processors.py moved klaam/processors/custom_wave2vec.py
requirements.txt removed  
run.sh moved scripts/run.sh
run_classifier.py moved scripts/run_classifier.py
run_common_voice.py moved scripts/run_common_voice.py
run_mgb3.py moved scripts/run_mgb3.py
run_mgb5.py moved scripts/run_mgb5.py
sample_run.sh moved scripts/sample_run.sh
utils.py moved klaam/utils/utils.py
  added docs
  added tests
  added .github
  added output
  added environment.yml
  added install.sh
  added mypi.ini
  added pyproject.toml
  added pytest.ini
  added ckpts
Tree Structure
root level 1 level2 description
.github     github stuff (e.g. github issue templates, github actions workflows, etc.)
  workflows    
    build.yml to test building of the package
    publish.yml to publish the package to pypi
    tests.yml to run tests
    docs.yml to generate documentation
klaam     the logic for the package
  utils    
    audio.py  
    utils.py  
  models    
    wav2vec.py  
  processors    
    wave2vec.py  
  external    
    FastSpeed2/*  
  speech_corpus    
    dialect.py  
    egy.py  
    mor.py  
  run.py    
notebooks      
  demo.ipynb    
  demo_with_mix.ipynb    
  inference.ipynb    
       
scripts     set of scripts to be used to train/evaluate or anything external from the logic of the package
  run.sh    
  run_classifier.py    
  run_common_voice.py    
  run_mgb3.py    
  run_mgb5.py    
  sample_run.sh    
tests     set of tests to test logics within klaam
  test_*.py    
  conftest.py    
misc      
  klaam_logo.png    
samples      
  demo.wav    
ckpts ...   checkpoints of pre-trained models that were downloaded
docs ...   documentation files
output ...    
environment.yml     conda environment definition
install.sh     installing script to setup conda environment and install dependecies using poetry
mypy.ini     pylint configuration
pyproject.toml     package definition and list of dependecies to be installed
pytest.ini     pytest configuration
LICENSE      
README.md      
.gitignore      

Environment/dependencies packages

  • conda is used to manage the environment and install essential libraries that are big/core to the package, e.g. TensorFlow, PyTorch, cudatools, etc.
  • poetry is used to manage dependencies and setup the package
  • pytest is used to enable unit/integration testing of the codebase

Commands

  • poetry add PACKAGE - to add a package (this will append to pyproject.toml)
    • If the package installation failed and couldn't find another way to add the package, then install it using conda and add to enviroment.yml manually. (leave a comment next to the line)
    • Check on the web for the right channels when install packages using conda
  • poetry install - to install the package (package_name)
  • pytest tests - to run all tests manually
  • pytest tests/TEST_PATH - to run a specific test file (check pytest documentation for more information)

Edit - added the following sections: env/dep packages and commands

Thanks, @sudomaze for your efforts :). @MagedSaeed any remarks on how to move forward with this?

Thank much @sudomaze for this awesome effort.

I can categorize this enhancement proposal into the following: restructuring the repo, improving the CI/CD experience, and more alignment with open source projects standards, and best practices. Let me break down and discuss each of them in further detail.


  • For restructuring the project. The proposed structure is really great. Thanks much for it. There are a couple of comments regarding the current status of klaam that I think we should address before moving forward.
    • I think FastSpeechs package is added as a kind of reference and inspiration. tbh, I not sure why it is added from the first place. If we are not using it any more, and I think this is the case, I think it is better to link the original repo instead. @zaidalyafeai, Please advise.
    • The development of Klaam started early as an unofficial fork from transformers research projects. I am not sure if, at that time, it was possible to derive and reuse classes directly from transformers instead of maintianing the code locally. If possible, we can subclass their classes instead. The affected modules will be processors and maybe models. Also, alot of code in run_*.py will not be needed if we are going to use their Argument Parsers, and maybe other stuff as well.

I just raised these comments as we are thinking of restructuring things. It should be also fine if we keep them unchanged.


  • For improving the CI/CD experience , this includes adding GitHub Actions integration and maintaining PyPI package releases and test suits. I remember we had a discussion earlier on this, not only to package Klaam but to have a general packaging pipeline for ARBML products in general. However, we pruned further effort since arbml at that time was more research-oriented rather than software products oriented. I think this needs further discussion as this decision will not affect klaam but maybe all other repo in ARBML. Even if we are going with that direction, I think also we need to maintain an APIs service to serve interaction with the trained models as packing these models would be impractical via PyPI or GitHub releases. For this, I think the best option to go with is a space in huggingface spaces for each repo/product . If anyone would like to use the model in his bussiness or customize it, he then should manage the integration to his workflow the way he wish. If we are going with huggingface spaces, I think we can still automate stuff and use GitHub action to deploy to the space, cannot we @sudomaze ?

  • For tests, it is always a good idea to add tests to open source projects. It always adds confidence. However, if we are going to resuse huggingface transformers, the logic we are adding would be drastically minimal. We may end up writing more tests than logic :). Also, I personally do not have much experimence in writing tests for models training and inference. Just for the sake of learning, I can start with transformers test and I would appreciate if you can provide more resources.


  • For more alignment with open source projects standards, and best practices, I really like the idea. This will also make the getting started experience for new-comers more easy and more exciting if they do not need much configuration to get started. It is also a good time for me to start working with and adapting poetry. There are also other stuff and utilities to be added to make the team work experience even more seamless. We can also add the following.

    • We can unify the formatting style using black since it is gaining more popularity over other formaters.
    • We can add isort to further minimize PRs conflicts.
    • We can add automatic checks/make this automatic using pre-commit.

    I think this should suffice for a good start and we can add more down the road.

Thanks, @MagedSaeed, for the feedback!

Restructuring the project

  • I am planning to keep the current structure in the first stage of refactoring so we can enable unit testing and ci in the project. We can tackle other aspects of the projects by removing dead/unnecessary code as needed in later issues/PRs.

Improving the CI/CD experience

  • We (@sudogroup) are planning to refactor all of ARBML repos to follow the best practice standards. I think having a good software development environment would encourage other people to contribute and enhance the code quality.
  • We can utilize external APIs to download the latest models/datasets for ease of accessibility. We can have klaam to be both a package on PyPi for anyone to utilize in their codebase or they download the models that you trained and the datasets that used through huggingface. This will combine the two worlds. I haven't pushed anything to huggingface before, however, looking to this, it seems that it is possible to create a GitHub Action to push to huggingface once we have a release or any other action trigger.
  • I think huggingface spaces is just a demo ground. Utilizing the previous point, having our models and datasets on huggingface and Klaam as a package to be installable through PyPi, will allow us (and others) to easily import Klaam and run demos on huggingface spaces.
  • For testing:
    • we can do a lot of testing and I can explain what type of testing I have been doing in my ML/DL research. Here are some two cents from utilizing testing:
      1. Fixing bugs that ensure the system is almost bug-free
      2. Improving speed by comparing old methods vs. newer ones that are faster. Because at the end of the day, it is just matrix operations
    • We don't need to write extensive tests for everything, we just need to make sure that things that we are doing right now, our logic, are tested to reduce potential bugs that could happen or ensure that our expectations of the operations that we have developed matches what is actually been developed. Also, ensures that new code doesn't break old ones. It is just a nice thing to have,
    • In later issues, I can add simple tests that do explicit things so everyone can understand how to utilize unit tests to validate your code.

More alignment with open source projects standards and best practices

  • We can use black and isort to make sure that things are formatted the right way. As for pre-commit, I haven't used it before but I do use act to run GitHub Actions locally. We can decide which frameworks we would like to add/remove in later issues.
  • As of right now, having the described structure as our baseline would be a reasonable change to the current codebase.

That sounds really great. We can start with klaam as you proposed, see what challenges we get, then apply to the rest. What do you think?

That sounds like a good plan! I will start working on the refactoring klaam and make a PR for that.

#21 summary:

  • Restructured the repo per this issue
  • Added pre-commit to ensure the quality of commits to follow the best practice
  • Added a CI to run tests when pushing to main and dev/* or making a PR to merge to main or dev/*
  • Added another CI to automatically publish to PyPi when pushing to stable/*

Some things to be resolved in other PRs are:

  • The removal of FastSpeech2
  • Make sure that the codebase works locally at its full capacity as there are some hard-coded paths in the codebase.
  • Utilizing huggingface API to upload/download models/datasets
  • Add meaningful unit/integration tests
  • Test the library on both local and on Google Colab
  • Generate documentation as webpages using sphinx and create a CI for it
  • Add a CI to test on different machines (Windows, MacOS, etc.) and on different Python versions (3.8.x, 3.9.x, 3.10.x, etc.)