[Proposal] Codebase refactoring
ma7dev opened this issue ยท 6 comments
To organize the code and introduce testing and continuous integration, it would be beneficial to refactor the entire codebase.
TL;DR
- Re-organizing the codebase to follow best practices and to introduce testing and continuous integration.
- Separating
logic
to import the package as a separate module,scripts
to localize scripts that were used for train/inference of the logic,notebooks
to localize demos and simple scripts that were written as notebooks, andtests
to test the logic - Adding GitHub Actions to test
build
,logic
of the package, auto-generatedocs
, and topublish
the package topypi
- Moving from
pip
andrequirements.txt
setup toconda
for environment management andpoetry
for packages management. This will ease the development as the project scales.
Codebase refactoring
Mapping
file/dir | action | placement |
---|---|---|
FastSpeed2/* | moved | kaalm/external/FastSpeed2/* |
dialect_speech_corpus | moved | klaam/speech_corpus/dialect.py |
egy_speech_corpus | moved | klaam/speech_corpus/egy.py |
mor_speech_corpus | moved | klaam/speech_corpus/mor.py |
samples | moved | samples |
.gitignore | moved | .gitignore |
LICENSE | moved | LICENSE |
README.md | moved | README.md |
audio_utils.py | moved | klaam/utils/audio.py |
demo.ipynb | moved | notebooks/demo.ipynb |
demo_with_mic.ipynb | moved | notebooks/demo_with_mix.ipynb |
inference.ipynb | moved | notebooks/inference.ipynb |
klaam.py | moved | klaam/run.py |
klaam_logo.PNG | moved | misc/klaam_logo.png |
models.py | moved | klaam/models/wav2vec.py |
processors.py | moved | klaam/processors/custom_wave2vec.py |
requirements.txt | removed | |
run.sh | moved | scripts/run.sh |
run_classifier.py | moved | scripts/run_classifier.py |
run_common_voice.py | moved | scripts/run_common_voice.py |
run_mgb3.py | moved | scripts/run_mgb3.py |
run_mgb5.py | moved | scripts/run_mgb5.py |
sample_run.sh | moved | scripts/sample_run.sh |
utils.py | moved | klaam/utils/utils.py |
added | docs | |
added | tests | |
added | .github | |
added | output | |
added | environment.yml | |
added | install.sh | |
added | mypi.ini | |
added | pyproject.toml | |
added | pytest.ini | |
added | ckpts |
Tree Structure
root | level 1 | level2 | description |
---|---|---|---|
.github | github stuff (e.g. github issue templates, github actions workflows, etc.) | ||
workflows | |||
build.yml | to test building of the package | ||
publish.yml | to publish the package to pypi |
||
tests.yml | to run tests | ||
docs.yml | to generate documentation | ||
klaam | the logic for the package | ||
utils | |||
audio.py | |||
utils.py | |||
models | |||
wav2vec.py | |||
processors | |||
wave2vec.py | |||
external | |||
FastSpeed2/* | |||
speech_corpus | |||
dialect.py | |||
egy.py | |||
mor.py | |||
run.py | |||
notebooks | |||
demo.ipynb | |||
demo_with_mix.ipynb | |||
inference.ipynb | |||
scripts | set of scripts to be used to train/evaluate or anything external from the logic of the package | ||
run.sh | |||
run_classifier.py | |||
run_common_voice.py | |||
run_mgb3.py | |||
run_mgb5.py | |||
sample_run.sh | |||
tests | set of tests to test logics within klaam |
||
test_*.py | |||
conftest.py | |||
misc | |||
klaam_logo.png | |||
samples | |||
demo.wav | |||
ckpts | ... | checkpoints of pre-trained models that were downloaded | |
docs | ... | documentation files | |
output | ... | ||
environment.yml | conda environment definition |
||
install.sh | installing script to setup conda environment and install dependecies using poetry |
||
mypy.ini | pylint configuration |
||
pyproject.toml | package definition and list of dependecies to be installed | ||
pytest.ini | pytest configuration |
||
LICENSE | |||
README.md | |||
.gitignore |
Environment/dependencies packages
conda
is used to manage the environment and install essential libraries that are big/core to the package, e.g. TensorFlow, PyTorch, cudatools, etc.poetry
is used to manage dependencies and setup the packagepytest
is used to enable unit/integration testing of the codebase
Commands
poetry add PACKAGE
- to add a package (this will append topyproject.toml
)- If the package installation failed and couldn't find another way to add the package, then install it using
conda
and add toenviroment.yml
manually. (leave a comment next to the line) - Check on the web for the right channels when install packages using
conda
- If the package installation failed and couldn't find another way to add the package, then install it using
poetry install
- to install the package (package_name
)pytest tests
- to run all tests manuallypytest tests/TEST_PATH
- to run a specific test file (check pytest documentation for more information)
Edit - added the following sections: env/dep packages and commands
Thanks, @sudomaze for your efforts :). @MagedSaeed any remarks on how to move forward with this?
Thank much @sudomaze for this awesome effort.
I can categorize this enhancement proposal into the following: restructuring the repo, improving the CI/CD experience, and more alignment with open source projects standards, and best practices. Let me break down and discuss each of them in further detail.
- For restructuring the project. The proposed structure is really great. Thanks much for it. There are a couple of comments regarding the current status of klaam that I think we should address before moving forward.
- I think FastSpeechs package is added as a kind of reference and inspiration. tbh, I not sure why it is added from the first place. If we are not using it any more, and I think this is the case, I think it is better to link the original repo instead. @zaidalyafeai, Please advise.
- The development of Klaam started early as an unofficial fork from transformers research projects. I am not sure if, at that time, it was possible to derive and reuse classes directly from transformers instead of maintianing the code locally. If possible, we can subclass their classes instead. The affected modules will be processors and maybe models. Also, alot of code in
run_*.py
will not be needed if we are going to use their Argument Parsers, and maybe other stuff as well.
I just raised these comments as we are thinking of restructuring things. It should be also fine if we keep them unchanged.
-
For improving the CI/CD experience , this includes adding GitHub Actions integration and maintaining PyPI package releases and test suits. I remember we had a discussion earlier on this, not only to package Klaam but to have a general packaging pipeline for ARBML products in general. However, we pruned further effort since
arbml
at that time was more research-oriented rather than software products oriented. I think this needs further discussion as this decision will not affect klaam but maybe all other repo in ARBML. Even if we are going with that direction, I think also we need to maintain an APIs service to serve interaction with the trained models as packing these models would be impractical via PyPI or GitHub releases. For this, I think the best option to go with is a space in huggingface spaces for each repo/product . If anyone would like to use the model in his bussiness or customize it, he then should manage the integration to his workflow the way he wish. If we are going with huggingface spaces, I think we can still automate stuff and use GitHub action to deploy to the space, cannot we @sudomaze ? -
For tests, it is always a good idea to add tests to open source projects. It always adds confidence. However, if we are going to resuse huggingface transformers, the logic we are adding would be drastically minimal. We may end up writing more tests than logic :). Also, I personally do not have much experimence in writing tests for models training and inference. Just for the sake of learning, I can start with transformers test and I would appreciate if you can provide more resources.
-
For more alignment with open source projects standards, and best practices, I really like the idea. This will also make the getting started experience for new-comers more easy and more exciting if they do not need much configuration to get started. It is also a good time for me to start working with and adapting
poetry
. There are also other stuff and utilities to be added to make the team work experience even more seamless. We can also add the following.- We can unify the formatting style using black since it is gaining more popularity over other formaters.
- We can add isort to further minimize PRs conflicts.
- We can add automatic checks/make this automatic using pre-commit.
I think this should suffice for a good start and we can add more down the road.
Thanks, @MagedSaeed, for the feedback!
Restructuring the project
- I am planning to keep the current structure in the first stage of refactoring so we can enable unit testing and ci in the project. We can tackle other aspects of the projects by removing dead/unnecessary code as needed in later issues/PRs.
Improving the CI/CD experience
- We (@sudogroup) are planning to refactor all of
ARBML
repos to follow the best practice standards. I think having a good software development environment would encourage other people to contribute and enhance the code quality. - We can utilize external APIs to download the latest models/datasets for ease of accessibility. We can have
klaam
to be both a package onPyPi
for anyone to utilize in their codebase or they download the models that you trained and the datasets that used throughhuggingface
. This will combine the two worlds. I haven't pushed anything tohuggingface
before, however, looking to this, it seems that it is possible to create a GitHub Action to push tohuggingface
once we have a release or any other action trigger. - I think
huggingface spaces
is just a demo ground. Utilizing the previous point, having our models and datasets onhuggingface
andKlaam
as a package to be installable throughPyPi
, will allow us (and others) to easily importKlaam
and run demos onhuggingface spaces
. - For testing:
- we can do a lot of testing and I can explain what type of testing I have been doing in my ML/DL research. Here are some two cents from utilizing testing:
- Fixing bugs that ensure the system is almost bug-free
- Improving speed by comparing old methods vs. newer ones that are faster. Because at the end of the day, it is just matrix operations
- We don't need to write extensive tests for everything, we just need to make sure that things that we are doing right now, our logic, are tested to reduce potential bugs that could happen or ensure that our expectations of the operations that we have developed matches what is actually been developed. Also, ensures that new code doesn't break old ones. It is just a nice thing to have,
- In later issues, I can add simple tests that do explicit things so everyone can understand how to utilize unit tests to validate your code.
- we can do a lot of testing and I can explain what type of testing I have been doing in my ML/DL research. Here are some two cents from utilizing testing:
More alignment with open source projects standards and best practices
- We can use
black
andisort
to make sure that things are formatted the right way. As forpre-commit
, I haven't used it before but I do useact
to run GitHub Actions locally. We can decide which frameworks we would like to add/remove in later issues. - As of right now, having the described structure as our baseline would be a reasonable change to the current codebase.
That sounds really great. We can start with klaam
as you proposed, see what challenges we get, then apply to the rest. What do you think?
That sounds like a good plan! I will start working on the refactoring klaam
and make a PR for that.
#21 summary:
- Restructured the repo per this issue
- Added
pre-commit
to ensure the quality of commits to follow the best practice - Added a CI to run tests when pushing to
main
anddev/*
or making a PR to merge tomain
ordev/*
- Added another CI to automatically publish to
PyPi
when pushing tostable/*
Some things to be resolved in other PRs are:
- The removal of
FastSpeech2
- Make sure that the codebase works locally at its full capacity as there are some hard-coded paths in the codebase.
- Utilizing
huggingface
API to upload/download models/datasets - Add meaningful unit/integration tests
- Test the library on both local and on
Google Colab
- Generate documentation as webpages using
sphinx
and create a CI for it - Add a CI to test on different machines (Windows, MacOS, etc.) and on different Python versions (3.8.x, 3.9.x, 3.10.x, etc.)