[Proposal] Codebase refactoring

Question

[Proposal] Codebase refactoring

ma7dev opened this issue 3 years ago · 6 comments

To organize the code and introduce testing and continuous integration, it would be beneficial to refactor the entire codebase.

TL;DR

Re-organizing the codebase to follow best practices and to introduce testing and continuous integration.
Separating logic to import the package as a separate module, scripts to localize scripts that were used for train/inference of the logic, notebooks to localize demos and simple scripts that were written as notebooks, and tests to test the logic
Adding GitHub Actions to test build, logic of the package, auto-generate docs, and to publish the package to pypi
Moving from pip and requirements.txt setup to conda for environment management and poetry for packages management. This will ease the development as the project scales.

Codebase refactoring

Mapping

file/dir	action	placement
FastSpeed2/*	moved	kaalm/external/FastSpeed2/*
dialect_speech_corpus	moved	klaam/speech_corpus/dialect.py
egy_speech_corpus	moved	klaam/speech_corpus/egy.py
mor_speech_corpus	moved	klaam/speech_corpus/mor.py
samples	moved	samples
.gitignore	moved	.gitignore
LICENSE	moved	LICENSE
README.md	moved	README.md
audio_utils.py	moved	klaam/utils/audio.py
demo.ipynb	moved	notebooks/demo.ipynb
demo_with_mic.ipynb	moved	notebooks/demo_with_mix.ipynb
inference.ipynb	moved	notebooks/inference.ipynb
klaam.py	moved	klaam/run.py
klaam_logo.PNG	moved	misc/klaam_logo.png
models.py	moved	klaam/models/wav2vec.py
processors.py	moved	klaam/processors/custom_wave2vec.py
requirements.txt	removed
run.sh	moved	scripts/run.sh
run_classifier.py	moved	scripts/run_classifier.py
run_common_voice.py	moved	scripts/run_common_voice.py
run_mgb3.py	moved	scripts/run_mgb3.py
run_mgb5.py	moved	scripts/run_mgb5.py
sample_run.sh	moved	scripts/sample_run.sh
utils.py	moved	klaam/utils/utils.py
	added	docs
	added	tests
	added	.github
	added	output
	added	environment.yml
	added	install.sh
	added	mypi.ini
	added	pyproject.toml
	added	pytest.ini
	added	ckpts

Tree Structure

root	level 1	level2	description
.github			github stuff (e.g. github issue templates, github actions workflows, etc.)
	workflows
		build.yml	to test building of the package
		publish.yml	to publish the package to `pypi`
		tests.yml	to run tests
		docs.yml	to generate documentation
klaam			the logic for the package
	utils
		audio.py
		utils.py
	models
		wav2vec.py
	processors
		wave2vec.py
	external
		FastSpeed2/*
	speech_corpus
		dialect.py
		egy.py
		mor.py
	run.py
notebooks
	demo.ipynb
	demo_with_mix.ipynb
	inference.ipynb

scripts			set of scripts to be used to train/evaluate or anything external from the logic of the package
	run.sh
	run_classifier.py
	run_common_voice.py
	run_mgb3.py
	run_mgb5.py
	sample_run.sh
tests			set of tests to test logics within `klaam`
	test_*.py
	conftest.py
misc
	klaam_logo.png
samples
	demo.wav
ckpts	...		checkpoints of pre-trained models that were downloaded
docs	...		documentation files
output	...
environment.yml			`conda` environment definition
install.sh			installing script to setup `conda` environment and install dependecies using `poetry`
mypy.ini			`pylint` configuration
pyproject.toml			package definition and list of dependecies to be installed
pytest.ini			`pytest` configuration
LICENSE
README.md
.gitignore

Environment/dependencies packages

conda is used to manage the environment and install essential libraries that are big/core to the package, e.g. TensorFlow, PyTorch, cudatools, etc.
poetry is used to manage dependencies and setup the package
pytest is used to enable unit/integration testing of the codebase

Commands

poetry add PACKAGE - to add a package (this will append to pyproject.toml)
- If the package installation failed and couldn't find another way to add the package, then install it using conda and add to enviroment.yml manually. (leave a comment next to the line)
- Check on the web for the right channels when install packages using conda
poetry install - to install the package (package_name)
pytest tests - to run all tests manually
pytest tests/TEST_PATH - to run a specific test file (check pytest documentation for more information)

Edit - added the following sections: env/dep packages and commands

Answer 1 · 2022-04-22T22:51:44.000Z

Thanks, @sudomaze for your efforts :). @MagedSaeed any remarks on how to move forward with this?

Answer 2 · 2022-04-23T16:46:50.000Z

Thank much @sudomaze for this awesome effort.

I can categorize this enhancement proposal into the following: restructuring the repo, improving the CI/CD experience, and more alignment with open source projects standards, and best practices. Let me break down and discuss each of them in further detail.

For restructuring the project. The proposed structure is really great. Thanks much for it. There are a couple of comments regarding the current status of klaam that I think we should address before moving forward.
- I think FastSpeechs package is added as a kind of reference and inspiration. tbh, I not sure why it is added from the first place. If we are not using it any more, and I think this is the case, I think it is better to link the original repo instead. @zaidalyafeai, Please advise.
- The development of Klaam started early as an unofficial fork from transformers research projects. I am not sure if, at that time, it was possible to derive and reuse classes directly from transformers instead of maintianing the code locally. If possible, we can subclass their classes instead. The affected modules will be processors and maybe models. Also, alot of code in run_*.py will not be needed if we are going to use their Argument Parsers, and maybe other stuff as well.

I just raised these comments as we are thinking of restructuring things. It should be also fine if we keep them unchanged.

For improving the CI/CD experience , this includes adding GitHub Actions integration and maintaining PyPI package releases and test suits. I remember we had a discussion earlier on this, not only to package Klaam but to have a general packaging pipeline for ARBML products in general. However, we pruned further effort since arbml at that time was more research-oriented rather than software products oriented. I think this needs further discussion as this decision will not affect klaam but maybe all other repo in ARBML. Even if we are going with that direction, I think also we need to maintain an APIs service to serve interaction with the trained models as packing these models would be impractical via PyPI or GitHub releases. For this, I think the best option to go with is a space in huggingface spaces for each repo/product . If anyone would like to use the model in his bussiness or customize it, he then should manage the integration to his workflow the way he wish. If we are going with huggingface spaces, I think we can still automate stuff and use GitHub action to deploy to the space, cannot we @sudomaze ?
For tests, it is always a good idea to add tests to open source projects. It always adds confidence. However, if we are going to resuse huggingface transformers, the logic we are adding would be drastically minimal. We may end up writing more tests than logic :). Also, I personally do not have much experimence in writing tests for models training and inference. Just for the sake of learning, I can start with transformers test and I would appreciate if you can provide more resources.

For more alignment with open source projects standards, and best practices, I really like the idea. This will also make the getting started experience for new-comers more easy and more exciting if they do not need much configuration to get started. It is also a good time for me to start working with and adapting poetry. There are also other stuff and utilities to be added to make the team work experience even more seamless. We can also add the following.
- We can unify the formatting style using black since it is gaining more popularity over other formaters.
- We can add isort to further minimize PRs conflicts.
- We can add automatic checks/make this automatic using pre-commit.
I think this should suffice for a good start and we can add more down the road.

Answer 3 · 2022-04-23T23:36:26.000Z

Thanks, @MagedSaeed, for the feedback!

Restructuring the project

I am planning to keep the current structure in the first stage of refactoring so we can enable unit testing and ci in the project. We can tackle other aspects of the projects by removing dead/unnecessary code as needed in later issues/PRs.

Improving the CI/CD experience

We (@sudogroup) are planning to refactor all of ARBML repos to follow the best practice standards. I think having a good software development environment would encourage other people to contribute and enhance the code quality.
We can utilize external APIs to download the latest models/datasets for ease of accessibility. We can have klaam to be both a package on PyPi for anyone to utilize in their codebase or they download the models that you trained and the datasets that used through huggingface. This will combine the two worlds. I haven't pushed anything to huggingface before, however, looking to this, it seems that it is possible to create a GitHub Action to push to huggingface once we have a release or any other action trigger.
I think huggingface spaces is just a demo ground. Utilizing the previous point, having our models and datasets on huggingface and Klaam as a package to be installable through PyPi, will allow us (and others) to easily import Klaam and run demos on huggingface spaces.
For testing:
- we can do a lot of testing and I can explain what type of testing I have been doing in my ML/DL research. Here are some two cents from utilizing testing:
  1. Fixing bugs that ensure the system is almost bug-free
  2. Improving speed by comparing old methods vs. newer ones that are faster. Because at the end of the day, it is just matrix operations
- We don't need to write extensive tests for everything, we just need to make sure that things that we are doing right now, our logic, are tested to reduce potential bugs that could happen or ensure that our expectations of the operations that we have developed matches what is actually been developed. Also, ensures that new code doesn't break old ones. It is just a nice thing to have,
- In later issues, I can add simple tests that do explicit things so everyone can understand how to utilize unit tests to validate your code.

More alignment with open source projects standards and best practices

We can use black and isort to make sure that things are formatted the right way. As for pre-commit, I haven't used it before but I do use act to run GitHub Actions locally. We can decide which frameworks we would like to add/remove in later issues.
As of right now, having the described structure as our baseline would be a reasonable change to the current codebase.

Answer 4 · 2022-04-24T22:25:34.000Z

That sounds really great. We can start with klaam as you proposed, see what challenges we get, then apply to the rest. What do you think?

Answer 5 · 2022-04-24T22:35:43.000Z

That sounds like a good plan! I will start working on the refactoring klaam and make a PR for that.

Answer 6 · 2022-05-16T15:54:43.000Z

#21 summary:

Restructured the repo per this issue
Added pre-commit to ensure the quality of commits to follow the best practice
Added a CI to run tests when pushing to main and dev/* or making a PR to merge to main or dev/*
Added another CI to automatically publish to PyPi when pushing to stable/*

Some things to be resolved in other PRs are:

The removal of FastSpeech2
Make sure that the codebase works locally at its full capacity as there are some hard-coded paths in the codebase.
Utilizing huggingface API to upload/download models/datasets
Add meaningful unit/integration tests
Test the library on both local and on Google Colab
Generate documentation as webpages using sphinx and create a CI for it
Add a CI to test on different machines (Windows, MacOS, etc.) and on different Python versions (3.8.x, 3.9.x, 3.10.x, etc.)