eKorpkit provides a flexible interface for NLP and ML research pipelines such as extraction, transformation, tokenization, training, and visualization. Its powerful config composition is backed by Hydra.
This project is still under development. The API is subject to change. Until the first stable release, the version number will be 0.x.x. Please use it at your own risk. If you have any questions or suggestions, please feel free to contact me.
Especially, some core configuration interface parts of the package will be carbed out and moved to a separate package. The package will be renamed to hyfi (Hydra Fast Interface). Image generation and visualization will be moved to a separate package. The package will be renamed to ekaros (from Íkaros[Icarus] in Greek mythology).
- You can compose your configuration dynamically, enabling you to easily get the perfect configuration for each research.
- You can override everything from the command line, which makes experimentation fast, and removes the need to maintain multiple similar configuration files.
- With a help of the eKonf class, it is also easy to compose configurations in a jupyter notebook environment.
- eKorpkit lets you focus on the problem at hand instead of spending time on boilerplate code like command line flags, loading configuration files, logging etc.
- A workflow is a configurable automated process that will run one or more jobs.
- You can divide your research into several unit jobs (tasks), then combine those jobs into one workflow.
- You can have multiple workflows, each of which can perform a different set of tasks.
- With eKorpkit, you can easily share your datasets and models.
- Sharing configs along with datasets and models makes every research reproducible.
- You can share each unit jobs or an entire workflow.
- eKorpkit has a pluggable architecture, enabling it to combine with your own implementation.
Tutorials for ekorpkit package can be found at https://ekorpkit.entelecheia.ai.
Install the latest version of ekorpkit:
pip install ekorpkit
To install all extra dependencies,
pip install ekorpkit[all]
The eKorpkit Corpus is a large, diverse, bilingual (ko/en) language modelling dataset.
@software{lee_2022_6497226,
author = {Young Joon Lee},
title = {eKorpkit: eKonomic Research Python Toolkit},
month = apr,
year = 2022,
publisher = {Zenodo},
doi = {10.5281/zenodo.6497226},
url = {https://doi.org/10.5281/zenodo.6497226}
}
@software{lee_2022_ekorpkit,
author = {Young Joon Lee},
title = {eKorpkit: eKonomic Research Python Toolkit},
month = apr,
year = 2022,
publisher = {GitHub},
url = {https://github.com/entelecheia/ekorpkit}
}
See the CHANGELOG for more information.
Contributions are welcome! Please see the contributing guidelines for more information.
- This project is released under the MIT License.
- Each corpus adheres to its own license policy. Please check the license of the corpus before using it!