PangyoCorpora

This repository is dedicated to training a multi-lingual Language Model (LLM) that can understand and generate text in multiple languages.

Overview

Pangyo Corpora is a valuable resource for anyone looking to enhance their multi-language learning capabilities. This repository provides a diverse and comprehensive dataset curated specifically for the purpose of training Language Models (LLMs) to handle code-switching and multi-language contexts effectively.

Dataset Description

The dataset consists of a wide range of code-switching examples, allowing LLMs to grasp the nuances of switching between languages within a single conversation or text. Pangyo Corpora covers a multitude of languages and domains, making it an ideal resource for developing LLMs with multi-language proficiency.

Features

  • Multilingual Diversity: Pangyo Corpora includes examples from various languages and language pairs, enabling LLMs to understand and generate content in multiple languages.

Contribute

  • If you have additional code-switching examples or improvements to share, we welcome contributions to enhance the dataset for the community.

Citation

If you use Pangyo Corpora in your research or projects, please consider citing this repository to acknowledge the source of your dataset:

@misc{pangyo-corpora,
  title = {Pangyo Corpora - Knowledge Transfer in Multilingual LLMs Based on Code-Switching Corpora},
  author = {Seonghyun Kim, Kanghee Lee, Minsu Jeong, Jungwoo Lee},
  year = {2023},
  publisher = {KOREAN INSTITUTE OF INFORROATTONSCIENTISTSANDENGINEERS},
  journal = {The 35th Annual Conference on Human & Cognitive Language Technology},
  howpublished = {\url{https://sites.google.com/view/hclt2023}},
}

License

Pangyo Corpora is made available under the MIT License

Contact

For questions, feedback, or collaboration opportunities, please contact us at [mrbananahuman.kim@gmail.com].