low-resource-languages
There are 118 repositories under low-resource-languages topic.
RichardLitt/low-resource-languages
Resources for conservation, development, and documentation of low resource (human) languages.
csebuetnlp/xl-sum
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
csebuetnlp/banglanmt
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
Andrews2017/africanlp-public-datasets
A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.
cisnlp/GlotLID
Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
jcblaisecruz02/Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Rumeysakeskin/Turkish-Text-to-Speech
Speech synthesis (TTS) in low-resource languages by training from scratch with Fastpitch and fine-tuning with HifiGan
ljvmiranda921/calamanCy
NLP pipelines for Tagalog using spaCy
kbatsuren/CogNet
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
alexandra-chron/relm_unmt
Python source code for EMNLP 2020 paper "Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT".
cdli-gh/Semi-Supervised-NMT-for-Sumerian-English
Exploring the Limits of Low-Resource Neural Machine Translation
csikasote/BembaSpeech
This is an ASR corpus for Bemba language. It contains read speech from diverse publicly available Bemba sources; Literature Books, Radio/TV shows transcripts, Youtube Video transcripts, Online sources. The corpus has 14, 438 utterances culminating into over 24 hours of speech.
hausanlp/NaijaSenti
This is a repository for NaijaSenti. A Lacuna Funded Project for the development of sentiment corpus for four Nigerian languages: Igbo, Hausa, Yoruba and Pidgin.
Kartikaggarwal98/Indian_ParallelCorpus
Curated list of publicly available parallel corpus for Indian Languages
emotion-analysis-project/SemEval2025-Task11
SemEval2024-task 11: Bridging the Gap in Text-Based Emotion Detection
charlesliucn/LanMIT
📖 LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.
EveryVoiceTTS/EveryVoice
The EveryVoice TTS Toolkit - Text To Speech for your language
RichardLitt/thesis
My thesis on "Open Source Code and Low Resource Languages" for an MSc in Language Science and Technology at Saarland University
CoEDL/vad-sli-asr
A pipeline to isolate and transcribe one language in mixed-language speech
luciusssss/mc2_corpus
[ACL'24] MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)
dmatekenya/Chichewa-Speech2Text
Automated Speech Recognition for Chichewa.
jhdeov/interlingual-MFA
Workflow for forced alignment between languages
alecokas/BiLatticeRNN-Confidence
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks https://arxiv.org/abs/1910.11933 or https://ieeexplore.ieee.org/document/9053264
luciusssss/ZhuangBench
[ACL'24 Findings] Teaching Large Language Models an Unseen Language on the Fly
surafelml/Afro-NMT
LOW-RESOURCE NEURAL MACHINE TRANSLATION: A BENCHMARK FOR FIVE AFRICAN LANGUAGES
jcblaisecruz02/Tagalog-fake-news
Fake news detection in Filipino via Multitask Transfer Learning
khuangaf/CONCRETE
Official implementation of "CONCRETE: Improving Cross-lingual Fact Checking with Cross-lingual Retrieval" (COLING'22)
unza-speech-lab/zambezi-voice
Repository for multilingual speech data resources for native languages of Zambia.
BatsResearch/LexC-Gen
Generate synthetic labeled data for extremely low-resource languages using bilingual lexicons.
IgnatiusEzeani/IGBONLP
This is a repository for the IGBONLP Project.
clefourrier/CopperMT
[ACL 2021, Findings] Cognate Prediction Per Machine Translation
fajri91/minangNLP
Minangkabau NLP corpus. PACLIC 2020
harmanpreet93/low-resource-machine-translation
Low resource machine translation using Transformers and Iterative Back translation
ruoyuxie/noisy_parallel_data_alignment
Enhanced awesome-align for low-resource languages and noise simulation: https://arxiv.org/abs/2301.09685
tafseer-nayeem/BengaliReadability
[AAAI 2021] - Simple or Complex? Learning to Predict Readability of Bengali Texts.