low-resource-languages

There are 118 repositories under low-resource-languages topic.

RichardLitt/low-resource-languages
Resources for conservation, development, and documentation of low resource (human) languages.
Language:TeX392 35 10156
csebuetnlp/xl-sum
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
Language:Python256 6 1541
csebuetnlp/banglanmt
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
Language:Python147 10 1147
Andrews2017/africanlp-public-datasets
A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.
93 7 120
cisnlp/GlotLID
Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
Language:Python92 5 47
jcblaisecruz02/Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Language:Python58 3 48
Rumeysakeskin/Turkish-Text-to-Speech
Speech synthesis (TTS) in low-resource languages by training from scratch with Fastpitch and fine-tuning with HifiGan
Language:Python47 5 45
ljvmiranda921/calamanCy
NLP pipelines for Tagalog using spaCy
Language:Python45 4 193
kbatsuren/CogNet
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
43 8 210
alexandra-chron/relm_unmt
Python source code for EMNLP 2020 paper "Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT".
Language:Python35 2 13
cdli-gh/Semi-Supervised-NMT-for-Sumerian-English
Exploring the Limits of Low-Resource Neural Machine Translation
Language:Jupyter Notebook33 3 410
csikasote/BembaSpeech
This is an ASR corpus for Bemba language. It contains read speech from diverse publicly available Bemba sources; Literature Books, Radio/TV shows transcripts, Youtube Video transcripts, Online sources. The corpus has 14, 438 utterances culminating into over 24 hours of speech.
32 4 13
hausanlp/NaijaSenti
This is a repository for NaijaSenti. A Lacuna Funded Project for the development of sentiment corpus for four Nigerian languages: Igbo, Hausa, Yoruba and Pidgin.
31 1 020
Kartikaggarwal98/Indian_ParallelCorpus
Curated list of publicly available parallel corpus for Indian Languages
30 4 14
emotion-analysis-project/SemEval2025-Task11
SemEval2024-task 11: Bridging the Gap in Text-Based Emotion Detection
Language:Jupyter Notebook26 6 43
charlesliucn/LanMIT
📖 LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.
Language:C++22 3 01
EveryVoiceTTS/EveryVoice
The EveryVoice TTS Toolkit - Text To Speech for your language
Language:Python21 6 2862
RichardLitt/thesis
My thesis on "Open Source Code and Low Resource Languages" for an MSc in Language Science and Technology at Saarland University
Language:TeX20 6 414
CoEDL/vad-sli-asr
A pipeline to isolate and transcribe one language in mixed-language speech
Language:Python18 3 03
luciusssss/mc2_corpus
[ACL'24] MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)
Language:Python18 3 01
Aditi138/EntityTargetedActiveLearning
Language:Python17 4 13
dmatekenya/Chichewa-Speech2Text
Automated Speech Recognition for Chichewa.
Language:Jupyter Notebook17 3 06
jhdeov/interlingual-MFA
Workflow for forced alignment between languages
Language:Python17 2 01
alecokas/BiLatticeRNN-Confidence
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks https://arxiv.org/abs/1910.11933 or https://ieeexplore.ieee.org/document/9053264
Language:Python16 4 04
luciusssss/ZhuangBench
[ACL'24 Findings] Teaching Large Language Models an Unseen Language on the Fly
Language:Python16 3 10
surafelml/Afro-NMT
LOW-RESOURCE NEURAL MACHINE TRANSLATION: A BENCHMARK FOR FIVE AFRICAN LANGUAGES
Language:Shell15 3 13
jcblaisecruz02/Tagalog-fake-news
Fake news detection in Filipino via Multitask Transfer Learning
14 3 12
khuangaf/CONCRETE
Official implementation of "CONCRETE: Improving Cross-lingual Fact Checking with Cross-lingual Retrieval" (COLING'22)
Language:Python14 2 60
unza-speech-lab/zambezi-voice
Repository for multilingual speech data resources for native languages of Zambia.
14 2 04
BatsResearch/LexC-Gen
Generate synthetic labeled data for extremely low-resource languages using bilingual lexicons.
Language:Python13 3 04
IgnatiusEzeani/IGBONLP
This is a repository for the IGBONLP Project.
Language:Modula-312 4 29
clefourrier/CopperMT
[ACL 2021, Findings] Cognate Prediction Per Machine Translation
Language:JavaScript10 3 10
fajri91/minangNLP
Minangkabau NLP corpus. PACLIC 2020
Language:Python10 3 02
harmanpreet93/low-resource-machine-translation
Low resource machine translation using Transformers and Iterative Back translation
Language:Python10 0 01
ruoyuxie/noisy_parallel_data_alignment
Enhanced awesome-align for low-resource languages and noise simulation: https://arxiv.org/abs/2301.09685
Language:Python9 2 11
tafseer-nayeem/BengaliReadability
[AAAI 2021] - Simple or Complex? Learning to Predict Readability of Bengali Texts.
Language:Python9 2 05

low-resource-languages

RichardLitt/low-resource-languages

csebuetnlp/xl-sum

csebuetnlp/banglanmt

Andrews2017/africanlp-public-datasets

cisnlp/GlotLID

jcblaisecruz02/Filipino-Text-Benchmarks

Rumeysakeskin/Turkish-Text-to-Speech

ljvmiranda921/calamanCy

kbatsuren/CogNet

alexandra-chron/relm_unmt

cdli-gh/Semi-Supervised-NMT-for-Sumerian-English

csikasote/BembaSpeech

hausanlp/NaijaSenti

Kartikaggarwal98/Indian_ParallelCorpus

emotion-analysis-project/SemEval2025-Task11

charlesliucn/LanMIT

EveryVoiceTTS/EveryVoice

RichardLitt/thesis

CoEDL/vad-sli-asr

luciusssss/mc2_corpus

Aditi138/EntityTargetedActiveLearning

dmatekenya/Chichewa-Speech2Text

jhdeov/interlingual-MFA

alecokas/BiLatticeRNN-Confidence

luciusssss/ZhuangBench

surafelml/Afro-NMT

jcblaisecruz02/Tagalog-fake-news

khuangaf/CONCRETE

unza-speech-lab/zambezi-voice

BatsResearch/LexC-Gen

IgnatiusEzeani/IGBONLP

clefourrier/CopperMT

fajri91/minangNLP

harmanpreet93/low-resource-machine-translation

ruoyuxie/noisy_parallel_data_alignment

tafseer-nayeem/BengaliReadability