nlp-datasets
There are 159 repositories under nlp-datasets topic.
mihail911/nlp-library
curated collection of papers for the nlp practitioner 📖👩🔬
guhhhhaa/4675-scifi
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
hellohaptik/multi-task-NLP
multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
dkulagin/kartaslov
Открытые лингвистические датасеты: тональный словарь русского языка КартаСловСент, датасет по семантике, ассоциативный граф и датасет по орфографическим ошибкам и опечаткам.
quincyliang/nlp-public-dataset
Chinese, English NER, English-Chinese machine translation dataset. 中英文实体识别数据集,中英文机器翻译数据集, 中文分词数据集
irfnrdh/Awesome-Indonesia-NLP
Resource NLP & Bahasa
grammarly/ua-gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
StonyBrookNLP/appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Paper.
liutiedong/goat
a Fine-tuned LLaMA that is Good at Arithmetic Tasks
cjiang2/VDCNN
Implementation of Very Deep Convolutional Neural Network for Text Classification
INK-USC/TriggerNER
TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)
INK-USC/CommonGen
A Constrained Text Generation Challenge Towards Generative Commonsense Reasoning
guhhhhaa/wula-scifi
chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
secsilm/zi-dataset
汉字数据集,包括汉字的相关信息,例如笔画数、部首、拼音、英文释义/同义词等。
xtea/chinese_medical_words
手工整理医疗行业词汇、术语等语料。可用于语音识别、对话系统等各类nlp模型训练。
Niger-Volta-LTI/yoruba-text
Yorùbá language training text for NLP, ASR and TTS tasks
kelvin-jiang/FreebaseQA
The release of the FreebaseQA data set (NAACL 2019).
Pzoom522/HistSumm
Code and data for "Summarising Historical Text in Modern Languages" (EACL 2021)
fido-ai/ua-datasets
A collection of datasets for Ukrainian language
gcunhase/AMICorpusXML
Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus
AndyTheFactory/romanian-nlp-datasets
A list of Romanian NLP Datasets
selimfirat/bilkent-turkish-writings-dataset
Compilation of Turkish writings dataset that promotes creativity, content, composition, grammar, spelling and punctuation.
afrisenti-semeval/afrisent-semeval-2023
AfriSenti-SemEval Shared Task 12: Sentiment Analysis for African languages : https://afrisenti-semeval.github.io/
matt-seb-ho/WikiWhy
WikiWhy is a new benchmark for evaluating LLMs' ability to explain between cause-effect relationships. It is a QA dataset containing 9000+ "why" question-answer-rationale triplets.
gkiril/benchie
Comprehensive evaluation framework for Open Information Extraction.
bothub-it/bothub
Bothub is an open platform for predicting, training and sharing NLP datasets in multiple languages
uma-pi1/OPIEC
Reading the data from OPIEC - an Open Information Extraction corpus
gpt-tester/ChatGPT-test-dataset-01
a small test dataset for use with OpenAI's ChatGPT
ElizaLo/Question-Answering-based-on-SQuAD
Question Answering System using BiDAF Model on SQuAD v2.0
cybermatt/russian-names
Library for generation of russian names
INK-USC/XCSR
Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"
OSINTAI/Arabic-Dictionaries
Arabic Dictionaries
jamesohortle/loanwords_gairaigo
English loanwords in Japanese
utahnlp/infotabs-code
Implementation of the semi-structured inference model in our ACL 2020 paper, INFOTABS: Inference on Tables as Semi-structured Data.
maxent-ai/Datasets
datasets with text data for use in NLP, Text analysis, information extraction, ML research.
INK-USC/RiddleSense
RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge