nlp-datasets

There are 159 repositories under nlp-datasets topic.

mihail911/nlp-library
curated collection of papers for the nlp practitioner 📖👩‍🔬
1.1k 69 189
guhhhhaa/4675-scifi
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料
422 7 065
hellohaptik/multi-task-NLP
multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
Language:Python372 19 1154
dkulagin/kartaslov
Открытые лингвистические датасеты: тональный словарь русского языка КартаСловСент, датасет по семантике, ассоциативный граф и датасет по орфографическим ошибкам и опечаткам.
370 33 151
quincyliang/nlp-public-dataset
Chinese, English NER, English-Chinese machine translation dataset. 中英文实体识别数据集，中英文机器翻译数据集, 中文分词数据集
Language:Python363 7 276
irfnrdh/Awesome-Indonesia-NLP
Resource NLP & Bahasa
269 7 066
grammarly/ua-gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Language:Macaulay2263 12 622
StonyBrookNLP/appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Paper.
Language:Python246 11 2717
liutiedong/goat
a Fine-tuned LLaMA that is Good at Arithmetic Tasks
Language:Jupyter Notebook177 3 517
cjiang2/VDCNN
Implementation of Very Deep Convolutional Neural Network for Text Classification
Language:Python172 6 1741
INK-USC/TriggerNER
TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)
Language:Python172 10 1618
INK-USC/CommonGen
A Constrained Text Generation Challenge Towards Generative Commonsense Reasoning
Language:Python141 7 023
guhhhhaa/wula-scifi
chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料
122 2 026
secsilm/zi-dataset
汉字数据集，包括汉字的相关信息，例如笔画数、部首、拼音、英文释义/同义词等。
122 4 217
xtea/chinese_medical_words
手工整理医疗行业词汇、术语等语料。可用于语音识别、对话系统等各类nlp模型训练。
121 2 040
Niger-Volta-LTI/yoruba-text
Yorùbá language training text for NLP, ASR and TTS tasks
Language:Python80 8 1330
kelvin-jiang/FreebaseQA
The release of the FreebaseQA data set (NAACL 2019).
72 3 61
Pzoom522/HistSumm
Code and data for "Summarising Historical Text in Modern Languages" (EACL 2021)
Language:Jupyter Notebook72 5 29
fido-ai/ua-datasets
A collection of datasets for Ukrainian language
Language:Python56 2 12
gcunhase/AMICorpusXML
Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus
Language:Python52 4 630
AndyTheFactory/romanian-nlp-datasets
A list of Romanian NLP Datasets
51 1 010
selimfirat/bilkent-turkish-writings-dataset
Compilation of Turkish writings dataset that promotes creativity, content, composition, grammar, spelling and punctuation.
Language:Python51 4 12
afrisenti-semeval/afrisent-semeval-2023
AfriSenti-SemEval Shared Task 12: Sentiment Analysis for African languages : https://afrisenti-semeval.github.io/
Language:Jupyter Notebook49 3 141
matt-seb-ho/WikiWhy
WikiWhy is a new benchmark for evaluating LLMs' ability to explain between cause-effect relationships. It is a QA dataset containing 9000+ "why" question-answer-rationale triplets.
Language:Python47 3 21
gkiril/benchie
Comprehensive evaluation framework for Open Information Extraction.
Language:Python39 3 410
bothub-it/bothub
Bothub is an open platform for predicting, training and sharing NLP datasets in multiple languages
Language:Makefile38 11 945
uma-pi1/OPIEC
Reading the data from OPIEC - an Open Information Extraction corpus
Language:Java38 4 26
gpt-tester/ChatGPT-test-dataset-01
a small test dataset for use with OpenAI's ChatGPT
33 4 011
ElizaLo/Question-Answering-based-on-SQuAD
Question Answering System using BiDAF Model on SQuAD v2.0
Language:Python25 1 027
cybermatt/russian-names
Library for generation of russian names
Language:Python24 2 12
INK-USC/XCSR
Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"
Language:Python22 7 12
OSINTAI/Arabic-Dictionaries
Arabic Dictionaries
21 4 00
jamesohortle/loanwords_gairaigo
English loanwords in Japanese
Language:Python18 2 02
utahnlp/infotabs-code
Implementation of the semi-structured inference model in our ACL 2020 paper, INFOTABS: Inference on Tables as Semi-structured Data.
Language:Python18 3 08
maxent-ai/Datasets
datasets with text data for use in NLP, Text analysis, information extraction, ML research.
Language:Jupyter Notebook16 4 03
INK-USC/RiddleSense
RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge
Language:Python14 4 01

nlp-datasets

mihail911/nlp-library

guhhhhaa/4675-scifi

hellohaptik/multi-task-NLP

dkulagin/kartaslov

quincyliang/nlp-public-dataset

irfnrdh/Awesome-Indonesia-NLP

grammarly/ua-gec

StonyBrookNLP/appworld

liutiedong/goat

cjiang2/VDCNN

INK-USC/TriggerNER

INK-USC/CommonGen

guhhhhaa/wula-scifi

secsilm/zi-dataset

xtea/chinese_medical_words

Niger-Volta-LTI/yoruba-text

kelvin-jiang/FreebaseQA

Pzoom522/HistSumm

fido-ai/ua-datasets

gcunhase/AMICorpusXML

AndyTheFactory/romanian-nlp-datasets

selimfirat/bilkent-turkish-writings-dataset

afrisenti-semeval/afrisent-semeval-2023

matt-seb-ho/WikiWhy

gkiril/benchie

bothub-it/bothub

uma-pi1/OPIEC

gpt-tester/ChatGPT-test-dataset-01

ElizaLo/Question-Answering-based-on-SQuAD

cybermatt/russian-names

INK-USC/XCSR

OSINTAI/Arabic-Dictionaries

jamesohortle/loanwords_gairaigo

utahnlp/infotabs-code

maxent-ai/Datasets

INK-USC/RiddleSense