nlp-datasets

There are 159 repositories under nlp-datasets topic.

  • mihail911/nlp-library

    curated collection of papers for the nlp practitioner 📖👩‍🔬

  • guhhhhaa/4675-scifi

    chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料

  • hellohaptik/multi-task-NLP

    multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.

    Language:Python372191154
  • dkulagin/kartaslov

    Открытые лингвистические датасеты: тональный словарь русского языка КартаСловСент, датасет по семантике, ассоциативный граф и датасет по орфографическим ошибкам и опечаткам.

  • quincyliang/nlp-public-dataset

    Chinese, English NER, English-Chinese machine translation dataset. 中英文实体识别数据集,中英文机器翻译数据集, 中文分词数据集

    Language:Python3637276
  • irfnrdh/Awesome-Indonesia-NLP

    Resource NLP & Bahasa

  • grammarly/ua-gec

    UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

    Language:Macaulay226312622
  • appworld

    StonyBrookNLP/appworld

    🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Paper.

    Language:Python246112717
  • liutiedong/goat

    a Fine-tuned LLaMA that is Good at Arithmetic Tasks

    Language:Jupyter Notebook1773517
  • cjiang2/VDCNN

    Implementation of Very Deep Convolutional Neural Network for Text Classification

    Language:Python17261741
  • INK-USC/TriggerNER

    TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)

    Language:Python172101618
  • INK-USC/CommonGen

    A Constrained Text Generation Challenge Towards Generative Commonsense Reasoning

    Language:Python1417023
  • guhhhhaa/wula-scifi

    chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料

  • secsilm/zi-dataset

    汉字数据集,包括汉字的相关信息,例如笔画数、部首、拼音、英文释义/同义词等。

  • xtea/chinese_medical_words

    手工整理医疗行业词汇、术语等语料。可用于语音识别、对话系统等各类nlp模型训练。

  • Niger-Volta-LTI/yoruba-text

    Yorùbá language training text for NLP, ASR and TTS tasks

    Language:Python8081330
  • kelvin-jiang/FreebaseQA

    The release of the FreebaseQA data set (NAACL 2019).

  • HistSumm

    Pzoom522/HistSumm

    Code and data for "Summarising Historical Text in Modern Languages" (EACL 2021)

    Language:Jupyter Notebook72529
  • fido-ai/ua-datasets

    A collection of datasets for Ukrainian language

    Language:Python56212
  • gcunhase/AMICorpusXML

    Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus

    Language:Python524630
  • AndyTheFactory/romanian-nlp-datasets

    A list of Romanian NLP Datasets

  • selimfirat/bilkent-turkish-writings-dataset

    Compilation of Turkish writings dataset that promotes creativity, content, composition, grammar, spelling and punctuation.

    Language:Python51412
  • afrisenti-semeval/afrisent-semeval-2023

    AfriSenti-SemEval Shared Task 12: Sentiment Analysis for African languages : https://afrisenti-semeval.github.io/

    Language:Jupyter Notebook493141
  • matt-seb-ho/WikiWhy

    WikiWhy is a new benchmark for evaluating LLMs' ability to explain between cause-effect relationships. It is a QA dataset containing 9000+ "why" question-answer-rationale triplets.

    Language:Python47321
  • gkiril/benchie

    Comprehensive evaluation framework for Open Information Extraction.

    Language:Python393410
  • bothub-it/bothub

    Bothub is an open platform for predicting, training and sharing NLP datasets in multiple languages

    Language:Makefile3811945
  • uma-pi1/OPIEC

    Reading the data from OPIEC - an Open Information Extraction corpus

    Language:Java38426
  • gpt-tester/ChatGPT-test-dataset-01

    a small test dataset for use with OpenAI's ChatGPT

  • ElizaLo/Question-Answering-based-on-SQuAD

    Question Answering System using BiDAF Model on SQuAD v2.0

    Language:Python251027
  • cybermatt/russian-names

    Library for generation of russian names

    Language:Python24212
  • INK-USC/XCSR

    Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

    Language:Python22712
  • Arabic-Dictionaries
  • jamesohortle/loanwords_gairaigo

    English loanwords in Japanese

    Language:Python18202
  • utahnlp/infotabs-code

    Implementation of the semi-structured inference model in our ACL 2020 paper, INFOTABS: Inference on Tables as Semi-structured Data.

    Language:Python18308
  • maxent-ai/Datasets

    datasets with text data for use in NLP, Text analysis, information extraction, ML research.

    Language:Jupyter Notebook16403
  • INK-USC/RiddleSense

    RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge

    Language:Python14401