llm-datasets

There are 12 repositories under llm-datasets topic.

  • neo4j-labs/text2cypher

    collection of text2cypher datasets, evaluations, and finetuning instructions

    Language:Jupyter Notebook1535318
  • dsdanielpark/open-llm-datasets

    Repository for organizing datasets and papers used in Open LLM.

  • discus-labs/discus

    A data-centric AI package for ML/AI. Get the best high-quality data for the best results. Discord: https://discord.gg/t6ADqBKrdZ

    Language:Python641187
  • asimsinan/LLM-Research

    A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks

    Language:Python43206
  • altunenes/rustysozluk

    Efficiently fetch and perform sentiment analysis (Turkish Only) on eksisozluk.com entries using Rust

    Language:Rust7140
  • DefinetlyNotAI/LLM_Data

    A bunch of very famous repos source code's in python as pure localdocs all in this repo to train CODE AI

    Language:Python3100
  • arian-askari/SOLID

    Synthetically Generating Intent-Aware Information-Seeking Dialogues! Useful for various tasks such as training/evaluating User Intent Predictors with the possibility to training/evaluating on real human dialogues. The backbone LLM of SOLID is Zephyr-7b-beta.

    Language:Python2101
  • tiddly-gittly/TiddlyWiki-LLM-dataset

    WikiText syntax dataset generation pipeline and open dataset for auto UI generation in TiddlyWiki. (WIP)

    Language:TypeScript210
  • redblock-ai/parrot-python

    PARROT (Performance Assessment of Reasoning and Responses On Trivia) is a novel benchmarking framework designed to evaluate Large Language Models (LLMs) on real-world, complex, and ambiguous QA tasks.

    Language:Python1250
  • aloobun/basedUX

    minimal dataset conisting og 363 Human & Assitant dialogs

  • aloobun/ccpem-modified

    A modified dataset consisting of English dialogs between a user and an assistant discussing movie preferences in natural language.

  • jsurrea/LLM-Latino

    Collection of ETL scripts used to create a dataset of text in Spanish to train Large Language Models.

    Language:Python20