datongzi666

datongzi666's Stars

RUC-NLPIR/FlashRAG
⚡FlashRAG: A Python Toolkit for Efficient RAG Research
Language:Python1.4k112
satellitecomponent/Neurite
Fractal Graph-of-Thought. Rhizomatic Mind-Mapping for Ai-Agents, Web-Links, Notes, and Code.
Language:JavaScript1.2k104
pathwaycom/llm-app
Ready-to-run cloud templates for RAG, AI pipelines, and enterprise search with live data. 🐳Docker-friendly.⚡Always in sync with Sharepoint, Google Drive, S3, Kafka, PostgreSQL, real-time data APIs, and more.
4.7k236
AnswerDotAI/RAGatouille
Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Language:Python3.1k210
neuml/txtai
💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
Language:Python9.5k611
deepset-ai/haystack
AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
Language:Python17.9k1.9k
SpursGoZmy/Tabular-LLM
本项目旨在收集开源的表格智能任务数据集（比如表格问答、表格-文本生成等），将原始数据整理为指令微调格式的数据并微调LLM，进而增强LLM对于表格数据的理解，最终构建出专门面向表格智能任务的大型语言模型。
48238
google-research/deduplicate-text-datasets
Language:Rust1.1k112
liyucheng09/Contamination_Detector
Lightweight tool to identify Data Contamination in LLMs evaluation
Language:Python431
ChenghaoMou/text-dedup
All-in-one text de-duplication
Language:Python62371
RUC-GSAI/Yulan-GARDEN
Official Repository for SIGIR2024 Demo Paper "An Integrated Data Processing Framework for Pretraining Foundation Models"
Language:Python579
allenai/wimbd
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
Language:Python19320
huggingface/datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Language:Python2.1k151
p-lambda/dsir
DSIR large-scale data selection framework for language model training
Language:Python23219
alipay/financial_evaluation_dataset
Language:Python17216
jeinlee1991/chinese-llm-benchmark
中文大模型能力评测榜单：目前已囊括128个大模型，覆盖chatgpt、gpt-4o、谷歌gemini、百度文心一言、阿里通义千问、百川、讯飞星火、商汤senseChat、minimax等商用模型，以及qwen2.5、llama3.1、glm4、书生internLM2.5、openbuddy、AquilaChat等开源大模型。不仅提供能力评分排行榜，也提供所有模型的原始输出结果！
2.9k134
FreedomIntelligence/CMB
CMB, A Comprehensive Medical Benchmark in Chinese
Language:Python13512
isen-zhang/ACLUE
Official github repo for ACLUE, an evaluation benchmark focused on ancient Chinese language comprehension
Language:Python23
luban-agi/Awesome-Domain-LLM
收集和梳理垂直领域的开源模型、数据集及评测基准。
2.3k179
winninghealth/WiNGPT2
WiNGPT是一个基于GPT的医疗垂直领域大模型，旨在将专业的医学知识、医疗信息、数据融会贯通，为医疗行业提供智能化的医疗问答、诊断支持和医学知识等信息服务，提高诊疗效率和医疗服务质量。
Language:Python33118
allenai/dolma
Data and tools for generating and inspecting OLMo pre-training data.
Language:Python1k108
togethercomputer/RedPajama-Data
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Language:Python4.6k350
modelscope/data-juicer
Making data higher-quality, juicier, and more digestible for any large models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据！
Language:Python3k181
michael-wzhu/ChatMed
ChatMed: 中文医疗大模型，善于在线回答患者/用户的日常医疗相关问题！
Language:Python52168
X-D-Lab/Sunsimiao
🌿孙思邈中文医疗大模型(Sunsimiao)：提供安全、可靠、普惠的中文医疗大模型
Language:Python39823
MediaBrain-SJTU/MING
明医 (MING)：中文医疗问诊大模型
Language:Python874109
triton-lang/triton
Development repository for the Triton language and compiler
Language:C++13.5k1.7k
huggingface/peft
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
Language:Python16.6k1.6k
hpcaitech/ColossalAI
Making large AI models cheaper, faster and more accessible
Language:Python38.8k4.3k
Lightning-AI/pytorch-lightning
Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
Language:Python28.5k3.4k