datongzi666's Stars
RUC-NLPIR/FlashRAG
⚡FlashRAG: A Python Toolkit for Efficient RAG Research
satellitecomponent/Neurite
Fractal Graph-of-Thought. Rhizomatic Mind-Mapping for Ai-Agents, Web-Links, Notes, and Code.
pathwaycom/llm-app
Ready-to-run cloud templates for RAG, AI pipelines, and enterprise search with live data. 🐳Docker-friendly.⚡Always in sync with Sharepoint, Google Drive, S3, Kafka, PostgreSQL, real-time data APIs, and more.
AnswerDotAI/RAGatouille
Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
neuml/txtai
💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
deepset-ai/haystack
AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
SpursGoZmy/Tabular-LLM
本项目旨在收集开源的表格智能任务数据集(比如表格问答、表格-文本生成等),将原始数据整理为指令微调格式的数据并微调LLM,进而增强LLM对于表格数据的理解,最终构建出专门面向表格智能任务的大型语言模型。
google-research/deduplicate-text-datasets
liyucheng09/Contamination_Detector
Lightweight tool to identify Data Contamination in LLMs evaluation
ChenghaoMou/text-dedup
All-in-one text de-duplication
RUC-GSAI/Yulan-GARDEN
Official Repository for SIGIR2024 Demo Paper "An Integrated Data Processing Framework for Pretraining Foundation Models"
allenai/wimbd
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
huggingface/datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
p-lambda/dsir
DSIR large-scale data selection framework for language model training
alipay/financial_evaluation_dataset
jeinlee1991/chinese-llm-benchmark
中文大模型能力评测榜单:目前已囊括128个大模型,覆盖chatgpt、gpt-4o、谷歌gemini、百度文心一言、阿里通义千问、百川、讯飞星火、商汤senseChat、minimax等商用模型, 以及qwen2.5、llama3.1、glm4、书生internLM2.5、openbuddy、AquilaChat等开源大模型。不仅提供能力评分排行榜,也提供所有模型的原始输出结果!
FreedomIntelligence/CMB
CMB, A Comprehensive Medical Benchmark in Chinese
isen-zhang/ACLUE
Official github repo for ACLUE, an evaluation benchmark focused on ancient Chinese language comprehension
luban-agi/Awesome-Domain-LLM
收集和梳理垂直领域的开源模型、数据集及评测基准。
winninghealth/WiNGPT2
WiNGPT是一个基于GPT的医疗垂直领域大模型,旨在将专业的医学知识、医疗信息、数据融会贯通,为医疗行业提供智能化的医疗问答、诊断支持和医学知识等信息服务,提高诊疗效率和医疗服务质量。
allenai/dolma
Data and tools for generating and inspecting OLMo pre-training data.
togethercomputer/RedPajama-Data
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
modelscope/data-juicer
Making data higher-quality, juicier, and more digestible for any large models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
michael-wzhu/ChatMed
ChatMed: 中文医疗大模型,善于在线回答患者/用户的日常医疗相关问题!
X-D-Lab/Sunsimiao
🌿孙思邈中文医疗大模型(Sunsimiao):提供安全、可靠、普惠的中文医疗大模型
MediaBrain-SJTU/MING
明医 (MING):中文医疗问诊大模型
triton-lang/triton
Development repository for the Triton language and compiler
huggingface/peft
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
hpcaitech/ColossalAI
Making large AI models cheaper, faster and more accessible
Lightning-AI/pytorch-lightning
Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.