Pinned Repositories
30dayMakeCppServer
30天自制C++服务器,包含教程和源代码
4675-scifi
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
500LineorLess_CN
500 line or less 中文翻译计划。
a-PyTorch-Tutorial-to-Sequence-Labeling
Empower Sequence Labeling with Task-Aware Neural Language Model | a PyTorch Tutorial to Sequence Labeling
crf
DeepSpeedExamples
Example models using DeepSpeed
lac
百度开源中文词法分析工具: 分词,词性标注,命名实体识别
wktagger
pos tagger of english
gokunwu's Repositories
gokunwu/Adala
Adala: Autonomous DAta (Labeling) Agent framework
gokunwu/augmentoolkit
Convert Compute And Books Into Instruct-Tuning Datasets
gokunwu/awesome-foundation-and-multimodal-models
👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper + Code]
gokunwu/charset_mnbvc
本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作
gokunwu/crawlee
Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
gokunwu/data_management_LLM
Collection of training data management explorations for large language models
gokunwu/DataCheck_MNBVC
gokunwu/datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
gokunwu/deduplication_mnbvc
文本去重
gokunwu/Douyin_TikTok_Download_API
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。
gokunwu/Exam-Question-Bank-Dataset-zh_mnbvc
通用考试题库数据集 选择 填空 简答
gokunwu/FlagData
gokunwu/github_downloader_mnbvc
github仓库下载器
gokunwu/gpt-crawler
Crawl a site to generate knowledge files to create your own custom GPT from a URL
gokunwu/khoj
Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use foundation models or private, local LLMs. Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.
gokunwu/MatPlotAgent
gokunwu/MegaParse
File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
gokunwu/mmdp_mnbvc
gokunwu/MNBVC
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
gokunwu/nuggets
gokunwu/OLMo-Eval
Evaluation suite for LLMs
gokunwu/Open-Sora
Open-Sora: Democratizing Efficient Video Production for All
gokunwu/OpenRefine
OpenRefine is a free, open source power tool for working with messy data and improving it
gokunwu/parallel_corpus_mnbvc
parallel corpus dataset from the mnbvc project
gokunwu/pdf_meta_data_mnbvc
gokunwu/Prompt-Engineering-Guide
🐙 Guides, papers, lecture, notebooks and resources for prompt engineering
gokunwu/Qwen-Agent
Agent framework and applications built upon Qwen, featuring Code Interpreter and Chrome browser extension.
gokunwu/sensitive-word
👮♂️The sensitive word tool for java.(敏感词/违禁词/违法词/脏词。基于 DFA 算法实现的高性能 java 敏感词过滤工具框架。请勿发布涉及政治、广告、营销、翻墙、违反国家法律法规等内容。高性能敏感词检测过滤组件,附带繁体简体互换,支持全角半角互换,汉字转拼音,模糊搜索等功能。)
gokunwu/Telechat
gokunwu/unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.