gokunwu

I Do NLP & ML

beijing

Pinned Repositories

30dayMakeCppServer
30天自制C++服务器，包含教程和源代码
Language:C++00
4675-scifi
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料
00
500LineorLess_CN
500 line or less 中文翻译计划。
Language:HTML0 1 00
a-PyTorch-Tutorial-to-Sequence-Labeling
Empower Sequence Labeling with Task-Aware Neural Language Model | a PyTorch Tutorial to Sequence Labeling
Language:Python0 1 00
crf
Language:C++1 1 01
DeepSpeedExamples
Example models using DeepSpeed
Language:Python1 0 00
lac
百度开源中文词法分析工具: 分词，词性标注，命名实体识别
Language:C++1 0 00
wktagger
pos tagger of english
Language:C++1 1 00

gokunwu's Repositories

gokunwu/Adala
Adala: Autonomous DAta (Labeling) Agent framework
Language:Python0 0
gokunwu/augmentoolkit
Convert Compute And Books Into Instruct-Tuning Datasets
Language:Python0 0
gokunwu/awesome-foundation-and-multimodal-models
👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper + Code]
Language:Python0 0
gokunwu/charset_mnbvc
本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作
Language:Python0 0
gokunwu/crawlee
Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
Language:TypeScript0 0
gokunwu/data_management_LLM
Collection of training data management explorations for large language models
0 0
gokunwu/DataCheck_MNBVC
gokunwu/datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Language:Python0 0
gokunwu/deduplication_mnbvc
文本去重
Language:Python0 0
gokunwu/Douyin_TikTok_Download_API
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具，支持API调用，在线批量解析及下载。
gokunwu/Exam-Question-Bank-Dataset-zh_mnbvc
通用考试题库数据集选择填空简答
gokunwu/FlagData
Language:Python0 0
gokunwu/github_downloader_mnbvc
github仓库下载器
Language:Python0 0
gokunwu/gpt-crawler
Crawl a site to generate knowledge files to create your own custom GPT from a URL
gokunwu/khoj
Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use foundation models or private, local LLMs. Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.
gokunwu/MatPlotAgent
Language:Python0 0
gokunwu/MegaParse
File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
gokunwu/mmdp_mnbvc
Language:Python0 0
gokunwu/MNBVC
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
gokunwu/nuggets
Language:Jupyter Notebook0 0
gokunwu/OLMo-Eval
Evaluation suite for LLMs
Language:Python0 0
gokunwu/Open-Sora
Open-Sora: Democratizing Efficient Video Production for All
Language:Python0 0
gokunwu/OpenRefine
OpenRefine is a free, open source power tool for working with messy data and improving it
gokunwu/parallel_corpus_mnbvc
parallel corpus dataset from the mnbvc project
gokunwu/pdf_meta_data_mnbvc
Language:Jupyter Notebook0 0
gokunwu/Prompt-Engineering-Guide
🐙 Guides, papers, lecture, notebooks and resources for prompt engineering
gokunwu/Qwen-Agent
Agent framework and applications built upon Qwen, featuring Code Interpreter and Chrome browser extension.
gokunwu/sensitive-word
👮‍♂️The sensitive word tool for java.(敏感词/违禁词/违法词/脏词。基于 DFA 算法实现的高性能 java 敏感词过滤工具框架。请勿发布涉及政治、广告、营销、翻墙、违反国家法律法规等内容。高性能敏感词检测过滤组件，附带繁体简体互换，支持全角半角互换，汉字转拼音，模糊搜索等功能。)
Language:Java0 0
gokunwu/Telechat
gokunwu/unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.