yoursock's Stars
BIT-ENGD/baidu_baike
fake-useragent/fake-useragent
Up-to-date simple useragent faker with real world database
mlfoundations/dclm
DataComp for Language Models
TeamHG-Memex/autopager
Detect and classify pagination links
pyppeteer/pyppeteer
Headless chrome/chromium automation library (unofficial port of puppeteer)
taishi-i/awesome-japanese-nlp-resources
A curated list of resources dedicated to Python libraries, LLMs, dictionaries, and corpora of NLP for Japanese
RubyMetric/chsrc
chsrc 全平台通用换源工具. Change Source for every software on every platform from the command line.
BaiduSpider/BaiduSpider
BaiduSpider,一个爬取百度搜索结果的爬虫,目前支持百度网页搜索,百度图片搜索,百度知道搜索,百度视频搜索,百度资讯搜索,百度文库搜索,百度经验搜索和百度百科搜索。
helloworld-Co/html2md
helloworld 开发者社区开源的一个轻量级,强大的 html 一键转 md 工具,支持多平台文章一键转换,并保存下载到本地。
jina-ai/jina
☁️ Build multimodal AI applications with cloud-native stack
onuratakan/gpt-computer-assistant
gpt-4o for windows, macos and linux
lm-sys/FastChat
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
modelscope/data-juicer
A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
VikParuchuri/marker
Convert PDF to markdown quickly with high accuracy
deepset-ai/haystack
:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
datawhalechina/joyful-pandas
pandas中文教程
datawhalechina/self-llm
《开源大模型食用指南》基于Linux环境快速部署开源大模型,更适合**宝宝的部署教程
g1879/DrissionPage
基于python的网页自动化工具。既能控制浏览器,也能收发数据包。可兼顾浏览器自动化的便利性和requests的高效率。功能强大,内置无数人性化设计和便捷功能。语法简洁而优雅,代码量少。
martinsbalodis/web-scraper-chrome-extension
Web data extraction tool implemented as chrome extension
alirezamika/autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stirling-Tools/Stirling-PDF
#1 Locally hosted web application that allows you to perform various operations on PDF files
hankcs/HanLP
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
tencentmusic/supersonic
SuperSonic is the next-generation BI platform that integrates Chat BI (powered by LLM) and Headless BI (powered by semantic layer) paradigms.
ltd0102/ghs
hiyouga/LLaMA-Factory
Unify Efficient Fine-Tuning of 100+ LLMs
kermitt2/grobid
A machine learning software for extracting information from scholarly documents
cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
WZBSocialScienceCenter/pdftabextract
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
pymupdf/PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
pdf2htmlEX/pdf2htmlEX
Convert PDF to HTML without losing text or format.