Pinned Repositories
awesome-data-deduplication
An awesome list of data deduplication use cases, papers, tools, and methods.
chenghaomou.github.io
Personal Blog
deduplicate-text-datasets
A modified version of Google's tool for pure text file
embeddings
zero-vocab or low-vocab embeddings
karafuru
Traditional Chinese colors in your terminal
pytorch-pQRNN
Implementation of pQRNN in PyTorch
simhash
Simhash in C++
text-dedup
All-in-one text de-duplication
touchbar-lyric
Show synced lyric in the touch-bar with BetterTouchTool and NetEase APIs
transformer-pointer-generator
Transformer with pointer generator for machine translation
ChenghaoMou's Repositories
ChenghaoMou/text-dedup
All-in-one text de-duplication
ChenghaoMou/touchbar-lyric
Show synced lyric in the touch-bar with BetterTouchTool and NetEase APIs
ChenghaoMou/embeddings
zero-vocab or low-vocab embeddings
ChenghaoMou/awesome-data-deduplication
An awesome list of data deduplication use cases, papers, tools, and methods.
ChenghaoMou/chenghaomou.github.io
Personal Blog
ChenghaoMou/deduplicate-text-datasets
A modified version of Google's tool for pure text file
ChenghaoMou/simhash
Simhash in C++
ChenghaoMou/karafuru
Traditional Chinese colors in your terminal
ChenghaoMou/idefics2-contract-qa
ChenghaoMou/mini-vae
Minimal GMM VAE model for NLP
ChenghaoMou/whisper_streaming
Whisper realtime streaming for long speech-to-text transcription and translation
ChenghaoMou/awesome-nlp
:book: A curated list of resources dedicated to Natural Language Processing (NLP)
ChenghaoMou/bigcode-analysis
Repository for analysis notebooks and experimentes of the BigCode project.
ChenghaoMou/bigcode-dataset
ChenghaoMou/blog
Public repo for HF blog posts
ChenghaoMou/chenghaomou
ChenghaoMou/closedapi
Tired of seeing not-so-open apis behind paywalls.
ChenghaoMou/data_tooling
Tools for managing datasets for governance and training.
ChenghaoMou/edgar-crawler
SEC EDGAR Exhibit Downloader
ChenghaoMou/GLiNER
Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024
ChenghaoMou/paper2speech
Convert a research paper to audio
ChenghaoMou/pipecat
Open Source framework for voice and multimodal conversational AI
ChenghaoMou/presidio
Context aware, pluggable and customizable data protection and de-identification SDK for text and images
ChenghaoMou/pytorch-dice-loss
Dice loss for data-imbalanced NLP tasks
ChenghaoMou/quartz
🌱 a fast, batteries-included static-site generator that transforms Markdown content into fully functional websites
ChenghaoMou/rmc
Convert to/from v6 .rm files from the reMarkable tablet
ChenghaoMou/rmrf
personal remarkable reformatter built on top of rmscene and rmc
ChenghaoMou/rmscene
Read v6 .rm files from the reMarkable tablet
ChenghaoMou/speech-trident
Awesome speech/audio LLMs, representation learning, and codec models
ChenghaoMou/table-transformer-doclaynet
Table Transformer Fine-tuned with DocLayNet Dataset