toolbox
Curated libraries for a faster workflow
Phase: Data
Data Annotation
- Image: makesense.ai
- Text: doccano, prodigy, dataturks, brat
- Audio: audio-annotator
- Label in notebooks: superintendent
Data Collection
- Curations: nlp-datasets
- Words: curse-words, badwords, LDNOOBW, english-words (A text file containing over 466k English words), 10K most common words, common-misspellings
- Text Corpus: project gutenberg, oscar (big multilingual corpus), nlp-datasets, 1 trillion n-grams, The Big Bad NLP Database, litbank, BookCorpus
- Sentiment: SST2, Amazon Reviews, Yelp Reviews, Movie Reviews, Food Reviews, Twitter Airline, GOP Debate, Sentiment Lexicons for 81 languages, SentiWordNet, Opinion Lexicon, Wordstat words, Emoticon Sentiment
- Emotion: NRC-Emotion-Lexicon-Wordlevel, ISEAR(17K)
- Summarization Data: curation-corpus
- Conversational data: conversational-datasets, cornell-movie-dialog-corpus
- Image: 1 million fake faces, flickr-faces, CIFAR-10, The Street View House Numbers (SVHN), STL-10, imagenette, objectnet, Yahoo Flickr Creative Commons 100 Million (YFCC100m), USPS, Animal Faces-HQ dataset (AFHQ)
- Paraphrasing: PPDB
- One Shot Learning: omniglot, mini-imagenet
- Audio: audioset (youtube audio with labels)
- Dataset search engine: datasetlist, UCI Machine Learning Datasets, Google Dataset Search, fastai-datasets, Data For Everyone
- Graphs: Social Networks (Github, Facebook, Reddit)
- Handwriting: iam-handwriting
Importing Data
- Prebuilt: OpenML, nlp
- Audio: pydub
- Video: pytube (download youtube vidoes), moviepy
- Image: py-image-dataset-generator (auto fetch images from web for certain search)
- News: news-please, news-catcher
- Email: talon
- PDF: camelot, tabula-py, Parsr, pdftotext, pdfplumber
- Excel: openpyxl
- Remote file: smart_open
- Crawling: pyppeteer (chrome automation), MechanicalSoup, libextract
- Google sheets: gspread
- Google drive: gdown, pydrive
- Python API for datasets: pydataset
- Google maps location data: geo-heatmap
- Text to Speech: gtts
- Databases: blaze (pandas and numpy interface to databases)
- Twitter: twint(scrape twitter)
- App Store: google-play-scraper
Data Augmentation
- Text: nlpaug, noisemix, textattack
- Image: imgaug, albumentations, augmentor, solt
- Audio: audiomentations, muda
- OCR data: TextRecognitionDataGenerator
- Tabular data: deltapy
- Automatic augmentation: deepaugment(image)
Phase: Exploration
Data Preparation
- Dataframe: cudf (pandas with GPU)
- Missing values: missingno
- Split images into train/validation/test: split-folders
- Class Imbalance: imblearn
- Categorical encoding: category_encoders
- Numerical data: numerizer (convert natural language numerics into ints and floats)
- Data Validation: pandera (validation for pandas), pandas-profiling
- Data Cleaning: pyjanitor (janitor ported to python)
- Parsing: pyparsing, parse
- Natural date parser: dateparser
- Unicode: text-unidecode
- Emoji: emoji
- Weak Supervision: snorkel
- Graph Sampling: little ball of fur
Data Exploration
- View Jupyter notebooks through CLI: nbdime
- Parametrize notebooks: papermill
- Access notebooks programatically: nbformat
- Convert notebooks to other formats: nbconvert
- Extra utilities not present in frameworks: mlxtend
- Maps in notebooks: ipyleaflet
- Data Exploration: bamboolib (a GUI for pandas)
Phase: Feature Engineering
Feature Generation
- Automatic feature engineering: featuretools, autopandas, tsfresh (automatic feature engineering for time series)
- Custom distance metric learning: metric-learn, pytorch-metric-learning
- Time series: python-holidays, skits, catch22
- DAG based dataset generation: DFFML
Dimensionality reduction
Phase: Modeling
Model Selection
- Pretrained models: modeldepot, pytorch-hub, papers-with-code, pretrained-models.pytorch, huggingface-models
- Automated Machine Learning (AutoML): auto-sklearn, tpot, mljar-supervised
- Curations: bert-related-papers
- Autogenerate ML code: automl-gs, mindsdb, autocat (auto-generate text classification models in spacy), lugwig
- ML from command line (or Python or HTTP): DFFML
- Find SOTA models: sotawhat
- Gradient Boosting: catboost, lightgbm (GPU-capable), thunderbm (GPU-capable), ngboost
- Hidden Markov Models: hmmlearn
- Genetic Programming: gplearn
- Active Learning: modal
- Support Vector Machines: thundersvm (GPU-capable)
- Rule based classifier: sklearn-expertsys
- Probabilistic modeling: pomegranate
- Graph Embedding and Community Detection: karateclub, python-louvain
- Anomaly detection: adtk
- Spiking Neural Network: norse
- Fuzzy Learning: fylearn, scikit-fuzzy
- Noisy Label Learning: cleanlab
- Few Shot Learning: keras-fewshotlearning
- Deep Clustering: deep-clustering-toolbox
- Graph Neural Networks: spektral (GNN for Keras)
- Contrastive Learning: contrastive-learner
NLP
- Libraries: spacy , nltk, corenlp, deeppavlov, kashgari, camphr (spacy plugin for transformers, elmo, udify), transformers, simpletransformers, ernie, stanza
- Scientific Domain: scispacy (spacy for medical documents)
- Clinical Domain: clinicalbert-mimicnotes, clinicalbert-discharge-summary
- Preprocessing: textacy
- Text Extraction: textract (Image, Audio, PDF)
- Text Generation: gp2client, textgenrnn, gpt-2-simple, aitextgen
- Machine Translation: MarianMT
- Summarization: textrank, pytldr, bert-extractive-summarizer, sumy, fast-pagerank, sumeval
- Spelling Correction: JamSpell, pyhunspell, pyspellchecker, cython_hunspell, hunspell-dictionaries, autocorrect (can add more languages), symspellpy
- Contraction Mapping: contractions
- Keyword extraction: rake, pke, phrasemachine
- Stopwords: stopwords-iso(stopwords for all languages)
- Multiply Choice Question Answering: mcQA
- Sequence to sequence models: headliner
- Transfer learning: finetune
- Translation: googletrans, word2word, translate-python
- Embeddings: pymagnitude (manage vector embeddings easily), chakin (download pre-trained word vectors), sentence-transformers, InferSent, bert-as-service, sent2vec, sense2vec, zeugma (pretrained-word embeddings as scikit-learn transformers), BM25Transformer,glove-python, fse
- Cross-lingual Lanuage Models: muse, laserembeddings, xlm
- Multilingual support: polyglot, inltk (indic languages), indic_nlp
- NLU: snips-nlu
- Semantic parsing: quepy
- Inflections: inflect
- Contractions: pycontractions
- Coreference Resolution: neuralcoref
- Readability: homer
- Language Detection: language-check
- Topic Modeling: guidedlda, enstop, top2vec, contextualized-topic-models, corex_topic, lda2vec
- Clustering: spherecluster (kmeans with cosine distance), kneed (automatically find number of clusters from elbow curve), kmodes
- Metrics: seqeval (NER, POS tagging)
- String match: jellyfish (perform string and phonetic comparison),flashtext (superfast extract and replace keywords), pythonverbalexpressions: (verbally describe regex), commonregex (readymade regex for email/phone etc), phrase-seeker, textsearch
- Sentiment: vaderSentiment (rule based)
- Aspect Based Sentiment Analysis: absa
- Emotion Classification: distilroberta-finetuned
- Text distances: textdistance, editdistance, word-mover-distance, wmd-relax (word mover distance for spacy)
- PID removal: scrubadub
- Profanity detection: profanity-check
- Visualization: stylecloud (wordclouds), scattertext
- Fuzzy Search : fuzzywuzzy
- Named Entity Recognition(NER) : spaCy , Stanford NER, sklearn-crfsuite, med7(spacy NER for medical records)
- Fill blanks: fitbert
- Dictionary: vocabulary
- Nearest neighbor: faiss
- Sentence Segmentation: nnsplit
- Knowledge Distillation: textbrewer
- Sentence Coherence: lm-scorer
- Record Linking: fuzzymatcher
- Markov chains: markovify
- Knowledge Graphs: stanford-openie
- Chinese Word Segmentation: jieba
Audio
- Library: speech_recognition, pyannotate, librosa
- Diarization: resemblyzer
- Source Separation: spleeter, nussl, open-unmix-pytorch, asteroid
RecSys
- Factorization machines (FM), and field-aware factorization machines (FFM): xlearn, DeepCTR
- Collaborative Filtering: implicit
- Scikit-learn like API: surprise
- Recommendation System in Pytorch: CaseRecommender
- Apriori algorithm: apyori
Computer Vision
- Image processing: scikit-image, imutils
- Segmentation Models in Keras: segmentation_models
- Face recognition: face_recognition, face-alignment (find facial landmarks)
- GANS: mimicry
- Face swapping: faceit, faceit-live, avatarify
- Video summarization: videodigest
- Semantic search over videos: scoper
- OCR: keras-ocr, pytesseract
- Object detection: luminoth, detectron2
- Image hashing: ImageHash
Timeseries
- Predict Time Series: prophet, atspy(automated time-series models), tslearn, pyts, seglearn, cesium, stumpy
- Scikit-learn like API: sktime
- ARIMA models: pmdarima
Framework extensions
- Pytorch: Keras like summary for pytorch, skorch (wrap pytorch in scikit-learn compatible API), catalyst
- Einstein notation: einops, kornia, torchcontrib(recent paper ideas)
- Scikit-learn: scikit-lego, iterstrat (cross-validation for multi-label data), iterative-stratification, tscv(time series cross-validation)
- Keras: keras-radam, larq (binarized neural networks), ktrain (fastai like interface for keras), tavolo (useful techniques from kaggle as utilities), tensorboardcolab (make tensorfboard work in colab), tf-sha-rnn
- Tensorflow: tensorflow-addons
Phase: Validation
Model Training Monitoring
- Learning curve: lrcurve (plot realtime learning curve in Keras), livelossplot
- Notifications: knockknock (get notified by slack/email), jupyter-notify (notify when task is completed in jupyter)
- Progress bar: fastprogress, tqdm
- GPU Usage: gpumonitor, jupyterlab-nvdashboard(see gpu usage in jupyterlab)
Interpretability
- Visualize keras models: keras-vis
- Interpret models: eli5, lime, shap, alibi, tf-explain, treeinterpreter, pybreakdown, xai, lofo-importance, interpretML
- Interpret BERT: exbert, bertviz (see attention)
- Interpret word2vec: word2viz, whatlies
Phase: Optimization
Hyperparameter Optimization
- Keras: keras-tuner
- Scikit-learn: sklearn-deap (evolutionary algorithm for hyperparameter search), hyperopt-sklearn
- General: hyperopt, optuna, evol, talos
- Parameter optimization: ParameterImportance
Visualization
- Visualization libraries: pygal, plotly, plotnine
- Interactive charts: bokeh
- Visualization for scikit-learn: yellowbrick, scikit-plot
- XKCD like charts: chart.xkcd
- Convert matplotlib charts to D3 charts: mpld3
- Generate graphs using markdown: mermaid
- Visualize topics models: pyldavis
- High dimensional visualization: umap
- Visualize architectures: netron, nn-svg
- Activation maps for keras: keract
- Create interactive charts online: flourish-studio
- Color Schemes: open-color,mplcyberpunk(cyberpunk style for matplotlib)
- Bar chart race animation: bar_chart_race
Phase: Production
Model Serialization
- Transpiling: sklearn-porter (transpile sklearn model to C, Java, JavaScript and others), m2cgen
- Pickling extended: cloudpickle, jsonpickle
Scalability
- Parallelize Pandas: pandarallel, swifter, modin
- Parallelize numpy operations: numba
- Distributed training: horovod
Bechmark
API
- Configuration Management: config, python-decouple
- Data Validation: schema, jsonschema, cerebrus, pydantic, marshmallow, validators
- Enable CORS in Flask: flask-cors
- Caching: cachetools, cachew (cache to local sqlite)
- Authentication: pyjwt (JWT)
- Task Queue: rq, schedule, huey
- Database: flask-sqlalchemy, tinydb, flask-pymongo
- Logging: loguru
Dashboard
- Generate frontend with python: streamlit
Adversarial testing
- Generate images to fool model: foolbox
- Generate phrases to fool NLP models: triggers
- General: cleverhans
Python libraries
- Datetime compatible API for Bikram Sambat: nepali-date
- Decorators: retrying (retry some function)
- bloom filter: python-bloomfilter
- Run python libraries in sandbox: pipx
- Pretty print tables in CLI: tabulate
- Leaflet maps from python: folium
- Debugging: PySnooper
- Date and Time: pendulum
- Create interactive prompts: prompt-toolkit
- Concurrent database: pickleshare
- Aync: tomorrow
- Testing: crosshair(find failure cases for functions)
- CLI tools: gitjk: Undo what you just did in git
- Virtual webcam: pyfakewebcam
- CLI Formatting: rich
- Control mouse and output device: pynput
- Shell commands as functions: sh