toolbox

Curated libraries for a faster workflow

Phase: Data

Data Annotation

Image: makesense.ai
Text: doccano, prodigy, dataturks, brat
Audio: audio-annotator
Label in notebooks: superintendent

Data Collection

Words: curse-words, badwords, LDNOOBW, english-words (A text file containing over 466k English words), 10K most common words
Text Corpus: project gutenberg, oscar (big multilingual corpus), nlp-datasets, 1 trillion n-grams, The Big Bad NLP Database, litbank
Sentiment: SST2, Amazon Reviews, Yelp Reviews, Movie Reviews, Food Reviews, Twitter Airline, GOP Debate, Sentiment Lexicons for 81 languages, SentiWordNet, Opinion Lexicon, Wordstat words, Emoticon Sentiment
Summarization Data: curation-corpus
Conversational data: conversational-datasets, cornell-movie-dialog-corpus
Image: 1 million fake faces, flickr-faces, CIFAR-10, The Street View House Numbers (SVHN), STL-10, imagenette, objectnet, Yahoo Flickr Creative Commons 100 Million (YFCC100m), USPS, Animal Faces-HQ dataset (AFHQ)
Paraphrasing: PPDB
One Shot Learning: omniglot, mini-imagenet
Audio: audioset (youtube audio with labels)
Dataset search engine: datasetlist, UCI Machine Learning Datasets, Google Dataset Search, fastai-datasets, Data For Everyone
Graphs: Social Networks (Github, Facebook, Reddit)

Importing Data

Audio: pydub
Video: pytube (download youtube vidoes), moviepy
Image: py-image-dataset-generator (auto fetch images from web for certain search)
News: news-please
Email: talon
PDF: camelot, tabula-py, Parsr, pdftotext, pdfplumber
Excel: openpyxl
Remote file: smart_open
Crawling: pyppeteer (chrome automation), MechanicalSoup, libextract
Google sheets: gspread
Google drive: gdown, pydrive
Python API for datasets: pydataset
Google maps location data: geo-heatmap
Text to Speech: gtts
Databases: blaze (pandas and numpy interface to databases)
Twitter: twint(scrape twitter)
Prebuilt: OpenML, nlp

Data Augmentation

Text: nlpaug, noisemix, textattack
Image: imgaug, albumentations, augmentor, solt
Audio: audiomentations, muda
OCR data: TextRecognitionDataGenerator
Tabular data: deltapy
Automatic augmentation: deepaugment(image)

Phase: Exploration

Data Preparation

Dataframe: cudf (pandas with GPU)
Missing values: missingno
Split images into train/validation/test: split-folders
Class Imbalance: imblearn
Categorical encoding: category_encoders
Numerical data: numerizer (convert natural language numerics into ints and floats)
Data Validation: pandera (validation for pandas), pandas-profiling
Data Cleaning: pyjanitor (janitor ported to python)
Parsing: pyparsing, parse
Natural date parser: dateparser
Unicode: text-unidecode
Emoji: emoji
Weak Supervision: snorkel
Graph Sampling: little ball of fur

Data Exploration

View Jupyter notebooks through CLI: nbdime
Parametrize notebooks: papermill
Access notebooks programatically: nbformat
Convert notebooks to other formats: nbconvert
Extra utilities not present in frameworks: mlxtend
Maps in notebooks: ipyleaflet
Data Exploration: bamboolib (a GUI for pandas)

Phase: Feature Engineering

Feature Generation

Automatic feature engineering: featuretools, autopandas, tsfresh (automatic feature engineering for time series)
Custom distance metric learning: metric-learn, pytorch-metric-learning
Time series: python-holidays, skits, catch22
DAG based dataset generation: DFFML

Dimensionality reduction

Dimensionality reduction: fbpca, fitsne

Phase: Modeling

Model Selection

Pretrained models: modeldepot, pytorch-hub, papers-with-code, pretrained-models.pytorch, huggingface-models
Automated Machine Learning (AutoML): auto-sklearn, tpot, mljar-supervised
Curations: bert-related-papers
Autogenerate ML code: automl-gs, mindsdb, autocat (auto-generate text classification models in spacy), lugwig
ML from command line (or Python or HTTP): DFFML
Find SOTA models: sotawhat
Gradient Boosting: catboost, lightgbm (GPU-capable), thunderbm (GPU-capable), ngboost
Hidden Markov Models: hmmlearn
Genetic Programming: gplearn
Active Learning: modal
Support Vector Machines: thundersvm (GPU-capable)
Rule based classifier: sklearn-expertsys
Probabilistic modeling: pomegranate
Graph Embedding and Community Detection: karateclub, python-louvain
Anomaly detection: adtk
Spiking Neural Network: norse
Fuzzy Learning: fylearn, scikit-fuzzy
Noisy Label Learning: cleanlab
Few Shot Learning: keras-fewshotlearning
Deep Clustering: deep-clustering-toolbox
Graph Neural Networks: spektral (GNN for Keras)
Contrastive Learning: contrastive-learner

NLP

Libraries: spacy , nltk, corenlp, deeppavlov, kashgari, camphr (spacy plugin for transformers, elmo, udify), transformers, simpletransformers, ernie, stanza, scispacy (spacy for medical documents)
Preprocessing: textacy
Text Extraction: textract (Image, Audio, PDF)
Text Generation: gp2client, textgenrnn, gpt-2-simple, aitextgen
Summarization: textrank, pytldr, bert-extractive-summarizer, sumy, fast-pagerank, sumeval
Spelling Correction: JamSpell, pyhunspell, pyspellchecker, cython_hunspell, hunspell-dictionaries, autocorrect (can add more languages), symspellpy
Contraction Mapping: contractions
Keyword extraction: rake, pke, phrasemachine
Stopwords: stopwords-iso(stopwords for all languages)
Multiply Choice Question Answering: mcQA
Sequence to sequence models: headliner
Transfer learning: finetune
Translation: googletrans, word2word, translate-python
Embeddings: pymagnitude (manage vector embeddings easily), chakin (download pre-trained word vectors), sentence-transformers, InferSent, bert-as-service, sent2vec, sense2vec, zeugma (pretrained-word embeddings as scikit-learn transformers), BM25Transformer,glove-python, fse
Cross-lingual embeddings: muse, laserembeddings
Multilingual support: polyglot, inltk (indic languages), indic_nlp
NLU: snips-nlu
Semantic parsing: quepy
Inflections: inflect
Contractions: pycontractions
Coreference Resolution: neuralcoref
Readability: homer
Language Detection: language-check
Topic Modeling: guidedlda, enstop, top2vec, contextualized-topic-models, corex_topic
Clustering: spherecluster (kmeans with cosine distance), kneed (automatically find number of clusters from elbow curve), kmodes
Metrics: seqeval (NER, POS tagging)
String match: jellyfish (perform string and phonetic comparison),flashtext (superfast extract and replace keywords), pythonverbalexpressions: (verbally describe regex), commonregex (readymade regex for email/phone etc), phrase-seeker, textsearch
Sentiment: vaderSentiment (rule based)
Aspect Based Sentiment Analysis: absa
Text distances: textdistance, editdistance, word-mover-distance, wmd-relax (word mover distance for spacy)
PID removal: scrubadub
Profanity detection: profanity-check
Visualization: stylecloud (wordclouds), scattertext
Fuzzy Search : fuzzywuzzy
Named Entity Recognition(NER) : spaCy , Stanford NER, sklearn-crfsuite, med7(spacy NER for medical records)
Fill blanks: fitbert
Dictionary: vocabulary
Nearest neighbor: faiss
Sentence Segmentation: nnsplit
Knowledge Distillation: textbrewer
Sentence Coherence: lm-scorer
Record Linking: fuzzymatcher
Markov chains: markovify

Audio

Library: speech_recognition, pyannotate, librosa
Diarization: resemblyzer
Source Separation: spleeter, nussl, open-unmix-pytorch, asteroid

RecSys

Factorization machines (FM), and field-aware factorization machines (FFM): xlearn, DeepCTR
Collaborative Filtering: implicit
Scikit-learn like API: surprise
Recommendation System in Pytorch: CaseRecommender
Apriori algorithm: apyori

Computer Vision

Image processing: scikit-image, imutils
Segmentation Models in Keras: segmentation_models
Face recognition: face_recognition, face-alignment (find facial landmarks)
GANS: mimicry
Face swapping: faceit, faceit-live, avatarify
Video summarization: videodigest
Semantic search over videos: scoper
OCR: keras-ocr, pytesseract
Object detection: luminoth, detectron2
Image hashing: ImageHash

Timeseries

Predict Time Series: prophet, atspy(automated time-series models), tslearn, pyts, seglearn, cesium, stumpy
Scikit-learn like API: sktime
ARIMA models: pmdarima

Framework extensions

Pytorch: Keras like summary for pytorch, skorch (wrap pytorch in scikit-learn compatible API), catalyst
Einstein notation: einops, kornia, torchcontrib(recent paper ideas)
Scikit-learn: scikit-lego, iterstrat (cross-validation for multi-label data), iterative-stratification, tscv(time series cross-validation)
Keras: keras-radam, larq (binarized neural networks), ktrain (fastai like interface for keras), tavolo (useful techniques from kaggle as utilities), tensorboardcolab (make tensorfboard work in colab), tf-sha-rnn
Tensorflow: tensorflow-addons

Phase: Validation

Model Training Monitoring

Learning curve: lrcurve (plot realtime learning curve in Keras), livelossplot
Notifications: knockknock (get notified by slack/email), jupyter-notify (notify when task is completed in jupyter)
Progress bar: fastprogress, tqdm
GPU Usage: gpumonitor, jupyterlab-nvdashboard(see gpu usage in jupyterlab)

Interpretability

Visualize keras models: keras-vis
Interpret models: eli5, lime, shap, alibi, tf-explain, treeinterpreter, pybreakdown, xai, lofo-importance, interpretML
Interpret BERT: exbert
Interpret word2vec: word2viz, whatlies

Phase: Optimization

Hyperparameter Optimization

Keras: keras-tuner
Scikit-learn: sklearn-deap (evolutionary algorithm for hyperparameter search), hyperopt-sklearn
General: hyperopt, optuna, evol, talos
Parameter optimization: ParameterImportance

Visualization

Visualization libraries: pygal, plotly, plotnine
Interactive charts: bokeh
Visualization for scikit-learn: yellowbrick, scikit-plot
XKCD like charts: chart.xkcd
Convert matplotlib charts to D3 charts: mpld3
Generate graphs using markdown: mermaid
Visualize topics models: pyldavis
High dimensional visualization: umap
Visualize architectures: netron, nn-svg
Activation maps for keras: keract
Create interactive charts online: flourish-studio
Color Schemes: open-color,mplcyberpunk(cyberpunk style for matplotlib)
Bar chart race animation: bar_chart_race

Phase: Production

Model Serialization

Transpiling: sklearn-porter (transpile sklearn model to C, Java, JavaScript and others), m2cgen
Pickling extended: cloudpickle, jsonpickle

Scalability

Parallelize Pandas: pandarallel, swifter, modin
Parallelize numpy operations: numba
Distributed training: horovod

Bechmark

Profile pytorch layers: torchprof
Load testing: k6
Monitor GPU usage: nvtop

API

Configuration Management: config, python-decouple
Data Validation: schema, jsonschema, cerebrus, pydantic, marshmallow, validators
Enable CORS in Flask: flask-cors
Caching: cachetools, cachew (cache to local sqlite)
Authentication: pyjwt (JWT)
Task Queue: rq, schedule, huey
Database: flask-sqlalchemy, tinydb, flask-pymongo
Logging: loguru

Dashboard

Generate frontend with python: streamlit

Adversarial testing

Generate images to fool model: foolbox
Generate phrases to fool NLP models: triggers
General: cleverhans

Python libraries

Datetime compatible API for Bikram Sambat: nepali-date
Decorators: retrying (retry some function)
bloom filter: python-bloomfilter
Run python libraries in sandbox: pipx
Pretty print tables in CLI: tabulate
Leaflet maps from python: folium
Debugging: PySnooper
Date and Time: pendulum
Create interactive prompts: prompt-toolkit
Concurrent database: pickleshare
Aync: tomorrow
Testing: crosshair(find failure cases for functions)
CLI tools: gitjk: Undo what you just did in git
Virtual webcam: pyfakewebcam
CLI Formatting: rich
Control mouse and output device: pynput
Shell commands as functions: sh

Workflow

ripgrep

VaijanathB/toolbox

toolbox

Phase: Data

Data Annotation

Data Collection

Importing Data

Data Augmentation

Phase: Exploration

Data Preparation

Data Exploration

Phase: Feature Engineering

Feature Generation

Dimensionality reduction

Phase: Modeling

Model Selection

NLP

Audio

RecSys

Computer Vision

Timeseries

Framework extensions

Phase: Validation

Model Training Monitoring

Interpretability

Phase: Optimization

Hyperparameter Optimization

Visualization

Phase: Production

Model Serialization

Scalability

Bechmark

API

Dashboard

Adversarial testing

Python libraries

Workflow