/awesome-python-data-science

A curated list of Python libraries used for data science.

MIT LicenseMIT

Awesome Python Data Science

A curated list of Python libraries used for data science.

Contents

Machine Learning Frameworks

  • scikit-learn - Machine learning.
  • CatBoost - Gradient boosting library with categorical features support.
  • LightGBM - Fast, distributed, high performance gradient boosting.
  • Xgboost - Scalable, Portable and Distributed Gradient Boosting.
  • PyMC - Probabilistic Programming.
  • statsmodels - Statistical modeling and econometrics.
  • SymPy - A computer algebra system.
  • NetworkX - Creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
  • dask-ml - Distributed and parallel machine learning.
  • imbalanced-learn - Perform under sampling and over sampling.
  • lightning - Large-scale linear models.
  • scikit-optimize - Sequential model-based optimization with a scipy.optimize interface.
  • BayesianOptimization - Global optimization with gaussian processes.
  • gplearn - Genetic Programming.
  • python-glmnet - glmnet package for fitting generalized linear models.
  • hmmlearn - Hidden Markov Models.
  • vecstack - stacking (machine learning technique).
  • modAL - Modular Active Learning framework
  • deap - Evolutionary computation framework.
  • pyro - Deep universal probabilistic programming with PyTorch.
  • civisml-extensions - scikit-learn-compatible estimators from Civis Analytics.
  • hyperopt-sklearn - Hyper-parameter optimization for sklearn.
  • scikit-survival - Survival analysis built on top of scikit-learn.
  • dstoolbox - Tools that make working with scikit-learn and pandas easier.
  • modin - Unify the way you interact with your data.
  • pyomo - Python Optimization MOdels.
  • BAMBI - BAyesian Model-Building Interface.
  • combo - A Python Toolbox for Machine Learning Model Combination.
  • fastai - The fast.ai deep learning library, lessons, and tutorials.
  • pycaret - Low-code machine learning library in Python.
  • river - River is a Python library for online machine learning.

Scientific

  • NumPy - A fundamental package for scientific computing with Python.
  • SciPy - A Python-based ecosystem of open-source software for mathematics, science, and engineering.
  • Pandas - A library providing high-performance, easy-to-use data structures and data analysis tools.
  • Numba - NumPy aware dynamic Python compiler using LLVM.
  • blaze - NumPy and Pandas for databases.
  • astropy - Astronomy and astrophysics.
  • Biopython - Astronomy and astrophysics.
  • PyDy - Multibody Dynamics.
  • nilearn - NeuroImaging.
  • patsy - Describing statistical models using symbolic formulas.
  • numexpr - Fast numerical array expression evaluator.
  • dask - Parallel computing with task scheduling.
  • or-tools - Google's Operations Research tools. Classical CS algorithms.
  • cvxpy - Python-embedded modeling language for convex optimization problems.

Outlier Detection

  • PyOD - Versatile Python library for detecting anomalies in multivariate data.
  • DeepOD - Deep learning-based outlier/anomaly detection

Deep Learning Frameworks

  • Tensorflow - DL Framework.
  • PyTorch - DL Framework.
  • Keras - High-level neutral networks API.
  • tensorlayer - A Deep Learning and Reinforcement Learning Library for Researchers and Engineers.
  • mxnet - Apache MXNet: A flexible and efficient library for deep learning.

Deep Learning Tools

  • TorchDrift - TorchDrift is a data and concept drift library for PyTorch.
  • Edward - Probabilistic programming language in TensorFlow.
  • pomegranate - Probabilistic modelling.
  • skorch - Scikit-learn PyTorch.
  • DLTK - Deep Learning Toolkit for Medical Image Analysis.
  • sonnet - TensorFlow-based neural network library.
  • rasa_core - Dialogue engine.
  • luminoth - Computer Vision.
  • allennlp - NLP Research library.
  • spotlight - Pytorch Recommender framework.
  • tensorforce - TensorFlow library for applied reinforcement learning.
  • tensorboard-pytorch - Tensorboard for pytorch.
  • keras-vis - Neural network visualization toolkit for keras.
  • hyperas - Keras + Hyperopt.
  • spaCy - Natural Language processing.
  • tensorboard_logger - Log TensorBoard events without touching TensorFlow.
  • foolbox - Python toolbox to create adversarial examples that fool neural networks.
  • pytorch/vision - Datasets, Transforms and Models specific to Computer Vision.
  • gluon-nlp - NLP made easy.
  • pytorch/ignite - High-level library to help with training neural networks in PyTorch.
  • Netron - Visualizer for deep learning and machine learning models.
  • gpytorch - A highly efficient and modular implementation of Gaussian Processes in PyTorch.
  • tensorly - Tensor Learning in Python.
  • einops - Deep learning operations reinvented.
  • hiddenlayer - Neural network graphs and training metrics for PyTorch, Tensorflow, and Keras.
  • segmentation_models.pytorch - Segmentation models with pretrained backbones.
  • pytorch-lightning - The lightweight PyTorch wrapper.
  • lightly - Lightly is a computer vision framework for self-supervised learning.

Deep Learning Projects

  • fairseq - Sequence-to-Sequence Toolkit.
  • tensorflow-wavenet - DeepMind's WaveNet.
  • DeepRecommender - Recommender systems.
  • DrQA - Reading Wikipedia to Answer Open-Domain Questions.
  • vqa.pytorch - Visual Question Answering in Pytorch.
  • Half-Life Regression - Model for spaced repetition practice.
  • learning-to-learn - Learning to Learn in Tensorflow.
  • capsule-networks - A PyTorch implementation of the NIPS 2017 paper "Dynamic Routing Between Capsules".
  • Mask_RCNN - Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow.
  • lightnet - Bringing pjreddie's DarkNet out of the shadows.
  • pytorch-openai-transformer-lm - OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI.
  • maskrcnn-benchmark - Fast, modular reference implementation of Semantic Segmentation and Object Detection algorithm in PyTorch.
  • LovaszSoftmax - Lovász-Softmax loss.
  • ludwing - Ludwig is a toolbox built on top of TensorFlow that allows to train and test deep learning models without the need to write code.

Visualization

  • Great Tables - Absolutely Delightful Table-making in Python.
  • PyGWalker - Turns pandas and polars dataframes into a Tableau-like user interface for visual exploration.
  • diagrams - Diagrams lets you draw the cloud system architecture in Python code.
  • matplotlib - 2D plotting.
  • seaborn - Visualization library.
  • bokeh - Interactive web plotting.
  • plotly - Collaborative web plotting.
  • dash - Interactive Web plotting.
  • altair - Declarative statistical visualization.
  • folium - Leaflet.js Maps.
  • geoplot - High-level geospatial data visualization.
  • datashader - Graphics pipeline system.
  • mplleaftlet - Matplotlib plots from Python into interactive Leaflet web maps.
  • matplotlib-venn - Area-weighted venn-diagrams.
  • pyLDAvis - Interactive topic model visualization.
  • cufflinks - Productivity Tools for Plotly + Pandas.
  • scatterText - Visualizations of how language differs among document types.
  • plotnine - ggplot for python.
  • mizani - scales package.
  • bqplot - Plotting library for IPython/Jupyter Notebooks.
  • PtitPrince - Raindrop cloud.
  • joypy - Ridgeline plots.
  • dtreeviz - Decision tree visualization and model interpretation.
  • ipyvolume - 3d plotting for Python in the Jupyter notebook based on IPython widgets using WebGL.

AutoML

  • Nevergrad - Gradient-free optimization.
  • featuretools - Automated feature engineering.
  • auto-sklearn - Automated machine learning.
  • tpot - Automated machine learning.
  • auto_ml - Automated machine learning.
  • MLBox - Automated Machine Learning python library.
  • devol - Automated deep neural network design via genetic programming.
  • skll - SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.
  • autokeras - Automated machine learning in Keras.
  • SMAC3 - Sequential Model-based Algorithm Configuration.

Exploration

  • mlxtend - A library of extension and helper modules for Python's data analysis and machine learning libraries.
  • yellowbrick - Visual analysis and diagnostic tools.
  • pandas-profiling - Profiling reports for pandas DataFrame objects.
  • Skater - Model Agnostic Interpretation.
  • Dora - Exploratory data analysis.
  • sklearn-evaluation - scikit-learn model evaluation.
  • fitter - simple class to identify the distribution from which a data samples is generated from.
  • missingno - Missing data visualization.
  • hypertools - Gaining geometric insights into high-dimensional data.
  • scikit-plot - Plotting functionality to scikit-learn objects.
  • elih - Explain Machine Learning.
  • kmeans_smote - Oversampling for imbalanced learning based on k-means and SMOTE.
  • pyUpSet - UpSet suite of visualisation methods.
  • lime - Explaining the predictions of any machine learning classifier.
  • pandas-summary - An extension to pandas dataframes describe function.
  • SauceCat/PDPbox - Partial dependence plot toolbox.
  • shap - A unified approach to explain the output of any machine learning model.
  • eli5 - Debug machine learning classifiers and explain their predictions.
  • rfpimp - Permutation and drop-column importance for scikit-learn random forests.
  • pypeln - Concurrent data pipelines made easy.
  • pycm - Multi-class confusion matrix library in Python.
  • great_expectations - Always know what to expect from your data.
  • alibi - Algorithms for monitoring and explaining machine learning models.
  • InterpretML - Fit interpretable models. Explain blackbox machine learning.
  • cleanlab - Finding label errors in datasets and learning with noisy labels.
  • dtale - Flask/React client for visualizing pandas data structures
  • dabl - Data Analysis Baseline Library
  • XAI - XAI - An eXplainability toolbox for machine learning
  • explainerdashboard - This package makes it convenient to quickly deploy a dashboard web app that explains the workings of a (scikit-learn compatible) machine learning model.
  • alibi-detect - Open source Python library focused on outlier, adversarial and drift detection. The package aims to cover both online and offline detectors for tabular data, text, images and time series.

Feature Extraction

General Feature Extraction

  • sklearn-pandas - Pandas integration with sklearn.
  • pdpipe - Easy pipelines for pandas DataFrames.
  • engarde - Defensive data analysis.
  • datacleaner - Tool that automatically cleans data sets and readies them for analysis.
  • categorical-encoding - sklearn compatible categorical variable encoders.
  • fancyimpute - Multivariate imputation and matrix completion algorithms.
  • raccoon - DataFrame with fast insert and appends.
  • kmodes - k-modes and k-prototypes clustering algorithm.
  • annoy - Approximate Nearest Neighbors.
  • datacleaner - Automatically cleans data sets and readies them for analysis.
  • scikit-feature - Filter methods for feature selection.
  • mifs - Parallelized Mutual Information based Feature Selection module.
  • skggm - Scikit-learn compatible estimation of general graphical models.
  • dirty_cat - Encoding methods for dirty categorical variables.
  • Impyute - Data imputations library to preprocess datasets with missing data.
  • eif - Extended Isolation Forest for Anomaly Detection.
  • featexp - Feature exploration for supervised learning.
  • feature_engine - Feature engineering package with sklearn like functionality.
  • stumpy - STUMPY is a powerful and scalable Python library that can be used for a variety of time series data mining tasks.
  • n2 - Lightweight approximate Nearest Neighbor library which runs faster even with large datasets.
  • compressio - Compressio provides lossless in-memory compression of pandas DataFrames and Series.

Time Series

  • Merlion - A Machine Learning Library for Time Series
  • Darts - darts is a Python library for easy manipulation and forecasting of time series.
  • GrayKite - Greykite: A flexible, intuitive and fast forecasting library
  • Causality - Causal analysis.
  • traces - Unevenly-spaced time series analysis.
  • PyFlux - Time series library for Python.
  • prophet - Tool for producing high quality forecasts.
  • tsfresh - Automatic extraction of relevant features from time series.
  • tslearn - Machine learning toolkit dedicated to time-series data.
  • pyts - A Python package for time series transformation and classification.
  • sktime - A scikit-learn compatible Python toolbox for learning with time series data.
  • stumpy - Matrix profiles.
  • luminaire - ML driven solutions for monitoring time series data.
  • NeuralProphet - A Neural Network based Time-Series model, inspired by Facebook Prophet and AR-Net, built on PyTorch.

Audio

  • python_speech_features - Speech features.
  • speechpy - A Library for Speech Processing and Recognition.
  • magenta - Music and Art Generation with Machine Intelligence.
  • librosa - Audio and music analysis.
  • pydub - Manipulate audio with a simple and easy high level interface.
  • pytorch/audio - simple audio I/O for pytorch.

Images and Video

Geolocation

Text/NLP

  • wordfreq - Library for looking up the frequencies of words in many languages, based on many sources of data.
  • BlingFire - A lightning fast Finite State machine and REgular expression manipulation library.
  • BERT-pytorch - Google AI 2018 BERT pytorch implementation.
  • pytorch-pretrained-BERT - PyTorch version of Google AI's BERT model with script to load Google's pre-trained models.
  • gensim - Topic Modeling.
  • pattern - Web ining module.
  • probablepeople - Parsing unstructured western names into name components.
  • Expynent - Regular expression patterns.
  • mimesis - Generate synthetic data.
  • pyenchant - Spell checking.
  • parserator - Domain-specific probabilistic parsers.
  • scrubadub - Clean personally identifiable information from dirty dirty text.
  • usaddress - Parsing unstructured address strings into address components.
  • python-phonenumbers - Python port of Google's libphonenumber.
  • jellyfish - Approximate and phonetic matching of strings.
  • preprocessing - Simple interface for the CMU Pronouncing Dictionary.
  • langid - Stand-alone language identification system.
  • fuzzywuzzy - Fuzzy String Matching.
  • Fuzzy - Soundex, NYSIIS, Double Metaphone.
  • snowball - Snowball compiler and stemming algorithms.
  • leven - Levenshtein edit distance.
  • flashtext - Extract Keywords from sentence or Replace keywords in sentences.
  • polyglot - Multilingual text NLP processing toolkit.
  • sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.
  • pyfasttext - Binding for fastText.
  • python-wordsegment - English word segmentation.
  • pyahocorasick - Exact or approximate multi-pattern string search.
  • Wordbatch - Parallel text feature extraction for machine learning.
  • langdetect - Port of Google's language-detection library.
  • translation - Uses web services for text translation.
  • nltk - Natural Language Toolkit.
  • unidecode - ASCII transliterations of Unicode text.
  • pytorch/text - Data loaders and abstractions for text and NLP.
  • textdistance - Compute distance between sequences.
  • sent2vec - General purpose unsupervised sentence representations.
  • pyhunspell - Python bindings for the Hunspell spellchecker engine.
  • facebook/fastText - Library for fast text representation and classification.
  • textblob - Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
  • facebook/InferSent - Sentence embeddings (InferSent) and training code for NLI.
  • nmslib - Non-Metric Space Library.
  • google/sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.
  • ftfy - Fixes mojibake and other glitches in Unicode text, after the fact.
  • fletcher - Pandas ExtensionDType/Array backed by Apache Arrow.
  • textacy - NLP, before and after spaCy.
  • hmtl - Hierarchical Multi-Task Learning - A State-of-the-Art neural network model for several NLP tasks based on PyTorch and AllenNLP.
  • pytext - A natural language modeling framework based on PyTorch.
  • flair - A very simple framework for state-of-the-art Natural Language Processing.
  • LASER - Language-Agnostic SEntence Representations.
  • transformer-xl - Attentive Language Models Beyond a Fixed-Length Context.
  • textstat - Calculate readability statistics of a text object - paragraphs, sentences, articles.
  • nlpaug - Augmenting nlp for your machine learning projects.
  • sum - Automatic summarization of text documents and HTML.
  • textract - Extract text from any document.
  • newspaper - News extraction, article extraction and content curation.

Ranking/Recommender

  • recommenders - Examples and best practices for building recommendation systems
  • Surprise - Analyzing recommender systems.
  • trueskill - TrueSkill rating system.
  • LightFM - Hybrid recommendation algorithm.
  • implicit - Collaborative Filtering for Implicit Datasets.

Trading

  • Clairvoyant - Identify and monitor social/historical cues.
  • zipline - Algorithmic Trading Library.
  • qstrader - Advanced Trading Infrastructure.

Misc

  • mmh3 - MurmurHash3, a set of fast and robust hash functions.
  • fbpca - Fast Randomized PCA/SVD.
  • annoy - Approximate Nearest Neighbors.
  • pipeline - Standard Runtime For Every Real-Time Machine Learning.
  • crayon - A language-agnostic interface to TensorBoard.
  • faiss - A library for efficient similarity search and clustering of dense vectors.
  • pyod - Comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data.

Deployment

  • evidently - Evidently helps evaluate machine learning models during validation and monitor them in production.
  • onnx - Open Neutral Network Exchange.
  • lore - Lore makes machine learning approachable for Software Engineers and maintainable for Machine Learning Researchers.
  • kubeflow - Machine Learning Toolkit for Kubernetes.
  • airflow - ETL.
  • mlflow - Open source platform for the complete machine learning lifecycle.
  • sklearn-porter - Transpile trained scikit-learn estimators.
  • sklearn-compiledtrees - Compiled Decision Trees for scikit-learn.

Profiling

  • mem_usage_ui - Measuring and graphing memory usage of local processes.
  • viztracer - VizTracer is a low-overhead logging/debugging/profiling tool that can trace and visualize your python code execution.
  • py-spy - Sampling profiler for Python programs.
  • memory_profiler - monitoring memory usage of a python program.
  • line_profiler - Line-by-line profiling.
  • filprofiler - Fil a memory profiler designed for data processing applications.
  • scalene - High-performance CPU and memory profiler for Python.
  • python-flamegraph - Statistical profiler which outputs in format suitable for FlameGraph.

Python Tools

  • Typer - Build CLIs with type hints.
  • hydra - Framework for elegantly configuring complex applications.
  • neurtu - A Python package for parametric benchmarks.
  • pyprojroot - Finding project directories in Python.
  • datasette - An open source multi-tool for exploring and publishing data.
  • delorean - Time Travel Made Easy.
  • pip-tools - Keeps dependencies up to date.
  • devpi - PyPI server and packaging/testing/release tool.
  • Jupyter Notebook - Notebooks are awseome.
  • click - CLI package.
  • sacredboard - Dashboard for sacred.
  • sacred - Reproduce computational experiments.
  • magic-wormhole - get things from one computer to another, safely.

Data Gathering

  • gain - Web crawling framework based on asyncio.
  • MechanicalSoup - A Python library for automating interaction with websites.
  • camelot - Camelot: PDF Table Extraction for Humans.
  • Pandarallel - Parallel pandas.
  • great_expectations - F framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests.
  • parse - Parse strings using a specification based on the Python format() syntax.
  • CleverCSV - CleverCSV is a Python package for handling messy CSV files