/mydatatoolkit

A toolkit for data scientists to get work done faster, easier, and in a smarter way.

mydatatoolkit

A toolkit for data scientist to get work done faster, easier and in a smarter way.

Interesting Open Source Projects for Data Scientists
Author: shaurya

  1. Scheduler for Automation
    a. Airflow (Documentation)

  2. NLP Support
    a. Hugging Face
    b. Simple Representation: Easy-to-use text representations extraction library based on the Transformers library.
    c. Simple Transformer: (Website) Transformers for Classification, NER, QA, Language Modelling, Language Generation, T5, Multimodal, and Conversational AI
    d. Facebook FastText: Library for fast text representation and classification.
    e. Bert-as-service: Mapping a variable-length sentence to a fixed-length vector using BERT model
    f. spaCy
    g. TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
    h. eXtreme classification: Multi-Label Classification with as big as 10M label set.
    i. TextHero: Texthero is a python package to work with text data efficiently. It empowers NLP developers with a tool to quickly understand any text-based dataset and it provides a solid pipeline to clean and represent text data, from zero to hero.
    j. Gensim: A library for topic modeling, document indexing, and similarity retrieval with large corpora. All algorithms in Gensim are memory independent, w.r.t., the corpus size, and hence, it can process input larger than RAM.
    k. Adapterhub
    l. KTrain: Zero Shot Learning Text Classification
    m. DeText: DeText: A Deep Neural Text Understanding Framework for Ranking and Classification
    n. Spark-NLP
    o. Date Parser: Support for almost every existing date format: absolute dates, relative dates (“two weeks ago” or “tomorrow”), timestamps, etc.
    p. Contraction: Solve contractions like you’re -> you are
    q. PolyFuzzy: Fuzzy string matching, grouping, and evaluation. Supports - TFIDF ngram characterlevel, BERT, editdistance, etc.

  3. Managing Data Science Projects
    a. DVC: Data Version Control for data science models and large size datasets
    b. Metaflow open-sourced by Netflix (Github Link) (Documentation) (Medium Article)
    c. CML: Continuous Machine Learning (CML) is CI/CD for Machine Learning Projects
    d. https://s3tools.org/s3cmd

  4. Model Inferencing / Exposing End Point
    a. BentoML: Turn trained ML model into production API endpoint with a few lines of code
    b. FastAPI: It is a python API microframework, widely used in data science for model inferencing, and is stated as a better and faster version of Flask.

  5. Useful Data Sources
    a. 10,000 Most Common Words based on Google’s Trillion Word Corpus: (Data Source Link) - This repo is a good dictionary source to get rid of the most common word of different lengths. They have long, medium, short, and with or without profanity most common words in the dictionary.
    b. English Profanity Dictionary: English profanity words identified by Google.

  6. Training DeepLearning Models
    a. Ludwig by Uber (Github Link) (Documentation) is a toolbox built on top of TensorFlow that allows us to train and test deep learning models without the need to write code.

  7. Notebooks other than Jupyter
    a. Google Colab (Direct Link) is a Jupyter Notebook hosted by Google that provides free GPU and TPU environments to train models
    b. Polynote open-sourced by Netflix (Github Link) (Link). Polynote is an experimental polyglot notebook environment. Currently, it supports Scala and Python (with or without Spark), SQL, and Vega.

  8. Extract Text from PDFs
    a. Camelot (Documentation)
    b. Tabula (Github Link)
    c. Tika-Python (Github Link) (Documentation)
    d. TaBERT (Documentation)
    e. Keras-RetinaNet

  9. Miscellaneous
    a. Vowpal Wabbit: provides a fast, flexible, online, and active learning solution that empowers you to solve complex interactive machine learning problems.

  10. Finance Dictionary
    a. Finance Dictionary Text
    b. RBI IFSC PARSER

  11. Generate Fake Data
    a. Python Faker Library: This would help generate dummy data of Name, Address, Location, Bank Details, IP Address, etc.

  12. Ranking Algorithms | Decision Making Algorithms:
    a. Learning to Rank by Xgboost/LightGBM
    b. Scikit-Criteria Multi-Criteria Decision making Algorithms (MCDA)

  13. Speech / Voice Analytics
    a. Kaldi
    b. PyAudioAnalysis
    c. DeepSpeech: DeepSpeech is an open source embedded speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
    d. Librosa: Python library for audio and music analysis