Practical Machine Learning and Natural Language Processing with examples.
- Interesting applications of ML, NLP, and Computer Vision
- Practical demonstration notebooks
- Reproducible experiments
- Illustrated best practices:
- Code extracted from notebooks for:
- automatic formatting with Black
- Type checking via MyPy annotations
- Linting via Pylint
- Doctests whenever possible
- Code extracted from notebooks for:
Download this repo using git with the submodule command, e.g.:
git pull --recurse-submodules
Submodules are used to pull in some data and external data processing utilities that we'll use for preprocessing some of the data.
mkdir p3
`which python3` -m venv ./p3
source setPythonHashSeed.sh
source p3/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt
Many notebooks use data that needs to be installed, do so by running the install script.
install_corpora.sh
- installs Python ssl certificates
- installs CLTK data for Latin and Greek
- installs NLTK data
./runUnitTests.sh
juypter notebook
- Labeling occupation data with Wikipedia and GoogleNews
- Correcting GoogleNews labels with Cleanlab
- Training to label with BERT and Cleanlab
- Assessing Corpus Quality
- Making a Frequency Distribution
- Making a Word Trie Probability Model
- Word and Sentence Probability using BERT
- Comparing Collocation Extraction Methodologies
- Making a Frequency Distribution of Transliterated Greek
- Boosting Training Data
- The Problem of Loanwords, and a Solution
- Feature Engineering with the Loanwords matrix
- Detecting Loanwords with Keras
- English Wikipedia Corpus Cleaning
- English Wikipedia Corpus Processing
- Latin Corpus Processing
- Downsample or not
- Generating an English Wikipedia word vector
- Generating a Latin word vector
- The Case for Using an Embedding Encoder
- Sentence Embeddings - A simple but effective baseline - using Seneca
- Object detection as a multivariable regression using a custom Convnet
- Assessing the Noisy Circle detector
- A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification by Ye Zhang, Byron Wallace
- Word2vec applied to Recommendation: Hyperparameters Matter byHugo Caselles-Dupré, Florian Lesaint, Jimena Royo-Letelier
- Exploiting Similarities among Languages for Machine Translation by Tomas Mikolov, Quoc V. Le, Ilya Sutskever
- Distributed Representations of Words and Phrases and their Compositionality by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean
- Deep Learning with Python by Francois Chollet
- Mining Massive Datasets
- Chris McCormick MinHash Tutorial with Python Code
- Convolutional Neural Networks for Text Classification by David S. Batista
- Convolutional Neural Networks for Sentence Classification by Yoon Kim
- The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction by Dimitris Alikaniotis, Vipul Raheja
- BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model by Alex Wang, Kyunghyun Cho
- An Exploration of Word Embedding Initialization in Deep-Learning Tasks by Tom Kocmi, Ondřej Bojar