NLPWork

Collecting my thoughts surrounding NLP (specifically related to the coupon project).

Important repos/packages/containers:

Hugging Face, obviously
Oscar - massive text dump
Sentence-level transformers

Helpful tutorials/blog posts/videos/slide decks:

Hugging Face Tutorial - a helpful tutorial to get started, though it's lacking key parts, such as how to build the config files, how to integrate the new classes into the all-important run_language_modeling.py file, etc. This Github issue helpfully points out many of the problems and links to this far more helpful tutorial.
Comprehensive Hugging Face slide deck - this slide deck is linked to this video that talks broadly about transfer learning, and Hugging Face's use of it.
Repo A and Repo B intended to keep on track with the current advances in NLP. While these are repos, they're not software, which is why I'm putting them here.
Discussion of Hugging Face papers
Distilbert blog post
Understanding emojis
Training models on GPUs
Tokenizer summary
Jay Alammar blog - incredibly helpful blog visualizing many key concepts of NLP and MLMs.
Approachable introduction to BERT

Important concepts:

Deciding which NLP problem I'm facing:

The problem I'm primarily facing is one of Named Entity Recognition, where I need to determine the part of speech of noisy coupon descriptions. Likely because of the fact that these are not complete sentences, basic ner functionality has not been successful. For this reason, I'm likely going to have to fine-tune it myself using run_ner.py. Using pipelines may also be successful.

An alternative to this would be to use sentence-level embeddings instead that may be better at classifying/clustering different coupon concepts. E.g. batch-bin coupons using the top ~100 words and use those to fine-tune the model.

Deciding on the tokenizer:

Starting with the tokenizer to see if we can successfully split up the comments into digestible chunks.

Deciding on the down-stream model to use:

Interestingly this paper found that BERT was undertrained, and that training it up a bit more (thus resulting in RoBERTa) got great results. Consider using that instead of BERT.

Sentence-level embedding:

Sentence-BERT - A method of generating sentence embeddings in which sentences with similar meaning will be close together in high-dimensional space.
Simply using CLS tags is remarkably effective.. Specifically this paper says how the token representations can be used for token-level tasks such as question-answering and sentence tagging, whereas the CLS tag can be used for things like classification.
bert-as-service - A far more popular (stars-wise) method of sentence-level embeddings than sentence Bert.

Questions I would like to have answered:

When following this tutorial, why does this work:
RobertaTokenizer.from_pretrained(./dir_with_tokenizer_model)
but this doesn't:
DistilBertTokenizer.from_pretrained(./dir_with_tokenizer_model)
Using DistilBertTokenizer errors out saying that the tokenizer is not a part of the models on HuggingFace, and it requires a vocab.txt file, though the work in the tutorial only generates the vocab.json and merges.txt file (helpfully explained here).
What's the difference between this tokenizer initialization and this one??
- Looking at the ByteLevelBPETokenizer class, it becomes apparent that all of the items explicitly stated here are simply under the hood here when initializing ByteLevelBPETokenizer with a couple extra flags.

Publications worth reading:

Black sheep paper - Paper discussing reporting bias, in which we don't talk about the obvious, thus preventing models from learning basic facts (e.g. sheep are generally white). Work (e.g. ERNIE publication) has been done to solve this problem. Multi-models (including images, other media) can also correct this.
Byte Pair Encoding Tokenizer

azdaly/NLPWork