Collecting my thoughts surrounding NLP (specifically related to the coupon project).
- Important repos/packages/containers
- Helpful tutorials/blog posts/videos/slide decks
- Important concepts
- Questions I would like to have answered
- Publications worth reading
- Hugging Face, obviously
- Oscar - massive text dump
- Sentence-level transformers
- Hugging Face Tutorial - a helpful tutorial to get started, though it's lacking key parts, such as how to build the config files, how to integrate the new classes into the all-important run_language_modeling.py file, etc. This Github issue helpfully points out many of the problems and links to this far more helpful tutorial.
- Comprehensive Hugging Face slide deck - this slide deck is linked to this video that talks broadly about transfer learning, and Hugging Face's use of it.
- Repo A and Repo B intended to keep on track with the current advances in NLP. While these are repos, they're not software, which is why I'm putting them here.
- Discussion of Hugging Face papers
- Distilbert blog post
- Understanding emojis
- Training models on GPUs
- Tokenizer summary
- Jay Alammar blog - incredibly helpful blog visualizing many key concepts of NLP and MLMs.
- Approachable introduction to BERT
The problem I'm primarily facing is one of Named Entity Recognition, where I need to determine the part of speech of noisy coupon descriptions. Likely because of the fact that these are not complete sentences, basic ner functionality has not been successful. For this reason, I'm likely going to have to fine-tune it myself using run_ner.py. Using pipelines may also be successful.
An alternative to this would be to use sentence-level embeddings instead that may be better at classifying/clustering different coupon concepts. E.g. batch-bin coupons using the top ~100 words and use those to fine-tune the model.
Starting with the tokenizer to see if we can successfully split up the comments into digestible chunks.
Interestingly this paper found that BERT was undertrained, and that training it up a bit more (thus resulting in RoBERTa) got great results. Consider using that instead of BERT.
- Sentence-BERT - A method of generating sentence embeddings in which sentences with similar meaning will be close together in high-dimensional space.
- Simply using CLS tags is remarkably effective.. Specifically this paper says how the token representations can be used for token-level tasks such as question-answering and sentence tagging, whereas the CLS tag can be used for things like classification.
- bert-as-service - A far more popular (stars-wise) method of sentence-level embeddings than sentence Bert.
- When following this tutorial, why does this work:
RobertaTokenizer.from_pretrained(./dir_with_tokenizer_model)
but this doesn't:
DistilBertTokenizer.from_pretrained(./dir_with_tokenizer_model)
Using DistilBertTokenizer errors out saying that the tokenizer is not a part of the models on HuggingFace, and it requires a vocab.txt file, though the work in the tutorial only generates the vocab.json and merges.txt file (helpfully explained here). - What's the difference between this tokenizer initialization and this one??
- Looking at the ByteLevelBPETokenizer class, it becomes apparent that all of the items explicitly stated here are simply under the hood here when initializing ByteLevelBPETokenizer with a couple extra flags.
- Black sheep paper - Paper discussing reporting bias, in which we don't talk about the obvious, thus preventing models from learning basic facts (e.g. sheep are generally white). Work (e.g. ERNIE publication) has been done to solve this problem. Multi-models (including images, other media) can also correct this.
- Byte Pair Encoding Tokenizer