/NLPWork

Collecting my thoughts surrounding NLP

NLPWork

Collecting my thoughts surrounding NLP (specifically related to the coupon project).

Table of contents:

Deciding which NLP problem I'm facing:

The problem I'm primarily facing is one of Named Entity Recognition, where I need to determine the part of speech of noisy coupon descriptions. Likely because of the fact that these are not complete sentences, basic ner functionality has not been successful. For this reason, I'm likely going to have to fine-tune it myself using run_ner.py. Using pipelines may also be successful.

An alternative to this would be to use sentence-level embeddings instead that may be better at classifying/clustering different coupon concepts. E.g. batch-bin coupons using the top ~100 words and use those to fine-tune the model.

Deciding on the tokenizer:

Starting with the tokenizer to see if we can successfully split up the comments into digestible chunks.

Deciding on the down-stream model to use:

Interestingly this paper found that BERT was undertrained, and that training it up a bit more (thus resulting in RoBERTa) got great results. Consider using that instead of BERT.

Sentence-level embedding:

  • Sentence-BERT - A method of generating sentence embeddings in which sentences with similar meaning will be close together in high-dimensional space.
  • Simply using CLS tags is remarkably effective.. Specifically this paper says how the token representations can be used for token-level tasks such as question-answering and sentence tagging, whereas the CLS tag can be used for things like classification.
  • bert-as-service - A far more popular (stars-wise) method of sentence-level embeddings than sentence Bert.
  • When following this tutorial, why does this work:
    RobertaTokenizer.from_pretrained(./dir_with_tokenizer_model)
    but this doesn't:
    DistilBertTokenizer.from_pretrained(./dir_with_tokenizer_model)
    Using DistilBertTokenizer errors out saying that the tokenizer is not a part of the models on HuggingFace, and it requires a vocab.txt file, though the work in the tutorial only generates the vocab.json and merges.txt file (helpfully explained here).
  • What's the difference between this tokenizer initialization and this one??
    - Looking at the ByteLevelBPETokenizer class, it becomes apparent that all of the items explicitly stated here are simply under the hood here when initializing ByteLevelBPETokenizer with a couple extra flags.
  • Black sheep paper - Paper discussing reporting bias, in which we don't talk about the obvious, thus preventing models from learning basic facts (e.g. sheep are generally white). Work (e.g. ERNIE publication) has been done to solve this problem. Multi-models (including images, other media) can also correct this.
  • Byte Pair Encoding Tokenizer