/nlp

a collection of notebooks regarding NLP

Primary LanguageJupyter Notebook

NLP

Sentiment Analysis

  • Text_Classification_Pytorch.ipynb: AG News data from pytorch, classified by news type. Use a simple MLP with the raw tokens. Result is quite good > 90% accuracy, probably because this is an easy task
  • Sentiment_Analysis_Using_Universal_Sentence_Embedder: Twitter sentiments, labels look rather arbitrary . Using Universal Sentence Embedder and an MLP. Poor result currently 40% accuracy.
  • sentiment-analysis-marketing/: using Amazon reviews data, and nltk python library. Rule based method

Parsing text using Dependency parsing and Entity tagging libraries

  • Resume parsing using spaCy, done in this repo

TODO:

  • project: Hackernews Progressive Web App with a search bar using fuzzy search and semantic search
  • pick one dataset, maybe Twitter dataset and try different methods with it
  • Try ULMFit
  • play with this gensim-based movie genre plot project (comes with slides)
  • play with the Annotated Transformers

Today I learned

doc = nlp( "In 1990, more than 60% of people in East Asia were in extreme poverty. " "Now less than 4% are." ) When two strings are placed together in python inside a function call, they are concatenated together. This is a convenience method to make code easy to read

Foundations & Resources

Transformers

  • Transformer_walthrough.ipynb: code walk through of Transformers model.
  • Building block of Transformers: Attention, Layer Normalization, Residual Connection, Collapsing then Expanding Linear connections
  • When creating embedding with torch.Embedding() they has N(0, 1) distribution. Embeddings are scaled by the factor of sqrt(emb_size) before adding with positional embedding. This helps to reduce the impact of positional embedding in proportion to the word embedding. Inside Attention layers, we did a step of rescaling the embedding by 1/sqrt(emb_size) which get the mean and standard deviation to N(0, 1) back again

spaCy

  • a library for NER, dependency parsing, and pattern matching.
  • This is good book with project example for spaCy. PROGRESS: done!

gensim

  • an optimized library for loading text into different file formats, word2vec and document similarity