Collection of useful natural language processing code snippets. Use of NLTK library and spaCy.
NLP basics
- opening files
- getting vocabulary
- word frequency dictionaries
Preprocessing text
Part 1
- sentence segmentation
- word tokenization
- twitter corpus
- average sentence length in corpora
Part 2
- case normalisation
- stemming
- punctuation and stopword removal
Jupyter Notebooks
-
Sentiment Classification Report on investigation of NLP methods for distinguishing positive and negative reviews written about movies. Comparison of a word list classifier approach to a Naive Bayes classifier.
-
POS tagging This involves deciding the correct part-of speech tag (e.g., noun, verb, adjective etc) for each word in a sentence. Since the correct tag for each word depends not only on the current word but on the tags of those words around it, it is generally viewed as a sequence labelling problem. In other words, for a given sequence of words, we are asking what is the most likely sequence of tags?
-
Distributional semantics In a distributional model of meaning, words are represented in terms of their co-occurrences. Notebook contrasts close proximity co-occurrence (where words co-occur, say, next to each other) with more distant proximity (where words co-occur, say, within a window of 10 words).
-
Information retrieval and Question answering Involves finding relevant documents to a query expressed in terms of keywords and extracting keywords from a query
-
Named entity recognition Notebook introduces the spaCy library (https://spacy.io/) which provides a number of very fast, state-of-the-art accuracy tools for carrying out NLP tasks including part-of-speech tagging, dependency parsing and named entity recognition.
Notebook questions provided by the University of Sussex, Eng & Inf. Dept