Natural Language Processing Fundamentals in Python

Course Description In this course, you'll learn Natural Language Processing (NLP) basics, such as how to identify and separate words, how to extract topics in a text, and how to build your own fake news classifier. You'll also learn how to use basic libraries such as NLTK, alongside libraries which utilize deep learning to solve common NLP problems. This course will give you the foundation to process and parse text as you move forward in your Python learning.

What is natural language processing?

You will learn the basics of NLP

Topic identification, text classification

NLP applications include: Chatbots, translation, sentiment analysis etc.

Regular expressions & word tokenization

This chapter will introduce some basic NLP concepts, such as word tokenization and regular expressions to help parse text. You'll also learn how to handle non-English text and more difficult tokenization you might find as you explore the wide world of NLP.

Introduction to regular expressions

re library

What exactly are regular expressions?

Strings with special syntax

Allow us to match patterns in other strings

Applications of regular expressions:

Find all that web links in a document

Parse email addresses, remove or replace unwanted characters

Common regex patterns: (?)

Practicing regular expressions: re.split() and re.findall()

Now you'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters. Take a look at my_string first by printing it in the IPython Shell, to determine how you might best match the different steps.

Note: It's important to prefix your regex patterns with r to ensure that your patterns are interpreted in the way you want them to. Else, you may encounter problems to do with escape sequences in strings. For example, "\n" in Python is used to indicate a new line, but if you use the r prefix, it will be interpreted as the raw string "\n" - that is, the character "" followed by the character "n" - and not as a new line.

Introduction to tokenization

nltk library (Word tokenization with NLTK)

What is tokenization

Turning a string or document into tokens (Smaller chunks)

One step is preparing a text for NLP

You can create your own rules using regular expressions, For example: Breaking out words or sentences, separating punctuation, separating all hashtags in a tweet

Why?

Easier to map part of speech, matching common words, Removing unwanted tokens, Determine meaning from simple text

Nltk tokenizers (?)

regex ranges and groups (?)

Regex with NLTK tokenization

Non-ascii tokenization

Charting word length with NLTK

Charting practice

Simple topic identification

This chapter will introduce you to topic identification, which you can apply to any text you encounter in the wild. Using basic NLP models, you will identify topics from texts based on term frequencies. You'll experiment and compare two simple methods - bag-of-words and Tf-idf using NLTK and a new library - Gensim.

Word counts with bag-of-words

Bag-of-words picker

Building a Counter with bag-of-words

Simple text preprocessing

Text preprocessing steps

Text preprocessing practice

Introduction to gensim

What are word vectors?

Creating and querying a corpus with gensim

Gensim bag-of-words

Tf-idf with gensim

What is tf-idf?

Tf-idf with Wikipedia

Named-entity recognition

This chapter will introduce a slightly more advanced topic - Named-entity recognition. You'll learn how to identify the who, what and where of your texts using pre-trained models on English and non-English text. You'll also learn how to use some new libraries - polyglot and spaCy - to add to your NLP toolbox.

Named Entity Recognition

NER with NLTK

Charting practice

Stanford library with NLTK

Introduction to SpaCy

Comparing NLTK with spaCy NER

spaCy NER Categories

Multilingual NER with polyglot

French NER with polyglot I

French NER with polyglot II

Spanish NER with polyglot

Building a "fake news" classifier

Here, you'll apply the basics of what you've learned along with some supervised machine learning to build a "fake news" detector. You'll begin by learning the basics of supervised machine learning, and then move forward by choosing a few important features and testing ideas to identify and classify "fake news" articles.

Classifying fake news using supervised learning with NLP

Which possible features?

Training and testing

Building word count vectors with scikit-learn

CountVectorizer for text classification

TfidfVectorizer for text classification

Inspecting the vectors

Training and testing a classification model with scikit-learn

Text classification models

Training and testing the "fake news" model with CountVectorizer

Training and testing the "fake news" model with TfidfVectorizer

Simple NLP, complex problems

Improving the model

Improving your model

Inspecting your model

xenakas/nlp-practice