The mystery lies in the use of language to express human life.
– Eudora Welty
Instructor: Brian Spiering Website: github.com/brianspiering/nlp-course
This course covers the fundamental concepts and algorithms in Natural Language Processing (NLP). The goal of the course is to understand text using computational statistics.
This course will start with basic text processing techniques (such as regular expressions) and then cover advanced techniques (text classification and topic modeling). The emphasis will be on contemporary best practices in industry, including Deep Learning and text embeddings. Along the way we will touch upon text mining, information retrieval, and computational linguistics.
This is course is a "buffet" format, a sample of many things, but you will not get "full" on any one topic. People get a PhD in each of these individual topics.
Remember - A little bit of knowledge and a lot of "how to" goes a long way in Data Science.
- Working knowledge of probability (e.g., calculate conditional probability and apply Bayes Theorem)
- Basic statistics (e.g., the difference between pmf and pdf)
- One course in machine learning
- Intermediate Python (e.g., the ability to create classes). Based on previous classes, the more Python a student knows the more NLP he/she learns during the course.
By the end of the course, you should be able to:
- Apply fundamental NLP concepts and algorithms to solve real-world problems
- Write efficient code to process and model text data
- Classify and cluster text data
- Create and use vector representations of words and documents
- Build an end-to-end system to model meaning in text
- Welcome
- NLP Overview
- Regular Expressions
- Segmenting, Tokenizing, & Stemming
- Language Modeling
- Text Embeddings: Words
- Text Embeddings: Documents et al.
- Word Tagging: POS (part of speech) and NER (named entity recognition)
- Text Classification / Sentiment Analysis with Naive Bayes
- Text Classification with Deep Learning
- Information Retrieval / Search Engineering
- Topic Modeling with Latent Dirichlet allocation (LDA)
- Theory. We are only going to cover applied parts of NLP, aka tips n' tricks for getting stuff done.
- Grammar. Grammar kinda sucks but it is a very powerful method for understanding language.
- Non-English languages. I ❤️ other languages, and they are very important to understanding NLP. There is just enough not time!
- Machine Translation. Again very important and incredible breakthroughs have been made. There is not enough time to adequately cover it.
- Natural Language Understanding (NLU). Finding "meaning" in text. We'll spend most of our time focused on lower levels of processing.
- Natural Language Generation (NLG). We'll only going to briefly touch on how to programmatically create text during the Language Modeling section.
- Speech Recognition. For this class, we'll assume audio waves have been digitized into text. In the last couple of years, speech-based language processing has be revolutionized and it is well worth looking into.
Item | Weight |
---|---|
Participation | 30% |
Labs | 30% |
Final Project | 40% |
Course grades range from “A” to “F.” The MSDS program considers a grade of "A" to represent exceptional work with respect to both the instructor's expectations and peer student achievements. A grade of "B" represents the expected outcome, what is called "competence" in a business setting. A "C" grade represents achievements lower than the instructor's expectations for competence in the subject. A grade of "F" represents little or no work in the course.
You must show up to each session prepared. Each person is important to the dynamic of the class, and therefore students are required to participate in class activities. Expect to be "cold called". I call on students at random not to put you on the spot but to keep you engaged in the material at all times.
Attendance is mandatory. It is the responsibility of the student to attend all classes. If you have to miss class, due to sickness or other circumstances, please notify your instructor by Slack in advance. Supporting documents (e.g., doctor’s notes) should accompany absences due to sickness.
The labs will be hands-on activities. They will require a combination of coding and writing. The coding sections will be implementing algorithms from scratch or applying common libraries (e.g., scikit-learn, nltk, and keras). The writing sections will focus on communication to technical and nontechnical audiences.
Details in Final Project Folder.
Course Structure
This course will be partly "flipped", basic lectures will be videos watched before class. In class lectures will cover complex topics in an active learning-style. You'll be writing a lot of code and completing many projects during class time.
Textbooks
There are no required textbooks for this course. Preparation materials (e.g., videos, articles, and blog posts) will be assigned for each session.