/applied-NLP-smm694

Teaching materials for a B-school, post-grad module on NLP

Primary LanguageJupyter Notebook

SMM694 ― Applied NLP

Instructor

Name: Dr. Simone Santoni, Lecturer in Strategy

Contacts: 020 7040 0057 ― simone.santoni.1@city.ac.uk

Webinar: Thursday ― 9:00 - 11:00 (via Zoom)

Office hour: Thursday ― 8:00 - 9:00 & 11:00 - 12:00 (via Zoom, book your slot via Doodle)

Module Overview

The increasing availability of textual data along with the development of ML and DL make NLP a must-have skill for business and financial analysts. 'Applied Natural Language Processing ― SMM694' provides post-graduate students enrolled in B-school programs with cutting-edge analytical frameworks to manipulate text corpora efficiently and to extract valuable insights out of (apparently) unstructured natural language such as social media posts, product reviews, or corporate communications. Ultimately, the goal of the module is to help students to appreciate how NLP can contribute to the organizational decision-making process.

Materials & Readings

For this module, it is not necessary to purchase any (expensive) book, whereas it is essential to go through the following:

Discretionary readings students may want to reference to:

Prerequisites

Below are the prerequisites to SMM694:

  • all class assignments will be in Python (using Gensim, NumPy, NLTK, PyTorch, scikit-learn, Scipy, spaCy, and Stanza);

  • students should also be comfortable with:

    • derivatives;

    • matrix/vector notation and operations;

    • basic probability and statistics;

    • foundations of machine learning.

Learning Objectives and Assessment

At the end of the module, students should be able:

  • to clean, prepare, and transform text corpora;

  • to design and operate a variety of NLP pipelines;

  • to select the most appropriate NLP framework/tools to address specific business problems;

  • to translate NLP outcomes into unique input to the organizational decision-making process.

As per the module specification, students will be assessed on the basis of coursework submissions, which all are the outcome of group-level efforts (yes, you understand correctly, for this module there is no final examination and you are not supposed to deliver any assignment on your own). Specifically, there are two types of coursework, namely two 'hackathons' (HK), and a 'final course project' (FCP), which contribute to the final mark (FM) as follows:

FM = 0.15 X 2 X HK + 0.70 X FCP

The two HKs ― to take place in weeks 4 and 6 ― fall in the category of 'purpose-driven' hackathons and will last five days each. The topics are top-secrets at the moment 😂 During the hackathon, students may want to interact with me to discuss problems and tentative solutions ― I expecpt interactions will be emergent and informal – very much hackathon alike. At then end of hackathon, each group will present the solution to the class.

For the FCP ― to be launched in week 5 ―, students are supposed: to prepare and analyze a real-world dataset containing the speeches of individuals who are collectively recognized as leaders. FCP submissions will be evaluated on a rolling-based window and are due by July 16 (5:00 PM London Time).

Both HK and FCP submissions will be evaluated against four criteria: i) appropriate use of notions and frameworks discussed in class; ii) effectiveness of the proposed answer or solution; iii) originality/creativity of the proposed answer or solution; iv) organization and clarity of submitted materials. All criteria carry-out equal weight in terms of mark.

🤔 💭 Problem sets will be launched weekly. Students may want to deal these problem on their own or working in groups sets and present their solution to the class. Few students per session will be selected on the basis of the novelty and effectiveness of the proposed solution. One bonus point (delta FM = +1) will be assigned.

Organization of the Module

The below-displayed table illustrates the schedule of the module. Note: depending on the progress of the class throughout the term, the set of topics included in the below-displayed table could be subject to minor changes.

Each block of the program has both theory and applications. I will cover the theory part in a series of Coursera-alike video-recordings. The main focus will be the Jupyter slideshow. I will release the video-recordings on a weekly basis ― i.e., every Monday at 11:30 PM London Time.

Every Thu at 9:00 London time, there will be an interactive Zoom webinar of one hour and a half. The first section of the webinar is a Q&A session in which I will address students' questions about the topics covered in the video-recordings (yup, you have circa 3 days to digest the video-recordings + related readings). Note: students are invited to share their clarification questions via email the day before the webinar (by 8:00 PM London time). In the second part of the webinar, I will bring the class through some real-time applications.

MS Teams is the main communication channel; the GitHub repo of the module ― constantly updates ― contains all the relevant scripts along with companion materials.

Week (date) Agenda
1 (20-05) Introduction to SMM694
― organization of the module
Overview of NLP
― conceptual and methodological roots
― scope of application
― established tools
― hot topics
A Python environment for NLP
― NLP pipelines (spaCy)
― General purpose NLP packages (Gensim, NLTK)
― Topic modeling (Tomotopy)
― NLP with Deep Learning (PyTorch)
― technical and scientific computation (NumPy)
― ML (scikit-learn)
2 (27-05) Representing words and meanings
― words and meanings in linguistics
― words and meanings in machines
― from WordNet to word2vec (via word vectors)
Webinar
― Q&A session
― problem set discussion
― using WordNet with NLTK
― loading a pre-trained model of language (spaCy)
― processing text through NLP pipelines (spaCy)
― leveraging word vectors (NumPy)
3 (03-06) Vector semantics and embeddings
― word2vec
― visualizing embeddings
― semantic properties of embeddings
― bias and embeddings
― evaluating vector models
― doc2vec
Webinar
― Q&A session
― problem set discussion
― training word embeddings (Gensim)
― training document embeddings (Gensim)
― passing embeddings through ML pipelines (scikit-learn)
― network analysis of embeddings (NetworkX)
4 (10-06) Topic modeling
― statistical estimation
― scope of application
― statistical validity
― face validity
― fit considerations
Webinar
― Q&A session
― problem set discussion
― cross-sectional lda (Gensim)
― sequential lda (Gensim)
― visualizing topic modeling outcomes (Gensim / pyLDAvis)
― expanding on topic modeling outcomes (scikit-learn)
5 (17-06) Sentiment, affect, and connotation
― Naive Bayes and sentiment classification
― available sentiment and affect lexicons
― human-labeled affect lexicons
― semi-supervised induction of affect lexicons
― supervised learning of word sentiment
Webinar
― Q&A session
― problem set discussion
― 'simple' sentiment analysis (PyTorch)
― convolutional sentiment analysis (PyTorch)
― multi-class sentiment analysis (PyTorch)
― aspect-based sentiment analysis (PyTorch)
6 (24-06) Information extraction
― Named Entity Recognition
― relation extraction
― extracting times
― extracting events and their time
Webinar
― Q&A session
― problem set discussion
― training a Named Entity Recognizer (spaCy)
― visualizing Named Entity Recognizer results (spaCy)
― training an entity linking model (spaCy)

Software Requirements

For this module you are supposed to run Python 3.7 on your machine. Now, how to get Python work on your machine? There are several ways to do that. A fast, smooth alternative is to install Anaconda, an open-source distribution of Python that includes: i) 250+ popular data science packages; ii) the conda package, which makes quick and easy to install, run, and upgrade complex data science and machine learning environments.

Here is the workflow:

  1. use your preferred browser to open the link pointing to the Anaconda repository;

  2. select the installer the which suits your machine (32- or 64-bit) and operating system (Win, Mac OS, Linux). Mac users may want to download the graphical installer rather than the command-line installer (students may feel less comfortable with);

  3. retrieve the installer (perhaps in your download folder);

  4. run the installer;

  5. log-out from your current session (it does not matter if you use Win, Mac OS or Linux);

  6. log-in into a new session;

  7. run 'Anaconda Navigator'―namely, a convenient place to launch the IPython shell or other user-interfaces to interact with IPython.

The following Python libraries will be used in the module:

  • Gensim

  • Jellyfish

  • NetworkX

  • NumPy

  • NLTK

  • pyLDAvis

  • PyTorch

  • scikit-learn

  • spaCy

  • Tomotopy

Depending on the emergence of learning opportunities, additional software could be required.