This repository hosts the course materials used for a 3-day seminar "Machine Learning and NLP: Advances and Applications" as part of Independent Study Period 2020 at New College of Florida.
Note that the seminar was held in Jan 2020, and the content may be a little bit oudated (as of Feb 2022). Please also refer to a Fall 2021 full semester course "CIS6930 Topics in Computing for Data Science", which covers much wider (and a little bit newer) Deep Learning topics.
This 3-day course provides students with an opportunity to learn Machine Learning and Natural Language Processing (NLP) from basics to applications. The course covers some state-of-the-art NLP techniques including Deep Learning. Each day consists of a lecture and a hands-on session to help students learn how to apply those techniques to real-world applications. During the hands-on session, students will be given assignments to develop programming code in Python. Three days are too short to fully understand the concepts that are covered by the course and learn to apply those techniques to actual problems. Students are strongly encouraged to complete reading assignments before the lecture to be ready for the course assignments, and bring a lot of questions to the course. :)
Students successfully completing the course will
- demonstrate the ability to apply machine learning and natural language processing techniques to various types of problems.
- demonstrate the ability to build their own machine learning models using Python libraries.
- demonstrate the ability to read and understand research papers in ML and NLP.
-
Wed 1/22 Day 1: Machine Learning basics [Slides]
- Machine learning examples
- Problem formulation
- Evaluation and hyper-parameter tuning
- Data Processing basics with pandas
- Machine Learning with scikit-learn
- Hands-on material: [ipynb]
-
Thu 1/23 Day 2: NLP basics [Slides]
- Unsupervised learning and visualization
- Topic models
- NLP basics with SpaCy and NLTK
- Understanding NLP pipeline for feature extraction
- Machine learning for NLP tasks (text classification, sequential tagging)
- Hands-on material [ipynb]
- Follow-up
- Commonsense Reasoning (Winograd Schema Challenge)
-
Fri 1/24 Day 3: Advanced techniques and applications [Slides]
- Basic Deep Learning techniques
- Word embeddings
- Advanced Deep Learning techniques for NLP
- Problem formulation and applications to (non-)NLP tasks
- Pre-training models: ELMo and BERT
- Hands-on material: [ipynb]
- Follow-up
The following online tutorials for students who are not familiar with the Python libraries used in the course. Each day will have a hands-on session that requires those libraries. Please do not expect to have enough time to learn how to use those libraries during the lecture.
- Pandas tutorials:
- scikit-learn tutorials:
- "An introduction to machine learning with scikit-learn"
- The other tutorials are also recommended
- gensim:
- spaCy:
- PyTorch:
The following list is a good starting point.
- Awesome - Most Cited Deep Learning Papers
The course will cover the following papers as examples of (non-NLP) applications (probably in Day 3.) Students who'd like to learn how to apply Deep Learning techniques to your own problems are encouraged to read the following papers.
- [1] A. Asai, S. Evensen, B. Golshan, A. Halevy, V. Li, A. Lopatenko, D. Stepanov, Y. Suhara, W.-C. Tan, Y. Xu, "HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments" Proc LREC 18, 2018. [Paper] [Dataset]
- [2] S. Evensen, Y. Suhara, A. Halevy, V. Li, W.-C. Tan, S. Mumick, "Happiness Entailment: Automating Suggestions for Well-Being," Proc. ACII 2019, 2019. [Paper]
- [3] Y. Suhara, Y. Xu, A. Pentland, "DeepMood: Forecasting Depressed Mood Based on Self-Reported Histories via Recurrent Neural Networks," Proc. WWW '17, 2017. [Paper]
- [4] N. Bhutani, Y. Suhara, W.-C. Tan, A. Halevy, H. V. Jagadish, "Open Information Extraction from Question-Answer Pairs," Proc. NAACL-HLT 2019, 2019. [Paper]
The course requires students to write code:
- Students are expected to have a personal computer at their disposal. Students should have a Python interpreter and the listed libraries installed on their machines.
The hands-on sessions will require the following Python libraries. Please install those libraries on your computer prior to the course. See also the reading assignment section for the recommended tutorials.
- pandas
- scikit-learn
- gensim
- spacy
- nltk
- torch (PyTorch)