/ik-nlp-tutorials

Lab tutorials for the MSc NLP course at the University of Groningen 🐮

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Welcome to the IK-NLP Course! 🎉

These lab sessions are designed to help you follow along with the contents presented during the lectures, and introduce you to the skills and tools needed to complete the final projects.

What to expect?

The lab sessions will be a mix of tutorials and exercises. The tutorials will present modern frameworks and tools to implement advanced NLP analyses and pipelines. The exercises are designed to teach you the skills needed for final projects. Here is a brief overview of the schedule:

Week Lab Tutorial Lab Exercise
1 · Intro, Setup work environment and team creation
· Start Intro to 🤗 Transformers
-
2 Intro to 🤗 Transformers and Datasets 🤗 Pipelines & Sentence Transformers for semantic search and QA
3 Linguistic analysis with spaCy and Stanza Training a BPE tokenizer and a lexicon-based transduction model
4 · Intro to the Hábrók cluster
· Text tagging and dependency parsing with spaCy and Transformers
Combining Textual and Non-textual Features in NLP Models
5 Fine-tuning and Inference with 🤗 Transformers Analyzing language generation models with Inseq 🐛
6 Natural Language Generation with 🤗 Transformers -
7 Final Project Progress Report -

Some notes:

  • The core contents are covered in the first few weeks of the course to kickstart your work. Exercise sessions are dropped from week 6 onwards to allow you to focus on the final project.

  • Participation to the lab sessions is highly encouraged, as they cover fundamental notions for the assignment portfolios and the final projects. Instructors will be available to answer questions and provide guidance.

Tools and Frameworks

The lab sessions make use of the Jupyter environment. You can use the following links to get started:

Alternatively, it is possible to use the notebooks via the Google Colab web environment simply by clicking on the Open In Colab button at the beginning of each notebook. If you’re running on Windows, we recommend following along using a Colab notebook. If you’re using a Linux distribution or macOS, you can use either approach described here. For an intro to the Colab environment, refer to:

Since the lab session will introduce you to OSS libraries such as spaCy, Stanza, Scikit-learn, 🤗 Transformers and 🤗 Datasets, most of the first few sessions' contents are adapted from official tutorials and docs. Here is a non-exhaustive list of the most relevant sources for additional reference:

The file requirements.txt in this repository contains the list of all the packages required to run the lab sessions. You can create a Python virtual environment (Python>=3.6) and install them using the following command:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Make sure the virtual environment is activated before running Jupyter. If you are using Colab, simply run the cell at the beginning of each notebook to install the required packages. Refer to Using a Python Virtual Environment for more details on how to create and activate a virtual environment. Alternatively, you can use Poetry to manage the dependencies.

For any troubleshooting, please consult the FAQ before asking for help. You are encouraged to contribute to it by adding your solutions!

About us

Arianna Bisazza Arianna Bisazza is an Associate Professor in Computational Linguistics and Natural Language Processing at the Computational Linguistics Group of the University of Groningen. She is passionate about the study of human languages, how they differ from each other, and how they can be modeled by computational tools. Her primary interest is in the development of language technologies supporting a large variety of languages around the world. She is also interested in the new knowledge that computational models can reveal about the nature of language.
Gabriele Sarti Gabriele Sarti is a PhD student in the Computational Linguistics Group of the University of Groningen. He is part of the Dutch consortium InDeep, working on interpretability for language generation and neural machine translation. Previously, he was a research scientist at Aindo and a research intern at Amazon Translate NYC. His research interests involve interpretability for NLP, human-AI interaction and the usage of behavioral information like eye-tracking patterns to improve language understanding systems.
Jirui Qi Jirui Qi is a PhD student in the Computational Linguistics Group of the University of Groningen. He is part of the Dutch consortium LESSEN, and his research mainly focuses on low-resource conversational generation, the generalization of factual knowledge across languages, and prompt-based learning for classification.
Leonidas Zotos Leonidas Zotos is a PhD student in the Computational Linguistics Group of the University of Groningen. He works on the intersection between language modelling and human learning with a focus on multifaceted event understanding. The current focus is on multiple choice assessment methods and how these tests can be better designed to improve long term retention.

You see something wrong or missing?

Please open as issue here on Github! This is the second year we are using these contents for the course and although most of them come from battle-tested online tutorials, we are always looking for feedback and suggestions.

We thank our past students Georg Groenendaal, Robin van der Noord, Ayça Avcı and Remco Leijenaar for their contributions in spotting errors in the course materials.

Teaching Assistants Alumni

2023

Ludwig Sickert Ludwig Sickert is was an MSc candidate in AI at the University of Groningen. He attended the IK-NLP course in 2022 and worked on interpreting formality in machine translation systems for his master thesis under the supervision of Gabriele and Arianna. He served as TA for the 2023 edition of the course.

2022

Anjali Nair Anjali Nair was an MSc candidate in AI at the University of Groningen. She served as teaching assistant for the 2022 edition of the course.