/ik-nlp-tutorials

Lab tutorials for the MSc NLP course at the University of Groningen 🐮

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Welcome to the IK-NLP Course! 🎉

These lab sessions are designed to help you follow along with the contents presented during the lectures, and introduce you to the skills and tools needed to complete the final projects.

What to expect?

The lab sessions will be a mix of tutorials and exercises. The tutorials will present modern frameworks and tools to implement advanced NLP analyses and pipelines. The exercises are designed to teach you the skills needed for final projects. Here is a brief overview of the schedule:

Week Lab Tutorial Lab Exercise
1 · Intro, Setup work environment and team creation
· Start Intro to 🤗 Transformers
-
2 Intro to 🤗 Transformers and Datasets 🤗 Pipelines & Sentence Transformers for semantic search and QA
3 Linguistic analysis with spaCy and Stanza Training a BPE tokenizer and a lexicon-based transduction model
4 · Intro to the Peregrine cluster
· Text tagging and dependency parsing with spaCy and Transformers
Combining Textual and Non-textual Features in NLP Models
5 Natural Language Generation with 🤗 Transformers TBD Exploring MT model saliency on the DivEMT corpus TBD
6 Fine-tuning and Efficient Modeling with 🤗 Transformers TBD -
7 Final Project Progress Report -

Some notes:

  • The core contents are covered in the first few weeks of the course to kickstart your work. Exercise sessions are dropped from week 6 onwards to allow you to focus on the final project.

  • The current notebooks for W4 and W5 are outdated and will be updated according to the schedule above.

  • Participation to the lab sessions is highly encouraged, as they cover fundamental notions for the assignment portfolios and the final projects. Instructors will be available to answer questions and provide guidance.

Tools and Frameworks

The lab sessions make use of the Jupyter environment. You can use the following links to get started:

Alternatively, it is possible to use the notebooks via the Google Colab web environment simply by clicking on the Open In Colab button at the beginning of each notebook. If you’re running on Windows, we recommend following along using a Colab notebook. If you’re using a Linux distribution or macOS, you can use either approach described here. For an intro to the Colab environment, refer to:

Since the lab session will introduce you to OSS libraries such as spaCy, Stanza, Scikit-learn, 🤗 Transformers and 🤗 Datasets, most of the first few sessions' contents are adapted from official tutorials and docs. Here is a non-exhaustive list of the most relevant sources for additional reference:

The file requirements.txt in this repository contains the list of all the packages required to run the lab sessions. You can create a Python virtual environment (Python>=3.6) and install them using the following command:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Make sure the virtual environment is activated before running Jupyter. If you are using Colab, simply run the cell at the beginning of each notebook to install the required packages. Refer to Using a Python Virtual Environment for more details on how to create and activate a virtual environment. Alternatively, you can use Poetry to manage the dependencies.

For any troubleshooting, please consult the FAQ before asking for help. You are encouraged to contribute to it by adding your solutions!

About us

Arianna Bisazza Arianna Bisazza is an Assistant Professor in Computational Linguistics and Natural Language Processing at the Computational Linguistics Group of the University of Groningen. She is passionate about the study of human languages, how they differ from each other, and how they can be modeled by computational tools. Her primary interest is in the development of language technologies supporting a large variety of languages around the world. She is also interested in the new knowledge about that computational models can reveal about the nature of language.
Gabriele Sarti Gabriele Sarti is a PhD student at the Computational Linguistics Group of the University of Groningen. He is part of the Dutch consortium InDeep, working on interpretability for language generation and neural machine translation. Previously, he was a research scientist at Aindo and a research intern at Amazon Translate NYC. His research interests involve interpretability for NLP, human-AI interaction and the usage of behavioral information like eye-tracking patterns to improve language understanding systems.
Ludwig Sickert Ludwig Sickert is a MSc candidate in AI at the University of Groningen and a senior consultant in Cloud and AI technologies at IBM Netherlands. He attended the IK-NLP course in 2022 and is now working on interpreting formality in machine translation systems for his master thesis under the supervision of Gabriele and Arianna. He will serve as TA for the 2023 edition of the course.

You see something wrong or missing?

Please open as issue here on Github! This is the second year we are using these contents for the course and although most of them come from battle-tested online tutorials, we are always looking for feedback and suggestions.

Alumni

2022

Anjali Nair Anjali Nair is a MSc candidate in AI at the University of Groningen. She served as teaching assistant for the 2022 edition of the Natural Language Processing course.

We thank our past students Georg Groenendaal, Robin van der Noord and Ayça Avcı for their contributions in spotting errors in the course materials.