"Computer Tools for Linguistic Research" in Higher School of Economics (Nizhny Novgorod branch).
- Demidovskij Alexander Vladimirovich - lector
- Uraev Dmitry Yurievich - assitant
The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various NLP libraries. Dataset requirements.
- Scrapper
- Short summary: Your code can automatically parse a media website you are going to choose , save texts and its metadata in a proper format
- Deadline: March 15th, 2021
- Format: each student works in their own PR
- Dataset volume: 5-7 articles
- Design document: ./docs/scrapper.md
- Additional resources:
- List of media websites to select from: link
- Pipeline
- Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.
- Deadline: April 5th, 2021
- Format: each student works in their own PR
- Dataset volume: 5-7 articles
- Design document: ./docs/pipeline.md
- Own Research
- Short summary: Your code can create a bigger processed dataset of a requested volume and format that you use for your linguistic research.
- Deadline: TBD (approx. May 30th, 2021)
- Format: students work in groups - one PR per group
- Dataset volume: 100 articles
Module | Description | Component | I need to know them, if I want to get at least |
---|---|---|---|
requests | module for downloading web pages | scrapper | 4 |
BeautifulSoup | module for finding information on web pages | scrapper | 4 |
lxml | module for parsing HTML as a structure | scrapper | 6 |
pymystem3 | module for morphological analysis | pipeline | 6 |
pymorphy2 | module for morphological analysis | pipeline | 8 |
pandas | module for table data analysis | pipeline | 10 |
Software solution is built on top of three components:
- scrapper.py - a module for finding articles from the given media, extracting text and dumping it to the filesystem. Students need to implement it.
- pipeline.py - a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.
- article.py - a module for article abstraction to incapsulate low-level manipulations with the article
Order of handing over:
- lab work is accepted for oral presentation.
- a student has explained the work of the program and showed it in action.
- a student has completed the min-task from a mentor that requires some slight code modifications.
- a student receives a mark:
- that corresponds to the expected one, if all the steps above are completed and mentor is satisified with the answer
- one point bigger than the expected one, if all the steps above are completed and mentor is very satisified with the answer
- one point smaller than the expected one, if a lab is handed over one week later than the deadline and criteria from 4.1 are satisfied
- two points smaller than the expected one, if a lab is handed over more than one week later than the deadline and criteria from 4.1 are satisfied
NOTE: a student might improve their mark for the lab, if they complete tasks of the next level after handing over the lab.
A lab work is accepted for oral presentation if all the critera below are satsified:
- there is a Pull Request (PR) with a correctly formatted name:
Laboratory work #<NUMBER>, <SURNAME> <NAME> - <UNIVERSITY GROUP NAME>
. Example:Laboratory work #1, Kuznetsova Valeriya - 19FPL1
. - has a filled file
target_score.txt
with an expected mark. Acceptable values: 4, 6, 8, 10. - has green status.
- has a label
done
, set by mentor.
- Academic performance: link
- Media websites list: link
- Python programming course from previous semester: link
- Scrapping tutorials: YouTube series (russian)