Dataset Collector Lab for 2nd course of Fundamental and Computational Linguistics (2020/2021)

About the course

"Computer Tools for Linguistic Research" in Higher School of Economics (Nizhny Novgorod branch).

Lectors

Demidovskij Alexander Vladimirovich - lector
Uraev Dmitry Yurievich - assitant

Motivation

The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various NLP libraries. Dataset requirements.

Project Timeline

Scrapper
1. Short summary: Your code can automatically parse a media website you are going to choose , save texts and its metadata in a proper format
2. Deadline: March 15th, 2021
3. Format: each student works in their own PR
4. Dataset volume: 5-7 articles
5. Design document: ./docs/scrapper.md
6. Additional resources:
  1. List of media websites to select from: link
Pipeline
1. Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.
2. Deadline: April 5th, 2021
3. Format: each student works in their own PR
4. Dataset volume: 5-7 articles
5. Design document: ./docs/pipeline.md
Own Research
1. Short summary: Your code can create a bigger processed dataset of a requested volume and format that you use for your linguistic research.
2. Deadline: TBD (approx. May 30th, 2021)
3. Format: students work in groups - one PR per group
4. Dataset volume: 100 articles

Technical solution

Module	Description	Component	I need to know them, if I want to get at least
requests	module for downloading web pages	scrapper	4
BeautifulSoup	module for finding information on web pages	scrapper	4
lxml	module for parsing HTML as a structure	scrapper	6
pymystem3	module for morphological analysis	pipeline	6
pymorphy2	module for morphological analysis	pipeline	8
pandas	module for table data analysis	pipeline	10

Software solution is built on top of three components:

scrapper.py - a module for finding articles from the given media, extracting text and dumping it to the filesystem. Students need to implement it.
pipeline.py - a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.
article.py - a module for article abstraction to incapsulate low-level manipulations with the article

Handing over your work

Order of handing over:

lab work is accepted for oral presentation.
a student has explained the work of the program and showed it in action.
a student has completed the min-task from a mentor that requires some slight code modifications.
a student receives a mark:
1. that corresponds to the expected one, if all the steps above are completed and mentor is satisified with the answer
2. one point bigger than the expected one, if all the steps above are completed and mentor is very satisified with the answer
3. one point smaller than the expected one, if a lab is handed over one week later than the deadline and criteria from 4.1 are satisfied
4. two points smaller than the expected one, if a lab is handed over more than one week later than the deadline and criteria from 4.1 are satisfied

NOTE: a student might improve their mark for the lab, if they complete tasks of the next level after handing over the lab.

A lab work is accepted for oral presentation if all the critera below are satsified:

there is a Pull Request (PR) with a correctly formatted name: Laboratory work #<NUMBER>, <SURNAME> <NAME> - <UNIVERSITY GROUP NAME>. Example: Laboratory work #1, Kuznetsova Valeriya - 19FPL1.
has a filled file target_score.txt with an expected mark. Acceptable values: 4, 6, 8, 10.
has green status.
has a label done, set by mentor.

Resources

Academic performance: link
Media websites list: link
Python programming course from previous semester: link
Scrapping tutorials: YouTube series (russian)

katearb/2020-2-level-ctlr