This is a repository for BA thesis written in School of Linguistics, NRU HSE (Moscow, RU), throughout the 2018/19 academic year.
UPD: all notebooks are now available in Binder! Click the badge to proceed:
I already did a project on Russian Drama Corpus and its stage directions — it can be found here: Stage Directions in Russian Drama. It dealt more with the quantitative part of the research and exploring corpus trends; at this one, I want to focus on the part which deals more with computational linguistics and machine learning, that is to extract linguistic features and run several different models.
Russian Drama Corpus (or shortly, RusDraCor) can be found at dracor-org/rusdracor; it is also available in a more user-friendly format at Dracor website.
Content | Notebook | Additional |
---|---|---|
retrieving and downloading data | api-data-preprocessing.ipynb | dracor_api.py, file_work.py |
annotation description | annotation_guide.md | |
morphology, NER, stopwords, etc. | linguistic-features.ipynb | |
semantics hypothesis + test on 2018 data | semantic-rules.ipynb | |
working with the final dataset | dataset-separation.ipynb | |
model fitting: entrance and exit | fitting-semantic-types.ipynb | data_preparation.py, model_fitting.py, separate semantic class to come |
model fitting: other types | fitting-nonsemantic-types.ipynb | data_preparation.py, model_fitting.py |
Is also here in the repo: pdf
TEI 2019 (Sep 20, 2019): Using Machine Learning for the Automated Classification of Stage Directions in TEI-Encoded Drama Corpora
Thesis defence (Jun 17, 2019): Short Text Classification: a Case of Stage Directions in Russian Drama
Module | Event | Date |
---|---|---|
3rd | Project Proposal presentation | March 26, 2019 |
4th | Written paper deadline | June 4, 2019 |
4th | Final thesis presentation | June 18, 2019 |