language-analytics-assignment1

First assignment for language analytics course.

The assignment is about extracting POS tag and NER data from the Uppsala Student English Corpus using the SpaCy NLP framework. The data can be downloaded from the official website.

Setup:

The corpus needs to be in the data/ folder, where the USEcorpus folder should contain all the subcorpora in its subfolders:

The file hierarchy should follow this structure:

- data
  - USEcorpus
    - a1
      - 1011.a1.txt
        ...
      - 5031.a1.txt
    ...
    - c1

Install the requirements of the scripts:

pip install -r requirements.txt

Usage

Run the script:

python3 src/run_analysis.py

This will produce a bunch of .csv files in the output/ folder for each subcorpus.

- output
  - a1.csv
  ...
  - c1.csv

Every row of the tables contains result for one file in the corpus with relative frequencies of UPOS tags per 10000 words and number of unique named entities per category.

Additionally the script will produce a csv file with the CO2 emissions of the substasks in the code (emissions/). This is necessary for Assignment 5, and is not directly relevant to this assignment.

Note: The emissions/emissions.csv file should be ignored. This is due to the fact, that codecarbon can't track process and task emissions at the same time.

Potential Limitations

The code in this repository utilizes the en_core_web_sm SpaCy model. Results are likely to be slightly inaccurate, as this model is not the most performant out of all English SpaCy models. A transformer-based pipeline would likely outperform this model at POS tagging and named entity recognition. Efficiency could also be made better by disabling unneccesary components in the pipeline, such as the parser or the lemmatizer.

x-tabdeveloping/language-analytics-assignment1

language-analytics-assignment1

Setup:

Usage

Potential Limitations