This is the repository for the first assignment in the course Language Analytics from the bachelors elective Cultural Data Science at Aarhus University.
The code was written by me independently.
This assignment concerns using spaCy
to extract linguistic information from a corpus of texts.
The corpus is an interesting one: The Uppsala Student English Corpus (USE). All of the data is included in the folder called in
but you can access more documentation via this link.
For this exercise, you should write some code which does the following:
- Loop over each text file in the folder called
in
- Extract the following information:
- Relative frequency of Nouns, Verbs, Adjective, and Adverbs per 10,000 words
- Total number of unique PER, LOC, ORGS
- For each sub-folder (a1, a2, a3, ...) save a table which shows the following information:
Filename | RelFreq NOUN | RelFreq VERB | RelFreq ADJ | RelFreq ADV | Unique PER | Unique LOC | Unique ORG |
---|---|---|---|---|---|---|---|
file1.txt | --- | --- | --- | --- | --- | --- | --- |
file2.txt | --- | --- | --- | --- | --- | --- | --- |
etc | --- | --- | --- | --- | --- | --- | --- |
This assignment is designed to test that you can:
- Work with multiple input data arranged hierarchically in folders;
- Use
spaCy
to extract linguistic information from text data; - Save those results in a clear way which can be shared or used for future analysis
- The data is arranged in various subfolders related to their content (see the README for more info). You'll need to think a little bit about how to do this. You should be able do it using a combination of things we've already looked at, such as
os.listdir()
,os.path.join()
, and for loops. - The text files contain some extra information that such as document ID and other metadata that occurs between pointed brackets
<>
. Make sure to remove these as part of your preprocessing steps! - There are 14 subfolders (a1, a2, a3, etc), so when completed the folder
out
should have 14 CSV files.
Your code should include functions that you have written wherever possible. Try to break your code down into smaller self-contained parts, rather than having it as one long set of instructions.
For this assignment, you are welcome to submit your code either as a Jupyter Notebook, or as .py
script. If you do not know how to write .py
scripts, don't worry - we're working towards that!
Lastly, you are welcome to edit this README file to contain whatever informatio you like. Remember - documentation is important!
The script extract_features.py
located in the folder src extracts linguistic feautures with spacy
.
The data is loaded, and preprocessed. By using the package fnmatch
the doc-id and other meta-data contained in pointed brackets <>
are removed from the text files with regex pattern matching and substitution.
The script loops over the textfiles located in the subfolders of the data folder in/USECorpus
. The linguistic features are saved as datatables stored in CSV-files: one for each subfolder with the same name as the subfolder.
The script is written in python 3.10.7. Make sure this is the python version you have installed on your device.
From the command line clone this GitHub repository to your current location (this can be changed with the cd path/to/desired/location
) by running the command git clone https://github.com/NiGitaMyrGit/Lang_assignment1.git".
From the commandline, located in the main directory, run the command bash setup.sh
. This will install all required packages from the requirements.txt
file.
The Uppsala Student English Corpus (USE)
The data is located in the folder in as a zip-file.
In order to unzip the data, from the commandline, located in the folder in, run the command unzip USEcorpus.zip
This will unpack a folder called [USEcorpus] with 14 subfolder names [a1]-[a5], [b1]-[b8] and [c1].
The zip.file was retrived from this link.
More information can be found in the [readme.md] located in the [in] folder.
From the command line, located in the main directory, run the command python3 src/extract_features
.
The script produces a table for each subfolder, showcasing the relative frequency per 10.000 words of nouns (RelFreq NOUN), verbs (RelFreq VERB), adjectives (RelFreq ADJ) and adverbs (RelFreq ADV). Furthermore it showcases the total number of unique persons (Unique PER) locations (Unique LOC) and organisations (Unique ORG). An example from the first table named a1_table.csv:
Filename | RelFreq NOUN | RelFreq VERB | RelFreq ADJ | RelFreq ADV | Unique PER | Unique LOC | Unique ORG |
---|---|---|---|---|---|---|---|
file1.txt | --- | --- | --- | --- | --- | --- | --- |
file2.txt | --- | --- | --- | --- | --- | --- | --- |
etc | --- | --- | --- | --- | --- | --- | --- |
Subdirectory | Filename | RelFreq NOUN | RelFreq VERB | RelFreq ADJ | RelFreq ADV | Unique PER | Unique LOC | Unique ORG |
---|---|---|---|---|---|---|---|---|
a1 | 0100.a1 | 1455.0641940085593 | 1283.8801711840229 | 798.8587731811697 | 556.3480741797432 | 0 | 0 | 0 |
a1 | 2049.a1 | 1496.5517241379312 | 1144.8275862068965 | 531.0344827586207 | 717.2413793103448 | 0 | 1 | 3 |
The relative frequency per 10.000 of the nouns of the filde 0100.a1.txt
are 1283.88, which in percentage means that approximately 12.8 % of the words are nouns. The same can be analysed from the relative frequency of verbs, adjectives and adverbs. The amount of unique person, location and organisation entities are 0, although "the US" is actually mentioned in the text. How come this is not counted as a location is unknown.
Looking at the line for the file 2049.a1.txt
some unique locations and organisations are accounted for. In the text-file a person "Inga" is mentioned, but this is not counted as a person, probably because it is not a well-known person such as "Barack Obama". On the other ahnd "James Bond" is mentioned, who is definitely a well known person. When going through all the output tables, it seems that no person entitites are counted, which is rather odd.
The unique organisations is probably Sky Channel, Sky One and Fun Factory, which maskes sense. Since the code for the unique entities of organisations and persons are produced in the same way, there should, in theory, be no problem, but somehow there is anyway.