This repository holds the code for the 5th and final project for Language Analytics (S2023). It contains a local Python package called TextSpace
which allows for interactive exploration of a corpus of text documents. Futhermore it contains a demo of the package using a corpus of danish song lyrics, as well as a script for initialising a dash app for interactive exploration of the lyrics in embedding space projected unto 3 principal components.
To demonstrate the functionality of the package as collection of lyrics from danish songs from 10 artists were scraped from Genius.com. Up to 5 songs from each artist were scraped, but only the danish songs were saved.
Artist | Number of songs |
---|---|
Anne Linnet | 5 |
Kim Larsen | 5 |
Guldimund | 5 |
Lis Sørensen | 4 |
Medina | 3 |
Nephew | 3 |
Sannne Salomonsen | 5 |
The Minds of 99 | 4 |
Zar Paulo | 5 |
Total | 39 |
The code was developed and tested on a MacBook Pro with macOS (Ventura v3.3., python v3.10.7). Furhermore, the pipeline was tested on uCloud (Ubuntu v22.10m, Coder python v1.77.3, python v3.10.7). Here all code runs as expected, except for the dash app which does not connect correctly to the host.
To reproduce the results of the example functionality of the package using the danish songs, follow the steps below. All terminal commands should be run from the root directory of the repository.
- Clone the repository
- Aquire a Genius API key and paste it into the
TOKEN.txt
file - Create a virtual environment and install requirements
bash setup.sh
- Run the
run.sh
script to:- Scrape danish song lyrics from Genius
- Preprocess the lyrics and prepare dataframe with appropriate columns for TextSpace
- Use TextSpace to create 3d visualisation of the lyrics using BoW, GPT2, latent dirichlet allocation and emotion embeddings
Note: As the repository holds all the files created by running the script you can skip running the following command if you just want to run the dash app.
bash run.sh
- To run the dash app, run the following command
source env/bin/activate
python examples/src/dash_app.py
If you want to display the embeddings of your own corpus, you can do so by simply providing a dataframe with the following columns:
- title
- author
- text_full
The text_full
column should contain the full text of the document, while the title
and author
columns should contain the title and author of the document respectively. The title
and author
columns are used for labelling the documents in the visualisation. The text_full
column is used for calculating the embeddings. All columns should contain strings.
Place the dataframe in the data
folder and name it "plotly_data.csv". Then you can both run run.sh
or the dash_app.py
script to visualise the embeddings of your own corpus.
├── data
│ ├── plotly_data.csv
│ ├── lyrics
│ │ ├── Anne_Linnet-Barndommens_Gade.txt
│ │ └── ...
├── env <- Not included in repo
├── examples
│ ├── src
│ │ ├── dash_app.py
│ │ ├── preprocess_lyrics.py
│ │ ├── scrape_songs.py
│ │ └── text_space.py
│ └── ...
├── TextSpace <- Local python package
│ ├── __init__.py
│ ├── dash_application.py
│ ├── data.py
│ └── plot3D.py
├── .gitignore
├── README.md
├── requirements.txt
├── run.sh
├── setup.sh
└── TOKEN.txt <- Remember to add your own Genius API key
To display the results of the example functionality of the package using the danish songs, follow the following links:
These can also be found in the examples
folder of the repository. Additionally the dash app provides an interactive way of exploring the corpus. It allows you to switch seamlessly between the types of embeddings and display the full text by clicking the songs.