/data-science-x-am

Here sletten and I will make some nice analysis of the gutenberg dataset. Get ready!

Primary LanguageJupyter Notebook


Logo

Exam Project: Data Science

MSc Cognitive Science 2022

Thea Rolskov Sloth & Astrid Sletten Rybner

Project information

This repository contains code for reproducing our analysis regarding gender representation in English literature from the nineteenth and twentieth century.

Repository structure

An overview of the scripts used for the analysis can be found below.

Folder Description
data_exploration data exploration
meta metadata files for analyzed texts
src main analysis scripts
output output folder for the different scripts
timeseries scripts for timeseries analysis

Usage

To reproduce the analysis, you need to first clone this repository:

git clone https://github.com/thearol/data-science-x-am
cd /cool_programmer_tshirts2.0
pip install -r requirements.txt

You then need to download the full Project Gutenberg corpus via The Standardized Project Gutenberg Corpus.

To analyze the same texts used in this analysis, the full corpus can then be filtered down using the metadata files in the meta folder. The txt files of the books should then be placed in a folder named data

Subsequently, the two main analysis can be run with:

python /src/gender-counts.py 
python /src/bodydescriptions.py 

The first script will extract a count of all male and female pronouns from each book. The second script locates all bodyparts and their owners, as well as any adjectives describing the bodypart. The output files are saved to the output folder.

Contact details

If you have any questions regarding the project itself or the code implementation, feel free to contact us via e-mail: Thea Rolskov Sloth & Astrid Sletten Rybner

Acknowledgements

We would like to give special thanks to the following: