/A3-hcds-hcc-bias

Primary LanguageJupyter NotebookMIT LicenseMIT

A3-hcds-hcc-bias

The goal of this project is to explore the concept of bias through data on Wikipedia articles. The project focuses on articles on political figures from a variety of countries. The analysis performed shows the coverage of politicians on Wikipedia and the quality of articles about politicians between countries.

Data sources

As data source one API and two existing datasets are used.

1. The ORES API (documentation, endpoint)

The ORES API is a service that provides information about the quality of revisions of Wikipedia articles.

2. A dataset of Wikipedia articles (documentation, download)

This dataset contains data on most English-language Wikipedia articles within the category "Category:Politicians by nationality". It was published by by Os Keyes and licensed under the CC-BY 4.0.

3. A dataset of country populations (documentation, download).

This dataset includes information about the population of countries at the end of the year 2019. Note: the downloaded file was edited before. The resulting file can be found here: src/_data/export_2019.csv.

Licensing

For the ORES API and the country population dataset no licensing was found. So please make sure you are useing this data sources properly. All resulting datasets follow the same licensing policy as the Wikipedia articles dataset (CC-BY 4.0).

Results

As result, you can find six CSV-formatted data files in the folder results.

Content

  1. country_coverage_data_top_10.csv: The countries with the greatest coverage of politicians on Wikipedia compared to their population
  2. country_coverage_data_bottom_10.csv: The countries with the least coverage of politicians on Wikipedia compared to their population
  3. country_relative_quality_data_top_10.csv: The countries with the highest proportion of high quality articles about politicians
  4. country_relative_quality_data_bottom_10.csv: The countries with the lowest proportion of high quality articles about politicians
  5. region_coverage_data.csv: The ranking of geographic regions by coverage of politicians.
  6. region_relative_quality_data.csv: The ranking of geographic regions by proportion of high quality articles

Fromats

Files 1 & 2

column name column description
country Country name
coverage Coverage

Files 3 & 4

column name column description
country Country name
relative_quality Percentage of high quality articles of all articles

File 5

column name column description
region Region name
coverage Coverage

File 6

column name column description
region Region name
relative_quality Percentage of high quality articles of all articles

Getting started

Prerequisites

In order to use this project (espaccilay the jupyter note book), please ensure that you have a Python version greater or equal to 3.6.1, a working installation of Poetry and [git][9] installed.

Setup

  1. Clone this repository (or use SSH) and move it into the repo root.

    git clone https://github.com/marisanest/A2-hcds-hcc.git cd A2-hcds-hcc

  2. Install the dependencies in the repo root.

    poetry install

  3. Create a subshell within the virtual environment by running:

    poetry shell

  4. Open the project with Jupyter in your browser.

    jupyter notebook