A3-hcds-hcc-bias

The goal of this project is to explore the concept of bias through data on Wikipedia articles. The project focuses on articles on political figures from a variety of countries. The analysis performed shows the coverage of politicians on Wikipedia and the quality of articles about politicians between countries.

Data sources

As data source one API and two existing datasets are used.

1. The ORES API (documentation, endpoint)

The ORES API is a service that provides information about the quality of revisions of Wikipedia articles.

2. A dataset of Wikipedia articles (documentation, download)

This dataset contains data on most English-language Wikipedia articles within the category "Category:Politicians by nationality". It was published by by Os Keyes and licensed under the CC-BY 4.0.

3. A dataset of country populations (documentation, download).

This dataset includes information about the population of countries at the end of the year 2019. Note: the downloaded file was edited before. The resulting file can be found here: src/_data/export_2019.csv.

Licensing

For the ORES API and the country population dataset no licensing was found. So please make sure you are useing this data sources properly. All resulting datasets follow the same licensing policy as the Wikipedia articles dataset (CC-BY 4.0).

Results

As result, you can find six CSV-formatted data files in the folder results.

Content

country_coverage_data_top_10.csv: The countries with the greatest coverage of politicians on Wikipedia compared to their population
country_coverage_data_bottom_10.csv: The countries with the least coverage of politicians on Wikipedia compared to their population
country_relative_quality_data_top_10.csv: The countries with the highest proportion of high quality articles about politicians
country_relative_quality_data_bottom_10.csv: The countries with the lowest proportion of high quality articles about politicians
region_coverage_data.csv: The ranking of geographic regions by coverage of politicians.
region_relative_quality_data.csv: The ranking of geographic regions by proportion of high quality articles

Fromats

Files 1 & 2

column name	column description
country	Country name
coverage	Coverage

Files 3 & 4

column name	column description
country	Country name
relative_quality	Percentage of high quality articles of all articles

File 5

column name	column description
region	Region name
coverage	Coverage

File 6

column name	column description
region	Region name
relative_quality	Percentage of high quality articles of all articles

Getting started

Prerequisites

In order to use this project (espaccilay the jupyter note book), please ensure that you have a Python version greater or equal to 3.6.1, a working installation of Poetry and [git][9] installed.

Setup

Clone this repository (or use SSH) and move it into the repo root.

git clone https://github.com/marisanest/A2-hcds-hcc.git cd A2-hcds-hcc
Install the dependencies in the repo root.

poetry install
Create a subshell within the virtual environment by running:

poetry shell
Open the project with Jupyter in your browser.

jupyter notebook

marisanest/A3-hcds-hcc-bias