The goal of this project is to explore the concept of bias through data on Wikipedia articles. The project focuses on articles on political figures from a variety of countries. The analysis performed shows the coverage of politicians on Wikipedia and the quality of articles about politicians between countries.
As data source one API and two existing datasets are used.
1. The ORES API (documentation, endpoint)
The ORES API is a service that provides information about the quality of revisions of Wikipedia articles.
2. A dataset of Wikipedia articles (documentation, download)
This dataset contains data on most English-language Wikipedia articles within the category "Category:Politicians by nationality". It was published by by Os Keyes and licensed under the CC-BY 4.0.
3. A dataset of country populations (documentation, download).
This dataset includes information about the population of countries at the end of the year 2019. Note: the downloaded file was edited before. The resulting file can be found here: src/_data/export_2019.csv
.
For the ORES API and the country population dataset no licensing was found. So please make sure you are useing this data sources properly. All resulting datasets follow the same licensing policy as the Wikipedia articles dataset (CC-BY 4.0).
As result, you can find six CSV
-formatted data files in the folder results
.
country_coverage_data_top_10.csv
: The countries with the greatest coverage of politicians on Wikipedia compared to their populationcountry_coverage_data_bottom_10.csv
: The countries with the least coverage of politicians on Wikipedia compared to their populationcountry_relative_quality_data_top_10.csv
: The countries with the highest proportion of high quality articles about politicianscountry_relative_quality_data_bottom_10.csv
: The countries with the lowest proportion of high quality articles about politiciansregion_coverage_data.csv
: The ranking of geographic regions by coverage of politicians.region_relative_quality_data.csv
: The ranking of geographic regions by proportion of high quality articles
Files 1 & 2
column name | column description |
---|---|
country | Country name |
coverage | Coverage |
Files 3 & 4
column name | column description |
---|---|
country | Country name |
relative_quality | Percentage of high quality articles of all articles |
File 5
column name | column description |
---|---|
region | Region name |
coverage | Coverage |
File 6
column name | column description |
---|---|
region | Region name |
relative_quality | Percentage of high quality articles of all articles |
In order to use this project (espaccilay the jupyter note book), please ensure that you have a Python version greater or equal to 3.6.1
, a working installation of Poetry and [git][9] installed.
-
Clone this repository (or use SSH) and move it into the repo root.
git clone https://github.com/marisanest/A2-hcds-hcc.git cd A2-hcds-hcc
-
Install the dependencies in the repo root.
poetry install
-
Create a subshell within the virtual environment by running:
poetry shell
-
Open the project with Jupyter in your browser.
jupyter notebook