The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. We combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.
There are several steps involved in this project:
- Data Collection
- Clean the Data
- There are some empty rows and irrelevant columns that are cleaned up
- Collect Predictions
- We use the ORES model to make predictions for each article in the dataset
- Combine the 2 datasets
- Analyze and transform the data
- Present the results
The analysis will consist of a series of tables that show:
- the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
- the countries with the highest and lowest proportion of high quality articles about politicians.
- a ranking of geographic regions by articles-per-person and proportion of high quality articles.
The following is the structure of the directory and files present:
.
├── LICENSE
├── README.md
├── data
│ ├── errors
│ │ ├── missing_prediction_revids.csv #missing predictions rev_ids
│ │ └── wp_wpds_countries-no_match.csv #unmerged countries
│ ├── processed
│ │ ├── politicians_country.csv #cleaned politicians by country data
│ │ ├── world_population_country_level.csv #cleaned world population (by country) data
│ │ ├── world_population_region_level.csv #cleaned world population (by region) data
│ │ └── wp_wpds_politicians_by_country.csv #merged cleaned data
│ └── raw
│ ├── WPDS_2020_data.csv #world population data
│ └── page_data.csv #politicians by country data
└── src
└── hcds-a2-bias.ipynb #source code
To make the predictions, we use the ORES scoring interface. Specifically, we use the following API:
- Scores Context
- This route provides access to all
{models}
within a{context}
. This path is useful for either exploring information about{models}
available within a{context}
or scoring one or more{revids}
using one or more{models
at the same time. - Specifically, we obtain the
prediction
from the response of the API. This prediction can be one of the following categories:- FA - Featured article
- GA - Good article
- B - B-class article
- C - C-class article
- Start - Start-class article
- Stub - Stub-class article
- This route provides access to all
Once the dataset is cleaned and the predictions are obtained, we merge the 2 datasets and output a CSV file with the following schema:
Column Name | Description |
---|---|
country | This is the country which the article belongs to |
article_name | The name/title of the article |
revision_id | A unique number that identifies the article |
article_quality_est. | The ORES prediction of the quality of the article |
population | The population of the country |
For the analysis, we attempt to answer the following 6 questions:
- Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
- Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
- Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
- Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality
Anaconda was used to maintain and install project dependencies. Additionally, a requirements.txt file has been provided to install all dependencies used in the notebook.
The repo can be cloned using the following command:
git clone https://github.com/sharma-apoorv/data-512-a2.git
The dependencies can be installed into an environment using the following command:
conda create --name <envname> --file requirements.txt
The environment is activated as follows:
conda activate <envname>
Lastly, the jupyter notebook kernel can be started using the following command:
jupyter-notebook
Executing the above command will open a link in the browser. Navigate to the notebook and click on "Run All" in the notebook to execute the commands.
Distributed under the MIT License. See LICENSE.md
for more information.
The Politicians by Country from the English-language Wikipedia is subject to a creative commons licence