The goal of this project is to acquisit, process, analyze, and publish a data set of the monthly traffic on Wikipedia.
As data source the Wikimedia Foundation REST API is used. Terms and Conditions to the Wikimedia Foundation REST API can be found here: Terms and Conditions. The content accessed via this API is licensed under the CC-BY-SA 3.0 and GFDL licenses, and thus all produced data throughout this project follows the same licensing policy.
To get a comprehensive set of data to different APIs need to be called:
- The Legacy Pagecounts API (documentation, endpoint) provides access to desktop and mobile traffic data from December 2007 through July 2016.
- The Pageviews API (documentation, endpoint) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through last month.
The resulting CSV
-formatted data file en-wikipedia_traffic_200712-202010.csv
can be found in the folder clean_data
. It contains the following fields:
name | description |
---|---|
year | Year with the format YYYY |
month | Month with the format MM |
pagecount_all_views | Desktop and mobile views in the specific period fetched vie the Pagecounts API |
pagecount_desktop_views | Desktop views in the specific period fetched vie the Pagecounts API |
pagecount_mobile_views | Mobile views in the specific period fetched vie the Pagecounts API |
pageview_all_views | Desktop and mobile views in the specific period fetched vie the Pageviews API |
pageview_desktop_views | Desktop views in the specific period fetched vie the Pageviews API |
pageview_mobile_views | Mobile views in the specific period fetched vie the Pageviews API |
The use of two different data sources leads to differences in the data represented by the two sources. For example the Pageview API excludes spiders/crawlers, while data from the Pagecounts API does not. As a result, the two data sources may provide different values, even if the same period is considered. The two data sources also overlap, so that for the period from July 2015 to July 2016 both sources provide data about the monthly traffic on Wikipedia.
In order to use this project (espaccilay the jupyter note book), please ensure that you have a Python version greater or equal to 3.6.1
, a working installation of Poetry and git installed.
-
Clone this repository (or use SSH) and move it into the repo root.
git clone https://github.com/marisanest/A2-hcds-hcc.git cd A2-hcds-hcc
-
Install the dependencies in the repo root.
poetry install
-
Create a subshell within the virtual environment by running:
poetry shell
-
Open the project with Jupyter in your browser.
jupyter notebook