###WMUtils ####An internal utilities library for the Wikimedia Foundation's Research and Data team
####Description
Every organisation has its own idiosyncratic data storage methods and engineering solutions, and the
Wikimedia Foundation is no exception. To help solve for this when doing research, we have created WMUtils
,
a library of utility functions for handling the WMF's various data formats, stores and needs.
####Domains
#####Log reading and database connections
Request logs are stored in both HDFS, unsanitised, for 30 days, and in a sanitised and sampled form on
stat1002
. hive_query and
sampled_logs, respectively, allow you to
get access to this data and read it into R. Once you have it, you can use log_strptime to parse the timestamp format, parse_uuids to extract the UUIDs used by the Wikipedia mobile applications, or even
log_sieve to filter the requests down to
those that are considered "pageviews". For general hive manipulation, hive_range makes a best-guess attempt at the smallest number of date-based partitions to run
a query over to cover a expected range of timestamps.
The rest of our data lives in big MariaDB databases, which can be read from using mysql_query (or global_query to do it en-masse), written to with mysql_write, and checked or amended with mysql_exists or mysql_delete respectively.
#####MediaWiki idiosyncracies
mw_strptime replicates log_strptime
,
but for MediaWiki-specific timestamps, while to_mw
allows you to shift POSIX timestamps back into acceptable MediaWiki ones. For namespace matching,
namespace_match localises numeric namespace
values and turns them into the appropriate strings, or takes localised strings and turns them into universally-accepted
numeric values.
#####Geolocation Through the MaxMind C API, we can take IP addresses and geolocate them. geo_country localises to country level, and geo_city to city-level, while geo_tz and geo_netspeed retrieve a tzdata-compatible timezone and a connection type, respectively.
#####User-agent parsing With the assistance of tobie's ua-parser library (specifically the C++ port), we can take user agents and use ua_parse to localise them, retrieving the device, operating system, browser, and browser major/minor versions. This includes spider identification.
Once the agent is retrieved, device_classifier takes ua-parser's outputted device and makes a best guess at classifying them as phones, tablets or other.
#####Session analysis
A variety of functions implemented in C++ allow for session identification and analysis. intertimes
takes
a set of timestamps and turns them into a series of intertime values, which can then be passed to session_length
to retrieve the length of the session(s), session_pages
to retrieve the number of pages within
those sessions, and session_count
to get the number of sessions.
####Dependencies
- R (doy)
- The Python libraries mentioned above
- data.table
- lubridate
- Rcpp
- jsonlite
- parallel