ImageMatching

Image recommendation for unillustrated Wikipedia articles

Getting started

Connect to stat1005 through ssh (the remote machine that will host your notebooks)

ssh stat1005.eqiad.wmnet

Installation

First, clone the repository

git clone https://github.com/clarakosi/ImageMatching.git

Setup and activate the virtual environment

cd ImageMatching
virtualenv -p python3 venv
source venv/bin/activate

Install the dependencies

export=http_proxy=http://webproxy.eqiad.wmnet:8080
export=https_proxy=http://webproxy.eqiad.wmnet:8080
python3 setup.py install

Running the script

To run the script pass in the snapshot (required), language (defaults to all wikis), and output directory (defaults to Output)

python3 algorunner.py 2020-12-28 hywiki Output

The output .ipynb and .tsv files can be found in your output directory

ls Output
hywiki_2020-12-28.ipynb  hywiki_2020-12-28_wd_image_candidates.tsv

Production data ETL

etl contains pyspark utilities to transform the algo raw output into a production dataset that will be consumed by a service.

TSV to parquet

raw2parquet.py is a job that loads a tsv file (model output), converts it to parquet, and stores it to HDFS (or local) using the wiki_db=wiki/snapshot=YYYY-MM partitioning scheme.

spark2-submit --properties-file conf/spark.properties --files etl/schema.py etl/raw2parquet.py \
    --wiki <wiki name> \
    --snapshot <YYYY-MM> \
    --source <raw data> \
    --destination <production data>

Production dataset

transform.py parses raw model output and tansforms it to production data, and stores it to HDFS (or local) using the wiki=wiki/snapshot=YYYY-MM partitioning scheme.

spark2-submit --files etl/schema.py etl/transform.py \
    --snapshot <YYYY-MM> \
    --source <raw data> \
    --destination <production data>

conf/spark.properties provides default settings to run the ETL as a regular size spark job on WMF's Analytics cluster.

spark2-submit --properties-file conf/spark.properties --files etl/schema.py etl/transform.py \
    --wiki <wiki name> \
    --snapshot <YYYY-MM> \
    --source <raw data> \
    --destination <production data>

Metrics collection

On WMF's cluster the Hadoop Resource Manager (and Spark History) is available at https://yarn.wikimedia.org/cluster. Additional instrumentation can be enabled by passing metrics.properites file to the Notebook or ETL jobs. A template metrics files, that outpus to the driver and executors stdout, can be found at conf/metrics.properties.template.

The easiest way to do it by setting PYSPARK_SUBMISSION_ARGS. For example

export PYSPARK_SUBMIT_ARGS="--files ./conf/metrics.properties --conf spark.metrics.conf=metrics.properties pyspark-shell"
python3 algorunner.py 2020-12-28 hywiki Output

Will submit the algorunner job, with additional instrumentation.

For more information refer to https://spark.apache.org/docs/latest/monitoring.html.

Get dataset Metrics

To get the dataset metrics run the dataset_metrics_python script. The script expects the snapshot (required) and output directory (defaults to Output)

cd dataset_metrics/
python3 dataset_metrics_runner.py 2021-01 Output

Exporting datasets

The following scripts export the datasets currently used by client teams.

ddl/export_prod_data.hql generates the canonical dataset for the image-suggestions-api service.
ddl/export_prod_data-android.hql generates an Android specific variant.

A template is provided at ddl/imagerec.sqlite.template to ingest data into sqlite for testing and validation purposes. It's parametrized by a SNAPSHOT variable; an sqlite script (DDL and .imports) can be generated in Bash with:

export SNAPSHOT=2021-02-22
eval "cat <<EOF
$(cat imagerec.sqlite.template)
EOF
" 2> /dev/null