Image recommendation for unillustrated Wikipedia articles
Connect to stat1005 through ssh (the remote machine that will host your notebooks)
ssh stat1005.eqiad.wmnet
First, clone the repository
git clone https://github.com/clarakosi/ImageMatching.git
Setup and activate the virtual environment
cd ImageMatching
virtualenv -p python3 venv
source venv/bin/activate
Install the dependencies
export=http_proxy=http://webproxy.eqiad.wmnet:8080
export=https_proxy=http://webproxy.eqiad.wmnet:8080
python3 setup.py install
To run the script pass in the snapshot (required), language (defaults to all wikis), and output directory (defaults to Output)
python3 algorunner.py 2020-12-28 hywiki Output
The output .ipynb and .tsv files can be found in your output directory
ls Output
hywiki_2020-12-28.ipynb hywiki_2020-12-28_wd_image_candidates.tsv
etl
contains pyspark utilities to transform the
algo raw output into a production dataset that will be consumed by a service.
raw2parquet.py
is a job that loads a tsv file (model output), converts it to
parquet, and stores it to HDFS (or local) using the wiki_db=wiki/snapshot=YYYY-MM
partitioning scheme.
spark2-submit --properties-file conf/spark.properties --files etl/schema.py etl/raw2parquet.py \
--wiki <wiki name> \
--snapshot <YYYY-MM> \
--source <raw data> \
--destination <production data>
transform.py
parses raw model output and tansforms it to production data,
and stores it to HDFS (or local) using the wiki=wiki/snapshot=YYYY-MM
partitioning scheme.
spark2-submit --files etl/schema.py etl/transform.py \
--snapshot <YYYY-MM> \
--source <raw data> \
--destination <production data>
conf/spark.properties
provides default settings to run the ETL as a regular size spark job on WMF's Analytics cluster.
spark2-submit --properties-file conf/spark.properties --files etl/schema.py etl/transform.py \
--wiki <wiki name> \
--snapshot <YYYY-MM> \
--source <raw data> \
--destination <production data>
On WMF's cluster the Hadoop Resource Manager (and Spark History) is available at https://yarn.wikimedia.org/cluster
.
Additional instrumentation can be enabled by passing metrics.properites
file to the Notebook or ETL jobs. A template
metrics files, that outpus to the driver and executors stdout, can be found at conf/metrics.properties.template
.
The easiest way to do it by setting PYSPARK_SUBMISSION_ARGS
. For example
export PYSPARK_SUBMIT_ARGS="--files ./conf/metrics.properties --conf spark.metrics.conf=metrics.properties pyspark-shell"
python3 algorunner.py 2020-12-28 hywiki Output
Will submit the algorunner
job, with additional instrumentation.
For more information refer to https://spark.apache.org/docs/latest/monitoring.html.
To get the dataset metrics run the dataset_metrics_python script. The script expects the snapshot (required) and output directory (defaults to Output)
cd dataset_metrics/
python3 dataset_metrics_runner.py 2021-01 Output
The following scripts export the datasets currently used by client teams.
ddl/export_prod_data.hql
generates the canonical dataset for theimage-suggestions-api
service.ddl/export_prod_data-android.hql
generates an Android specific variant.
A template is provided at ddl/imagerec.sqlite.template
to ingest data into sqlite
for testing and validation purposes. It's parametrized by a SNAPSHOT
variable;
an sqlite script (DDL and .import
s) can be generated in Bash with:
export SNAPSHOT=2021-02-22
eval "cat <<EOF
$(cat imagerec.sqlite.template)
EOF
" 2> /dev/null