/ml-feature-tweaking

This repository contains the source code associated with the method proposed by Tolomei et al. in their KDD 2017 research paper entitled "Interpretable Predictions of Tree-based Ensembles via Actionable Feature Tweaking"

Primary LanguageJupyter Notebook

Tweaking Features of Ensembles of Machine-Learned Trees

This repository contains the source code associated with the method proposed by Tolomei et al. in their KDD 2017 research paper entitled "Interpretable Predictions of Tree-based Ensembles via Actionable Feature Tweaking" [more information available at: KDD 2017 website or arXiv.org]

NOTE: This work has been developed by the authors of the paper while working at Yahoo Labs, London, UK. Although the method proposed is general and applicable to several different domains, the authors validate it on an online advertising use case. In particular, they demonstrate the ability of this approach to generate actionable recommendations for improving the quality of the ads served by Yahoo Gemini.
Due to confidentiality, any business-related detail has been removed from this repository, which however can still be used by other researchers working on related topics, such as ML model interpretability or adversarial ML just to name a few.

This repo is made up of 3 scripts which are supposed to be run in the same order as follows:

  1. dump_paths.py
  2. tweak_features.py
  3. compute_tweaking_costs.py

1. dump_paths.py

The first stage of the pipeline is accomplished by this script, which can be invoked as follows:

> ./dump_paths.py ${PATH_TO_SERIALIZED_MODEL} ${PATH_TO_OUTPUT_FILE}

where:
${PATH_TO_SERIALIZED_MODEL} is the path to the (binary) file containing a serialized, trained binary classifier (i.e., a scikit-learn tree-based ensemble estimator).
${PATH_TO_OUTPUT_FILE} is the path where the output file will be stored. This file will contain a plain-text representation of all the positive paths, namely all the paths extracted from all the trees in the ensemble whose leaves are labeled as positive.
Each line of the output file is a positive path, which in turn is a sequence of boolean tests with the following format:

[tree_id, [(feature_id, op, value), ..., (feature_id, op, value)]

where:

  • tree_id is the unique id of the tree within the ensemble.
  • feature_id is the unique id of the feature subject of the test.
  • op is the operator of the test: either '<=' or '>'.
  • value is the value against which the feature is tested.

2. tweak_features.py

The second stage of the pipeline is actually the core of the entire process. The script can be run as follows:

> ./tweak_features.py ${PATH_TO_DATASET} ${PATH_TO_SERIALIZED_MODEL} ${PATH_TO_POSITIVE_PATHS_FILE} \
${PATH_TO_OUTPUT_FILE} [--epsilon=x]

where:
${PATH_TO_DATASET} is the path to the dataset file used to train the binary classifier. This is assumed to be either a .tsv or a .csv file, where each line is an instance and each field is a feature. The very last field is supposed to be the target label (named 'class').
${PATH_TO_SERIALIZED_MODEL} as above.
${PATH_TO_POSITIVE_PATHS_FILE} is the path to the output file generated by the previous script dump_paths.py at stage 1.
${PATH_TO_OUTPUT_DIRECTORY} is the path to the directory where the output file will be stored. This file will be called transformations_${EPSILON}.tsv, where ${EPSILON} is the value of epsilon optional argument (by default epsilon=0.1).

3. compute_tweaking_costs.py

Once the set of candidate feature transformations (i.e., tweakings) have been successfully calculated, we can measure the actual costs of those transformations. This can be achieved by running the following script:

> ./compute_tweaking_costs.py  ${PATH_TO_DATASET} \
${PATH_TO_TRANSFORMATIONS} \
${PATH_TO_OUTPUT_DIRECTORY} \
--costfuncs=unmatched_component_rate,euclidean_distance,cosine_distance,jaccard_distance,pearson_correlation_distance

where:
${PATH_TO_DATASET} is the path to the dataset file used to train the binary classifier, as above.
${PATH_TO_TRANSFORMATIONS} is the path to the file containing the candidate transformations obtained with step 2.
${PATH_TO_OUTPUT_DIRECTORY} is the path to the directory where the output file will be stored. Finally, the optional argument costfuncs will contain a list of functions used to compute the cost of each transformation (default costfuncs=euclidean_distance.

The ultimate result of this step is the creation of 2 tsv files inside ${PATH_TO_OUTPUT_DIRECTORY} containing the costs and the signs of each transformation.

Additional steps can be performed using those files as input, depending on the final task goal.

Citation

If you use this implementation in your work, please add a reference/citation to the paper. You can use the following BibTeX entry:

@inproceedings{DBLP:conf/kdd/TolomeiSHL17,
  author    = {Gabriele Tolomei and
               Fabrizio Silvestri and
               Andrew Haines and
               Mounia Lalmas},
  title     = {Interpretable Predictions of Tree-based Ensembles via Actionable Feature
               Tweaking},
  booktitle = {Proceedings of the 23rd {ACM} {SIGKDD} International Conference on
               Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13
               - 17, 2017},
  pages     = {465--474},
  year      = {2017},
  crossref  = {DBLP:conf/kdd/2017},
  url       = {http://doi.acm.org/10.1145/3097983.3098039},
  doi       = {10.1145/3097983.3098039},
  timestamp = {Tue, 15 Aug 2017 16:11:01 +0200},
  biburl    = {http://dblp.org/rec/bib/conf/kdd/TolomeiSHL17},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}