/SAR-PU

Beyond the Selected Completely At Random Assumption for Learning from Positive and Unlabeled Data

Primary LanguageJupyter NotebookMIT LicenseMIT

This repository contains the code that was used in the paper ``Beyond the Selected Completely At Random Assumption for Learning from Positive and Unlabeled Data'' https://arxiv.org/abs/1809.03207

Install

Make virtual environment with python 3 and activate it:

$ virtualenv -p python3 env_sarpu
$ source env_sarpu/bin/activate

Install required packages, the sar pu code and other local libraries. The KM library is downloaded from the original source and made into a python package, compatible with python 3.

(env_sarpu) $ pip install -r requirements.txt
(env_sarpu) $ pip install -e sarpu
(env_sarpu  $ python make_km_lib.py
(env_sarpu) $ pip install -e lib/tice
(env_sarpu) $ pip install -e lib/km

Install a jupyter kernel with the environment:

(env_sarpu) $ ipython kernel install --user --name=env_sarpu

Data

The data directory contains the original data, the preprocessed data and the PU labellings. For each dataset, there is a data directory <dataset>

Data Directory Structure

Each data directory has the following structure:

<dataset>
├── original
│   └── <original_downloaded_dataset>
└── processed
    ├── <dataset>._.class.csv
    ├── <dataset>._.data.csv
    ├── labelings
    │    └── <dataset>._.train_test._.<partition_id>.csv
    └── partitions
         ├── <dataset>._.propmodel.<modeltype>._.train_test._.<propensity_attributes>.e.csv          
         └── <dataset>._.propmodel.<modeltype>._.train_test._.<propensity_attributes>._.lab.<sampling_id>.csv

<dataset>/original contains the original data.

<dataset>/processed contains the processed/reformatted data, the data partitions (train/test) and the PU labellings.

A PU dataset is the combination of 4 files, where each line is an example.

  1. <dataset>._.data.csv contains the attribute values, separated by spaces.
  2. <dataset>._.class.csv contains the true class values (0/1).
  3. <dataset>._.train_test._.<partition_id>.csv contains the partitions. 1 for train, 2 for test.
  4. <dataset>._.propmodel.<modeltype>._.train_test._.<propensity_attributes>._.lab.<sampling_id>.csv contains the PU labels, where the positive examples were sampled according to the propensity scores <dataset>._.propmodel.<modeltype>._.train_test._.<propensity_attributes>.e.csv assigned by the propensity model of type modeltype for the attributes propensity_attributes.

Download and preprocess

For each dataset there is a notebook to download and prepocess it:

notebooks/data_preprocessing/<Dataset>.ipynb

Currently, the available datasets are:

  • 20ng
  • Adult
  • BreastCancer
  • Covtype
  • Diabetes
  • ImageSegmentation
  • Mushroom
  • Splice

All the notebooks can be run from the terminal with a shell script, using the provided jupyter kernel:

$  ./generateData.sh env_sarpu

Extended Data

To be able to do some controlled experiments, we extended the datasets with artificially generated attributes.

Extended versions of the datasets are generated by the notebook

notebooks/data_preprocessing/Extended Data.ipynb

Experiments

The notebook notebooks/Experiments shows how to run experiments

Through command line, experiments can be run as follows:

Label the PU data according to the provided labeling mechanism

(env_sarpu) $ python -m sarpu label $data_dir $data_name $labeling_model_type $propensity_attributes $nb_assignments
  • data_dir: The base directory for the data. This is probably "Data"
  • data_name: The dataset to use
  • labeling_model_type: unique labeling mechanism name, usually "simple_0.2_0.8"
  • propensity_attributes: 1.-3.5 for attributes [1,3,5] and signs [1,-1,1].
  • nb_assignments: how many labellings to produce

Ouput

The labellings are saved in the data director under <dataset>/processed/labelings/

Train and evaluate a model using the provided PU method.

(env_sarpu) $ python -m sarpu train_eval $data_dir $results_dir $data_name $labeling_model_type $propensity_attributes $labeling $partition $settings $pu_method
  • data_dir: The base directory for the data. This is probably "Data"
  • data_dir: The base directory for the results. This is probably "Results"
  • data_name: The dataset to use
  • labeling_model_type: unique labeling mechanism name, usually "simple_0.2_0.8"
  • propensity_attributes: 1.-3.5 for attributes [1,3,5] and signs [1,-1,1].
  • labeling: which labeling to use (id)
  • partition: which train/test partition to use (id)
  • settings: Which settings to use, i.e. which type of model for classification, which type of model for propensity scores and which attributes to use for classification. The three values are separated by ._.. The models can be "lr" (Logistic regression) and the classification attributes are either "all" or something like "1.3-5.9-11" which indicates that attributes [1,3,4,5,9,10,11] should be used for classification.
  • pu_method: which pu_method to use
    • supervised: standard supervised learning with access to the true labels
    • negative: standard supervised learning given the PU labels
    • sar-e: propensity score weighting given the correct propensity scores.
    • scar-c: propensity score weighting given the label frequency as the propensity score for all examples.
    • sar-em: The EM-based SAR-PU method
    • scar-km2: propensity score weighting with an estimated label frequency as the propensity score for all examples. km2 is used to estimate the label frequency.
    • scar-tice: propensity score weighting with an estimated label frequency as the propensity score for all examples. tice is used to estimate the label frequency.

Output

The experiment results are saved in the folder Results/<data_name>/<labeling_model>._.<propensity_attributes>.__.<settings>.__.<labeling>.__.<partition>/<pu_method>/. This folder contains the following files:

  • e.model: the propensity score model
  • f.model: the classification model
  • info.csv: info that was output during the training, such as training time, number of iterations and intermediate results
  • results.csv: many evaluation metrics for the propensity score model, classification model, calculated on the train and test data.

Directories and Files

Data: contains the data, both original and SAR PU. The data can be downloaded and generated using the notebooks in notebooks/data_preprocessing/data_preprocessing

lib: external libraries that are used in the code

  • km: the class prior estimator from Ramaswamy
  • tice: class prior estimator from our AAAI paper

notebooks: Notebooks for all experiments etc

  • data_preprocessing: downloads data and generates sar pu versions
  • Experiments: fast way to test something. Specify the SAR mechanism, dataset, and settings. Then compare the different methods and analyse the behaviour of the SAR mechanism.

Results: The raw results generated by our experiments. Unless specifically asked otherwise (by setting a flag), this folder is checked for results before running experiments to save time.

sarpu: The library with the sarpu code