/pyDVL_DataOOB

fork of pyDVL

Primary LanguageJupyter NotebookGNU Lesser General Public License v3.0LGPL-3.0

pyDVL Logo

A library for data valuation.

Build Status
License DOI

Documentation

pyDVL collects algorithms for Data Valuation and Influence Function computation.

Data Valuation is the task of estimating the intrinsic value of a data point wrt. the training set, the model and a scoring function. We currently implement methods from the following papers:

Influence Functions compute the effect that single points have on an estimator / model. We implement methods from the following papers:

Installation

To install the latest release use:

$ pip install pyDVL

You can also install the latest development version from TestPyPI:

pip install pyDVL --index-url https://test.pypi.org/simple/

For more instructions and information refer to Installing pyDVL in the documentation.

Usage

Influence Functions

For influence computation, follow these steps:

  1. Wrap your model and loss in a TorchTwiceDifferential object
  2. Compute influence factors by providing training data and inversion method

Using the conjugate gradient algorithm, this would look like:

import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

from pydvl.influence import TorchTwiceDifferentiable, compute_influences, InversionMethod

nn_architecture = nn.Sequential(
    nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
    nn.Flatten(),
    nn.Linear(27, 3),
)
loss = nn.MSELoss()
model = TorchTwiceDifferentiable(nn_architecture, loss)

input_dim = (5, 5, 5)
output_dim = 3

train_data_loader = DataLoader(
    TensorDataset(torch.rand((10, *input_dim)), torch.rand((10, output_dim))),
    batch_size=2,
)
test_data_loader = DataLoader(
    TensorDataset(torch.rand((5, *input_dim)), torch.rand((5, output_dim))),
    batch_size=1,
)

influences = compute_influences(
    model,
    training_data=train_data_loader,
    test_data=test_data_loader,
    progress=True,
    inversion_method=InversionMethod.Cg,
    hessian_regularization=1e-1,
    maxiter=200,
)

Shapley Values

The steps required to compute values for your samples are:

  1. Create a Dataset object with your train and test splits.
  2. Create an instance of a SupervisedModel (basically any sklearn compatible predictor)
  3. Create a Utility object to wrap the Dataset, the model and a scoring function.
  4. Use one of the methods defined in the library to compute the values.

This is how it looks for Truncated Montecarlo Shapley, an efficient method for Data Shapley values:

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from pydvl.value import *

data = Dataset.from_sklearn(load_breast_cancer(), train_size=0.7)
model = LogisticRegression()
u = Utility(model, data, Scorer("accuracy", default=0.0))
values = compute_shapley_values(
    u,
    mode=ShapleyMode.TruncatedMontecarlo,
    done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01),
    truncation=RelativeTruncation(u, rtol=0.01),
)

For more instructions and information refer to Getting Started in the documentation. We provide several examples with details on the algorithms and their applications.

Caching

pyDVL offers the possibility to cache certain results and speed up computation. It uses Memcached For that.

You can run it either locally or, using Docker:

docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest

You can read more in the documentation.

Contributing

Please open new issues for bugs, feature requests and extensions. You can read about the structure of the project, the toolchain and workflow in the guide for contributions.

License

pyDVL is distributed under LGPL-3.0. A complete version can be found in two files: here and here.

All contributions will be distributed under this license.