/astartes

Train:Test Algorithmic Sampling for Molecules, Images, and Arbitrary Arrays

Primary LanguagePythonMIT LicenseMIT

astartes

Train:Test Algorithmic Sampling for Molecules, Images, and Arbitrary Arrays

astarteslogo

GitHub Repo Stars PyPI - Downloads PyPI PyPI - License

Online Documentation

Click here to read the documentation

Background

Rational Splitting Algorithms

While much machine learning is done with a random choice between training/test/validation data, an alternative is the use of so-called "rational" splitting algorithms. These approaches use some similarity-based algorithm to divide data into sets. Some of these algorithms include Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms as discussed by Tropsha et. al as well as the DUPLEX, OptiSim, D-optimal, as discussed in Applied Chemoinformatics: Achievements and Future Opportunities. Some clustering-based splitting techniques have also been introduced, such as DBSCAN.

Sampling Algorithms

  • Random
  • Kennard-Stone (KS)
  • Minimal Test Set Dissimilarity
  • Sphere Exclusion
  • DUPLEX
  • OptiSim
  • D-Optimal
  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
  • KMEANS Split
  • SPXY
  • RBM
  • Time Split

Extending Functionality

Adding a new sampler should extend the sampler.py abstract base class.

It can be as simple as a passthrough to a another train_test_split, or it can be an original implementation that results in X and y being split into two lists.

Adding a new interface should take on this format:

from extended_train_test_split import train_test_split

def train_test_split_INTERFACE(
    INTERFACE_input,
    INTERFACE_ARGS,
    y: np.array = None,
    test_size: float = 0.25,
    train_size: float = 0.75,
    splitter: str = 'random',
    hopts: dict = {},
    INTERFACE_hopts: dict = {},
):
    # turn the INTERFACE_input into an input X
    # based on INTERFACE ARGS where INTERFACE_hopts
    # specifies additional behavior
    X = []
    
    # call train test split with this input
    return train_test_split(
        X,
        y=y,
        test_size=test_size,
        train_size=train_size,
        splitter=splitter,
        hopts=hopts,
    )

JOSS Branch

paper.md is stored in a separate branch aptly named joss-paper. To push changes from the main branch into the joss-paper branch, run the Update JOSS Branch workflow.