RALPS stands for Regularized Adversarial Learning Preserving Similarity. It's a method for eliminating batch effects in omics data, developed originally to harmonize multi-batch metabolomics measurements by MS. RALPS exploits reference samples (e.g. pooled study samples, NIST 1950 SRM, you name it...) present in every batch to assess interbatch differences. RALPS tries to reconstruct the original data while (i) removing interbatch differences on the basis of replicate measurements across batches and (ii) avoiding an expansion of the overall variance.
In practice, each batch should be first normalized individually to suppress intrabatch problems, e.g. temporal trends associated to drifts in LC-MS. RALPS is used in a second step to harmonize multiple batches.
RALPS is particularly flexible in the experimental design. In fact, reference samples can be identical across all batches, but also vary between each pair of batches. In principle, it is also possible to include some samples from the previous one in the next batch and use these replicate measurements for training RALPS.
RALPS preserves spectral properties and is robust against missing values.
RALPS includes a heuristic to automatically, identify the best set of parameters.
Principles and performance is described in the accompaining paper:
Dmitrenko A, Reid M and Zamboni N, Regularized adversarial learning for normalization of multi-batch untargeted metabolomics data, Bioinformatics (2023), DOI
hdbscan==0.8.27
matplotlib==3.4.1
numpy==1.20.0
pandas==1.2.4
scikit-learn==0.24.2
scipy==1.6.3
seaborn==0.11.1
torch==1.8.1
umap-learn==0.5.1
RALPS has been tested on CPU and GPU under MacOS and Windows.
Training time required to normalize a dataset with ~3000 samples and ~150 metabolites was 5.82 minutes per run on average (30 epochs).
Run the following command from the src
directory to normalize data with RALPS:
python ralps.py -n path/to/config.csv
Config file should contain paths to the data and batch information files, and some other parameters.
All the required fields, as well as all the necessary parameters, are described below.
Find the input example files here.
Parameter | Comment | Default value |
---|---|---|
data_path | path to a csv data file | - |
info_path | path to a csv batch info file | - |
out_path | path to a new folder to save results to | - |
latent_dim | dimension of the bottleneck linear layer | -1 (automatically derived from PCA) |
variance_ratio | percent of explained variance to derive latent_dim | 0.9,0.95,0.99 |
n_replicates | mean number of replicates in the data | 3 |
grid_size | size of the randomized grid search (# of runs) | 1 |
d_lr | classifier learning rate | 0.00005-0.005 |
g_lr | autoencoder learning rate | 0.00005-0.005 |
d_lambda | classifier loss coef | 0.-10. |
g_lambda | autoencoder regularization term coef | 0.-10. |
v_lambda | variation loss coef | 0.-10. |
train_ratio | train-test split ratio | 0.9 |
batch_size | data loader batch size | 32,64,128 |
epochs | # of epochs to train | 30 |
skip_epochs | # of epochs to skip for model selection | 3 |
keep_checkpoints | save all model checkpoints after training | False (keep only best model) |
device | device to train on (Torch) | cpu |
plots_extension | save plots with this extension | png |
min_relevant_intensity | missing values before normalization are replaced with this; values below this after normalization are masked with zeros |
1000 |
allowed_vc_increase | fraction of sample's VC increase allowed (not contributing to the variation loss) | 0.05 |
For most parameters, coma separated values (e.g., 'batch_size'
) or dash separated intervals (e.g., 'd_lr'
) can be provided.
For those, values will be uniformly sampled in the randomized grid search using defined options or intervals.
Otherwise, the exact values provided will be used.
Default parameter values can be used by setting '-1'
.
sample_id_1 | ... | sample_id_M | |
---|---|---|---|
feature_1 | count | count | |
... | |||
feature_N | count | count |
batch | group | benchmark | |
---|---|---|---|
sample_id_1 | 1 | reg_1 | 0 |
sample_id_2 | 1 | reg_1 | 0 |
sample_id_3 | 2 | 0 | 0 |
... | |||
sample_id_M-1 | k | 0 | bench_M |
sample_id_M | k | 0 | bench_M |
- Batch column indicates samples' batch labels.
- Group column indicates groups of identical samples (replicates), used for regularization.
If several samples have the same label (e.g.,
'reg_1'
), they are treated as replicates of the same material. While training, samples of the same group are encouraged to appear in the same cluster. Use'0'
or''
to provide no information about similarity of samples. - Benchmark indicates groups of identical samples taken as benchmarks in model evaluation. They are not used for regularization while training, unless they appear in the group column as well.
If you choose to keep checkpoints in the config file, you will find the autoencoder model at each training epoch saved in the checkpoints
directory.
You can select a few checkpoints based on the training history to obtain alternative normalization solutions and the corresponding evaluation plots.
To do that, remove unnecessary checkpoints and run the following command from the src
directory:
python ralps.py -e path/to/directory/with/checkpoints/
Important: This works only with default RALPS output (directories and filenames should not be changed).
If you wish to remove outliers from the normalized data, as proposed in the paper, run the following command from the src
directory:
python ralps.py -r path/to/normalized/data.csv
Important: This works only with default RALPS output (directories and filenames should not be changed).
If you wish to reconfigure RALPS (e.g., to use a different clustering algorithm as default, or to change default parameter values), you can do so by editing src/constants.py
.