Help with data structures in a dictionary produced by SyntheticCancerDataset

Question

Help with data structures in a dictionary produced by SyntheticCancerDataset

Closed this issue 2 months ago · 3 comments

Good day @Valentyn1997 ,

I am priviliged to explore your excellent paper and its implementation for my thesis work!

My current aim is to transform Tumor Growth dataset into a tabular format so that I can use it in the training of another model. However, I struggle to comprehend data structures that are produced by an instance of SyntheticCancerDataset.

For example, when I run a simple snippet like this:

import pandas as pd
import numpy as np
from src.data.cancer_sim.dataset import SyntheticCancerDatasetCollection
from src.data.cancer_sim.dataset import SyntheticCancerDataset

# Define the parameters
chemo_coeff = 0.5
radio_coeff = 0.5
num_patients = 10
seed = 5
window_size = 15
seq_length = 10
subset_name = 'train'
mode = 'factual'
projection_horizon = 10
lag = 0
cf_seq_mode = 'sliding_treatment'
treatment_mode = 'multiclass'

# Create an instance of the class
df = SyntheticCancerDataset(
    chemo_coeff,
    radio_coeff,
    num_patients,
    window_size,
    seq_length,
    subset_name,
    mode,
    projection_horizon,
    seed,
    lag,
    cf_seq_mode,
    treatment_mode
)

scaling_params = df.get_scaling_params()
df.process_data(scaling_params)

# Get the data for the first patient
first_patient_data = df[0]
print(first_patient_data)

I get a dictionary with multiple arrays of a different length:

cancer_volume: 10
chemo_dosage: 10
radio_dosage: 10
chemo_application: 10
radio_application: 10
chemo_probabilities: 10
radio_probabilities: 10
sequence_lengths: Not an iterable
death_flags: 10
recovery_flags: 10
patient_types: Not an iterable
prev_treatments: 9
current_treatments: 9
current_covariates: 9
outputs: 9
active_entries: 9
unscaled_outputs: 9
prev_outputs: 9
static_features: 1

Could you help me understand why some arrays have 10 items, whereas other only 9? Similarly, could you give me pointers how to transform this simple dictionary with data for one patient to a tabular format? I am mainly interested in one-hot encoded covariates for historical radio/chemo application and historical tumour volume.

Thank you very much in advance!

Answer 1 · 2024-06-03T11:14:22.000Z

Hello, I also had this doubt, about the difference between the lenghts of the series.
Another point, that is in default parameters some series are generated with more dimensions, for example the prev_treatments and current_treatments have 59 x 4 dimensions , different from others which are single series.

Answer 2 · 2024-08-13T15:17:14.000Z

Dear @linaske,

Sorry for a very late reply!

Could you help me understand why some arrays have 10 items, whereas other only 9?

The fields radio_probabilities, chemo_probabilities, chemo_application, radio_application, radio_dosage, chemo_dosage, cancer_volume have lengths of 10 as the original projection_horizon. They represent original raw data.

Then, the others, prev_outputs, unscaled_outputs, active_entries, outputs, current_covariates, current_treatments,prev_treatments, represent pre-processed data. Here, we need to split the original treatment sequences, chemo_application and radio_application, into current_treatments and prev_treatments, respectively, so that the prev_treatments are one time-step left-shifted. Hence, the length of the pre-processed sequences is one time-step shorter.

could you give me pointers how to transform this simple dictionary with data for one patient to a tabular format?

To transform the data to the tabular format, one needs to flatten all the pre-processed sequences (prev_outputs, unscaled_outputs, active_entries, outputs, current_covariates, current_treatments,prev_treatments) and append the static features. Importantly, some patients have shorter sequences so that the last entries are filled in with zeros.

I hope, I have answered your question!

Best, Valentyn

Answer 3 · 2024-08-13T15:19:24.000Z

Dear @angeruzzi,

Regarding your question:

some series are generated with more dimensions, for example the prev_treatments and current_treatments have 59 x 4 dimensions

Both prev_treatments and current_treatments contain one-hot encoded combinations of two binary treatments (chemo_application, radio_application).

Best,
Valentyn