Help with data structures in a dictionary produced by SyntheticCancerDataset
Closed this issue · 3 comments
Good day @Valentyn1997 ,
I am priviliged to explore your excellent paper and its implementation for my thesis work!
My current aim is to transform Tumor Growth dataset into a tabular format so that I can use it in the training of another model. However, I struggle to comprehend data structures that are produced by an instance of SyntheticCancerDataset.
For example, when I run a simple snippet like this:
import pandas as pd
import numpy as np
from src.data.cancer_sim.dataset import SyntheticCancerDatasetCollection
from src.data.cancer_sim.dataset import SyntheticCancerDataset
# Define the parameters
chemo_coeff = 0.5
radio_coeff = 0.5
num_patients = 10
seed = 5
window_size = 15
seq_length = 10
subset_name = 'train'
mode = 'factual'
projection_horizon = 10
lag = 0
cf_seq_mode = 'sliding_treatment'
treatment_mode = 'multiclass'
# Create an instance of the class
df = SyntheticCancerDataset(
chemo_coeff,
radio_coeff,
num_patients,
window_size,
seq_length,
subset_name,
mode,
projection_horizon,
seed,
lag,
cf_seq_mode,
treatment_mode
)
scaling_params = df.get_scaling_params()
df.process_data(scaling_params)
# Get the data for the first patient
first_patient_data = df[0]
print(first_patient_data)
I get a dictionary with multiple arrays of a different length:
- cancer_volume: 10
- chemo_dosage: 10
- radio_dosage: 10
- chemo_application: 10
- radio_application: 10
- chemo_probabilities: 10
- radio_probabilities: 10
- sequence_lengths: Not an iterable
- death_flags: 10
- recovery_flags: 10
- patient_types: Not an iterable
- prev_treatments: 9
- current_treatments: 9
- current_covariates: 9
- outputs: 9
- active_entries: 9
- unscaled_outputs: 9
- prev_outputs: 9
- static_features: 1
Could you help me understand why some arrays have 10 items, whereas other only 9? Similarly, could you give me pointers how to transform this simple dictionary with data for one patient to a tabular format? I am mainly interested in one-hot encoded covariates for historical radio/chemo application and historical tumour volume.
Thank you very much in advance!
Hello, I also had this doubt, about the difference between the lenghts of the series.
Another point, that is in default parameters some series are generated with more dimensions, for example the prev_treatments and current_treatments have 59 x 4 dimensions , different from others which are single series.
Dear @linaske,
Sorry for a very late reply!
Could you help me understand why some arrays have 10 items, whereas other only 9?
The fields radio_probabilities
, chemo_probabilities
, chemo_application
, radio_application
, radio_dosage
, chemo_dosage
, cancer_volume
have lengths of 10 as the original projection_horizon
. They represent original raw data.
Then, the others, prev_outputs
, unscaled_outputs
, active_entries
, outputs
, current_covariates
, current_treatments
,prev_treatments
, represent pre-processed data. Here, we need to split the original treatment sequences, chemo_application
and radio_application
, into current_treatments
and prev_treatments
, respectively, so that the prev_treatments
are one time-step left-shifted. Hence, the length of the pre-processed sequences is one time-step shorter.
could you give me pointers how to transform this simple dictionary with data for one patient to a tabular format?
To transform the data to the tabular format, one needs to flatten all the pre-processed sequences (prev_outputs
, unscaled_outputs
, active_entries
, outputs
, current_covariates
, current_treatments
,prev_treatments
) and append the static features. Importantly, some patients have shorter sequences so that the last entries are filled in with zeros.
I hope, I have answered your question!
Best, Valentyn
Dear @angeruzzi,
Regarding your question:
some series are generated with more dimensions, for example the prev_treatments and current_treatments have 59 x 4 dimensions
Both prev_treatments
and current_treatments
contain one-hot encoded combinations of two binary treatments (chemo_application
, radio_application
).
Best,
Valentyn