This notebook demonstrates how to load, preprocess, and analyze radiomics and clinical data using machine learning techniques. The goal is to classify patient outcomes using radiomics features (derived from imaging data and extracted using QUANTIMAGE-V2 and clinical features. The process includes data preprocessing, feature selection, model training, and evaluation. Data originates from 3D PET/CT images of the HEad and neCK TumOR segmentation and outcome prediction (HECKTOR) challenge.
The notebook imports essential libraries for:
- Data handling:
numpy
,pandas
. - Modeling and evaluation:
scikit-learn
(for feature selection, train/test split, and model evaluation). - Statistical analysis:
scipy
,statsmodels
. - File handling:
google.colab
(for uploading files in Colab).
- The following functions are provided to load and filter the data:
load_features(folder_path, file_start)
: Loads CSV files from the specified folder and concatenates them.filter_patients(df1, df2)
: Filters patients present in both feature and outcome datasets.
- Data Cleaning and Transformation:
preprocess_data(df)
: Prepares and pivots the feature DataFrame to combine components like Modality and ROI.feature_preprocessing(df)
: Applies scaling, imputation, and one-hot encoding for categorical variables.
- Correlation Filtering:
drop_correlated_features(X_train, X_test, threshold)
: Drops highly correlated features to avoid multicollinearity.
- Manual selection of both radiomics and clinical features can be performed using
select_feature()
and the listclinical_features
, respectively. SelectKBest
can be used to select the most important features based e.g. on mutual information between the features and target variable.
- Model Selection:
- A
RandomForestClassifier
is used for this analysis, but the notebook is flexible to accommodate other models likeLogisticRegression
.
- A
- Evaluation:
- Evaluation metrics include:
- Accuracy
- ROC AUC
- Classification report
- Confusion matrix
- Bootstrap Analysis: Bootstrap resampling is used to compute confidence intervals for the ROC AUC score. This provides robust evaluation of model performance.
evaluate_model(model, X_train, y_train, X_test, y_test, run_bootstrap)
: This function evaluates the model and outputs metrics.
- Evaluation metrics include:
load_features(folder_path, file_start)
: Loads and concatenates CSV files containing patient features.preprocess_data(df)
: Reshapes the feature DataFrame to combine components and prepare it for modeling.feature_preprocessing(df)
: Performs scaling, imputation, and one-hot encoding of features.drop_correlated_features(X_train, X_test, threshold)
: Removes highly correlated features to improve model performance.evaluate_model(model, X_train, y_train, X_test, y_test, run_bootstrap)
: Evaluates the model on training and test sets, with an option to perform bootstrap analysis.
The notebook expects the following CSV files to be uploaded:
- Feature data: CSV files containing radiomics features for each patient, each ROI (GTVp and GTVn) and each modality (PET and CT). Available in the
data/
- Outcome data: A CSV file containing patient outcomes for classification (e.g.,
hecktor2022_HPV_outcomesBalanced.csv
, available indata/
). - Patient split: A CSV file defining the training and test set split (e.g.,
patient_split.csv
, available indata/
). - Clinical data: A CSV file containing clinical information (e.g.,
hecktor2022_clinicalFeatures.csv
, not publicly available).