Subtype-aware batch correction retains biological signal of integrated
breast cancer datasets—Supplementary Methods

1—Dataset Acquisition

Raw CEL files from ten breast cancer gene expression dataset where downloaded from GEO and normalized with fRMA, RMA and MAS5. Demographics of each dataset are shown in Table 1.

Table 1—Demographics of datsets used in this study.

GSE	platform	nSamples	fracER+	fracHER2+
GSE5327	HG-U133A	58	0.00	NA
GSE25065	HG-U133A	198	0.62	0.01
GSE2034	HG-U133A	286	0.73	NA
GSE17705	HG-U133A	298	1.00	NA
GSE16446	HG-U133Plus2	120	0.00	0.33
GSE17907	HG-U133Plus2	33	0.47	1.00
GSE21653	HG-U133Plus2	266	0.57	0.12
GSE5460	HG-U133Plus2	127	0.58	0.24
GSE2109	HG-U133Plus2	353	0.65	0.27
GSE23177	HG-U133Plus2	116	1.00	0.00

2—Microarray Dataset Integration

Microarray dataset integration needs to account for technical, non-biological, variation across multiple batches of independently acquired datasets. The goal of conventional batch correction (BC) is to remove batch effects while retaining biological variation conveyed by each dataset. Several methods exist to address this task. One popular approach is ComBat, which proposes “parametric and nonparametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples”. However, most BC integration strategies do not account for imbalanced subtype composition within datasets.

This manuscript introduces a novel procedure for integrating expression profile datasets of known dissimilar composition, called subtype-aware batch correction (SABC). Molecular subtypes are initially assigned to each sample in each dataset with a publicly available single sample predictor (SSP). Then, batch effects are resolved between datasets on a per-subtype basis. This two-layered approach allows for the biological specifics captured by the single sample predictor of choice to be accounted for during the batch correction step, and thus carried over into the integrated dataset. By ignoring distinct subtype compositions between datasets, conventional BC incorrectly apprehends biological variation as technical batch effect, and consequently distorts true biological signal in the integrated dataset.

Breast cancer is widely understood to be subdivided into five intrinsic or molecular subtypes, which can be assigned by gene expression profile single sample predictors. Breast cancer hence provides an ideal case stud y for the evaluation of SABC. We used conventional BC and SABC to integrate four breast cancer datasets hybridized onto the Affymetrix HG-U133a chip and six breast cancer datasets hybridized onto the HG-U133 Plus2 chip (Table 1).

To evaluate the performance of each method, we compared the distributions of expression values for the 205225_at and the 216836_s_at probesets in each chip, respectively targeting for the ESR1 and ERBB2 gene transcripts. Estrogen receptor (ESR1) and Erb-B2 Receptor Tyrosine Kinase 2 receptor (ERBB2) status are strong predictors of breast cancer prognosis and are traditionally assessed by immunohistochemestry (IHC). Post dataset integration, the expression values of both genes should remain in line with the biological signal conveyed by the independent assessment given by IHC for both proteins. In addition, we compared the agreement between single sample predictor class assignments for individual samples prior and post integration. The batch correction procedure should not interfere with the molecular subtype identity of each sample, and for that reason higher agreement rates should be indicative of higher transcriptional fidelity of the integrated dataset.

To guide the integration process with SABC, we used two SSPs implemented in the Genefu Bioconductor package. The first, sorlie2003, is based on 534 diagnostic genes and is a five-subtype classifier; the second, desmedts2008, is based on three genes (ESR1, ERBB2 and AURKA), and is a three-subtype classifier. To assess classifier agreement prior and post integration, and in order to circumvent the redundancy caused by using the same single sample predictor to integrate and to validate the integration process, we used the Genefu implementation of the PAM50 single sample predictor, based on 50 genes and yielding a five-subtype classifier.

##    user  system elapsed 
##  22.646   8.356  31.081

2.1—Distortion of Molecular ESR1 and ERBB2 Measurements

We compared the distributions of expression values prior and post dataset integration for the 205225_at probeset in 840 breast tumours, hybridised onto the HG-U133a chip, from four datasets with distinct fractions of ER+ samples (Table 1 and Fig.1). Regardless of the normalisation method, BC integration significantly distorts ESR1 expression measurements in samples from datasets with extreme fractions of ER+ samples (GSE5327 and GSE17705), to the point where the two distributions no longer can tell the difference between ER– and ER+ samples (Fig.2, second column of panels). In both cases, SABC integration, whether driven by a 3-subtype SSP (SABC3) or a 5-subtype SSP (SABC5), succeeds in retaining the biological signal conveyed by the IHC status in datasets with extreme compositions (Fig.2, third and fourth columns of panels). Further comparison of ESR1 transcript abundance in these two datasets prior and post integration reveals that expression values depart significantly from original measurements when datasets are integrated with BC, yet are preserved from extreme distortion by SABC integration (Fig.3).

Although less pronounced, a similar trend is observed when comparing the expression values prior and post- dataset integration for the 216836_s_at probeset, in 1015 breast tumours hybridised onto the HG-U133Plus2 chip, from six datasets with distinct proportions of HER2+ samples (Table 1 and Fig.4). Regardless of the normalisation protocol, The ERBB2 probeset distributions clearly reflect the IHC HER2 receptor status in the two datasets in our analysis with extreme HER2 compositions (GSE23177, all HER2–; and GSE17907, all HER2+). When the five datasets are integrated with BC, this biological signal is erased, yet preserved (albeit to a lesser extent than in the original datasets), when integrated with SABC3 and SABC5 (Fig.4).

The different degree to which subtype-aware batch correction successfully retains biological signal from these two molecular correlates of breast cancer biology, in datasets with extreme compositions, could be explained by how well the classifiers used to drive integration capture the underlying biology of the ER and HER2 receptors. Because the clinical HER2 breast cancer phenotype is the result of a gene amplification, it is possible that gene expression classifiers are less apt to model binary gene expression distributions (HER2) rather than continuous ones (ER). In addition, the top level split highlighted by most molecular characterisations of breast tumours is lead by ER status, consigning most SSP derived from molecular data to be particularly sensitive to biological signal conveyed by this marker.

2.2—Single Sample Predictor Agreement

Single sample predictor assignments for the Genefu implementation of the PAM50 breast cancer classifier were computed for each of the 1015 samples in GSE2109, GSE21653, GSE5460, GSE16446, GSE23177 and GSE17907, after FRMA normalisation (Table 1). We then integrated the six datasets with BC, SABC3 and SABC5, and computed each sample’s PAM50 subtype post-integration. Comparisons of subtype assignments prior- and post-integration can be seen on Fig.5.

Table 2—Interrater agreement of PAM50 subtype assessment prior- and post-dataset integration (see text for details).

GSE	nSamples	fracER+	fracHER2+	Cohen’s Kappa: BC	Cohen’s Kappa: SABC3	Cohen’s Kappa: SABC5	Cohen’s Kappa: HRMN5	Cohen’s Kappa: BCCOV5
all	1015	NA	NA	0.72	0.78	0.80	0.72	0.83
GSE2109	353	0.65	0.27	0.95	0.85	0.93	0.92	0.96
GSE21653	266	0.57	0.12	0.80	0.90	0.85	0.80	0.85
GSE5460	127	0.58	0.24	0.72	0.75	0.76	0.74	0.72
GSE16446	120	0.00	0.33	0.21	0.63	0.50	0.20	0.70
GSE23177	116	1.00	0.00	0.30	0.25	0.40	0.32	0.43
GSE17907	33	0.47	1.00	0.58	0.79	0.61	0.65	0.69

With the exception of the largest dataset, GSE2109, PAM50 interrater agreement was always higher with SABC integration than with conventional integration. Incidentally, the datasets that showed lesser subtype agreement post BC integration are the ones with most extreme ER and HER2 compositions (GSE16446, GSE23177 and GSE17907). For these datasets, SABC integration was able to increase subtype assignment agreement (with the exception of SABC3 for GSE23177).

Because sample size can further confound the effect of dataset composition towards retention of biological signal post-integration, we compared the PAM50 interrate agreement prior and post-integration for datasets GSE2109, GSE21653, GSE5460, GSE16446 and GSE23177, sampling in each dataset, 116 samples with the same IHC ER+ proportions as the original datasets. Comparisons of subtype assignements prior and post-integration in this experiment can be seen in Fig.6 and Table 3.

2.3—Comparison with other Methods

SABC is not the first method to account for external biological variables or clinical covariates during dataset integration. ComBat offers the possibility of using clinical variables as covariates during dataset integration. Harman, a more recent integration procedure using a PCA and a constrained optimisation technique, also calls for factor coding for an “experimental grouping variable” to drive integration. We ran ComBat and Harman to integrate the 1015 breast tumours hybridised onto the the HG-U133Plus2 chip, using the the output of the 5-class sorlie2003 SSP as covariate and experimental grouping variable respectively.