Accessing probe intensities for QC metrics
Closed this issue · 0 comments
Greetings,
I am playing around with this repo in an attempt to implement GCT score calculation (Explained here: https://academic.oup.com/nar/article/45/4/e22/2290930).
I've added some code to manifest.py that allows me to add a column to the manifest dataframe that contains the extension base for a probe.
Next is to get the mean green intensity of probes expected to extend into a C and divide that by the mean green intensity of probes expected to extend into a T. I've included my function for doing so below but I am not getting close to producing a ratio near 1 (the expected result).
I was hoping to get some advice on where I might be going wrong here. I've had some trouble tracking probe intensity information throughout the analysis so it may be possible that I am using the wrong dataframe for this analysis. I am afraid my current approach uses unnormalized / dye bias corrected intensities which would prevent me from calculating a meaningful ratio.
I appreciate any advice you could share with me to help me implement GCT score calculation. From my testing, it's a useful way to quantify levels of bisulfite conversion and I'd be happy to add the feature and open a pull request.
Thanks,
Mauro Chavez
def _calculate_gct_score(data_container):
"""
SeSSme implementation in R:
# Get the type I probes expected to extend into C and T
extC <- sesameDataGet(paste0(sdfPlatform(sdf), '.probeInfo'))$typeI.extC
extT <- sesameDataGet(paste0(sdfPlatform(sdf), '.probeInfo'))$typeI.extT
## prbs <- rownames(oobG(sset))
df <- InfIR(sdf)
extC <- intersect(df$Probe_ID, extC)
extT <- intersect(df$Probe_ID, extT)
dC <- df[match(extC, df$Probe_ID),]
dT <- df[match(extT, df$Probe_ID),]
mean(c(dC$MG, dC$UG), na.rm=TRUE) / mean(c(dT$MG, dT$UG), na.rm=TRUE)
"""
if "Next_Base" not in data_container.man.columns:
LOGGER.warning(f"No Next_Base col found in manifest: NO GCT Score calculated")
return None
print(data_container.man.columns)
# Successful conversion means that all these probes should light up RED
ext_c_probes = data_container.man[(data_container.man["Next_Base"] == "C") & (data_container.man["Infinium_Design_Type"] == "I")]
ext_c_probes = ext_c_probes.merge(
data_container.green_idat.probe_means,
left_on="AddressA_ID", right_index=True,
).merge(
data_container.green_idat.probe_means,
left_on="AddressB_ID", right_index=True,
suffixes=('_AddressA_grn', '_AddressB_grn')
).merge(
data_container.red_idat.probe_means,
left_on="AddressA_ID", right_index=True,
).merge(
data_container.red_idat.probe_means,
left_on="AddressB_ID", right_index=True,
suffixes=('_AddressA_red', '_AddressB_red')
)
print("Exp C probes")
print(ext_c_probes[["mean_value_AddressA_grn", "mean_value_AddressB_grn", "mean_value_AddressA_red", "mean_value_AddressB_red"]].mean())
# Use these probes to normalize against how often a green signal comes from an extension into a T
ext_t_probes = data_container.man[(data_container.man["Next_Base"] == "T") & (data_container.man["Infinium_Design_Type"] == "I")]
ext_t_probes = ext_t_probes.merge(
data_container.green_idat.probe_means,
left_on="AddressA_ID", right_index=True,
).merge(
data_container.green_idat.probe_means,
left_on="AddressB_ID", right_index=True,
suffixes=('_AddressA_grn', '_AddressB_grn')
).merge(
data_container.red_idat.probe_means,
left_on="AddressA_ID", right_index=True,
).merge(
data_container.red_idat.probe_means,
left_on="AddressB_ID", right_index=True,
suffixes=('_AddressA_red', '_AddressB_red')
)
print("Exp T probes")
print(ext_t_probes[["mean_value_AddressA_grn", "mean_value_AddressB_grn", "mean_value_AddressA_red", "mean_value_AddressB_red"]].mean())
return (ext_c_probes["mean_value_AddressA_grn"].mean() + ext_c_probes["mean_value_AddressB_grn"].mean()) / (ext_t_probes["mean_value_AddressA_grn"].mean() + ext_t_probes["mean_value_AddressB_grn"].mean())