meyer-lab/DDMC

Worse predictions of genotypes when excluding NATs

Closed this issue · 12 comments

@aarmey Predicting mutational statuses with NATs improves ROC substantially in all cases. This is not surprising since as we can see in figureM3, the signaling between normal and tumor samples is really different. Thus, when including NATs I'm not sure whether we can attribute these signaling changes to mutational status or if these are just mere differences between tumor vs normal cells. An option would be removing logistic regression for all genotypes except for STK11 and keep the hypothesis testing.

TP53 status with NATs:
image

TP53 without NATs:
image

Rest with NATs:
image

Rest without NATs (in this case it looks like at least it works with STK11):
image

Why do the number of points change in (A) when you add NATs? Are you treating them like additional patients?

Yes, we have half of the data when removing NATs because for each patient we have a tumor and a NAT sample. When using all the data I just look at whether that sample has the corresponding mutation, regardless of which patient a sample belongs to.

But this is an incorrect view of the problem. You don't have twice as many patients—you have the same number, but know more about them. You'll have to reshape the data so that each patient is still one row, but columns include the abundance of each cluster and NAT separately.

So the dimensions of the data would be #patients x #clusters*2? So that for each patient/row we have two columns per cluster (tumor and NAT signal)? e.g patient1: cluster1_NAT | cluster1_T | Cluster2_NAT | Cluster2_T | ...?

Exactly.

@aarmey I just realized that I still got the regression set up wrong for figure M4 (mutational status) and figure M5 (immune infiltration). When I closed this I reshaped the data so that each patient is one row, but columns include cluster center abundances of tumor and NAT separately, as discussed above. However, I didn't realize that the same patient can have different phenotypes in their tumor and NAT samples. For instance a patient can have STK11 mutant in their tumor sample and be STK11 WT in their NAT sample (see patients C3N.00572 and C3N.02423 below). With this in mind, should I separate this analysis by sample type, i.e. using tumor signaling to predict tumor mutational status and NAT signaling to predict NAT mutational status?

Screen Shot 2021-02-16 at 12 43 09 PM

How I was thinking about this problem initially was "given this signaling status, predict whether this sample is WT or mutant", regardless of whether two samples come from the same patient or not. I am aware now however that treating two samples from the same patient as completely independent, even if they are tumor and NAT, is an invalid assumption.

Did they maybe just label all of the NAT samples as non-mutated? Assuming it's this, you can just use the mutational status of the tumor.

Oh you're right... I thought they sequenced both samples per patient. Just checked and all NAT samples are labeled with 0s. It's already set up so that I'm using only the tumors' mutational status so we're good for figures 4 and S4.

In the case of figure 5 however, we do have an infiltration score per cell type for both tumor and NAT samples. Should I just use the signaling and infiltration scores of tumors?

Screen Shot 2021-02-16 at 4 35 26 PM

Hmm... I think so.

Figure 5 looks practically the same because before the NAT centers were slightly weighted compared to tumors by the multi lasso model so it barely makes a difference to remove them.