Worse predictions of genotypes when excluding NATs

Question

Worse predictions of genotypes when excluding NATs

Closed this issue 3 years ago · 12 comments

@aarmey Predicting mutational statuses with NATs improves ROC substantially in all cases. This is not surprising since as we can see in figureM3, the signaling between normal and tumor samples is really different. Thus, when including NATs I'm not sure whether we can attribute these signaling changes to mutational status or if these are just mere differences between tumor vs normal cells. An option would be removing logistic regression for all genotypes except for STK11 and keep the hypothesis testing.

TP53 status with NATs:

TP53 without NATs:

Rest with NATs:

Rest without NATs (in this case it looks like at least it works with STK11):

aarmey commented 4 years ago

Exactly.

Answer 1 · 2020-12-21T21:10:16.000Z

Why do the number of points change in (A) when you add NATs? Are you treating them like additional patients?

Answer 2 · 2020-12-21T21:20:57.000Z

Yes, we have half of the data when removing NATs because for each patient we have a tumor and a NAT sample. When using all the data I just look at whether that sample has the corresponding mutation, regardless of which patient a sample belongs to.

Answer 3 · 2020-12-21T21:33:16.000Z

But this is an incorrect view of the problem. You don't have twice as many patients—you have the same number, but know more about them. You'll have to reshape the data so that each patient is still one row, but columns include the abundance of each cluster and NAT separately.

Answer 4 · 2020-12-21T22:07:53.000Z

So the dimensions of the data would be #patients x #clusters*2? So that for each patient/row we have two columns per cluster (tumor and NAT signal)? e.g patient1: cluster1_NAT | cluster1_T | Cluster2_NAT | Cluster2_T | ...?

Answer 5 · 2021-02-16T11:47:03.000Z

@aarmey I just realized that I still got the regression set up wrong for figure M4 (mutational status) and figure M5 (immune infiltration). When I closed this I reshaped the data so that each patient is one row, but columns include cluster center abundances of tumor and NAT separately, as discussed above. However, I didn't realize that the same patient can have different phenotypes in their tumor and NAT samples. For instance a patient can have STK11 mutant in their tumor sample and be STK11 WT in their NAT sample (see patients C3N.00572 and C3N.02423 below). With this in mind, should I separate this analysis by sample type, i.e. using tumor signaling to predict tumor mutational status and NAT signaling to predict NAT mutational status?

Answer 6 · 2021-02-16T11:51:09.000Z

How I was thinking about this problem initially was "given this signaling status, predict whether this sample is WT or mutant", regardless of whether two samples come from the same patient or not. I am aware now however that treating two samples from the same patient as completely independent, even if they are tumor and NAT, is an invalid assumption.

Answer 7 · 2021-02-16T15:31:07.000Z

Did they maybe just label all of the NAT samples as non-mutated? Assuming it's this, you can just use the mutational status of the tumor.

Answer 8 · 2021-02-16T15:39:24.000Z

Oh you're right... I thought they sequenced both samples per patient. Just checked and all NAT samples are labeled with 0s. It's already set up so that I'm using only the tumors' mutational status so we're good for figures 4 and S4.

Answer 9 · 2021-02-16T15:39:29.000Z

In the case of figure 5 however, we do have an infiltration score per cell type for both tumor and NAT samples. Should I just use the signaling and infiltration scores of tumors?

Answer 10 · 2021-02-16T15:40:29.000Z

Hmm... I think so.

Answer 11 · 2021-02-16T16:13:35.000Z

Figure 5 looks practically the same because before the NAT centers were slightly weighted compared to tumors by the multi lasso model so it barely makes a difference to remove them.