PCAWG-11/Heterogeneity

Questions About Details / Data for Selected Figures

Closed this issue · 4 comments

Hi,

I think this is more of a "discussion" than an "issue," but I don't think "discussions" are enabled for this repository.

I have a couple of questions where I would very much appreciate it if you could help:

1) I think Fig. 29 in the Supplemental Information represents something interesting and important.

Is there any information available about the fraction of subclonal variants within repeats that are ambiguous with Illumina reads and/or the fraction of subclonal variants within genes with a certain fraction of non-unique coding sequence?

My understanding is that these “subclonal” variants have diluted read support and may in fact be clonal (in the “Subclonal driver genes and their unique sequence” section of this Cancer Cell paper, which also references what would become Tarabichi et al 2021). So, I thought some sense of the frequency where this happens could be useful.

2) I think that I can see how a comparison between copy number callers can have complications and I believe Fig. 5 in the Supplemental Information indicates that there are noticeable differences in the original copy number calls (and was described in the publications as such).

Is anything in this repository (or elsewhere) available in order to attempt a visualization similar to Supplemental Figure S1 for purity estimates and/or Fig. 29 in the Supplemental Information for the subclonal architecture (but for copy number calls with different methods)?

I would guess a percent altered genome for individual copy number calls (per sample) could be used for something like Supplemental Figure S1 and/or some index of overlap/consistency for copy number calls (across calls for pairs of samples) could be used for something like Fig. 29 in the Supplemental Information.

I also noticed there is PCAWG data hosted on the ICGC Data Portal (with a consensus_cnv subfolder):

https://dcc.icgc.org/releases/PCAWG

If smaller comparison/summary files are not already created, am I correctly understanding that I might be able to create something along the lines that I was asking from the consensus.20170119.somatic.cna.annotated.tar.gz files for individual samples?

Thank you to everybody for putting together the paper and the associated materials!

Sincerely,
Charles

For the first question, I have had an e-mail exchange that includes a couple of the authors that I think has been helpful. I hope that an update can be posted within the next month or so.

For the second question, I am certainly still open and interested in additional feedback.

However, if I understand the file contents correctly, then I could create the following plots for consensus copy number segments with either 2 stars or 3 stars:

Percent segment calls with other methods (either 2 stars or 3 stars):

percent_consensus_segments_called-per_sample

Pearson correlation with other methods (either 2 stars or 3 stars):

consensus_segments-other_called_correlations-per_sample

Method Median Correlation
ABSOLUTE 0.887
ACEseq 0.917
Battenberg 0.864
cloneHD 0.862
Sclust 0.887
JaBbA 0.896

Absolute Difference (either 2 stars or 3 stars):

consensus_segments-other_called_absolute_differences-per_sample

I then looked for a minimum segment size (of 1 Mb), and I required that the sample had at least 50 sufficiently large segments (filtering only 47 samples):

Percent segment calls with other methods (either 2 stars or 3 stars + min 1 Mb segment):

percent_consensus_segments_called-per_sample-MIN_1Mb_SEGMENT_LENGTH

Pearson correlation with other methods (either 2 stars or 3 stars + min 1 Mb segment):**

consensus_segments-other_called_correlations-per_sample-MIN_1Mb_SEGMENT_LENGTH

Method Median Correlation
ABSOLUTE 0.935
ACEseq 0.977
Battenberg 0.953
cloneHD 0.956
Sclust 0.941
JaBbA 0.969

Absolute Difference (either 2 stars or 3 stars + min 1 Mb segment):

consensus_segments-other_called_absolute_differences-per_sample-MIN_1Mb_SEGMENT_LENGTH

The code was almost identical and I have noted the lines of code that were added for the second set of 3 plots:

input.folder = "../ICGC_PCAWG/consensus.20170119.somatic.cna.annotated/"
output.call.plot = "percent_consensus_segments_called-per_sample-MIN_1Mb_SEGMENT_LENGTH.png"
output.cor.plot = "consensus_segments-other_called_correlations-per_sample-MIN_1Mb_SEGMENT_LENGTH.png"
output.diff.plot = "consensus_segments-other_called_absolute_differences-per_sample-MIN_1Mb_SEGMENT_LENGTH.png"

samples = list.files(input.folder)
samples = samples[grep(".consensus.20170119.somatic.cna.annotated.txt",samples)]
samples = gsub(".consensus.20170119.somatic.cna.annotated.txt","",samples)

sample.mapping = c()
sample.other_method = c()
sample.percent_called.at_least_2stars = c()
sample.correlation.major.at_least_2stars = c()
sample.median_abs_diff.major.at_least_2stars = c()
sample.max_abs_diff.major.at_least_2stars = c()

method.names = c("absolute_broad_major_cn","aceseq_major_cn","battenberg_nMaj1_A",
				"clonehd_major_cn","sclust_nMaj1_A","jabba_major_cn")

######## ADDITIONAL COUNTER for Minimum Segment Length########
skip_count = 0
for (i in 1:length(samples)){
	print(paste(i," out of ",length(samples),sep=""))
	temp.file = paste(input.folder,samples[i],".consensus.20170119.somatic.cna.annotated.txt",sep="")
	temp.input.table = read.table(temp.file, head=T, sep="\t")
	temp.input.table = temp.input.table[temp.input.table$star >= 2,]
	######## START EXTRA CODE for Minimum Segment Length ########
	temp.segment_length = temp.input.table$end - temp.input.table$start
	temp.input.table = temp.input.table[!is.na(temp.segment_length) & (temp.segment_length > 1000000),]
	######## END EXTRA CODE for Minimum Segment Length ########
	
	######## ADDITIONAL IF-ELSE STATEMENT for Minimum Segment Length########
	if(nrow(temp.input.table) > 50){
		######## START EARLIER CODE########
		for (j in 1:length(method.names)){
			temp.consensus = temp.input.table$major_cn
			sample.mapping = c(sample.mapping, samples[i])
			sample.other_method = c(sample.other_method, method.names[j])
			temp.cn = temp.input.table [,method.names[j]]
			
			temp.percent_called = 100 * length(temp.cn[!is.na(temp.cn)])/nrow(temp.input.table)
			sample.percent_called.at_least_2stars = c(sample.percent_called.at_least_2stars, temp.percent_called)
			
			temp.cor = cor(temp.consensus, temp.cn, use="pairwise.complete.obs")
			sample.correlation.major.at_least_2stars = c(sample.correlation.major.at_least_2stars, temp.cor)
			
			diff.arr = abs(temp.consensus-temp.cn)
			sample.median_abs_diff.major.at_least_2stars = c(sample.median_abs_diff.major.at_least_2stars, median(diff.arr, na.rm=T))
			sample.max_abs_diff.major.at_least_2stars = c(sample.max_abs_diff.major.at_least_2stars, max(diff.arr, na.rm=T))
			######## END EARLIER CODE########
		}#end for (j in 1:length(method.names))
	}else{
		print("...Skip sample with less than 50 large segments")
		skip_count = skip_count + 1
	}#end else	
}#end for (i in 1:length(samples))

sample.other_method = factor(sample.other_method, levels=method.names)

jitter_x = jitter(as.numeric(sample.other_method))

png(output.call.plot, height=400, width=400)
par(mar=c(13,5,3,2))
plot(sample.other_method, sample.percent_called.at_least_2stars,
	las=2, col="gray", ylim=c(0,100), outline=F,
	ylab="Percent Consensus Segment Called",
	main = "Individual Method (Per Sample)", font.main=1, cex.main=0.9)
points(jitter_x, sample.percent_called.at_least_2stars,
		pch=16, cex=0.5)
dev.off()

png(output.cor.plot, height=400, width=400)
par(mar=c(13,5,3,2))
plot(sample.other_method, sample.correlation.major.at_least_2stars,
	las=2, col="gray", ylim=c(-1,1), outline=F,
	ylab="Pearson Correlation",
	main = "Consensus versus Individual Method (Per Sample)", font.main=1, cex.main=0.9)
points(jitter_x, sample.correlation.major.at_least_2stars,
		pch=16, cex=0.5)
dev.off()

print(tapply(sample.correlation.major.at_least_2stars, sample.other_method, median, na.rm=T))

print(paste(min(sample.median_abs_diff.major.at_least_2stars, na.rm=T)," - ",max(sample.median_abs_diff.major.at_least_2stars, na.rm=T),sep=""))
plot_min = min(sample.median_abs_diff.major.at_least_2stars, na.rm=T) - 0.1
plot_max = max(sample.median_abs_diff.major.at_least_2stars, na.rm=T) + 0.1
png(output.diff.plot, height=400, width=800)
par(mar=c(13,5,3,2), mfcol=c(1,2))
plot(sample.other_method, sample.median_abs_diff.major.at_least_2stars,
	las=2, col="gray", outline=F, ylim=c(plot_min, plot_max),
	ylab="Absolute Copy Difference (Per Sample)",
	main = "Median Difference (Consensus versus Individual Method)", font.main=1, cex.main=0.9)
points(jitter_x, sample.median_abs_diff.major.at_least_2stars,
		pch=16, cex=0.5)

print(max(sample.max_abs_diff.major.at_least_2stars))
plot(sample.other_method, sample.max_abs_diff.major.at_least_2stars,
	las=2, col="gray", outline=F,
	ylim=c(0,max(sample.max_abs_diff.major.at_least_2stars)+5),
	ylab="Absolute Copy Difference (Per Sample)",
	main = "Maximum Difference (Consensus versus Individual Method)", font.main=1, cex.main=0.9)
points(jitter_x, sample.max_abs_diff.major.at_least_2stars,
		pch=16, cex=0.5)
dev.off()

######## Print Statement for Minimum Segment Length########
print(paste(skip_count,"skipped samples"))

I hope that others find this interesting and useful!

Sincerely,
Charles

For 1), I have had a number of helpful e-mail conversations with Maxime Tarabichi.

There is controlled access data available in the following locations:

https://dcc.icgc.org/releases/PCAWG/subclonal_reconstruction

https://dcc.icgc.org/releases/PCAWG/thesaurus_snv

Given that I am not a PI and I don't work in a traditional lab, I don't think I can apply for controlled access.

However, if I could, then my understanding is that Maxime agrees that what I am describing is worth checking. So, I hope that somebody else can complete a successful controlled access application to check this topic.

I think this is what might have been what was helpful in justifying the check for Maxime:

[If] I look at the decompressed version of the figure_6.snv.input.txt.gz from figure_6a in the GitHub repository, it looks like the median percentage of subclonal variants per sample is 13%. I realize that different fractions are reported for different cancer types in the paper. Nevertheless, if the absolute number of subclonal variants is much less than clonal variants and non-unique variants are often falsely called “subclonal”, then I think the fraction of non-unique “subcolonal” variants could be more than 10%. For example, as a considerable simplification, I will assume that there are 1000 variants and 13% are called “subclonal” variants (130 “subclonal” and 870 “clonal”). Now, with that simplification, if ~10% of clonal variants were misannotated due to the presence in non-unique regions, then that means the original/true clonal count would be ~966 and the number of misclassified “subclonal” variants would be ~96. That would mean ~96/130 “subclonal” variants were misclassified (which would be ~74% of the “subclonal” variants). However, that is essentially where/why I am trying to find the associated statistics.

[Additionally], I think [Maxime is indicating] that all possible true variants were probably not called, and that would certainly affect the calculation above.

In other words, the fact that there are a lot more "clonal" consensus variant calls than "subclonal" consensus variant calls is why this might be worth checking.

It is importantly missing the "cloncal" and "subclonal" annotations, but Maxime pointed out a different set of open-access somatic variant calls. If the files in the controlled access deposits don't contain germline genetic information, then I hope that there might be a chance those could become open-access at a later point in time (and I would then update this content). Otherwise, I hope that somebody else can apply for controlled access to this thread.

If I understand correctly, then I am not sure if additional code and/or intermediate files for Supplemental Figure 29 might help?

I can find a number of helpful files in this location:

https://github.com/PCAWG-11/Heterogeneity/tree/master/code_figures_paper

However, I could not find the similar files for Supplemental Figure 29.

Hi Charles, indeed, the effect of mapping on subclonal assignment in pcawg is an interesting aspect to explore. At the moment this would indeed require to request access to the controlled data to get the subclonal assignments.

Thank you very much - I hope that helps with encouraging somebody to request controlled access!

When I typically have conversations like this on GitHub, I would close after some response from the developer.

As always, I would be interested to hear additional feedback. In this particular case, I am especially interested given that I can't summarize the existing data (as I understand it), and I think a relatively small set of summary statistics and the code to reproduce that is basically that I am asking about.

So, I will close the "issue" given that I very much appreciate receiving feedback. However, I hope this can be looked into more!