derived dataset: mutation load
jingchunzhu opened this issue · 14 comments
Build derived datasets for TCGA
For each sample, number of mutations -- this measures tumor mutation load, mutations include all types of mutations.
@jingchunzhu Hi,
I am a junior student in software engineering in Shanghai.This issue has some attractions for me.Although I do not know much about biology, I am willing to try my best to understand the relevant knowledge.
Does this issue want to measure TMB (Tumor Mutation Burden) for each sample(both of tumor samples and normal samples)?
If the goal is to make precise calculations, should the coder remove the effects of “somatic mutations“ before calculating?
@MaleicAcid The goal is to first calculate the number of somatic mutations per sample.
We assume there is no somatic mutations in normal samples.
mutation load = somatic non-synonymous mutations per megabase of coding sequence
@jingchunzhu Although this is a bit tough for me now, I'm trying to understand.
I searched a lot of information, it seems that I need to first download the MAT file. I originally wanted to download them from TCGA's official website, but I could not access their download links properly. Finally, I found the relevant data on Xena. One of the files I downloaded is:
https://tcga.xenahubs.net/download/TCGA.LAML.sampleMap/mutation_wustl.gz
Could you please briefly explain the main process of calculating the number of somatic mutations for a sample?(It is best to take a specific sample as an example.)
I found many of the existing softwares for calling somatic mutations(mutect, muse, strella, varscan etc.).Do I need to know them?
Looking forward to more guidance.
files like this https://tcga.xenahubs.net/download/TCGA.LAML.sampleMap/mutation_wustl.gz is already the somatic mutation call results. You don't need to run the calling programs.
Each row of the file is a somatic mutation, has format similar to the following, which you can read from the header row.
Sample chromosome start end reference_base variant_base gene effect
You need to count how many mutations each sample has. To be more precise, you count how many non-synonymous mutations each sample has. The effect column will tell you if it is a non-synonymous mutation. Then you divide the count by the length of coding sequence in the human genome multiply is by 1 million to get the value
mutation load = somatic non-synonymous mutations per megabase of coding sequence
@jingchunzhu Thanks for your patient explanation very much!
My understanding of the project has become clearer. But I'm still not quite sure about some of the details of this project.
If it is convenient, I want to communicate in your native language.
@MaleicAcid English please.
@jingchunzhu Please forgive my poor biological knowledge.
- I'm still confused about how to determine the value:"the length of coding sequence in the human genome".
- In order to judge the type of mutation,I try to find all possible values for the effect column on tcga's wiki page.
And they are:
Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR 1 , Intron, RNA, Targeted_Region
Please tell me which ones belong to non-synonymous mutations.
In addition, I saw gdc website has a download tool called gdc-client. Combined with the manifest file, gdc-client can download data in batch. Does Xena need such a downloader?
@MaleicAcid To determine the length of coding sequence in the human genome, start by googling it.
@MaleicAcid non-synonymous mutations is any type of mutation with a score 4,3,2 in code here: https://github.com/ucscXena/ucsc-xena-client/blob/master/js/models/mutationVector.js#L67 .
@jingchunzhu I searched for a lot of information about genes, but I'm not sure if the results I found are correct. Please give me further guidance.
Search engines tells me the gene information can be found on the specialized gene website (ncbi, geneBank, uniprot, etc.).I try to search the information about human gene on these website.
One of the sites I visit is called genome.ucsc.edu.In the species options column I choose human.Then I type the name of the gene: NONO and clicke on the first link in the results page.Finally, I jump to this page.
Above this page there is a line of such words "17,586 bp".I guess that's probably the length of the NONO gene.I go on googling and learned that 1bp is equivalent to 2 bases.So it means that the NONO gene has 2 * 17586 bases.
I found the other gene lengths for the TCGA-AB-3011-03 sample as above. You can check the TCGA-AB-3011-03 sample in the file named mutation_wustl.
The following gives a simple piece of code.
#!/usr/bin/python
MB = 1000000
TCGA_AB_3011_03 = {
"tml": 0, # need to be calculated
"mutation_count": 6,
"gene_list": { # search from http://genome.ucsc.edu/, the unit is bp
"NONO": 17586,
"OR1C1": 945,
"IDH1": 18907,
"GTF3A": 11145,
"WNK4": 16259,
"TRPM4": 54096,
"NPM1": 23768
}
}
length = 0
for value in TCGA_AB_3011_03["gene_list"].values():
length += value
length = length*2 # 1bp is equivalent to 2 bases
TCGA_AB_3011_03["tml"] = TCGA_AB_3011_03["mutation_count"]/length*MB
print(TCGA_AB_3011_03["tml"]) # output: 21.02224153154037
I think if the length of the gene can only be searched from other sites, I should write a web crawler to get the data.
Last but not least, what type of data is expected to be delivered? Whether the database data or data files like mutation_wustl.
@MaleicAcid googling something like "human genome coding sequence length", and find publications or books excerpts. you can results such as in https://books.google.com/books?id=dSwWBAAAQBAJ&pg=PA266&lpg=PA266&dq=table+9.5+human+genome+and+human+gene+statistics&source=bl&ots=7AQm7z1ig4&sig=H8pVFYpGhnd6WrO2iv4Ei5FEbP8&hl=en&sa=X&ved=0ahUKEwje3qbc48LZAhVW5GMKHb3_DrkQ6AEIbzAN#v=onepage&q=table%209.5%20human%20genome%20and%20human%20gene%20statistics&f=false and http://kirschner.med.harvard.edu/files/bionumbers/Human%20genome%20and%20human%20gene%20statistics.pdf
@MaleicAcid find another source for the human genome coding length. Typically find minimum of two credible sources.
@jingchunzhu Taking gene CYLC1 as an example,it‘s length is 25552 bp on the ucsc genome browser, but ncbi says the value is 25575 bp. When different sources of data inconsistencies, which should be taken as standard?
Look forward to your reply.
@MaleicAcid there are a few issues. In your example, it is the whole gene's length, not coding sequence length. The second issue is that gene annotations do change due to who did the annotation, and genome build. It is not surprising to see difference. For human genome coding genes, it is relatively stable, meaning the difference between annotations will not be large. The last issue is that we are looking for total length of human coding sequences. You could look for a second credible source to give you that answer. You could also add up each gene coding sequence length. I thought it is easier for you to go with the first approach.