mnsmar/clipseqtools

How to extract the reads that aligned to the intronic regions from the preprocess results

Closed this issue · 7 comments

Hi, there,

Thank you for developing this tool. I am using this tool to analyze some AGO1 iCLIPseq libraries and I am trying to extract the reads that are aligned to the intronic regions.
It was mentioned in the paper that 'aligned reads are tagged based on whether they are part of genes, transcripts..., and repeat elements (annotate_with_genic_elements)'. I used clipseqtools-preprocess all to preprocess the data. So are the genomic region tags available for the aligned reads from the clipseqtools-preprocess all results? If so, where can I find them so as to extract the aligned intronic reads?

Thanks,
Zhijie

Hi. Yes the information should be available in the database. The clipseqtools filters to use in the analysis modules should be `--filter transcript="def" --filter exon="undef". I'm not exactly sure what you mean by "extract" but clipseqtools does not have a module to extract data in a e.g. SAM file. However it should be straightforward to access the database and extract reads from there. For example you can use the following command to get a BED-like file (notice that you will have to convert the last column to +/-) with the reads that align in introns.

sqlite3 -separator $'\t' your_sqlite_file.db "SELECT rname, start, stop, qname, 0, strand FROM genome WHERE transcript IS NOT NULL AND exon IS NULL;"

Thank you for the help. That works!
Another quick question, what is the column 'copy number' in the sqlite_file.db ? Does the copy number have anything to do when you computing the genomic_distribution by counting the reads taht mapped on those regions?

Thanks,
Zhijie

Great. In CLIP-Seq it is often the case that we have PCR artifacts due to the amplification. By default CLIPSeqTools collapses the reads that have exactly the same sequence and align at the same coordinates (chr, start, stop, strand) to avoid this problem. However this is not always desirable so it retains the copy_number in case it is needed. So basically the copy_number is the number of reads that were collapsed into one read.

I see. So when I tried to filter the sqlite_file.db by transcript IS NOT NULL AND exon IS NULL, I found that this will include the reads where its cds=1 as shown in the screenshot below. Shouldn't cds be a part of the exon so that when exon is NULL, cds is NULL as well? This also happens to utr5 and utr3 where they have a value of 1 when exon is NULL. When I filter the intronic reads, should I also include cds IS NULL AND utr5 IS NULL AND utr3 IS NULL as a part of my command to get the intronic reads?

image

Thanks,
Zhijie

Nope. The cds, utr5, utr3 correspond to the genomic coordinates of the coding region, 5'UTR and 3'UTR respectively. So a read that maps in an intron within the start and stop codon (these are the boundaries of the cds) will have the cds NOT NULL and the exon IS NULL. This is because it maps within the CDS boundaries but it is not an exon. The command that I specified before is the correct one. Now notice that the command I gave earlier returns all reads in introns (even for non coding transcripts). If you are only interested in coding transcripts then just replace transcript IS NOT NULL with coding_transcript IS NOT NULL.

I see. Thank you for the help!

You are very welcome. By the way you may want to take a look at htsdb. This is a set of tools that are compatible with the CLIPSeqTools but several times faster and written in Go.