How to extract the reads that aligned to the intronic regions from the preprocess results
Closed this issue · 7 comments
Hi, there,
Thank you for developing this tool. I am using this tool to analyze some AGO1 iCLIPseq libraries and I am trying to extract the reads that are aligned to the intronic regions.
It was mentioned in the paper that 'aligned reads are tagged based on whether they are part of genes, transcripts..., and repeat elements (annotate_with_genic_elements)'. I used clipseqtools-preprocess all
to preprocess the data. So are the genomic region tags available for the aligned reads from the clipseqtools-preprocess all
results? If so, where can I find them so as to extract the aligned intronic reads?
Thanks,
Zhijie
Hi. Yes the information should be available in the database. The clipseqtools filters to use in the analysis modules should be `--filter transcript="def" --filter exon="undef". I'm not exactly sure what you mean by "extract" but clipseqtools does not have a module to extract data in a e.g. SAM file. However it should be straightforward to access the database and extract reads from there. For example you can use the following command to get a BED-like file (notice that you will have to convert the last column to +/-) with the reads that align in introns.
sqlite3 -separator $'\t' your_sqlite_file.db "SELECT rname, start, stop, qname, 0, strand FROM genome WHERE transcript IS NOT NULL AND exon IS NULL;"
Thank you for the help. That works!
Another quick question, what is the column 'copy number' in the sqlite_file.db
? Does the copy number have anything to do when you computing the genomic_distribution by counting the reads taht mapped on those regions?
Thanks,
Zhijie
Great. In CLIP-Seq it is often the case that we have PCR artifacts due to the amplification. By default CLIPSeqTools collapses the reads that have exactly the same sequence and align at the same coordinates (chr, start, stop, strand) to avoid this problem. However this is not always desirable so it retains the copy_number
in case it is needed. So basically the copy_number
is the number of reads that were collapsed into one read.
I see. So when I tried to filter the sqlite_file.db
by transcript IS NOT NULL AND exon IS NULL
, I found that this will include the reads where its cds=1
as shown in the screenshot below. Shouldn't cds be a part of the exon so that when exon is NULL
, cds is NULL
as well? This also happens to utr5
and utr3
where they have a value of 1 when exon is NULL. When I filter the intronic reads, should I also include cds IS NULL AND utr5 IS NULL AND utr3 IS NULL
as a part of my command to get the intronic reads?
Thanks,
Zhijie
Nope. The cds
, utr5
, utr3
correspond to the genomic coordinates of the coding region, 5'UTR and 3'UTR respectively. So a read that maps in an intron within the start and stop codon (these are the boundaries of the cds
) will have the cds NOT NULL
and the exon IS NULL
. This is because it maps within the CDS boundaries but it is not an exon. The command that I specified before is the correct one. Now notice that the command I gave earlier returns all reads in introns (even for non coding transcripts). If you are only interested in coding transcripts then just replace transcript IS NOT NULL
with coding_transcript IS NOT NULL
.
I see. Thank you for the help!