igvteam/igv-webapp

Extracting insertion/deletion data from aligned region of interest

Opened this issue · 9 comments

Hello IGV team,

I am currently analyzing amplicons from an insect genomic DNA sample edited with CRISPR/Cas9. I am aware IGV offers insertion and deletion data by selecting a specific base pair of your aligned read(s). Is it possible to access and export this data for a highlighted region of interest from IGV?

Side note: I am aware this data is contained within .bam files, however extracting the information in a clean way is not so simple and can get a bit messy (i.e. with mpileup offered by bcftools). I would also like to add I do not intend to use this data in a clinical or therapeutic context, but instead am using it to analyze gene editing in insects - so although I'm aware this web app is not particularly intended for this purpose, the ability to this data from IGV would be extremely useful and sufficient for my application.

Thanks,

Joe

Extracting the data with samtools is pretty trivial if you know the region

samtools view chr1:100-200

What information are you looking for, and in what format?

With samtools view, I can see the alignment results (mismatches, insertions, deletions) and count the sum of any one of these (or all) over a given range (i.e. chr1:100-200), however I would lose base pair resolution of this information (i.e. if my region is chr1:100-200, I will not be able to see indel counts for chr1:100, chr1:101, chr1:102, and so on). By manually clicking a position of my aligned read on IGV web app, I can obtain this information, but it is not scalable for many alignment files - is there a faster, higher-throughput way to do this via IGV or some package offered by IGV?

Output format of this information is not particular important , but something simple such as columns for reference position, insertions, and deletions would be ideal.

Thanks for the quick response!

I would find it really surprising if there isn't a simple tool to do this, but I assume you have looked. This isn't the sort of thing we design IGV for, which is a visual tool, but leave this open we will consider it for the future.

BTW, on a tangent here, but where would you expect an insertion between positions 10 and 11 to be counted?

Great question - I am exploring variant calling programs to count these. Out of curiosity, how does IGV do this?

Yes there is no simple tool to do this. When you map variants with Minimap2 --cs -xasm5 /-xasm20 ; or unimap, --cs you get a CS tag which is a newer type of MD type which is basically telling you what sequence is ins or del relative to the reference ( its what you see when you click the insertion in IGV, if I understand it correctly..).

When you select regions in IGV you get a bed file. And when we call variants against this bed file using either sniffles2 --regions regions_igv.bed --reference reference.fasta ; or using Pilon --targets regions.bed

Well it doesnt work ..

It doesnt work as you explained, similarly to with the mpileup example. In bcftools mpileup it doesnt work because the max ins length is limited to 500 based but when you increase it bugs.

You have the case where your region of interests spans multiple reads in your bam but only 1 region/gap in your reference.

You have the case where your region of interest on IGV spans the flanks of a gap in your query.bam and so samtools view -L cannot gap two reads of your query.bam with NNNN

You have the case were your region of interest spans multiple gaps in your reference.fasta but only 1 read in your query.bam, this makes took like ragtag or quartet crash.

There was an option on IGV to copy the consensus sequence of an alignment when you right click on the alignment. It doesnt even work and would extract the full consensus and doesn't even work when we paste it in notes..

Unfortunately no one ever solved this problem. I am trying to write a script and test it. Without much success.

@jsromanowski I'm not exactly sure what you are asking. Are you asking about the insertion/deletion details of a single aligned read, or a pileup at a specific location of multiple aligned reads? For a single read the indel information is in the alignment record, specifically the "cigar" string. For a pileup at a location IGV simply counts the indels at that location for the alignments intersecting the location.

@Isoris I'm not following your discussion at all. However there is no concept of a consensus sequence for a single alignment, by definition a consensus sequence is the consensus of multiple alignments. However this is a tangent discussion.

Hi @jrobinso,

Thank you for clarifying. Let me rephrase my question for better understanding.

I'm asking whether IGV can be used to extract and manipulate data when there is a missing sequence in the reference (REF) relative to the alternate allele (ALT) or query (QRY). Specifically:

  1. For single alignments: Could IGV allow the user to select specific reads (e.g., by clicking on them), highlight them in a different color, and then export these reads? Instead of creating a new consensus FASTA, the selected reads could replace the REF sequence over the specified region, gapping any missing positions with 'NNN' if necessary to maintain alignment.

  2. For pileups: Could IGV provide an option to save the pileup as a genome graph, or export the alignment and variant information for further downstream analysis?

@Isoris No, thats not in IGV's scope.