0 vs 1-based coordinate system
Closed this issue · 12 comments
Hi,
I am doing the following:
- Making 0-based bedgraph files --> input.bg (0-based)
- Converting to bigwig with UCSC tools bedGraphToBigWig --> input.bw (0-based ?)
- Doing a trim and fill-in for sites not specified in the genome:
wiggletools trim genome.bed fillIn genome.bed input.bw > input.trimfillin.wig (0-based?)
- Then converting this wiggle file back to bigwig: with UCSC wigToBigWig --> input.trimfillin.bw (0-based?)
- Then outputting with: wiggletools write_bg - compress lt 1.0 input.trimfillin.bw > output.bg (0-based?)
My question is from the beginning to the end, does this process--in particular the wiggletools steps-- maintain the correct 0 vs 1 based coordinate system? i.e. output.bed should be 0-based.
Based on this page: https://www.biostars.org/p/384606/
it says: "BigWig files created from bedGraph format use "0-start, half-open" coordinates"
So I'm not clear what would actually happen.
Thanks
Just to add, I tested this manually with a dummy file, and it seems to work. output.bg ends up as 0-based. But I'm actually not sure how that happens.
Because somehow all the tools in between would need to know that it is 0-based and preserve that info:
bedGraphToBigWig
wiggletools trim
wiggletools fillIn
wigToBigWig
wiggletools write_bg - compress lt
And this is despite wiggle officially being 1-based, with intermediate wiggle files.
I'm curious how this works?
Hello @gevro ,
I'm relieved to hear that it works because to be honest it is a headache that plagues these file formats.
In short, it is file format dependent:
- Bed, BedGraph, Bam, BigBed, BigWig: 0 based
- Wig, VCF, SAM: 1 based
Even better, a wig file (1 based), can contain BedGraph blocks (0 based).
Each tool has its own internal convention (wiggletools is 1 based under the hood) and has to be careful when reading or writing data.
Hope this helps,
Daniel
Hi, I may have found an issue.
- A big wig file simple output from wiggle tools appears to be 0-based:
wiggletools test.bw | head
chr1 0 13 0.000000
chr1 13 24 1.000000
fixedStep chrom=chr1 start=25 step=1
0.000000
chr1 25 39 1.000000
chr1 39 41 0.000000
fixedStep chrom=chr1 start=42 step=1
1.000000
0.000000
chr1 43 53 1.000000
- Likewise when outputting it with write_bg it appears 0-based:
wiggletools write_bg - test.bw | head
chr1 0 13 0.000000
chr1 13 24 1.000000
chr1 24 25 0.000000
chr1 25 39 1.000000
chr1 39 41 0.000000
chr1 41 42 1.000000
chr1 42 43 0.000000
chr1 43 53 1.000000
chr1 53 54 0.000000
chr1 54 55 1.000000
- It also appears that the wiggle tools 'seek' function takes 1-based coordinates. Because giving it 0-based coordinates doesn't return anything:
wiggletools seek chr1 0 10 test.bw | head
# No output
wiggletools seek chr1 1 10 test.bw | head
chr1 0 9 0.000000
However, the issue is that when using seek with the 1-based coordinates, it is not outputting the correct interval.
For example wiggletools seek chr1 1 10 test.bw should have returned the 0-based interval chr1 0 10, which is the correct interval corresponding to the 1-based interval given to seek of chr 1 10.
It seems that wiggletools is not outputting the correct 0-based interval when seek is given a 1-based interval.
Note that the bigwig above was made from a bedgraph file using the bedGraphToBigWig tool.
Also, another important point is that the issue where seek with chr1 0 10 for a .bw file doesn't return any output doesn't seem to occur for a .wig file. But nevertheless, the issue still occurs for .wig files too in that seek seems to miss the last base position.
Also note, the workaround would be simple: just adding +1 to the end coordinate given to the 'seek' function. But I do believe this is nevertheless a bug. This is wiggletools version v1.2.11
Hello @gevro ,
The subtlety is that within a Wig file, you can have two types of blocks with different coordinate systems:
- BedGraph lines: i.e. all 4 column lines of the form (chrom, start, end, value): 0-based coordinates
- All other lines: 1-based coordinates
As mentioned above, Wiggletools works around the latter system (1 based, closed), so when you enter seek chr1 1 10
in the CLI, that corresponds to chr1 0 9
in the 0-based coordinate system, hence the discrepancy you noticed in the very last example.
As to why the command seek chr1 0 10
returned nothing in the second to last example, it would seem that this has to do with the LibBigWig library which uses 0 based unsigned integers (uint32_t
to be precise). The 1-based 0 value you entered got converted to a 0-based -1, which itself was transformed into a very large number by the variable casting. I guess this error could be averted by ensuring that the start value never goes negative. I'll shortly put in a PR for that bugfix.
Best regards,
Daniel
PS: Now corrected, however to enjoy these changes, you would need to rebuild WiggleTools from the master branch
Thanks! However, I'm still not clear, why does wiggletools seek with 'chr1 1 10' return the 0-based interval of 'chr1 0 9'?
Because 1-based interval of 'chr1 1 10' corresponds to 0-based interval of 'chr1 0 10', rather than 'chr1 0 9'.
Hi, Just re-posting to draw attention to the last question in the thread above, in case it was lost in the shuffle. Thank you!
Hello @gevro ,
Indeed, there seems to be an edge error there. I'll look at it next week.
Cheers,
Daniel
Hello, Thanks again for trying to figure it out - were you able to track down the bug?
Thanks! Is it updated on bioconda or should I install from github source?