Ensembl/WiggleTools

0 vs 1-based coordinate system

Closed this issue · 12 comments

gevro commented

Hi,

I am doing the following:

  1. Making 0-based bedgraph files --> input.bg (0-based)
  2. Converting to bigwig with UCSC tools bedGraphToBigWig --> input.bw (0-based ?)
  3. Doing a trim and fill-in for sites not specified in the genome:
wiggletools trim genome.bed fillIn genome.bed input.bw > input.trimfillin.wig (0-based?)
  1. Then converting this wiggle file back to bigwig: with UCSC wigToBigWig --> input.trimfillin.bw (0-based?)
  2. Then outputting with: wiggletools write_bg - compress lt 1.0 input.trimfillin.bw > output.bg (0-based?)

My question is from the beginning to the end, does this process--in particular the wiggletools steps-- maintain the correct 0 vs 1 based coordinate system? i.e. output.bed should be 0-based.

Based on this page: https://www.biostars.org/p/384606/
it says: "BigWig files created from bedGraph format use "0-start, half-open" coordinates"

So I'm not clear what would actually happen.
Thanks

gevro commented

Just to add, I tested this manually with a dummy file, and it seems to work. output.bg ends up as 0-based. But I'm actually not sure how that happens.

Because somehow all the tools in between would need to know that it is 0-based and preserve that info:
bedGraphToBigWig
wiggletools trim
wiggletools fillIn
wigToBigWig
wiggletools write_bg - compress lt

And this is despite wiggle officially being 1-based, with intermediate wiggle files.

I'm curious how this works?

Hello @gevro ,

I'm relieved to hear that it works because to be honest it is a headache that plagues these file formats.

In short, it is file format dependent:

  • Bed, BedGraph, Bam, BigBed, BigWig: 0 based
  • Wig, VCF, SAM: 1 based

Even better, a wig file (1 based), can contain BedGraph blocks (0 based).

Each tool has its own internal convention (wiggletools is 1 based under the hood) and has to be careful when reading or writing data.

Hope this helps,

Daniel

gevro commented

Hi, I may have found an issue.

  1. A big wig file simple output from wiggle tools appears to be 0-based:
wiggletools test.bw | head
chr1	0	13	0.000000
chr1	13	24	1.000000
fixedStep chrom=chr1 start=25 step=1
0.000000
chr1	25	39	1.000000
chr1	39	41	0.000000
fixedStep chrom=chr1 start=42 step=1
1.000000
0.000000
chr1	43	53	1.000000
  1. Likewise when outputting it with write_bg it appears 0-based:
wiggletools write_bg - test.bw | head
chr1	0	13	0.000000
chr1	13	24	1.000000
chr1	24	25	0.000000
chr1	25	39	1.000000
chr1	39	41	0.000000
chr1	41	42	1.000000
chr1	42	43	0.000000
chr1	43	53	1.000000
chr1	53	54	0.000000
chr1	54	55	1.000000
  1. It also appears that the wiggle tools 'seek' function takes 1-based coordinates. Because giving it 0-based coordinates doesn't return anything:
wiggletools seek chr1 0 10 test.bw | head
# No output

wiggletools seek chr1 1 10 test.bw | head
chr1	0	9	0.000000

However, the issue is that when using seek with the 1-based coordinates, it is not outputting the correct interval.
For example wiggletools seek chr1 1 10 test.bw should have returned the 0-based interval chr1 0 10, which is the correct interval corresponding to the 1-based interval given to seek of chr 1 10.

It seems that wiggletools is not outputting the correct 0-based interval when seek is given a 1-based interval.

Note that the bigwig above was made from a bedgraph file using the bedGraphToBigWig tool.

Also, another important point is that the issue where seek with chr1 0 10 for a .bw file doesn't return any output doesn't seem to occur for a .wig file. But nevertheless, the issue still occurs for .wig files too in that seek seems to miss the last base position.

Also note, the workaround would be simple: just adding +1 to the end coordinate given to the 'seek' function. But I do believe this is nevertheless a bug. This is wiggletools version v1.2.11

Hello @gevro ,

The subtlety is that within a Wig file, you can have two types of blocks with different coordinate systems:

  • BedGraph lines: i.e. all 4 column lines of the form (chrom, start, end, value): 0-based coordinates
  • All other lines: 1-based coordinates

As mentioned above, Wiggletools works around the latter system (1 based, closed), so when you enter seek chr1 1 10 in the CLI, that corresponds to chr1 0 9 in the 0-based coordinate system, hence the discrepancy you noticed in the very last example.

As to why the command seek chr1 0 10 returned nothing in the second to last example, it would seem that this has to do with the LibBigWig library which uses 0 based unsigned integers (uint32_t to be precise). The 1-based 0 value you entered got converted to a 0-based -1, which itself was transformed into a very large number by the variable casting. I guess this error could be averted by ensuring that the start value never goes negative. I'll shortly put in a PR for that bugfix.

Best regards,

Daniel

PS: Now corrected, however to enjoy these changes, you would need to rebuild WiggleTools from the master branch

gevro commented

Thanks! However, I'm still not clear, why does wiggletools seek with 'chr1 1 10' return the 0-based interval of 'chr1 0 9'?
Because 1-based interval of 'chr1 1 10' corresponds to 0-based interval of 'chr1 0 10', rather than 'chr1 0 9'.

gevro commented

Hi, Just re-posting to draw attention to the last question in the thread above, in case it was lost in the shuffle. Thank you!

Hello @gevro ,

Indeed, there seems to be an edge error there. I'll look at it next week.

Cheers,

Daniel

gevro commented

Hello, Thanks again for trying to figure it out - were you able to track down the bug?

Hello @gevro ,

It should now be fixed!

Thanks for reporting this inconsistency,

Daniel

gevro commented

Thanks! Is it updated on bioconda or should I install from github source?

Hello @gevro ,

Currently it is only available on the master branch, as I do not know how to trigger a bioconda version update.

Regards,

Daniel