Potential 32 bit overflow on parse of data with 2.2B lines

Question

Potential 32 bit overflow on parse of data with 2.2B lines

wasade opened this issue a year ago · 0 comments

We are using bedtools 2.31.0 from bioconda on a 64-bit Linux system to identify contiguous regions of sequences observed by short reads.

Our sorted input piped to bedtools merge -i stdin -c 5 -o count contains approximately 2.2B records and is 450GB in size. This resulted in the following error message:

Error: line number 730706434 of file stdin has 2 fields, but 5 were expected.

A direct examination of the input data showed that all lines contain exactly 4 tabs, e.g.:

>>> for idx, line in enumerate(open('foo', 'rb')):
...     if line.count(b'\t') != 4:
...         print(idx, line)
... 
>>>

In examining the code surrounding the emitted error message, we observed numerous uses of int as a datatype without a width specified, including on variables that appear to be used as counters.

Google suggested int would default to 32-bit even on 64-bit machines. To test this, we installed GCC13 from bioconda, which we believe is the same version used for the bedtools build, and ran a brief test:

$ uname -a
Linux barnacle2 3.10.0-1160.31.1.el7.x86_64 #1 SMP Thu Jun 10 13:32:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (conda-forge gcc 13.1.0-0) 13.1.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ cat width_test.cpp 
#include <iostream>

int main(int argc, char **argv) {
  int foo;
  std::cout << sizeof(foo) << std::endl;
} 
$ g++ width_test.cpp -o width_test
$ ./width_test 
4

Based on this observation, and the error being inconsistent with the data, we suspect there may be an overflow occurring on parse.

Sharing the input data is not going to be easy. That said, would it be possible for the bedtools team to check if builds are using 32-bit ints on counters? If it would help, I could propose a small program that generates synthetic data to test. If confirmed, I would be interested in contributing to part of refactor (just not the whole codebase as this is large and I'm unfamiliar with it).