ben-strasser/fast-cpp-csv-parser

Loss of precision on float reading

Opened this issue · 5 comments

Hi,

Thank you for this library !

I observed a surprising anomaly when reading some float numbers (as double).

If we take this csv, for example, and read the first column as a double :

Col1,Col2
9.855536508,0

The first value will be read as 9.8555365080000054
It should be read as 9.8555365080000001

The second one is the value you get with other parsers. It is also the value you get if you do
double x = 9.855536508;

For my application, this has an impact that leads to a bigger bug.

Do not hesitate to ask me for further tests if you wish.

Hi, thanks for the bug report.

Currently, the float parsing is handled using

 x = 0;
while('0' <= *col && *col <= '9'){
        int y = *col - '0';
        x *= 10;
        x += y;
        ++col;
}

if(*col == '.'|| *col == ','){
        ++col;
        T pos = 1;
        while('0' <= *col && *col <= '9'){
                pos /= 10;
                int y = *col - '0';
                ++col;
                x += y*pos;
        }
}

Where x is a float, double or long double. This code contains quite a few floating point computations that can introduce small errors.

I think this can be improved by parsing the floating first as integer plus some exponent and then doing a cast and a single division. This means that "1234.789" would be parsed as integer "1234789" and exponent "-3". One would then do something along float(1234789)*pow(10,-3).

Would parsing 9.855536508 by computing double(9855536508ll)*pow(10,-9) get you the result you need?

For float, this should work when parsing the digits first as int64. To correctly parse double and long double I think that more than a 64 bit integer are necessary. It is unclear to me what the best course of action is here. Maybe parsing it using multiple int64 is a good idea. For example, one could parse 456468484554529.8454545546684682555365 as double(456468484554529845ll)*pow(10,-3) + double(4545546684682555365ll)*pow(10,-22)

Important: pow is slow. A faster computation of powers of 10 is necessary. pow is just for exposition.

Hi,

Thank you both for these explanations.

I confirm that computing double(9855536508ll)*pow(10,-9) returns the value I expect.
Compared to the value returned by the parser, it is closer to the theoretical value by a factor of 50.

I guess there is a trade-off between precision and speed of parsing.

Just encountered the exact same issue.

I've used std::string to read and then use sscanf with %le for double

But is there a better way by providing a custom "decoder"?

std::from_chars() might be what you really want going forward?