Stats queries return integers as numbers, floats as strings
Opened this issue · 10 comments
Example:
{
"aggregations": {
"school.tuition_revenue_per_fte": {
"count": 7362,
"min": 0,
"max": 207213,
"avg": "0.9954935343656614E4",
"sum": 73288234,
"sum_of_squares": 1249896256472,
"variance": "0.7067598825743325E8",
"std_deviation": "0.8406901228005076E4",
"std_deviation_bounds": {
"upper": "0.26768737799666764E5",
"lower": "-0.6858867112353537E4"
}
}
}
}
... whereas what we get when querying ES directly is:
{
"aggregations": {
"school.tuition_revenue_per_fte": {
"count": 7362,
"min": 0,
"max": 207213,
"avg": 9954.935343656614,
"sum": 7.3288234E7,
"sum_of_squares": 1.249896256472E12,
"variance": 7.067598825743325E7,
"std_deviation": 8406.901228005076,
"std_deviation_bounds": {
"upper": 26768.737799666764,
"lower": -6858.867112353537
}
}
}
}
I initially thought this was something to do with Ruby mis-parsing scientific notation, but now I suspect it's something in our type-detection output code.
I just discovered the same thing. For quick reference, here's a sample query:
@siruguri would you have time to take a look at what's going on here?
I have to re-install elastic search because I got a new laptop, and remind
myself how to load up some dummy data... I can take a whirl at that this
afternoon, and hopefully at least point out what the deal is... not sure if
I can get around to fixing/testing it in the next few days...
On Wed, Feb 17, 2016 at 2:19 PM, Sarah Allen notifications@github.com
wrote:
@siruguri https://github.com/siruguri would you have time to take a
look at what's going on here?—
Reply to this email directly or view it on GitHub
#278 (comment)
.
in controllers.rb, line 75, the .to_json
calls .to_s
on each item in
the hash data
. Floats are BigDecimals; so to_s makes them strings which I
think is the behavior of BigDecimal
, whereas the integers are Fixnums
(maybe?).
So the solution would be either to intercept the to_s
if possible in
JSON
, or post-process the output.
I am not too sure when I can get to this, maybe next week? I know I also
have a failing test request on another issue.
On Wed, Feb 17, 2016 at 2:27 PM, Sameer Siruguri siruguri@gmail.com wrote:
I have to re-install elastic search because I got a new laptop, and remind
myself how to load up some dummy data... I can take a whirl at that this
afternoon, and hopefully at least point out what the deal is... not sure if
I can get around to fixing/testing it in the next few days...On Wed, Feb 17, 2016 at 2:19 PM, Sarah Allen notifications@github.com
wrote:@siruguri https://github.com/siruguri would you have time to take a
look at what's going on here?—
Reply to this email directly or view it on GitHub
#278 (comment)
.
So the writer of BigDecimal
actually wanted the default implementation to be that floating points are rendered as strings by the to_json
method, to reduce the possibility of (small) errors during de/re-serialization of data. As far as I could figure, it seems like monkey-patching BigDecimal
is the only solution, with the caveat that it will produce this corner-case error.
Should I do that? Should we instead let the numbers be strings so that the client system can decide how to re-cast to their version of the floating-point type?
Sorry to be so slow, @siruguri! Ugh, this is a nasty one, and I'm really torn. Providing floats as strings feels like giving a worse user/developer experience in return for a level of accuracy that - in all the existing use cases, anyway - really isn't needed. On the other hand, Open-data-maker is meant to be domain-agnostic, and I can't speak for the needs of future data sets and their users. (On the other other hand, I don't know what ES uses for its floats internally; the accuracy may already have been compromised before the stats were emitted.)
My inclination is to monkey-patch in favour of a better user experience. What does everyone else think?
I'm open to either solution though I think I'd rather un-string the value
in the open-data-maker code than monkey-patch BigDecimal. I'll work on this
today.
On Fri, Mar 4, 2016 at 12:18 PM, Yoz Grahame notifications@github.com
wrote:
Sorry to be so slow, @siruguri https://github.com/siruguri! Ugh, this
is a nasty one, and I'm really torn. Providing floats as strings feels like
giving a worse user/developer experience in return for a level of accuracy
that - in all the existing use cases, anyway - really isn't needed. On the
other hand, Open-data-maker is meant to be domain-agnostic, and I can't
speak for the needs of future data sets and their users. (On the other
other hand, I don't know what ES uses for its floats internally; the
accuracy may already have been compromised before the stats were emitted.)My inclination is to monkey-patch in favour of a better user experience.
What does everyone else think?—
Reply to this email directly or view it on GitHub
#278 (comment)
.
In fact, now that I think about it,
Providing floats as strings feels like giving a worse user/developer experience in return for a level of accuracy that - in all the existing use cases, anyway - really isn't needed.
sounds equiv to, "let's cast the BigDecimal
object via .to_f
" Should I do that instead? Thoughts? Then .to_json
will continue to produce non-stringified numbers.