18F/open-data-maker

Stats queries return integers as numbers, floats as strings

Opened this issue · 10 comments

Example:

{
  "aggregations": {
    "school.tuition_revenue_per_fte": {
      "count": 7362,
      "min": 0,
      "max": 207213,
      "avg": "0.9954935343656614E4",
      "sum": 73288234,
      "sum_of_squares": 1249896256472,
      "variance": "0.7067598825743325E8",
      "std_deviation": "0.8406901228005076E4",
      "std_deviation_bounds": {
        "upper": "0.26768737799666764E5",
        "lower": "-0.6858867112353537E4"
      }
    }
  }
}

... whereas what we get when querying ES directly is:

{
  "aggregations": {
    "school.tuition_revenue_per_fte": {
      "count": 7362,
      "min": 0,
      "max": 207213,
      "avg": 9954.935343656614,
      "sum": 7.3288234E7,
      "sum_of_squares": 1.249896256472E12,
      "variance": 7.067598825743325E7,
      "std_deviation": 8406.901228005076,
      "std_deviation_bounds": {
        "upper": 26768.737799666764,
        "lower": -6858.867112353537
      }
    }
  }
}

I initially thought this was something to do with Ruby mis-parsing scientific notation, but now I suspect it's something in our type-detection output code.

@siruguri would you have time to take a look at what's going on here?

I have to re-install elastic search because I got a new laptop, and remind
myself how to load up some dummy data... I can take a whirl at that this
afternoon, and hopefully at least point out what the deal is... not sure if
I can get around to fixing/testing it in the next few days...

On Wed, Feb 17, 2016 at 2:19 PM, Sarah Allen notifications@github.com
wrote:

@siruguri https://github.com/siruguri would you have time to take a
look at what's going on here?


Reply to this email directly or view it on GitHub
#278 (comment)
.

in controllers.rb, line 75, the .to_json calls .to_s on each item in
the hash data. Floats are BigDecimals; so to_s makes them strings which I
think is the behavior of BigDecimal, whereas the integers are Fixnums
(maybe?).

So the solution would be either to intercept the to_s if possible in
JSON, or post-process the output.

I am not too sure when I can get to this, maybe next week? I know I also
have a failing test request on another issue.

On Wed, Feb 17, 2016 at 2:27 PM, Sameer Siruguri siruguri@gmail.com wrote:

I have to re-install elastic search because I got a new laptop, and remind
myself how to load up some dummy data... I can take a whirl at that this
afternoon, and hopefully at least point out what the deal is... not sure if
I can get around to fixing/testing it in the next few days...

On Wed, Feb 17, 2016 at 2:19 PM, Sarah Allen notifications@github.com
wrote:

@siruguri https://github.com/siruguri would you have time to take a
look at what's going on here?


Reply to this email directly or view it on GitHub
#278 (comment)
.

So the writer of BigDecimal actually wanted the default implementation to be that floating points are rendered as strings by the to_json method, to reduce the possibility of (small) errors during de/re-serialization of data. As far as I could figure, it seems like monkey-patching BigDecimal is the only solution, with the caveat that it will produce this corner-case error.

Should I do that? Should we instead let the numbers be strings so that the client system can decide how to re-cast to their version of the floating-point type?

Sorry to be so slow, @siruguri! Ugh, this is a nasty one, and I'm really torn. Providing floats as strings feels like giving a worse user/developer experience in return for a level of accuracy that - in all the existing use cases, anyway - really isn't needed. On the other hand, Open-data-maker is meant to be domain-agnostic, and I can't speak for the needs of future data sets and their users. (On the other other hand, I don't know what ES uses for its floats internally; the accuracy may already have been compromised before the stats were emitted.)

My inclination is to monkey-patch in favour of a better user experience. What does everyone else think?

I'm open to either solution though I think I'd rather un-string the value
in the open-data-maker code than monkey-patch BigDecimal. I'll work on this
today.

On Fri, Mar 4, 2016 at 12:18 PM, Yoz Grahame notifications@github.com
wrote:

Sorry to be so slow, @siruguri https://github.com/siruguri! Ugh, this
is a nasty one, and I'm really torn. Providing floats as strings feels like
giving a worse user/developer experience in return for a level of accuracy
that - in all the existing use cases, anyway - really isn't needed. On the
other hand, Open-data-maker is meant to be domain-agnostic, and I can't
speak for the needs of future data sets and their users. (On the other
other hand, I don't know what ES uses for its floats internally; the
accuracy may already have been compromised before the stats were emitted.)

My inclination is to monkey-patch in favour of a better user experience.
What does everyone else think?


Reply to this email directly or view it on GitHub
#278 (comment)
.

In fact, now that I think about it,

Providing floats as strings feels like giving a worse user/developer experience in return for a level of accuracy that - in all the existing use cases, anyway - really isn't needed.

sounds equiv to, "let's cast the BigDecimal object via .to_f" Should I do that instead? Thoughts? Then .to_json will continue to produce non-stringified numbers.

@siruguri Yes, that sounds like the right option to me. Thank you!