MISP/misp-rfc

Fields containing only numbers should be JSON integers

rhaist opened this issue · 2 comments

To allow proper marshaling of the proposed misp-core-format JSON protocol it would be essential to have the attributes only containing numbers to be actual JSON Integers:

Exemplary for Events:

  • id
  • threat_level_id
  • analysis
  • org_id
  • orgc_id
  • attribute_count
  • distribution
  • sharing_group_id
  • ....

Probably also true for the different timestamps currently containing unix timestamp formated strings.

Same should be applied throughout the other parts of the RFC.

REF: #2

I expect a serialization of ID like field values as JSON numbers to cause more trouble than it can cure, at least in the cases when it is possible, that the values might exceed 9007199254740991 or say 15+ digits in decimal at some future time or for some large installments. Some host languages ("JavaScript") do not handle such large numbers without further ado, and thus precision loss, staged parsing or even interoperability / round-trip problems MAY occur.

On the other hand, the attribute_count value (as a count) would be a good JSON number candidate, I think.

The general robust structuring with using only JSON objects, arrays, strings and booleans has some value in itself, hasn't it :-? Serializing a potentially infinite large identification number as string has the benefit, that whatever consumer or producer processing is involved, there is a good chance, that upon write and read, the id is preserved (whatever optimizations happen inside the local "processing" node). Performance MAY be an issue, suggesting to lean towards numbers where possible, but then the structures are built from many nested objects and only few arrays, thus speed optimized segmented parsing does not seem to be the primary design goal ;-)

Note, that a migration from string(EPOC) towards ISO8601 or RFC3339 (subsetting ISO) would also move away from JSON number #4

So, maybe a detailed discussion could be helpful on these "value type serialization" topics. I would consider one exchange on the general modeling level (primary and secondary goals), and then - possibly guided by that outcome - other discussions focused on groups of attributes.

At this date it seems by default MISP software (that is one backed by MySQL) uses 4 bytes for IDs (i.e. twice less than above noted limitation). Postgresql installs use bigserial, 8 byte IDs, which is, exactly the same as "some host languages like JavaScript".

Looking at these defaults it seems quite unlikely you would need anything larger than 8 byte ints in foreseeable future.

Using numbers instead of the string means you can define more meaningful json schema, which is helpful for validation, as well as simplify serialization/deserialization process where you have to translate these numbers back and forth. I'm not concerned here about speed, but the hassle of implementing this.

This is only somewhat related, but if it was described as a number in the JSON schema I wouldn't have ever be confused by #9 when implementing MISP format in my own app.