security-force-monitor/research-handbook

Consider changing the data types to something less internal

Opened this issue · 4 comments

Screenshot 2023-09-26 at 15 23 23

In this column, the data types are named by their internal validation function for clojure.spec. As these are relatively arbitratily named, I would suggest changing them out for something 1) used more widely and 2) further abstracted 3) reveals less about inner implementation.

One such solution would be to use the Typescript standard, as that is becoming a very widely used type system.

the changes would look like this

inner implementation typed reference as multiple
uuid-string uuid uuid[]
strings<->uuids uuid[] uuid[]
single-string string string[]
cell-list string[] string[]
YN<->bool boolean boolean[]
string-date<->timestamp timestamp timestamp[]
status "3" \ "2" \ "1" Array<"3" \ "2" \ "1">

optional values get added question marks. string?, string[]?, etc

Then, we can talk about how a boolean is represented as "Y" | "N" in the spreadsheet, or as a boolean in ingested data

Yeah, agree. Or even have a human readable version, like I report in the detailed section: uuid-strings -> String, formatted in UUID format.

I think there's a good case for explaining what different data types are (strings, integers, bools), the related validators and what we use them for. For some readers, this may be the first time anyone's done that for them!

There's a greater issue there of data-types in the sheet vs data-types in the intermediate system vs data-types in json exports.

Clojure/Script has a uuid data type, for example, where it is a string in the sheet and a string in the export. Similarly, dates have multiple representations along the chain.

They do have predictable transformations, though.

I think there are three audiences, with various overlapping interests in the docs:
A) developers, who want to know how type handling should be managed;
B) researchers: likely to use the model in practice, and will have interest what they need to do for data validation/integrity/quality; and,
C) downstream users of product derived from our data: assessment of the meaning of a field and the rules used to create it, as an aid to understanding.

Stuff written for B and C is generally useful to A, but stuff for A less so for B and C!