Consistancy issues in JSON schema

Question

Consistancy issues in JSON schema

xXyepXx opened this issue 10 months ago · 14 comments

Describe the bug
The definition of the schema in schema.json specifies that some fields may be one of several types.

For instance, the field implies can either be a string or an array (of strings):

"implies": {
  "oneOf": [
    {
      "type": "array",
      "items": {
        "$ref": "#/definitions/non-empty-non-blank-string"
      }
    },
    {
      "$ref": "#/definitions/non-empty-non-blank-string"
    }
  ]
},

This makes it really hard to parse the JSON object in languages like Go where you need to define the types statically:

type Techno struct {
    Categories  []int    `json:"cats"`
    Implies     []string `json:"implies"`  // This may just be a string!
}

Trying to deserialize a JSON object into an instance of this struct will return an error if the JSON input is using a string type instead of an array of strings:

json: cannot unmarshal string into Go struct field Techno.implies of type []string

Expected behavior
Have a single type for each field.

Answer 1 · 2024-01-29T15:23:53.000Z

This is an issue we have encountered aswell
the original project is not ours, we just started maintaining adding technologies and fixing json objects, always trying not to break old implementations of wappalyzer.
we could deal with this by just replacing all "[]string OR string" by just "[]string" as webappanalyzer implementations parse string or []string anyways.
Im going to leave this issue open and link commits related to this.
is not as good practice and makes the code less readable, but you can json unmarshal to map[string]interface{} and assert the interface{} (obj.([]string)) or just by using "reflect".

Answer 2 · 2024-02-08T19:28:34.000Z

I was the one that added the schema to the original project, it was added 'after the fact'. Then was modified further after that. Since it was added after the data files existed (not before), it had to conform to the data present after years of different people contributing with no guidance or format restrictions.

I agree if it was more strongly typed that would really help.

On the off chance that it might help someone here's the parser we use within ZAPs technology detection add-on: https://github.com/zaproxy/zap-extensions/blob/main/addOns/wappalyzer/src/main/java/org/zaproxy/zap/extension/wappalyzer/WappalyzerJsonParser.java
(That code has also evolved over time and been worked on by maybe a dozen people so if it seems horrid I won't be surprised 😉)

Answer 3 · 2024-07-16T07:35:09.000Z

I made some contribution in order to unify and simplify the schema.
I hope it fits your needs. If you see other fields that could be improved, just raise your idea

Answer 4 · 2024-07-16T09:51:33.000Z

Thanks. Seems fine on my end. Nice work!

Answer 5 · 2024-07-24T13:01:32.000Z

unsure about what to do with the dom selectors, there are too many variations

string
"dom": "form[name='formLogin'][action='login.aspx' i][id='formLogin']"

string array:
"dom": ["form[name='formLogin'][action='login.aspx' i][id='formLogin']", ...]

object
"dom": {
"img[src*='provenexpert']": {
"attributes": {
"src": "images\.provenexpert\.\w+"
}
}
}

I'd say transform string and string array to the object structure, like "dom": {"form[name='formLogin'][action='login.aspx' i][id='formLogin']": ""}, seems more flexible.

However this change might break some implementations so i thought on creating a new one called "selector" and deprecating the "dom" one.

Answer 6 · 2024-07-24T13:46:23.000Z

Well assuming people are already using the objects then moving strings and arrays into objects "shouldn't" break anything ... Though I guess there might be some just using the strings and string array elements 🤷‍♂️

Answer 7 · 2024-07-24T17:40:07.000Z

Yeah the thing is i'm not sure what is the people using right now, my guess would be json object aswell, I checked some libs on github and people are not even using the query selectors

Answer 8 · 2024-07-24T17:57:25.000Z

I can say that ZAP uses both but I think our implementation south continue to work.

The easy, consistent, and quick'ish thing I guess is to make all the single strings into single element arrays as was done elsewhere. Then atleast it's down to two types vs. three.

Answer 9 · 2024-07-24T18:04:55.000Z

I guess, still not consistent you can't properly map it on a strongly typed language -- but I'm not sure if you could anyways, since it seems to have unknown depth, so...

Answer 10 · 2024-07-25T14:03:34.000Z

there appears to be a limited amout of keys inside the selector

attributes: with an object inside ex: "dom": {"attributes": {"href": "pattern", "src": "pattern"}}
properties: with an object inside ex: "dom": {"properties": {"_reactRootContainer": "pattern"}}
text: with a string inside ex: "dom": {"text": "pattern"}
exists: with an empty string inside ex: "dom": {"exists": ""}

any of the "pattern" or empty string "" can contain the \;confidence and version thing inside

Of course many if them seem to be very broken so im going to work on them now.

But still you can't map it because even tho attributes, text or exists seem to be mandatory, the key that contains it is dynamic, as well as the keys inside the attributes obj...

a better way of mapping those objects would be like

{
  "dom": [
    {
      "selector": "link[href*='/wp-content/plugins/sfwd-lms/']",
      "attributes": [
        {
          "name": "href",
          "value": "whatever"
        },
        {
          "name": "src",
          "value": ""
        }
      ],
      "text": "Powered By ...(\\d+)\\;version:\\1"
    }
  ]
}

so its easier to scale aswell like this

{
  "dom": [
    {
      "selector": "link[href*='/wp-content/plugins/123/']",
      "attributes": [
        {
          "name": "href",
          "value": "whatever"
        },
      ]
    },
    {
      "selector": "link[href*='/wp-content/plugins/456/']",
      "text": "Powered By ...(\\d+)\\;version:\\1"
    }
  ]
}

also I'm not sure what the exists is for, since if you dont want it, just dont put it.

Answer 11 · 2024-07-25T14:08:52.000Z

anyways, knowing this structure at least we can properly validate everything, not sure how many people use this objects as they are supposed to since there is no documentation of it.

Also many of the objects are broken no idea how this has been working in prod, I've seen at least 5 or 6 randomly structured objects.

Answer 12 · 2024-07-25T14:22:16.000Z

Remember they aren't regex they're element selectors.

Our project is using them without issue.

Answer 13 · 2024-07-25T14:30:00.000Z

nono, the selectors are correct, i already validated them all and its included in the github workflow already, but i meant that some objects are not configured following the structure required, for example:

im not sure about the {"exists": ""} thing

Answer 14 · 2024-07-25T14:37:47.000Z

Oh okay, now I see what you mean