Consistancy issues in JSON schema
xXyepXx opened this issue ยท 14 comments
Describe the bug
The definition of the schema in schema.json
specifies that some fields may be one of several types.
For instance, the field implies
can either be a string or an array (of strings):
"implies": {
"oneOf": [
{
"type": "array",
"items": {
"$ref": "#/definitions/non-empty-non-blank-string"
}
},
{
"$ref": "#/definitions/non-empty-non-blank-string"
}
]
},
This makes it really hard to parse the JSON object in languages like Go where you need to define the types statically:
type Techno struct {
Categories []int `json:"cats"`
Implies []string `json:"implies"` // This may just be a string!
}
Trying to deserialize a JSON object into an instance of this struct will return an error if the JSON input is using a string type instead of an array of strings:
json: cannot unmarshal string into Go struct field Techno.implies of type []string
Expected behavior
Have a single type for each field.
This is an issue we have encountered aswell
the original project is not ours, we just started maintaining adding technologies and fixing json objects, always trying not to break old implementations of wappalyzer.
we could deal with this by just replacing all "[]string OR string" by just "[]string" as webappanalyzer implementations parse string or []string anyways.
Im going to leave this issue open and link commits related to this.
is not as good practice and makes the code less readable, but you can json unmarshal to map[string]interface{} and assert the interface{} (obj.([]string)) or just by using "reflect".
I was the one that added the schema to the original project, it was added 'after the fact'. Then was modified further after that. Since it was added after the data files existed (not before), it had to conform to the data present after years of different people contributing with no guidance or format restrictions.
I agree if it was more strongly typed that would really help.
On the off chance that it might help someone here's the parser we use within ZAPs technology detection add-on: https://github.com/zaproxy/zap-extensions/blob/main/addOns/wappalyzer/src/main/java/org/zaproxy/zap/extension/wappalyzer/WappalyzerJsonParser.java
(That code has also evolved over time and been worked on by maybe a dozen people so if it seems horrid I won't be surprised ๐)
I made some contribution in order to unify and simplify the schema.
I hope it fits your needs. If you see other fields that could be improved, just raise your idea
Thanks. Seems fine on my end. Nice work!
unsure about what to do with the dom selectors, there are too many variations
string
"dom": "form[name='formLogin'][action='login.aspx' i][id='formLogin']"
string array:
"dom": ["form[name='formLogin'][action='login.aspx' i][id='formLogin']", ...]
object
"dom": {
"img[src*='provenexpert']": {
"attributes": {
"src": "images\.provenexpert\.\w+"
}
}
}
I'd say transform string and string array to the object structure, like "dom": {"form[name='formLogin'][action='login.aspx' i][id='formLogin']": ""}, seems more flexible.
However this change might break some implementations so i thought on creating a new one called "selector" and deprecating the "dom" one.
Well assuming people are already using the objects then moving strings and arrays into objects "shouldn't" break anything ... Though I guess there might be some just using the strings and string array elements ๐คทโโ๏ธ
Yeah the thing is i'm not sure what is the people using right now, my guess would be json object aswell, I checked some libs on github and people are not even using the query selectors
I can say that ZAP uses both but I think our implementation south continue to work.
The easy, consistent, and quick'ish thing I guess is to make all the single strings into single element arrays as was done elsewhere. Then atleast it's down to two types vs. three.
I guess, still not consistent you can't properly map it on a strongly typed language -- but I'm not sure if you could anyways, since it seems to have unknown depth, so...
there appears to be a limited amout of keys inside the selector
- attributes: with an object inside ex: "dom": {"attributes": {"href": "
pattern
", "src": "pattern
"}} - properties: with an object inside ex: "dom": {"properties": {"_reactRootContainer": "
pattern
"}} - text: with a string inside ex: "dom": {"text": "
pattern
"} - exists: with an empty string inside ex: "dom": {"exists": ""}
any of the "pattern
" or empty string "" can contain the \;confidence and version thing inside
Of course many if them seem to be very broken so im going to work on them now.
But still you can't map it because even tho attributes, text or exists seem to be mandatory, the key that contains it is dynamic, as well as the keys inside the attributes obj...
a better way of mapping those objects would be like
{
"dom": [
{
"selector": "link[href*='/wp-content/plugins/sfwd-lms/']",
"attributes": [
{
"name": "href",
"value": "whatever"
},
{
"name": "src",
"value": ""
}
],
"text": "Powered By ...(\\d+)\\;version:\\1"
}
]
}
so its easier to scale aswell like this
{
"dom": [
{
"selector": "link[href*='/wp-content/plugins/123/']",
"attributes": [
{
"name": "href",
"value": "whatever"
},
]
},
{
"selector": "link[href*='/wp-content/plugins/456/']",
"text": "Powered By ...(\\d+)\\;version:\\1"
}
]
}
also I'm not sure what the exists is for, since if you dont want it, just dont put it.
anyways, knowing this structure at least we can properly validate everything, not sure how many people use this objects as they are supposed to since there is no documentation of it.
Also many of the objects are broken no idea how this has been working in prod, I've seen at least 5 or 6 randomly structured objects.
Remember they aren't regex they're element selectors.
Our project is using them without issue.
Oh okay, now I see what you mean