santhosh-tekuri/jsonschema

Performance drop when schema has "uniqueItems" set to true

Closed this issue · 7 comments

Hi,

I noticed a major performance drop when the schema has uniqueItems set as true for an array and the json content being a 4mb file has a massive array.

The benchmark test took only about 0.07260 ns/op when set to false and a whopping 107824033000 ns/op (~107 sec) when set to true.

Another library from xeipuuv/gojsonschema seems to not have any major time difference when uniqueItems is set to true. So perhaps there is room for improvement here.

The schema used is as follows:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "array",
  "items": {
    "type": "object",
    "required": ["name", "price", "description"],
    "properties": {
      "name": {
        "type": "string"
      },
      "price": {
        "type": "number"
      },
      "description": {
        "type": "string"
      }
    }
  },
  "uniqueItems": true
}

fixed with caf0c40

@NitinRamesh25 can you verify with this fix

@NitinRamesh25 it seems you have json array with large number of items.

the above fix improves performance only in cases where large array contains duplicate items.

@santhosh-tekuri I have verified with the new change and it works fast when duplicate items are close by in the array.
In case where the first and last element of the large array are duplicates, then it still takes a lot of time.
I would prefer if it runs fast even when no duplicate items are found.

Wouldn't this be faster if we marshal the element to []byte, convert it to a string and store it as the key of a map ?
Since the array elements would be keys of the map, lookups for duplicate data would be faster.

marshalling array item is the approach taken by the library you have mentioned. I have following concerns with that approach:

  • lots of memory allocations
  • handling numbers (especially larger numbers) is error prone

the best approach is use Hashmap. I have used this approach in boon (it is port of this library in Rust)

unfortunately Golang map does not allow slices as keys, because slices has no equality defined.

I plan to use somewhat map like approach for this;

fixed with d0f75c8

@NitinRamesh25 now we use map to check uniqueness, when array is large enough (>20)

I tested locally by temporarily setting size threshold to zero, and all tests passed;

please verify at your end

Much thanks. It's faster now.

@NitinRamesh25 just published v5.3.1 with this fix