miner/herbert

hg/sample perf issues with larger schemas

coopernurse opened this issue · 4 comments

Hi there,

Thanks for writing herbert. The test.check integration is really nice.

While using this today I noticed some perf issues in the sample function. It looks like performance degrades exponentially. If you have a schema with 20 fields, sample never completes (at least on my Macbook). Here are some examples:

;; 8 fields
(def schema8 '[{:heatingSystem? str,
  :propertyCity? str,
  :mlsListingId? str,
  :listingKey {:externalAppId int, :listingId str},
  :yearBuilt? str,
  :propertyCountry? str,
  :imageUrls? (seq (* str)),
  :userId str}])

;; 15 fields
(def schema15 '[{:heatingSystem? str,
  :propertyCity? str,
  :mlsListingId? str,
  :listingKey {:externalAppId int, :listingId str},
  :yearBuilt? str,
  :propertyCountry? str,
  :listDate? str,
  :county? str,
  :attic? str,
  :secondAgentName? str,
  :diningRoom? str,
  :baths? str,
  :secondAgentPhone1? str,
  :imageUrls? (seq (* str)),
  :userId str}])

;; 20 fields
(def schema20 '[{:heatingSystem? str,
  :monthlyHOAFees? str,
  :remarks? str,
  :roofType? str,
  :listingSource? str,
  :propertyCity? str,
  :bedrooms? str,
  :mlsListingId? str,
  :listingKey {:externalAppId int, :listingId str},
  :yearBuilt? str,
  :propertyCountry? str,
  :listDate? str,
  :county? str,
  :attic? str,
  :secondAgentName? str,
  :diningRoom? str,
  :baths? str,
  :secondAgentPhone1? str,
  :imageUrls? (seq (* str)),
  :userId str}])

Then when run in the REPL:

user=> (time (def x (hg/sample schema8)))
"Elapsed time: 4.203 msecs"

user=> (time (def x (hg/sample schema15)))
"Elapsed time: 543.325 msecs"

;; never finishes - crushes CPU
user=> (time (def x (hg/sample schema20)))

I'll have to investigate. It seems that the optional keys are causing the slowdown.

It appears to be a combinatorially explosion in mk-literal-hash-map. I'll have to rewrite that.

Please try version 0.6.5, just released on Clojars.

Thank you sir! Looking much better.