Example of poor performance with many updates to a distinct accumulator

Question

Example of poor performance with many updates to a distinct accumulator

Opened this issue 6 years ago · 0 comments

@wdullaer reported a case that performs poorly (took over an hour to run before terminating the process) This case can be found at https://gist.github.com/wdullaer/cecf88b3266ba0ac90b4f060eefe5208 which, for convenience, I will copy below:

(ns clara-perf.core
  (require [clara.rules :refer :all]
           [clara.rules.accumulators :as acc]
           [clojure.set :refer [subset?]]))

(defrecord AllowedProcessor
    [personId processorId])

(defrecord Processor
    [processorId purpose attributes])

(defrecord Consent
    [personId purpose attribute])

(defrule mark-processing-allowed
  "Inserts an AllowedProcessor fact if the person has given sufficient consent"
  [?consents <- (acc/distinct :attribute) :from [Consent (= ?purpose purpose)
                                            (= ?personId personId)]]
  [Processor (= ?purpose purpose)
                 (= ?processorId processorId)
                 (= ?attrs attributes)]
  [:test (subset? (set ?attrs) (set ?consents))]
  =>
  (insert! (->AllowedProcessor ?personId ?processorId)))

(defn logstream
  "Utility function which passes the input to the output while logging a message"
  [input logmessage]
  (println logmessage)
  input)

(def test-purposes [:marketing :health :legal :mobility])
(def test-attributes [:name :email :address :phoneNumber :age :gender :profilePicture :location :homepage :hartrate])

(defn get-random-consent
  "Generate a random consent message with attributes uniformly drawn from a list of values"
  [numPersons purposes attributes]
  (->Consent (str (rand-int numPersons))
             (get purposes (rand-int (count purposes)))
             (get attributes (rand-int (count attributes)))))

(defn generate-consents
  "Generate a seq of random consents"
  [n purposes attributes]
  (repeatedly n #(get-random-consent (quot n 10) purposes attributes)))

(defn -main
  "Run the performance test"
  [& args]
  (-> (mk-session 'clara-perf.core)
      (logstream "Insert Processors")
      (insert (->Processor :mailinglist :marketing [:email])
              (->Processor :recommender :marketing [:location :name])
              (->Processor :sleepmonitor :health [:age :hartrate :name]))
      (logstream "Generating consent 1")
      (insert-all (generate-consents 1000000 test-purposes test-attributes))
      (logstream "Processing consent 1")
      (fire-rules)
      (logstream "Generating consent 2")
      (insert-all (generate-consents 1000000 test-purposes test-attributes))
      (logstream "Processing consent 2")
      (fire-rules)
      (logstream "Done")))

@wdullaer also provided a VisualVM snapshot that can be downloaded from https://www.dropbox.com/s/no6h9pwa03hzt7w/clara-perf.nps?dl=0

I've taken a screenshot of the parts of the snapshot that I think provide the most info expanded:

In contrast, @wdullaer reported that when the same benchmark was run with only one insertion call with the same number of facts in total that the runtime was far less.

I've logged this issue so we can investigate this issue and have somewhere to discuss our findings.