replikativ/datahike

Datoms disappearing upon reconnection

Closed this issue · 8 comments

I have a (boolean) attribute that exists on several entities. I can set and query that attribute, and as long as I have a cached connection, things appear consistent. But as soon as I reconnect (d/connect) the attribute reverts to an earlier value for some but not all of the entities. If I continue to transact data with the new connection, then reconnect, again some of updates will be lost, though on a different set of entities.

I am using version 0.3.6.

This may not be relevant, but I'll include it for completeness. I've been comparing the :eavt indices between the two connections. For each transaction that sets the attribute, on the cached connection I see two Datoms: one for the updated value of the attribute, and one for the transaction itself, setting :db/txInstant. On the fresh connection, the index has only one Datom from that transaction, the one that sets :db/txInstant. That is, on the fresh connection the transaction exists, but isn't updating the entity. I don't know the hitchhiker tree format well enough to research whether the transaction isn't being fully persisted, or if it is persisted but fails to be applied when re-connecting.

Any guidance would be appreciated. As is, my database is unusable because it keeps losing data.

Hi @mainej , could you give share your database schema? I'll have a look at the issue then.

@kordano thanks for taking a look at this.

This is hard to replicate, but I've come up with something that seems to do it reliably. Note that in the following code, the sequence of operations is important. The problem doesn't happen if steps are re-ordered or combined.

(ns debug
  (:require [datahike.api :as d]
            [datahike.db :as db]
            [clojure.data :as data]))

(defn ident-eid [db ident]
  (d/q '[:find ?e .
         :in $ ?ident
         :where [?e :db/ident ?ident]]
       db ident))

(defn random-uuid []
  (java.util.UUID/randomUUID))

(def ascii-ish
  (map char (concat (range 48 58) (range 65 91) (range 97 123))))
(defn random-char []
  (rand-nth ascii-ish))
(defn random-string [length]
  (apply str (repeatedly length random-char)))

(defn ent-ids [db]
  (d/q '[:find [?e ...]
         :where
         [?e :ent/id]]
       db))

(comment
  (let [cfg {:store {:backend :file :path "/tmp/replicate-bug"}}]
    (when-not (d/database-exists? cfg)
      (d/create-database cfg))
    (let [conn (d/connect cfg)]
      ;; initialize data, only once
      (when-not (ident-eid (d/db conn) :ent/id)
        (d/transact conn {:tx-data [{:db/ident       :ent/id
                                     :db/valueType   :db.type/uuid
                                     :db/cardinality :db.cardinality/one
                                     :db/unique      :db.unique/identity
                                     :db/doc         "The entity ID."}
                                    {:db/ident       :ent/active?
                                     :db/valueType   :db.type/boolean
                                     :db/cardinality :db.cardinality/one
                                     :db/doc         "Whether the entity is active."}
                                    {:db/ident       :meta/space-taker
                                     :db/valueType   :db.type/string
                                     :db/cardinality :db.cardinality/one
                                     :db/doc         "Takes up some space in the db"}]})
        ;; make a few entities
        (d/transact conn {:tx-data (map (fn [_]
                                          {:ent/id (random-uuid)})
                                        (range 8))})

        ;; take up some space in the db
        ;; NOTE: problem only happens if there is some extra data in the db
        (d/transact conn {:tx-data (map (fn [_]
                                          {:meta/space-taker (random-string 250)})
                                        (range 1000))})

        ;; activate all entities
        ;; NOTE: problem only happens if `:ent/active?` is set to true as an
        ;; update, not if it set when the entity is created.
        (d/transact conn {:tx-data (map (fn [eid]
                                          {:db/id      eid
                                           :ent/active? true})
                                        (ent-ids (d/db conn)))}))
      ;; later on, deactivate some entities
      (d/transact conn {:tx-data (->> (ent-ids (d/db conn))
                                      sort
                                      (take 5)
                                      (map (fn [eid]
                                             [:db/add eid :ent/active? false])))})

      ;; Some of the entities should be deactivated. This should be true whether
      ;; using a cached or fresh connection
      (let [cached-db (d/db conn)
            fresh-db  (d/db (d/connect cfg))]
        ;; in my testing, this assertion fails:
        ;; Assert failed: active should be same in cached 3 and fresh 8
        ;; (= (count (active cached-db)) (count (active fresh-db)))
        (let [active (fn [db]
                       (d/q '[:find [?e ...]
                              :where
                              [?e :ent/active? true]]
                            db))]
          (assert (= (count (active cached-db))
                     (count (active fresh-db)))
                  (str "active should be same in cached " (count (active cached-db))
                       " and fresh " (count (active fresh-db)))))
        ;; If the first assertion is commented out, this assertion fails,
        ;; showing that the entities that were deactivated are still active on
        ;; the fresh connection.
        (let [idx                        (fn [db]
                                           (db/-datoms db :eavt []))
              [only-cached only-fresh _] (data/diff (set (idx cached-db))
                                                    (set (idx fresh-db)))]
          (assert (and (nil? only-cached)
                       (nil? only-fresh))
                  [only-cached only-fresh]))))))

Thanks for the detailed info. I'll get right to it.

I could reproduce your problem @mainej and tried to figure out when and why this started to happen. It seems that the error was introduced after version 0.3.2, maybe through our upsert optimizations. So, if you are using Datahike 0.3.2, you should be fine. In the meantime I'm working on solving this issue with the current version.

Thanks for researching @kordano. I can't revert to 0.3.2, as I was experiencing #122. The fix for that wasn't merged until 0.3.6.

Good to know that it wasn't present in 0.3.2 though. If I get a chance, I might git bisect through the datahike history to see exactly when the problem appears.

Bisecting was a good idea @mainej. I did it and it is when we moved to the following hitchhiker-tree branch that the bug happens. That is from Datahike commit 01125e3.

{io.replikativ/hitchhiker-tree               {:git/url "https://github.com/replikativ/hitchhiker-tree.git"}
                                                               :sha "4d332a2a00d460ec59ab8cc5468eb7272e99fe3e"}

The bug still needs investigation though.

whilo commented

The bump in the hitchhiker-tree also corresponds to a bump back to konserve version 0.5.1 from replikativ/konserve@f904359. I am not sure whether this is the reason, but it might be worth trying whether the bug goes away when manually bumping the konserve dependency in your project to 0.6.0-alpha3.

Closing because this issue is fixed in PR #320 .