simonw/sqlite-history

Initial design for a change tracking system for SQLite tables

simonw opened this issue · 40 comments

simonw commented

Goal:

  • On demand, create a snapshot backup of a table
  • Ability to restore those snapshots
  • Snapshots use minimal space, ideally by avoiding storing multiple copies of the same data
simonw commented

I'm leaning towards a design that looks something like this:

  • content - the table itself. id, title, body etc. MAYBE has an extra _hash column, see later.
  • content_shadow - the shadow table where the copies of old rows are kept. id, _id, title, body and maybe _hash
  • snapshot - id, created, note (note is optional)
  • shapshot_shadow - snapshot_id, shadow_id

Here's the bit that saves space: while every snapshot has a row in shapshot_shadow (2 integers) for every row saved in that snapshot, some of those rows can point to the same shadow_id.

Advanced optimization: could I figure out a way to have a snapshot that is just the previous snapshot plus one extra row only store a single record in snapshot_shadow, not a record for every single row in the table?

simonw commented

For that last problem: any snapshot can be compared to any other snapshot in terms of just the difference in shadow rows - which rows are in the new one and not the old one and vice-versa.

So maybe there's a mechanism where snapshots are stored in terms of row differences from the previous snapshot - going all the way back to the first ever snapshot for that table which stores ALL rows.

Downside: sounds like it would make restores more expensive - you might have to restore the first snapshot, apply the difference from the second snapshot, etc etc etc up to the most recent one you wanted to restore.

simonw commented

That _hash column from earlier: the idea there was to store e.g. the first two letters of the md5 of the concatenated content columns, with the hope that this would allow for a quick lookup of if a new row is likely to be identical to an old one.

If the hash matches you still need to check the individual columns, but if it doesn't you can skip that comparison entirely.

Not yet sure if this would actually win anything though. Probably best to build v1 of the system without it.

simonw commented

Python API for this:

snapshot_id = snapshot(db, table_name, note="optional note")

# To restore
snapshot_restore(db, table_name, snapshot_id)

# To list
snapshots = snapshot_list(db, table_name)

Would be cool if snapshot(...) created the tables if they did not exist yet.

simonw commented

It should probably be possible to take a snapshot of multiple tables at once, maybe in a way that respects foreign keys? That's a feature for later though: right now I just want to be able to track a single table, in order to build AI-driven table refactoring features.

simonw commented

Here's a completely different option: treat the entire database as an append-only log and record the SQL queries that created it - every insert and create table and alter and so on.

Now any point can be recreated from scratch by re-applying SQL up to a specified point.

Problem: if you run a restore, does that throw away history beyond that point, or does it double your history size?

simonw commented

I can't help but feel that I am reinventing transactions and save points.

simonw commented

What if every row did have a _version column that was incrementes on updates?

Could I skip hashes but instead do something clever to detect which rows had been updated since the last snapshot?

Or... could I ensure shadow rows were unique on (_id, _version) such that I didn't have to worry about dupes there. Then it's just the snapshot m2m table that I have to think about.

simonw commented

One more potential design. What if I forget about snapshots and instead store a timestamp for every shadow row. Could that be enough for Point-in-time recovery, such that I could just record the time before a big operation and be able to roll back to that moment?

simonw commented

For that I think the shadow design is:

_content_shadow table: _id, _version, _timestamp, title, body, _deleted

Compound primary key on _id, _version

_deleted is a boolean for representing when a row was deleted.

Need to think carefully about what happens if a deleted primary key is reused for a new record. Maybe prevent that in an insert trigger?

simonw commented

Maybe the shadow table should always be enough information to entirely reconstruct the content table.

simonw commented

Under this system all writes are doubled: they get written to both content and shadow at both insert and update time.

Lots of inefficient row copies if I'm not careful. Could I store a bitmask of which columns had changed since the previous version perhaps?

simonw commented

Prompt to try:

SQLite trigger: given these tables:

content: id, title, body, _version
shadow: _id, _version, title, body, _mask

On insert into content, set _version to 1 and write a copy of the new row into shadow with the _mask set to an integer representative two bits set to 1

On update to content, increment _version and insert a new row into shadow with _mask set to a bitmap representing which of the title and body columns have changed from their previous values.
simonw commented

I think my ideal solution solves the following problems:

  • Efficient storage of all previous row changes, without multiple copies of row values that didn't change. I think the _mask may be the way to do that
  • Ability to create on-demand snapshots of the table. Is this just a label against a timestamp or is there more to it than that?
  • If snapshots use m2m, a mechanism to avoid storing 1m new integer pairs when only a single row has changed would be ideal
  • Stretch goal: ability to snapshot multiple tables at once
  • Need to consider alter table - might require a snapshot where the old table is kept (after a rename) and a new, altered table created with its own unique history
simonw commented

Decision needed: do I want to solve on-demand snapshots, or continuous version tracking, or both?

Which could use the least space?

simonw commented

On-demand snapshotting feels intuitively like it should be less resource intensive - especially for tables that might get modified a lot.

simonw commented

Would be nice if I could keep the _versión column out of the main table too. Maybe there's a _content_versions table which is a one-to-one against content purely to track row versions.

it could track _modified timestamps too! Maybe even have a _deleted timestamp to record if and when a row was deleted.

simonw commented

Does SQLite ever reuse rowids? If not then maybe those are the way records are associated with their shadow versions table.

simonw commented

Did an experiment and it looks like rowids are not reused: https://chat.openai.com/chat/525950f5-c84e-4489-aa1c-7eb809972cad

simonw commented

So...

  • content - normal table
  • _content_versions - id that matches rowid in content, version starting at 0 and incrementing on any changes, created timestamp, updated timestamp, deleted timestamp - that's five integers

Next step: design history and snapshot tables that build on these.

simonw commented

Ideally I want _content_history to mirror content schema but add a _version column. And maybe a _updated timestamp?

Also maybe a _mask bitmap integer to allow me to only store updated values and record which columns were modified.

Not yet convinced I can get away with the bitmap trick in pure SQL, for both saves and restores.

simonw commented

I think storing may be easy - because the trigger can compare the old and new columns and compute a bitmap just using that information.

The harder problem is reassembling a specific version of a row - or restoring the entire table to a specific timestamp.

If I can this at the row restore level I think whole table is achievable albeit slowly.

simonw commented

Ran this prompt:

CREATE TABLE [content] (
   [id] INTEGER PRIMARY KEY,
   [title] TEXT,
   [body] TEXT,
   [created] TEXT
);
CREATE TABLE [_content_meta] (
   [_rowid] INTEGER PRIMARY KEY,
   [version] INTEGER,
   [created] INTEGER,
   [updated] INTEGER,
   [deleted] INTEGER
);

SQLite SQL to create a trigger which does the following:

On insert into content, ensure a row exists in _content_meta with a _rowid set to the rowid of the new content row. It should have version of 1, deleted of null, and created and updated as the current timestamp

On update to row in content, ensure that _content_meta row exists, increment the version and set updated to current timestamp

On delete to row in content, ensure _content_meta row exists and set deleted to the current timestamp

Then added:

Update that trigger to only update _content_meta if at least one of the columns in content has had its value changed

Got this (which I have edited a bit subsequently):

-- Trigger for INSERT operation
CREATE TRIGGER content_insert_trigger
AFTER INSERT ON content
BEGIN
  INSERT INTO _content_meta (_rowid, version, created, updated, deleted)
  VALUES (new.rowid, 1, strftime('%s','now'), strftime('%s','now'), NULL);
END;

-- Trigger for UPDATE operation
CREATE TRIGGER content_update_trigger
AFTER UPDATE ON content
FOR EACH ROW
WHEN old.title != new.title OR old.body != new.body OR old.created != new.created
BEGIN
  UPDATE _content_meta
  SET version = version + 1, updated = strftime('%s','now')
  WHERE _rowid = old.rowid;
END;

-- Trigger for DELETE operation
CREATE TRIGGER content_delete_trigger
AFTER DELETE ON content
BEGIN
  UPDATE _content_meta
  SET updated = strftime('%s','now'), deleted = strftime('%s','now')
  WHERE _rowid = old.rowid;
END;
simonw commented

I want timestamp in ms since UTC epoch instead. This seems to work:

cast((julianday('now') - 2440587.5) * 86400.0 * 1000 as integer)

Wish it wasn't that convoluted!

simonw commented

Here's SQL to populate that for an existing table:

INSERT INTO _content_meta (_rowid, version, created, updated, deleted)
SELECT c.rowid, 1, strftime('%s','now'), strftime('%s','now'), NULL
FROM content c
WHERE NOT EXISTS (
  SELECT 1
  FROM _content_meta cm
  WHERE cm._rowid = c.rowid
);
simonw commented

Putting this system in place gives me the ability to sync up tables! I can poll the _content_meta table for rows that have been modified since last time I checked.

I've modified the deleted trigger to set the updated column as well - that's the column I'll poll against. I'm going to add an index to that column too.

simonw commented

New prompt:

CREATE TABLE [content] (
   [id] INTEGER PRIMARY KEY,
   [title] TEXT,
   [body] TEXT,
   [created] TEXT
);
CREATE TABLE [_content_history] (
   [_rowid] INTEGER PRIMARY KEY,
   [id] INTEGER,
   [title] TEXT,
   [body] TEXT,
   [created] TEXT,
   [_version] INTEGER,
   [_updated] INTEGER,
   [_mask] INTEGER
);

SQLite SQL triggers such that:

If a row is inserted into content, a matching row is inserted into _content_history
with the same values, and _version set to 1, _updated set to the current time, and
_mask set to an integer representing all four columns.

If a row is updated in content, a matching row is inserted into _content_history
with id, title, body and created values ONLY for columns that have changed, and
_version set to the current version + 1, _updated set to the current time, and
_mask set to an integer bitmask representing the columns that have changed.

(Copilot helped write that prompt)

It gave me:

-- Trigger for INSERT operation
CREATE TRIGGER content_insert_history
AFTER INSERT ON content
BEGIN
  INSERT INTO _content_history (id, title, body, created, _version, _updated, _mask)
  VALUES (new.id, new.title, new.body, new.created, 1, strftime('%s', 'now'), 15);
END;

-- Trigger for UPDATE operation
CREATE TRIGGER content_update_history
AFTER UPDATE ON content
FOR EACH ROW
BEGIN
  INSERT INTO _content_history (id, title, body, created, _version, _updated, _mask)
  SELECT new.id,
    CASE WHEN old.title != new.title THEN new.title ELSE NULL END,
    CASE WHEN old.body != new.body THEN new.body ELSE NULL END,
    CASE WHEN old.created != new.created THEN new.created ELSE NULL END,
    (SELECT MAX(_version) FROM _content_history WHERE id = old.id) + 1,
    strftime('%s', 'now'),
    (CASE WHEN old.title != new.title THEN 1 ELSE 0 END) +
    (CASE WHEN old.body != new.body THEN 2 ELSE 0 END) +
    (CASE WHEN old.created != new.created THEN 4 ELSE 0 END) +
    (CASE WHEN old.id != new.id THEN 8 ELSE 0 END)
  WHERE old.title != new.title OR old.body != new.body OR old.created != new.created;
END;
simonw commented

I said:

SQL to populate the _content_history table for the first time based on content

INSERT INTO _content_history (id, title, body, created, _version, _updated, _mask)
SELECT id, title, body, created, 1, strftime('%s', 'now'), 15
FROM content;
simonw commented

Another prompt:

Given this table:

|   _rowid |   id | title   | body         |    created |   _version |   _updated |   _mask |
|----------|------|---------|--------------|------------|------------|------------|---------|
|        4 |    4 | Three   | 3 is here    | 1680992364 |          1 | 1680992364 |      15 |
|        6 |    4 | Four    | Four is here |            |          2 | 1680992431 |       3 |
|        7 |    4 | 4       |              |            |          3 | 1680992437 |       1 |

Write SQLite SQL that can reconstruct the most recent version of the data. It needs to go
through them in order, updating its model of what id, title, body and created are based
on the _mask bitmap - if that bitmap indicates a column, that column should take the
new value from the row

Got back:

To reconstruct the most recent version of the data from the _content_history table, you can use a recursive Common Table Expression (CTE) to process the rows in the _content_history table in order of increasing _version. This will allow you to build a representation of the data by applying the changes indicated by the _mask bitmap.

Here's the SQL query to reconstruct the most recent version of the data:

WITH RECURSIVE
  history_cte(_rowid, id, title, body, created, _version, _updated, _mask, max_version) AS (
    SELECT h._rowid, h.id, h.title, h.body, h.created, h._version, h._updated, h._mask, MAX(h2._version)
    FROM _content_history h
    LEFT JOIN _content_history h2 ON h.id = h2.id
    WHERE h._version = 1
    GROUP BY h.id
    UNION ALL
    SELECT h._rowid, h.id,
      CASE WHEN h._mask & 1 THEN h.title ELSE c.title END,
      CASE WHEN h._mask & 2 THEN h.body ELSE c.body END,
      CASE WHEN h._mask & 4 THEN h.created ELSE c.created END,
      h._version, h._updated, h._mask, c.max_version
    FROM _content_history h
    JOIN history_cte c ON h.id = c.id AND h._version = c._version + 1
  )
SELECT id, title, body, created
FROM history_cte
WHERE _version = max_version;

This query first selects the rows with _version 1 and calculates the maximum _version for each id. Then, it iteratively processes the rows in the _content_history table by updating the fields based on the _mask bitmap. Finally, the query returns the most recent version of the data by selecting the rows where _version equals max_version.

I tried running this and it looked like it worked! I'm not convinced by it yet though, needs very extensive testing.

simonw commented

I said:

What should a delete trigger do?

It suggested:

A DELETE trigger can be used to track the deletion of rows in the content table by inserting a row in the _content_history table that represents the deleted data. In this case, the DELETE trigger can be created to insert a row in the _content_history table with the deleted values and increment the _version. Additionally, you can set the _mask value to a specific number (e.g., -1) to indicate that this row was deleted.

CREATE TRIGGER content_delete_history
AFTER DELETE ON content
BEGIN
  INSERT INTO _content_history (id, title, body, created, _version, _updated, _mask)
  SELECT old.id, old.title, old.body, old.created,
    (SELECT MAX(_version) FROM _content_history WHERE id = old.id) + 1,
    strftime('%s', 'now'), -1
  FROM content
  WHERE content.id = old.id;
END;

I'll probably need to update that terrifying CTE too somehow.

simonw commented

That didn't seem to work so I got it to rewrite it to use VALUES (...)

CREATE TRIGGER content_delete_history
AFTER DELETE ON content
BEGIN
  INSERT INTO _content_history (id, title, body, created, _version, _updated, _mask)
  VALUES (
    old.id,
    old.title,
    old.body,
    old.created,
    (SELECT COALESCE(MAX(_version), 0) FROM _content_history WHERE id = old.id) + 1,
    strftime('%s', 'now'),
    -1
  );
END;
simonw commented

I think this query might work!

WITH RECURSIVE
  history_cte(_rowid, id, title, body, created, _version, _updated, _mask, max_version) AS (
    SELECT h._rowid, h.id, h.title, h.body, h.created, h._version, h._updated, h._mask, MAX(h2._version)
    FROM _content_history h
    LEFT JOIN _content_history h2 ON h.id = h2.id
    WHERE h._version = 1
    GROUP BY h.id
    UNION ALL
    SELECT h._rowid, h.id,
      CASE WHEN h._mask & 1 THEN h.title ELSE c.title END,
      CASE WHEN h._mask & 2 THEN h.body ELSE c.body END,
      CASE WHEN h._mask & 4 THEN h.created ELSE c.created END,
      h._version, h._updated, h._mask, c.max_version
    FROM _content_history h
    JOIN history_cte c ON h.id = c.id AND h._version = c._version + 1
  )
SELECT id, title, body, created
FROM history_cte
WHERE _version = max_version and _mask != -1;
simonw commented

This query seems to be able to show the specific value for a specific column at a specific version of the row:

SELECT title
FROM _content_history
WHERE id = 2 AND (_mask & 1) AND _mask != -1 AND _version <=3
ORDER BY _version DESC
LIMIT 1;

Remove AND _version <=3 to get the most recent value for that column. The column is identified by both SELECT title and by the _mask & 1 which should be & 2 for the second column, etc.

simonw commented

It looks like, given the _content_history table alone, I can restore the table contents to a specific point in time AND track the history of individual rows, with efficient storage and reasonable performance.

This is great!

Do I still need the _content_meta table at all? The only benefit that gives me is that it enables external processes to poll a table for changes since last time... but I think I can get the same benefit from the _content_history table too - especially with an index on _updated.

simonw commented

If I want to know everything that changed since 1680993708 I can do:

select * from _content_history where _updated > 1680993708 order by _updated

But I can be smarter than that - if I want to know what rows to copy fresh from the database and what rows to delete I can do:

select distinct
  'fetch' as action, id
from _content_history
  where _updated > 1680993708
union
select 'delete' as action, id
from _content_history
  where _updated > 1680993708 and _mask = -1

Could even join that against the content table to get back the full information needed to update the remote copy.

simonw commented

This also means that capturing a snapshot is as easy as grabbing the max _updated timestamp for the table (from _content_history) and recording that somewhere.

simonw commented

For operations that alter the table in some way snapshotting will have to work differently: it will have to capture a full copy of the table, and maybe a full copy of the table history as well. That's OK, I can live with that.

simonw commented

I'm going to wrap this all up in a sqlite-history Python library, maybe with a datasette-history on top of it.

simonw commented

It would be neat if the Datasette plugin could automatically add history tables to any created tables in the database.

It could do this by checking for when the schema version increments, then re-scanning for new tables that are missing their history.

Unnecessary at first though - could instead have a datasette track-history data.db table1 table2 etc command.