500 "attempt to write a readonly database" error caused by "PRAGMA schema_version"

I've not been able to replicate this myself yet, but I've seen log files from a user affected by it.

File "/usr/local/lib/python3.11/site-packages/datasette/views/base.py", line 89, in dispatch_request
await self.ds.refresh_schemas()
File "/usr/local/lib/python3.11/site-packages/datasette/app.py", line 371, in refresh_schemas
await self._refresh_schemas()
File "/usr/local/lib/python3.11/site-packages/datasette/app.py", line 386, in _refresh_schemas
schema_version = (await db.execute("PRAGMA schema_version")).first()[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/datasette/database.py", line 267, in execute
results = await self.execute_fn(sql_operation_in_thread)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/datasette/database.py", line 213, in execute_fn
return await asyncio.get_event_loop().run_in_executor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/datasette/database.py", line 211, in in_thread
return fn(conn)
^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/datasette/database.py", line 237, in sql_operation_in_thread
cursor.execute(sql, params if params is not None else {})
sqlite3.OperationalError: attempt to write a readonly database

That's running the official Datasette Docker image on https://fly.io/ - it's causing 500 errors on every page of their site.

I thought it might relate to the "defensive mode" issue described here:

simonw/sqlite-utils#235

But I have since determined that the Datasette official Docker image does NOT run anything in defensive mode, so I don't think it's related to that.

Asked on the SQLite Forum if anyone has seen this before: https://sqlite.org/forum/forumpost/793a2ed75b

Here's the relevant code:

datasette/datasette/app.py

Lines 421 to 437 in 5890a20

    
           async def _refresh_schemas(self): 
        
               internal_db = self.databases["_internal"] 
        
               if not self.internal_db_created: 
        
                   await init_internal_db(internal_db) 
        
                   self.internal_db_created = True 
        
               current_schema_versions = { 
        
                   row["database_name"]: row["schema_version"] 
        
                   for row in await internal_db.execute( 
        
                       "select database_name, schema_version from databases" 
        
                   ) 
        
               } 
        
               for database_name, db in self.databases.items(): 
        
                   schema_version = (await db.execute("PRAGMA schema_version")).first()[0] 
        
                   # Compare schema versions to see if we should skip it 
        
                   if schema_version == current_schema_versions.get(database_name): 
        
                       continue

This function is called on almost every request (everything that subclasses BaseView at least - need to remember that for the refactor in #2053 etc).

datasette/datasette/views/base.py

Lines 101 to 103 in 5890a20

    
           async def dispatch_request(self, request): 
        
               if self.ds: 
        
                   await self.ds.refresh_schemas()

It uses PRAGMA schema_version as a cheap way to determine if the schema has changed, in which case it needs to refresh the internal schema tables.

This was already the cause of a subtle bug here:

#1231

Here's a potential workaround: when I store the schema versions, I could also score an MD5 hash of the full schema (select group_concat(sql) from sqlite_master). When I read the schema version with PRAGMA schema_version I could catch that exception and, if I see it, I could calculate that MD5 hash again as a fallback and use that to determine if the schema has changed instead.

The performance overhead of this needs investigating - how much more expensive is md5(... that SQL query result) compared to just PRAGMA schema_version, especially on a database with a lot of tables?

I got ChatGPT (code execution alpha) to run a micro-benchmark for me. This was the conclusion:

The benchmark using PRAGMA schema_version is approximately 1.36 times faster than the benchmark using hashlib.md5 for the case with 100 tables. For the case with 200 tables, the benchmark using PRAGMA schema_version is approximately 2.33 times faster than the benchmark using hashlib.md5.

Here's the chart it drew me:

(It's a pretty rubbish chart though, it only took measurements at 100 and 200 and drew a line between the two, I should have told it to measure every 10 and plot that)

And the full transcript: https://gist.github.com/simonw/2fc46effbfbe49e6de0bcfdc9e31b235

The benchmark looks good enough on first glance that I don't feel the need to be more thorough with it. PRAGMA schema_version is faster, but not so fast that I feel like the MD5 hack is worth worrying about too much.

I'm tempted to add something to the /-/versions page that tries to identify if this is a problem or not though.

Here's a much better chart, which shows that MD5 performance unsurprisingly gets worse as the number of tables increases while schema_version remains constant:

I tightened up the benchmark (it was measuring the time taken to create the tables too) and got this:

Really interesting how you are using ChatGPT in this.

I tried deploying the fix you submitted, but still getting the same errors. I can past the logs here if needed, but I really don't see anything new in them.

	async def _refresh_schemas(self):
	internal_db = self.databases["_internal"]
	if not self.internal_db_created:
	await init_internal_db(internal_db)
	self.internal_db_created = True

	current_schema_versions = {
	row["database_name"]: row["schema_version"]
	for row in await internal_db.execute(
	"select database_name, schema_version from databases"
	)
	}
	for database_name, db in self.databases.items():
	schema_version = (await db.execute("PRAGMA schema_version")).first()[0]
	# Compare schema versions to see if we should skip it
	if schema_version == current_schema_versions.get(database_name):
	continue

	async def dispatch_request(self, request):
	if self.ds:
	await self.ds.refresh_schemas()