EarthScope/rover

ERROR: Cursor exit: database is locked

Closed this issue · 9 comments

These messages are from a relatively simple retrieve:

retrieve  DEFAULT: Downloading TA_034A 2010-084 (N_S 1/636; day 77/359)
retrieve  DEFAULT: Downloading TA_034A 2010-085 (N_S 1/636; day 78/359)
download.28996    ERROR: Cursor exit: database is locked
download.29009    ERROR: Cursor exit: database is locked
retrieve  DEFAULT: Downloading TA_034A 2010-086 (N_S 1/636; day 79/359)
retrieve  DEFAULT: Downloading TA_034A 2010-087 (N_S 1/636; day 80/359)
retrieve  DEFAULT: Downloading TA_034A 2010-088 (N_S 1/636; day 81/359)

My guess is that these are from database contention between workers. Hopefully these are addressed by increasing the timeout to 60 seconds as done in 60ab1e8.

No luck, increasing the connection timeout did not work.

Error is coming from sqlite.py for class CursorContext::

    def __exit__(self, exc_type, exc_val, exc_tb):
        if exc_type:
            if self._quiet:
                self._support._log.debug('Cursor exit: %s' % exc_val)
            else:
                self._support._log.error('Cursor exit: %s' % exc_val)
        else:
            self._support._db.commit()  # probably implied by close?
        self._cursor.close()
        return False  # propagate any exceptions

This occurs on macOS (Python 3.6) and CentOS (Python 3.7).

Arg. This error just killed a multi-day running rover collection. This implies that it occurred on the main, orchestrating process and not just a download or ingestion process.

P.S. perhaps this was just the report of an exception that happening during the run...

In the handful of errors I've collected each failure is an exception at the delete statement due to the database being locked (in lock.py), i.e. this is the symptom of lock collision:

    def release(self):
        self._log.debug('Releasing lock on %s with %s' % (self._table, self._key))
        with self._db:
            c = self._db.cursor()
            c.execute('begin')
            c.execute('delete from %s where key = ?' % self._table, (self._key,))

In an initial review of the LockContext code, an acquire creates a lock by writing a row to the database, but it does so in a transaction that remains uncommitted until the cursor is closed. From my understanding this means the insert'ed lock cannot be seen by other processes, rendering it ineffective. Hmmm, I must be missing something here.

The retrieve request I'm using to test this is here:

TA * * BHZ 2010-01-01 2010-01-10

In an initial review of the LockContext code, an acquire creates a lock by writing a row to the database, but it does so in a transaction that remains uncommitted until the cursor is closed. From my understanding this means the insert'ed lock cannot be seen by other processes, rendering it ineffective. Hmmm, I must be missing something here.

Actually, the with self._db: syntax will issue a commit or rollback depending on success. So that mystery is solved.

This might be solved in 8dfad0e, which removes the code that locked work between indexing workers. The mseedindex processes themselves will not conflict as they are transactional/atomic regarding database changes and there does not appear to be any other reason to lock the indexing workers in rover.

Will close after more testing.

This issue has not been observed in testing since 8dfad0e, closing.