lnx-search/lnx

[BUG][0.8.0] index writer lock issue during k8s pod restart

Plasmatium opened this issue · 6 comments

This issue is not present in 0.7.0, but present in 0.8.0

Error log detail

2022-01-16T14:30:18.000995Z ERROR error during lnx runtime: failed to load existing indexes due to error Failed to acquire Lockfile: LockBusy. Some("Failed to acquire index lock. If you are using a regular directory, this means there is already an `IndexWriter` working on this `Directory`, in this process or in a different process.")

Reproduce step:
Restart pod or recreate pod. k8s will send SIGTERM to pod. But this issue is not present in 0.7.0
In k8s I can add preStop hook to execute kill -SIGINT $(pgrep lnx) to send CTRL-C to lnx process, but if lnx is paniced or crashed, how can I unlock the writer index? For example, if there is a lock file could be removed before lnx start: rm -rf index/some-lock-file && lnx

Hi, Easiest way to resolve the issue is remove the $(cwd)/index/index-storage/<index-id>/data/.tantivy-lock file.

Are you able to give some more info on how the process was shut down / what was going on beforehand?

The server should automatically handle SigInt's correctly and shutdown everything correctly, however, if you've done a large amount of indexing just before then it might be that k8s is killing the process after a timeout as shutdown right after indexing (especially a large number of docs) can take a little while to clean up and park all threads.

You should see a set of logs during the shutdown saying things like the writer actor has shutdown etc...

From some additional clarification currently, 0.8.0 only expects a SIGINT to handle correct (I have a fix for that coming soon) so k8s sending SIGTERM will likely be causing the server to abort before everything's correctly shutdown.

Hi, thank you for replying. k8s definitely sends SIGTERM. When I restart the statefulset (eg. new deploying), I didn't see the shut down logs.

Besides, I met another bug, that about date field with fast enabled. My field named as update_at with type date, when I search it using "order_by": "update_at", it paniced with a chrono error about failed to parse the date. Now I have a work around to dealing that: just using i64.

I think I can create a new issue when I have time to reproduce this.

The sigterm issue should be fixed now in a master, the system should now correctly shutdown for sigint, sigterm and sigquit

As for the other issue yes please do open a separate issue, the panic is definitely a bug. I feel like it's probably Chrono panicking trying to parse the date format which is quite strict.

@Plasmatium I think i've accidentally run into your panic if you're using multi-value fields / have the fast key set to multi on the index decleration.

Marking this as closed as this should be fixed from 0.8.1 onwards after testing some deployments.