[BUG][0.8.0] index writer lock issue during k8s pod restart

Question

[BUG][0.8.0] index writer lock issue during k8s pod restart

Plasmatium opened this issue 3 years ago · 6 comments

This issue is not present in 0.7.0, but present in 0.8.0

Error log detail

2022-01-16T14:30:18.000995Z ERROR error during lnx runtime: failed to load existing indexes due to error Failed to acquire Lockfile: LockBusy. Some("Failed to acquire index lock. If you are using a regular directory, this means there is already an `IndexWriter` working on this `Directory`, in this process or in a different process.")

Reproduce step:
Restart pod or recreate pod. k8s will send SIGTERM to pod. But this issue is not present in 0.7.0
In k8s I can add preStop hook to execute kill -SIGINT $(pgrep lnx) to send CTRL-C to lnx process, but if lnx is paniced or crashed, how can I unlock the writer index? For example, if there is a lock file could be removed before lnx start: rm -rf index/some-lock-file && lnx

Answer 1 · 2022-01-16T16:08:36.000Z

Hi, Easiest way to resolve the issue is remove the $(cwd)/index/index-storage/<index-id>/data/.tantivy-lock file.

Are you able to give some more info on how the process was shut down / what was going on beforehand?

The server should automatically handle SigInt's correctly and shutdown everything correctly, however, if you've done a large amount of indexing just before then it might be that k8s is killing the process after a timeout as shutdown right after indexing (especially a large number of docs) can take a little while to clean up and park all threads.

You should see a set of logs during the shutdown saying things like the writer actor has shutdown etc...

Answer 2 · 2022-01-16T16:46:33.000Z

From some additional clarification currently, 0.8.0 only expects a SIGINT to handle correct (I have a fix for that coming soon) so k8s sending SIGTERM will likely be causing the server to abort before everything's correctly shutdown.

Answer 3 · 2022-01-17T02:36:37.000Z

Hi, thank you for replying. k8s definitely sends SIGTERM. When I restart the statefulset (eg. new deploying), I didn't see the shut down logs.

Besides, I met another bug, that about date field with fast enabled. My field named as update_at with type date, when I search it using "order_by": "update_at", it paniced with a chrono error about failed to parse the date. Now I have a work around to dealing that: just using i64.

I think I can create a new issue when I have time to reproduce this.

Answer 4 · 2022-01-17T08:16:46.000Z

The sigterm issue should be fixed now in a master, the system should now correctly shutdown for sigint, sigterm and sigquit

As for the other issue yes please do open a separate issue, the panic is definitely a bug. ~~I feel like it's probably Chrono panicking trying to parse the date format which is quite strict.~~

Answer 5 · 2022-01-23T14:25:58.000Z

@Plasmatium I think i've accidentally run into your panic if you're using multi-value fields / have the fast key set to multi on the index decleration.

Answer 6 · 2022-01-23T23:16:37.000Z

Marking this as closed as this should be fixed from 0.8.1 onwards after testing some deployments.