Stale lock file prevents Prometheus start
Closed this issue · 8 comments
tsdb puts a lock file into its directory to ensure exclusive access. This lock file is pid-based (e.g. not flock() based) and this leads to the following problem:
- lock file is created
- node is crashed
- prometheus boots, finds stale lock, checks PID and if such PID is already occupied (by some other program) decides that the database is locked and exits.
I believe there are two possible solutions:
- provide an option to configure lock file location, so it can be placed on tmpfs
- switch to flock() based approach.
There's an option to disable the lock file, which should often be OK if your scheduling system ensures the safety via other means. Prometheus exposes this option through a flag as well.
OS file locks were initially used, however, there's a variety of rants on how badly everything around those is broken. Plain PID lock files remain the sanest option according to my research.
Two blog posts describing the madness quite well:
Ok, I see. Thanks!
In absence of a better option than no locking or a conventional lock file, I'm closing here.
OS file locks were initially used, however, there's a variety of rants on how badly everything around those is broken. Plain PID lock files remain the sanest option according to my research.
Two blog posts describing the madness quite well:
Those are interesting reads, and the conclusion I can draw from it is that POSIX locks should work just fine if you use it on sentinel files, i.e. not on files with real data. Modulo NFS, of course, but that's a special case.
Were there any problems in practice with the Prometheus 1.x file locks?
One problem I see is that people will run Prometheus on NFS. Regardless of whether we officially support it, we shouldn't block it unnecessarily.
The other is that we are working with a bunch of files here. Some coming, some going as Prometheus runs. Getting additional management of locks over those right is complexity. What we get for that complexity, though? All those locks are advisory (both articles explain why mandatory locks are a horrible idea). Which means all other applications would have to use the same kind of locking we do (and get it right!). If not, they can still do whatever they want with the files. There's no use case for multiple processes writing our files really, e.g. by locking byte ranges and concurrently writing to disjoint ranges.
The only thing we can and want to prevent with locking is two Prometheus processes running against the same data directory. A simple PID lock file seems to do this job perfectly with virtually no complexity overhead.
There are other things of course, out-of-process modifications to our immutable blocks can be used to implement things like complex retention policies. However, there's a safe and simple workflow for that: "disable compactions", "write replacement files" (through atomic mv
), "trigger TSDB to reload files", "re-enable compactions".
I have no full understanding of the degree of file locking Prometheus 1 does. I suppose the question is more whether it brings as many benefits as one would like to imagine, as the linked blog posts elaborate.
Edit: reading again, I think you are rather hinting at having a lock
file over which a POSIX advisory lock is acquired, which won't suffer from the false non-return issue via PIDs? I have not considered that actually but sounds feasible.
Prometheus 1.x was only ever locking the sentinel DIRTY
file.
By using a POSIX lock for it, we avoided exactly the problem @pborzenkov has encountered before filing this issue.
The only case where the approach didn't work is on (arguably old-ish/broken) NFS setups or other non-POSIX compliant file systems. I would have thought that the problem @pborzenkov reported above is way more common than the problem of non-POSIX compliant file systems. The option of disabling locks could also be used as an argument for NFS user (which could disable locks, or we could have an option of PID based locking just for them).
It would be (or have been, I guess) good to get some stats first what problem is actually more common.
Just edited my first response as I realized what you actually meant afterwards. That might actually be a good solution. That's not a breaking change I'd say so we can certainly improve in that direction at any point.
I think you are rather hinting at having a lock file over which a POSIX advisory lock is acquired, which won't suffer from the false non-return issue via PIDs? I have not considered that actually but sounds feasible.
Precisely. That's what Prometheus 1.x did. (At least in our own code. I think LevelDB did some locking on its own.)