scrapy/queuelib

Failed write corrupts LIFO queue file

ArturGaspar opened this issue · 5 comments

If a write fails when pushing an item to the LIFO queue, some data may be actually written before an exception is raised. When trying to pop, the end of the file is supposed to have the size of the last item, but it actually contains whatever was in the partially written data from previous failed push() calls.

>>> from queuelib import LifoDiskQueue
>>> q = LifoDiskQueue("./small_fs/queue")  # Filesystem has less than 100KB available.
>>> for i in range(100): 
...     q.push(b'a' * 1000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "queuelib/queue.py", line 152, in push
    self.f.write(string)
IOError: [Errno 28] No space left on device
>>> q.pop()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "queuelib/queue.py", line 162, in pop
    self.f.seek(-size-self.SIZE_SIZE, os.SEEK_END)
IOError: [Errno 22] Invalid argument

The error above comes from the value of size, which is decoded from the last SIZE_SIZE bytes of the file. When that value is larger than the file itself the seek will fail.

attempt (not finished yet) to fix of error either for LIFO and FIFO
https://github.com/artscrap/queuelib/tree/fixdiskwrite

I’m thinking we could solve this by:

  • Replacing every \ by \\ in the input string while pushing.
  • Adding \n after the size in every write.
  • If upon reading where \n should be what’s read is not \n, find the earliest \n, and truncate the file contents to that point.
  • Replacing every \\ by \ on the string read while popping.

Because of the format change, as well as a potential performance impact of slash conversions, I wonder whether we should change the existing queue class or implement a new one that follows this approach (e.g. RecoverableLifoDiskQueue).

If we do modify the existing queue class, I wonder how to handle format versioning. For example, should we detect old-format files and upgrade them on disk initialization? Or should we empty such files and start from scratch?

@dangra Thoughts?

Not sure if this makes sense here or as a separate issue, but I ran into a different corruption FIFO issue: When the drive ran out of space, somehow the info.json file got truncated. Right now I'm manually recovering it by iterating through the entries up to the entry I know I popped/added.

I am having the same issue. It should at least be documented that the LifoDiskQueue is not safe against corruption at the moment.