scrapy/queuelib

PickleFifoDiskQueue and FifoMemoryQueue throw EOF when used to persist a crawl

nramirezuy opened this issue · 7 comments

AUTHOR: @Varriount

When using scrapy's PickleFifoDiskQueue and FifoMemoryQueue objects for persistence and request scheduling, an EOF exception is thrown:

Traceback (most recent call last):
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\commands\crawl.py", line 58, in run
    self.crawler_process.start()
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\crawler.py", line 253, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "C:\x64\python27\lib\site-packages\twisted\internet\base.py", line 1194, in run
    self.mainLoop()
  File "C:\x64\python27\lib\site-packages\twisted\internet\base.py", line 1203, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "C:\x64\python27\lib\site-packages\twisted\internet\base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\utils\reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\core\engine.py", line 105, in _next_request
    if not self._next_request_from_scheduler(spider):
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\core\engine.py", line 132, in _next_request_from_scheduler
    request = slot.scheduler.next_request()
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\core\scheduler.py", line 68, in next_request
    request = self._dqpop()
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\core\scheduler.py", line 98, in _dqpop
    d = self.dqs.pop()
  File "C:\x64\python27\lib\site-packages\queuelib\pqueue.py", line 43, in pop
    m = q.pop()
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\squeues.py", line 21, in pop
    return deserialize(s)
exceptions.EOFError:

There isn't any problem when not persisting the crawl (omitting '-s JOBDIR=crawl-1'), so my best guess is that the problem lies mainly with PickleFifoDiskQueue.
I'm running Python 2.7 x64 on Windows 8 x64, using the latest Scrapy from the master branch. This bug affects the latest stable Scrapy build as well.

Edit: After some investigation, it seems that a race condition is occurring when the FifoDiskQueue object's headf and tailf file descriptors point to the same file. Adding a 'sleep' to the pop() method greatly decreases the occurrence of the EOF exception over multiple runs.

Encountered the same error. My environment is windows 7 32-bit and 64-bit.

I found out by inserting logging statements into the beginning of FifoDiskQueue.push:

logging.warning("Writing to disk queue: %s", binascii.b2a_hex(string))

and the end of FifoDiskQueue.pop methods, before the return statement:

logging.warning("Reading from disk queue: %s", binascii.b2a_hex(data))

Here is that part of the log I got:

2015-08-23 22:04:41 [root] WARNING: Writing to disk queue: 80027d7101285504626f64797102557e656e67696e653d3526776f72643d254535254139254234254535253834254246254536253841253941254538254137254136254535254241253841254535253845253832254535254145254236267175657279547970653d32266b65793d4a53484825343025323925323125323426636f6d70616e7949643d3236393432710355095f656e636f64696e67710455057574662d3871055507636f6f6b69657371067d710755046d65746171087d71092855056465707468710a4b05550a636f6d70616e795f6964710b4d3e6955076b6579776f7264710c5815000000e5a9b4e584bfe68a9ae8a7a6e5ba8ae58e82e5aeb6710d550b656e67696e655f74797065710e5503333630710f7555076865616465727371107d7111550c436f6e74656e742d5479706571125d711355216170706c69636174696f6e2f782d7777772d666f726d2d75726c656e636f64656471146173550375726c7115584b000000687474703a2f2f6f70656e312e6668743336302e636f6d2f526573744c6974654170692f52616e6b696e672f44656c6574654b6579776f7264517965727942794d6f72652e616374696f6e550b646f6e745f66696c74657271168955087072696f7269747971174b31550863616c6c6261636b7118551770617273655f7375626d697373696f6e5f726573756c74711955066d6574686f64711a5504504f5354711b55076572726261636b711c4e752e
2015-08-23 22:04:42 [scrapy] DEBUG: Crawled (200) <GET https://www.haosou.com/s?ie=utf-8&shb=1&src=360sou_newhome&q=%E5%A9%B4%E5%84%BF%E6%B8%B8%E6%B3%B3%E6%96%B9%E6%B1%A0%E5%8E%82%E5%AE%B6&> (referer: None)
2015-08-23 22:04:42 [root] WARNING: Reading from disk queue: 80027d7101285504626f64797102557e656e67696e653d3526776f72643d254535254139254234254535253834254246254536253841253941254538254137254136254535254241253841254535253845253832254535254145254236267175657279547970653d32266b65793d4a53484825343025323925323125323426636f6d70616e7949643d3236393432710355095f656e636f64696e67710455057574662d3871055507636f6f6b69657371067d710755046d65746171087d71092855056465707468710a4b05550a636f6d70616e795f6964710b4d3e6955076b6579776f7264710c5815000000e5a9b4e584bfe68a9ae8a7a6e5ba8ae58e82e5aeb6710d550b656e67696e655f74797065710e5503333630710f7555076865616465727371107d7111550c436f6e74656e742d5479706571125d711355216170706c69636174696f6e2f782d7777772d666f726d2d75726c656e636f64656471146173550375726c7115584b000000687474703a2f2f6f70656e312e6668743336302e636f6d2f526573744c6974654170692f52616e6b696e672f44656c6574654b6579776f7264517965727942794d6f72652e616374696f6e550b646f6e745f66696c74657271168955087072696f7269747971174b31550863616c6c6261636b7118551770617273655f7375626d697373696f6e5f726573756c74711955066d6574686f6471

Clearly, it reads only a portion of what is written.

It seems to be some weirdness with how the underlying disk-file is being updated, possibly something specific to Windows?
In any case, I've tried fixing this, but have had no luck with any normal code. Perhaps using the low-level OS file functions would work?

I had the same suspicion. I will try scrapy on linux later today.

Just now I tried the change my log statements to include the thread id, as follows:

        logging.warning("[Thread %s] Writing to disk queue: %s",
                    threading.current_thread().name, binascii.b2a_hex(string))


2015-08-24 08:47:18 [root] WARNING: [Thread MainThread] Writing to disk queue: 80027d7101285504626f647971025478010000656e67696e653d3126646f6d61696e3d68786373702e6668743336302e636f6d26776f72643d25453525423825423825453525423725394525453525384425393725453725393325394325453525414425393026717565727956616c75653d254537254143254143312545392541312542352545372541432541433225453525393025384426636f6d70616e7949643d32373935352672616e6b55726c3d687474702533412532462532467777772e62616964752e636f6d2532467325334669652533447574662d3825323666253344382532367273765f6270253344302532367764253344253235453525323542382532354238253235453525323542372532353945253235453525323538442532353937253235453725323539332532353943253235453525323541442532353930253236253236696e70757454253344353132267175657279547970653d3226717565727954696d653d323031352d30382d32342b3038253341343725334131382e333239303030710355095f656e636f64696e67710455057574662d3871055507636f6f6b69657371067d710755046d65746171087d71092855056465707468710a4b03550c6c6173745f75706461746564710b636461746574696d650a6461746574696d650a710c550a07df0818082f120505288552710d55076b6579776f7264710e580f000000e5b8b8e5b79ee58d97e7939ce5ad90710f550a636f6d70616e795f696471104d336d550472616e6b71114b02550e636f6d70616e795f646f6d61696e7112581000000068786373702e6668743336302e636f6d711355047061676571144b01550b656e67696e655f7479706571155505426169647571167555076865616465727371177d7118550c436f6e74656e742d5479706571195d711a55216170706c69636174696f6e2f782d7777772d666f726d2d75726c656e636f646564711b6173550375726c711c584f000000687474703a2f2f6f70656e312e6668743336302e636f6d2f526573744c6974654170692f52616e6b696e672f496e736572744b6579776f72645175657279526573756c74546f44422e616374696f6e550b646f6e745f66696c746572711d8955087072696f72697479711e4b93550863616c6c6261636b711f551770617273655f7375626d697373696f6e5f726573756c74712055066d6574686f6471215504504f5354712255076572726261636b71234e752e
2015-08-24 08:47:18 [root] WARNING: [Thread MainThread] Reading from disk queue: 80027d7101285504626f647971025478010000656e67696e653d3126646f6d61696e3d68786373702e6668743336302e636f6d26776f72643d25453525423825423825453525423725394525453525384425393725453725393325394325453525414425393026717565727956616c75653d254537254143254143312545392541312542352545372541432541433225453525393025384426636f6d70616e7949643d32373935352672616e6b55726c3d687474702533412532462532467777772e62616964752e636f6d2532467325334669652533447574662d3825323666253344382532367273765f6270253344302532367764253344253235453525323542382532354238253235453525323542372532353945253235453525323538442532353937253235453725323539332532353943253235453525323541442532353930253236253236696e70757454253344353132267175657279547970653d3226717565727954696d653d323031352d30382d32342b3038253341343725334131382e333239303030710355095f656e636f64696e67710455057574662d3871055507636f6f6b69657371067d710755046d65746171087d71092855056465707468710a4b03550c6c6173745f75706461746564710b636461746574696d650a6461746574696d650a710c550a07df0818082f120505288552710d55076b6579776f7264710e580f000000e5b8b8e5b79ee58d97e7939ce5ad90710f550a636f6d70616e795f696471104d336d550472616e6b71114b02550e636f6d70616e795f646f6d61696e7112581000000068786373702e6668743336302e636f6d711355047061676571144b01550b656e67696e655f7479706571155505426169647571167555076865616465727371177d7118550c436f6e74656e742d5479706571195d71

The results shows that the same file is read by the same thread, the MainThread.

In FifoDiskQueue.__init__, headf and tailf are opened as two separate file descriptors using two _openchunk calls. Probably the windows OS doesn't guarantee that at any moment of time, we will see the same content from either file descriptor.

OK. It works smoothly in linux. This would probably take forever for the scrapy dev team to fix. Windows is almost the abandoned land,

I added a line after os.read in pop method,

def pop(self):
    tnum, tcnt, toffset = self.info['tail']
    if [tnum, tcnt] >= self.info['head']:
        return
    tfd = self.tailf.fileno()
    szhdr = os.read(tfd, self.szhdr_size)
    if not szhdr:
        return
    size, = struct.unpack(self.szhdr_format, szhdr)
    data = os.read(tfd, size)
    logging.warning("size in header is %d, size read is %d", size, len(data)) # <=== this line
    tcnt += 1

And it prints:

2015-08-24 04:53:23 [root] WARNING: size in header is 488, size read is 465

I am thinking of adding some checks after the os.read and continuesly read until I get the amount of data specified in the szhdr. Will update when I have progress.

OK. I finally traced down to the root of the problem.

The solution is adding one charcater:

def _openchunk(self, number, mode='rb'):

Without the b mode, windows will interpret the file as text file and os.read(size_you_want) might read much less data than you specified even though there are more bytes available.

closed by #8