PickleFifoDiskQueue and FifoMemoryQueue throw EOF when used to persist a crawl
nramirezuy opened this issue · 7 comments
AUTHOR: @Varriount
When using scrapy's PickleFifoDiskQueue and FifoMemoryQueue objects for persistence and request scheduling, an EOF exception is thrown:
Traceback (most recent call last):
File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\commands\crawl.py", line 58, in run
self.crawler_process.start()
File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\crawler.py", line 253, in start
reactor.run(installSignalHandlers=False) # blocking call
File "C:\x64\python27\lib\site-packages\twisted\internet\base.py", line 1194, in run
self.mainLoop()
File "C:\x64\python27\lib\site-packages\twisted\internet\base.py", line 1203, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "C:\x64\python27\lib\site-packages\twisted\internet\base.py", line 825, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\utils\reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\core\engine.py", line 105, in _next_request
if not self._next_request_from_scheduler(spider):
File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\core\engine.py", line 132, in _next_request_from_scheduler
request = slot.scheduler.next_request()
File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\core\scheduler.py", line 68, in next_request
request = self._dqpop()
File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\core\scheduler.py", line 98, in _dqpop
d = self.dqs.pop()
File "C:\x64\python27\lib\site-packages\queuelib\pqueue.py", line 43, in pop
m = q.pop()
File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\squeues.py", line 21, in pop
return deserialize(s)
exceptions.EOFError:
There isn't any problem when not persisting the crawl (omitting '-s JOBDIR=crawl-1'), so my best guess is that the problem lies mainly with PickleFifoDiskQueue.
I'm running Python 2.7 x64 on Windows 8 x64, using the latest Scrapy from the master branch. This bug affects the latest stable Scrapy build as well.
Edit: After some investigation, it seems that a race condition is occurring when the FifoDiskQueue object's headf and tailf file descriptors point to the same file. Adding a 'sleep' to the pop() method greatly decreases the occurrence of the EOF exception over multiple runs.
Encountered the same error. My environment is windows 7 32-bit and 64-bit.
I found out by inserting logging statements into the beginning of FifoDiskQueue.push
:
logging.warning("Writing to disk queue: %s", binascii.b2a_hex(string))
and the end of FifoDiskQueue.pop methods
, before the return statement:
logging.warning("Reading from disk queue: %s", binascii.b2a_hex(data))
Here is that part of the log I got:
2015-08-23 22:04:41 [root] WARNING: Writing to disk queue: 80027d7101285504626f64797102557e656e67696e653d3526776f72643d254535254139254234254535253834254246254536253841253941254538254137254136254535254241253841254535253845253832254535254145254236267175657279547970653d32266b65793d4a53484825343025323925323125323426636f6d70616e7949643d3236393432710355095f656e636f64696e67710455057574662d3871055507636f6f6b69657371067d710755046d65746171087d71092855056465707468710a4b05550a636f6d70616e795f6964710b4d3e6955076b6579776f7264710c5815000000e5a9b4e584bfe68a9ae8a7a6e5ba8ae58e82e5aeb6710d550b656e67696e655f74797065710e5503333630710f7555076865616465727371107d7111550c436f6e74656e742d5479706571125d711355216170706c69636174696f6e2f782d7777772d666f726d2d75726c656e636f64656471146173550375726c7115584b000000687474703a2f2f6f70656e312e6668743336302e636f6d2f526573744c6974654170692f52616e6b696e672f44656c6574654b6579776f7264517965727942794d6f72652e616374696f6e550b646f6e745f66696c74657271168955087072696f7269747971174b31550863616c6c6261636b7118551770617273655f7375626d697373696f6e5f726573756c74711955066d6574686f64711a5504504f5354711b55076572726261636b711c4e752e
2015-08-23 22:04:42 [scrapy] DEBUG: Crawled (200) <GET https://www.haosou.com/s?ie=utf-8&shb=1&src=360sou_newhome&q=%E5%A9%B4%E5%84%BF%E6%B8%B8%E6%B3%B3%E6%96%B9%E6%B1%A0%E5%8E%82%E5%AE%B6&> (referer: None)
2015-08-23 22:04:42 [root] WARNING: Reading from disk queue: 80027d7101285504626f64797102557e656e67696e653d3526776f72643d254535254139254234254535253834254246254536253841253941254538254137254136254535254241253841254535253845253832254535254145254236267175657279547970653d32266b65793d4a53484825343025323925323125323426636f6d70616e7949643d3236393432710355095f656e636f64696e67710455057574662d3871055507636f6f6b69657371067d710755046d65746171087d71092855056465707468710a4b05550a636f6d70616e795f6964710b4d3e6955076b6579776f7264710c5815000000e5a9b4e584bfe68a9ae8a7a6e5ba8ae58e82e5aeb6710d550b656e67696e655f74797065710e5503333630710f7555076865616465727371107d7111550c436f6e74656e742d5479706571125d711355216170706c69636174696f6e2f782d7777772d666f726d2d75726c656e636f64656471146173550375726c7115584b000000687474703a2f2f6f70656e312e6668743336302e636f6d2f526573744c6974654170692f52616e6b696e672f44656c6574654b6579776f7264517965727942794d6f72652e616374696f6e550b646f6e745f66696c74657271168955087072696f7269747971174b31550863616c6c6261636b7118551770617273655f7375626d697373696f6e5f726573756c74711955066d6574686f6471
Clearly, it reads only a portion of what is written.
It seems to be some weirdness with how the underlying disk-file is being updated, possibly something specific to Windows?
In any case, I've tried fixing this, but have had no luck with any normal code. Perhaps using the low-level OS file functions would work?
I had the same suspicion. I will try scrapy on linux later today.
Just now I tried the change my log statements to include the thread id, as follows:
logging.warning("[Thread %s] Writing to disk queue: %s",
threading.current_thread().name, binascii.b2a_hex(string))
2015-08-24 08:47:18 [root] WARNING: [Thread MainThread] Writing to disk queue: 80027d7101285504626f647971025478010000656e67696e653d3126646f6d61696e3d68786373702e6668743336302e636f6d26776f72643d25453525423825423825453525423725394525453525384425393725453725393325394325453525414425393026717565727956616c75653d254537254143254143312545392541312542352545372541432541433225453525393025384426636f6d70616e7949643d32373935352672616e6b55726c3d687474702533412532462532467777772e62616964752e636f6d2532467325334669652533447574662d3825323666253344382532367273765f6270253344302532367764253344253235453525323542382532354238253235453525323542372532353945253235453525323538442532353937253235453725323539332532353943253235453525323541442532353930253236253236696e70757454253344353132267175657279547970653d3226717565727954696d653d323031352d30382d32342b3038253341343725334131382e333239303030710355095f656e636f64696e67710455057574662d3871055507636f6f6b69657371067d710755046d65746171087d71092855056465707468710a4b03550c6c6173745f75706461746564710b636461746574696d650a6461746574696d650a710c550a07df0818082f120505288552710d55076b6579776f7264710e580f000000e5b8b8e5b79ee58d97e7939ce5ad90710f550a636f6d70616e795f696471104d336d550472616e6b71114b02550e636f6d70616e795f646f6d61696e7112581000000068786373702e6668743336302e636f6d711355047061676571144b01550b656e67696e655f7479706571155505426169647571167555076865616465727371177d7118550c436f6e74656e742d5479706571195d711a55216170706c69636174696f6e2f782d7777772d666f726d2d75726c656e636f646564711b6173550375726c711c584f000000687474703a2f2f6f70656e312e6668743336302e636f6d2f526573744c6974654170692f52616e6b696e672f496e736572744b6579776f72645175657279526573756c74546f44422e616374696f6e550b646f6e745f66696c746572711d8955087072696f72697479711e4b93550863616c6c6261636b711f551770617273655f7375626d697373696f6e5f726573756c74712055066d6574686f6471215504504f5354712255076572726261636b71234e752e
2015-08-24 08:47:18 [root] WARNING: [Thread MainThread] Reading from disk queue: 80027d7101285504626f647971025478010000656e67696e653d3126646f6d61696e3d68786373702e6668743336302e636f6d26776f72643d25453525423825423825453525423725394525453525384425393725453725393325394325453525414425393026717565727956616c75653d254537254143254143312545392541312542352545372541432541433225453525393025384426636f6d70616e7949643d32373935352672616e6b55726c3d687474702533412532462532467777772e62616964752e636f6d2532467325334669652533447574662d3825323666253344382532367273765f6270253344302532367764253344253235453525323542382532354238253235453525323542372532353945253235453525323538442532353937253235453725323539332532353943253235453525323541442532353930253236253236696e70757454253344353132267175657279547970653d3226717565727954696d653d323031352d30382d32342b3038253341343725334131382e333239303030710355095f656e636f64696e67710455057574662d3871055507636f6f6b69657371067d710755046d65746171087d71092855056465707468710a4b03550c6c6173745f75706461746564710b636461746574696d650a6461746574696d650a710c550a07df0818082f120505288552710d55076b6579776f7264710e580f000000e5b8b8e5b79ee58d97e7939ce5ad90710f550a636f6d70616e795f696471104d336d550472616e6b71114b02550e636f6d70616e795f646f6d61696e7112581000000068786373702e6668743336302e636f6d711355047061676571144b01550b656e67696e655f7479706571155505426169647571167555076865616465727371177d7118550c436f6e74656e742d5479706571195d71
The results shows that the same file is read by the same thread, the MainThread.
In FifoDiskQueue.__init__
, headf
and tailf
are opened as two separate file descriptors using two _openchunk
calls. Probably the windows OS doesn't guarantee that at any moment of time, we will see the same content from either file descriptor.
OK. It works smoothly in linux. This would probably take forever for the scrapy dev team to fix. Windows is almost the abandoned land,
I added a line after os.read
in pop
method,
def pop(self):
tnum, tcnt, toffset = self.info['tail']
if [tnum, tcnt] >= self.info['head']:
return
tfd = self.tailf.fileno()
szhdr = os.read(tfd, self.szhdr_size)
if not szhdr:
return
size, = struct.unpack(self.szhdr_format, szhdr)
data = os.read(tfd, size)
logging.warning("size in header is %d, size read is %d", size, len(data)) # <=== this line
tcnt += 1
And it prints:
2015-08-24 04:53:23 [root] WARNING: size in header is 488, size read is 465
I am thinking of adding some checks after the os.read
and continuesly read until I get the amount of data specified in the szhdr
. Will update when I have progress.
OK. I finally traced down to the root of the problem.
The solution is adding one charcater:
def _openchunk(self, number, mode='rb'):
Without the b
mode, windows will interpret the file as text file and os.read(size_you_want) might read much less data than you specified even though there are more bytes available.