alard/warc-proxy

Excessive memory usage when loading a WARC with big files

bzc6p opened this issue · 2 comments

bzc6p commented

I tried to load a WARC with a few larger (200-300 MB) files in it. During the process of loading (indexing) the WARC, memory usage of the python process (that worked on the indexing) increased up to, like, 700 MB, and then ran out of memory, leaving the following error message in the terminal:

Loading /media/datadisk/upload_queue/hajduvolan_hu_2015_05.warc.gz
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "./warcproxy.py", line 112, in run
    http_response = parse_http_response(record)
  File "./warcproxy.py", line 24, in parse_http_response
    remainder = message.feed(record.content[1])
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 576, in feed
    text = HTTPMessage.feed(self, text)
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 94, in feed
    text = self.feed_start(text)
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 179, in feed_start
    line, text = self.feed_line(text)
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 159, in feed_line
    text = str(self.buffer[pos:])
MemoryError

The progress bar stuck, the indexing stopped.
I bet on the big files being responsible for this, as I've been using this great tool for long and haven't experienced such a problem so far (this was the first time that I tried to load a WARC with files larger than a few tens of megabytes). However, I can't imagine why warc-proxy would need 700 MB of mermoy for indexing a 250 MB file.

I think you can easily reproduce the problem: you can find the problematic WARC here: https://archive.org/details/hajduvolan_hu_2015_05. The probably problematic files are http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt_2010-2012.flv (249 MB) and http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt.flv (146 MB).

I haven't measured the memory usage, but
I'm writing this comment here to report that it totally FAILS TO LOAD/index
WARC-files that are multiple GiB in size. An example that fails to laoad
MIGHT be available at

http://temporary.softf1.com/2017/bugs/www.tldp.org-2017-01-06-c51e36ac-00000.warc.gz

bzc6p commented

I generally don't have problems with WARCs up to a few gigabytes in size (haven't tried files tens of gigabytes of size, however), only if there are several hundred megabytes files IN the WARC itself.

I've tried your file, and it has been indexed fine here. Maybe check the command-line messages while indexing, in order to get a clue about a missing dependency or some other sort of error.