Reading in an in-memory gzip.GzipFile object breaks warcat.model.binary.BinaryFileRef objects
Closed this issue · 3 comments
The following:
byte_stream = io.BytesIO(r.content)
file_object = gzip.GzipFile(fileobj=byte_stream)
warc = warcat.model.WARC().read_file_object(file_object)
record = warc.records[0]
binary_block = record.content_block.binary_block.get_file()
results in an AttributeError
in warcat.model.binary.BinaryFileRef
:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-1319f0884b9c> in <module>()
----> 1 rec.content_block.binary_block.get_file()
/usr/local/lib/python3.5/site-packages/warcat/model/binary.py in get_file(self, safe, spool_size)
128 file_obj = self.file_obj
129
--> 130 original_position = file_obj.tell()
131
132 if self.file_offset:
AttributeError: 'NoneType' object has no attribute 'tell'
The same error also occurs with the Payload.get_file
method. This seems to be because the BinaryBlock
and BlockWithPayload
classes' load
method passes the file object's name directly to set_file
on lines 40, 83, and 96 of warcat/model/block.py; changing these lines to pass in the file object itself instead of its name seems to work.
I pushed a fix on the develop
branch. If you can, could you verify that it is fixed? Thanks.
Thanks! This fixed things for my purposes. There is still an edge case if you define the GzipFile
object with a name like so:
...
file_object = gzip.GzipFile('test', fileobj=byte_stream)
...
If you name the file, you end up with:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-19-f0659c79356d> in <module>()
----> 1 rec.content_block.payload.get_file().read() == warc_record.content_block.payload.get_file().read()
/usr/local/lib/python3.5/site-packages/warcat/model/binary.py in get_file(self, safe, spool_size)
124 gzip.GzipFile(self.filename))
125 else:
--> 126 file_obj = open(self.filename, 'rb')
127
128 util.file_cache.put(self.filename, file_obj)
FileNotFoundError: [Errno 2] No such file or directory: 'test'
Looks like this can be fixed by swapping this if/else statement or by putting the in memory file in the cache.
Ok, thanks. I'm going to put that edge case as a separate issue.