'utf-8' codec can't decode byte invalid continuation byte
fanchyna opened this issue · 1 comments
fanchyna commented
I've installed warcat on my server under Python 3.4. The warc.load() command to a warc file gives me the following error message:
>> warc.load("/gstorage01/external-data/internet-archive/archive.org/download/archiveteam_pdf_20160412083746/pdf_20160412083746.megawarc.warc.gz")
Content block length changed from 92850 to 92849
Content block length changed from 150326 to 150325
Content block length changed from 156258 to 156257
Content block length changed from 129362 to 129361
Content block length changed from 156196 to 156195
Content block length changed from 129336 to 129335
Content block length changed from 147763 to 147762
Content block length changed from 129338 to 129337
Content block length changed from 129350 to 129349
Content block length changed from 156195 to 156194
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/site-packages/warcat/model/warc.py", line 32, in load
self.read_file_object(f)
File "/usr/lib/python3.4/site-packages/warcat/model/warc.py", line 39, in read_file_object
record, has_more = self.read_record(file_object)
File "/usr/lib/python3.4/site-packages/warcat/model/warc.py", line 75, in read_record
check_block_length=check_block_length)
File "/usr/lib/python3.4/site-packages/warcat/model/record.py", line 68, in load
content_type)
File "/usr/lib/python3.4/site-packages/warcat/model/block.py", line 21, in load
field_cls=HTTPHeader)
File "/usr/lib/python3.4/site-packages/warcat/model/block.py", line 92, in load
fields = field_cls.parse(file_obj.read(field_length).decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 26: invalid continuation byte
The data is available from internet archive website that everyone can download. The size is about 130GB, but I don't think it should matter. The key issue is how does a codec error happen.
jeffcasavant commented
I'm having the same issue I think. This is a WARC file that was built using the Internet Archive's warc library.
[jeff warc]$ warcat split my.warc.gz
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/lib/python3.5/site-packages/warcat/__main__.py", line 154, in <module>
main()
File "/usr/lib/python3.5/site-packages/warcat/__main__.py", line 70, in main
command_info[1](args)
File "/usr/lib/python3.5/site-packages/warcat/__main__.py", line 126, in split_command
tool.process()
File "/usr/lib/python3.5/site-packages/warcat/tool.py", line 95, in process
check_block_length=self.check_block_length)
File "/usr/lib/python3.5/site-packages/warcat/model/warc.py", line 75, in read_record
check_block_length=check_block_length)
File "/usr/lib/python3.5/site-packages/warcat/model/record.py", line 68, in load
content_type)
File "/usr/lib/python3.5/site-packages/warcat/model/block.py", line 21, in load
field_cls=HTTPHeader)
File "/usr/lib/python3.5/site-packages/warcat/model/block.py", line 92, in load
fields = field_cls.parse(file_obj.read(field_length).decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 712: invalid start byte