BadZipfile: File is not a zip file
dsal1951 opened this issue · 4 comments
I'm trying to pull bill text for previous congresses for some data science research I'm working on. When I try to run the following command
./run govinfo --collections=BILLS --extract=mods,text,xml,pdf --congress=114
I get a BadZipfile error on every bill (example for s29 below). If I try to manually open the package.zip, I end up with the never ending zip -> cpgz -> cycle.
The strange thing is that if I delete all of the text-versions subdirectories and then rerun the same command, it works fine for the vast majority of bills (~90%). I haven't been able to figure out any rhyme or reason to this behavior but can confirm that I've observed it across Mac OSx and Ubuntu as well as many congresses.
Error fetching package 114s29is in collection BILLS from https://www.govinfo.gov/app/details/BILLS-114s29is.
Traceback (most recent call last):
File "/Users/trent/Documents/congress/tasks/govinfo.py", line 174, in update_sitemap2
mirror_results = mirror_package(collection, package_name, lastmod, lastmod_cache.setdefault("packages", {}), options)
File "/Users/trent/Documents/congress/tasks/govinfo.py", line 313, in mirror_package
extracted_files = extract_package_files(collection, package_name, file_path, lastmod_cache, options)
File "/Users/trent/Documents/congress/tasks/govinfo.py", line 371, in extract_package_files
with zipfile.ZipFile(package_file) as package:
File "/anaconda2/envs/congress2/lib/python2.7/zipfile.py", line 770, in init
self._RealGetContents()
File "/anaconda2/envs/congress2/lib/python2.7/zipfile.py", line 811, in _RealGetContents
raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file
For some additional context, below is a table showing the % of bills that had their bill text properly extracted broken down by congress. So it looks like 22% of bills failed with the BadZipfile error in each congress, except for the 114th which had no failures for some strange reason.
103 0.784871
104 0.784470
105 0.783371
106 0.785635
107 0.772524
108 0.779064
109 0.773925
110 0.784199
111 0.777987
112 0.772009
113 0.777428
114 1.000000
115 0.783106
116 0.767378