Add files from another generator

Question

Add files from another generator

Closed this issue a year ago · 3 comments

Is it possible to add files from a different generator? I have to download multiple files from an S3 bucket, create a zip of them on the fly, and return a stream response.

I can't use a for loop and the .add() method, because It waits for all files to be downloaded and added.

I want a generator/iterator that downloads files one by one and adds them to the zip.

Answer 1 · 2023-09-27T19:09:36.000Z

There's an example of doing something very similar in #11 (comment)

I can't use a for loop and the .add() method, because it waits for all files to be downloaded and added.

If you add a generator via the add() function, generally the ZipStream won't iterate over it and wait for it to finish. The exception to this is that if you don't provide the size of the data and the ZipStream is in sized mode. In order to calculate the final size of the zip, the size of each file must be known when it's added. If a size is not provided for a generator, the generator must be iterated over to determine the size. I suspect this is what's happening to you.

In your case, I believe that S3 provides a Content-Length header along with the response. You should be able to use that to provide the response size to add() to avoid needing to iterate the stream to determine it. Note that if the content length header is incorrect then streaming the zip will fail since the actual vs expected size of each file is checked as the stream is generated. Example (untested):

# <snip setup code from the example in the comment linked above>

def stream_response(response):
    """Stream from a response and clean up when finished"""
    try:
        yield from response.stream(amt=32*1024)  # stream data in 32KB chunks, adjust as needed
    finally:
        response.close()
        response.release_conn()

zs = ZipStream(sized=True)

response = client.get_object("bucket_name", "object_name")  # get response
zs.add(
    stream_response(response),  # data is an iterator
    arcname="some_filename",
    size=response.headers["Content-Length"],  # provide the size so the data isn't iterated to determine it
)

# add more files

Alternatively, you can use an unsized ZipStream (ZipStream(sized=False)) which will never iterate over generators added via add(). This does have the disadvantage of not being able to provide a response size to the client downloading it though.

Answer 2 · 2023-09-28T03:44:58.000Z

I need to add multiple files from the s3 bucket, not one. files can be up to 1000.
ZipStream waits until all files are downloaded and added.

Example -

reports = DiagnosticReport.objects.exclude(document="")

zs = ZipStream(compress_type=ZIP_DEFLATED, compress_level=9)
for report in reports:
        file_name = os.path.basename(report.document.name)
        zs.add(report.document.chunks(), file_name)

response = StreamingHttpResponse(zs)
response["Content-Disposition"] = f"attachment; filename=reports.zip"
return response

Answer 3 · 2023-09-28T13:47:23.000Z

ZipStream waits until all files are downloaded and added.

I'm not sure why this is happening in your implementation, but it's not an issue with ZipStream. You can test this by reducing your code to the simplest possible version of creating a ZipStream with generators and adding some print statements:

import time
from zipfile import ZIP_DEFLATED
from zipstream import ZipStream

start = time.time()
zs = ZipStream(compress_type=ZIP_DEFLATED, compress_level=9)

# fake a really slow download
def slow_generator():
    time.sleep(1)
    yield b"data"

# add multiple files
for x in range(10):
    zs.add(slow_generator(), f"file_{x}.txt")

print("files added:", time.time() - start)

b"".join(zs)  # force full generation of the stream
print("zip generated:", time.time() - start)

The above prints something like:

files added: 0.00005912780
zip generated: 10.090222358

so it's clear that adding generators to a ZipStream isn't waiting for all the files to be fully generated when they're added.

Something else in your code is causing the blocking to occur. Maybe report.document.chunks() isn't actually a generator?

I'm going to close this out since as far as I can tell there's nothing for me to fix in ZipStream and you should now have all the information you need to figure out why it's not working in your specific case. Feel free to reopen if you believe there's a bug in ZipStream related to this.