emscripten-core/emsdk

Downloading files with urllib results in missing KB

weypro opened this issue · 3 comments

I encountered an issue while using the urllib version of the download function. Specifically, the downloaded files consistently missed a few KB of data, leading to incomplete downloads.
To address this, I rewrote the download function using the requests library. The rewritten version successfully resolves the issue, consistently downloading complete files without missing data.

Given this improvement, I suggest that the repository prioritize the use of the requests library for downloading files. The requests library offers a more robust and user-friendly API, and in my experience, it handles file downloads more reliably than urllib. We can use urllib only if requests is unavailable or unsuitable for certain edge cases.

Below is the code I used to replace the original urllib implementation. I hope it helps and can be integrated into the repository for a more reliable file download process.

import requests

def download_file(url, dstpath, download_even_if_exists=False, filename_prefix=""):
    debug_print(f"download_file(url={url}, dstpath={dstpath})")
    file_name = get_download_target(url, dstpath, filename_prefix)

    if os.path.exists(file_name) and not download_even_if_exists:
        print(f"File '{file_name}' already downloaded, skipping.")
        return file_name

    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()
        
        mkdir_p(os.path.dirname(file_name))
        
        file_size = int(response.headers.get('content-length', 0))
        if file_size > 0:
            print(f"Downloading: {file_name} from {url}, {file_size} Bytes")
        else:
            print(f"Downloading: {file_name} from {url}")

        file_size_dl = 0
        # Draw a progress bar 80 chars wide (in non-TTY mode)
        progress_max = 80 - 4
        progress_shown = 0
        block_sz = 256 * 1024
        if not TTY_OUTPUT:
            print(' [', end='')

        with open(file_name, 'wb') as f:
            for chunk in response.iter_content(chunk_size=block_sz):
                if not chunk:
                    break
                file_size_dl += len(chunk)
                f.write(chunk)
                if file_size:
                    percent = file_size_dl * 100.0 / file_size
                    if TTY_OUTPUT:
                        status = f"\r {file_size_dl:10d}  [{percent:3.02f}%]"
                        print(status, end="", flush=True)
                    else:
                        while progress_shown < progress_max * percent / 100:
                            print("-", end="", flush=True)
                            progress_shown += 1

            if not TTY_OUTPUT:
                print("]")
            else:
                print()

    except requests.exceptions as e:
        errlog(f"Error: Downloading URL '{url}': {str(e)}")
        if "SSL: CERTIFICATE_VERIFY_FAILED" in str(e) or "urlopen error unknown url type: https" in str(e):
            errlog(
                "Warning: Possibly SSL/TLS issue. Update or install Python SSL root certificates (2048-bit or greater) supplied in Python folder or https://pypi.org/project/certifi/ and try again."
            )
        rmfile(file_name)
        return None
    except KeyboardInterrupt:
        rmfile(file_name)
        exit_with_error("aborted by user, exiting")

    return file_name

Additionally, the changes I made were based on the emsdk dependency of the skia chrome/m127 branch, requiring further improvements to accommodate the latest emsdk.

What platform are you seeing these partial downloads occurring on? Presumably its not macOS where we download using the external curl tool? Is it linux or windows?

Sadly the requests library is not part of the core python library so not really possible for us to use it in this script (which is supposed to be highly portable, and kind of boostraps the whole system).

I'm curious how you think the urllib implementation could be failing? I bug in urllib itself seems highly unlikely. And the fact that nobody else has reported this issue in over a decade also seems odd. I think its worth investigating furthur.

I am using Windows. Due to well-known reasons, Chinese developers often need to use proxies to accelerate access. It's not that urllib has a 'bug', but it seems less capable of handling less stable network environments. That's why I switched to the requests library, which gave me more consistent results.

As for why others haven't reported this issue, there could be various reasons. Many might simply choose to download files manually rather than addressing the root cause of the problem. Also, even when using a proxy, it doesn't necessarily mean the network is unstable - at least my friends can execute the script normally. I hope this additional context helps explain the situation.

I completely understand the importance of maintaining high portability for the script. Your point about requests not being part of the core Python library is valid. For the sake of easy testing, I simply replaced the original implementation. However, if we were to formally integrate this into the project, we should definitely consider this point. For example, we could try the following approach:

try:
    import requests
    USE_REQUESTS = True
except ImportError:
    USE_REQUESTS = False

This way, we maintain portability while offering a more robust solution for those who need it.

I would like to understand exactly what the problem is with the urllib implementation before making any change here. Is the problem consistent for you? Perhaps we can start by adding some assertions here? Presumably we should never be returning success if not all the bytes were transferred?