Unable to Save All Images (Concurrency-Related Maybe)

Question

Unable to Save All Images (Concurrency-Related Maybe)

SakiSakiSakiSakiSaki opened this issue 2 years ago · 25 comments

SakiSakiSakiSakiSaki commented 2 years ago

So I have about 25,000 likes/bookmarks on Pixiv I'm trying to download (about 59,000 images). It takes almost 48 hours to download them all from project initialization.

I have two issues:

1. Images AND illustrations are being downloaded out of order. (The order I added it to my bookmarks, from last to first, and illustration album order, p0-p100)

Usually it works right:
Sometimes it doesn't work:

Sorting through all date formats (created, modified) never yields the expected order. My 5th ever liked illustration shows up as my 2nd most recent download.

2. The bigger issue, some images are straight up not downloading.
My SQLite database gives me the correct number (I know it's correct because it omits invisible works and my profile hasnt liked any ugoira, and my R-18 blocking settings are disabled):

My images directory has a lower number:

Running a comparison program gives me the missing filenames, here's a few:

100259613_p0.jpg doesn't exist in the directory
100728200_p19.jpg doesn't exist in the directory
100728200_p48.jpg doesn't exist in the directory
101820708_p0.jpg doesn't exist in the directory 
102787429_p0.jpg doesn't exist in the directory 
104981105_p0.jpg doesn't exist in the directory 
104981105_p1.jpg doesn't exist in the directory 
105850627_p0.jpg doesn't exist in the directory 
23580508_p0.jpg doesn't exist in the directory
28564463_p0.jpg doesn't exist in the directory
32759252_p17.jpg doesn't exist in the directory
35812504_p0.jpg doesn't exist in the directory 
38319983_p12.jpg doesn't exist in the directory
40090211_p0.jpg doesn't exist in the directory
44923009_p0.png doesn't exist in the directory
45713411_p0.jpg doesn't exist in the directory

All of the missing files are from visible illustrations. Sometimes in a series, it'll "skip" or fail to download an image, and proceed normally after, (100728200_p0-18, and 20-47 in my directory).

I know you guys can't see what my program looks like, thus your diagnosis is limited (I can share my source code). But I have a feeling it's related to:

with ThreadPoolExecutor(max_workers = 5) as executor:
    for index_1, illust in enumerate(illusts_list): 
        executor.submit(api.download, url, path = saved_images_path, name = f'{media_file_name}')

My questions range in the following:

Is this just how concurrency works? Does things a little out of order?
- Is this a compromise I have to deal with for no rate-limits?
- If I want files to be downloaded in order, do I have to forgo concurrency?
Why does it not download certain images? There's no pattern among the ones it missed.
- What can I do?

Answer 1 · 2023-04-05T23:03:12.000Z

Both issues can be caused by concurrency.

The executor didn't gurantee the order of task execution. A worker will pick up new task without waiting prior tasks to be completed, and no gurantee on task pickup order. By doing so the workers will not be stuck by some particular task.
- So if the order matters, then concurrency is not suitable.

Errors within the function passed to executor.submit() will not be yield directly and can be easily missed. It can be some intermittent network issue. You can wrap the download function to add more logging for debug. E.g.:

def download_with_logging(api, *args, **kwargs):
    try:
        api.download(*args, **kwargs)
    except Exception as e:
        print(e)  # Or any logging method you like, like `logging` module.
# ...
executor.submit(download_with_logging, api, url, ...)

Answer 2 · 2023-04-05T23:05:23.000Z

The reason for the first issue is that you are using multi-threaded downloading, and sorted the images using creation time.
However, time is unreliable, so it is recommended to insert an index prefix before the file name to maintain order. eg: 00001_illustid.png

The download() API is only a demonstration of how to request images from CDN. It cannot guarantee download accuracy if the network is unstable.
Therefore, I recommend that you refer to the implementation of download() and add two measures:

Add retry when failed
Check if CDN header Content-Length and file size match

Answer 3 · 2023-04-06T04:35:33.000Z

@Xdynix @upbit Thank you for your response. Before I switch over and modify my usage of download(), I wanted to confirm that without concurrency, I will be dealing with rate-limits yes? Or is it the same as before (no rate-limiting), but just slower since it's doing the process one at a time?

Answer 4 · 2023-04-06T04:43:02.000Z

From what we currently observed there is no rate limits on downloading. And since switching away from concurrency will only slows down your downloading, it doesn't make sense that you'd run into new rate limits.

Answer 5 · 2023-04-06T04:52:48.000Z

In fact, downloads only interact with CDN servers, and the concurrency limit for the same CDN address is higher than imagined. Higher concurrency can also be achieved by resolving to different CDN nodes.
So removing concurrency may not necessarily and not a good strategy. The key is how to check for download failures and retries.

Answer 6 · 2023-04-06T05:01:55.000Z

Thanks, I'll try to implement your proposed changes to see if I can what I'm looking for. The failure/retry is definitely something I can implement for those failed images, but the issue is also that images are downloaded out of order. I'll consider adding an index to the filenames, or seeing if there's a specialized parameter for that, but I am also trying to keep the files as close to the original desktop experience as possible.

@upbit
When you say

Therefore, I recommend that you refer to the implementation of download() and add two measures:

Add retry when failed
Check if CDN header Content-Length and file size match

Do you mean that I should adjust the download() function that you wrote itself, or add retry and Content-Length check after I download the image, outside of the function?

Answer 7 · 2023-04-06T05:14:31.000Z

Adding a prefix is quite simple, you just need to use the prefix parameter.

download() does not provide implementations for retry and file length check.

For retry, you can refer to @Xdynix 's code above, and use a try..catch outside download()
For Content-Length header, see Get file size from "Content-Length" value, check if the length equal to downloaded image file size.

Answer 8 · 2023-04-06T05:41:46.000Z

The key point of downloading image is to send the header Referer: https://app-api.pixiv.net/. So it's easy to implement your own download function.

This reminds me that I've wanted to reinvent the wheel before. The resulting code is as follows (you may need to modify some syntax according to your Python version).

import shutil
from collections.abc import Container
from functools import wraps
from pathlib import Path
from tempfile import NamedTemporaryFile
from typing import Type

import requests


def retry(num: int, retryable: Container[Type[Exception]] = None):
    """Makes function retryable.
    
    :param num: Maximum execution number.
    :param retryable: Optional, a collection of retryable exception classes.
    """
    def decorator(func):
        @wraps(func)
        def decorated_func(*args, **kwargs):
            error_count = 0
            while True:
                try:
                    return func(*args, **kwargs)
                except Exception as ex:
                    if retryable is not None and ex.__class__ not in retryable:
                        raise
                    error_count += 1
                    if error_count >= num:
                        raise

        return decorated_func

    return decorator


@retry(3)
def download(url: str, path: str | Path, headers: dict[str, str] = None, force=False):
    if isinstance(path, str):
        path = Path(path)
    if path.exists() and not force:
        return

    with requests.get(url, headers=headers, timeout=15, stream=True) as response, \
            NamedTemporaryFile() as temp_file:

        response.raise_for_status()

        downloaded_length = 0
        for chunk in response.iter_content(chunk_size=1024):
            if not chunk:
                continue
            temp_file.write(chunk)
            downloaded_length += len(chunk)

        content_length = response.headers.get('Content-Length')
        if content_length is not None:
            if str(downloaded_length) != content_length:
                raise RuntimeError('Incorrect file size!')

        temp_file.seek(0)
        path.parent.mkdir(exist_ok=True, parents=True)
        with path.open('wb') as output_file:
            shutil.copyfileobj(temp_file, output_file)


def main():
    download(
        'https://i.pximg.net/img-original/img/2023/03/09/04/00/01/106042736_p0.jpg',
        Path('download') / '106042736_p0.jpg',
        headers={'Referer': 'https://app-api.pixiv.net/'},
    )


if __name__ == '__main__':
    main()

Answer 9 · 2023-04-06T06:15:19.000Z

Thanks a lot! I'll look into that code. I assume it's compatible with concurrency? Or did you write that assuming I'll stop using concurrency?

Answer 10 · 2023-04-06T06:17:45.000Z

It's compatible with concurrency (very generic, actually).

Answer 11 · 2023-04-15T05:22:35.000Z

I would like to report back on how your (@Xdynix)'s decorated retry()/download() functions have been going.

They have been working amazingly! They have been implemented with concurrency (the initial issue of images being saved out of order was not concurrency's fault, instead I misindented a list reverse method in my old code):

My implementation:

executor.submit(
         download(url = url,
         path = f'{saved_images_path}/{media_file_name}',
         headers = {'Referer': 'https://app-api.pixiv.net/'}
         )
) #Saving images through concurrency, no rate-limits

I just wanted to know if you had any error handling for deleted illustrations in the download() function.
This function was fed the following Pixiv link: https://www.pixiv.net/en/artworks/91184135

An exception would occur from response.raise_for_status():

Exception has occurred: HTTPError
404 Client Error: Not Found for url: https://i.pximg.net/img-original/img/2021/10/02/20/57/18/91184135_p0.jpg
...
...
...
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://i.pximg.net/img-original/img/2021/10/02/20/57/18/91184135_p0.jpg

I'm brainstorming error handling from my side too, like maybe try and except, and return something, and checking to see if that var equals said something after calling the function. It sound very ineloquent, so if an idea comes to mind, please let me know!

Answer 12 · 2023-04-15T05:34:30.000Z

If the error encountered is HTTPError 404, then I think we can ignore it (try ... except ... pass ...). Because that's expected - this illustration is made private. We can also check the response of illust_detail(), if it is a private illustration, skip downloading it.

For other exceptions, we can write to a log file. They were unexpected and need to root cause. When you submit a task to the executor through submit(), it will return a future object. You can check the future object to see if the task failed at the end (example).

Answer 13 · 2023-04-15T06:39:30.000Z

If the error encountered is HTTPError 404, then I think we can ignore it

So comment out response.raise_for_status(). But the download() function will still generate an image at the end, albeit a corrupted one. Is there a way I can break out of this function if the response is 404, AND have the future object be read as a failed task? Will a simple return do it?

Or maybe, don't bother reading the object as failed, since these takes are supposed to process over the course of several iterations, and return something like Failed?

How would read that result? I think it was future_obj.result()?

Answer 14 · 2023-04-15T06:55:55.000Z

I just noticed how you submit the task.

executor.submit(
    download(
            url=url,
            path=f'{saved_images_path}/{media_file_name}',
            headers={'Referer': 'https://app-api.pixiv.net/'},
    )
)

This is not doing the concurrent stuff (and that's why the order you saw is correct). You called the download immediately, waiting this illustration downloading to complete, and None is returned by download(), and then the None is submitted to the executor (and been ignored). And that's why you're bothered by the exception thrown by download(). If you are using the executor correctly, you will most likely not notice these exceptions.

To use executor, you need to write things like:

url_futures = {}

with Executor(...) as executor:
    for image_url in urls:
        future = executor.submit(download, url=url, path='foobar.jpg', headers={...})
        url_futures[url] = future

for url, future in url_futures.items():
    ex = future.exception()
    if ex is not None:
        print('Failed to download:', url)

Otherwise, you need to write things like:

for image_url in urls:
    try:
        download(url=url, path='foobar.jpg', headers={...})
    except Exception as ex:
        print('Failed to download:', url)

For your use case I recommend the latter one.

To check the exception, you can use if isinstance(ex, requests.HTTPError) and ex.response.status_code == 404:.

Answer 15 · 2023-04-15T07:11:03.000Z

Oof, that's embarrassing. So no concurrency for me I suppose.

At least the retry decorator will work.

Thanks.

Answer 16 · 2023-04-15T08:02:04.000Z

Otherwise, you need to write things like:

for image_url in urls:
    try:
        download(url=url, path='foobar.jpg', headers={...})
    except Exception as ex:
        print('Failed to download:', url)

So I tried out something like this solution, and I'm confused as to what to do with response.raise_for_status(). If I leave it as is, this error will be raised inside the function, instead of being handled in the try except. If I comment it out, no exception is raised and it just saves the "image". Is this a VSC problem?

Answer 17 · 2023-04-15T08:10:37.000Z

When we got a response, we need to check it's status code (2xx indicates successed). raise_for_status() is a handy way to doing so, it will raise an exception if the requet failed, so that the download() can quit early and don't do the following steps (write content to file). It's some how acts like an early return (if not 200 <= response.status_code < 300: return). But the exception handling process using try-catch is more semantically appropriate.

The exception then keeps bubbling up, left download() and being caught by the try-catch around download().

Answer 18 · 2023-04-15T08:14:56.000Z

So I tried running this in my .bat file, and it does in fact quit early. Prints that message in the except.

But running it in VSC, opens up the file where I stored the two functions, and shows the error instead.

Not sure if there's a setting in VSC to handle this.

Answer 19 · 2023-04-15T08:20:43.000Z

Hmm I don't use VSC so not sure about this. Looks like it's running in debug-mode or something similar. Maybe check your execute config.

Some stackoverflow answer: https://stackoverflow.com/questions/54519728/how-to-prevent-visual-studio-code-python-debugger-from-stopping-on-handled-excep

Answer 20 · 2023-04-15T08:21:54.000Z

Yep that did it, you're a lifesaver.

Answer 21 · 2024-08-25T06:23:35.000Z

@Xdynix @upbit Hey, today I noticed that the download function you provided isn't working. On my end, I get the following error:

@retry(3)
def download(url: str, path: str | Path, headers: dict[str, str] = None, force=False):
    if isinstance(path, str):
        path = Path(path)
    if path.exists() and not force:
        return

    with requests.get(url, headers=headers, timeout=15, stream=True) as response, \
            NamedTemporaryFile() as temp_file:

        response.raise_for_status()

        downloaded_length = 0
        for chunk in response.iter_content(chunk_size=1024):
            if not chunk:
                continue
            temp_file.write(chunk)
            downloaded_length += len(chunk)

        content_length = response.headers.get('Content-Length')
        if content_length is not None:
            if str(downloaded_length) != content_length:
                raise RuntimeError('Incorrect file size!')

        temp_file.seek(0)
        path.parent.mkdir(exist_ok=True, parents=True)
        with path.open('wb') as output_file:
            shutil.copyfileobj(temp_file, output_file)

Have you experienced this?

File "C:\Users\Saki\Pixiv\utilities\helper.py", line 84, in download    response.raise_for_status()
  File "C:\Users\Saki\Pixiv\venv\lib\site-packages\requests\models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://i.pximg.net/img-original/img/2024/08/23/00/14/23/121741319_p0.jpg

Answer 22 · 2024-08-25T06:30:21.000Z

I haven't download Pixiv images from script for a while so I didn't observed it.

403 Client Error: Forbidden for url: https://i.pximg.net/...

My guess is that the server now requires some more headers and we don't have them, so that the image server recognized that the program is a bot and refused the request.

You can try adding different headers and see which one can bypass the check.

Edit: Well, per my test the previous header still works. Have you changed your invoke code? It should has headers={"Referer": "https://app-api.pixiv.net/"} when calling download().

$ http https://i.pximg.net/img-original/img/2024/08/23/00/14/23/121741319_p0.jpg 'Referer:https://app-api.pixiv.net/'
HTTP/1.1 200 OK
# ...

Answer 23 · 2024-08-25T06:33:49.000Z

Edit: Well, per my test the previous header still works. Have you changed your invoke code?

The code has been unchanged for months. Any suggestions? I still have access to site through browser, so I'm not blocked on IP.

Answer 24 · 2024-08-25T06:40:10.000Z

Any suggestions?

I just confirmed, if the User-Agent in header is python-requests (by default), it will be blocked. So we need to provide another fake value. Add "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" to headers should work. You can also replace it with UA from your browser.

Answer 25 · 2024-08-25T06:44:00.000Z

Add "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" to headers should work. You can also replace it with UA from your browser.

Alright we're back in business. Thanks for your continued help brother.