upbit/pixivpy

How to Check Rate-Limits?

Closed this issue · 13 comments

Is there a way I can check rate limits, and pause/wait if I'm getting too close?

For example in,

for i, illust_id in enumerate(bookmarks_id_list):
    json_result = api.illust_detail(illust_id)

    illust = json_result.illust
    if illust.meta_single_page != {}:
        url = illust.meta_single_page.original_image_url
        filename = os.path.basename(url)
        api.download(url, path='Images', name=f'{filename}')

    else:
        for meta_page in illust.meta_pages:
            url = meta_page.image_urls.original
            filename = os.path.basename(url)
            api.download(url, path='Images', name=f'{filename}')
    print(f"{i} / {total_bookmarks}", end = '\r')
    time.sleep(2)

I added a time.sleep(2) to sleep, but I'm still getting rate limited. Any suggestions?

Pixiv does not disclose any API specifications because they are not meant to be used publicly. We can only work around rate limits by guesswork, such as trying to sleep longer. If you're already restricted, it's better to wait a while for the quota to recover/be released from the greylist/etc.

We can only work around rate limits by guesswork, such as trying to sleep longer

Do you have any reliable suggestions? Or is there a particular number of calls you can make before you ideally should sleep?

I don't have a number either. My scripts aren't sleeping anywhere, but never hit the rate limit because I rarely run them. You can only try gradually increasing sleep time until it works.

upbit commented

api.illust_detail is the main reason for the frequency limitations here. In Pixiv App, this function call is not very frequent, as most of it is done while browsing list pages. Therefore, it is recommended to directly use the bookmarks list interface to obtain the URLs of these illusts and save them.

In addition, api.download is an independent request mode that accesses the CDN. If you have multiple proxies, you can refer to the Download code to change it to parallel downloading (no token is required).

Therefore, it is recommended to directly use the bookmarks list interface to obtain the URLs of these illusts and save them.

Could you elaborate on this? Do you mean, use the actual website and obtain the URLs through there, such as through scraping? Or to just use the initial list I get through user_bookmarks_illust? Getting the IDs doesn't seem to rate-limit me as soon as api.download does. What I use is:

bookmarks_id_list = []
    bookmarks = api.user_bookmarks_illust(user_id)

    while True:
        for illustration in bookmarks.illusts:
            bookmarks_id_list.append(illustration.id)
            print(f"There are {len(bookmarks_id_list)} bookmarks. On Pixiv it says '{total_bookmarks}'.", end = '\r')
        if 'next_url' in bookmarks and bookmarks['next_url']:
            next_qs = api.parse_qs(bookmarks['next_url'])
            bookmarks = api.user_bookmarks_illust(**next_qs)
            time.sleep(2)
        else:
            break

In addition, api.download is an independent request mode that accesses the CDN. If you have multiple proxies, you can refer to the Download code to change it to parallel downloading (no token is required).

So assuming CDN means "Content Delivery Network", is this what I can do aside sleeping for longer? When you say Download code, do you mean the example in the README.md file? Or are you talking about a different file? Would like a bit more direction regarding this, thank you!

import os
import time
from concurrent.futures import ThreadPoolExecutor

from pixivpy3 import AppPixivAPI


def main():
    api = AppPixivAPI()
    api.auth(refresh_token='REFRESH-TOKEN')

    with ThreadPoolExecutor(max_workers=5) as executor:
        qs = {'user_id': 27517}
        while qs:
            json_result = api.user_bookmarks_illust(**qs)
            qs = api.parse_qs(json_result.next_url)
            for illust in json_result.illusts:
                if illust.type == 'ugoira':
                    img_urls = []  # Dealing with ugoira is hard, ignore at this time.
                elif illust.page_count == 1:
                    img_urls = [illust.meta_single_page.original_image_url]
                else:
                    img_urls = [
                        page.image_urls.original
                        for page in illust.meta_pages
                    ]
                for url in img_urls:
                    filename = os.path.basename(url)
                    executor.submit(api.download, url, path='Images', name=f'{filename}')
            time.sleep(10)


if __name__ == '__main__':
    main()

The API calls in the while-loop (api.user_bookmarks_illust())'s rate will be limited by time.sleep(). And the download calls (api.download()) will be executed in other threads through the executor and don't have rate limit.

As the response of api.user_bookmarks_illust() already contains all the URLs we needed, we don't have to call api.illust_detail() on each illustration separately.

Thank you @Xdynix. This appears to be part of your personal project to download images from users.

A lot of this is going over my head, but it seems like most of the magic is in ThreadPoolExecutor and executor.submit(). A big part of my personal project is insert metadata of every illust into a SQLite table, with the following scheme.

posts_table
post_id
user_id
post_url
post_title
post_caption
post_tags (in one string seperated by commas), keep original japanese
pixiv_status
date_downloaded


media_table
media_id
media_type
media_url
media_file_name
md5_hashsum
post_id

users_table
user_id
user_name
user_account

I assume all the data that's in illust_detail is probably in user_bookmarks_illust then. Meaning I won't need illust_detail anymore. I also suppose in order to download images I haven't processed yet, I could either break when I come across an illust_id already in my SQLite table. Whatever the case, thanks!

From the perspective of meta data, APIs such as user_bookmarks_illust that return multiple illustrations cannot guarantee to contain all the information, I observed it on old APIs. It is better to compare its return value with that of illust_detail. If the fields you want are indeed missing, you can only call illust_detail to fetch them.

Right, I forgot. If I want meta data for illustrations that contain multiple images, on a per image basis, I would have to refactor the code you gave me to include data from illust_detail right?

Not sure how I'd incorporate ThreadPoolExecutor / executor.submit with that, so I guess I'll investigate and experiment with that.

I might have to refactor more though tbh, because I'm trying to download the images in order that they were liked, starting from last liked to recently liked, just to have the timestamps be intuitive in my file system. I suppose I could append json_result to a larger list, and then reverse that list, and then iterate through each illust.


As an aside, I just noticed that,

json_result = api.user_bookmarks_illust(**qs)
qs = api.parse_qs(json_result.next_url)

it passes qs as a parameter, from its previous while loop iteration. I'll remember this, since it seems to the the intended way to paginate.

The code I posted before can already deal with downloading multi-page illustration. You can just print (or pprint) the json_result to see what it already has from user_bookmarks_illust and decide whether you want another illust_detail call.

Just for curious, how does the download order make sense? Since you already have metadata in database..

decide whether you want another illust_detail call.

Are there any indicators in json_result to see if there's multiple images, and then process that illust_id even further?

Just for curious, how does the download order make sense? Since you already have metadata in database..

I have pretty bad diagnosed OCD, so it bothered me when my Twitter bot would download my most recent likes, (say 10 likes), and the one I liked last would appear first in my file system. I know it doesn't make sense at all, and from my perspective it's even worse since I have to put extra work to make things all orderly and clean. I hope you can understand.

json_result = api.user_bookmarks_illust(**qs)
from pprint import pprint
pprint(json_result)
{'illusts': [{'caption': '',
              'create_date': '2021-04-09T00:00:02+09:00',
              'height': 1000,
              'id': 89024820,
              'illust_ai_type': 0,
              'illust_book_style': 0,
              'image_urls': {'large': 'https://i.pximg.net/c/600x1200_90/img-master/img/2021/04/09/00/00/02/89024820_p0_master1200.jpg',
                             'medium': 'https://i.pximg.net/c/540x540_70/img-master/img/2021/04/09/00/00/02/89024820_p0_master1200.jpg',
                             'square_medium': 'https://i.pximg.net/c/360x360_70/img-master/img/2021/04/09/00/00/02/89024820_p0_square1200.jpg'},
              'is_bookmarked': False,
              'is_muted': False,
              'meta_pages': [],
              'meta_single_page': {'original_image_url': 'https://i.pximg.net/img-original/img/2021/04/09/00/00/02/89024820_p0.png'},
              'page_count': 1,
              'restrict': 0,
              'sanity_level': 2,
              'series': None,
              'tags': [{'name': 'バーチャルYouTuber',
                        'translated_name': 'virtual youtuber'},
                       {'name': 'にじさんじ', 'translated_name': 'Nijisanji'},
                       {'name': 'リゼ・ヘルエスタ', 'translated_name': 'Lize Helesta'},
                       {'name': 'バーチャルYouTuber50000users入り',
                        'translated_name': None},
                       {'name': '白髪ロング', 'translated_name': 'long white hair'}],
              'title': 'リゼ様',
              'tools': ['SAI'],
              'total_bookmarks': 56423,
              'total_view': 328065,
              'type': 'illust',
              'user': {'account': '23233',
                       'id': 882569,
                       'is_followed': False,
                       'name': '赤倉',
                       'profile_image_urls': {'medium': 'https://i.pximg.net/user-profile/img/2017/05/26/14/33/28/12608711_5883fc4f50f7ad7079e25ba16f5459c8_170.png'}},
              'visible': True,
              'width': 1000,
              'x_restrict': 0},
  # ...

You can see there is a page_count property.


Understandable. You can consider use os.utime() to set the modification date of the downloaded file to any date you want, and sort your folder by modification date decreasingly.

Thank you so much with both. I'll spend some time refactoring my code, while using the rate-limit avoidance tools you mentioned.