upbit/pixivpy

Unable to download all related illustrations with illust_related()

towakuka opened this issue ยท 8 comments

I would like to use illust_related() to download illustrations related to a certain illustration. However, when I execute the following code, the next_url retrieved the first time and the next_url retrieved the second time are almost the same, and I can only download a small portion of the related illustrations. How can I download all the related illustrations?

  • OS: Windows 10 Home 21H2
  • Python version: 3.7.9
  • pixivpy3 version: 3.7.1
from pixivpy3 import *
import urllib.parse as up

REFRESH_TOKEN = 'xxxxx'

aapi = AppPixivAPI()
aapi.auth(refresh_token=REFRESH_TOKEN)

# https://www.pixiv.net/artworks/98699730
res = aapi.illust_related('98699730')
print(up.unquote(res.next_url))

next_qs = aapi.parse_qs(res.next_url)
print(next_qs)

res = aapi.illust_related(**next_qs)
print(up.unquote(res.next_url))

The results are as follows.

https://app-api.pixiv.net/v2/illust/related?illust_id=98699730&filter=for_ios&seed_illust_ids[0]=98699730
&viewed[0]=66304207&viewed[1]=78519399&viewed[2]=90635714&viewed[3]=100141684&viewed[4]=97112634
&viewed[5]=97789918&viewed[6]=100162560&viewed[7]=96163314&viewed[8]=97490930&viewed[9]=95632136
&viewed[10]=62759610&viewed[11]=98840284&viewed[12]=98876790&viewed[13]=80461026&viewed[14]=99193150
&viewed[15]=76880757&viewed[16]=98376241&viewed[17]=61546752&viewed[18]=93903011&viewed[19]=99171188
&viewed[20]=94524475&viewed[21]=99370255&viewed[22]=87511157&viewed[23]=64346020&viewed[24]=94084919
&viewed[25]=99507076&viewed[26]=60600324&viewed[27]=98921186

{'illust_id': '98699730', 'filter': 'for_ios', 'seed_illust_ids': ['98699730'], 'viewed': ['98921186']}

https://app-api.pixiv.net/v2/illust/related?illust_id=98699730&filter=for_ios&seed_illust_ids[0]=98699730
&viewed[0]=66304207&viewed[1]=100162560&viewed[2]=100104444&viewed[3]=78519399&viewed[4]=97112634
&viewed[5]=97789918&viewed[6]=100141684&viewed[7]=95632136&viewed[8]=97490930&viewed[9]=62759610
&viewed[10]=94745733&viewed[11]=98840284&viewed[12]=98876790&viewed[13]=80461026&viewed[14]=99193150
&viewed[15]=76880757&viewed[16]=98376241&viewed[17]=61546752&viewed[18]=93903011&viewed[19]=99171188
&viewed[20]=94524475&viewed[21]=99370255&viewed[22]=87511157&viewed[23]=64346020&viewed[24]=94084919
&viewed[25]=92555703&viewed[26]=60600324&viewed[27]=98921186

If I extract the values of viewed[num] in next_url (and sort them), most of the illust_ids are duplicates.

# the values of viewed[num] of the first next_url
100141684
100162560
60600324
61546752
62759610
64346020
66304207
76880757
78519399
80461026
87511157
93903011
94084919
94524475
94745733
95632136
97112634
97490930
97789918
98376241
98840284
98876790
98921186
99171188
99193150
99370255
99507076
99928488

# the values of viewed[num] of the second next_url
100141684
100162560
60600324
61546752
62759610
64346020
66304207
76880757
78519399
80461026
87511157
88928934
93717572
93903011
94084919
94524475
95632136
97112634
97490930
98376241
98440469
98482730
98785502
98876790
98921186
99223979
99370255
99507076

Currentry, parse_qs parses array parameters like viewed[num] into a list contains only last value.

It seems problematic behavior, isn't it?

result_qs[key.split("[")[0]] = value

Maybe need to fix.

for key, value in up.parse_qs(query).items():
    # merge seed_illust_ids[] liked PHP params to array
    if "[" in key and key.endswith("]"):
        # keep the origin sequence, just ignore array length
        key_, *_ = key.split("[")
        if key_ not in result_qs:
            result_qs[key_] = value
        elif isinstance(result_qs[key_], list):
            result_qs[key_].extend(value)
        else:
            # error
    else:
        result_qs[key] = value[-1]

Thank you for your prompt reply. I replaced the relevant part of aapi.py and ran the code in the first question again, and now 'viewed' in next_qs has multiple values, but 22 of the 28 viewed[num]'s were duplicates.

# first next_url
https://app-api.pixiv.net/v2/illust/related?illust_id=98699730&filter=for_ios&seed_illust_ids[0]=98699730
&viewed[0]=80691434&viewed[1]=97043375&viewed[2]=97974102&viewed[3]=100336008&viewed[4]=97112634
&viewed[5]=97490930&viewed[6]=93903011&viewed[7]=100274468&viewed[8]=97789918&viewed[9]=95908114
&viewed[10]=91462133&viewed[11]=98840284&viewed[12]=69925083&viewed[13]=98376241&viewed[14]=99193150
&viewed[15]=62759610&viewed[16]=95943995&viewed[17]=85294754&viewed[18]=99171188&viewed[19]=64346020
&viewed[20]=73188428&viewed[21]=99370255&viewed[22]=78552357&viewed[23]=81176933&viewed[24]=98921186
&viewed[25]=66304207&viewed[26]=84590312&viewed[27]=65293346

# next_qs
{'illust_id': '98699730', 'filter': 'for_ios', 'seed_illust_ids': ['98699730'], 'viewed': ['80691434',
'97043375', '97974102', '100336008', '97112634', '97490930', '93903011', '100274468', '97789918',
'95908114', '91462133', '98840284', '69925083', '98376241', '99193150', '62759610', '95943995',
'85294754', '99171188', '64346020', '73188428', '99370255', '78552357', '81176933', '98921186',
'66304207', '84590312', '65293346']}

# second next_url (22 of the 28 viewed[num]'s were duplicates)
https://app-api.pixiv.net/v2/illust/related?illust_id=98699730&filter=for_ios&seed_illust_ids[0]=98699730
&viewed[0]=80691434&viewed[1]=97043375&viewed[2]=98365209&viewed[3]=100336008&viewed[4]=68233822
&viewed[5]=97490930&viewed[6]=100274468&viewed[7]=99370255&viewed[8]=93903011&viewed[9]=91462133
&viewed[10]=98921186&viewed[11]=69925083&viewed[12]=98376241&viewed[13]=62759610&viewed[14]=98440469
&viewed[15]=95943995&viewed[16]=85294754&viewed[17]=98785502&viewed[18]=64346020&viewed[19]=73188428
&viewed[20]=98482730&viewed[21]=78552357&viewed[22]=81176933&viewed[23]=99223979&viewed[24]=66304207
&viewed[25]=84590312&viewed[26]=97112634&viewed[27]=65293346

Since it does not appear to be intended to be paging in the related works displayed on the browser's works page, perhaps it is a specification that duplicates are returned in the request by next_url.

@upbit How about this?

upbit commented
    def illust_related(
        self,
        illust_id: int | str,
        filter: _FILTER = "for_ios",
        seed_illust_ids: int | str | list[str] | None = None,
        offset: int | str | None = None,
        viewed: list[str] | None = None,
        req_auth: bool = True,
    ) -> ParsedJson:
        url = "%s/v2/illust/related" % self.hosts
        params: dict[str, Any] = {
            "illust_id": illust_id,
            "filter": filter,
            "offset": offset,
        }
        if isinstance(seed_illust_ids, str):
            params["seed_illust_ids[]"] = [seed_illust_ids]
        elif isinstance(seed_illust_ids, list):
            params["seed_illust_ids[]"] = seed_illust_ids
        r = self.no_auth_requests_call("GET", url, params=params, req_auth=req_auth)
        return self.parse_result(r)

Sorry, it seems like a bug. viewed is not passed to the params like seed_illust_ids[], try add:

        elif isinstance(seed_illust_ids, list):
            params["seed_illust_ids[]"] = seed_illust_ids
+       if isinstance(viewed, list):
+           params["viewed[]"] = viewed
        r = self.no_auth_requests_call("GET", url, params=params, req_auth=req_auth)

Thank you for your reply. I have applied your correction and was able to download a large number of related illustrations. However, there still seem to be a lot of duplicates in the viewed of next_url.

The bar graph below shows the number of related illustrations at each step of the next_url retrieval process, which was repeated about 100 times (a post consisting of multiple images is counted as one).

If all duplicate illustrations are removed from here, the bar graph is as follows.

I have found that I can download a sufficient number of illustrations if I set the number of repetitions to about 20, so I will continue to operate with the number of repetitions set to 20 from now on.

Thank you very much for your time.

upbit commented

I'm not sure if the client side has merged the viewed parameter, did you try to pass all the previously returned viewed ids when pagination?

In addition, the number of repetitions refers to the offset=20 parameter?

The first source code is reproduced below.

from pixivpy3 import *
import urllib.parse as up

REFRESH_TOKEN = 'xxxxx'

aapi = AppPixivAPI()
aapi.auth(refresh_token=REFRESH_TOKEN)

# https://www.pixiv.net/artworks/98699730
res = aapi.illust_related('98699730')
print(up.unquote(res.next_url))

next_qs = aapi.parse_qs(res.next_url)
print(next_qs)

res = aapi.illust_related(**next_qs)
print(up.unquote(res.next_url))

I repeated getting the next_url and running aapi.illust_related(**next_qs) about 100 times, so I'm not passing all the previously returned viewed ids. Only the viewed ids returned in each step are passed.

The number of iterations is determined by looking at the bar chart above only, so the offset=20 parameter is not referenced.