Invalid MDList Regex

Question

Invalid MDList Regex

striker4150 opened this issue 5 years ago · 9 comments

The "manga" regex for MDList scraping is not working for me. After a bit of debugging, I found out that it is because the href attribute comes before the class attribute in the scraped HTML.

Example tag:
<a title="Beware of the Brothers!" href="/title/45114/beware-of-the-brothers" class="ml-1 manga_title text-truncate">

Regex used:
/<a[^>]*class=["'][^"']+manga_title[^"']+["'][^>]*href=["']\/title\/(\d+)\/[^>]*["'][^>]*>/gmi

Answer 1 · 2020-06-15T23:34:05.000Z

Personally, I recommend that a proper HTML parser like Cheerio or JQuery be used instead, since any change that Mangadex makes to their document structures will break the regex.

Answer 2 · 2020-06-16T00:02:20.000Z

Can you specify what version you are using? I cannot replicate the issue with package version 3.8.0 and node version 12.16.3. In addition, NodeJS, cURL, and Firefox return the tag with the class attribute before the href attribute, like the regex.

However, I do agree that regex is not a good solution for HTML scraping, and I have planned on refactoring the code for Mangadex v5 since that will most likely break existing regex.

Answer 3 · 2020-06-16T00:43:54.000Z

I'm using package version 3.8.0 and node version 12.16.2. When I visit the site using my Firefox install, I get the class attribute after the href attribute, as shown in my example.

Answer 4 · 2020-06-16T00:49:37.000Z

I also tested the Util.getMatches() method in mdlist.js using REPL, and removed href from the regex.
Running Util.getMatches(site, {"manga": /<a[^>]*class=["'][^"']+manga_title[^"']+["'][^>]*>/gmi}).then((matches) => {console.log(matches.manga)}); gave me the output:

'<a title="I Fell in Love, so I Tried Livestreaming."\r\n' +
    '                           href="/title/41973/i-fell-in-love-so-i-tried-livestreaming"\r\n' +
    '                           class="ml-1 manga_title text-truncate">',
  '<a title="Beware of the Brothers!"\r\n' +
    '                           href="/title/45114/beware-of-the-brothers"\r\n' +
    '                           class="ml-1 manga_title text-truncate">',
  `<a title="A Returner's Magic Should Be Special"\r\n` +
    '                           href="/title/31551/a-returner-s-magic-should-be-special"\r\n' +
    '                           class="ml-1 manga_title text-truncate">',

...

Answer 5 · 2020-06-16T01:52:10.000Z

Once again, class returns before title for me.

Your Script with REPL:
<a class="ml-1 manga_title text-truncate" title="Official Test Manga"\r\n href="/title/47/official-test-manga">
Chrome:
<a class="ml-1 manga_title text-truncate" title="Official Test Manga" href="/title/47/official-test-manga">Official Test Manga</a>
Firefox:
<a class="ml-1 manga_title text-truncate" title="Official Test Manga" href="/title/47/official-test-manga">Official Test Manga</a>
cURL:
<a class="ml-1 manga_title text-truncate" title="Official Test Manga" href="/title/47/official-test-manga">Official Test Manga</a>

I have tried this with different MDLists, different NodeJS versions, different IDEs, different OS-es, and even with a VPN in case it was server-specific, but I always got a successful response. I would like to know what is happening before I make a special edgecase or other change in the code.

Is there a specific error you are getting, or does it just ask if you have permission? And are you using an agent (ie logging in)?

Answer 6 · 2020-06-16T03:54:53.000Z

I'm not getting an error, but it's incorrectly telling me that I have 0 manga in my MDList. It should have more than 200. It's retrieving the id and banner correctly. However, the manga and pages properties are wrong. manga is empty (due to the regex not matching any links), and pages is 7 (40 mangas per page) even though the site shows 3 pages (100 per page). I'm fine with the number of pages being off, since at least the count is correct. However, I'm not getting any errors.

On a side note, I'm apparently actually using package version 3.7.0. I accidentally ran npm show instead of npm list. However, this doesn't seem like it would affect the regex, as the changes seem to be unrelated.

Lines I am running in REPL (NODE_OPTIONS="--experimental-repl-await"):

const api = require("mangadex-full-api");
await api.agent.cacheLogin("./login_cache", process.env.MD_USERNAME, process.env.MD_PASSWORD);
await api.agent.fillUser();
let manga_list = new api.MDList();
await manga_list.fillByUser(api.agent.user);

The filled MDList object:

MDList {
  id: 'MY_ID_HERE',
  manga: [],
  banner: [
    'https://mangadex.org/images/lists/default.dark.png',
    'https://mangadex.org/images/lists/default.dark.png'
  ],
  pages: 7
}

Answer 7 · 2020-06-16T03:59:21.000Z

I dumped the results of Util.getHTTPS(url) to a file:

Answer 8 · 2020-06-16T04:43:07.000Z

Alright I think I have found the problem. The program assumes the website is using the "Detailed" setting for viewing the list, but since you are using the "Simple List" setting, it not only changes the attribute order, but it also changes the number of manga per page. This isn't a problem when an agent is not in use (since the program naively uses default settings), but since you are using an account with custom settings, the problems arise.

A temporary fix is to change the settings on the agent account to use a the default "Detailed List" setting. However, after I looked at the code for this object, I've realized it is especially broken, so I'm going to redo it within the next two days for the next version. As for the regex elsewhere, I will redo all of it after the release of Mangadex v5 (or sooner).

Thank you.

Answer 9 · 2020-06-16T04:54:54.000Z

I can't believe I missed that! Thanks for catching it. Once I changed from simple view to detailed, it worked. Good luck with Mangadex v5!