fritzmg/contao-sharebuttons

crawler does not skip links of the sharebuttons

Closed this issue · 14 comments

hello fritz,
Here is grashalm again. if I have correctly understood your revision to version 2.1.4, then the links of the sharebuttons from the crawler in contao should actually be skipped when crawling for broken links. Is that correct?
Since the sharebuttons appear as a module on every page of my site, my crawler still finds around 65,000 links on my site with every search. If I switch off the module in the layout beforehand, there are only around 8,700 links. is that still a bug, or have I misinterpreted your revision to version 2.1.4?

That should be the case, yes. Did you check the HTML output?

Sorry, I am an absolute beginner in these matters. What does that mean: "...check the html output"? Where can I check it and what should I see then?
When the crawling was finished, the backend has shown the number of checked links (something about 65.000)...

Where can I check it and what should I see then?

In your browser (right click - view source). The links that should not be checked by the broken link checker should have the data-skip-broken-link-checker attribute. If this is not the case for the share links, may be you are using a custom template and you did not update it yet?

Hello Fritz,
I had already suspected that you meant the source code. There is actually the "data-skip-broken-link-checker" attribute for each of the share button links. I use a template "sharebuttons_default" that I have adapted, but of course after updating your extension I took the new, revised template and made my few changes there again. So that's not the point. Incidentally, the above-mentioned link checker attribute is missing in the e-mail code. And the title attribute in the e-mail code also has no descriptive text.
As I said: It must be something else. The crawler counts well over 60,000 links, even though my site only has just under 9,000 links. But it works when I deactivate the share buttons module in the layout. Do I actually always have to delete the crawl queue before a new scan? And do I have to delete the prod cache in the manager beforehand? Maybe you still have an idea what it could be ...

grashalm

The crawler counts well over 60,000 links

Does the broken link checker actually check those links or just count them?

That's a good question! I will check that tonight. Tomorrow, when the check is finished, I will download the logfiles and see, whether there are sharebutton links amongst the broken links. From former scans I can remember, that there were a lot of sharebutton links among the broken links. I will give you a current report tomorrow.

so, the scan is through ...
In the debug log file there were more lines to be seen like here:

"2021-03-03 22: 25: 27.860060", "Contao \ CoreBundle \ Crawl \ Escargot \ Subscriber \ BrokenLinkCheckerSubscriber", http: //www.linkedin.com/shareArticle? Mini = true & url = https% 3A% 2F% 2Fdie -schreibmaus.de% 2Ftexte% 2Fdu-hast-geweint.html & title = You% 20hast% 20gewein, https: //die-schreibmaus.de/texte/du-hast-geweint.html,3, "skip-broken-link- checker, rel-nofollow, disallowed-robots-txt "," Did not check because it was marked to be skipped using the data-skip-broken-link-checker attribute. "

"2021-03-03 22: 25: 27.860250", "Contao \ CoreBundle \ Crawl \ Escargot \ Subscriber \ BrokenLinkCheckerSubscriber", https: //www.xing.com/social_plugins/share/new? Sc_p = xing-share & h = 1 & url = https% 3A% 2F% 2Fdie-schreibmaus.de% 2Ftexte% 2Fdu-hast-geweint.html, https: //die-schreibmaus.de/texte/du-hast-geweint.html,3, "skip-broken- link-checker, rel-nofollow "," Did not check because it was marked to be skipped using the data-skip-broken-link-checker attribute. "

I now conclude from this that it counts the sharebutton links, but does not check them, as if I were to deactivate them in the layout.
Would there be a simple way to expand / change the template so that it also omits counting, because it just takes a terribly long time for me at least?

I don't think there is anything I can or should do about this. If links with data-skip-broken-link checker really do increase the runtime of the broken link checker significantly, then this might be an issue that you should raise in contao/contao.

Okay, thank you very much for your assessment!

hello fritz, you recently processed my pull request that I had sent to the core team regarding the long running times of the crawler on the share buttons. now i followed yanick's suggestion and added the "data-escargot-ignore" attribute to all links in the template. and now it works: the crawler no longer adds these links to the queue. Thanks a lot for this!
my question now is: can the code "rel =" noopener noreferrer nofollow "data-skip-broken-link-checker" that you recently added to the template be completely removed again? or only partially? or should it stay in there in any case? will you update your template in the near future and add yanick's attributes to it?
thank you for a short answer.

grashalm

Why do you want it removed? If anything it should be changed to data-escargot-ignore.

That is what I meant. I just wanted to ask, if the code snippet "rel =" noopener noreferrer nofollow "data-skip-broken-link-checker" is still necessary, when the code "data-escargot-ignore" seems to work, or if the new snippet just could replace the old one.

The templates now use data-escargot-ignore in version 2.1.8.

That sounds great. Thank you very much!