JayBizzle/Crawler-Detect

The ability to extend crawlers list?

godstanis opened this issue · 7 comments

Hi, really appreciate your work, guys, the package is great.

But why is there no way to mannualy extend the default crawlers list? For example i need to add "Sendsay.Ru/1.0; https://Sendsay.Ru/; ask@sendsay.ru" (it's a niche Russian crawler), but i have no options but to write an issue here. It would be really great if we could have some kind of a configuration. Or at least have Crawlers injected in CrawlerDetect class via constructor so we could extend it and inject our custom Crawlers implementation without rewriting the whole constructor.

Thanks)

Adding additional strings via a constructor could be unsafe if there is a collision with one of the existing regexes.

It also means the community can't benefit from the additional crawler strings :-(

This project is very receptive to Pull Requests though... :-)

Hi,

As @gplumb pointed out, we have deliberately avoided making the crawlers list extendable to hopefully encourage PRs which ultimately help the whole community.

We look forward to receiving your PR 👍

Thanks for the response.

@gplumb i didn't mean to actually pass strings into the constructor) I mean just to change it to something like this:

// CrawlerDetect.php
/**
 * Class constructor.
 */
public function __construct(array $headers = null, $userAgent = null, Crawlers $crawlers = null)
{
    if ($crawlers === null) {
        $this->crawlers = new Crawlers();
    } else {
        $this->crawlers = $crawlers;
    }
    //...

could be unsafe if there is a collision

Patterns are combined via '|', so i dont see any problems with it. Could you explain the type of collisions you are afraid of?

@JayBizzle

we have deliberately avoided making the crawlers list extendable to hopefully encourage PRs

@gplumb mentioned collision problems, are they still relevant?

@stasgar
With your suggested code, are you suggesting you would duplicate the entire Crawler class?

There is no issue with regex collisions. We only test for them so we can speed up scanning. No point checking something twice.

If you would like to get a new user agent into this library we encourage PR's so everyone can benefit. They are normally merged and released within a few days maximum.

My original concern about regex collisions was based on an assumption about how the constructor code might look, something that actually isn't having seen @stasgar's code :-)

@MaxGiting i would extend Crawler class and overwrite it's getAll method.

Ok, then pls close this issue and i will PR this really rare Russian crawler)

@gplumb I see what you mean! Yes there could have been in that scenario.

@stasgar There's probably a few other rare and even defunct user agents in our list by now, but hopefully some other people will benefit from your contribution.