OpenBuildings/spiderling

Ability to pass in user agent header, connection timeout etc.?

Opened this issue · 2 comments

Hello,

I would like to pass in user agent, connect timeout etc. with the varioius drivers. Perhaps, also check if robots.txt allows spidering.
Such opts can be handled well in curl, I am unaware of the rest.

Re RequestFacory https://github.com/OpenBuildings/spiderling/blob/3f2da1a3bc6b8a7b48639ce159e3668ae65e10b8/src/Openbuildings/Spiderling/Driver/Simple/RequestFactory/HTTP.php

Hi
nice to see some interest in this library :) It was mainly developed to facilitate testing not crawling, so I didn't really have those concerns.
All the drivers already support setting user_agent so thats one thing crossed from your list.
You can easily add a method to pass arbitrary curl options in the class you referenced, and make a pull request out of it.
Connection timeout and robots.txt checking could also be added to other drivers, but that's work that I don't really have time to do ATM, sorry. I will be very appreciative of pull requests though.

Thanks!

nice to see some interest in this library :)

Yes, Spiderling is awesome.

Connection timeout and robots.txt checking could also be added to other drivers, but that's work that I don't really have time to do ATM, sorry. I will be very appreciative of pull requests though.

Meanwhile, I notice there are plenty of robots.txt classes on github... I might just throw sg together and run with it.