Ability to pass in user agent header, connection timeout etc.?
Opened this issue · 2 comments
Hello,
I would like to pass in user agent, connect timeout etc. with the varioius drivers. Perhaps, also check if robots.txt allows spidering.
Such opts can be handled well in curl, I am unaware of the rest.
Hi
nice to see some interest in this library :) It was mainly developed to facilitate testing not crawling, so I didn't really have those concerns.
All the drivers already support setting user_agent so thats one thing crossed from your list.
You can easily add a method to pass arbitrary curl options in the class you referenced, and make a pull request out of it.
Connection timeout and robots.txt checking could also be added to other drivers, but that's work that I don't really have time to do ATM, sorry. I will be very appreciative of pull requests though.
Thanks!
nice to see some interest in this library :)
Yes, Spiderling is awesome.
Connection timeout and robots.txt checking could also be added to other drivers, but that's work that I don't really have time to do ATM, sorry. I will be very appreciative of pull requests though.
Meanwhile, I notice there are plenty of robots.txt classes on github... I might just throw sg together and run with it.