accidental work of grab.spider
EnzoRondo opened this issue ยท 15 comments
Hey there (@lorien), thanks a lot for great library ๐
I am learning your library and now see unexpected behavior during work, here is my code sample which is based on example in documentation:
import csv
import logging
import re
from grab.spider import Spider, Task
class ExampleSpider(Spider):
def create_grab_instance(self, **kwargs):
g = super(ExampleSpider, self).create_grab_instance(**kwargs)
g.setup(proxy='127.0.0.1:8090', proxy_type='socks5', timeout=60, connect_timeout=15)
return g
def task_generator(self):
for i in range(1, 1 + 1):
page_url = "{}{}/".format("https://www.mourjan.com/properties/", i)
# print("page url: {}".format(page_url))
yield Task('stage_two', url=page_url)
def prepare(self):
# Prepare the file handler to save results.
# The method `prepare` is called one time before the
# spider has started working
self.result_file = csv.writer(open('result.txt', 'w'))
# This counter will be used to enumerate found images
# to simplify image file naming
self.result_counter = 0
def task_stage_two(self, grab, task):
for elem in grab.doc.select("//li[@itemprop='itemListElement']//p")[0:4]:
part = elem.attr("onclick")
url_part = re.search(r"(?<=wo\(\').*(?=\'\))", part).group()
end_url = grab.make_url_absolute(url_part)
yield Task('stage_three', url=end_url)
def task_stage_three(self, grab, task):
# First, save URL and title into dictionary
post = {
'url': task.url,
'title': grab.doc.xpath_text("//title/text()"),
}
self.result_file.writerow([
post['url'],
post['title'],
])
# Increment image counter
self.result_counter += 1
if __name__ == '__main__':
logging.basicConfig(level=logging.DEBUG)
# Let's start spider with two network concurrent streams
bot = ExampleSpider(thread_number=2)
bot.run()
first run:
DEBUG:grab.spider.base:Using memory backend for task queue
DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[02] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 7.35 [error:multi-added-already=5, network-count-rejected=1]
DEBUG:grab.spider.parser_pipeline:Started shutdown of parser process: Thread-1
DEBUG:grab.spider.parser_pipeline:Finished joining parser process: Thread-1
DEBUG:grab.spider.base:Main process [pid=4064]: work done
๐
then I am running code again ~20 attempts and have same shit, but 21 time gives success and I see what I want to see:
DEBUG:grab.spider.base:Using memory backend for task queue
DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 0.52 []
DEBUG:grab.network:[02] GET https://www.mourjan.com/kw/kuwait/warehouses/rental/10854564/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/ae/abu-dhabi/apartments/rental/11047384/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/kw/kuwait/villas-and-houses/rental/11041455/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/ae/abu-dhabi/apartments/rental/11009663/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 2.36 []
DEBUG:grab.stat:RPS: 1.28 []
DEBUG:grab.spider.parser_pipeline:Started shutdown of parser process: Thread-1
DEBUG:grab.spider.parser_pipeline:Finished joining parser process: Thread-1
DEBUG:grab.spider.base:Main process [pid=4860]: work done
why it happens?
Probably it's related to this code part:
def create_grab_instance(self, **kwargs):
g = super(ExampleSpider, self).create_grab_instance(**kwargs)
g.setup(proxy='127.0.0.1:8090', proxy_type='socks5', timeout=60, connect_timeout=15)
return g
I have copied this snippet from the documentation, my goal to set proxy and other settings for grab. Without this part of code all work fine. It seems to me that need use alternative here.
@EnzoRondo
I do no understand the question.
why it happens?
Happens what?
You've provided two log outputs. I do not see big difference between them. With these logs, I can't find what have been done incorrectly (or have not been done correctly) by your spider.
Closing this issue untill @EnzoRondo provides additional details of what did he mean.
Happens what?
Look at that:
DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[02] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
grab trying to follow 5 same urls instead of right one
does it work w/o socks5 proxy? as far as i remember pycurl has issue with socks5
@EnzoRondo Spider does not work correctly with socks5 in multicurl mode.
If you want to use Spider with socks proxy then you HAVE TO use urllib3
transport and threaded
network service:
bot = SomeSpider(network_service='threaded`, grab_transport='urllib3')
I have translated this post and got that it's possible to use socks5 using threaded transport, but now you are saying another things, where is the truth?
Tested: bot = SomeSpider(network_service='threaded`, grab_transport='urllib3'), works perfect, thanks
Will that bug fixed in future? I am using grab.spider in different project and that's one place (create_grab_instance function) where I have problems with it
Works with socks5:
threaded
network service &pycurl
grab transportthreaded
network service &urllib3
grab transport
Does not work with socks5:
multicurl
network service &pycurl
grab transport - this is not a bug in Grab, it is a bug in pycurl library.
threaded network service & urllib3 grab transport
works perfect
threaded network service & pycurl grab transport
works, but we can see bug from the first post
works, but we can see bug from the first post
Code from first post does NOT use threaded
network service
Yep, but I have tried:
bot = ExampleSpider(thread_number=2, network_service='threaded', grab_transport='pycurl')
and bug still here
So what do you want from me? I do not know what you've tried and have not tried.
Please provide exact and detailed information of what you did (minimal working source code), what you expected to get and what did you get instead.
So what do you want from me?
to fix bug, but I can't reproduce it now on last dev build ๐ , probably some of your commits successfully fixed this issue, thanks a lot friend! now spider works more stable ๐
I am very happy to have no problems here, thanks a lot again, I appreciate it ๐