lorien/grab

accidental work of grab.spider

EnzoRondo opened this issue ยท 15 comments

Hey there (@lorien), thanks a lot for great library ๐Ÿ˜ƒ

I am learning your library and now see unexpected behavior during work, here is my code sample which is based on example in documentation:

import csv
import logging
import re

from grab.spider import Spider, Task


class ExampleSpider(Spider):
    def create_grab_instance(self, **kwargs):
        g = super(ExampleSpider, self).create_grab_instance(**kwargs)
        g.setup(proxy='127.0.0.1:8090', proxy_type='socks5', timeout=60, connect_timeout=15)
        return g

    def task_generator(self):
        for i in range(1, 1 + 1):
            page_url = "{}{}/".format("https://www.mourjan.com/properties/", i)
            # print("page url: {}".format(page_url))
            yield Task('stage_two', url=page_url)

    def prepare(self):
        # Prepare the file handler to save results.
        # The method `prepare` is called one time before the
        # spider has started working
        self.result_file = csv.writer(open('result.txt', 'w'))

        # This counter will be used to enumerate found images
        # to simplify image file naming
        self.result_counter = 0

    def task_stage_two(self, grab, task):
        for elem in grab.doc.select("//li[@itemprop='itemListElement']//p")[0:4]:
            part = elem.attr("onclick")
            url_part = re.search(r"(?<=wo\(\').*(?=\'\))", part).group()
            end_url = grab.make_url_absolute(url_part)
            yield Task('stage_three', url=end_url)

    def task_stage_three(self, grab, task):
        # First, save URL and title into dictionary
        post = {
            'url': task.url,
            'title': grab.doc.xpath_text("//title/text()"),
        }
        self.result_file.writerow([
            post['url'],
            post['title'],
        ])
        # Increment image counter
        self.result_counter += 1


if __name__ == '__main__':
    logging.basicConfig(level=logging.DEBUG)
    # Let's start spider with two network concurrent streams
    bot = ExampleSpider(thread_number=2)
    bot.run()

first run:

DEBUG:grab.spider.base:Using memory backend for task queue
DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[02] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 7.35 [error:multi-added-already=5, network-count-rejected=1]
DEBUG:grab.spider.parser_pipeline:Started shutdown of parser process: Thread-1
DEBUG:grab.spider.parser_pipeline:Finished joining parser process: Thread-1
DEBUG:grab.spider.base:Main process [pid=4064]: work done

๐Ÿ˜•

then I am running code again ~20 attempts and have same shit, but 21 time gives success and I see what I want to see:

DEBUG:grab.spider.base:Using memory backend for task queue
DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 0.52 []
DEBUG:grab.network:[02] GET https://www.mourjan.com/kw/kuwait/warehouses/rental/10854564/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/ae/abu-dhabi/apartments/rental/11047384/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/kw/kuwait/villas-and-houses/rental/11041455/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/ae/abu-dhabi/apartments/rental/11009663/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 2.36 []
DEBUG:grab.stat:RPS: 1.28 []
DEBUG:grab.spider.parser_pipeline:Started shutdown of parser process: Thread-1
DEBUG:grab.spider.parser_pipeline:Finished joining parser process: Thread-1
DEBUG:grab.spider.base:Main process [pid=4860]: work done

why it happens?

Probably it's related to this code part:

    def create_grab_instance(self, **kwargs):
        g = super(ExampleSpider, self).create_grab_instance(**kwargs)
        g.setup(proxy='127.0.0.1:8090', proxy_type='socks5', timeout=60, connect_timeout=15)
        return g

I have copied this snippet from the documentation, my goal to set proxy and other settings for grab. Without this part of code all work fine. It seems to me that need use alternative here.

@EnzoRondo
I do no understand the question.

why it happens?

Happens what?

You've provided two log outputs. I do not see big difference between them. With these logs, I can't find what have been done incorrectly (or have not been done correctly) by your spider.

Closing this issue untill @EnzoRondo provides additional details of what did he mean.

Happens what?

Look at that:

DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[02] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5

grab trying to follow 5 same urls instead of right one

oiwn commented

does it work w/o socks5 proxy? as far as i remember pycurl has issue with socks5

@istinspring it works

@EnzoRondo Spider does not work correctly with socks5 in multicurl mode.
If you want to use Spider with socks proxy then you HAVE TO use urllib3 transport and threaded network service:

bot = SomeSpider(network_service='threaded`, grab_transport='urllib3')

I have translated this post and got that it's possible to use socks5 using threaded transport, but now you are saying another things, where is the truth?

Tested: bot = SomeSpider(network_service='threaded`, grab_transport='urllib3'), works perfect, thanks

Will that bug fixed in future? I am using grab.spider in different project and that's one place (create_grab_instance function) where I have problems with it

Works with socks5:

  • threaded network service & pycurl grab transport
  • threaded network service & urllib3 grab transport

Does not work with socks5:

  • multicurl network service & pycurl grab transport - this is not a bug in Grab, it is a bug in pycurl library.

threaded network service & urllib3 grab transport

works perfect

threaded network service & pycurl grab transport

works, but we can see bug from the first post

works, but we can see bug from the first post

Code from first post does NOT use threaded network service

Yep, but I have tried:

bot = ExampleSpider(thread_number=2, network_service='threaded', grab_transport='pycurl')

and bug still here

So what do you want from me? I do not know what you've tried and have not tried.
Please provide exact and detailed information of what you did (minimal working source code), what you expected to get and what did you get instead.

So what do you want from me?

to fix bug, but I can't reproduce it now on last dev build ๐Ÿ˜• , probably some of your commits successfully fixed this issue, thanks a lot friend! now spider works more stable ๐Ÿ‘

I am very happy to have no problems here, thanks a lot again, I appreciate it ๐Ÿ˜Ž