istresearch/scrapy-cluster

Support for custom header and cookies for the initial request from kafka_monitor.py feed

knirbhay opened this issue · 5 comments

I needed to request an URL with custom header and preset cookies. eg.

There is an API at https://xyz.com/test_api/_id which returns a json.
and this should be called with api keys with custom header and few preset cookies in the request with a POST call.

How do I get it working with scrapy-cluster?

With Scrapy I used to override the start_request method and apply custom header and cookies.

Another problem looks like cookie jar issue where cookies are stored on one node and can not be passed to another node. This is activated when Server uses Set-Cookies method to store session details.

Cookie support is already provided thanks to the cookie field in the kafka monitor api. This is a cookie string that is then deserialized into a scrapy cookie object.

As for custom request methods, the custom scheduler is where you want to look, as that translates the objects coming in into Scrapy Requests. I think the scheduler should be able to handle Post requests being yielded from the spider, thanks to the scrapy dict methods, but on initial request that is something that could be improved on.

Scrapy Cluster purposefully does not store cookie information in each spider, because any single chain of requests might go to multiple spiders or machines. You would need to customize the setup a bit to pass those cookies through your calls so they are used in subsequent requests.

Scrapy Cluster is most suited for large scale on demand crawling, and in its current form (because it is distributed) has some of the limitations or assumptions I noted above. I am always happy to look at or review a PR if you think it would be worthwhile to add to the project!

Working towards it. I made the custom request working with headers and cookies. Working on shared cookie instances, shared via redis seperated by crawl/spider ids

Below custom cookie middleware worked for me. Not sure if this is a right place to initiate redis_conn. Could not find a way to share DistributedScheduler redis_conn.

`
import redis
import pickle

from scrapy.downloadermiddlewares.cookies import CookiesMiddleware

class SharedCookiesMiddleware(CookiesMiddleware):

def __init__(self, debug=True, server=None):
    CookiesMiddleware.__init__(self, debug)
    self.redis_conn = server
    self.debug = debug

@classmethod
def from_crawler(cls, crawler):
    server = redis.Redis(host=crawler.settings.get('REDIS_HOST'),
                         port=crawler.settings.get('REDIS_PORT'),
                         db=crawler.settings.get('REDIS_DB'))
    return cls(crawler.settings.getbool('COOKIES_DEBUG'), server)

def process_request(self, request, spider):
    if 'dont_merge_cookies' in request.meta:
        return
    cookiejarkey = "{spiderid}:sharedcookies:{crawlid}".format(
                    spiderid=request.meta.get("spiderid"),
                    crawlid=request.meta.get("crawlid"))

    jar = self.jars[cookiejarkey]
    jar.clear()

    if self.redis_conn.exists(cookiejarkey):
        data = self.redis_conn.get(cookiejarkey)
        jar = pickle.loads(data)

    cookies = self._get_request_cookies(jar, request)
    for cookie in cookies:
        jar.set_cookie_if_ok(cookie, request)

    # set Cookie header
    request.headers.pop('Cookie', None)
    jar.add_cookie_header(request)
     

    self._debug_cookie(request, spider)
    self.redis_conn.set(cookiejarkey, pickle.dumps(jar))

def process_response(self, request, response, spider):
    if request.meta.get('dont_merge_cookies', False):
        return response
    cookiejarkey = "{spiderid}:sharedcookies:{crawlid}".format(
                    spiderid=request.meta.get("spiderid"),
                    crawlid=request.meta.get("crawlid"))

    
    jar = self.jars[cookiejarkey]
    jar.clear()

    if self.redis_conn.exists(cookiejarkey):
        data = self.redis_conn.get(cookiejarkey)
        jar = pickle.loads(data)
        
    # extract cookies from Set-Cookie and drop invalid/expired cookies
    jar.extract_cookies(response, request)
    self._debug_set_cookie(response, spider)
     

    self.redis_conn.set(cookiejarkey, pickle.dumps(jar))
    return response

`

Thanks @knirbhay! This is a great start, I can try to incorporate this in or just leave it as a standalone file. You can make a PR and I can review it and get it merged in, otherwise there are just a couple of things I would like changed in it, but otherwise is great work.

Sure I will also include custom request working with headers and cookies. I have enhanced Kafka feed API but I also need to check if it works with REST API of Scrapy cluster.