istresearch/scrapy-cluster

maxdepth can not large than 2

anthony9981 opened this issue · 5 comments

Hi,
When I try to feed an url with this:
curl localhost:5343/feed -H "Content-Type: application/json" -d '{"url": "https://domain.com", "appid": "gd2", "crawlid": "gdcrawl2", "maxdepth": 5, "allowed_domains": ["domain.com"]}'

I always got:
{"message": "Did not find schema to validate request", "parsed": true, "valid": false, "logger": "kafka-monitor", "timestamp": "2020-12-18T03:06:51.564078Z", "data": {"url":...

But when decrease maxdepth to 2, the crawler worked.

So what exactly maxdepth for?
As I understand is
maxdepth=0: only get the current page. I guess this is the default value
maxdepth=1
maxdepth=2
What exactly maxdepth value for?

By default, scrapy_cluster won't crawl websites with maxdepth larger than 3. You should change the schema first. To do this, login to your kafka_monitor container:

1- docker exec -it container_id bash
2- cd plugins
3- edit scraper_schema.json (change max value for maxdepth from 3 to anything you want)

from this point, you can crawl websites for maxdepth more than min value and less than the max value you just set.

Hi @NeoArio,
Thanks for your reply.
Your answer helps me a alot.
I have a question if you don't mind:

I have some knowable website, I need to crawl title and content in exactly selector for each of them.
Q1: How I can predefine CSS selector for each of them then feed the monitor only the domain?
Q2: And where I can take the scraped items then store to data base like elasticsearch?
I tried with pipelines (scrapy-elasticsearch) but it will ton of additional request to es server.

Sorry I'm new on scrapy. This is awesome!
Best regards,

Hi! I hope you enjoy scraping :D

Q1: I have another database that stores websites CSS and XPath patterns. I don't know if what you want to try is really applicable.
Q2: scraped items will be pushed to the demo.crawled_firehose topic: https://scrapy-cluster.readthedocs.io/en/latest/topics/kafka-monitor/api.html#kafka-topics
Write a code to consume from this topic then do what you want with that data. Finally, you can send it to elasticsearch by another kafka pipline. I think it is better to insert in elasticsearch with bulk request, I mean each insertion contains 100 crawled link. Add a timeout beside this and you are perfect. 100 crawled link or 10 minutes are good conditions to insert in elasticsearch.

Hi @NeoArio ,
Idea about database that stores the selector is so nice, why I don't think about it before 👍
Could you please show me your?
I came from PHP to python then I'm here so Kafka is new for me:)
Thanks to point me up:) Let me learn it deeper.
Best regards,

I'm happy to chat through custom implementations on Gitter, but per the guidelines I am going to close this issue as a "custom implementation" question which is beyond the scope of a true bug ticket/problem.

More generally - crawling at a depth beyond 2 gets your spider way into the weeds of the internet and is 99% of the time not useful for your actual request. If you wish to crawl at a greater depth you should also implement an allowed_domains filter or regex in the crawl api request to limit your crawler to a specific domain.

If you need to change anything else in the api spec for the request, you can do so at this file https://github.com/istresearch/scrapy-cluster/blob/master/kafka-monitor/plugins/scraper_schema.json