grammy-jiang/scrapy-pipeline-mongodb

Handling errors in process_item

botzill opened this issue · 4 comments

Hi.

This module is great, good job!

I'm wandering how to properly handle errors in the process_item callback? For example I want to silently log duplicate entries.

Thx.

Hi, @botzill ,

Duplicate is always a problem/exception when you use MongoDB in pipelines.

In my view, this exception is a problem of your project, not about the pipeline - that is the reason why I did not include any exception treatment in my repo.

In this certain case, I always write the try statement for each project - you can mention it in this setting: MONGODB_PROCESS_ITEM.

For example, you put your process_item method in a path scrapy_project.pipeline.process_item, then it will be:

MONGODB_PROCESS_ITEM = 'scrapy_project.pipeline.process_item'

And please read my code carefully and make sure your args/kwargs are identical to mine:

Note: This middleware is still under development, some settings could be changed in the future!

Thx @grammy-jiang !

Yes, that's how I currently did it, implemented that method. I just wandered about handling errors using some twisted methods like, addErrback or smth like that. Is it OK to handle like this?

        try:
            yield pipeline.coll.insert_one(item)
            spider.logger.info("[%s] item inserted.", item['_d'])
        except Exception as e:
            spider.logger.info("[%s] item already exists, skip it.", item['_id'])

Hi, @botzill

Your code is fine with me, and if I were you, I will:

  • use the pipeline's logger, not spider's
  • the log level would be debug, not info
  • save the return of insert_one, even it may not be used anymore
  • put the succeed log under the statement else instead of try

For your another question, I have never thought about it before. But I realize there is a code example from scrapy documentation which could help - in the section Take screenshot of item (Item Pipeline — Scrapy 1.5.0 documentation). The method process_item can return a deferred object!

Maybe you can try it and find something interesting! And please let me know!

Thx @grammy-jiang for tips, yes you points make sense. I'm not really experienced with twisted so, I need to check in details about this deferrer and right now not really knowing what can be done with a returned deferrer from process_item. Will check more details about this.

Thx a lot.