scrapy/itemadapter

Extend supported classes with 3rd party classes?

Closed this issue · 5 comments

Hi,

I was wondering it would be feasible to provide an interface to extend the supported classes to e.g.:

The reason I ask is because I'm currently using pydantic for schema validation and it would be nice to be able to return a pydantic BaseModel directly from the spider. (And I can imagine some a similar use case for mongoengine Documents and many other classes too).

Of course, this would introduce unnecessary dependencies so I guess a better way would be to build in extensibility.

I could extend the supported classes myself by subclassing ItemAdapter and extending its methods, but there are a lot of methods to keep track of and it doesn't feel very stable.

Have there been any thoughts on providing an interface to extend the supported classes?

I think it makes sense to support these as part of itemadapter, just as we added support for dataclasses or attrs.

The dependencies should not be an issue. For example, itemadapter supports Scrapy’s Item but does not require scrapy as a dependency. We can enable support based on available packages.

My only suggestion would be to open a second ticket, so that we can keep separate tickets for each suggested new item type, instead of suggesting 2 different types here, as they can (and probably should) be implemented in separate pull requests.

@LaundroMat Thanks for the idea. Support for arbitrary types was just merged (#45) and released (v0.2.0). Please try it and give us your feedback.

Thanks for this addition!

I'm not using it in production yet (still adding some tests to my code for the changes this entails), but so far things seem to work fine. I've switched from pydantic to marshmallow meanwhile, so the adapter interface is very timely for me 👍

To be honest, I had to wrap my head around the finer details of the implementation to get things rolling, but that's probably just me. FWIW, those were:

  • Having to implement all methods from the MutableMapping class too. I had to look at the source code of itemadapter.adapter._MixinAttrsDataclassAdapter to get an idea of what I was meant to add;
  • The fact that the first class to return True for the is_item class method is used for all subsequent operations confused me. If item data does not validate in my customer adapter (and that adapter class' classmethod is_item hence returns False), my code is still under the impression that the item data is valid for the use the adapter was built for (e.g. saving to a mongo document). Or is validation a concern that should be moved outside of the adapter?

To give an example for that last point: suppose I have a document where url is a required field. In my custom MongoItemAdapter, I have this classmethod:

    from itemadapter import AdapterInterface, ItemAdapter, is_item

    class Webpage(mongoengine.DynamicDocument):
        url = mongoengine.URLField(unique=True, required=True)
        title = mongoengine.StringField()

    class MongoItemAdapter(AdapterInterface):
        ...
       @classmethod
        def is_item(cls, item: Any) -> bool:
            try:
                assert Webpage(**item).validate()
            except mongoengine.errors.ValidationError:
                return False
        ...

    ItemAdapter.ADAPTER_CLASSES.appendleft(MongoItemAdapter)
    item = {'title': 'Something something'}    # url is missing, but is required by Webpage
    assert is_item(item) is False    # will fail, because default item adapters will return True

My custom adapter will fail, but any other adapter will not. So, is validation a responsibility of the adapter? And should I then catch validation errors later (e.g. when saving the document)?

I could also do this:

 ItemAdapter.ADAPTER_CLASSES = [auction_item_adapter]

but using itemadapter would then be unnecessary of course.

I guess what I'm asking here is: is there a way to know what adapters have failed (if validation should be itemadapter's concern of course)?

The ItemAdapter interface is not meant to handle validation, no. is_item will usually be as easy to implement as return isinstance(item, <new class to support>).

In Scrapy, validation would be handled in a spider, a spider middleware or an item pipeline.

Aha thanks to your mentioning return isinstance(item, <new class to support>), I understand the reasoning behind is_item now. Keep up the great work!