Extend supported classes with 3rd party classes?
Closed this issue · 5 comments
Hi,
I was wondering it would be feasible to provide an interface to extend the supported classes to e.g.:
The reason I ask is because I'm currently using pydantic for schema validation and it would be nice to be able to return a pydantic BaseModel directly from the spider. (And I can imagine some a similar use case for mongoengine Documents and many other classes too).
Of course, this would introduce unnecessary dependencies so I guess a better way would be to build in extensibility.
I could extend the supported classes myself by subclassing ItemAdapter and extending its methods, but there are a lot of methods to keep track of and it doesn't feel very stable.
Have there been any thoughts on providing an interface to extend the supported classes?
I think it makes sense to support these as part of itemadapter
, just as we added support for dataclasses
or attrs
.
The dependencies should not be an issue. For example, itemadapter
supports Scrapy’s Item
but does not require scrapy
as a dependency. We can enable support based on available packages.
My only suggestion would be to open a second ticket, so that we can keep separate tickets for each suggested new item type, instead of suggesting 2 different types here, as they can (and probably should) be implemented in separate pull requests.
@LaundroMat Thanks for the idea. Support for arbitrary types was just merged (#45) and released (v0.2.0
). Please try it and give us your feedback.
Thanks for this addition!
I'm not using it in production yet (still adding some tests to my code for the changes this entails), but so far things seem to work fine. I've switched from pydantic to marshmallow meanwhile, so the adapter interface is very timely for me 👍
To be honest, I had to wrap my head around the finer details of the implementation to get things rolling, but that's probably just me. FWIW, those were:
- Having to implement all methods from the MutableMapping class too. I had to look at the source code of
itemadapter.adapter._MixinAttrsDataclassAdapter
to get an idea of what I was meant to add; - The fact that the first class to return True for the is_item class method is used for all subsequent operations confused me. If item data does not validate in my customer adapter (and that adapter class' classmethod
is_item
hence returns False), my code is still under the impression that the item data is valid for the use the adapter was built for (e.g. saving to a mongo document). Or is validation a concern that should be moved outside of the adapter?
To give an example for that last point: suppose I have a document where url is a required field. In my custom MongoItemAdapter, I have this classmethod:
from itemadapter import AdapterInterface, ItemAdapter, is_item
class Webpage(mongoengine.DynamicDocument):
url = mongoengine.URLField(unique=True, required=True)
title = mongoengine.StringField()
class MongoItemAdapter(AdapterInterface):
...
@classmethod
def is_item(cls, item: Any) -> bool:
try:
assert Webpage(**item).validate()
except mongoengine.errors.ValidationError:
return False
...
ItemAdapter.ADAPTER_CLASSES.appendleft(MongoItemAdapter)
item = {'title': 'Something something'} # url is missing, but is required by Webpage
assert is_item(item) is False # will fail, because default item adapters will return True
My custom adapter will fail, but any other adapter will not. So, is validation a responsibility of the adapter? And should I then catch validation errors later (e.g. when saving the document)?
I could also do this:
ItemAdapter.ADAPTER_CLASSES = [auction_item_adapter]
but using itemadapter would then be unnecessary of course.
I guess what I'm asking here is: is there a way to know what adapters have failed (if validation should be itemadapter's concern of course)?
The ItemAdapter interface is not meant to handle validation, no. is_item
will usually be as easy to implement as return isinstance(item, <new class to support>)
.
In Scrapy, validation would be handled in a spider, a spider middleware or an item pipeline.
Aha thanks to your mentioning return isinstance(item, <new class to support>)
, I understand the reasoning behind is_item
now. Keep up the great work!