scrapinghub/spidermon

Spidermon fails to add `_validation` field to the item.

proway2 opened this issue · 10 comments

Ubuntu 22.04, Python 3.10.9, Spidermon 1.17.1

If an item defined with attrs library and this condition https://github.com/scrapinghub/spidermon/blob/master/spidermon/contrib/scrapy/pipelines.py#L140 is True, then it fails here https://github.com/scrapinghub/spidermon/blob/master/spidermon/contrib/scrapy/pipelines.py#L141 with this message:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: 'ProductList' object does not support item assignment

This is a limitation on the Item definition as it don't support adding new fields.

In theory, using slots=False should work according to the attr dosc, however it didn't work for me.

Alternatively, adding the field _validation to the class should remove the error, however, you will get _validation automatically populated to None on each item which is not the best solution either.

On my experience I had to just convert from Attrs class to dict before spidermon validaton.

@VMRuiz I had a hope that with this PR we wouldn't need this conversion-to-dict pipeline anymore.

That class is just an interface to simplify read and write access to the different types of Item classes. It doesn't add any additional functionality.

I don't know if we could overcharge it to support additional fields not defined on the original item class.

Maybe we can start by capturing the exception and raising a warning instead.

cc. @Gallaecio @rennerocha

Alternatively, adding the field _validation to the class should remove the error

I think this is the way to go. You either disable the creation of this _validation field, if Spidermon allows for that, or you add the field to your item class.

@Gallaecio Shall we close this ticket as wont fix or similar?

Well, I think there might be an action here even if we go with #379 (comment).

  • If the _validation field cannot be disabled without disabling validation itself (I don’t know if that is the case), then that’s something to support.
  • We should document this scenario, i.e. the need to add _validation to an item class that does not support defining unknown fields, and the option to disable the generation of that field instead.

This can be turn off with SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS - https://spidermon.readthedocs.io/en/latest/item-validation.html#spidermon-validation-add-errors-to-items

We may expand the docs to explicitly state that if enabled, it will only work on items that have defined the _validation field or that support adding undefined fields, like dicts .

Edit: Additionally, SPIDERMON_VALIDATION_ERRORS_FIELD can be use to replace the _validation field with any other field name.

This problem was added by #358 and it is a backward incompatible change that we didn't inform our users that it was introduced. So we need to do something about it. Probably one new release that indicates that _validation field will only work if we are dealing with dict item.

We already have this situation in Scrapy related to files and files_urls field for the media pipelines (https://docs.scrapy.org/en/latest/topics/media-pipeline.html#usage-example). So adding the information that we need to specify _validation should be enough. However, I believe that we should put as a warning, not just in the middle of the text.