freedomofpress/securedrop-client

[sdk] Use pydantic for requests/responses

legoktm opened this issue · 6 comments

@nabla-c0d3 had proposed using pydantic for the sdk:

Related to #1766 (and #1765 and #1764), this is to demonstrate the usage of pydantic for defining requests and responses.
Why pydantic:

  • The declarations are very read-able.
  • It works with type annotations.
  • It throws an exception if the parsing of the message fail (ie. if the field's type in the JSON message does not match the class declaration, or a field is missing).
  • I've used it extensively in other projects and it's a great library. Note that Python 3.6 is required for pydantic.

I made the change to the User object only, and if this approach is approved I could do the rest of the objects.

Other objects are going to be tricky tho due to #1764 . For example right now the Source and Reply objects are used both to parse a response returned by the server, and to pass around a UUID when trying to fetch an object from the server (where the rest of the fields are set to a dummy value). This approach is visible here for example: freedomofpress/securedrop-sdk#32.
So I might have to do API-breaking changes to properly fix this, but I don't know if that's an option or not.

kushaldas wrote:

After digging more into the usage of pydantic, my personal preference would be not to use it. Major reason is in two parts:

  • We are receiving data from our own Servers instances (no third party data source), means we do have much high confidence into the data we are receiving (note: I am not talking about safe/unsafe data, but rather the actual data types).

  • Right now we do have mypy checking data types in to the codebase, and that is not effecting the runtime. Means if even we do a mistake in writing wrong data type, the runtime is not getting effected and user experience is not hampered. Any runtime check in the other hand has to be 100% same and accurate for this, or else it will raise exceptions.

rmol wrote:

I'm coming to the same conclusion as Kushal, I'm afraid.

It looks like it might be difficult to integrate pydantic with other class-based validation like SQLAlchemy models or Flask forms. If the SDK is the only place we can apply it in the SecureDrop ecosystem, I'm not sure the benefit would outweigh the cognitive cost.

A bigger problem for me is that incoming data can be altered to fit the model specification, particularly considering the purpose of this project. I think it would be OK for the SDK to double-check the data it's moving, as a backup for the validation happening in the server or workstation, but it should never be silently truncating numeric values or converting between types. It's a conduit.

We have seen problems where a value of the wrong type makes its way into a column, thanks to quirks of the combination of SQLAlchemy and SQLite. The last thing we need is a new validation library being itself a source of those bugs. 🙁

If you have integrated pydantic with other validation systems, or if I've missed a way to prevent inadvertently transforming data as it's passing through the SDK, let me know. I'd be happy to take another look.

nabla-c0d3 wrote:

This will be my last post on this PR/issue as this was just a proposal, but here is another example from the SDK code.

  1. Current version:
class API:

    def get_sources(self) -> List[Source]:
        path_query = "api/v1/sources"
        method = "GET"

        data, status_code, headers = self._send_json_request(
            method, path_query, headers=self.req_headers, timeout=self.default_request_timeout,
        )

        sources = data["sources"]
        result = []  # type: List[Source]

        for source in sources:
            s = Source(**source)
            result.append(s)

        return result
  1. With pydantic:
class SourceField(pydantic.BaseModel):
    uuid: UUID
    add_star_url: str
    is_flagged: bool
    # etc...


class SourcesResponse(pydantic.BaseModel):
    sources: List[SourceField]


class API:

    def get_sources(self) -> List[Source]:
        path_query = "api/v1/sources"
        method = "GET"

        data, status_code, headers = self._send_json_request(
            method, path_query, headers=self.req_headers, timeout=self.default_request_timeout,
        )
        return SourcesResponse(**data)

I think that the code to parse the response is a lot easier to read in 2) and less prone to coding mistakes. Also adding a new field to the response will be a lot easier. The same logic can be applied to any code/project that needs to send requests, or parse responses, and the pydantic-style declaration tells the reader that the types and the messages formats are 100% correct as they are enforced at runtime. The code becomes self-documenting.

Lastly, to reply to some of your comments and describe my thoughts a bit more:

Means if even we do a mistake in writing wrong data type, the runtime is not getting effected and user experience is not hampered. Any runtime check in the other hand has to be 100% same and accurate for this, or else it will raise exceptions.

I would argue that data types/formats being "100% same and accurate for this, or else it will raise exceptions" is exactly the behavior that you want. With a wrong data type, the runtime will then be operating on wrong assumptions; it might work for some time or in some cases, but there will definitely be bugs at some point. These bugs will probably be tricky to troubleshoot because the bugs will be side-effects of the initially-wrong assumptions. Instead you want the code to crash as soon as assumptions are no longer correct, for example as soon as the developer changes a type and makes a mistake doing so. This allows the developer to catch problems for example when they run the tests, instead of letting the problem reach production.

It looks like it might be difficult to integrate pydantic with other class-based validation like SQLAlchemy models or Flask forms

I've used pydantic extensively with SQLAlchemy and didn't run into problems. I have not used with Flask tho.

That's all for pydantic... Ultimately it is of course not up to me to make a decision on this. My personal experience of using pydantic has been great, but perhaps there are better approaches for the securedrop projects.

rmol wrote:

This will be my last post on this PR/issue as this was just a proposal, but here is another example from the SDK code.

Thanks for the thought you've put into this. I understand not wanting to sink more time into something that we don't seem likely to adopt, so don't feel obligated to respond.

I've used pydantic extensively with SQLAlchemy and didn't run into problems.

Is any of that work public? I'd like to see how you integrated them.

I would argue that data types/formats being "100% same and accurate for this, or else it will raise exceptions" is exactly the behavior that you want.
I think that the code to parse the response is a lot easier to read in 2) and less prone to coding mistakes. Also adding a new field to the response will be a lot easier.

I agree with all of that. But ideally, this SDK should either convey values verbatim, or raise an exception. It's not obvious to me how to avoid pydantic's value coercion, at least right now. I did find pydantic/pydantic#1098 which suggests that better/easier strict parsing might be coming, so maybe we can take another look then, and if we've figured out how to integrate it with the other tools I mentioned, we could bring it in.

nabla-c0d3 wrote:

Our go-to web stack at work has evolved throughout the years and is now falcon + pydantic + sqlalchemy. Unfortunately, none of this code is public. The code base we have with this stack is pretty big tho.

As for the type coercion done by pydantic, I guess I don't see it as a big problem? The main thing when using pydantic is that the types in the message format/class should not be considered as "type hints", but as the source of truth, and therefore must be correct. If a field is defined as an int but in reality it should be a float, that's a mistake in the field's declaration.

OK after copying over all of those comments...my own thoughts: I'm a big fan of Rust's serde library, which pydantic is pretty similar to conceptually. I would like more state encoded in the type system, and stuff like pydantic, dataclasses, etc. would move us in that direction.