bloomberg/python-comdb2

Provide a "class" row factory

deppe opened this issue · 14 comments

deppe commented

Is your feature request related to a problem? Please describe.
Currently, comdb2 module provides dict and namedtuple factories. It would also be nice to provide a "class" row factory which would allow rows returned as client-defined dataclasses, pydantic models, etc.

Describe the solution you'd like
See a similar concept in psycopg's class_row factory: https://www.psycopg.org/psycopg3/docs/api/rows.html#psycopg.rows.class_row
This also allows nice static type checking: https://www.psycopg.org/psycopg3/docs/advanced/typing.html#example-returning-records-as-pydantic-models

One way to design the "class" row factory would be to have a the class_row_factory as a class. This would mean it would look something like the following below


In comdb2/factories.py

class ClassRowFactory:
    def __init__(self, class_pointer: Any) -> None:
        self.class_pointer = class_pointer

    def __call__(self,col_names) -> Any:
        _raise_on_duplicate_column_names(col_names)
       def class_row(col_values):
           key_value_mapping = dict(zip(col_names, col_vals))
           resulting_class = self.class_pointer(**key_value_mapping)
           return resulting_class
       return class_row

So this will try to fit in to the existing code structure that is present in the factories.py file. The use of it will be something like

Example:
       >>> @dataclass
       >>> class ABC:
       >>>     x: int
       >>>     y: int
       .......
        >>> conn.row_factory = ClassRowFactory(ABC)
        >>> row = conn.cursor().execute("select 1 as x, 2 as y").fetchone()
        >>> print(row)
        <__main__.ABC object at 0x000001CBE9CC8790>
        >>> print(row.x)
        1 

#57

Here's the PR for the addition

key_value_mapping = dict(zip(col_names, col_vals))
resulting_class = self.class_pointer(**key_value_mapping)

Is this the way that psycopg3 does it, with all columns names being passed to the class constructor as keyword arguments?

The column names aren't passed as a class constructor but the pointer to the particular class is passed to the constructor. To make it behave like a function (as it is shown in the existing format of comdb) I used the specified format in the __call__ method. But if you wish to see how its done in psycopg, it does approach it in a similar fashion where :

def class_row(cls: Type[T]) -> BaseRowFactory[T]:
    r"""Generate a row factory to represent rows as instances of the class `!cls`.

    The class must support every output column name as a keyword parameter.

    :param cls: The class to return for each row. It must support the fields
        returned by the query as keyword arguments.
    :rtype: `!Callable[[Cursor],` `RowMaker`\[~T]]
    """
    def class_row_(cursor: "BaseCursor[Any, Any]") -> "RowMaker[T]":
        names = _get_names(cursor)
        if names is None:
            return no_result

        def class_row__(values: Sequence[Any]) -> T:
            return cls(**dict(zip(names, values)))            < --------- This is how it is converting it to a class and returning it

        return class_row__

    return class_row_

I adopted a similar method and to make it easy to understand and read I put it in a class

the __call__ enables the class instance to be called like a function. So essentially what is being passed conn.row_factory = ClassRowFactory(ABC) is like class instance that can also behave like a reference to a function

Screenshot from 2024-01-12 00-12-13

I recommended this way because it doesn't break the format in which other factories are being used in the factories.py file.

The column names aren't passed as a class constructor but the pointer to the particular class is passed to the constructor

Right, I understand that, I was asking about how that class is called. Does psycopg3 turn every column name into a keyword argument for that class constructor?

Does psycopg3 turn every column name into a keyword argument for that class constructor?

In its source code it does get the column names using another function and zips it to keyword argument of values.
How it gets the column names is through a cursor functionality implemented within it.

Its easy to convert a dictionary to a class object. If there is a blueprint for getting the dictionary within the factories.py then can't that dictionary factory be extended to just be passed into its respective class?

My concerns about a class row factory have never been about ease of implementation, they're about performance and about robustness.

On the performance side, this approach to a class row factory means that for each row a zip object gets created, and a dict (or I think actually 2 dicts, under the covers, due to the way the interpreter implements **), and an instance of the user-provided class. That's a lot of work, but maybe it's justified for the convenience.

On the robustness side, I'm wondering how useful it is to assume that column names will exactly match the parameter names of the __init__ of the class that was provided to the factory-factory's constructor.

If psycopg3 has done this the same way - with a dict being created for each row and used to provide keyword arguments to the class constructor based on the column names - that helps to set my mind at ease about both the performance and robustness aspect. That's why I'm inquiring about the approach they took.

On the robustness side, I'm wondering how useful it is to assume that column names will exactly match the parameter names of the __init__ of the class that was provided to the factory-factory's constructor.

The column names and values are returned as a dictionary

{'col1':value1,'col2':val2}

These will match with the parameters of the class specified if the parameter name and the key names (in this case the column name) will be the same

so the class should look something like

@dataclass
class ABC:
   col1: data_type1
   col2: data_type2

Just how it has been described in the original psycopg implementation

Also important thing. Datatype of the attributes of the dataclass should be the same as the datatype of the corresponding values of the dictionary.

I'm pretty sure I don't want to expose this feature, at least as things stand today. There's a few different things that make this make more sense for psycopg3 than it makes for Comdb2, but the biggest one is that I'm reluctant to add more code that makes static type checking difficult.

The psycopg3 Connection and Cursor classes are generic, so a type checker like mypy or pylance can catch a bug like this:

from dataclasses import dataclass

@dataclass
class Point:
    x: int
    y: int

conn = psycopg.connect(row_factory=class_row(Point))
cur = conn.cursor()
cur.execute("select 1, 2")
for point in cur.fetchall():
    print(f"x={point.x} y={point.y} z={point.z}")

The type checker knows that Connection[Point].cursor().fetchall() returns a Sequence[Point], and it knows that a Point doesn't have a z attribute, so the last line fails type analysis because it tries to access point.z.

Compare that against this example that uses python-comdb2, using the interface proposed in #57:

from dataclasses import dataclass

from comdb2.dbapi2 import connect
from comdb2.factories import ClassRowFactory

@dataclass
class Point:
    x: int
    y: int

conn = connect("mattdb")
conn.row_factory = ClassRowFactory(Point)
cur = conn.cursor()
cur.execute("select 1, 2")
for point in cur.fetchall():
    print(f"x={point.x} y={point.y} z={point.z}")

In the case of python-comdb2, our Connection and Cursor classes aren't currently generic, and so the type checker wouldn't catch the same bug - cur.fetchall() is typed as returning Sequence[Any] in the comdb2 world, and accessing the z attribute of an object of type Any isn't a type error.

It seems a bit unfair to dislike this because of an unrelated feature that python-comdb2 is lacking, except that there's an alternative way to write this code where a type checker would catch the bug. If that Comdb2 example were rewritten to this:

from dataclasses import dataclass

from comdb2.dbapi2 import connect
from comdb2.factories import dict_row_factory

@dataclass
class Point:
    x: int
    y: int

conn = connect("mattdb")
conn.row_factory = dict_row_factory
cur = conn.cursor()
cur.execute("select 1, 2")
for row in cur.fetchall():
    point = Point(**row)
    print(f"x={point.x} y={point.y} z={point.z}")

then the type checker would catch the bug. That's only one more line of code, and it makes it so that type checkers can usefully check the processing of each row. If someone wants to convert each row to a class, I'd rather they do it this way so that the type checker understands what's happening.

That's not a "no" for doing this proposed change, but I think it is a "not yet". I don't think this change makes sense unless and until our Connection and Cursor types become generic.

And sorry for taking so long to reply here. I was hoping I'd find time to try making our Connection and Cursor classes generic, and never managed to get around to it.