kippnorcal/sqlsorcery

Passing Params vs. Engine

Opened this issue · 4 comments

https://github.com/dchess/sqlsorcery/blob/00a14ae7bd65d893d8cfa63a08aad04bd35e71c8/sqlsorcery/__init__.py#L214

The way this is designed currently has a separate class for each type of database this supports. This takes away the advantages of SQLALchemy, that it is DB-agnostic. What are your thoughts on instead passing in the SQLAlchemy engine, or the string to generate that engine, instead? That allows us to focus on the real value-add of this library, which is simplifying some of the DB<->Dataframe interactions, and remove a ton of boilerplate code and dependancies for each DB infrastructure.

@zkagin say more about that. The approach to the classes was to help abstract away the needed methods for creating those connection strings and the specifics of each engine creation (i.e some need the fast_executemany (like MS SQL) or the executemany_mode (like Postgres) params, while others do not. They also don't all have the same params needed for the connection string.

I think I see where you are going, but perhaps you can give an example of what a simplified version would look like? I could see a world in which we used kwargs to pass in a dictionary, but I'm not totally seeing the full benefits of that. One of the benefits I had in mind with inheritance, was that we could replace methods. For instance, BigQuery would need different underlying code to handle the query and insert_into methods because it uses different approaches in pandas.

Happy to talk more. I'm planning on digging in to some of this over the weekend, so perhaps we could talk through an approach beforehand.

@zkagin I'm experimenting with an alternate approach based on your feedback. Tell me what you think:

from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL


def get_engine():
    dialect = os.getenv("DB_DIALECT")
    options = {
        "username": os.getenv("DB_USER"),
        "password": os.getenv("DB_PWD"),
        "host": os.getenv("DB_SERVER"),
        "port": os.getenv("DB_PORT"),
        "database": os.getenv("DB"),
    }
    helpers = {}
    if "mssql" in dialect:
        options["query"] = {"driver": os.getenv("ODBC_DRIVER")}
        helpers["fast_executemany"] = True
    elif "postgres" in dialect:
        helpers["executemany_mode"] = "batch"
    cstr = URL(dialect, **options)
    return create_engine(cstr, **helpers)

I'll need to still figure out how to handle SQLite, Oracle, and BigQuery but I'm thinking I could extend the Connection class init method using this logic.

@zkagin: I've tested this with MS SQL, Postgres, and SQLite and it seems to work fine:

from os import getenv
from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL


class Connection:
    def __init__(
        self,
        dialect=getenv("DB_DIALECT"),
        user=getenv("DB_USER"),
        pwd=getenv("DB_PWD"),
        server=getenv("DB_SERVER"),
        port=getenv("DB_PORT"),
        db=getenv("DB"),
        schema=getenv("DB_SCHEMA"),
    ):
        self.dialect = dialect
        self.options = {
            "username": user,
            "password": pwd,
            "host": server,
            "port": port,
            "database": db,
        }
        self.helpers = {}
        self.engine = self._get_engine()

    def _cstr(self):
        return URL(self.dialect, **self.options)

    def _mssql_options(self):
        if "mssql" in self.dialect:
            self.options["query"] = {"driver": getenv("ODBC_DRIVER")}
            self.helpers["fast_executemany"] = True

    def _postgres_options(self):
        if "postgres" in self.dialect:
            self.helpers["executemany_mode"] = "batch"

    def _get_engine(self):
        self._mssql_options()
        self._postgres_options()
        return create_engine(self._cstr(), **self.helpers)

Then you would use it like:

import os
from sqlsorcery import Connection

sql = Connection()
df = pd.read_sql_table("test", con=sql.engine, schema=sql.schema)
print(df)

I will likely open a new PR to address this after testing the other dialects as well.

Looks cool! I think something like this is at least a nice "option" for cases where you don't need the rest of what sqlsorcery does. That way you could still leave the other classes as extra shortcuts. A few thoughts:

  • I'd probably move the env variables to parameters. As is, the fact that other libraries have these hidden env dependencies can be a little confusing.
  • Rather than check "in", I would use direct string comparison (unless you have an idea where multiple engines could be generated?)
  • If there's a way to retain the magic you have in getting the driver for MSSQL, that would also be useful!