/dbcooper-py

Quickly access and tab-complete the tables in your database.

Primary LanguagePythonMIT LicenseMIT

dbcooper-py

CI Binder

The dbcooper package turns a database connection into a collection of functions, handling logic for keeping track of connections and letting you take advantage of autocompletion when exploring a database.

It's especially helpful to use when authoring database-specific Python packages, for instance in an internal company package or one wrapping a public data source.

For the R version see dgrtwo/dbcooper.

Installation

pip install dbcooper

Example

Initializing the functions

The dbcooper package asks you to create the connection first. As an example, we'll use the Lahman baseball database package (lahman).

from sqlalchemy import create_engine
from dbcooper.data import lahman_sqlite

# connect to sqlite
engine = create_engine("sqlite://")

# load the lahman data into the "lahman" schema
lahman_sqlite(engine)

Next we'll set up dbcooper

from dbcooper import DbCooper

dbc = DbCooper(engine)

The DbCooper object contains two important things:

  • Accessors to fetch specific tables.
  • Functions for interacting with the underlying database.

Using table accessors

In the example below, we'll use the "Lahman"."Salaries" table as an example. By default, dbcooper makes this accessible as .lahman_salaries.

Plain .lahman_salaries prints out table and column info, including types and descriptions.

# show table and column descriptions
dbc.lahman_salaries

salaries

(No table description.)

name type description
index BIGINT
yearID BIGINT
teamID TEXT
lgID TEXT
playerIDTEXT
salary BIGINT

Note that sqlite doesn't support table and columnn descriptions, so these sections are empty.

Calling .lahman_salaries() fetches a lazy version of the data.

dbc.lahman_salaries()
# Source: lazy query
# DB Conn: Engine(sqlite://)
# Preview:
   index  yearID teamID lgID   playerID  salary
0      0    1985    ATL   NL  barkele01  870000
1      1    1985    ATL   NL  bedrost01  550000
2      2    1985    ATL   NL  benedbr01  545000
3      3    1985    ATL   NL   campri01  633333
4      4    1985    ATL   NL  ceronri01  625000
# .. may have more rows

Note that this data is a siuba LazyTbl object, which you can use to analyze the data.

from siuba import _, count

dbc.lahman_salaries() >> count(over_100k = _.salary > 100_000)
# Source: lazy query
# DB Conn: Engine(sqlite://)
# Preview:
   over_100k      n
0       True  25374
1      False   1054
# .. may have more rows

Using database functions

  • .list(): Get a list of tables
  • .tbl(): Access a table that can be worked with using siuba.
  • .query(): Perform a SQL query and work with the result.
  • ._engine: Get the underlying sqlalchemy engine.

For instance, we could start by finding the names of the tables in the Lahman database.

dbc.list()
['lahman.allstar_full',
 'lahman.appearances',
 'lahman.awards_managers',
 'lahman.awards_players',
 'lahman.awards_share_managers',
 'lahman.awards_share_players',
 'lahman.batting',
 'lahman.batting_post',
 'lahman.college_playing',
 'lahman.fielding',
 'lahman.fielding_of',
 'lahman.fielding_ofsplit',
 'lahman.fielding_post',
 'lahman.hall_of_fame',
 'lahman.home_games',
 'lahman.managers',
 'lahman.managers_half',
 'lahman.parks',
 'lahman.people',
 'lahman.pitching',
 'lahman.pitching_post',
 'lahman.salaries',
 'lahman.schools',
 'lahman.series_post',
 'lahman.teams',
 'lahman.teams_franchises',
 'lahman.teams_half']

We can access one of these tables with dbc.tbl(), then put it through any kind of siuba operation.

dbc.tbl("Salaries")
# Source: lazy query
# DB Conn: Engine(sqlite://)
# Preview:
   index  yearID teamID lgID   playerID  salary
0      0    1985    ATL   NL  barkele01  870000
1      1    1985    ATL   NL  bedrost01  550000
2      2    1985    ATL   NL  benedbr01  545000
3      3    1985    ATL   NL   campri01  633333
4      4    1985    ATL   NL  ceronri01  625000
# .. may have more rows
from siuba import _, count
dbc.tbl("Salaries") >> count(_.yearID, sort=True)
# Source: lazy query
# DB Conn: Engine(sqlite://)
# Preview:
   yearID     n
0    1999  1006
1    1998   998
2    1995   986
3    1996   931
4    1997   925
# .. may have more rows

If you'd rather start from a SQL query, use the .query() method.

dbc.query("""
    SELECT
        playerID,
        sum(AB) as AB
    FROM Batting
    GROUP BY playerID
""")
# Source: lazy query
# DB Conn: Engine(sqlite://)
# Preview:
    playerID     AB
0  aardsda01      4
1  aaronha01  12364
2  aaronto01    944
3   aasedo01      5
4   abadan01     21
# .. may have more rows

For anything else you might want to do, the sqlalchemy Engine object is available. For example, the code below shows how you can set its .echo attribute, which tells sqlalchemy to provide useful logs.

dbc._engine.echo = True
table_names = dbc.list()
2022-03-20 22:49:37,553 INFO sqlalchemy.engine.Engine PRAGMA database_list
2022-03-20 22:49:37,554 INFO sqlalchemy.engine.Engine [raw sql] ()
2022-03-20 22:49:37,555 INFO sqlalchemy.engine.Engine SELECT name FROM "main".sqlite_master WHERE type='table' ORDER BY name
2022-03-20 22:49:37,555 INFO sqlalchemy.engine.Engine [raw sql] ()
2022-03-20 22:49:37,556 INFO sqlalchemy.engine.Engine SELECT name FROM "lahman".sqlite_master WHERE type='table' ORDER BY name
2022-03-20 22:49:37,557 INFO sqlalchemy.engine.Engine [raw sql] ()

Note that the log messages above show that the .list() method executed two queries: One to list tables in the "main" schema (which is empty), and one to list tables in the "lahman" schema.

Advanced Configuration

⚠️: These behaviors are well tested, but dbcooper's internals and API may change.

dbcooper can be configured in three ways, each corresponding to a class interface:

  • TableFinder: Which tables will be used by dbcooper.
  • AccessorBuilder: How table names are turned into accessors.
  • DbcDocumentedTable: The class that defines what an accessor will return.
from sqlalchemy import create_engine
from dbcooper.data import lahman_sqlite
from dbcooper import DbCooper, AccessorBuilder

engine = create_engine("sqlite://")
lahman_sqlite(engine)

Excluding a schema

from dbcooper import TableFinder

finder = TableFinder(exclude_schemas=["lahman"])
dbc_no_lahman = DbCooper(engine, table_finder=finder)
dbc_no_lahman.list()
[]

Formatting table names

from dbcooper import AccessorBuilder

# omits schema, and keeps only table name
# e.g. `salaries`, rather than `lahman_salaries`
builder = AccessorBuilder(format_from_part="table")

tbl_flat = DbCooper(engine, accessor_builder=builder)
tbl_flat.salaries()
# Source: lazy query
# DB Conn: Engine(sqlite://)
# Preview:
   index  yearID teamID lgID   playerID  salary
0      0    1985    ATL   NL  barkele01  870000
1      1    1985    ATL   NL  bedrost01  550000
2      2    1985    ATL   NL  benedbr01  545000
3      3    1985    ATL   NL   campri01  633333
4      4    1985    ATL   NL  ceronri01  625000
# .. may have more rows

Grouping tables by schema

from dbcooper import AccessorHierarchyBuilder

tbl_nested = DbCooper(engine, accessor_builder=AccessorHierarchyBuilder())

# note the form: <schema>.<table>
tbl_nested.lahman.salaries()
# Source: lazy query
# DB Conn: Engine(sqlite://)
# Preview:
   index  yearID teamID lgID   playerID  salary
0      0    1985    ATL   NL  barkele01  870000
1      1    1985    ATL   NL  bedrost01  550000
2      2    1985    ATL   NL  benedbr01  545000
3      3    1985    ATL   NL   campri01  633333
4      4    1985    ATL   NL  ceronri01  625000
# .. may have more rows

Don't show table documentation

from dbcooper import DbcSimpleTable

dbc_no_doc = DbCooper(engine, table_factory=DbcSimpleTable)
dbc_no_doc.lahman_salaries
DbcSimpleTable(..., 'salaries', 'lahman')

Note that sqlalchemy dialects like snowflake-sqlalchemy cannot look up things like table and column descriptions as well as other dialects, so DbcSimpleTable may be needed to connect to snowflake (see this issue).

Developing

# install with development dependencies
pip install -e .[dev]

# or install from requirements file
pip install -r requirements/dev.txt

Test

# run all tests, see pytest section of pyproject.toml
pytest

# run specific backends
pytest -m 'not snowflake and not bigquery'

# stop on first failure, drop into debugger
pytest -x --pdb

Release

# set version number
git tag v0.0.1

# (optional) push to github
git push origin --tags

# check version
python -m setuptools_scm