A python package for defensive data analysis. Documentation is at readthedocs.
- pandas
Supports python 2.7+ and 3.4+
Data are messy.
But, our analysis often depends on certain assumptions about our data
that should be invariant across updates to your dataset.
engarde
is a lightweight way to explicitly state your assumptions
and check that they're actually true.
This is especially important when working with flat files like CSV that aren't bound for a more structured destination (e.g. SQL or HDF5).
There are two main ways of using the library, which correspond to the two main ways I use pandas: writing small scripts or interactively at the interpreter.
First, as decorators, which are most useful in .py
scripts
The basic idea is to write each step of your ETL process as a function
that takes and returns a DataFrame. These functions can be decorated with
the invariants that should be true at that step in the process.
from engarde.decorators import none_missing, unique_index, is_shape
@none_missing
def f(df1, df2):
return df1.add(df2)
@is_shape((1290, 10))
@unique_index
def make_design_matrix('data.csv'):
out = ...
return out
Second, interactively.
The cleanest way to integrate this is through the pipe
method,
introduced in pandas 0.16.2 (June 2015).
>>> import engarde.checks as dc
>>> (df1.reindex_like(df2)
... .pipe(dc.unique_index)
... .cumsum()
... .pipe(dc.within_range, (0, 100))
... )