/dfply

dplyr-style piping operations for pandas dataframes

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

dfply

Version: 0.3.2

Note: Version 0.3.0 is the first big update in awhile, and changes a lot of the "base" code. The pandas-ply package is no longer being imported. I have coded my own version of the "symbolic" objects that I was borrowing from pandas-ply. Also, I am no longer supporting Python 2, sorry!

In v0.3 groupby has been renamed to group_by to mirror the dplyr function. If this breaks your legacy code, one possible fix is to have from dfply.group import group_by as groupby in your package imports.

The dfply package makes it possible to do R's dplyr-style data manipulation with pipes in python on pandas DataFrames.

This is an alternative to pandas-ply and dplython, which both engineer dplyr syntax and functionality in python. There are probably more packages that attempt to enable dplyr-style dataframe manipulation in python, but those are the two I am aware of.

dfply uses a decorator-based architecture for the piping functionality and to "categorize" the types of data manipulation functions. The goal of this
architecture is to make dfply concise and easily extensible, simply by chaining together different decorators that each have a distinct effect on the wrapped function. There is a more in-depth overview of the decorators and how dfply can be customized below.

dfply is intended to mimic the functionality of dplyr. The syntax is the same for the most part, but will vary in some cases as Python is a considerably different programming language than R.

A good amount of the core functionality of dplyr is complete, and the remainder is actively being added in. Going forward I hope functionality that is not directly part of dplyr to be incorporated into dfply as well. This is not intended to be an absolute mimic of dplyr, but instead a port of the ease, convenience and readability the dplyr package provides for data manipulation tasks.

Expect frequent updates to the package version as features are added and any bugs are fixed.

Overview of functions

The >> and >>= pipe operators

dfply works directly on pandas DataFrames, chaining operations on the data with the >> operator, or alternatively starting with >>= for inplace operations.

from dfply import *

diamonds >> head(3)

   carat      cut color clarity  depth  table  price     x     y     z
0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31

You can chain piped operations, and of course assign the output to a new DataFrame.

lowprice = diamonds >> head(10) >> tail(3)

lowprice

   carat        cut color clarity  depth  table  price     x     y     z
7   0.26  Very Good     H     SI1   61.9   55.0    337  4.07  4.11  2.53
8   0.22       Fair     E     VS2   65.1   61.0    337  3.87  3.78  2.49
9   0.23  Very Good     H     VS1   59.4   61.0    338  4.00  4.05  2.39

Inplace operations are done with the first pipe as >>= and subsequent pipes as >>.

diamonds >>= head(10) >> tail(3)

diamonds

   carat        cut color clarity  depth  table  price     x     y     z
7   0.26  Very Good     H     SI1   61.9   55.0    337  4.07  4.11  2.53
8   0.22       Fair     E     VS2   65.1   61.0    337  3.87  3.78  2.49
9   0.23  Very Good     H     VS1   59.4   61.0    338  4.00  4.05  2.39

When using the inplace pipe, the DataFrame is not required on the left hand side of the >>= pipe and the DataFrame variable is overwritten with the output of the operations.

The X DataFrame symbol

The DataFrame as it is passed through the piping operations is represented by the symbol X. It records the actions you want to take (represented by the Intention class), but does not evaluate them until the appropriate time. Operations on the DataFrame are deferred. Selecting two of the columns, for example, can be done using the symbolic X DataFrame during the piping operations.

diamonds >> select(X.carat, X.cut) >> head(3)

   carat      cut
0   0.23    Ideal
1   0.21  Premium
2   0.23     Good

Selecting and dropping

select() and drop() functions

There are two functions for selection, inverse of each other: select and drop. The select and drop functions accept string labels, integer positions, and/or symbolically represented column names (X.column). They also accept symbolic "selection filter" functions, which will be covered shortly.

The example below selects "cut", "price", "x", and "y" from the diamonds dataset.

diamonds >> select(1, X.price, ['x', 'y']) >> head(2)

       cut  price     x     y
0    Ideal    326  3.95  3.98
1  Premium    326  3.89  3.84

If you were instead to use drop, you would get back all columns besides those specified.

diamonds >> drop(1, X.price, ['x', 'y']) >> head(2)

   carat color clarity  depth  table     z
0   0.23     E     SI2   61.5   55.0  2.43
1   0.21     E     SI1   59.8   61.0  2.31

Selection using the inversion ~ operator on symbolic columns

One particularly nice thing about dplyr's selection functions is that you can drop columns inside of a select statement by putting a subtraction sign in front, like so: ... %>% select(-col). The same can be done in dfply, but instead of the subtraction operator you use the tilde ~.

For example, let's say I wanted to select any column except carat, color, and clarity in my dataframe. One way to do this is to specify those for removal using the ~ operator like so:

diamonds >> select(~X.carat, ~X.color, ~X.clarity) >> head(2)

       cut  depth  table  price     x     y     z
0    Ideal   61.5   55.0    326  3.95  3.98  2.43
1  Premium   59.8   61.0    326  3.89  3.84  2.31

Note that if you are going to use the inversion operator, you must place it prior to the symbolic X (or a symbolic such as a selection filter function, covered next). For example, using the inversion operator on a list of columns will result in an error:

diamonds >> select(~[X.carat, X.color, X.clarity]) >> head(2)

TypeError: bad operand type for unary ~: 'list'

Selection filter functions

The vanilla select and drop functions are useful, but there are a variety of selection functions inspired by dplyr available to make selecting and dropping columns a breeze. These functions are intended to be put inside of the select and drop functions, and can be paired with the ~ inverter.

First, a quick rundown of the available functions:

  • starts_with(prefix): find columns that start with a string prefix.
  • ends_with(suffix): find columns that end with a string suffix.
  • contains(substr): find columns that contain a substring in their name.
  • everything(): all columns.
  • columns_between(start_col, end_col, inclusive=True): find columns between a specified start and end column. The inclusive boolean keyword argument indicates whether the end column should be included or not.
  • columns_to(end_col, inclusive=True): get columns up to a specified end column. The inclusive argument indicates whether the ending column should be included or not.
  • columns_from(start_col): get the columns starting at a specified column.

The selection filter functions are best explained by example. Let's say I wanted to select only the columns that started with a "c":

diamonds >> select(starts_with('c')) >> head(2)

   carat      cut color clarity
0   0.23    Ideal     E     SI2
1   0.21  Premium     E     SI1

The selection filter functions are instances of the class Intention, just like the X placeholder, and so I can also use the inversion operator with them. For example, I can alternatively select the columns that do not start with "c":

diamonds >> select(~starts_with('c')) >> head(2)

   depth  table  price     x     y     z
0   61.5   55.0    326  3.95  3.98  2.43
1   59.8   61.0    326  3.89  3.84  2.31

They work the same inside the drop function, but with the intention of removal. I could, for example, use the columns_from selection filter to drop all columns from "price" onwards:

diamonds >> drop(columns_from(X.price)) >> head(2)

   carat      cut color clarity  depth  table
0   0.23    Ideal     E     SI2   61.5   55.0
1   0.21  Premium     E     SI1   59.8   61.0

As the example above shows, you can use symbolic column names inside of the selection filter function! You can also mix together selection filters and standard selections inside of the same select or drop command.

For my next trick, I will select the first two columns, the last two columns, and the "depth" column using a mixture of selection techniques:

diamonds >> select(columns_to(1, inclusive=True), 'depth', columns_from(-2)) >> head(2)

   carat      cut  depth     y     z
0   0.23    Ideal   61.5  3.98  2.43
1   0.21  Premium   59.8  3.84  2.31

Subsetting and filtering

row_slice()

Slices of rows can be selected with the row_slice() function. You can pass single integer indices or a list of indices to select rows as with. This is going to be the same as using pandas' .iloc.

diamonds >> row_slice([10,15])

    carat      cut color clarity  depth  table  price     x     y     z
10   0.30     Good     J     SI1   64.0   55.0    339  4.25  4.28  2.73
15   0.32  Premium     E      I1   60.9   58.0    345  4.38  4.42  2.68

Note that this can also be used with the group_by function, and will operate like a call to .iloc on each group. The group_by pipe function is covered later, but it essentially works the same as pandas .groupby (with a few subtle differences).

diamonds >> group_by('cut') >> row_slice(5)

     carat        cut color clarity  depth  table  price     x     y     z
128   0.91       Fair     H     SI2   64.4   57.0   2763  6.11  6.09  3.93
20    0.30       Good     I     SI2   63.3   56.0    351  4.26  4.30  2.71
40    0.33      Ideal     I     SI2   61.2   56.0    403  4.49  4.50  2.75
26    0.24    Premium     I     VS1   62.5   57.0    355  3.97  3.94  2.47
21    0.23  Very Good     E     VS2   63.8   55.0    352  3.85  3.92  2.48

sample()

The sample() function functions exactly the same as pandas' .sample() method for DataFrames. Arguments and keyword arguments will be passed through to the DataFrame sample method.

diamonds >> sample(frac=0.0001, replace=False)

       carat        cut color clarity  depth  table  price     x     y     z
19736   1.02      Ideal     E     VS1   62.2   54.0   8303  6.43  6.46  4.01
37159   0.32    Premium     D     VS2   60.3   60.0    972  4.44  4.42  2.67
1699    0.72  Very Good     E     VS2   63.8   57.0   3035  5.66  5.69  3.62
20955   1.71  Very Good     J     VS2   62.6   55.0   9170  7.58  7.65  4.77
5168    0.91  Very Good     E     SI2   63.0   56.0   3772  6.12  6.16  3.87


diamonds >> sample(n=3, replace=True)

       carat        cut color clarity  depth  table  price     x     y     z
52892   0.73  Very Good     G     SI1   60.6   59.0   2585  5.83  5.85  3.54
39454   0.57      Ideal     H     SI2   62.3   56.0   1077  5.31  5.28  3.30
39751   0.43      Ideal     H    VVS1   62.3   54.0   1094  4.84  4.85  3.02

distinct()

Selection of unique rows is done with distinct(), which similarly passes arguments and keyword arguments through to the DataFrame's .drop_duplicates() method.

diamonds >> distinct(X.color)

    carat        cut color clarity  depth  table  price     x     y     z
0    0.23      Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
3    0.29    Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4    0.31       Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75
7    0.26  Very Good     H     SI1   61.9   55.0    337  4.07  4.11  2.53
12   0.22    Premium     F     SI1   60.4   61.0    342  3.88  3.84  2.33
25   0.23  Very Good     G    VVS2   60.4   58.0    354  3.97  4.01  2.41
28   0.23  Very Good     D     VS2   60.5   61.0    357  3.96  3.97  2.40

mask()

Filtering rows with logical criteria is done with mask(), which accepts boolean arrays "masking out" False labeled rows and keeping True labeled rows. These are best created with logical statements on symbolic Series objects as shown below. Multiple criteria can be supplied as arguments and their intersection will be used as the mask.

diamonds >> mask(X.cut == 'Ideal') >> head(4)

    carat    cut color clarity  depth  table  price     x     y     z
0    0.23  Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
11   0.23  Ideal     J     VS1   62.8   56.0    340  3.93  3.90  2.46
13   0.31  Ideal     J     SI2   62.2   54.0    344  4.35  4.37  2.71
16   0.30  Ideal     I     SI2   62.0   54.0    348  4.31  4.34  2.68

diamonds >> mask(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)

       carat    cut color clarity  depth  table  price     x     y     z
26683   0.33  Ideal     E     SI2   62.2   54.0    427  4.44  4.46  2.77
32297   0.34  Ideal     E     SI2   62.4   54.0    454  4.49  4.52  2.81
40928   0.30  Ideal     E     SI1   61.6   54.0    499  4.32  4.35  2.67
50623   0.30  Ideal     E     SI2   62.1   54.0    401  4.32  4.35  2.69
50625   0.30  Ideal     E     SI2   62.0   54.0    401  4.33  4.35  2.69

Alternatively, mask() can also be called using the alias filter_by():

diamonds >> filter_by(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)

       carat    cut color clarity  depth  table  price     x     y     z
26683   0.33  Ideal     E     SI2   62.2   54.0    427  4.44  4.46  2.77
32297   0.34  Ideal     E     SI2   62.4   54.0    454  4.49  4.52  2.81
40928   0.30  Ideal     E     SI1   61.6   54.0    499  4.32  4.35  2.67
50623   0.30  Ideal     E     SI2   62.1   54.0    401  4.32  4.35  2.69
50625   0.30  Ideal     E     SI2   62.0   54.0    401  4.33  4.35  2.69

pull()

pull simply retrieves a column and returns it as a pandas series, in case you only care about one particular column at the end of your pipeline. Columns can be specified either by their name (string) or an integer.

The default returns the last column (on the assumption that's the column you've created most recently).

Example:

(diamonds
 >> filter_by(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)
 >> pull('carat'))

 26683    0.33
 32297    0.34
 40928    0.30
 50623    0.30
 50625    0.30
 Name: carat, dtype: float64

DataFrame transformation

mutate()

New variables can be created with the mutate() function (named that way to match dplyr).

diamonds >> mutate(x_plus_y=X.x + X.y) >> select(columns_from('x')) >> head(3)

      x     y     z  x_plus_y
0  3.95  3.98  2.43      7.93
1  3.89  3.84  2.31      7.73
2  4.05  4.07  2.31      8.12

Multiple variables can be created in a single call.

diamonds >> mutate(x_plus_y=X.x + X.y, y_div_z=(X.y / X.z)) >> select(columns_from('x')) >> head(3)

      x     y     z  x_plus_y   y_div_z
0  3.95  3.98  2.43      7.93  1.637860
1  3.89  3.84  2.31      7.73  1.662338
2  4.05  4.07  2.31      8.12  1.761905

Note: In Python the new variables created with mutate may not be guaranteed to be created in the same order that they are input into the function call, though this may have been changed in Python 3...

transmute()

The transmute() function is a combination of a mutate and a selection of the created variables.

diamonds >> transmute(x_plus_y=X.x + X.y, y_div_z=(X.y / X.z)) >> head(3)

   x_plus_y   y_div_z
0      7.93  1.637860
1      7.73  1.662338
2      8.12  1.761905

Grouping

group_by() and ungroup()

DataFrames are grouped along variables using the group_by() function and ungrouped with the ungroup() function. Functions chained after grouping a DataFrame are applied by group until returning or ungrouping. Hierarchical/multiindexing is automatically removed.

Note: In the example below, the lead() and lag() functions are dfply convenience wrappers around the pandas .shift() Series method.

(diamonds >> group_by(X.cut) >>
 mutate(price_lead=lead(X.price), price_lag=lag(X.price)) >>
 head(2) >> select(X.cut, X.price, X.price_lead, X.price_lag))

          cut  price  price_lead  price_lag
8        Fair    337      2757.0        NaN
91       Fair   2757      2759.0      337.0
2        Good    327       335.0        NaN
4        Good    335       339.0      327.0
0       Ideal    326       340.0        NaN
11      Ideal    340       344.0      326.0
1     Premium    326       334.0        NaN
3     Premium    334       342.0      326.0
5   Very Good    336       336.0        NaN
6   Very Good    336       337.0      336.0

Reshaping

arrange()

Sorting is done by the arrange() function, which wraps around the pandas .sort_values() DataFrame method. Arguments and keyword arguments are passed through to that function.

diamonds >> arrange(X.table, ascending=False) >> head(5)

       carat   cut color clarity  depth  table  price     x     y     z
24932   2.01  Fair     F     SI1   58.6   95.0  13387  8.32  8.31  4.87
50773   0.81  Fair     F     SI2   68.8   79.0   2301  5.26  5.20  3.58
51342   0.79  Fair     G     SI1   65.3   76.0   2362  5.52  5.13  3.35
52860   0.50  Fair     E     VS2   79.0   73.0   2579  5.21  5.18  4.09
49375   0.70  Fair     H     VS1   62.0   73.0   2100  5.65  5.54  3.47


(diamonds >> group_by(X.cut) >> arrange(X.price) >>
 head(3) >> ungroup() >> mask(X.carat < 0.23))

    carat      cut color clarity  depth  table  price     x     y     z
8    0.22     Fair     E     VS2   65.1   61.0    337  3.87  3.78  2.49
1    0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
12   0.22  Premium     F     SI1   60.4   61.0    342  3.88  3.84  2.33

rename()

The rename() function will rename columns provided as values to what you set as the keys in the keyword arguments. You can indicate columns with symbols or with their labels.

diamonds >> rename(CUT=X.cut, COLOR='color') >> head(2)

   carat      CUT COLOR clarity  depth  table  price     x     y     z
0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31

gather()

Transforming between "wide" and "long" format is a common pattern in data munging. The gather(key, value, *columns) function melts the specified columns in your DataFrame into two key:value columns.

diamonds >> gather('variable', 'value', ['price', 'depth','x','y','z']) >> head(5)

   carat      cut color clarity  table variable  value
0   0.23    Ideal     E     SI2   55.0    price  326.0
1   0.21  Premium     E     SI1   61.0    price  326.0
2   0.23     Good     E     VS1   65.0    price  327.0
3   0.29  Premium     I     VS2   58.0    price  334.0
4   0.31     Good     J     SI2   58.0    price  335.0

Without any columns specified, your entire DataFrame will be transformed into two key:value pair columns.

diamonds >> gather('variable', 'value') >> head(5)

  variable value
0    carat  0.23
1    carat  0.21
2    carat  0.23
3    carat  0.29
4    carat  0.31

If the add_id keyword argument is set to true, an id column is added to the new elongated DataFrame that acts as a row id from the original wide DataFrame.

elongated = diamonds >> gather('variable', 'value', add_id=True)
elongated >> head(5)

   _ID variable value
0    0    carat  0.23
1    1    carat  0.21
2    2    carat  0.23
3    3    carat  0.29
4    4    carat  0.31

spread()

Likewise, you can transform a "long" DataFrame into a "wide" format with the spread(key, values) function. Converting the previously created elongated DataFrame for example would be done like so.

widened = elongated >> spread(X.variable, X.value)
widened >> head(5)

    _ID carat clarity color        cut depth price table     x     y     z
0     0  0.23     SI2     E      Ideal  61.5   326    55  3.95  3.98  2.43
1     1  0.21     SI1     E    Premium  59.8   326    61  3.89  3.84  2.31
2    10   0.3     SI1     J       Good    64   339    55  4.25  4.28  2.73
3   100  0.75     SI1     D  Very Good  63.2  2760    56   5.8  5.75  3.65
4  1000  0.75     SI1     D      Ideal  62.3  2898    55  5.83   5.8  3.62

In this case the _ID column comes in handy since it is necessary to not have any duplicated identifiers.

If you have a mixed datatype column in your long-format DataFrame then the default behavior is for the spread columns to be of type object.

widened.dtypes

_ID         int64
carat      object
clarity    object
color      object
cut        object
depth      object
price      object
table      object
x          object
y          object
z          object
dtype: object

If you want to try to convert dtypes when spreading, you can set the convert keyword argument in spread to True like so.

widened = elongated >> spread(X.variable, X.value, convert=True)
widened.dtypes

_ID          int64
carat      float64
clarity     object
color       object
cut         object
depth      float64
price        int64
table      float64
x          float64
y          float64
z          float64
dtype: object

separate()

Columns can be split into multiple columns with the separate(column, into, sep="[\W_]+", remove=True, convert=False, extra='drop', fill='right') function. separate() takes a variety of arguments:

  • column: the column to split.
  • into: the names of the new columns.
  • sep: either a regex string or integer positions to split the column on.
  • remove: boolean indicating whether to remove the original column.
  • convert: boolean indicating whether the new columns should be converted to the appropriate type (same as in spread above).
  • extra: either drop, where split pieces beyond the specified new columns are dropped, or merge, where the final split piece contains the remainder of the original column.
  • fill: either right, where np.nan values are filled in the right-most columns for missing pieces, or left where np.nan values are filled in the left-most columns.
print d

         a
0    1-a-3
1      1-b
2  1-c-3-4
3    9-d-1
4       10

d >> separate(X.a, ['col1', 'col2'], remove=True, convert=True,
              extra='drop', fill='right')

   col1 col2
0     1    a
1     1    b
2     1    c
3     9    d
4    10  NaN

d >> separate(X.a, ['col1', 'col2'], remove=True, convert=True,
              extra='drop', fill='left')

   col1 col2
0   1.0    a
1   1.0    b
2   1.0    c
3   9.0    d
4   NaN   10

d >> separate(X.a, ['col1', 'col2'], remove=False, convert=True,
              extra='merge', fill='right')

         a  col1   col2
0    1-a-3     1    a-3
1      1-b     1      b
2  1-c-3-4     1  c-3-4
3    9-d-1     9    d-1
4       10    10    NaN

d >> separate(X.a, ['col1', 'col2', 'col3'], sep=[2,4], remove=True, convert=True,
              extra='merge', fill='right')

  col1 col2 col3
0   1-   a-    3
1   1-    b  NaN
2   1-   c-  3-4
3   9-   d-    1
4   10  NaN  NaN

unite()

The unite(colname, *args, sep='_', remove=True, na_action='maintain') function does the inverse of separate(), joining columns together by a separator. Any columns that are not strings will be converted to strings. The arguments for unite() are:

  • colname: the name of the new joined column.
  • *args: list of columns to be joined, which can be strings, symbolic, or integer positions.
  • sep: the string separator to join the columns with.
  • remove: boolean indicating whether or not to remove the original columns.
  • na_action: can be one of "maintain" (the default), "ignore", or "as_string". The default "maintain" will make the new column row a NaN value if any of the original column cells at that row contained NaN. "ignore" will treat any NaN value as an empty string during joining. "as_string" will convert any NaN value to the string "nan" prior to joining.
print d

a  b      c
0  1  a   True
1  2  b  False
2  3  c    NaN

d >> unite('united', X.a, 'b', 2, remove=False, na_action='maintain')

   a  b      c     united
0  1  a   True   1_a_True
1  2  b  False  2_b_False
2  3  c    NaN        NaN

d >> unite('united', ['a','b','c'], remove=True, na_action='ignore', sep='*')

      united
0   1*a*True
1  2*b*False
2        3*c

d >> unite('united', d.columns, remove=True, na_action='as_string')

      united
0   1_a_True
1  2_b_False
2    3_c_nan

Joining

Currently implemented joins are:

  1. inner_join(other, by='column')
  • outer_join(other, by='column') (which works the same as full_join())
  • right_join(other, by='column')
  • left_join(other, by='column')
  • semi_join(other, by='column')
  • anti_join(other, by='column')

The functionality of the join functions are outlined with the toy example DataFrames below.

a = pd.DataFrame({
        'x1':['A','B','C'],
        'x2':[1,2,3]
    })
b = pd.DataFrame({
    'x1':['A','B','D'],
    'x3':[True,False,True]
})

inner_join()

inner_join() joins on values present in both DataFrames' by columns.

a >> inner_join(b, by='x1')

  x1  x2     x3
0  A   1   True
1  B   2  False

outer_join() or full_join()

outer_join merges DataFrame's together on values present in either frame's by columns.

a >> outer_join(b, by='x1')

  x1   x2     x3
0  A  1.0   True
1  B  2.0  False
2  C  3.0    NaN
3  D  NaN   True

left_join()

left_join merges on the values present in the left DataFrame's by columns.

a >> left_join(b, by='x1')

  x1  x2     x3
0  A   1   True
1  B   2  False
2  C   3    NaN

right_join()

right_join merges on the values present in the right DataFrame's by columns.

a >> right_join(b, by='x1')

  x1   x2     x3
0  A  1.0   True
1  B  2.0  False
2  D  NaN   True

semi_join()

semi_join() returns all of the rows in the left DataFrame that have a match in the right DataFrame in the by columns.

a >> semi_join(b, by='x1')

  x1  x2
0  A   1
1  B   2

anti_join()

anti_join() returns all of the rows in the left DataFrame that do not have a match in the right DataFrame within the by columns.

a >> anti_join(b, by='x1')

  x1  x2
2  C   3

Set operations

The set operation functions filter a DataFrame based on row comparisons with another DataFrame.

Each of the set operation functions union(), intersect(), and set_diff() take the same arguments:

  • other: the DataFrame to compare to
  • index: a boolean (default False) indicating whether to consider the pandas index during comparison.
  • keep: string (default "first") to be passed through to .drop_duplicates() controlling how to handle duplicate rows.

With set operations columns are expected to be in the same order in both DataFrames.

The function examples use the following two toy DataFrames.

a = pd.DataFrame({
        'x1':['A','B','C'],
        'x2':[1,2,3]
    })
c = pd.DataFrame({
      'x1':['B','C','D'],
      'x2':[2,3,4]
})

union()

The union() function returns rows that appear in either DataFrame.

a >> union(c)

  x1  x2
0  A   1
1  B   2
2  C   3
2  D   4

intersect()

intersect() returns rows that appear in both DataFrames.

a >> intersect(c)

  x1  x2
0  B   2
1  C   3

set_diff()

set_diff() returns the rows in the left DataFrame that do not appear in the right DataFrame.

a >> set_diff(c)

  x1  x2
0  A   1

Binding

dfply comes with convenience wrappers around pandas.concat() for joining DataFrames by rows or by columns.

The toy DataFrames below (a and b) are the same as the ones used to display the join functions above.

bind_rows()

The bind_rows(other, join='outer', ignore_index=False) function is an exact call to pandas.concat([df, other], join=join, ignore_index=ignore_index, axis=0), joining two DataFrames "vertically".

a >> bind_rows(b, join='inner')

x1
0  A
1  B
2  C
0  A
1  B
2  D

a >> bind_rows(b, join='outer')

  x1   x2     x3
0  A  1.0    NaN
1  B  2.0    NaN
2  C  3.0    NaN
0  A  NaN   True
1  B  NaN  False
2  D  NaN   True

Note that bind_rows() does not reset the index for you!

bind_cols()

The bind_cols(other, join='outer', ignore_index=False) is likewise just a call to pandas.concat([df, other], join=join, ignore_index=ignore_index, axis=1), joining DataFrames "horizontally".

a >> bind_cols(b)

  x1  x2 x1     x3
0  A   1  A   True
1  B   2  B  False
2  C   3  D   True

Note that you may well end up with duplicate column labels after binding columns as can be seen above.

Summarization

There are two summarization functions in dfply that match dplr: summarize and summarize_each (though these functions use the 'z' spelling rather than 's').

summarize()

summarize(**kwargs) takes an arbitrary number of keyword arguments that will return new columns labeled with the keys that are summary functions of columns in the original DataFrame.

diamonds >> summarize(price_mean=X.price.mean(), price_std=X.price.std())

    price_mean    price_std
0  3932.799722  3989.439738

summarize() can of course be used with groupings as well.

diamonds >> group_by('cut') >> summarize(price_mean=X.price.mean(), price_std=X.price.std())

         cut   price_mean    price_std
0       Fair  4358.757764  3560.386612
1       Good  3928.864452  3681.589584
2      Ideal  3457.541970  3808.401172
3    Premium  4584.257704  4349.204961
4  Very Good  3981.759891  3935.862161

summarize_each()

The summarize_each(function_list, *columns) is a more general summarization function. It takes a list of summary functions to apply as its first argument and then a list of columns to apply the summary functions to. Columns can be specified with either symbolic, string label, or integer position like in the selection functions for convenience.

diamonds >> summarize_each([np.mean, np.var], X.price, 'depth')

    price_mean     price_var  depth_mean  depth_var
0  3932.799722  1.591533e+07   61.749405   2.052366

summarize_each() works with groupings as well.

diamonds >> group_by(X.cut) >> summarize_each([np.mean, np.var], X.price, 4)

         cut   price_mean     price_var  depth_mean  depth_var
0       Fair  4358.757764  1.266848e+07   64.041677  13.266319
1       Good  3928.864452  1.355134e+07   62.365879   4.705224
2      Ideal  3457.541970  1.450325e+07   61.709401   0.516274
3    Premium  4584.257704  1.891421e+07   61.264673   1.342755
4  Very Good  3981.759891  1.548973e+07   61.818275   1.900466

Embedded column functions

UNDER CONSTRUCTION: documentation not complete.

Like dplyr, the dfply package provides functions to perform various operations on pandas Series. These are typically window functions and summarization functions, and wrap symbolic arguments in function calls.

Window functions

Window functions perform operations on vectors of values that return a vector of the same length.

lead() and lag()

The lead(series, n) function pushes values in a vector upward, adding NaN values in the end positions. Likewise, the lag(series, n) function pushes values downward, inserting NaN values in the initial positions. Both are calls to pandas Series.shift() function under the hood.

(diamonds >> mutate(price_lead=lead(X.price, 2), price_lag=lag(X.price, 2)) >>
            select(X.price, -2, -1) >>
            head(6))

    price  price_lag  price_lead
 0    326        NaN       327.0
 1    326        NaN       334.0
 2    327      326.0       335.0
 3    334      326.0       336.0
 4    335      327.0       336.0
 5    336      334.0       337.0

between()

The between(series, a, b, inclusive=False) function checks to see if values are between two given bookend values.

diamonds >> select(X.price) >> mutate(price_btwn=between(X.price, 330, 340)) >> head(6)

   price price_btwn
0    326      False
1    326      False
2    327      False
3    334       True
4    335       True
5    336       True

dense_rank()

The dense_rank(series, ascending=True) function is a wrapper around the scipy function for calculating dense rank.

diamonds >> select(X.price) >> mutate(price_drank=dense_rank(X.price)) >> head(6)

   price  price_drank
0    326          1.0
1    326          1.0
2    327          2.0
3    334          3.0
4    335          4.0
5    336          5.0

min_rank()

Likewise, min_rank(series, ascending=True) is a wrapper around the scipy ranking function with min rank specified.

diamonds >> select(X.price) >> mutate(price_mrank=min_rank(X.price)) >> head(6)

price  price_mrank
0    326          1.0
1    326          1.0
2    327          3.0
3    334          4.0
4    335          5.0
5    336          6.0

cumsum()

The cumsum(series) function calculates a cumulative sum of a column.

diamonds >> select(X.price) >> mutate(price_cumsum=cumsum(X.price)) >> head(6)

   price  price_cumsum
0    326           326
1    326           652
2    327           979
3    334          1313
4    335          1648
5    336          1984

cummean()

cummean(series)

diamonds >> select(X.price) >> mutate(price_cummean=cummean(X.price)) >> head(6)

   price  price_cummean
0    326     326.000000
1    326     326.000000
2    327     326.333333
3    334     328.250000
4    335     329.600000
5    336     330.666667

cummax()

cummax(series)

diamonds >> select(X.price) >> mutate(price_cummax=cummax(X.price)) >> head(6)

   price  price_cummax
0    326         326.0
1    326         326.0
2    327         327.0
3    334         334.0
4    335         335.0
5    336         336.0

cummin()

cummin(series)

diamonds >> select(X.price) >> mutate(price_cummin=cummin(X.price)) >> head(6)

   price  price_cummin
0    326         326.0
1    326         326.0
2    327         326.0
3    334         326.0
4    335         326.0
5    336         326.0

cumprod()

cumprod(series)

diamonds >> select(X.price) >> mutate(price_cumprod=cumprod(X.price)) >> head(6)

   price     price_cumprod
0    326               326
1    326            106276
2    327          34752252
3    334       11607252168
4    335     3888429476280
5    336  1306512304030080

Summary functions

mean()

mean(series)

diamonds >> groupby(X.cut) >> summarize(price_mean=mean(X.price))

         cut   price_mean
0       Fair  4358.757764
1       Good  3928.864452
2      Ideal  3457.541970
3    Premium  4584.257704
4  Very Good  3981.759891

first()

first(series, order_by=None)

diamonds >> groupby(X.cut) >> summarize(price_first=first(X.price))

         cut  price_first
0       Fair          337
1       Good          327
2      Ideal          326
3    Premium          326
4  Very Good          336

last()

last(series, order_by=None)

diamonds >> groupby(X.cut) >> summarize(price_last=last(X.price))

         cut  price_last
0       Fair        2747
1       Good        2757
2      Ideal        2757
3    Premium        2757
4  Very Good        2757

nth()

nth(series, n, order_by=None)

diamonds >> groupby(X.cut) >> summarize(price_penultimate=nth(X.price, -2))

         cut  price_penultimate
0       Fair               2745
1       Good               2756
2      Ideal               2757
3    Premium               2757
4  Very Good               2757

n()

n(series)

diamonds >> groupby(X.cut) >> summarize(price_n=n(X.price))

         cut  price_n
0       Fair     1610
1       Good     4906
2      Ideal    21551
3    Premium    13791
4  Very Good    12082

n_distinct()

n_distinct(series)

diamonds >> groupby(X.cut) >> summarize(price_ndistinct=n_distinct(X.price))

         cut  price_ndistinct
0       Fair             1267
1       Good             3086
2      Ideal             7281
3    Premium             6014
4  Very Good             5840

IQR()

IQR(series)

diamonds >> groupby(X.cut) >> summarize(price_iqr=IQR(X.price))

         cut  price_iqr
0       Fair    3155.25
1       Good    3883.00
2      Ideal    3800.50
3    Premium    5250.00
4  Very Good    4460.75

colmin()

colmin(series)

diamonds >> groupby(X.cut) >> summarize(price_min=colmin(X.price))

         cut  price_min
0       Fair        337
1       Good        327
2      Ideal        326
3    Premium        326
4  Very Good        336

colmax()

colmax(series)

diamonds >> groupby(X.cut) >> summarize(price_max=colmax(X.price))

         cut  price_max
0       Fair      18574
1       Good      18788
2      Ideal      18806
3    Premium      18823
4  Very Good      18818

median()

median(series)

diamonds >> groupby(X.cut) >> summarize(price_median=median(X.price))

         cut  price_median
0       Fair        3282.0
1       Good        3050.5
2      Ideal        1810.0
3    Premium        3185.0
4  Very Good        2648.0

var()

var(series)

diamonds >> groupby(X.cut) >> summarize(price_var=var(X.price))

         cut     price_var
0       Fair  1.267635e+07
1       Good  1.355410e+07
2      Ideal  1.450392e+07
3    Premium  1.891558e+07
4  Very Good  1.549101e+07

sd()

sd(series)

diamonds >> groupby(X.cut) >> summarize(price_sd=sd(X.price))

         cut     price_sd
0       Fair  3560.386612
1       Good  3681.589584
2      Ideal  3808.401172
3    Premium  4349.204961
4  Very Good  3935.862161

Extending dfply with custom functions

There are a lot of built-in functions, but you are almost certainly going to reach a point where you want to use some of your own functions with the dfply piping syntax. Luckily, dfply comes with two handy decorators to make this as easy as possible.

For a more detailed walkthrough of these two cases, see the basics-extending-functionality.ipynb jupyter notebook in the examples folder.

Case 1: A custom "pipe" function with @dfpipe

You might want to make a custom function that can be a piece of the pipe chain. For example, say we wanted to write a dfply wrapper around a simplified version of pd.crosstab. For the most part, you'll only need to do two things to make this work:

  1. Make sure that your function's first argument will be the dataframe passed in implicitly by the pipe.
  2. Decorate the function with the @dfpipe decorator.

Here is an example of the dfply-enabled crosstab function:

@dfpipe
def crosstab(df, index, columns):
    return pd.crosstab(index, columns)

Normally you could use pd.crosstab like so:

pd.crosstab(diamonds.cut, diamonds.color)

color         D     E     F     G     H     I    J
cut                                               
Fair        163   224   312   314   303   175  119
Good        662   933   909   871   702   522  307
Ideal      2834  3903  3826  4884  3115  2093  896
Premium    1603  2337  2331  2924  2360  1428  808
Very Good  1513  2400  2164  2299  1824  1204  678

The same result can be achieved now with the custom function in pipe syntax:

diamonds >> crosstab(X.cut, X.color)

color         D     E     F     G     H     I    J
cut                                               
Fair        163   224   312   314   303   175  119
Good        662   933   909   871   702   522  307
Ideal      2834  3903  3826  4884  3115  2093  896
Premium    1603  2337  2331  2924  2360  1428  808
Very Good  1513  2400  2164  2299  1824  1204  678

Case 2: A function that works with symbolic objects using @make_symbolic

Many tasks are simpler and do not require the capacity to work as a pipe function. The dfply window functions are the common examples of this: functions that take a Series (or symbolic Series) and return a modified version.

Let's say we had a dataframe with dates represented by strings that we wanted to convert to pandas datetime objects using the pd.to_datetime function. Below is a tiny example dataframe with this issue.

sales = pd.DataFrame(dict(date=['7/10/17','7/11/17','7/12/17','7/13/17','7/14/17'],
                          sales=[1220, 1592, 908, 1102, 1395]))

sales

      date  sales
0  7/10/17   1220
1  7/11/17   1592
2  7/12/17    908
3  7/13/17   1102
4  7/14/17   1395

Using the pd.to_datetime function inside of a call to mutate will unfortunately break:

sales >> mutate(pd_date=pd.to_datetime(X.date, infer_datetime_format=True))

...

TypeError: __index__ returned non-int (type Intention)

dfply functions are special in that they "know" to delay their evaluation until the data is at that point in the chain. pd.to_datetime is not such a function, and will immediately try to evaluate X.date. With a symbolic Intention argument passed in (X is an Intention object), the function will fail.

Instead, we will need to make a wrapper around pd.to_datetime that can handle these symbolic arguments and delay evaluation until the right time.

It's quite simple: all you need to do is decorate a function with the @make_symbolic decorator:

@make_symbolic
def to_datetime(series, infer_datetime_format=True):
    return pd.to_datetime(series, infer_datetime_format=infer_datetime_format)

Now the function can be used with symbolic arguments:

sales >> mutate(pd_date=to_datetime(X.date))

      date  sales    pd_date
0  7/10/17   1220 2017-07-10
1  7/11/17   1592 2017-07-11
2  7/12/17    908 2017-07-12
3  7/13/17   1102 2017-07-13
4  7/14/17   1395 2017-07-14

Without symbolic arguments, @make_symbolic functions work like normal functions!

A particularly nice thing about functions decorated with @make_symbolic is that they will operate normally if passed arguments that are not Intention symbolic objects.

For example, you can pass in the series itself and it will return the new series of converted dates:

to_datetime(sales.date)

0   2017-07-10
1   2017-07-11
2   2017-07-12
3   2017-07-13
4   2017-07-14
Name: date, dtype: datetime64[ns]

Advanced: understanding base dfply decorators

Under the hood, dfply functions work using a collection of different decorators and special classes. Below the most important ones are detailed. Understanding these are important if you are planning on making big additions or changes to the code.

The Intention class

Python is not a lazily-evaluated language. Typically, something like this would not work:

diamonds >> select(X.carat) >> head(2)

The X is supposed to represent the current state of the data through the piping operator chain, and X.carat indicates "select the carat column from the current data at this point in the chain". But Python will try to evaluate what X is, then what X.carat is, then what select(X.carat) is, all before the diamonds dataset ever gets evaluated.

The solution to this is to delay the evaluation until the appropriate time. I will not get into the granular details here (but feel free to check it out for yourself in base.py). The gist is that things to be delayed are represented by a special Intention class that "waits" until it is time to evaluate the stored commands with a given dataframe. This is the core of how dplyr data manipulation syntax is made possible in dfply.

(Thanks to the creators of the dplython and pandas-ply for trailblazing a lot of this before I made this package.)

@pipe

The primary decorator that enables chaining functions with the >> operator is @pipe. For functions to work with the piping syntax they must be decorated with @pipe.

Any function decorated with @pipe implicitly receives a single first argument expected to be a pandas DataFrame. This is the DataFrame being passed through the pipe. For example, mutate and select have function specifications mutate(df, **kwargs) and select(df, *args, **kwargs), but when used do not require the user to insert the DataFrame as an argument.

# the DataFrame is implicitly passed as the first argument
diamonds >> mutate(new_var=X.price + X.depth) >> select(X.new_var)

If you create a new function decorated by @pipe, the function definition should contain an initial argument that represents the DataFrame being passed through the piping operations.

@pipe
def myfunc(df, *args, **kwargs):
  # code

@group_delegation

In order to delegate a function across specified groupings (assigned by the group_by() function), decorate the function with the @group_delegation decorator. This decorator will query the DataFrame for assigned groupings and apply the function to those groups individually.

Groupings are assigned by dfply as an attribute ._grouped_by to the DataFrame proceeding through the piped functions. @group_delegation checks for the attribute and applies the function by group if groups exist. Any hierarchical indexing is removed by the decorator as well.

Decoration by @group_delegation should come after (internal) to the @pipe decorator to function as intended.

@pipe
@group_delegation
def myfunc(df, *args, **kwargs):
  # code

@symbolic_evaluation

Evaluation of any Intention-class symbolic object (such as X) is handled by the @symbolic_evaluation function. For example, when calling mutate(new_price = X.price * 2.5) the X.price symbolic representation of the price column in the DataFrame will be evaluated to the actual Series by this decorator.

The @symbolic_evaluation decorator can have functionality modified by optional keyword arguments:

Controlling @symbolic_evaluation with the eval_symbols argument

@symbolic_evaluation(eval_symbols=False)
def my_function(df, arg1, arg2):
    ...

If the eval_symbols argument is True, all symbolics will be evaluated with the passed-in dataframe. If False or None, there will be no attempt to evaluate symbolics.

A list can also be passed in. The list can contain a mix of positional integers and string keywords, which reference positional arguments and keyworded arguments respectively. This targets which arguments or keyword arguments to try and evaluate specifically:

# This indicates that arg1, arg2, and kw1 should be targeted for symbolic
# evaluation, but not the other arguments.
# Note that positional indexes reference arguments AFTER the passed-in dataframe.
# For example, 0 refers to arg1, not df.
@symbolic_evaluation(eval_symbols=[0,1,'kw1'])
def my_function(df, arg1, arg2, arg3, kw1=True, kw2=False):
    ...

In reality, you are unlikely to need this behavior unless you really want to prevent dfply from trying to evaluate symbolic arguments. Remember that if an argument is not symbolic it will be evaluated as normal, so there shouldn't be much harm leaving it at default other than a little bit of computational overhead.

@dfpipe

Most new or custom functions for dfply will be decorated with the pattern:

@pipe
@group_delegation
@symbolic_evaluation
def myfunc(df, *args, **kwargs):
  # code

Because of this, the decorator @dfpipe is defined as exactly this combination of decorators for your convenience. The above decoration pattern for the function can be simply written as:

@dfpipe
def myfunc(df, *args, **kwargs):
  # code

This allows you to easily create new functions that can be chained together with pipes, respect grouping, and evaluate symbolic DataFrames and Series correctly.

@make_symbolic

Sometimes, like in the window and summary functions that operate on series, it is necessary to defer the evaluation of a function. For example, in the code below:

diamonds >> summarize(price_third=nth(X.price, 3))

The nth() function would typically be evaluated before summarize() and the symbolic argument would not be evaluated at the right time.

The @make_symbolic decorator can be placed above functions to convert them into symbolic functions that will wait to evaluate. Again, this is used primarily for functions that are embedded inside the function call within the piping syntax.

The nth() code, for example, is below:

@make_symbolic
def nth(series, n, order_by=None):
    if order_by is not None:
        series = order_series_by(series, order_by)
    try:
        return series.iloc[n]
    except:
        return np.nan

Functions you write that you want to be able to embed as an argument can use the @make_symbolic to wait until they have access to the DataFrame to evaluate.

Contributing

By all means please feel free to comment or contribute to the package. The more people adding code the better. If you submit an issue, pull request, or ask for something to be added I will do my best to respond promptly.

The TODO list (now located in the "Projects" section of the repo) has an ongoing list of things that still need to be resolved and features to be added.

If you submit a pull request with features or bugfixes, please target the "develop" branch rather than the "master" branch.