ryp: R inside Python

ryp is a minimalist, powerful Python library for:

running R code inside Python
quickly transferring huge datasets between Python (NumPy/pandas/polars) and R without writing to disk
interactively working in both languages at the same time

ryp is an alternative to the widely used rpy2 library. Compared to rpy2, ryp provides:

increased stability
a much simpler API, with less of a learning curve
interactive printouts of R variables that match what you'd see in R
a full-featured R terminal inside Python for interactive work
inline plotting in Jupyter notebooks (requires the svglite R package)
much faster data conversion with Arrow (also provided by rpy2-arrow)
support for every NumPy, pandas and polars data type representable in base R, no matter how obscure
support for sparse arrays/matrices
recursive conversion of containers like R lists, Python tuples/lists/dicts, and S3/S4/R6 objects
full Windows support

ryp does the opposite of the reticulate R library, which runs Python inside R.

Installation
Functionality
Conversion rules
- Python to R (to_r())
- R to Python (to_py())
  - Data types
Examples

Installation

Install ryp via pip:

pip install ryp

conda:

conda install ryp

or mamba:

mamba install ryp

Or, install the development version via pip:

pip install git+https://github.com/Wainberg/ryp

ryp's only mandatory dependencies are:

Python 3.7+
R
the cffi Python package
the pyarrow Python package, which includes NumPy as a dependency
the arrow R library

R and the arrow R library are automatically installed when installing ryp via conda or mamba, but not via pip. ryp uses the R installation pointed to by the environment variable R_HOME, or if R_HOME is not defined or not a directory, by running R RHOME through subprocess.run().

ryp also has several optional dependencies, which are not installed automatically with pip, conda or mamba. These are:

pandas, for format='pandas'
polars, for format='polars'
SciPy and the Matrix R library, for sparse matrices
the svglite R library, for inline plotting in Jupyter notebooks

Functionality

ryp consists of just three core functions:

r(R_code) runs a string of R code. r() with no arguments opens up an R terminal inside your Python terminal for interactive work.
to_r(python_object, R_variable_name) converts a Python object into an R object named R_variable_name.
to_py(R_statement) converts the R object produced by evaluating R_statement to Python. R_statement may be a single variable name, or a more complex code snippet that evaluates to the R object you'd like to convert.

and two more functions, get_config() and set_config(), for getting and setting ryp's global configuration options.

`r()`

r(R_code: str = ...) -> None

r(R_code) runs a string of R code inside ryp's R interpreter, which is embedded inside Python. It can contain multiple statements separated by semicolons or newlines (e.g. within a triple-quoted Python string). It returns None; use to_py() instead if you would like to convert the result back to Python.

r() with no arguments opens up an R terminal inside your Python terminal for interactive debugging. Press Ctrl + D to exit back to the Python terminal. R variables defined from Python will be available in the R terminal, and variables defined in the R terminal will be available from Python once you exit:

>>> from ryp import r
>>> r('a = 1')
>>> r()
> a
[1]
1
> b <- 2
>
>>> r('b')
[1]
2

Note that the default value for R_code is the special sentinel value ... (Ellipsis) rather than None. This stops users from inadvertently opening the terminal when passing a variable that is supposed to be a string but is unexpectedly None.

`to_r()`

to_r(python_object: object, R_variable_name: str, *, 
     format: Literal['keep', 'matrix', 'data.frame'] | None = None,
     rownames: object = None, colnames: object = None) -> None

to_r(python_object, R_variable_name) converts python_object to R, adding it to R's global namespace (globalenv) as a variable named R_variable_name.

If python_object is a container (list, tuple, or dict), to_r() recursively converts each element and returns an R named list (if python_object is a dict) or unnamed list (if python_object is a list or tuple).

The `format` argument

By default (format='keep'), ryp converts polars and pandas DataFrames (and pandas MultiIndexes) into R data frames, and 2D NumPy arrays into R matrices. Specify format='matrix' to convert everything (even DataFrames) to R matrices (in which case all DataFrame columns must have the same type), and format='data.frame' to convert everything (even 2D NumPy arrays) to R data frames.

format must be None unless python_object is a DataFrame, MultiIndex or 2D NumPy array – or unless python_object is a list, tuple, or dict, in which case the format will apply recursively to any DataFrames, MultiIndexes or 2D NumPy arrays it contains.

The `rownames` and `colnames` arguments

Since NumPy arrays, polars DataFrames and Series, and scipy sparse arrays and matrices lack row and column names, you can specify these separately via the rownames and/or colnames arguments, and they will be added to the converted R object. rownames and colnames can be lists, tuples, string Series, or categorical Series with string categories, and will be automatically converted to R character vectors.

rownames and colnames must match the length or shape[1], respectively, of the object being converted. The one exception is that rownames of any length may be added to a 0 × 0 polars DataFrame, since polars does not have the concept of an N × 0 DataFrame for nonzero N. (Dropping all the columns of a polars DataFrame always results in a 0 × 0 DataFrame, even if the original DataFrame had more than 0 rows.)

Because Python bool, int, float, and str convert to length-1 R vectors that support names, you can pass length-1 rownames when converting objects of these types. You can also pass rownames and/or colnames when python_object is a list, tuple, or dict, in which case row and column names will only be added to elements that support them. All elements that support rownames must have the same length as the rownames, and similarly for the colnames.

rownames cannot be specified if python_object is a pandas Series or DataFrame (since they already have rownames, i.e. an index), or bytes/bytearray (since these convert to raw vectors, which lack rownames). colnames cannot be specified unless python_object is a multidimensional NumPy array or scipy sparse array or matrix, or something that might contain one (list, tuple, or dict).

`to_py()`

to_py(R_statement: str, *,
      format: Literal['polars', 'pandas', 'pandas-pyarrow', 'numpy'] |
              dict[Literal['vector', 'matrix', 'data.frame'],
                   Literal['polars', 'pandas', 'pandas-pyarrow',
                           'numpy']] | None = None,
      index: str | Literal[False] | None = None,
      squeeze: bool | None = None) -> Any

to_py(R_statement) runs a single statement of R code (which can be as simple as a single variable name) and converts the resulting R object to Python.

If the object is a list/S3 object, S4 object, or environment/R6 object, it recursively converts each attribute/slot/field and returns a Python dict (or list, if the object is an unnamed list). For R6 objects, only public fields will be converted.

The `format` argument

By default, or when format='polars', R vectors will be converted to polars Series, and R data frames and matrices will be converted to polars DataFrames. You can change this by setting the format argument to 'pandas', 'pandas-pyarrow' (like 'pandas', but converting to pyarrow dtypes wherever possible) or 'numpy'. (You can also change the default format, e.g. with set_config(to_py_format='pandas').)

For finer-grained control, you can set format for only certain R variable types by specifying a dictionary with 'vector', 'matrix', and/or 'data.frame' as keys and 'polars', 'pandas', 'pandas-pyarrow' and/or 'numpy' as values.

format must be None when R_statement evaluates to NULL, when it evaluates to an array of 3 or more dimensions (these are always converted to NumPy arrays), or when the final result would be a Python scalar (see squeeze below).

The `index` argument

By default, the R object's names or rownames will become the index (for pandas) or the first column (for polars) of the output Python object, named 'index'. Set the index argument to a different string to change this name, or set index=False to not convert the names/rownames.

Note that for polars, the output will be a two-column DataFrame (not a Series!) when the input is an R vector, unless index=False.

When the output is a NumPy array, names and rownames will always be discarded, since numeric NumPy arrays cannot store string indexes except with the inefficient dtype=object.

index must be None when format='numpy', or when the final result would be a Python scalar (see squeeze below).

The `squeeze` argument

By default, length-1 R vectors, matrices and arrays will be converted to Python scalars instead of Python arrays, Series or DataFrames. Set squeeze=False to disable this special case. (R data frames are never converted to Python scalars even if squeeze=True.)

squeeze must be None unless the R object is a vector, matrix or array (raw vectors don't count, because they always convert to Python scalars).

`set_config()`

set_config(*, to_r_format=None, to_py_format=None, index=None, squeeze=None, 
           plot_width: int | float | None = None, 
           plot_height: int | float | None = None) -> None

set_config sets ryp's configuration settings. Arguments that are None are left unchanged.

For instance, to set pandas as the default format in to_py(), run set_config(to_py_format='pandas').

to_r_format: the default value for the format parameter in to_r(); must be 'keep' (the default), 'matrix', or 'data.frame'.
to_py_format: the default value for the format parameter in to_py(); must be 'polars' (the default), 'pandas', 'pandas-pyarrow', 'numpy', or a dictionary with one of those four Python formats and/or None as values and 'vector', 'matrix' and/or 'data.frame' as keys. If certain keys are missing or have None as the format, leave their format unchanged.
index: the default value for the index parameter in to_py(); must be a string (default: 'index') or False.
squeeze: the default value for the squeeze parameter in to_py(); must
be True (the default) or False.
plot_width: the width, in inches, of inline plots in Jupyter notebooks; must be a positive number. Defaults to 6.4 inches, to match Matplotlib's default.
plot_height: the height, in inches, of inline plots in Jupyter notebooks; must be a positive number. Defaults to 4.8 inches, to match Matplotlib's default.

For additional customization, users can specify ryp-specific settings in their .Rprofile:

if ("ryp" %in% commandArgs()) {
    # Custom settings for running R within ryp
} else {
    # Custom settings for native R
}

`get_config()`

get_config() -> dict[str, dict[str, str] | str | bool | int]

get_config returns the current configuration options as a dictionary, with keys to_r_format, to_py_format, index, squeeze, plot_width, and plot_height.

Conversion rules

Python to R (`to_r()`)

Python	R
`None`	`NULL` (if scalar) or `NA` (if inside NumPy, pandas or polars)
`nan`	`NaN` (if scalar or inside polars) or `NA` (if inside NumPy or pandas)
`pd.NA`	`NA`
`pd.NaT`, `np.datetime64('NaT')`, `np.timedelta64('NaT')`	`NA`
`bool`	length-1 `logical` vector
`int`	length-1 `integer` (if `abs(x) <= 2_147_483_647`) or `bit64::integer64` vector
`float`	length-1 `numeric` vector
`str`	length-1 `character` vector
`complex`	length-1 `complex` vector
`datetime.date`	length-1 `Date` vector
`datetime.datetime`	length-1 `POSIXct` vector
`datetime.timedelta`	length-1 `difftime(units='secs')` vector
`datetime.time` (`tzinfo` must be `None`)	length-1 `hms::hms` vector
`bytes`, `bytearray`	`raw` vector
`list`, `tuple`	unnamed list
`dict` (all keys must be strings)	named list
polars Series, pandas Series^*, pandas `Index`	vector
polars DataFrame, pandas DataFrame^*, pandas `MultiIndex`	matrix^† (if `format == 'matrix'`; all columns must have same data type) or `data.frame`
1D NumPy array	vector
2D NumPy array	`data.frame` (if `format == 'data.frame'`) or matrix^†
≥ 3D NumPy array	array^†
0D NumPy array (e.g. `np.array(1)`), NumPy generic (e.g. `np.int32(1)`)	length-1 vector
`csr_array`, `csr_matrix`	`dgRMatrix` (if int or float), `lgRMatrix` (if boolean), -- (if complex)
`csc_array`, `csc_matrix`	`dgCMatrix` (if int or float), `lgCMatrix` (if boolean), -- (if complex)
`coo_array`, `coo_matrix`	`dgTMatrix` (if int or float), `lgTMatrix` (if boolean), -- (if complex)

NumPy data types

Python	R
`bool`	`logical`
`int8`, `uint8`, `int16`, `uint16`, `int32`	`integer`
`uint32`, `uint64`	`integer` (if `x <= 2_147_483_647`) or `numeric`
`int64`	`integer` (if `abs(x) <= 2_147_483_647`) or `bit64::integer64`
`float16`, `float32`, `float64`, `float128`	`numeric` (note: `float128` loses precision)
`complex64`, `complex128`	`complex`
`bytes` (e.g. `'S1'`)	--
`str`/`unicode` (e.g. `'U1'`)	`character`
`datetime64`	`POSIXct`
`timedelta64`	`difftime(units='secs')`
`void` (unstructured)	`raw`
`void` (structured)	--
`object`	depends on the contents^‡

pandas-specific data types

Python	R
`BooleanDtype`	`logical`
`Int8Dtype`, `UInt8Dtype`, `Int16Dtype`, `UInt16Dtype`, `Int32Dtype`	`integer`
`UInt32Dtype`, `UInt64Dtype`	`integer` (if `x <= 2_147_483_647`) or `numeric`
`Int64Dtype`	`integer` (if `abs(x) <= 2_147_483_647`) or `bit64::integer64`
`Float32Dtype`, `Float64Dtype`	`numeric`
`StringDtype`	`character`
`CategoricalDtype(ordered=False)`	unordered `factor`
`CategoricalDtype(ordered=True)`	ordered `factor`
`DatetimeTZDtype`, `PeriodDtype`	`POSIXct`
`IntervalDtype`, `SparseDtype`	--

pandas Arrow data types (`pd.ArrowDtype`)

Python	R
`pa.bool_`	`logical`
`pa.int8`, `pa.uint8`, `pa.int16`, `pa.uint16`, `pa.int32`	`integer`
`pa.uint32`, `pa.uint64`	`integer` (if `x <= 2_147_483_647`) or `numeric`
`pa.int64`	`integer` (if `abs(x) <= 2_147_483_647`) or `bit64::integer64`
`pa.float32`, `pa.float64`	`numeric`
`pa.string`, `pa.large_string`	`character`
`pa.date32`	`Date`
`pa.date64`, `pa.timestamp`	`POSIXct`
`pa.duration`	`difftime(units='secs')`
`pa.time32`, `pa.time64`	`hms::hms`
`pa.dictionary(any integer type, pa.string(), ordered=0)`	unordered `factor`
`pa.dictionary(any integer type, pa.string(), ordered=1)`	ordered `factor`
`pa.null()`	`vctrs::unspecified`

Polars data types

Python	R
`Boolean`	`logical`
`Int8`, `UInt8`, `Int16`, `UInt16`, `Int32`	`integer`
`UInt32`, `UInt64`	`integer` (if `x <= 2_147_483_647`) or `numeric`
`Int64`	`integer` (if `abs(x) <= 2_147_483_647`) or `bit64::integer64`
`Float32`, `Float64`	`numeric`
`Date`	`Date`
`Datetime`	`POSIXct`
`Duration`	`difftime(units='secs')`
`Time`	`hms::hms`
`String`	`character`
`Categorical`	unordered `factor`
`Enum`	ordered `factor`
`Object`	depends on the contents^‡
`Null`	`vctrs::unspecified`
`Binary`, `Decimal`, `List`, `Array`	--

Notes

^* For pandas Series and DataFrames, string indexes (and categorical indexes where the categories are strings) will be automatically converted to names/rownames. The default index (pd.RangeIndex(len(python_object))) will be ignored. All other indexes are disallowed.

^† Because R does not support POSIXct and Date matrices or arrays, dates and datetimes cannot be converted to R matrices or arrays.

^‡ For dtype=object and dtype=pl.Object, the output R type depends on the contents, e.g. 'character' if all elements are strings. Some additional notes on ryp's handling of object data types:

None, np.nan, pd.NA, pd.NaT, np.datetime64('NaT'), and np.timedelta64('NaT') are all treated as missing values – even for polars, where np.nan is ordinarily treated as a floating-point number rather than a missing value.
Length-0 and all-missing data will be converted to the vctrs::unspecified R type (vctrs is part of the tidyverse).
If the elements are objects with a mix of types (or datetimes with a mix of time zones), Arrow will generally cause the conversion to fail, though mixes of related types (e.g. int and float) will be automatically cast to the common supertype and succeed.
Conversion will also fail if the contents are objects that are not representable as R vector elements. This includes bytes/bytearray (which are only representable in R when scalar, as a raw vector) and Python containers (list, tuple, and dict).
pandas Timedelta objects will be rounded down to the nearest microsecond, following the behavior of Arrow.

R to Python (`to_py()`)

R	Python
`NULL`	`None`
`NA`	`None` (if scalar or `format='polars'`), `None`/`nan`/`pd.NA`/`pd.NaT`/`np.datetime64('NaT', 'us')`/`np.timedelta64('NaT', 'ns')`/etc. (if `format='numpy'` `'pandas'` or `'pandas-pyarrow'`)
`NaN`	`nan`
length-1 vector, matrix or array, `squeeze == False`	scalar
vector or 1D array, `format == 'numpy'`	1D NumPy array
vector or 1D array, `format == 'pandas'` or `format == 'pandas-pyarrow'`	pandas Series
vector or 1D array, `format == 'polars'`	polars Series (if `index=False`) or two-column DataFrame
matrix or `data.frame`, `format == 'numpy'`	2D NumPy array
matrix or `data.frame`, `format == 'pandas'` or `format == 'pandas-pyarrow'`	pandas DataFrame
matrix or `data.frame`, `format == 'polars'`	polars DataFrame
≥ 3D array	NumPy array
unnamed list	`list`
named list, S3 object, S4 object, environment, S6 object	`dict`
`dgRMatrix`	`csr_array(dtype='int32')`
`dgCMatrix`	`csc_array(dtype='int32')`
`dgTMatrix`	`coo_array(dtype='int32')`
`lgRMatrix`, `ngRMatrix`	`csr_array(dtype=bool)`
`lgCMatrix`, `ngCMatrix`	`csc_array(dtype=bool)`
`lgTMatrix`, `ngTMatrix`	`coo_array(dtype=bool)`
formula (`~`)	--

Data types

R	Python scalar	NumPy	pandas	pandas-pyarrow	polars
`logical`	`bool`	`bool`	`bool`	`ArrowDtype(pa.bool_())`	`Boolean`
`integer`	`int`	`int32`	`int32`	`ArrowDtype(pa.int32())`	`Int32`
`bit64::integer64`	`int`	`int64`	`int64`	`ArrowDtype(pa.int64())`	`Int64`
`numeric`	`float`	`float`	`float`	`ArrowDtype(pa.float64())`	`Float64`
`character`	`str`	`object` (with `str` elements)	`object` (with `str` elements)	`ArrowDtype(pa.string())`	`String`
`complex`	`complex`	`complex128`	`complex128`	`complex128`	--
`raw`	`bytearray`	--	--	--	--
unordered `factor`	`str`	`object` (with `str` elements)	`CategoricalDtype(ordered=False)`	`ArrowDtype(pa.dictionary(pa.int8(), pa.string(), ordered=0))`	`Categorical`
ordered `factor`	`str`	`object` (with `str` elements)	`CategoricalDtype(ordered=True)`	`ArrowDtype(pa.dictionary(pa.int8(), pa.string(), ordered=1))`	`Enum`
`POSIXct` without time zone	`datetime.datetime`^*	`datetime64[us]`^*	`datetime64[us]`^*	`ArrowDtype(pa.timestamp('us'))`^*	`Datetime('us')`^*
`POSIXct` with time zone	`datetime.datetime`^*	`datetime64[us]`^* (time zone discarded)	`DatetimeTZDtype('us', time_zone)`^*	`ArrowDtype(pa.timestamp('us', time_zone))`^*	`Datetime('us, time_zone)`^*
`POSIXlt`	`dict` of scalars	`dict` of NumPy arrays	`dict` of pandas Series	`dict` of pandas Series	`dict` of polars Series
`Date`	`datetime.date`	`datetime64[D]`	`datetime64[ms]`	`ArrowDtype(pa.date32('day'))`	`Date`
`difftime`	`datetime.timedelta`^*	`timedelta64[ns]`	`timedelta64[ns]`	`ArrowDtype(pa.duration('ns'))`	`Duration(time_unit='ns')`
`hms::hms`	`datetime.time`^*	`object` (with `datetime.time` elements)^*	`object` (with `datetime.time` elements)^*	`ArrowDtype(pa.time64('ns'))`^*	`Time`
`vctrs::unspecified`	`None`	`object` (with `None` elements)	`object` (with `None` elements)	`ArrowDtype(pa.null())`	`Null`

^* Due to the limitations of conversion with Arrow, POSIXct and hms::hms values are rounded down to the nearest microsecond when converting to Python, except for hms::hms when converting to polars. difftime values are also rounded down to the nearest microsecond, but only when converting to scalar datetime.timedelta values (which cannot represent nanoseconds).

Examples

Apply R's scale() function to a pandas DataFrame:

import pandas as pd
from ryp import r, to_py, to_r
data = pd.DataFrame({'a': [1, 2, 3], 'b': [1, 3, 4]})
to_r(data, 'data')
r('data')
#   a b
# 1 1 1
# 2 2 3
# 3 3 4

r('data <- scale(data)')  # scale the R data.frame
scaled_data = to_py('data', format='pandas')  # convert the R data.frame to Python
scaled_data
#      a         b
# 0 -1.0 -1.091089
# 1  0.0  0.218218
# 2  1.0  0.872872

Note: we could have just written to_py('scale(data)') instead of r('data <- scale(data)') followed by to_py('data'). We could also have run set_config(to_py_format='pandas') at the top, to avoid having to specify format='pandas' in each to_py() call.

Run a linear model on a polars DataFrame:

import polars as pl
from ryp import r, to_py, to_r
data = pl.DataFrame({'y': [7, 1, 2, 3, 6], 'x': [5, 2, 3, 2, 5]})
to_r(data, 'data')
r('model <- lm(y ~ x, data=data)')
coef = to_py('summary(model)$coefficients', index='variable')
p_value = coef.filter(variable='x').select('Pr(>|t|)')[0, 0]
p_value
# 0.02831035772841884

Recursive conversion, showcasing all the keyword arguments of to_r() and to_py():

import numpy as np
from ryp import r, to_py, to_r
arrays = {'ints': np.array([[1, 2], [3, 4]]),
          'floats': np.array([[0.5, 1.5], [2.5, 3.5]])}
to_r(arrays, 'arrays', format='data.frame',
     rownames = ['row1', 'row2'], colnames = ['col1', 'col2'])
r('arrays')
# $ints
#      col1 col2
# row1    1    2
# row2    3    4
# 
# $floats
#      col1 col2
# row1  0.5  1.5
# row2  2.5  3.5
arrays = to_py('arrays', format='pandas', index='foo')
arrays['ints']
#       col1  col2
# foo
# row1     1     2
# row2     3     4
arrays['floats']
#       col1  col2
# foo
# row1   0.5   1.5
# row2   2.5   3.5

Wainberg/ryp

ryp: R inside Python

Table of Contents

Installation

Functionality

r()

to_r()

The format argument

The rownames and colnames arguments

to_py()

The format argument

The index argument

The squeeze argument

set_config()

get_config()

Conversion rules

Python to R (to_r())

NumPy data types

pandas-specific data types

pandas Arrow data types (pd.ArrowDtype)

Polars data types

Notes

R to Python (to_py())

Data types

Examples

`r()`

`to_r()`

The `format` argument

The `rownames` and `colnames` arguments

`to_py()`

The `format` argument

The `index` argument

The `squeeze` argument

`set_config()`

`get_config()`

Python to R (`to_r()`)

pandas Arrow data types (`pd.ArrowDtype`)

R to Python (`to_py()`)