py_to_r returns array with wrong orders, where the array is numpy.array with character type
Closed this issue · 3 comments
Dear reticulate team,
Thank you so much for writing this wonderful package! I face the array transformation inconsistence issue today:
In a python script
import pickle
import numpy as np
s1 = np.array([1,2,3])
s2 = np.array([3,4,5])
s = np.stack([s1,s2], axis = 1)
with open("s.pl", 'wb') as f:
pickle.dump(s, f)
c1 = np.array(["1","2","3"])
c2 = np.array(["3","4","5"])
c = np.stack([c1,c2], axis = 1)
with open("c.pl", 'wb') as f:
pickle.dump(c, f)
Then in R script under the same directory
library(reticulate)
# I use python_path to set a python version in my local env,
# which I ignore here.
py <- import_builtins()
pickle <- import("pickle")
with(py$open("s.pl", 'rb') %as% f, {
s <- pickle$load(f)
})
with(py$open(".c.pl", 'rb') %as% f, {
c <- pickle$load(f)
})
Then, in R console, if I print s and c
> s
[,1] [,2]
[1,] 1 3
[2,] 2 4
[3,] 3 5
> c
[,1] [,2]
[1,] "1" "4"
[2,] "3" "3"
[3,] "2" "5"
s
is the right order, but c
is not.
My R session is:
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin22.4.0 (64-bit)
Running under: macOS Ventura 13.4.1
Matrix products: default
BLAS: /usr/local/Cellar/openblas/0.3.23/lib/libopenblasp-r0.3.23.dylib
LAPACK: /usr/local/Cellar/r/4.3.0_1/lib/R/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reticulate_1.30
loaded via a namespace (and not attached):
[1] compiler_4.3.0 here_1.0.1 Matrix_1.5-4 rprojroot_2.0.3 cli_3.6.1 withr_2.5.0 Rcpp_1.0.10
[8] grid_4.3.0 jsonlite_1.8.7 rlang_1.1.1 png_0.1-8 lattice_0.21-8
Would you mind to double check this or if I need to pay specific attention on it?
Thanks!
Songpeng
@beyondpie Thanks for reporting! I can reproduce.
@dfalbel Can you please take a look? It seems like our numpy conversion routines in C++ that detect string type arrays needs to be updated.
A more minimal reprex, (without pickle)
library(reticulate)
py_run_string(
r"(
import numpy as np
s1 = np.array([1,2,3])
s2 = np.array([3,4,5])
s = np.stack([s1,s2], axis = 1)
c1 = np.array(["1","2","3"])
c2 = np.array(["3","4","5"])
c = np.stack([c1,c2], axis = 1)
)")
py$c
#> [,1] [,2]
#> [1,] "1" "4"
#> [2,] "3" "3"
#> [3,] "2" "5"
py$s
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 2 4
#> [3,] 3 5
# flags for both c and s are identical
py_eval("c.flags")
#> C_CONTIGUOUS : True
#> F_CONTIGUOUS : False
#> OWNDATA : True
#> WRITEABLE : True
#> ALIGNED : True
#> WRITEBACKIFCOPY : False
py_eval("s.flags")
#> C_CONTIGUOUS : True
#> F_CONTIGUOUS : False
#> OWNDATA : True
#> WRITEABLE : True
#> ALIGNED : True
#> WRITEBACKIFCOPY : False
Created on 2023-07-11 with reprex v2.0.2
We wrongly assume the item
method will return elements in the order the array is. I'm not sure if this is a bug in numpy or if this expected, but this doesn't seem to be the recommended way to iterate over array elements in numpy.
Lines 1128 to 1137 in 6d93f3f
See for instance the array iter docs that recommends using the nditer()
iterator instead to get the same order corresponding to the memory layout: https://numpy.org/doc/stable/reference/arrays.nditer.html#arrays-nditer
We have a few options to fix this issue:
- Permute the dimensions of the array to match the Fortran ordering, then interate. Looks a bit hacky and requires a full copy of the array (besides the R copy).
- Add strides to indexing to correctly get the element using the
item()
method. This can be quite complicated, specially for multi-dimensional arrays, and some unexpected numpy flags can cause breakage. - Use proper's numpy iterators: https://numpy.org/doc/stable/reference/c-api/iterator.html. Requires adding support for it in reticulate's
libpython.h
, which is quite tricky. But seems to allow for faster iterations and is safer to get the correct fortran ordering.
I'm leaning towards following the 3rd approach. Adding numpy iterators support could even allow for us to improve casting for other data types in the future. @t-kalinowski what do you think?
Thanks @dfalbel for investigating and chatting just now!
We decided to take the approach of calling into Python to call ndarray.nditer()
from C, and use the standard Python C API (PyIter_Next
) for iteration, rather than using the NpyIter_*
API. This is a simple and safe approach, and doesn't require us to increase the number of symbols loaded when python is initialized.