py_to_r returns array with wrong orders, where the array is numpy.array with character type

Dear reticulate team,
Thank you so much for writing this wonderful package! I face the array transformation inconsistence issue today:

In a python script

import pickle
import numpy as np
s1 = np.array([1,2,3])
s2 = np.array([3,4,5])
s = np.stack([s1,s2], axis = 1)
with open("s.pl", 'wb') as f:
    pickle.dump(s, f)

c1 = np.array(["1","2","3"])
c2 = np.array(["3","4","5"])
c = np.stack([c1,c2], axis = 1)
with open("c.pl", 'wb') as f:
    pickle.dump(c, f)

Then in R script under the same directory

library(reticulate)
# I use python_path to set a python version in my local env,
# which I ignore here.

py <- import_builtins()
pickle <- import("pickle")

with(py$open("s.pl", 'rb') %as% f, {
  s <- pickle$load(f)
})

with(py$open(".c.pl", 'rb') %as% f, {
  c <- pickle$load(f)
})

Then, in R console, if I print s and c

> s
     [,1] [,2]
[1,]    1    3
[2,]    2    4
[3,]    3    5
> c
     [,1] [,2]
[1,] "1"  "4" 
[2,] "3"  "3" 
[3,] "2"  "5"

s is the right order, but c is not.

My R session is:

sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin22.4.0 (64-bit)
Running under: macOS Ventura 13.4.1

Matrix products: default
BLAS:   /usr/local/Cellar/openblas/0.3.23/lib/libopenblasp-r0.3.23.dylib 
LAPACK: /usr/local/Cellar/r/4.3.0_1/lib/R/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reticulate_1.30

loaded via a namespace (and not attached):
 [1] compiler_4.3.0  here_1.0.1      Matrix_1.5-4    rprojroot_2.0.3 cli_3.6.1       withr_2.5.0     Rcpp_1.0.10    
 [8] grid_4.3.0      jsonlite_1.8.7  rlang_1.1.1     png_0.1-8       lattice_0.21-8

Would you mind to double check this or if I need to pay specific attention on it?
Thanks!
Songpeng

@beyondpie Thanks for reporting! I can reproduce.

@dfalbel Can you please take a look? It seems like our numpy conversion routines in C++ that detect string type arrays needs to be updated.
A more minimal reprex, (without pickle)

library(reticulate)
py_run_string(
r"(
import numpy as np
s1 = np.array([1,2,3])
s2 = np.array([3,4,5])
s = np.stack([s1,s2], axis = 1)

c1 = np.array(["1","2","3"])
c2 = np.array(["3","4","5"])
c = np.stack([c1,c2], axis = 1)
)")

py$c
#>      [,1] [,2]
#> [1,] "1"  "4" 
#> [2,] "3"  "3" 
#> [3,] "2"  "5"
py$s
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    2    4
#> [3,]    3    5

# flags for both c and s are identical
py_eval("c.flags")
#>   C_CONTIGUOUS : True
#>   F_CONTIGUOUS : False
#>   OWNDATA : True
#>   WRITEABLE : True
#>   ALIGNED : True
#>   WRITEBACKIFCOPY : False
py_eval("s.flags")
#>   C_CONTIGUOUS : True
#>   F_CONTIGUOUS : False
#>   OWNDATA : True
#>   WRITEABLE : True
#>   ALIGNED : True
#>   WRITEBACKIFCOPY : False

^{Created on 2023-07-11 with reprex v2.0.2}

We wrongly assume the item method will return elements in the order the array is. I'm not sure if this is a bug in numpy or if this expected, but this doesn't seem to be the recommended way to iterate over array elements in numpy.

reticulate/src/python.cpp

Lines 1128 to 1137 in 6d93f3f

    
           case NPY_UNICODE: { 
        
             PyObjectPtr itemFunc(PyObject_GetAttrString(ptrArray, "item")); 
        
             if (itemFunc.is_null()) 
        
               throw PythonException(py_fetch_error()); 
        
             rArray = Rf_allocArray(STRSXP, dimsVector); 
        
             RObject protectArray(rArray); 
        
             for (int i=0; i<len; i++) { 
        
               PyObjectPtr pyArgs(PyTuple_New(1)); 
        
               // PyTuple_SetItem steals reference to object created by PyInt_FromLong 
        
               PyTuple_SetItem(pyArgs, 0, PyInt_FromLong(i));

See for instance the array iter docs that recommends using the nditer() iterator instead to get the same order corresponding to the memory layout: https://numpy.org/doc/stable/reference/arrays.nditer.html#arrays-nditer

We have a few options to fix this issue:

Permute the dimensions of the array to match the Fortran ordering, then interate. Looks a bit hacky and requires a full copy of the array (besides the R copy).
Add strides to indexing to correctly get the element using the item() method. This can be quite complicated, specially for multi-dimensional arrays, and some unexpected numpy flags can cause breakage.
Use proper's numpy iterators: https://numpy.org/doc/stable/reference/c-api/iterator.html. Requires adding support for it in reticulate's libpython.h, which is quite tricky. But seems to allow for faster iterations and is safer to get the correct fortran ordering.

I'm leaning towards following the 3rd approach. Adding numpy iterators support could even allow for us to improve casting for other data types in the future. @t-kalinowski what do you think?

Thanks @dfalbel for investigating and chatting just now!

We decided to take the approach of calling into Python to call ndarray.nditer() from C, and use the standard Python C API (PyIter_Next) for iteration, rather than using the NpyIter_* API. This is a simple and safe approach, and doesn't require us to increase the number of symbols loaded when python is initialized.

	case NPY_UNICODE: {
	PyObjectPtr itemFunc(PyObject_GetAttrString(ptrArray, "item"));
	if (itemFunc.is_null())
	throw PythonException(py_fetch_error());
	rArray = Rf_allocArray(STRSXP, dimsVector);
	RObject protectArray(rArray);
	for (int i=0; i<len; i++) {
	PyObjectPtr pyArgs(PyTuple_New(1));
	// PyTuple_SetItem steals reference to object created by PyInt_FromLong
	PyTuple_SetItem(pyArgs, 0, PyInt_FromLong(i));