WizardMac/librdata

Reading matrices with only column or row names

ofajardo opened this issue · 10 comments

When reading matrices with no column and no row names (mat_simple.rds), or with both column and row names (mat_rowcolnames.rds), the new code works like a charm!

However, when reading matrices with only row names (mat_rownames.rds) or only column names (mat_colnames.rds), an error arises:

Invalid file, or file has unsupported features

See attached zip for the examples

matrices.zip

It looks like with only row names or column names present, the matrix has a NIL value for the other names. I can skip over that NIL value unless you are expecting different behavior.

The problem with skipping the NIL is that right now I am getting the number of dimensions as an array, let's say [4,3] if 4 rows and 3 cols, then when retrieving the names, I am expecting that the function gives me 4x3=12 names ... so I guess it would be needed to fill with as many NULLs as needed so that the number is preserved? Otherwise I wouldn't know to which dimensions the names correspond? (maybe there is another way to know that?).
Or maybe a bit more confusing for the API user, but if a single NULL is retrieved when getting the names, I can also interpret that as the names for the whole dimension should be skipped.

The NIL applies to the row or column name vector, not to the names themselves. So I think it may behave as expected.

Please try this branch and let me know if it works:

https://github.com/WizardMac/librdata/tree/issue-36

the file that has only rownames (A to D for 4 rows) is perfectly fine, I get the names of the rows (i.e an array ["A", "B", "C", "D"]) and nothing for the columns, where I can infer there are no column names (but it would be problematic if there are 3 dimensions and the second one doesn't have names).

For the file with names only on columns (3 columns, V1 to V3), again I get the names of the columns only (so I get an array ["V1", "V2", V3"] but this time not having a signal that the names of the rows are missing, I cannot skip them, basically I don't know that the names should be applied to the columns and not to the rows. Maybe inserting a NULL instead of the row names would help? Otherwise is there any other way to know ? I guess I could count how many values I am getting and if it in this case I get 3, that doesn't match the number of rows I can infer it is for the columns, however what if I get an square array of 3 rows and 3 columns? no way to know in that case ...

Okay, understood. I'll see if I can return NULL once for each (missing) row name.

Please try the updated code in the issue-36 branch

It works well for the test files proposed initially!

But there are two strange things:

  • in my handle_dim function, for the argument rdata_type_t type I am getting NULL when I should be getting RDATA_TYPE_INT32.
  • I have two files with 3d arrays, one of them with no names and one with names of every dimension. These have been working well previously, but now they are failing with "Invalid file, or file has unsupported features" (see attached files)

arrays3d.zip

Okay, the latest code should fix both oddities. If you need more than 3 dimensions, please let me know.

yes, it's good! thanks a lot! So I guess the only border case remaining is #37