ofajardo/pyreadr

I try to read a rds file, but get the following error:

Erikvvats opened this issue · 13 comments

This is my code:

import pyreadr
result = pyreadr.read_r('data/injuryTimeDataset.rds')

This is the error:
parser.parse(path)
File "pyreadr\librdata.pyx", line 117, in pyreadr.librdata.Parser.parse
File "pyreadr\librdata.pyx", line 139, in pyreadr.librdata.Parser.parse
File "pyreadr\librdata.pyx", line 102, in pyreadr.librdata._handle_value_label
File "pyreadr\librdata.pyx", line 197, in pyreadr.librdata.Parser.__handle_value_label
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1: invalid start byte

What should I do? I have not looked in the rds file, but it is supposed to be a mixture of strings, ints and floats. Lastly, this works:
pyreadr.object_list

Make sure you are using the latest version of pyreadr. If the problem persists send a file to reproduce the issue. If I cannot reproduce it, I cannot fix it.

I have the latest pyreadr. However, I am not allowed to share the dataset.

that's unfortunate because if I can't reproduce it there is nothing I can do now.
If you can prepare a minimal synthetic dataset that reproduces the error, that would be ideal.
Otherwise, we will have to wait until somebody else finds the same issue and generates a file to reproduce it.

Hello,

I have the same issue and I have made a reproducible file for you to check out (however I cannot find how to upload it here). I tried a lot to get it to work and probably more during my long internet search. I think my file is not a "good" .RData file and tried to find the reason why, but so far unsuccessful. Could you have a look?

  1. Loading it into Rstudio and trying to save it in a different way. saveRDS() with different parameters (compress, version, ascii)
  2. tried to change the encoding if that might work with R scripts and then saving it as an .RData file
fix.encoding <- function(df, originalEncoding = "UTF-8") {
  numCols <- ncol(df)
  df <- data.frame(df)
  for (col in 1:numCols)
  {
    if(class(df[, col]) == "character"){
      Encoding(df[, col]) <- originalEncoding
    }
    
    if(class(df[, col]) == "factor"){
      Encoding(levels(df[, col])) <- originalEncoding
    }
    else{
      Encoding(df[, col]) <- originalEncoding
    }
  }
  return(as_data_frame(df))
}
  1. tried to open it with this python code. Which kind of works, but not really.
with open(file, 'rb') as f:
    text = f.read()
    text = text.decode("utf-8") 
  1. also tried to remove the rownames, or change all my factors to characters, but also still an error.
    df <- tibble::rownames_to_column(df, "VALUE")
    and
i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)

thanks, I need the file to take a look. Zip it and then upload it here, just drag and drop into this text box. If the file is too big, then put it in dropbox, google drive or similar and share it with everyone and paste here the link. You can research for other services where you can put your file without having an account.

Without file it is impossible for me to take a look.

Sorry, I uploaded some corrupt files earlier. This one should work
test8.RData.zip

I Finally found a solution! However, I would like not to load it into R, to re-save the file, and then use it in my code. I would rather just use the original RData files. But I was trying all kinds of stuff for proof of concept.

I load this file into R, run the following to remove the Factors: (rlvnc2 is the name of de dataframe, change accordingly)

i <- sapply(rlvnc2, is.factor)
rlvnc2[i] <- lapply(rlvnc2[i], as.character)

And then save it with the standard save() option from R
save(rlvnc2, file = "/file/path/test9.RData")

Then it works fine with your pyreadr. But if I save it with saveRDS() it doesn't work anymore. Also the original file doesn't work (with the factors instead of characters)

Ok, thanks I can reproduce it. The issue is coming from the C library, therefore I have submitted a new issue about this.

I see that in the file every factor has a lot of levels, I wonder if there is some non-UTF8 character hidden there somewhere. In the other hand it seems that you already tried to change the encoding of all factors and that didn't work.

to be sure, I tried to change the encoding again and save with save() instead of saveRDS() and still I have the error.

Good luck finding the exact problem. If you need any help with trial and error, let me know

interesting, when I save the file it looks completely different when looked at a hex file editor. What version of R are you using, on which platform? (windows, mac, linux ... )?

I think that the original file (that isn't working) is made on a linux based computer with an old version of R or a windows computer with an old version of R. I do not know the exact origin, because I only work with this file and was created before I was involved.

the new file (after changing the factors to characters) was made on R version 4.0.2 with Rstudio 2021.09.0 Build 351 "Ghost Orchid" Release (077589bc, 2021-09-20) for macOS.
test9.RData.zip

EDIT:
now that I think about it... both files are made in the macOS R version 4.0.2. I made a reproducible example using my own computer. the original-original file is much bigger, but also has some information in it that I am unable to share. this is just the first 4 lines of the original file, saved in macOS R version 4.0.2 (test8).
after changing the factors to character (as explained earlier) the same dataframe works again (test9)

OK anyway, saving the file again with 4.02 gives exactly the same error, I think somehow the C library is not reading one of the fields in the binary file from the correct byte.

I saved a working version for you in a previous post. Might be a good way to compare the two.
Screen Shot 2021-12-14 at 15 54 01