ofajardo/pyreadr

it shows me this error LibrdataError: Unable to convert string to the requested encoding (invalid byte sequence)

69hed opened this issue · 12 comments

69hed commented

I want to open below dataset in python, but it keeps showing me an error. The codes are:

  import pyreadr
  result = pyreadr.read_r(r"~/Desktop/review2020.rda")
  print(result.keys())
  df1 = result["df1"]

The error:
~/opt/anaconda3/lib/python3.8/site-packages/pyreadr/pyreadr.py in read_r(path, use_objects, timezone)
46 if not os.path.isfile(path):
47 raise PyreadrError("File {0} does not exist!".format(path))
---> 48 parser.parse(path)
49
50 result = OrderedDict()

~/opt/anaconda3/lib/python3.8/site-packages/pyreadr/librdata.pyx in pyreadr.librdata.Parser.parse()

~/opt/anaconda3/lib/python3.8/site-packages/pyreadr/librdata.pyx in pyreadr.librdata.Parser.parse()

LibrdataError: Unable to convert string to the requested encoding (invalid byte sequence) #

How I can fix this?

as suggested in the issue template, please include a file (with no sensitive data) so that I can reproduce the issue. If I cannot reproduce the issue I cannot fix it.

69hed commented

I can't access the file, it gives me an error. Please zip it and drag and drop here directly.

69hed commented

After signing in it keeps me giving a permission denied error. Please attach the file here in github (you need to zip it not to reduce the size, but because github accepts zip files) or look for another way to share it.

69hed commented

I managed to download the file and reproduce the error. Reading the first bytes of the file I got this:

b'RDX3\nX\n\x00\x00\x00\x03\x00\x03\x06\x01\x00\x03\x05\x00\x00\x00\x00\x06CP1252\x00'

I think CP1252 is the encoding, meaning Windows-1252. Right now as indicated in the Known limitations section of the README of this repo, pyreadr does not support other encodings different from UTF-8.

Cannot read RData or rds files in encodings other than utf-8.

That means this file is not supported.

This limitation comes from the C backend librdata. Looking at the C source code I have the feeling the error message should be different, so I am going to make an issue there for them to take a look. I will also ask if other encodings could be supported. It may come at some point in the future.

If you have control over the generation of the rda files, then try saving them with utf-8 encoding.

69hed commented

@69hed could you please share the file again? It has been deleted from dropbox.

@69hed recovered the file and hosted it here: https://github.com/ofajardo/readstat_test_files/blob/master/tip2020.rda for easier sharing with librdata people, who is looking at it.