ofajardo/pyreadr

Why don't you compress files?

pablodegrande opened this issue · 5 comments

I investigated a few, and I believe that creating compressed rdata files in no more that calling:

import sys
import gzip
import shutil

with open('uncompressedfile.rdata', 'rb') as f_in:
with gzip.open('compressedfile.gz.rdata', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)

I was wondering why your library wouldn't do that while saving... Is any other format issue I am not aware of?

Thanks a lot, Pablo.

No, it is as you say, but it increases the time and there is no general agreement on what the compression should be, you like gzip, but some others prefer bzip2 and so on.

The other reason is that the idea behind the writing is that you are going to do a quick and dirty data exchange with R and therefore the files will be destroyed after use and therefore the size is not very relevant. If needed to store the files then it sounds like a very bad idea to do it as rdata or rds ... Better use arrow.

Great! I will use your library into this project https://github.com/poblaciones/poblaciones (which renders a collaborative data oriented map https://poblaciones.org). Users will be ok downloading an rdata file, and will gzip-it for them before retrieval... Thanks a lot!!

Yeah I see, for your case you need it compressed. Maybe I add it as an option in the future (default will be no compression so that it doesn't break existing code).

Just as a piece of advice, the interoperability of R files is terrible. Only R can read and write it correctly, because the format is undocumented and changes all the time. For that reason it would be better to provide files in an interoperable, documented format. But of course if you have a lot of R users they won't like it (and if you have users from other systems they won't like R formats)

OK, gzip compression is implemented as an option in pyreadr 0.3.2:

pyreadr.write_rdata("test.RData", df, df_name="dataset", compress="gzip")

Now I also remembered that the reason why this was not implemented before was partially because not high priority as explained before, but also because there was a bug on Windows that did not allow to delete the created files (Roche/pyreadstat#49), that was blocking this.

Hope it helps