WizardMac/ReadStat

Unable to parse sas7bdat when data set page size > ~1MiB

curtisalexander opened this issue · 3 comments

Issue

I am attempting to parse a sas7bdat with a data set page size of 2097152 (~ 2 MiB). When attempting to parse this file I get the error:

Stopping with error: Error when attempting to parse sas7bdat: READSTAT_ERROR_PARSE

SAS file

As noted, the data set page size of the file is 2091752. The SAS dataset is a randomly created dataset with 2,000 rows and 110 created columns and is ~6MB in size.

I can regenerate the same file, reducing the page size from 2091752 (~ 2 MiB) down to 1048576 (~ 1MiB) and the file parses without issue.

The files I used for testing are in the following location:

To generate the tables linked to above, I manually adjusted using the following in SAS.

/* Unable to parse */
data rand_ds_largepage_err(bufsize=2M);
  /* elided */
run;

/* Able to parse */
data rand_ds_largepage_ok(bufsize=1M);
  /* elided */
run;

SAS BUFSIZE option

According to the SAS documentation on the BUFSIZE option, the page size may be adjusted by altering the system or dataset BUFSIZE. Again from the documentation, the maximum data set page size that may be set is 2147483647.

MAX → sets the page size to the maximum possible number in your operating environment, up to the largest 4-byte, signed integer, which is 2^31–1, or approximately 2 billion bytes.

Troubleshooting / Potential Fix

On line 257 of the readstat_sas.c file I note the hinfo->header_size is checked against 1<<20 (1048576 in decimal). If I alter this line to check against 1<<21 (2097152 in decimal), the file parses without issue.

Because the SAS documentation notes that users can set the data set page size to as much as ~ 4GiB, I wonder if the line should be adjusted to check against INT32_MAX. Obviously making the adjustment may have ramifications that I don't immediately observe as I'm not extremely familiar with the repository.

Finally, I am glad to submit a PR with the change I noted above. Or if you have suggestions on a set of other changes I would need to trace through, I am glad to put in the work. All in all, I am glad to help in any way! Thanks so much for all the effort you (and others) have put into the library!

Thanks for the detailed report. The purpose of the page and header size checks is to prevent excessive memory allocations with malformed input – right now ReadStat allocates a buffer equal to the page size, and so we don't want that running into the gigabytes.

I'll add some slack to the header size test so it matches the page size test – (1 << 24) works out to 16 MB so it should work fine with your file. If you encounter a file in the wild with a larger page size than that, just open another issue.

Thanks so much @evanmiller!

Completely understand not wanting to allocate a buffer in the GBs. I have seen SAS datasets in the wild with a data set page size of 10 MB. If I encounter others that are larger I will post another issue.

Closing as f6766cd corrects. Will re-open new issue if encounter if encounter page size greater than 16MB.