BioStatMatt/sas7bdat

New code for reading compressed SAS data

Opened this issue · 8 comments

Just found this: http://ggasoftware.com/opensource/parso

I haven't tried it yet, but knew they'd been working on it for some time.

Thanks for the link Harry. I couldn't find the source code on their web
page. Judging by their file naming, it looks like they may have used the
information that we gathered about the file format, which is great. There
are several other sas7bdat reader implementations that are faster than
mine. My focus has been mostly in the reverse-engineering effort, rather
than the reader-implementation effort.

Best,
Matt

On Mon, Apr 14, 2014 at 7:05 AM, Harry Southworth
notifications@github.comwrote:

Just found this: http://ggasoftware.com/opensource/parso

I haven't tried it yet, but knew they'd been working on it for some time.


Reply to this email directly or view it on GitHubhttps://github.com//issues/5
.

They started with something called SassyReader which is a Java library reverse engineered from your code. That library had trouble dealing with format catalogues (or at least broken ones) and compressed SAS files so GGA put some extra effort in and got it working much more generally.

I got parso running via a Java wrapper with some help from a friend. I've transformed a few dozen SAS datasets now, including compressed ones, and including at least one that had newlines in character fields that caused the SAS csv writer to create garbage, and parso has done the trick every time. If you want to figure out how to add it as an alternative engine to your sas7bdat package, let me know and I'll help out as much as I'm able.

Harry

Harry,

Thanks for you offer to help. I am familiar with Parso. However, I'm not
sure it would be worth the effort to interface the sas7bdat package with
the Java code (unless you are very familiar with the that mechanism). It
might be best implemented as a separate package.

Ultimately, I'd like to incorporate the extra work those authors had done
(compression, etc) into the sas7bdat package in a more natural way.
Unfortunately, adding new functionality to the package is a low priority at
the moment.

If you are willing and able to help, I think the biggest contribution that
you could make is to either wrap the Parso library into an R package, if
its license permits it, or to help improve the sas7bdat file format
documentation in the sas7bdat package (i.e., the 'sas7bdat' vignette). I'm
sure that we can learn a lot from reading their source code. I would, of
course, acknowledge any effort you provide within the 'sas7bdat'
documentation.

Thanks for you interest!

Best,
Matt

On Sun, Sep 7, 2014 at 7:13 AM, Harry Southworth notifications@github.com
wrote:

They started with something called SassyReader which is a Java library
reverse engineered from your code. That library had trouble dealing with
format catalogues (or at least broken ones) and compressed SAS files so GGA
put some extra effort in and got it working much more generally.

I got parso running via a Java wrapper with some help from a friend. I've
transformed a few dozen SAS datasets now, including compressed ones, and
including at least one that had newlines in character fields that caused
the SAS csv writer to create garbage, and parso has done the trick every
time. If you want to figure out how to add it as an alternative engine to
your sas7bdat package, let me know and I'll help out as much as I'm able.

Harry


Reply to this email directly or view it on GitHub
#5 (comment).

OK. I went ahead and wrote an R package that wraps the Parso library. It's
a bit slow though.

https://github.com/biostatmatt/sas7bdat.parso

On Sun, Sep 7, 2014 at 8:43 AM, Matthew Shotwell biostatmatt@gmail.com
wrote:

Harry,

Thanks for you offer to help. I am familiar with Parso. However, I'm not
sure it would be worth the effort to interface the sas7bdat package with
the Java code (unless you are very familiar with the that mechanism). It
might be best implemented as a separate package.

Ultimately, I'd like to incorporate the extra work those authors had done
(compression, etc) into the sas7bdat package in a more natural way.
Unfortunately, adding new functionality to the package is a low priority at
the moment.

If you are willing and able to help, I think the biggest contribution that
you could make is to either wrap the Parso library into an R package, if
its license permits it, or to help improve the sas7bdat file format
documentation in the sas7bdat package (i.e., the 'sas7bdat' vignette). I'm
sure that we can learn a lot from reading their source code. I would, of
course, acknowledge any effort you provide within the 'sas7bdat'
documentation.

Thanks for you interest!

Best,
Matt

On Sun, Sep 7, 2014 at 7:13 AM, Harry Southworth <notifications@github.com

wrote:

They started with something called SassyReader which is a Java library
reverse engineered from your code. That library had trouble dealing with
format catalogues (or at least broken ones) and compressed SAS files so GGA
put some extra effort in and got it working much more generally.

I got parso running via a Java wrapper with some help from a friend. I've
transformed a few dozen SAS datasets now, including compressed ones, and
including at least one that had newlines in character fields that caused
the SAS csv writer to create garbage, and parso has done the trick every
time. If you want to figure out how to add it as an alternative engine to
your sas7bdat package, let me know and I'll help out as much as I'm able.

Harry


Reply to this email directly or view it on GitHub
#5 (comment).

It /is/ slow, but it works where read.sas7bdat fails due to data compression! I had to give that .parso repo its first star for that!

I just added a function s7b2csv that should be a bit faster in converting
the file, since the full read occurs without switching control back and
forth between the Java and R code. It's still a sequential process (i.e.,
read a bit of the sas7bdat file, then write a bit of the CSV file). This is
a safety play, since some files may not fit completely into memory.
Ideally, the code would look at the size of the data (from the header
information) and then decide whether to do a sequential read, or to read
everything into memory and then write a CSV, say if the data size was less
than 2GB.

On Tue, Sep 9, 2014 at 12:22 PM, Jason Miller notifications@github.com
wrote:

It /is/ slow, but it works where read.sas7bdat fails due to data
compression! I had to give that .parso repo its first star for that!


Reply to this email directly or view it on GitHub
#5 (comment).

Great, duly noted

FWIW, I could not find the source code on their website, but it is apparently accessible from: http://search.maven.org/remotecontent?filepath=com/ggasoftware/parso/1.2.1/parso-1.2.1-sources.jar