epam/parso

Make the SasFileParser() API Public

thesuperzapper opened this issue · 11 comments

Hi Guys, I am the maintainer of the spark-sas7bdat package, which is used by many to read large SAS7BDAT with Apache Spark across many servers.

The issue is, because we need to start reading arbitrary offsets of the sas files (so each of the servers only reads a piece of the file), we need to call some of the protected methods/constructors in Parso.

Currently we use this crazy hack (PrivateMethodExposer.scala) to break the into protected methods and constructors.

I would love it if you could publicly expose the following things:

  1. The constructor for com.epam.parso.impl.SasFileParser
  2. The getSasFileProperties() method of SasFileParser
  3. The readNext() method of SasFileParser
  4. The readNextPage() method of SasFileParser
  5. The constants: TIME_FORMAT_STRINGS, DATE_FORMAT_STRINGS, and EPSILON

Here is the actual class we use in conjunction with the PrivateMethodExposer.scala, so we can use com.epam.parso.impl.SasFileParser: ParsoWrapper.scala

Hi Mathew,

Thanks for sharing your issue. Could you please answer the following questions so we can find out the best way to help you.

  1. Would it be possible for you to use SasFileReaderImpl?
  2. Why do you need readNextPage() to be public as this is internal method for file parsing?

Thanks

@printsev
This is the process we currently use to distribute the reading of sas files across multiple servers:

  1. Each worker is provided a rough "start point" and "end point" in byte offsets from the beginning of the file. (These are coordinated by a master node and have roughly "File Size"/"Num workers" bytes)
  2. Each worker initialises a com.epam.parso.impl.SasFileParser.
    1. With the input stream starting from byte 0, so that it can read the metadata in the header.
  3. Each worker moves its "start point" and "end point" backwards such that they sit on the closest preceding page end.
  4. Each worker seeks the SasFileParser's input stream to the "start point"
    1. After this we call readNextPage() so that the internals of Parso get reset to the new offset.
  5. Each worker reads new rows until the input stream is at "end point" offset.

Here is the code, which currently uses hacks to expose the private methods described above.

Hi @thesuperzapper,

Will you be comfortable if we provide public access to readNextPage() method via SasFileReaderImpl class? This class already contains public access methods to getSasFileProperties() and readNextPage() which you use in your code. Constants like TIME_FORMAT_STRINGS, DATE_FORMAT_STRINGS, and EPSILON was moved to public interfaces (will soon be released), in this way it looks like we can add public void readNextPage() method to SasFileReaderImpl and you can use it instead of creating instance of SasFileParser directly. Or there are any other reasons why you need instance of SasFileParser class? Thank you.

@Yana-Guseva That would probably work.

@Yana-Guseva any progress on this?
We would love to make spark-sas7bdat use this new public API.

@thesuperzapper currently all changes related to this issue are available in the master branch. Please let me know if you run into any problems.

@Yana-Guseva @printsev While these changes are now in the master branch, no release has happened in many months.

When are you planning to cut a release with these changes? (the spark-sas7bdat package needs this change urgently to support Spark 3.0)

Tagar commented

@printsev @Yana-Guseva any chance a new release can be cut from master? thx!!

Sorry for some delay with my answer -- vacation time (even the year is absolutely crazy). I've deployed 2.0.12-SNAPSHOT to maven snapshot. Hope it's OK for now, and after we deal with some failing tests (I believe it's not the application as they are failing even with code from 2016), we can make a 2.0.12 release. Hope it's OK.

Tagar commented

Thank you Igor

I've made the 2.0.12 release, please let me know if it works for you