epam/parso

Deleted Records.

PCaff opened this issue · 6 comments

PCaff commented

I'm one of the users that has experienced issues while using this in spark.
I continually get ArrayIndexOutOfBounds Errors which is similar to the other issues that users have reported while using the spark version. I recently cloned this repo and made a quick program to read in one of these error files using just parso. The error persisted (though it is a different stack trace).

The issue (I believe) is null records that are placed into the data from the readAll() method. So cycling through the 2D array object will clearly output a NullPointerException without proper checking.

An interesting observation I made was that the number of records I was able to read before one of these errors was that the number of null rows is very close to the number of deleted rows.

Does parso handle deleted rows? If not is there any logic that is to be implemented for the deleted records? This error can be fixed by just resaving the SAS file in a SAS program. However, when it comes to large files this takes a very long time.

thanks for reporting the issue @PCaff. I don't think parso supports deleted records, do you think the error you face is related to this? if you have some test examples, it would be great if you could share those, and we can even include them to our test suite.

PCaff commented

I'm trying to replicate the issue with the test data provided in this repo @printsev . I'll hopefully get a commit/PR in the next hour or so tonight.

Currently, I am able to produce a deleted record in the test data. This doesn't replicate my issue but I am convinced it is related. It seems parso ignores a page that contains deleted records. Using all_rand_normal.sas7bdat with a deleted record, I receive 0 columns and 0 rows.

PCaff commented

Actually, to make it easier. I'll just post the file here:

all_rand_normal_with_deleted.zip

PCaff commented

It also looks like this issue may encompass records split between pages. If someone can point me to the right place in the logic, I can try to enhance it.

PCaff commented

Will be running tests this week on new logic related to this issue. New pages were found that contain deleted records. The logic for mapping the deleted markers to the records is included.

Merged to master so closing the issue.