AsperaGmbH/xlsx-reader

No option to skip empty rows

ralfstrobel opened this issue · 6 comments

Excel sadly often saves xlsx files which end in a large number of explicit empty row elements (for example if the cells were formatted but then never used), which may take a form such as the following:

<row r="659" spans="1:10" x14ac:dyDescent="0.2">
	<c r="D659" s="3"/>
	<c r="E659" s="3"/>
</row>

Note that not a single <v> element is contained.

It would be nice to have a native option to ignore empty rows, ideally only at the end of the file. However, since a sequential reader cannot look ahead to know whether there will be further non-empty rows, this would probably be difficult to implement.

I would once again suggest to create feature parity with akeneo-labs/spreadsheet-parser, which skips any empty rows by default, but still updates the return value of Iterator::key() to reflect the actual excel row and therefore expose to the caller that an empty row was skipped.

While debugging this further, I've also noticed that the Implementation of Reader::next() returns the current value, while the Iterator interface defines the return as void. This doesn't waste much performance, but is still unnecessary, especially given the additional call to adjustRowOutput(). During iteration the value will always be obtained via a separate call to current().

Hey there,

The just-released version v0.10.0 should address these issues now. Please check it out.

To make the reader behave in the way you described, configure it similarly to this:

$reader = new Reader(
    (new ReaderConfiguration())
        ->setSkipEmptyRows(ReaderSkipConfiguration::SKIP_TRAILING_EMPTY)
);

If you wish to skip all empty rows instead of only trailing ones, replace SKIP_TRAILING_EMPTY with SKIP_EMPTY.

Sidenote, if you happen to make use of SkipEmptyCells:
Be advised that SkipEmptyCells now also requires ReaderSkipConfiguration values, making its interface similar to that of SkipEmptyRows. You may need to adapt your usages accordingly.

Thank you for your work!

Sadly, I was unable to test the new feature, or even get the new library version to run due to two breaking changes you introduced...

  1. ReaderConfiguration::setSkipEmptyCells() used to enforce a boolean argument, now enforces an integer constant, which results in an InvalidArgumentException with our old code. Not a major problem, but still undesirable for a minor version release.
  2. More importantly for us, ITST-21347 has changed the semantics of ReaderConfiguration::setReturnUnformatted(). In version 0.9, it did not affect date values, i.e. it could be used in conjunction with ReaderConfiguration::setReturnDateTimeObjects(false) and the format setters to receive ISO8601 strings for date values, which is what we did. In version 0.10, the unformatted directive now takes precedence and thus an internal Excel date floating point representation is returned. We can open a separate issue to discuss this further if desired.

1.: Yes, that's what I wanted to point out with the sidenote I added to the end of my previous comment. That integer should be a ReaderSkipConfiguration constant now. It behaves the same way the new option setSkipEmptyRows() does.

Regarding your versioning concerns, be advised that this library is currently using a version <v1. This implies slightly different semantics for the versioning pattern. In particular, it implies that all version updates should be considered unsafe.
To at least provide a bit of safety though, this library adheres to the behavior e.g. composer exposes when dealing with such versions, which is that the 2nd digit of the version is considered the "major" one. As such, in the context of this library, for breaking changes, said digit of the version number will be bumped, and for minor releases, the following ones.
(e.g.: v0.8.0 -> v0.9.0: major release, v0.8.0 -> v0.8.1: minor release)
The respective note can be found here: https://getcomposer.org/doc/articles/versions.md#caret-version-range-

Of course, should we decide to bump the version to v1.x.x at some point, the semantic versioning rules you already expected will begin to apply.

2.: Yes, that's a bug. : ) Thank you for testing and reporting, a fix is in the works.

The just-released v0.10.1 contains a fix for the aforementioned issue regarding the usage of returnUnformatted with date/time configuration options.

Yeah, seems to work fine now, at least as far as our own automated tests are concerned. Thanks again!