dbmdz/solr-ocrhighlighting

ALTO coordinates as xsd:float

mspalti opened this issue · 3 comments

First, thanks very much for this plugin! We're using it in a project to index ALTO content and retrieve word coordinates.

One issue we've noticed. A small number of our ALTO files express coordinates as floats, e.g. 158.0, rather than integers. I checked the ALTO standard and this appears to be allowed as of February 20, 2014, version 2.1. Unfortunately, in the AltoPassageFormatter class 'Integer.parseInt()' an throws an error when attempting to parse these non-Integer coordinates.

I added a small helper function to the AltoPassageFormatter class that evaluates the String before calling the parse function --using either parseInt() or parseDouble(). Do you think this is valid? I'd be happy to submit a PR if that's useful.

A better solution might be to use parseDouble() and then cast to int.

I created a pull request with minor code changes. Please feel free to ignore it if you don't agree with the approach taken. Again, thanks for sharing the plugin!

Thanks for pointing this out and providing the fix :-) Keep the PRs coming if you notice anything else, we're very glad about contributions! There's certainly a lot more of those small corner cases lurking around!