Reference to line/byte from each segment
nerdoc opened this issue · 4 comments
Btw: It would be nice if each segment had a reference to the exact input line (byte position start/end) to ease debugging. That might also help handling malformed files. (I just swapped our self-written edifact parser against pydifact and it was a pretty pleasant experience. However exact byte positions was the one feature we lost in that process.)
Originally posted by @FelixSchwarz in #24 (comment)
This is not completely trivial. pydifact stores its data in lists, and does not create Token objects. This is mainly for performance reasons. Creating an own object for each Token would cause much more memory usage. But OTOH, it's not possible to store information additionally to the Token into the list, without breaking compatibility to all structures using Tokens and Segments. One approach would be, creating a separate list in parallel, where line information is stored with the matching index like in the token list. This seems a bit clumsy to me, but maybe is the most straightforward way to go.
Refactoring pydifact to use Token objects (with a linenumber/byte attribute is a no-go IMHO, as it uses to much overhead.
Even the "parallel list" approach could be made "switch-off"-able - to use less memory when parsing bigger EDI files, at the cost of one additional if clause in each iter.
I'll test that a bit.
OMG. Big mistake. I should read my own code. Tokens ARE in fact objects, so we can store line information in them. ;-)
I also like using namedtuple
. Previously our own parser used something like this
Segment = namedtuple('Segment', ('data', 'line_nr', 'offset_start', 'offset_end', 'raw_data'))
You could easily add some extra methods there for extra convenience like you do now for Token
.
Maybe it would be an option to skip the step of creating Segments as nested lists, and instead directly create the Segment objects using the factory.