jeppe742/DeltaLakeReader

Predicates in json log files can contain spaces and breaks the parsing

Closed this issue · 1 comments

The function _apply_partial_logs splits multiple json lines on whitespace characters. But when a whitespace character occurs in a string field, it splits the log into invalid json parts breaking the parsing of the log.

Example value that will break the parser (this is the raw string with escape characters inluded):

b'{"commitInfo":{"timestamp":1614330479493,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[\"Feature\",\"Period\"]","predicate":"Feature = 'xxx' AND Period = '1'"},"readVersion":49,"isBlindAppend":false,"operationMetrics":{"numFiles":"1","numOutputBytes":"667","numOutputRows":"2"}}}\n{"add":{"path":"Feature=xxx/Period=1/part-00000-b6191164-a8d8-4b53-9b6b-fb04a55bb5d8.c000.snappy.parquet","partitionValues":{"Feature":"xxx","Period":"1"},"size":667,"modificationTime":1614330479064,"dataChange":true}}\n'

Splitting on newline only ( \n ) would solve this issue but I don't know if that will have undesired side effects. I will see what I can do :).

@Autom8edChaos, merged your PR and uploaded the new version to PyPI. Thanks for creating and fixing the issue 😄