frictionlessdata/datapackage-pipelines

Automatic datetime validation for incorrect format

cschloer opened this issue · 5 comments

Hi, I'm running python 3.6.5 on ubuntu 18.04. I'm upgrading a datapackage-pipelines project to v2.0 and running into problems. This might be related to a previous issue I opened (#132) but it is different and bit more serious.

I have a row with the format %Y-%m-%dT%H:%M:%SZ. I have not labeled it as a datetime type - it's just a string. However, when I run the dump_to_path (or dump.to_path for that matter) it somehow automatically interprets it as a datetime string. That wouldn't necessarily be a problem, however it then uses the %Y-%m-%dT%H:%M:%S string to match the string, which obviously doesn't work because there is a Z at the end. The pipeline then crashes.

I've narrowed the bug down to just a two step pipeline and a single column dataset. Here is an example pipeline-spec.yaml (you will need to update the paths) and an example dataset (I can't attach .yaml or .csv file types :/). Note I've tried both with and without validate as a parameter in the load step.

test-pipeline:
  title: Test Pipeline
  description: Possible bug
  pipeline:
  - run: load
    parameters: {
      from: PATH_TO_TEST_CSV,
      name: default,
      validate: false
    }
  - run: dump_to_path
    parameters: {
      out-path: PATH_TO_RESULT_FOLDER,
      validate: false
    }
TestDatetime
2017-03-03T11:24:32Z
2017-03-03T11:24:32Z
2017-03-03T11:24:32Z
2017-03-03T11:24:32Z
2017-03-03T11:24:32Z
2017-03-03T11:24:32Z
2017-03-03T11:24:32Z
2017-03-03T11:24:32Z
2018-03-03T11:24:32Z
2018-03-03T11:24:32Z
2018-03-03T11:24:32Z
2019-03-03T11:24:32Z
2019-03-03T11:24:32Z
2019-03-03T11:24:32Z
2020-03-03T11:24:32Z
2020-03-03T11:24:32Z
2020-03-03T11:24:32Z
2021-03-03T11:24:32Z
2021-03-03T11:24:32Z
2021-03-03T11:24:32Z
2022-03-03T11:24:32Z
2022-03-03T11:24:32Z
2018-03-03T11:24:32Z

This might actually be a problem for dataflows (https://github.com/datahq/dataflows).

I was able to find two lines within dataflows that have different template strings for dates:

./helpers/extended_json.py:9:DATETIME_FORMAT = '%Y-%m-%dT%H:%M:%S.%f'

and

./processors/dumpers/file_formats.py:9:DATETIME_FORMAT = '%Y-%m-%dT%H:%M:%S'

Hope that helps!

Thanks for reporting this @cschloer !

This pipeline spec worked for me:

test-pipeline:
  title: Test Pipeline
  description: Possible bug
  pipeline:
  - run: load
    parameters: {
      from: "bla.csv",
      name: default,
      validate: true
    }
  - run: dump_to_path
    parameters: {
      out-path: "bla3",
    }

I will look into the reason why your original pipeline fails.

Thanks!

That's interesting that setting validate: true actually made it work. I didn't even try that because I was assuming it was defaulting to validate: true and thus failing.

roll commented

Hi @cschloer is this one still actual?