datopian/aircan

Use resource schema to set schema fields in data resource that is passed to Factory (and error if absent)

mbeilin opened this issue · 4 comments

When user uploads a file with tabular data to portal (CKAN), we want to guess columns names properly, so it would be passed as part of payload to a DAG.

Acceptance

  • The columns names of the uploaded file determine the schema_fields_array payload parameter(currently hard coded).

Tasks

  • Add tabular data file (currently CSV) guessing columns names mechanism.

  • Populate schema_fields_array payload parameter using guessing columns names mechanism above.

  • Write tests for checking functionality.

Relates to #22

@mbeilin Would it make sense to just send the file to airflow and then let it to infer the fields? (ar further on, the types of these fields?) Or should we send the fields beforehand?

@mbeilin Would it make sense to just send the file to airflow and then let it to infer the fields? (ar further on, the types of these fields?) Or should we send the fields beforehand?

@hannelita according to datapusher/xloader the detecting headers process is implemented with messytables, so I just finished implementing the similar mechanism in our aircan-connector - PR.
We can try to detect headers this way and if for any reason, it was aborted we can send schema_fields_array as empty and let pandas doing the job (pandas is the module dealing with files in aircan, correct?).

As proposed in #1 , this is out of scope. We'll send this information on the request