Use resource schema to set schema fields in data resource that is passed to Factory (and error if absent)
mbeilin opened this issue · 4 comments
When user uploads a file with tabular data to portal (CKAN), we want to guess columns names properly, so it would be passed as part of payload to a DAG.
Acceptance
- The columns names of the uploaded file determine the
schema_fields_array
payload parameter(currently hard coded).
Tasks
-
Add tabular data file (currently CSV) guessing columns names mechanism.
-
Populate
schema_fields_array
payload parameter using guessing columns names mechanism above. -
Write tests for checking functionality.
@mbeilin Would it make sense to just send the file to airflow and then let it to infer the fields? (ar further on, the types of these fields?) Or should we send the fields beforehand?
@mbeilin Would it make sense to just send the file to airflow and then let it to infer the fields? (ar further on, the types of these fields?) Or should we send the fields beforehand?
@hannelita according to datapusher/xloader the detecting headers process is implemented with messytables, so I just finished implementing the similar mechanism in our aircan-connector
- PR.
We can try to detect headers this way and if for any reason, it was aborted we can send schema_fields_array
as empty and let pandas
doing the job (pandas
is the module dealing with files in aircan, correct?).