os-climate/osc-ingest-tools

Missing(?) conversions in _p2smap

MichaelTiemannOSC opened this issue · 4 comments

In https://github.com/os-climate/osc-ingest-tools/blob/main/osc_ingest_trino/sqltypes.py _p2smap does not have a translation for the object dtype. Is it intentional to force upstream users to convert dtype to str before converting to SQL, or should we support object as a varchar?

Similarly, _p2smap does not support datetime64[ns]. Should upstream users be forced to do everything relative to UTC? Or should we allow unconverted timestamps? Or should we perform some kind of conversion which, if there is no local timezone variable set, throws the appropriate exception so that the upstream user can properly initialize their local timezone so that the ingest tools can convert the data as needed?

Having thought this through, I think it is dangerous to supply a default mapping for object - it is best practice to apply df.convert_dtypes() to make DF column types as specific as possible.

Regarding datetime64[ns], it is best practice to ensure that these are qualified with some time zone. It may further be best practice to require that timezone to be UTC. It is almost certainly bad to have timestamps that are not qualified by timezone, since that introduces significant ambiguity.

It is best practice to use the convert_dtypes() pandas DF method, so that pandas data types can be made as specific as possible.

However I have filed #3 to address the likely use case of occasionally needing to either add new mappings, or override existing mappings.

This should be now be fixable using the new typemap parameter to create_table_schema_pairs:
https://github.com/os-climate/osc-ingest-tools#adding-custom-type-mappings-to-create_table_schema_pairs

pip install osc-ingest-tools==0.1.1

Indeed. I'll take it from here.