aws/aws-sdk-pandas

Timestamps not being saved correctly to arraw dataset

david-waterworth opened this issue · 2 comments

My source data is polars, but I don't think that's causing this issue, I'm trying to write a parquet hive to s3 and have been experimenting with aws-sdk-pandas

My code is

wr.s3.to_parquet(
    df.collect().to_pandas(use_pyarrow_extension_array=True),
    path="s3://data-science.cimenviro.com/watchtower/sites",
    dataset=True,
    mode="overwrite_partitions",
   partition_cols=["id"],
)

What I've noticed is even though the datetime columns have datatype timestamp[ms, UTC] the parquet files written have coerced this to timestamp[ns] - I can see this by first enabling logging.DEBUG - I can see that despite my data originally being polars, when you create an arrow table they're still correct, i.e. I can see in my debug trace log

DEBUG:awswrangler.s3._write:Resolved pyarrow schema: 
task_id: int32
task_name: large_string
created_at: timestamp[us, tz=UTC]
deleted_at: timestamp[us, tz=UTC]

So up until you write the parquet file(s) the datatype is tz-aware. But if I read back the parquet file without using awswrangler, the datatype is now timestamp[ns] rather than timestamp[us, tz=UTC]

I can also replicate this using wrangler if I pass dtype_backend="pyarrow" rather than dtype_backend="numpy_nullable" I get the same (non-round tripped) datatypes (i.e. timestamp[ns] rather than timestamp[us, tz=UTC].

wr.s3.read_parquet("s3://data-science.cimenviro.com/watchtower/test", dtype_backend="pyarrow").dtypes

This doesn't seem ideal, even if awswrangler correctly round-trips pandas -> parquet -> arrow, the parquet files doen't have the correct schema for non-pandas clients, so (unless I missed explicit timestamp coercion somewhere) there seems to be some reliance on pandas converting timezone niave datetimes to tz aware (utc) automatically so it appears to work?

I'm wondering though if I'm missing something here?

Hi @david-waterworth, coercion defaults are different between parquet versions, with [ms] being set as default on SDK for pandas. Not sure why you are getting [ns] but I will run a test to check.

In the meantime, you can override coercion via coerce_timestamps (valid values are {None, ‘ms’, ‘us’}):

wr.s3.to_parquet(
    df,
    path="s3://...",
    dataset=True,
    mode="overwrite_partitions",
    partition_cols=["id"],
    pyarrow_additional_kwargs={"coerce_timestamps": None},
)

Addressed in #2953