ankane/ruby-polars

Cannot serialize when datetime column contains all nulls

Closed this issue ยท 4 comments

Hi ๐Ÿ‘‹๐Ÿผ

When querying a table in postgres that has a datetime column type and all values in this column are null, I cannot run describe on the dataframe nor serialize it to anything else other than a ruby hash (tried write_parquet and to_csv).

Here's the error when trying to run a describe in a dataframe with only that column selected:

irb(main):025:0> df.describe
thread '<unnamed>' panicked at 'not implemented for dtype Object("object")', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/polars-core-0.31.1/src/series/ops/null.rs:76:17
/usr/local/bundle/gems/polars-df-0.6.0-x86_64-linux/lib/polars/data_frame.rb:3910:in `mean': not implemented for dtype Object("object") (fatal)

It looks like when all rows are null the column gets casted to object and this column type cannot then be serialized here.

image

Hi @geclos, thanks for reporting! Addressed in the commit above.

Hey @ankane thanks for your work!

FYI I've noticed this behaviour happens with any column that is all nils. I've ended up implementing a pre-cleaning step that casts any empty column to string so that polars does not cast it to the fallback 'object' type, since that type presents so many constraints. Maybe it would be better to default to str casting instead of object?

      def self.clean(payload)
        # Empty columns get casted to object type by polars which causes
        # serialization issues when we try to persist these dataframes as
        # parquet files. Ensuring empty columns have some data prevents this
        # issue.
        payload.each_key do |key|
          next if payload[key].any?

          payload[key] = payload[key].map { '' }
        end

        payload
      end

Changed it to use the Polars::Null type in the commit above.

amazing!