ankane/ruby-polars

Enhance initialization of Series with temporal types

Closed this issue · 1 comments

I have a use case where I'm creating small DataFrames in unit tests. I want to ensure that these DataFrames have the same schema as the parquet files my product will be reading, so I'm attempting to be explicit about the types.

I initially attempted to use the dtype argument to DataFrame#initialize, but it appears that it does not have support for temporal types.

After that, I fell back to allowing the gem to infer the type, which it does well. However, it appears that it defaults the units to ns when using the constructor, but when I actually load the parquet file, it is using us units. I'm not sure whether or not this is actually important, but I am trying to be precise.

In order to work around this, I ended up constructing the series as follows:

Polars::Series.new("timestamp", timestamps).dt.cast_time_unit("us")

This works, and I'm happy with this as a solution. However, the code that I use to create the DataFrames is in a helper function as we want to create different DataFrames for different test cases. One test case we always write is for an empty file. In that case, the above code boils down to:

Polars::Series.new("timestamp", []).dt.cast_time_unit("us")

This unfortunately fails with:

thread '<unnamed>' panicked at 'expected duration or datetime', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/polars-plan-0.29.0/src/dsl/dt.rs:46:22
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: <F as polars_plan::dsl::expr::FunctionOutputField>::get_field
   3: polars_plan::logical_plan::aexpr::schema::<impl polars_plan::logical_plan::aexpr::AExpr>::to_field
   4: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
   5: <polars_core::schema::Schema as core::iter::traits::collect::FromIterator<F>>::from_iter
   6: core::iter::adapters::try_process
   7: polars_plan::utils::expressions_to_schema
   8: polars_plan::logical_plan::builder::prepare_projection
   9: polars_plan::logical_plan::builder::LogicalPlanBuilder::project
  10: polars_lazy::frame::LazyFrame::select
  11: polars::lazyframe::RbLazyFrame::select
  12: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
  13: polars::init::anon
  14: _vm_call_cfunc_with_frame
  15: _vm_sendish
  16: _vm_exec_core
  17: _rb_vm_exec
  18: _rb_ec_exec_node
  19: _ruby_run_node
  20: _main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
/Users/ajakowpa/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/lazy_frame.rb:837:in `select': expected duration or datetime (fatal)
	from /Users/ajakowpa/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/lazy_frame.rb:837:in `select'
	from /Users/ajakowpa/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/data_frame.rb:3678:in `select'
	from /Users/ajakowpa/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/expr_dispatch.rb:19:in `method_missing'
	from /Users/ajakowpa/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/date_time_name_space.rb:877:in `cast_time_unit'
	from /Users/ajakowpa/Library/Application Support/JetBrains/RubyMine2023.1/scratches/polars.rb:17:in `<main>'

I've worked around this by doing:

series = Polars::Series.new("timestamp", timestamps)
series = series.dt.cast_time_unit("us") unless timestamps.empty?

This works, but its a bit verbose, and the DataFrame has the wrong type (It doesn't actually matter in my use case because if there's no data, my code doesn't do anything. I can imagine it might be a problem if our production code attempted to do some type of schema validation).

To summarize, I think I'm requesting two things:

  • Allow temporal types and time units to be specified when creating a new Series (doing this via the Series initializer would be great, being able to use the columns argument of the DataFrame initializer would be even more awesome)
  • Allow the DateTimeNamespace to be used for empty series.
ankane commented

Hi @simbasdad, thanks for the suggestions (and providing great context). For the first one, you can now do:

Polars::Series.new("timestamp", timestamps, dtype: Polars::Datetime.new("us")) # or ms, ns

and

Polars::DataFrame.new(..., schema: {"timestamp" => Polars::Datetime.new("us")})

For the second one, empty series will still have a type (Float32), and I'm not sure it makes sense to support DateTimeNamespace for Float32 series only if they're empty (also, it wouldn't know if you want a Duration or Datetime).