Enhance initialization of Series with temporal types
Closed this issue · 1 comments
I have a use case where I'm creating small DataFrame
s in unit tests. I want to ensure that these DataFrame
s have the same schema as the parquet files my product will be reading, so I'm attempting to be explicit about the types.
I initially attempted to use the dtype
argument to DataFrame#initialize
, but it appears that it does not have support for temporal types.
After that, I fell back to allowing the gem to infer the type, which it does well. However, it appears that it defaults the units to ns
when using the constructor, but when I actually load the parquet file, it is using us
units. I'm not sure whether or not this is actually important, but I am trying to be precise.
In order to work around this, I ended up constructing the series as follows:
Polars::Series.new("timestamp", timestamps).dt.cast_time_unit("us")
This works, and I'm happy with this as a solution. However, the code that I use to create the DataFrame
s is in a helper function as we want to create different DataFrame
s for different test cases. One test case we always write is for an empty file. In that case, the above code boils down to:
Polars::Series.new("timestamp", []).dt.cast_time_unit("us")
This unfortunately fails with:
thread '<unnamed>' panicked at 'expected duration or datetime', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/polars-plan-0.29.0/src/dsl/dt.rs:46:22
stack backtrace:
0: _rust_begin_unwind
1: core::panicking::panic_fmt
2: <F as polars_plan::dsl::expr::FunctionOutputField>::get_field
3: polars_plan::logical_plan::aexpr::schema::<impl polars_plan::logical_plan::aexpr::AExpr>::to_field
4: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
5: <polars_core::schema::Schema as core::iter::traits::collect::FromIterator<F>>::from_iter
6: core::iter::adapters::try_process
7: polars_plan::utils::expressions_to_schema
8: polars_plan::logical_plan::builder::prepare_projection
9: polars_plan::logical_plan::builder::LogicalPlanBuilder::project
10: polars_lazy::frame::LazyFrame::select
11: polars::lazyframe::RbLazyFrame::select
12: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
13: polars::init::anon
14: _vm_call_cfunc_with_frame
15: _vm_sendish
16: _vm_exec_core
17: _rb_vm_exec
18: _rb_ec_exec_node
19: _ruby_run_node
20: _main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
/Users/ajakowpa/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/lazy_frame.rb:837:in `select': expected duration or datetime (fatal)
from /Users/ajakowpa/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/lazy_frame.rb:837:in `select'
from /Users/ajakowpa/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/data_frame.rb:3678:in `select'
from /Users/ajakowpa/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/expr_dispatch.rb:19:in `method_missing'
from /Users/ajakowpa/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/date_time_name_space.rb:877:in `cast_time_unit'
from /Users/ajakowpa/Library/Application Support/JetBrains/RubyMine2023.1/scratches/polars.rb:17:in `<main>'
I've worked around this by doing:
series = Polars::Series.new("timestamp", timestamps)
series = series.dt.cast_time_unit("us") unless timestamps.empty?
This works, but its a bit verbose, and the DataFrame
has the wrong type (It doesn't actually matter in my use case because if there's no data, my code doesn't do anything. I can imagine it might be a problem if our production code attempted to do some type of schema validation).
To summarize, I think I'm requesting two things:
- Allow temporal types and time units to be specified when creating a new
Series
(doing this via theSeries
initializer would be great, being able to use thecolumns
argument of theDataFrame
initializer would be even more awesome) - Allow the
DateTimeNamespace
to be used for empty series.
Hi @simbasdad, thanks for the suggestions (and providing great context). For the first one, you can now do:
Polars::Series.new("timestamp", timestamps, dtype: Polars::Datetime.new("us")) # or ms, ns
and
Polars::DataFrame.new(..., schema: {"timestamp" => Polars::Datetime.new("us")})
For the second one, empty series will still have a type (Float32
), and I'm not sure it makes sense to support DateTimeNamespace
for Float32
series only if they're empty (also, it wouldn't know if you want a Duration
or Datetime
).