ankane/ruby-polars

Decimal types in parquet files cannot be converted to ruby

Closed this issue · 1 comments

I have a parquet file that has a decimal type in it:

stringio = StringIO.new(File.binread("example.parquet"))
df = T.let(Polars.read_parquet(stringio), Polars::DataFrame)

puts df[["revenue"]]
shape: (438, 1)
┌────────────────┐
│ revenue        │
│ ---            │
│ decimal[.20,3] │
╞════════════════╡
│ 409.59         │
│ 72             │
│ 584.34         │
│ 5              │
│ …              │
│ 241.71         │
│ 15.11          │
│ 78.16          │
│ 147            │
└────────────────┘

When I try to convert the dataframe to a hash, it fails:

df.to_hashes
thread '<unnamed>' panicked at 'not yet implemented', ext/polars/src/conversion.rs:209:46
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic
   3: <polars::conversion::Wrap<polars_core::datatypes::any_value::AnyValue> as magnus::into_value::IntoValue>::into_value_with
   4: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
   5: <magnus::r_array::RArray as core::iter::traits::collect::FromIterator<T>>::from_iter
   6: polars::dataframe::RbDataFrame::row_tuple
   7: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
   8: polars::init::anon
   9: _vm_call_cfunc_with_frame
  10: _vm_sendish
  11: _vm_exec_core
  12: _rb_vm_exec
  13: _invoke_block_from_c_bh
  14: _rb_yield_values2
  15: _collect_i
  16: _invoke_block_from_c_bh
  17: _rb_yield_1
  18: _int_dotimes
  19: _vm_call0_body
  20: _rb_call0
  21: _rb_iterate0
  22: _rb_block_call_kw
  23: _vm_call0_body
  24: _rb_call0
  25: _rb_iterate0
  26: _rb_lambda_call
  27: _enum_collect
  28: _vm_call_cfunc_with_frame
  29: _vm_sendish
  30: _vm_exec_core
  31: _rb_vm_exec
  32: _rb_ec_exec_node
  33: _ruby_run_node
  34: _main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
~/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/data_frame.rb:763:in `row_tuple': not yet implemented (fatal)
	from ~/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/data_frame.rb:763:in `block in to_hashes'
	from ~/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/data_frame.rb:762:in `times'
	from ~/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/data_frame.rb:762:in `each'
	from ~/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/data_frame.rb:762:in `map'
	from ~/.gem/ruby/3.2.1/gems/polars-df-0.5.0-arm64-darwin/lib/polars/data_frame.rb:762:in `to_hashes'
	from ~/Library/Application Support/JetBrains/RubyMine2023.1/scratches/polars.rb:9:in `<main>'

It appears that this can be worked around by casting:

df = df.select(
  [
    Polars.col("revenue").cast(:f64),
  ],
)

This will work, but is inconvenient as the conversion is happening in a base class that processes many files and is not schema aware.

ankane commented

Thanks @simbasdad! Improved support for the Decimal type in the commit above.

Note: Casting to f64 will lose precision.