ankane/ruby-polars

Polars::Binary type

Tseyang opened this issue · 5 comments

Thanks for making this library! I'm playing around with it and trying to read a Parquet file that has string that gets encoded as binary data. It seems like the DataFrame that gets created has the column with [binary data]:

[17] pry(main)> df[["content"]]
=> shape: (1035, 1)
┌───────────────┐
 content       
 ---           
 binary        
╞═══════════════╡
 [binary data] 
 [binary data] 
 [binary data] 
 [binary data] 
 ...           
 [binary data] 
 [binary data] 
 [binary data] 
 [binary data] 
└───────────────┘

I can't seem to find a way to transform this data into its string representation and everything I've tried (indexing into the Series, to_a.map, apply) seems to indicate that this data type is still not properly supported yet:

[18] pry(main)> df["content"][0]
thread '<unnamed>' panicked at 'not yet implemented', ext/polars/src/conversion.rs:164:37
fatal: not yet implemented
from /Users/tyl/.gem/ruby/3.2.1/gems/polars-df-0.3.1-arm64-darwin/lib/polars/series.rb:282:in `get_idx'

I'm wondering if there's an existing way to achieve what I want - to transform a Series holding Polars::Binary data into Polars::Utf8?

Thank you!

Hey @Tseyang, you can use series.cast(Polars::Utf8) for this.

Also, df["content"].to_a and df["content"][0] will work in the next release.

ah I see. Thanks for the fast turnaround!

@ankane sorry I had another small question: I don't suppose read_parquet takes in Bytes directly vs. a filename? E.g. if I have a means of obtaining the raw bytes for the parquet data, can I directly create a DataFrame from it without writing to a Tempfile?

It didn't seem so from the function declaration but maybe I overlooked something.

You can pass an object that responds to read, like:

require "stringio"

io = StringIO.new("binary-data")
df = Polars.read_parquet(io)