
Polars::Binary type

Tseyang opened this issue · 5 comments

Thanks for making this library! I'm playing around with it and trying to read a Parquet file that has string that gets encoded as binary data. It seems like the DataFrame that gets created has the column with [binary data]:

[17] pry(main)> df[["content"]]
=> shape: (1035, 1)
 [binary data] 
 [binary data] 
 [binary data] 
 [binary data] 
 [binary data] 
 [binary data] 
 [binary data] 
 [binary data] 

I can't seem to find a way to transform this data into its string representation and everything I've tried (indexing into the Series,, apply) seems to indicate that this data type is still not properly supported yet:

[18] pry(main)> df["content"][0]
thread '<unnamed>' panicked at 'not yet implemented', ext/polars/src/
fatal: not yet implemented
from /Users/tyl/.gem/ruby/3.2.1/gems/polars-df-0.3.1-arm64-darwin/lib/polars/series.rb:282:in `get_idx'

I'm wondering if there's an existing way to achieve what I want - to transform a Series holding Polars::Binary data into Polars::Utf8?

Thank you!

Hey @Tseyang, you can use series.cast(Polars::Utf8) for this.

Also, df["content"].to_a and df["content"][0] will work in the next release.

ah I see. Thanks for the fast turnaround!

@ankane sorry I had another small question: I don't suppose read_parquet takes in Bytes directly vs. a filename? E.g. if I have a means of obtaining the raw bytes for the parquet data, can I directly create a DataFrame from it without writing to a Tempfile?

It didn't seem so from the function declaration but maybe I overlooked something.

You can pass an object that responds to read, like:

require "stringio"

io ="binary-data")
df = Polars.read_parquet(io)