Methods for casting from one dtype to another
MarcoGorelli opened this issue · 5 comments
Should the standard support casting columns from one dtype to another? Like, Int64 -> Float64?
Could have:
DataFrame.cast_columns
, which takes a mapping between column names and dtypesColumn.cast
which accepts a single dtype
It seems to me like that will be needed sooner or later.
My initial thought/expectation for the method name would be astype
, a la pandas.DataFrame.astype and the method or function in array libraries/standard. The _columns
flavor at the dataframe level sounds good though; question is whether it should only take a mapping or also a dtype. The pandas version is overly flexible:
dtype: str, data type, Series or Mapping of column name -> data type
(and that description still misses that Python scalar types are also allowed). I don't see a need to support strings, columns or Python types. But a single dtype may be useful.
I presume that to_array_object
would never upcast, right? And if there's null
values, it's up to the caller to fill them appropriately
Which means that if you start with [1, 2, null, 4]
and want to end up with [1., 2., nan, 4.]
, then you'd need to do:
column.cast(namespace.Float64).fill_null(float('nan')).to_array_object()
I'd suggest:
- mapping of column names to target dtypes for dataframe
- single dtype for column
I also prefer astype
for the name, although the proposal seems entirely sensible to me! And the less that is implicit the better here regarding the null
handling in my opinion. column.cast(namespace.Float64).to_array_object()
should, in your example, yield [1., 2., null, 4.]
.
yield [1., 2., null, 4.]
sorry how would it yield null
in an array library (which typically doesn't distinguish nan and null)? do you mean [1., 2., nan, 4.]
?
You're right actually given the to_array_object
at the end. I meant merely to not close the door for libraries to distinguish them if possible in the column casting (up to column.cast(namespace.Float64)
) - prominent example being polars Series where null
and nan
are distinguished.