data-apis/dataframe-api

Methods for casting from one dtype to another

MarcoGorelli opened this issue · 5 comments

Should the standard support casting columns from one dtype to another? Like, Int64 -> Float64?

Could have:

  • DataFrame.cast_columns, which takes a mapping between column names and dtypes
  • Column.cast which accepts a single dtype

It seems to me like that will be needed sooner or later.

My initial thought/expectation for the method name would be astype, a la pandas.DataFrame.astype and the method or function in array libraries/standard. The _columns flavor at the dataframe level sounds good though; question is whether it should only take a mapping or also a dtype. The pandas version is overly flexible:

dtype: str, data type, Series or Mapping of column name -> data type

(and that description still misses that Python scalar types are also allowed). I don't see a need to support strings, columns or Python types. But a single dtype may be useful.

I presume that to_array_object would never upcast, right? And if there's null values, it's up to the caller to fill them appropriately

Which means that if you start with [1, 2, null, 4] and want to end up with [1., 2., nan, 4.], then you'd need to do:

column.cast(namespace.Float64).fill_null(float('nan')).to_array_object()

I'd suggest:

  • mapping of column names to target dtypes for dataframe
  • single dtype for column

I also prefer astype for the name, although the proposal seems entirely sensible to me! And the less that is implicit the better here regarding the null handling in my opinion. column.cast(namespace.Float64).to_array_object() should, in your example, yield [1., 2., null, 4.].

yield [1., 2., null, 4.]

sorry how would it yield null in an array library (which typically doesn't distinguish nan and null)? do you mean [1., 2., nan, 4.]?

You're right actually given the to_array_object at the end. I meant merely to not close the door for libraries to distinguish them if possible in the column casting (up to column.cast(namespace.Float64)) - prominent example being polars Series where null and nan are distinguished.