data-apis/dataframe-api

`set_column_by_name`/ `set_column`?

Closed this issue ยท 11 comments

Thanks for the work that you're doing here!

I would like to request the addition of DataFrame.set_item_by_name to the API.

From the demonstration at euroscipy last week, the ugliest part of the usage was adding columns by:

for col in cols:
    df.insert(len(df.get_column_names()), col)

This seems like an obvious omission, but I haven't found any documentation saying why it hasn't been included. Seems like it was discussed in a call (#138 (comment))?

thanks @ivirshup !

I've got a PR open to add DataFrame.update_column and DataFrame.update_columns - I think this is what you're after? #232

my general issue with set_column was that it tried to do two things:

  • add a column if not present (but where? not clear)
  • update a column

From the demonstration at euroscipy last week, the ugliest part of the usage was

๐Ÿ˜„ yeah I realised it was pretty "ugly" whilst preparing and presenting the presentation - would

  • insert_column(loc, column)
  • update_column(column)

be less ugly in your opinion?

my general issue with set_column was that it tried to do two things:

  • add a column if not present (but where? not clear)
  • update a column

I guess I want the behavior you are describing ๐Ÿ˜…. But I also want there to be a unique solution to DataFrame.get_column_by_name.

Personally, I think of dataframes as Mapping[str, Array] where every array has the same length. This model, which works well in across the R and python libraries I've used, would say "you are overwriting or inserting the column", and I don't think this has been confusing. Of course, many of these implementation have strange ways you can construct dataframes with non-unique column names which leads to extremely confusing behavior:

e.g. confusing definition of union
In [1]: import pandas as pd

In [2]: pd.Index(["a", "a", "b"]).intersection(pd.Index(["a", "a", "b"]))
Out[2]: Index(['a', 'b'], dtype='object')

In [3]: pd.Index(["a", "a", "b"]).union(pd.Index(["a", "b", "a"]))
Out[3]: Index(['a', 'a', 'b'], dtype='object')
  • add a column if not present (but where? not clear)

I think most libraries do "key insertion order", but I don't really care since I'll be accessing columns by name anyways.


I sort of see an argument for saying overwrite is different than insert and these should be distinct. I don't really see an argument for set by position, get by name. I'm also fine with writing:

if col.name not in df.get_column_names():
    df = df.set_column(col)  # or whatever other than position
else:
    raise SomeError(f"{col.name} was taken, be more original")

From the demonstration at euroscipy last week, the ugliest part of the usage was

๐Ÿ˜„ yeah I realised it was pretty "ugly"

I would actually like to rephrase from "the ugliest" to "the only ugly part".

Well.... maybe the names could be shorter image

Thanks for your thoughts, much appreciated

@kkraus14 has brought up the idea of unordered columns before - maybe we should do that? So then we'd have:

  • insert_column
  • update_column
  • get_column
  • drop_column

and corresponding plural versions.

Then there'd be no need for by_name because that's all they could be

And get_column_names would return a set.

The name get_columns_by_name has been a source of annoyance for me to be honest, I'd be glad if we could move away from it

@kkraus14 has brought up the idea of unordered columns before - maybe we should do that?

I think I brought up a question about whether we had explicitly decided that columns were ordered or not. From my perspective, it's a common pattern to iterate over the columns in a dataframe and returning column names as a set doesn't guarantee you a consistent iteration order as far as I'm aware, which is undesired behavior.

My stance here is that columns should be both ordered and named. We could update the behavior of df.insert to default to inserting at the end of the dataframe?

We could update the behavior of df.insert to default to inserting at the end of the dataframe?

sounds good - you OK with the other renamings suggested?

@MarcoGorelli your proposal sounds right, but I agree that set's unstable iteration order is undesirable.

A case where order really matters: visual inspection of the table. It's quite annoying when:

df.head()
# a | b | c
# ...
df.write(store)
df = store.read()
df.head()
# c | a | b
# ...

But you don't need that much order awareness to accomplish this, basically just insertion order and stable iteration order.


We could update the behavior of df.insert to default to inserting at the end of the dataframe?

For insert_column to have a default location the argument order would have to change as well, no? Does position become key word?


I would also like to make an case for a set_column that overwrites or inserts, but maybe I'll leave that to a separate issue?

Sure, let's keep columns ordered then, and default to inserting columns at the end. people can always reorder them with get_columns_by_name and get_column_names() anyway

for set_column - can't you do:

if col.name in df.get_column_names():
    df = df.set_column(col)
else:
    df = df.insert_column(col)

?
If this is meant to be a developer-focused API, I'd suggest forcing such explicitness

But I also don't really care too much about this particular topic, I'm much more concerned Ibis and polars-lazy not being compatible with some column-related parts of the API - if you had any thoughts / inputs on #229, that would be appreciated

Just for my understanding, what's your use case for the dataframe standard?

Just for my understanding, what's your use case for the dataframe standard?

Context

I work on anndata. This library provides the AnnData object, which most tools in the scverse ecosystem (focussed on bioinformatics, in particular single-cell data analysis) use to interchange data. It also has hdf5 and zarr based storage formats. This object contains arrays and dataframe objects and keeps them aligned along two dimensions: observations and variables.

It's sort of like a 2-D xarray.Dataset specialized for exploratory analysis of high dimensional data, where your data looks like something that could be passed into a scikit-learn transformer. It's not actually an xarray.Dataset because we need support for sparse arrays and dataframes (for 1d annotations along the observations or variables, e.g. "cell type"). We're currently expanding the types of arrays we can work, sometimes with help from the array-api, but not as often as we'd like since we have sparse data.

However, we currently only support pandas dataframes, and I'd like to change that. This initiative seems like a great way to do that.

What I'd like to be able to do with arbitrary dataframe types

The big operations we need are:

  • Creation, e.g.
df = DataFrame()
store: h5py.Group | zarr.Group = ...

df = Dataframe.from_dict(
    {k: v.decode() for k, v in store.items()}
)
  • Row based indexing
  • Concatenation

And if we can do these without having to specialize for each case, as we've been doing to have sparse array support.


Eventually I would also like to support more dataframe types in our analysis libraries, but being able to handle them in our data container would need to happen first.

Thanks @ivirshup

Row based indexing

Could you expand on what you need please? Row labels are disallowed by the standard, would you be able to manage without?

Could you expand on what you need please? Row labels are disallowed by the standard, would you be able to manage without?

Yes. We just need positional indexing of rows. We can implement our own labels on top of that.

ok that should be fine then, thanks!

How about

df = Dataframe.from_dict(
    {k: v.decode() for k, v in store.items()}
)

, when you do v.decode(), what object is that?

The standard doesn't have a dataframe.from_dict method, where you could create a dataframe from a dict of columns. So you could do something like:

df_standard = df.__dataframe_consortium_standard__()
namespace = df_standard.__dataframe_namespace__()
df = namespace.dataframe_from_dict(
    {k: v.decode().__column_consortium_standard__() for k, v in store.items()}
)