data-apis/dataframe-api

Typing `names` argument in `select_columns_by_name`

mroeschke opened this issue · 3 comments

Currently DataFrame.select_columns_by_name has names typed as names: Sequence[str].

I think the intention here is Sequence is a non-scalar container of string labels, but Sequence[str] also matches a pure str.

hey - this looks correct to me

a str is an sequence of str. If you have single-letter column names, and pass a string with the column names, I'd expect it to work

and it does:

In [20]: pd.api.interchange.from_dataframe(pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]}).__d
    ...: ataframe__().select_columns_by_name('ab'))
Out[20]:
   a  b
0  1  4
1  2  5
2  3  6

Ah okay I forgot about that possibility. Closing then

.select_columns_by_name('ab')

seems pretty ambiguous and bug-prone to me. I'd expect that to give me a single column named 'ab', not two columns named 'a' and 'b'. I think the intention was for this to be spelled ['a', 'b'].

Unfortunately I'm not sure if there's a way to fix this to make it unambiguous while allowing all non-string sequences. And list[str, ...] may be too restrictive. So I guess we have to leave it as is either way.