pandas-dev/pandas-stubs

Boolean indexing doesn't work with subclass and TypeVar

Closed this issue · 2 comments

Describe the bug
A suggested on #908 (comment) I'm trying to use DataFrameT = TypeVar("DataFrameT", bound=DataFrame),
but with boolean indexing instead of a pipe.

To Reproduce

from typing import TypeVar, reveal_type

from pandas import DataFrame, Series


class SubDF(DataFrame):
    # https://pandas.pydata.org/pandas-docs/stable/development/extending.html#override-constructor-properties
    @property
    def _constructor(self):
        return SubDF

    @property
    def _constructor_sliced(self):
        return Series


sub = SubDF({'a': [1, 2, 3]})

DataFrameT = TypeVar("DataFrameT", bound=DataFrame)


def func(df: DataFrameT) -> DataFrameT:
    index = Series([True, False, True])

    df_ = df.loc[index]
    reveal_type(df_)

    return df_  # Type "DataFrame" is not assignable to return type "DataFrameT@func"


reveal_type(func(sub))

pyright:

  /workspaces/ng/repro.py:27:17 - information: Type of "df_" is "DataFrame"
  /workspaces/ng/repro.py:29:12 - error: Type "DataFrame" is not assignable to return type "DataFrameT@func"
    Type "DataFrame" is not assignable to type "DataFrameT@func" (reportReturnType)
  /workspaces/ng/repro.py:32:13 - information: Type of "func(sub)" is "SubDF"
1 error, 0 warnings, 2 informations 

Please complete the following information:

  • OS: Linux
  • OS Version 20.04.6
  • python version 3.12.2
  • version of type checker 1.1.390
  • version of installed pandas-stubs 2.2.3.241126

Additional context

Repro inspired by:

Might be same root cause as:

Dr-Irv commented

It's not a .loc issue. You get a similar result with any DataFrame method that returns a DataFrame, e.g.:

def func2(df: DataFrameT) -> DataFrameT:
    df_ = df.query("x <= 10")

    return df_   # Type "DataFrame" is not assignable to return type "DataFrameT@func"

reveal_type(func2(sub))

The type revealed is still correct (SubDF) in this case.

That particular example can be fixed by changing query() to return Self instead of DataFrame. I imagine we'd have to do that with any of the methods in class DataFrame that currently return DataFrame - i.e., change them to return Self

But loc is different, because it is returning the class _LocIndexerFrame, so I think that latter class would have to become generic with Self passed in as a parameter, so it is a subclass of Generic[_T]

I tried this idea and it worked on your example.

PR with tests welcome.

Fixed in #1091