pandas-dev/pandas

BUG: `pivot_table` drops rows and columns despite values being non-`NaN`

Opened this issue · 9 comments

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd
import pandas._testing as tm


def test_pivot_table_index_and_column_with_nan() -> None:
    """Index and columns should exist if any non-null values.

    Input data
    ----------
        row  col  val
    0  NaN  0.0    0
    1  0.0  1.0    1
    2  1.0  2.0    2
    3  2.0  3.0    3
    4  3.0  NaN    4

    Expected output
    ---------------
    col  0.0  1.0  2.0  3.0  NaN
    row                         
    NaN  0.0  NaN  NaN  NaN  NaN
    0.0  NaN  1.0  NaN  NaN  NaN
    1.0  NaN  NaN  2.0  NaN  NaN
    2.0  NaN  NaN  NaN  3.0  NaN
    3.0  NaN  NaN  NaN  NaN  4.0
    """
    data = {
        "row": [None, *range(4)],
        "col": [*range(4), None],
        "val": range(5)
    }
    df = pd.DataFrame(data)
    actual = df.pivot_table(values="val", index="row", columns="col")
    e_index = [None, *range(4)]
    e_columns = [*range(4), None]
    e_data = np.zeros(shape=(5, 5))
    e_data.fill(np.NaN)
    np.fill_diagonal(a=e_data, val=range(5))
    expected = pd.DataFrame(
        data=e_data,
        index=pd.Index(data=e_index, name="row"),
        columns=pd.Index(data=e_columns, name="col")
    )
    tm.assert_frame_equal(left=actual, right=expected)  # fails

Issue Description

Rows and columns are unexpectedly dropped while creating a pivot table with a single NaN index label and a single NaN column label and no NaN values in the input data.

Input data

   row  col  val
0  NaN  0.0    0
1  0.0  1.0    1
2  1.0  2.0    2
3  2.0  3.0    3
4  3.0  NaN    4

Actual output

col  1.0  2.0  3.0
row               
0.0  1.0  NaN  NaN
1.0  NaN  2.0  NaN
2.0  NaN  NaN  3.0

Expected Behavior

The docs for dropna state:

Do not include columns whose entries are all NaN. If True, rows with a NaN value in any column will be omitted before computing margins.

Given that dropna=True by default, I'd expect all columns and rows to be present as each has at least one non-NaN value.

Expected output

col  0.0  1.0  2.0  3.0  NaN
row                         
NaN  0.0  NaN  NaN  NaN  NaN
0.0  NaN  1.0  NaN  NaN  NaN
1.0  NaN  NaN  2.0  NaN  NaN
2.0  NaN  NaN  NaN  3.0  NaN
3.0  NaN  NaN  NaN  NaN  4.0

Installed Versions

INSTALLED VERSIONS

commit : 3c93d06
python : 3.13.2
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : AMD64 Family 25 Model 116 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 3.0.0.dev0+2007.g3c93d06d64
numpy : 1.26.4
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : 3.0.12
sphinx : 8.1.3
IPython : 9.0.2
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.3
blosc : None
bottleneck : 1.4.2
fastparquet : 2024.11.0
fsspec : 2025.3.0
html5lib : 1.1
hypothesis : 6.129.0
gcsfs : 2025.3.0
jinja2 : 3.1.6
lxml.etree : 5.3.1
matplotlib : 3.10.1
numba : 0.61.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 19.0.1
pyreadstat : 1.2.8
pytest : 8.3.5
python-calamine : None
pytz : 2025.1
pyxlsb : 1.0.10
s3fs : 2025.3.0
scipy : 1.15.2
sqlalchemy : 2.0.39
tables : 3.10.2
tabulate : 0.9.0
xarray : 2024.9.0
xlrd : 2.0.1
xlsxwriter : 3.2.2
zstandard : 0.23.0
tzdata : 2025.1
qtpy : None
pyqt5 : None

Strangely, pivot almost produces the expected output (the columns are sorted with NaN first):

data = {
    "row": [None, *range(4)],
    "col": [*range(4), None],
    "val": range(5)
}
df = pd.DataFrame(data)
out = df.pivot(index="col", columns="row", values="val")
print(out)
col  NaN  0.0  1.0  2.0  3.0
row                         
NaN  NaN  0.0  NaN  NaN  NaN
0.0  NaN  NaN  1.0  NaN  NaN
1.0  NaN  NaN  NaN  2.0  NaN
2.0  NaN  NaN  NaN  NaN  3.0
3.0  4.0  NaN  NaN  NaN  NaN

It looks like dropna is used in multiple places inside __internal_pivot_table:

  1. In the groupby() which means that NaN keys (either in index (row) or columns (col)) will be dropped.
grouped = data.groupby(keys, observed=observed, sort=sort, dropna=dropna)
  1. For dropping all-NaN rows and columns
if dropna and ...
    agged = agged.dropna(how="all")

Due to the usage in 1 above, your result df doesn't contain cases where row or col is NaN. It looks like this behavior is not documented well.
Seems this is a known issue: #53521

@snitish

Thanks for the reply! What options should I explore to fix and/or improve this? I will look at #53521 to see if suggestions have already been made there as well.

I'll let @rhshadrach chime in

Indeed, this seems duplicative of #53521. For now I recommend using dropna=False followed by dropping all NA rows / columns in the output. Does that satisfy your use case?

@rhshadrach I was actually looking to keep all the NA rows and columns in the output as in the expected output above. dropna=False does this, but wondering if I should update the docs in pivot_table so users know that dropna=True does more than what it says in the current docs. Maybe something like the following:

    dropna : bool, default True
         Do not include columns whose entries are all NaN. If True,
-        rows with a NaN value in any column will be omitted before
-        computing margins.
+        - rows with a NaN value in any column will be omitted before computing margins
+        - index/column keys containing NA values will be dropped (see ``dropna`` parameter in ``pandas.DataFrame.groupby``)

Shared groupby docs for dropna

I'm +1 on adding to the docstring as suggested.

I'll link a PR

@rhshadrach PR has been linked