BUG: `pivot_table` drops rows and columns despite values being non-`NaN`
Opened this issue · 9 comments
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import numpy as np
import pandas as pd
import pandas._testing as tm
def test_pivot_table_index_and_column_with_nan() -> None:
"""Index and columns should exist if any non-null values.
Input data
----------
row col val
0 NaN 0.0 0
1 0.0 1.0 1
2 1.0 2.0 2
3 2.0 3.0 3
4 3.0 NaN 4
Expected output
---------------
col 0.0 1.0 2.0 3.0 NaN
row
NaN 0.0 NaN NaN NaN NaN
0.0 NaN 1.0 NaN NaN NaN
1.0 NaN NaN 2.0 NaN NaN
2.0 NaN NaN NaN 3.0 NaN
3.0 NaN NaN NaN NaN 4.0
"""
data = {
"row": [None, *range(4)],
"col": [*range(4), None],
"val": range(5)
}
df = pd.DataFrame(data)
actual = df.pivot_table(values="val", index="row", columns="col")
e_index = [None, *range(4)]
e_columns = [*range(4), None]
e_data = np.zeros(shape=(5, 5))
e_data.fill(np.NaN)
np.fill_diagonal(a=e_data, val=range(5))
expected = pd.DataFrame(
data=e_data,
index=pd.Index(data=e_index, name="row"),
columns=pd.Index(data=e_columns, name="col")
)
tm.assert_frame_equal(left=actual, right=expected) # fails
Issue Description
Rows and columns are unexpectedly dropped while creating a pivot table with a single NaN
index label and a single NaN
column label and no NaN
values in the input data.
Input data
row col val
0 NaN 0.0 0
1 0.0 1.0 1
2 1.0 2.0 2
3 2.0 3.0 3
4 3.0 NaN 4
Actual output
col 1.0 2.0 3.0
row
0.0 1.0 NaN NaN
1.0 NaN 2.0 NaN
2.0 NaN NaN 3.0
Expected Behavior
The docs for dropna
state:
Do not include columns whose entries are all NaN. If True, rows with a NaN value in any column will be omitted before computing margins.
Given that dropna=True
by default, I'd expect all columns and rows to be present as each has at least one non-NaN
value.
Expected output
col 0.0 1.0 2.0 3.0 NaN
row
NaN 0.0 NaN NaN NaN NaN
0.0 NaN 1.0 NaN NaN NaN
1.0 NaN NaN 2.0 NaN NaN
2.0 NaN NaN NaN 3.0 NaN
3.0 NaN NaN NaN NaN 4.0
Installed Versions
INSTALLED VERSIONS
commit : 3c93d06
python : 3.13.2
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : AMD64 Family 25 Model 116 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 3.0.0.dev0+2007.g3c93d06d64
numpy : 1.26.4
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : 3.0.12
sphinx : 8.1.3
IPython : 9.0.2
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.3
blosc : None
bottleneck : 1.4.2
fastparquet : 2024.11.0
fsspec : 2025.3.0
html5lib : 1.1
hypothesis : 6.129.0
gcsfs : 2025.3.0
jinja2 : 3.1.6
lxml.etree : 5.3.1
matplotlib : 3.10.1
numba : 0.61.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 19.0.1
pyreadstat : 1.2.8
pytest : 8.3.5
python-calamine : None
pytz : 2025.1
pyxlsb : 1.0.10
s3fs : 2025.3.0
scipy : 1.15.2
sqlalchemy : 2.0.39
tables : 3.10.2
tabulate : 0.9.0
xarray : 2024.9.0
xlrd : 2.0.1
xlsxwriter : 3.2.2
zstandard : 0.23.0
tzdata : 2025.1
qtpy : None
pyqt5 : None
Strangely, pivot
almost produces the expected output (the columns are sorted with NaN
first):
data = {
"row": [None, *range(4)],
"col": [*range(4), None],
"val": range(5)
}
df = pd.DataFrame(data)
out = df.pivot(index="col", columns="row", values="val")
print(out)
col NaN 0.0 1.0 2.0 3.0
row
NaN NaN 0.0 NaN NaN NaN
0.0 NaN NaN 1.0 NaN NaN
1.0 NaN NaN NaN 2.0 NaN
2.0 NaN NaN NaN NaN 3.0
3.0 4.0 NaN NaN NaN NaN
It looks like dropna
is used in multiple places inside __internal_pivot_table
:
- In the
groupby()
which means thatNaN
keys (either inindex
(row) orcolumns
(col)) will be dropped.
grouped = data.groupby(keys, observed=observed, sort=sort, dropna=dropna)
- For dropping all-NaN rows and columns
if dropna and ...
agged = agged.dropna(how="all")
Due to the usage in 1 above, your result df doesn't contain cases where row
or col
is NaN
. It looks like this behavior is not documented well.
Seems this is a known issue: #53521
I'll let @rhshadrach chime in
Indeed, this seems duplicative of #53521. For now I recommend using dropna=False
followed by dropping all NA rows / columns in the output. Does that satisfy your use case?
@rhshadrach I was actually looking to keep all the NA rows and columns in the output as in the expected output above. dropna=False
does this, but wondering if I should update the docs in pivot_table
so users know that dropna=True
does more than what it says in the current docs. Maybe something like the following:
dropna : bool, default True
Do not include columns whose entries are all NaN. If True,
- rows with a NaN value in any column will be omitted before
- computing margins.
+ - rows with a NaN value in any column will be omitted before computing margins
+ - index/column keys containing NA values will be dropped (see ``dropna`` parameter in ``pandas.DataFrame.groupby``)
I'm +1 on adding to the docstring as suggested.
I'll link a PR
@rhshadrach PR has been linked