pandas-dev/pandas

REGR: setting column with setitem should not modify existing array inplace

Opened this issue · 34 comments

So consider this example of a small dataframe with a nullable integer column:

def recreate_df():
    return pd.DataFrame({'int': [1, 2, 3], 'int2': [3, 4, 5],
                         'float': [.1, .2, .3],
                         'EA': pd.array([1, 2, None], dtype="Int64")
                        })

Assigning a new column with __setitem__ (df[col] = ...) normally does not even preserve the dtype:

In [2]: df = recreate_df() 
   ...: df['EA'] = np.array([1., 2., 3.]) 
   ...: df['EA'].dtype 
Out[2]: dtype('float64')

In [3]: df = recreate_df() 
   ...: df['EA'] = np.array([1, 2, 3]) 
   ...: df['EA'].dtype
Out[3]: dtype('int64')

When assigning a new nullable integer array, it of course keeps the dtype of the assigned values:

In [4]: df = recreate_df() 
   ...: df['EA'] = pd.array([1, 2, 3], dtype="Int64") 
   ...: df['EA'].dtype 
Out[4]: Int64Dtype()

However, in this case you now also have the tricky side-effect of being in place:

In [5]: df = recreate_df() 
   ...: original_arr = df.EA.array 
   ...: df['EA'] = pd.array([1, 2, 3], dtype="Int64") 
   ...: original_arr is df.EA.array  
Out[5]: True

I don't think this behaviour should depend on the values being set, and setitem should always replace the array of the ExtensionBlock.

Because with the above way, you can unexpectedly alter the data with which you created the dataframe. See also a different example using Categorical of this at the original PR that introduced this: #32831 (comment)

cc @jbrockmendel

setitem should always replace the array of the ExtensionBlock

I'd be OK with this creating a new EB with a new EA, not wild about having a Block get a new array.

xref #33198, in both cases AFAICT the issue involves _can_hold_element being too permissive.

I don't care about the Block (for an end user / from the EA perspective, the Block does not have any state you might want to preserve, while the actual array does), so creating a new block is certainly fine. Probably the cleanest API anyway (assigning new column = new Block, regardless of the data type)

@jbrockmendel would you be able to look into this?

probably not before 1.1; im expecting my pandas time to diminish and am focused on wrapping uo frequencies and ops pushes ATM, will return to indexing after those

OK. Alternatively, could you take a minimal look at your original PR that caused this regression (#32831) to see if you have an idea how (broadly speaking) it could be solved there? That could help me getting started on trying to fix this myself.

This is a regression in master, so a blocker for 1.1, IMO

ill take a look

Alternatively, could you take a minimal look at your original PR that caused this regression (#32831) to see if you have an idea how (broadly speaking) it could be solved there? That could help me getting started on trying to fix this myself.

In #32831 the behavior being addressed was a new array was being pinned with ExtensionBlock.values = values, which didn't match the behavior of other Block subclasses that do Block.values[locs] = values

I think the "make a new ExtensionBlock instead of calling eb.set" idea discussed above is likely the best bet in the short-run. Longer-run the base class implementation block.values[locs] = values becomes viable with 2D EAs, haven't looked into whether that solves the issue at hand.

@jreback please don't move issues from the 1.1 that other people have labeled as "blocker" without discussing it (or at least commenting about it that you changed it, changing a milestone doesn't give a notification)

@jorisvandenbossche this release is way behind if u want to move to 1.1.1 pls do
but we are moving forward and releasing the rc

there are way too many blockers that don’t have PRs if you want to put some up great
but if they don’t have anything now then they rn it blockers at all

Things that are labeled as regressions / blockers need to be discussed. I personally rely on milestones to track what needs to be closed out before the release.

w.r.t. this specific issue, I think I'm OK with releasing the RC without it.

Things that are labeled as regressions / blockers need to be discussed. I personally rely on milestones to track what needs to be closed out before the release.

w.r.t. this specific issue, I think I'm OK with releasing the RC without it.

maybe so, but these need to be done ASAP. we cannot keep delaying things. so either remove the blocker or put a comment on WHY this is a blocker AND WHY it needs to be fixed for 1.1 Just because something is a regression does not mean it absolutely needs fixing for 1.1, there is 1.1.x of course, and blocking the entire release is silly. . @pandas-dev/pandas-core

if there is NO PR up for an issue I will remove the blocker labels this wednesday.

@jreback can you comment on the issues when you're removing them?

sure

I need to clarify the expected/desired behavior. Using the example from the OP:

df = recreate_df()
ea_orig = df["EA"]
fa_orig = df["float"]

Doing either df['EA'] = np.array([1., 2., 3.]) or df['EA'] = np.array([.1, .2, .3]) does not alter ea_orig, while doing df['EA'] = pd.array([1, 2, 3], dtype="Int64") does.

The OP focuses on the EA column, but we get the same behavior if we set df["float"] = fa_orig * 2

fa_orig_copy = fa_orig.copy()
df["float"] = fa_orig * 2

>>> (fa_orig == fa_orig_copy).all()
False

@jorisvandenbossche the OP focuses on the EA column, but would you want to change the behavior for non-EA columns too? (Changing the behavior for all columns is a 1-line edit, haven't run the full test suite yet though)

but would you want to change the behavior for non-EA columns too?

My expectation is that DataFrame.__setitem__ should add / replace whole columns, not mutate existing values.

I think this should differ from df.loc[:, "column"] = array, which should mutate inplace.

I haven't thought through it too much, but I think I'm comfortable making the change, calling it a bugfix. It'll be a behavior change but I don't see an easy way to deprecate it, and I think matches most people's mental model for DataFrames.

@TomAugspurger I tend to agree. @jorisvandenbossche @jreback how about you?

All it takes to do this is to remove the line if blk.should_store(value) in BlockManager.iset. This breaks 13 tests for me (havent checked which are testing this specific behavior vs coincidence).

xref #33198, which this does not fix

If we're OK making this behavior change, I think we should be OK with #33780

i agree with your analysis @jbrockmendel so ok to make this change (but let's explain a whats new section)

Doing this breaks (among other things which I'm still tracking down) tests.internals.test_internals.test_get_bool_data and test_get_numeric_data. If we're OK with changing the expected behavior of those, then I think we should be OK with the behavior change to those two methods in #34918.

FYI #35369 also tracks back to #32831

cc @jreback @TomAugspurger @jorisvandenbossche is everyone OK with changing the get_bool_data and get_numeric_data as necessary? If so, I'll continue with this branch.

what changes about those tests?

what changes about those tests?

Never mind, the changes needed for those turn out to be benign. I've got 8 more tests failing that i need to troubleshoot, then will make a PR.

I think our three options right now are

Of these I think we should go with #35271 for 1.1.0. It's the smallest change from 1.0.x.

Long-term, I think we want something like Brock's #35417 gets us consistency. But that probably should wait for 2.x

But that probably should wait for 2.x

Thinking through this a bit more. Hopefully the whatsnew over at https://github.com/pandas-dev/pandas/pull/35417/files is clarifying, but this is probably OK to do in 1.2.

The bit about consistently assigning a new array regardless of dtype is the important part. I hope that not too many people are relying on the current behavior one way or another, given the inconsistency.

The bit about consistently assigning a new array regardless of dtype is the important part. I hope that not too many people are relying on the current behavior one way or another, given the inconsistency.

It has more consequences than just the overwriting of the column in question or not, though.

One aspect I am thinking of / checking now is how this impacts consolidated blocks. Normally, assigning to an existing column (for a consolidated dtype) leaves the block structure intact:

In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])

In [2]: df._mgr
Out[2]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [3]: block_values = df._mgr.blocks[0].values

In [4]: df['b'] = 0.0

In [5]: df._mgr
Out[5]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [6]: df._mgr.blocks[0].values is block_values
Out[6]: True

In [7]: pd.__version__
Out[7]: '1.1.0'

While using #35417 branch (current state of time of posting):

In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])    

In [2]: df._mgr      
Out[2]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [3]: df['b'] = 0.0   

In [4]: df._mgr  
Out[4]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: [0 2 3], 3 x 10, dtype: float64
FloatBlock: slice(1, 2, 1), 1 x 10, dtype: float64

So creating different blocks, and thus the assignment of one float column triggered a copy of all float columns (in this case it actually already copied due to the block layout, in some cases it might also be a slice, but once a next step performs consolidation, this will become a copy anyway).

So creating different blocks, and thus the assignment of one float column triggered a copy of all float columns

One option to avoid this copy would be to end up with three blocks, corresponding to df[["a"]], df[["b"]], and df[["c", "d"]], respecetively. The first and third could contain views on the original array.

That certainly avoids the copy initially, but as also mentioned above, once a next step in your analysis performs consolidation, this will still result in a full copy due to the assignment.

But that probably should wait for 2.x

Thinking through this a bit more. Hopefully the whatsnew over at https://github.com/pandas-dev/pandas/pull/35417/files is clarifying, but this is probably OK to do in 1.2.

I am personally not yet convinced that we should do this for 1.2:

  • It has API breaking changes. It is of course difficult to assess the potential impact, but I think we should at least do a bit more effort to think this through before saying that this is OK for 1.2, and shouldn't wait until 2.0.
  • It introduces additional copies (and pandas is already copy heavy) when assigning to a single column with consolidated blocks.

As I understand, a large part of the motivation is the inconsistency in behaviour between different dtypes?
Personally, I would be fine with accepting this difference (if the difference is between consolidated blocks vs non-consolidated extension blocks). There are several things we are doing different for the new extension dtypes, and assuming they will become the default in 2.0, I think it is fine to change the default behaviour that way, instead of already wanting to align the behaviour for consolidated vs extension blocks right now.

It has API breaking changes.

I consider the internal inconsistency to be a bug, plus reported bugs eg #35731 caused by the current behavior (no doubt that could be fixed with some other patch, but better to get at the root of the problem)

removing milestone and blocker label

@jorisvandenbossche can you see if there is anything left to do here? AFAICT DataFrame.__setitem__ should now never modify the existing array inplace.