REGR: setting column with setitem should not modify existing array inplace

Question

REGR: setting column with setitem should not modify existing array inplace

Opened this issue 4 years ago · 34 comments

jorisvandenbossche commented 4 years ago

So consider this example of a small dataframe with a nullable integer column:

def recreate_df():
    return pd.DataFrame({'int': [1, 2, 3], 'int2': [3, 4, 5],
                         'float': [.1, .2, .3],
                         'EA': pd.array([1, 2, None], dtype="Int64")
                        })

Assigning a new column with __setitem__ (df[col] = ...) normally does not even preserve the dtype:

In [2]: df = recreate_df() 
   ...: df['EA'] = np.array([1., 2., 3.]) 
   ...: df['EA'].dtype 
Out[2]: dtype('float64')

In [3]: df = recreate_df() 
   ...: df['EA'] = np.array([1, 2, 3]) 
   ...: df['EA'].dtype
Out[3]: dtype('int64')

When assigning a new nullable integer array, it of course keeps the dtype of the assigned values:

In [4]: df = recreate_df() 
   ...: df['EA'] = pd.array([1, 2, 3], dtype="Int64") 
   ...: df['EA'].dtype 
Out[4]: Int64Dtype()

However, in this case you now also have the tricky side-effect of being in place:

In [5]: df = recreate_df() 
   ...: original_arr = df.EA.array 
   ...: df['EA'] = pd.array([1, 2, 3], dtype="Int64") 
   ...: original_arr is df.EA.array  
Out[5]: True

I don't think this behaviour should depend on the values being set, and setitem should always replace the array of the ExtensionBlock.

Because with the above way, you can unexpectedly alter the data with which you created the dataframe. See also a different example using Categorical of this at the original PR that introduced this: #32831 (comment)

cc @jbrockmendel

jreback commented 4 years ago

sure

Answer 1 · 2020-04-10T15:10:03.000Z

setitem should always replace the array of the ExtensionBlock

I'd be OK with this creating a new EB with a new EA, not wild about having a Block get a new array.

xref #33198, in both cases AFAICT the issue involves _can_hold_element being too permissive.

Answer 2 · 2020-04-10T17:07:46.000Z

I don't care about the Block (for an end user / from the EA perspective, the Block does not have any state you might want to preserve, while the actual array does), so creating a new block is certainly fine. Probably the cleanest API anyway (assigning new column = new Block, regardless of the data type)

Answer 3 · 2020-05-22T14:43:20.000Z

@jbrockmendel would you be able to look into this?

Answer 4 · 2020-05-22T18:27:13.000Z

probably not before 1.1; im expecting my pandas time to diminish and am focused on wrapping uo frequencies and ops pushes ATM, will return to indexing after those

Answer 5 · 2020-05-22T18:31:14.000Z

OK. Alternatively, could you take a minimal look at your original PR that caused this regression (#32831) to see if you have an idea how (broadly speaking) it could be solved there? That could help me getting started on trying to fix this myself.

This is a regression in master, so a blocker for 1.1, IMO

Answer 6 · 2020-05-22T18:47:39.000Z

ill take a look

Answer 7 · 2020-06-25T07:49:05.000Z

@jbrockmendel gentle ping for #33457 (comment)

Answer 8 · 2020-07-06T17:24:21.000Z

Alternatively, could you take a minimal look at your original PR that caused this regression (#32831) to see if you have an idea how (broadly speaking) it could be solved there? That could help me getting started on trying to fix this myself.

In #32831 the behavior being addressed was a new array was being pinned with ExtensionBlock.values = values, which didn't match the behavior of other Block subclasses that do Block.values[locs] = values

I think the "make a new ExtensionBlock instead of calling eb.set" idea discussed above is likely the best bet in the short-run. Longer-run the base class implementation block.values[locs] = values becomes viable with 2D EAs, haven't looked into whether that solves the issue at hand.

Answer 9 · 2020-07-11T19:56:53.000Z

@jreback please don't move issues from the 1.1 that other people have labeled as "blocker" without discussing it (or at least commenting about it that you changed it, changing a milestone doesn't give a notification)

Answer 10 · 2020-07-11T20:14:53.000Z

@jorisvandenbossche this release is way behind if u want to move to 1.1.1 pls do
but we are moving forward and releasing the rc

Answer 11 · 2020-07-11T20:15:31.000Z

there are way too many blockers that don’t have PRs if you want to put some up great
but if they don’t have anything now then they rn it blockers at all

Answer 12 · 2020-07-13T15:10:07.000Z

Things that are labeled as regressions / blockers need to be discussed. I personally rely on milestones to track what needs to be closed out before the release.

w.r.t. this specific issue, I think I'm OK with releasing the RC without it.

Answer 13 · 2020-07-13T15:25:08.000Z

Things that are labeled as regressions / blockers need to be discussed. I personally rely on milestones to track what needs to be closed out before the release.

w.r.t. this specific issue, I think I'm OK with releasing the RC without it.

maybe so, but these need to be done ASAP. we cannot keep delaying things. so either remove the blocker or put a comment on WHY this is a blocker AND WHY it needs to be fixed for 1.1 Just because something is a regression does not mean it absolutely needs fixing for 1.1, there is 1.1.x of course, and blocking the entire release is silly. . @pandas-dev/pandas-core

Answer 14 · 2020-07-13T15:25:55.000Z

if there is NO PR up for an issue I will remove the blocker labels this wednesday.

Answer 15 · 2020-07-13T15:38:24.000Z

@jreback can you comment on the issues when you're removing them?

Answer 16 · 2020-07-16T17:55:37.000Z

I need to clarify the expected/desired behavior. Using the example from the OP:

df = recreate_df()
ea_orig = df["EA"]
fa_orig = df["float"]

Doing either df['EA'] = np.array([1., 2., 3.]) or df['EA'] = np.array([.1, .2, .3]) does not alter ea_orig, while doing df['EA'] = pd.array([1, 2, 3], dtype="Int64") does.

The OP focuses on the EA column, but we get the same behavior if we set df["float"] = fa_orig * 2

fa_orig_copy = fa_orig.copy()
df["float"] = fa_orig * 2

>>> (fa_orig == fa_orig_copy).all()
False

@jorisvandenbossche the OP focuses on the EA column, but would you want to change the behavior for non-EA columns too? (Changing the behavior for all columns is a 1-line edit, haven't run the full test suite yet though)

Answer 17 · 2020-07-17T14:07:23.000Z

but would you want to change the behavior for non-EA columns too?

My expectation is that DataFrame.__setitem__ should add / replace whole columns, not mutate existing values.

I think this should differ from df.loc[:, "column"] = array, which should mutate inplace.

I haven't thought through it too much, but I think I'm comfortable making the change, calling it a bugfix. It'll be a behavior change but I don't see an easy way to deprecate it, and I think matches most people's mental model for DataFrames.

Answer 18 · 2020-07-17T15:16:54.000Z

@TomAugspurger I tend to agree. @jorisvandenbossche @jreback how about you?

All it takes to do this is to remove the line if blk.should_store(value) in BlockManager.iset. This breaks 13 tests for me (havent checked which are testing this specific behavior vs coincidence).

xref #33198, which this does not fix

If we're OK making this behavior change, I think we should be OK with #33780

Answer 19 · 2020-07-17T15:31:13.000Z

i agree with your analysis @jbrockmendel so ok to make this change (but let's explain a whats new section)

Answer 20 · 2020-07-17T20:21:29.000Z

Doing this breaks (among other things which I'm still tracking down) tests.internals.test_internals.test_get_bool_data and test_get_numeric_data. If we're OK with changing the expected behavior of those, then I think we should be OK with the behavior change to those two methods in #34918.

Answer 21 · 2020-07-22T11:25:06.000Z

FYI #35369 also tracks back to #32831

Answer 22 · 2020-07-23T03:31:18.000Z

cc @jreback @TomAugspurger @jorisvandenbossche is everyone OK with changing the get_bool_data and get_numeric_data as necessary? If so, I'll continue with this branch.

Answer 23 · 2020-07-23T10:42:33.000Z

what changes about those tests?

Answer 24 · 2020-07-23T20:27:29.000Z

what changes about those tests?

Never mind, the changes needed for those turn out to be benign. I've got 8 more tests failing that i need to troubleshoot, then will make a PR.

Answer 25 · 2020-07-27T14:15:43.000Z

I think our three options right now are

#35266 (Joris' first)
#35271 (Joris' second, targeted revert)
#35417 (Brock's)

Of these I think we should go with #35271 for 1.1.0. It's the smallest change from 1.0.x.

Long-term, I think we want something like Brock's #35417 gets us consistency. But that probably should wait for 2.x

Answer 26 · 2020-07-27T17:59:45.000Z

But that probably should wait for 2.x

Thinking through this a bit more. Hopefully the whatsnew over at https://github.com/pandas-dev/pandas/pull/35417/files is clarifying, but this is probably OK to do in 1.2.

The bit about consistently assigning a new array regardless of dtype is the important part. I hope that not too many people are relying on the current behavior one way or another, given the inconsistency.

Answer 27 · 2020-08-18T19:00:10.000Z

The bit about consistently assigning a new array regardless of dtype is the important part. I hope that not too many people are relying on the current behavior one way or another, given the inconsistency.

It has more consequences than just the overwriting of the column in question or not, though.

One aspect I am thinking of / checking now is how this impacts consolidated blocks. Normally, assigning to an existing column (for a consolidated dtype) leaves the block structure intact:

In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])

In [2]: df._mgr
Out[2]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [3]: block_values = df._mgr.blocks[0].values

In [4]: df['b'] = 0.0

In [5]: df._mgr
Out[5]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [6]: df._mgr.blocks[0].values is block_values
Out[6]: True

In [7]: pd.__version__
Out[7]: '1.1.0'

While using #35417 branch (current state of time of posting):

In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])    

In [2]: df._mgr      
Out[2]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [3]: df['b'] = 0.0   

In [4]: df._mgr  
Out[4]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: [0 2 3], 3 x 10, dtype: float64
FloatBlock: slice(1, 2, 1), 1 x 10, dtype: float64

So creating different blocks, and thus the assignment of one float column triggered a copy of all float columns (in this case it actually already copied due to the block layout, in some cases it might also be a slice, but once a next step performs consolidation, this will become a copy anyway).

Answer 28 · 2020-08-18T20:20:52.000Z

So creating different blocks, and thus the assignment of one float column triggered a copy of all float columns

One option to avoid this copy would be to end up with three blocks, corresponding to df[["a"]], df[["b"]], and df[["c", "d"]], respecetively. The first and third could contain views on the original array.

Answer 29 · 2020-08-18T20:26:43.000Z

That certainly avoids the copy initially, but as also mentioned above, once a next step in your analysis performs consolidation, this will still result in a full copy due to the assignment.

Answer 30 · 2020-08-27T14:56:07.000Z

But that probably should wait for 2.x

Thinking through this a bit more. Hopefully the whatsnew over at https://github.com/pandas-dev/pandas/pull/35417/files is clarifying, but this is probably OK to do in 1.2.

I am personally not yet convinced that we should do this for 1.2:

It has API breaking changes. It is of course difficult to assess the potential impact, but I think we should at least do a bit more effort to think this through before saying that this is OK for 1.2, and shouldn't wait until 2.0.
It introduces additional copies (and pandas is already copy heavy) when assigning to a single column with consolidated blocks.

As I understand, a large part of the motivation is the inconsistency in behaviour between different dtypes?
Personally, I would be fine with accepting this difference (if the difference is between consolidated blocks vs non-consolidated extension blocks). There are several things we are doing different for the new extension dtypes, and assuming they will become the default in 2.0, I think it is fine to change the default behaviour that way, instead of already wanting to align the behaviour for consolidated vs extension blocks right now.

Answer 31 · 2020-08-27T16:00:47.000Z

It has API breaking changes.

I consider the internal inconsistency to be a bug, plus reported bugs eg #35731 caused by the current behavior (no doubt that could be fixed with some other patch, but better to get at the root of the problem)

Answer 32 · 2021-06-11T13:44:08.000Z

removing milestone and blocker label

Answer 33 · 2021-11-08T18:31:46.000Z

@jorisvandenbossche can you see if there is anything left to do here? AFAICT DataFrame.__setitem__ should now never modify the existing array inplace.