Performance improvements to bool replacement
Opened this issue · 4 comments
Current behaviour is to run df.replace({True: 1, False: 0})
on all DataFrames passed, this can have quite a noticeable performance impact on large dataframes.
For example, on a 1.25M row, 8 column, 150MB DataFrame, running the replace takes roughly 15 seconds.
Some options to avoid this that seem sensible:
- Only run replace on bool columns (or perhaps Object columns too?)
- Allow the user to specify if they would like the replacement to go ahead via an argument or flag
- Allow the user which columns should be replaced, defaulting to current behaviour
I think the first point is probably the most pragmatic, but happy to raise an MR with whatever change you see fit - this change would save quite a lot of time when loading large data sets into the database.
This is a good point, thanks for bringing it up. Yeah, go ahead and submit a pull request, let's see what we can do
can you provide a sample schema of the dataframe you are loading that takes "roughly 15 seconds"? I.e. the output of df.info()
I came across this and tried to recreate it, I've got a snippet which exhibits a similar time profile, the following runs in 12s on my M1 Macbook Pro:
import time
import numpy as np
import pandas as pd
def test_bool_replace():
length = 1_000_000
df = pd.DataFrame({
'bool': [bool(x) for x in np.random.randint(0, 2, size=(length, 1))],
'emptyNDArray': [np.random.rand(0, 0) for _ in range(length)],
})
start = time.time()
df = df.replace({True: 1, False: 0})
end = time.time()
print(end - start)
I created another df before this which had 8 int columns and 1 million rows, that took about 0.5 seconds.
It seems the issue is caused by empty ndarrays (if I give them a non-zero size then the .replace
seems to fail - this is perhaps a bug to be fixed too!).
Of course this example isn't that realistic, so getting @AaronCritchley 's original df format would be handy.
Using the above example, replacing
df = df.replace({True: 1, False: 0})
with
bool_df = df.select_dtypes("bool")
df[bool_df.columns] = bool_df.replace({True: 1, False: 0})
Brings the time down to 0.2s on the same M1 MacBook Pro.