yehoshuadimarsky/bcpandas

Performance improvements to bool replacement

Opened this issue · 4 comments

Current behaviour is to run df.replace({True: 1, False: 0}) on all DataFrames passed, this can have quite a noticeable performance impact on large dataframes.

For example, on a 1.25M row, 8 column, 150MB DataFrame, running the replace takes roughly 15 seconds.

Some options to avoid this that seem sensible:

  • Only run replace on bool columns (or perhaps Object columns too?)
  • Allow the user to specify if they would like the replacement to go ahead via an argument or flag
  • Allow the user which columns should be replaced, defaulting to current behaviour

I think the first point is probably the most pragmatic, but happy to raise an MR with whatever change you see fit - this change would save quite a lot of time when loading large data sets into the database.

This is a good point, thanks for bringing it up. Yeah, go ahead and submit a pull request, let's see what we can do

can you provide a sample schema of the dataframe you are loading that takes "roughly 15 seconds"? I.e. the output of df.info()

I came across this and tried to recreate it, I've got a snippet which exhibits a similar time profile, the following runs in 12s on my M1 Macbook Pro:


import time
import numpy as np
import pandas as pd

def test_bool_replace():
    length = 1_000_000

    df = pd.DataFrame({
        'bool': [bool(x) for x in np.random.randint(0, 2, size=(length, 1))],
        'emptyNDArray': [np.random.rand(0, 0) for _ in range(length)],
    })

    start = time.time()
    df = df.replace({True: 1, False: 0})
    end = time.time()
    print(end - start)

I created another df before this which had 8 int columns and 1 million rows, that took about 0.5 seconds.

It seems the issue is caused by empty ndarrays (if I give them a non-zero size then the .replace seems to fail - this is perhaps a bug to be fixed too!).

Of course this example isn't that realistic, so getting @AaronCritchley 's original df format would be handy.

Using the above example, replacing

df = df.replace({True: 1, False: 0})

with

    bool_df = df.select_dtypes("bool")
    df[bool_df.columns] = bool_df.replace({True: 1, False: 0})

Brings the time down to 0.2s on the same M1 MacBook Pro.