scikit-learn-contrib/sklearn-pandas

transform() is not threadsafe

tatome opened this issue · 5 comments

self.transformed_names_ = []

The property DataFrameMapper.transformed_names_ is reassigned and modified during _transform(). That makes transform() not thread safe and a Pipeline using a DataFrameMapper cannot be safely used in multiple threads.

I guess this can be quite easily resolved by changing

self.transformed_names_ += self.get_names(
columns, transformers, Xt, alias)

to something like

self.transformed_names_.extend(
    self.get_names(columns, transformers, Xt, alias) 
)

Or am I mistaken that extend is threadsafe?

I'm sorry, I think I misunderstood the issue initially. I've prepared a PR that should resolve the actual issue.

Hi @FlorisHoogenboom
Thanks for raising this issue. I took over the maintenance of sklearn-pandas and going through all the old issues. I think this is an important issue and should be fixed. I see your PR and happy to merge it. Wondering is there any way we can test this.

In general it is very hard to test these kind of concurency safety models. I think the MR proposed at least fixxes some very obvious problems by making those operations more atomic. I wouldn't know any way to validate (except by deducing it from the operations performed) programmatically that this method is indeed thread safe now.

okie. I will review the MR and merge it. Thanks for your submission.