Why @sdc_overload_method use func_text instead of normal python func definition?

Question

Why @sdc_overload_method use func_text instead of normal python func definition?

Closed this issue 3 years ago · 4 comments

Now, I am digging into sdc source code, and I try to implement a limited version of pandas.dataframe.apply() for myself.

During this, I found your implementation of pandas.dataframe.head(). Why did you choose to implement an overload head method using func_text? Are there any reasons? Using normal python func definition will not work for sdc?

Answer 1 · 2021-12-14T11:36:33.000Z

I think this kind of 'strange' implementation is related with current SDC limited support of pandas.dataframe(), which can only accept {'col_1':series_1, 'col_2':series2, ...} as raw data. Is this thought right? However, don't we have another normal and equivalent way that can achieve this goal?

Another important issue is that the lines of generated python code from codegen function is proportional to the dataframe columns. I think this will affect compilation performance a lot? E.g., one column only consumes 1 second, 1k columns can consume much more time? Not verified yet.

Answer 2 · 2021-12-14T13:01:13.000Z

I want to contribute a PR about pandas.dataframe.apply(), a limited version of pandas original one. I found the rolling.apply() can be a good reference for me.

Answer 3 · 2021-12-14T17:08:28.000Z

I think this kind of 'strange' implementation is related with current SDC limited support of pandas.dataframe(), which can only accept {'col_1':series_1, 'col_2':series2, ...} as raw data. Is this thought right?

@dlee992 Hi, yes, this is related. In general series types can be different hence, to infer the resulting DataFrame type a const mapping from column names to series data is needed (and at least when this was written, there were no support for dicts with const literal names and heterogenous in type values in Numba). You are right that this has impact on compilation times of course, but our tests showed that with recent improvements (namely #936) DF constructor compiles quite fast (several minutes for DF of ~500 columns).

I want to contribute a PR about pandas.dataframe.apply()

Any PRs are very welcome! There's an alternative to using exec for building a resulting DF data. You can refer to the example below where df.drop is refactored via functions in sdc.functions.tuple_utils:
https://gist.github.com/kozlov-alexey/f29e8d2703789491e8e24e41de16536b

Answer 4 · 2021-12-16T09:40:43.000Z

@kozlov-alexey , hi, very much thanks!

I am reading your example df.drop, this is a very good hint.

I will try to mock one for df.apply, by the way to learn the underlying data structure of SDC DataFrame.