IntelPython/sdc

[Question] Can groupby agg scale in HPAT?

Opened this issue · 1 comments

Hi,

I am trying to use HPAT to accelerate data science workloads, especially the ETL process.

The data frame I am using contains 21,721,922 rows and 45 columns. All the data entries use float64 dtype. There is no missing data after cleaning.

I was putting the following code into an HPAT decorated function. It simply groups the data frame by ``year" and calculates the average number for each year. I am tracking the execution time of the groupby-agg operator.

    t0 = time.time()
    tmp1 = df.groupby('YEAR')['INCTOT'].mean()
    tt = time.time() - t0

I am using a server with 2 x Intel(R) Xeon(R) CPU E5-2699 v4 CPU, where it has 44 cores in total.

The results look like this:

Baseline is to use Pandas only without HPAT.

Num of cores groupby-agg time (sec.)
baseline 0.227021694
1 1.437
2 1.39
3 1.398
4 1.427
11 1.51
22 1.794
44 2.838

We observe that when the number of processes used increases, the time spent on groupby-agg also increases. Since GroupBy-agg is a simple map-reduce parallel pattern which should be able to parallelize, the observation is a bit weird to me as far as I understood.

Second, even we only use one thread, applying HPAT gives slowdown compared to pandas.

The groupby-count results of my dataset. Note that in each year, plenty of data entries exist--- there should have sufficient parallelism.

YEAR count
1970 1486744
1980 8746006
1990 1906165
2000 2199860
2010 2494822

Am I missing something? Could you give some suggestions on how should I do to accelerate the groupby-agg operation using HPAT?

Thank you so much.

Best regards,
Hongyuan Liu

Thank you @bigwater! We're currently working on groupby