[Question] Can groupby agg scale in HPAT?
Opened this issue · 1 comments
Hi,
I am trying to use HPAT to accelerate data science workloads, especially the ETL process.
The data frame I am using contains 21,721,922 rows and 45 columns. All the data entries use float64 dtype. There is no missing data after cleaning.
I was putting the following code into an HPAT decorated function. It simply groups the data frame by ``year" and calculates the average number for each year. I am tracking the execution time of the groupby-agg operator.
t0 = time.time()
tmp1 = df.groupby('YEAR')['INCTOT'].mean()
tt = time.time() - t0
I am using a server with 2 x Intel(R) Xeon(R) CPU E5-2699 v4 CPU, where it has 44 cores in total.
The results look like this:
Baseline is to use Pandas only without HPAT.
Num of cores | groupby-agg time (sec.) |
---|---|
baseline | 0.227021694 |
1 | 1.437 |
2 | 1.39 |
3 | 1.398 |
4 | 1.427 |
11 | 1.51 |
22 | 1.794 |
44 | 2.838 |
We observe that when the number of processes used increases, the time spent on groupby-agg also increases. Since GroupBy-agg is a simple map-reduce parallel pattern which should be able to parallelize, the observation is a bit weird to me as far as I understood.
Second, even we only use one thread, applying HPAT gives slowdown compared to pandas.
The groupby-count results of my dataset. Note that in each year, plenty of data entries exist--- there should have sufficient parallelism.
YEAR | count |
---|---|
1970 | 1486744 |
1980 | 8746006 |
1990 | 1906165 |
2000 | 2199860 |
2010 | 2494822 |
Am I missing something? Could you give some suggestions on how should I do to accelerate the groupby-agg operation using HPAT?
Thank you so much.
Best regards,
Hongyuan Liu
Thank you @bigwater! We're currently working on groupby