greenplum-db/pxf

Poor performance quering large HDFS data with PXF

LittleWuCoding opened this issue · 12 comments

Problem summary:
The performance of Greenplum PXF is very poor whether it reads the data on HDFS directly or accesses the data on HDFS through hive. The format of the stored data is parquet, it's actual data size is 1.2T, and the data size stored in HDFS is about 350G. Based on the TPCH test scenario, the execution time of Q1 is around 1680 seconds, and other SQL is over 1000 seconds. During the test, it was found that Java process occupied a lot of CPU, and HDFS also occupied a lot of I/O. From the perspective of statistical time, the reason for this phenomenon should be that all the data on HDFS was read, and the required data was not filtered. We want to know the detailed process of PXF in the query process and which process causes the low performance. It includes the following questions:

Question 1: What metadata does PXF's agent get from Name Node in HDFS? What's the difference between accessing HDFS directly and reading metadata through hive table access? On the surface, why is the performance of accessing HDFS directly similar to that of accessing hive table?
Question 2: PXF uses Tomcat to read and forward data? Tomcat is to read the whole data to each host?
Question 3: Whether the where condition pushdown of PXF do filtering in Tomcat or when reading HDFS data?
Question 4: Does Tomcat on each host read data from HDFS once or multiple host nodes only read data once?
Question 5: When the segments on each GP read data, do they read only a part of the data or all data from Tomcat?
Question 6: For this scenario, HDFS has a large amount of data (more than 300GB for a single table). Is there any good optimization methods?

Testing process:

  1. Test version and environment:
    Greenplum version: GP 6.4
    Operating system: REHL7
    pxf version: 5.10.1

  2. Network environment:
    image

Scene No. Test results(s)
GP+PXF+Parquet
1 1681.623
3 1466.411
5 1683.317
6 1175.192
8 1286.907
9 1514.672
14 1151.166
17 2346.653

Look forward to your replies~
@yydzero

@LittleWuCoding Thanks for reporting. @vraghavan78 @frankgh would you kindly please help on this? Thanks!

@oliverralbertini @yydzero
Thanks for your kindly help.
I had tried testing hive table with data partitioned, and found that really works, PXF did read less data from Hive.
But our data is not partitioned and it is not possible to change it to partitioned table, and partitioned table also can not help for query's filter conditions without partitioned column.
So, the critical problem is the same performance for HDFS and Hive without data partitioned. In other words, how to pushdown filter to hive.
We know that PXF support Filter pushdown from the mamual, so I set the follow params in both hive and PXF hive-site.xml :

hive.optimize.ppd = true
hive.metastore.integral.jdo.pushdown = true

but it seems did not works :performance for HDFS and Hive are the same though PXF. Also, I noticed that the HDFS I/O are the same on both situations which indicated that the Filter is not pushdown when querying on hive table through PXF.
And I ran the same query on hive, found that the filter was pushed down by "explain analyze" command
So, did I set anyting wrong or missing any configuration?
OR did PXF just Filter pushdown on partitioned column?
I am looking forward your reply~

@LittleWuCoding as an alternative to run queries against Hive, you can use the JDBC profile to issue queries against the external system. There is native JDBC support for querying Hive in PXF (meaning that you don't need to install additional drivers). Here is the documentation: https://gpdb.docs.pivotal.io/5270/pxf/jdbc_pxf.html

Let me know if this will help your use case.

@frankgh hello, I have tried the JDBC profile to issue the query, but found worse performance than PXF external table.
And I found many limitation on JDBC profile, for example, for complex querys including aggregate functions and joins , I have to use named query which is not easy and convenient.
Besides,could you give me more advice about query data on hive or hdfs?

@LittleWuCoding have you tried querying the same hive data using the hdfs:parquet format. I believe the Hive profile accessing parquet files does not perform predicate pushdown. The downside for using the hdfs:parquet format is that you need to know the path for you hive table, but once you obtain that piece of information can you try using hdfs:parquet. The hdfs:parquet format uses both predicate pushdown and column projection.

@frankgh We have aready tried hdfs:parquet compared with hive:parquet. We found that the performance of hdfs:parquet are slightly better than hive, but also very slow. For example, a simple query likes below:

select a.xxx, b.yyy from a join b where a.aa = b.bb and a.cc = qqq;

which table a about 200k row and table b 200 million row and the final resultset of the query is 9 row:
hive:parquet: 500s
hdfs:parquet: 400s
but where using the apache presto to query, the time is just 20s, which is a huge gap.
So, can PXF has similar performance of presto ? Or their performance is not an order of magnitude?

but where using the apache presto to query, the time is just 20s, which is a huge gap.

Interesting, is presto installed on the Greenplum cluster, or where does presto live?

So, can PXF has similar performance of presto ? Or their performance is not an order of magnitude?

If PXF is installed closer to the hadoop cluster (as opposed to the Greenplum cluster), we can reduce the data transfer between PXF and Greenplum. It'd be interesting to understand what are the differences between your presto installation and PXF installation.

@frankgh Thanks for your quick reply.
In my env, the presto, GP, PXF and Hadoop are installed in the same cluster as the figure on the top of this issue.

@LittleWuCoding looking at your query I realize that you are doing a join. PXF does not support join pushdown (yet), but we are working on adding performance features for PXF. The first feature we want to get out is support for ANALYZE, so joins like the one above will benefit from table statistics. We will be rolling out these performance features closer to the end of the year, it's an area that we are actively looking to improve, and have been working towards that goal.

select a.xxx, b.yyy from a join b where a.aa = b.bb and a.cc = qqq;

Out of curiosity, can you share the explain plan for the query above, and which table has 200k rows and which table has 200M rows?

Looks like this thread has gone stale, so we will be closing this for now. If this issue persists, please reopen this issue or create a new issue with all the details.