Poor performance quering large HDFS data with PXF

Question

Poor performance quering large HDFS data with PXF

LittleWuCoding opened this issue 4 years ago · 12 comments

Problem summary:
The performance of Greenplum PXF is very poor whether it reads the data on HDFS directly or accesses the data on HDFS through hive. The format of the stored data is parquet, it's actual data size is 1.2T, and the data size stored in HDFS is about 350G. Based on the TPCH test scenario, the execution time of Q1 is around 1680 seconds, and other SQL is over 1000 seconds. During the test, it was found that Java process occupied a lot of CPU, and HDFS also occupied a lot of I/O. From the perspective of statistical time, the reason for this phenomenon should be that all the data on HDFS was read, and the required data was not filtered. We want to know the detailed process of PXF in the query process and which process causes the low performance. It includes the following questions:

Question 1: What metadata does PXF's agent get from Name Node in HDFS? What's the difference between accessing HDFS directly and reading metadata through hive table access? On the surface, why is the performance of accessing HDFS directly similar to that of accessing hive table?
Question 2: PXF uses Tomcat to read and forward data? Tomcat is to read the whole data to each host?
Question 3: Whether the where condition pushdown of PXF do filtering in Tomcat or when reading HDFS data?
Question 4: Does Tomcat on each host read data from HDFS once or multiple host nodes only read data once?
Question 5: When the segments on each GP read data, do they read only a part of the data or all data from Tomcat?
Question 6: For this scenario, HDFS has a large amount of data (more than 300GB for a single table). Is there any good optimization methods?

Testing process:

Test version and environment:
Greenplum version: GP 6.4
Operating system: REHL7
pxf version: 5.10.1
Network environment:

Scene No.	Test results(s)
GP+PXF+Parquet
1	1681.623
3	1466.411
5	1683.317
6	1175.192
8	1286.907
9	1514.672
14	1151.166
17	2346.653

Look forward to your replies~
@yydzero

Answer 1 · 2020-06-14T00:13:33.000Z

@LittleWuCoding Thanks for reporting. @vraghavan78 @frankgh would you kindly please help on this? Thanks!

Answer 2 · 2020-06-19T16:11:17.000Z

Hi @yydzero, thanks for the detailed report and questions. I'll do my best to answer your questions about PXF and how it interacts with HDFS.

Question 1: What metadata does PXF's agent get from Name Node in HDFS? What's the difference between accessing HDFS directly and reading metadata through hive table access? On the surface, why is the performance of accessing HDFS directly similar to that of accessing hive table?

The metadata that PXF needs from NameNode is the location of files on HDFS so that the data can then be split across segments on GPDB. When accessing metadata through Hive, PXF relies on a Hive table definition to determine which HDFS files to access. Performance should be similar because the act of fetching metadata is only a small portion of most (large) queries. The fetching of the actual data from HDFS should take the vast majority of the run time for any large query.

Question 2: PXF uses Tomcat to read and forward data? Tomcat is to read the whole data to each host?

Yes, Tomcat handles the requests from Greenplum to fetch data from HDFS (or conversely to write to HDFS) and then feeds the data back to each respective Greenplum segment. Each host asks for a subset of the total data on HDFS that corresponds to the query. A PXF/Tomcat running on a segment host is responsible for serving each segment on that host.

Question 3: Whether the where condition pushdown of PXF do filtering in Tomcat or when reading HDFS data?

PXF column projection and filter pushdown means filtering should be done on the external source, not on PXF. However this isn't supported when accessing HDFS because it's just a filesystem. When accessing data on S3, for example, you could use S3 Select feature of AWS to pushdown filters to S3. This also works with JDBC because the underlying SQL database can perform filtering operations. Filter pushdown should also work with Hive, but if you are seeing the same performance for HDFS and Hive, we should discuss further to see why that is.

Question 4: Does Tomcat on each host read data from HDFS once or multiple host nodes only read data once?

After accessing the metadata from the NameNode, the data is split across GPDB segments. If X is number of segments per host, each Tomcat server will then serve X requests per query (one for each segment on the host). Each request will be for a different part of the underlying HDFS data.

Question 5: When the segments on each GP read data, do they read only a part of the data or all data from Tomcat?

Each segment reads just the data that it requested from Tomcat/PXF.

Question 6: For this scenario, HDFS has a large amount of data (more than 300GB for a single table). Is there any good optimization methods?

There are ways to read Hive tables that are partitioned. If you can find a way to partition your data that makes sense for the number of segments on your GPDB cluster, this can provide significant speedup. https://gpdb.docs.pivotal.io/6-8/pxf/hive_pxf.html#hive_homog_part https://gpdb.docs.pivotal.io/6-8/pxf/hive_pxf.html#hive_heter_part

Answer 3 · 2020-07-10T06:57:59.000Z

@oliverralbertini @yydzero
Thanks for your kindly help.
I had tried testing hive table with data partitioned, and found that really works, PXF did read less data from Hive.
But our data is not partitioned and it is not possible to change it to partitioned table, and partitioned table also can not help for query's filter conditions without partitioned column.
So, the critical problem is the same performance for HDFS and Hive without data partitioned. In other words, how to pushdown filter to hive.
We know that PXF support Filter pushdown from the mamual, so I set the follow params in both hive and PXF hive-site.xml :

hive.optimize.ppd = true
hive.metastore.integral.jdo.pushdown = true

but it seems did not works :performance for HDFS and Hive are the same though PXF. Also, I noticed that the HDFS I/O are the same on both situations which indicated that the Filter is not pushdown when querying on hive table through PXF.
And I ran the same query on hive, found that the filter was pushed down by "explain analyze" command
So, did I set anyting wrong or missing any configuration?
OR did PXF just Filter pushdown on partitioned column?
I am looking forward your reply~

Answer 4 · 2020-07-12T12:08:09.000Z

@LittleWuCoding as an alternative to run queries against Hive, you can use the JDBC profile to issue queries against the external system. There is native JDBC support for querying Hive in PXF (meaning that you don't need to install additional drivers). Here is the documentation: https://gpdb.docs.pivotal.io/5270/pxf/jdbc_pxf.html

Let me know if this will help your use case.

Answer 5 · 2020-07-13T08:48:47.000Z

@frankgh hello, I have tried the JDBC profile to issue the query, but found worse performance than PXF external table.
And I found many limitation on JDBC profile, for example, for complex querys including aggregate functions and joins , I have to use named query which is not easy and convenient.
Besides，could you give me more advice about query data on hive or hdfs？

Answer 6 · 2020-07-13T18:48:52.000Z

@LittleWuCoding have you tried querying the same hive data using the hdfs:parquet format. I believe the Hive profile accessing parquet files does not perform predicate pushdown. The downside for using the hdfs:parquet format is that you need to know the path for you hive table, but once you obtain that piece of information can you try using hdfs:parquet. The hdfs:parquet format uses both predicate pushdown and column projection.

Answer 7 · 2020-07-14T01:18:14.000Z

@frankgh We have aready tried hdfs:parquet compared with hive:parquet. We found that the performance of hdfs:parquet are slightly better than hive, but also very slow. For example, a simple query likes below:

select a.xxx, b.yyy from a join b where a.aa = b.bb and a.cc = qqq;

which table a about 200k row and table b 200 million row and the final resultset of the query is 9 row:
hive:parquet: 500s
hdfs:parquet: 400s
but where using the apache presto to query, the time is just 20s， which is a huge gap.
So, can PXF has similar performance of presto ? Or their performance is not an order of magnitude?

Answer 8 · 2020-07-14T01:25:32.000Z

but where using the apache presto to query, the time is just 20s， which is a huge gap.

Interesting, is presto installed on the Greenplum cluster, or where does presto live?

So, can PXF has similar performance of presto ? Or their performance is not an order of magnitude?

If PXF is installed closer to the hadoop cluster (as opposed to the Greenplum cluster), we can reduce the data transfer between PXF and Greenplum. It'd be interesting to understand what are the differences between your presto installation and PXF installation.

Answer 9 · 2020-07-14T02:03:47.000Z

@frankgh Thanks for your quick reply.
In my env, the presto, GP, PXF and Hadoop are installed in the same cluster as the figure on the top of this issue.

Answer 10 · 2020-07-14T02:20:49.000Z

@LittleWuCoding looking at your query I realize that you are doing a join. PXF does not support join pushdown (yet), but we are working on adding performance features for PXF. The first feature we want to get out is support for ANALYZE, so joins like the one above will benefit from table statistics. We will be rolling out these performance features closer to the end of the year, it's an area that we are actively looking to improve, and have been working towards that goal.

Answer 11 · 2020-07-14T03:12:30.000Z

select a.xxx, b.yyy from a join b where a.aa = b.bb and a.cc = qqq;

Out of curiosity, can you share the explain plan for the query above, and which table has 200k rows and which table has 200M rows?

Answer 12 · 2023-03-03T22:47:36.000Z

Looks like this thread has gone stale, so we will be closing this for now. If this issue persists, please reopen this issue or create a new issue with all the details.