greenplum-db/pxf

PXF Hive: filtering does not work if partition column is string data type

akotuc opened this issue · 1 comments

Hello,
it seems filtering based on partition column in PXF Hive profile is not correctly applied in case the partition column is string type - tested on several cases in the same environment and partition filtering always works for partition columns of integer types however does not work if partition column is string type.

PXF version: 5.12.0
GPDB version: 6.7.1

How tested - created two identical tables:
a) partition column filtering is applied on is integer type
Partition columns:
year_month int
platform string

Query: SELECT * FROM adhoc.table WHERE year_month = 202004;

PXF log:

2020-05-20 08:56:05.0671 DEBUG tomcat-http--15 org.greenplum.pxf.plugins.hive.HiveMetaStoreClientCompatibility1xx - Attempting to fallback
2020-05-20 08:56:05.0955 DEBUG tomcat-http--15 org.greenplum.pxf.plugins.hive.HiveClientWrapper - Item: dw.table, type: EXTERNAL_TABLE
2020-05-20 08:56:05.0956 DEBUG tomcat-http--15 org.greenplum.pxf.plugins.hive.HiveClientWrapper - Hive table: 12 fields. 2 partitions.
2020-05-20 08:56:05.0956 DEBUG tomcat-http--15 org.greenplum.pxf.plugins.hive.HiveDataFragmenter - setPartitions: [platform, year_month]
2020-05-20 08:56:05.0961 DEBUG tomcat-http--15 org.greenplum.pxf.plugins.hive.HiveDataFragmenter - Filter String for Hive partition retrieval : year_month = "202004"
2020-05-20 08:56:05.0986 DEBUG tomcat-http--3 org.greenplum.pxf.service.rest.BridgeResource - Starting streaming fragment 0 of resource hdfs://nameservice1/projects/dw/table/year_month=201707/platform=windows/part-05750-42764e67-c785-4ee3-8984-484a964af736.c000.gz.parquet
2020-05-20 08:56:06.0023 DEBUG tomcat-http--15 org.greenplum.pxf.plugins.hive.HiveDataFragmenter - Table - dw.table matched partitions list size: 8
2020-05-20 08:56:06.0155 INFO tomcat-http--15 org.greenplum.pxf.service.rest.FragmenterResource - org.greenplum.pxf.plugins.hive.HiveDataFragmenter returns 28 fragments for path dw.table in 635 ms for Session = gpadmin:1588764755-0000007762:2:hadoop-hive [profile Hive filter is available]

b) partition column filtering is applied on is string type

Partition columns:
year_month string
platform string

Query: SELECT * FROM adhoc.table WHERE year_month = '202004';

PXF log:

2020-05-20 06:37:20.0100 DEBUG tomcat-http--6 org.greenplum.pxf.plugins.hive.HiveMetaStoreClientCompatibility1xx - Attempting to fallback
2020-05-20 06:37:20.0259 DEBUG tomcat-http--6 org.greenplum.pxf.plugins.hive.HiveClientWrapper - Item:dw.table, type: EXTERNAL_TABLE
2020-05-20 06:37:20.0259 DEBUG tomcat-http--6 org.greenplum.pxf.plugins.hive.HiveClientWrapper - Hive table: 12 fields. 2 partitions.
2020-05-20 06:37:24.0734 INFO tomcat-http--6 org.greenplum.pxf.service.rest.FragmenterResource - org.greenplum.pxf.plugins.hive.HiveDataFragmenter returns 25119 fragments for path dw.table in 5131 ms for Session = gpadmin:1588764755-0000007756:0:hadoop-hive [profile Hive filter is not available]

Seems to me it starts on this line: https://github.com/greenplum-db/pxf/blob/master/server/pxf-hive/src/main/java/org/greenplum/pxf/plugins/hive/HiveDataFragmenter.java#L198
with context.hasFilter() condition since tbl.getPartitionKeysSize() is returning value 2 based on the log (Hive table: 12 fields. 2 partitions.), which comes from https://github.com/greenplum-db/pxf/blob/master/server/pxf-hive/src/main/java/org/greenplum/pxf/plugins/hive/HiveClientWrapper.java#L125 and uses to get number of partition columns the same function tbl.getPartitionKeysSize().

Best,

Ales

Closing, cause by external table definition - partition column has to be TEXT data type (VARCHAR does not work!).