greenplum-db/pxf

PXF when loading unescaped regex

sutera opened this issue · 2 comments

I am having an issue in PXF 5.16.0 on Greenplum version 6.14.0 where if I try to load a column's data with an un-escaped regex that the column will break into a new line at the characters \.. Minimal reproducible example

parse.txt
account_id	banner	other
11999	www.butter.com	big words
11999	/^https?:\/\/([a-z0-9_\-\.]+\.)?abc\.net(:\d{1,5})?\//i.test(src)	jaffa kree
11999	www.google.com	text stuff
create external table staging.parsing_issue_ext
(
account_id INTEGER,
banner TEXT,
other TEXT
)
LOCATION ('pxf://<external_file_path>/parse.txt?PROFILE=gs:text&SERVER=gcs')
FORMAT 'TEXT' (NULL AS 'staq_no_value_found' ESCAPE 'OFF')
LOG ERRORS
SEGMENT REJECT LIMIT 2500;

The first and third row should load while the second breaks into a new column at each occurrence of "."

Ideally I'd like to keep the regex un-escaped because we process the data further down in our pipeline using those regexes. Is this intended behavior or is loading this type of data intentionally not supported?

Thanks.

@sutera -- the issue you're seeing is due to the fact that PXF is using Greenplum COPY code on the C side to parse the data when format TEXT is specified and for that format \. sequence has a special meaning. When this sequence is on a separate line by itself, it signifies the end of data stream. In your file it is in the middle of the line and hence Greenplum fails with the "end-of-copy marker corrupt" message (you can see it if you do SELECT * from gp_read_error_log('staging.parsing_issue_ext');)

This is a known issue with Postgres as well (Greenplum is based on Postgres), especially on Windows platforms, where this pattern is quite common when referring to file paths.

A workaround that might work for your data is to use the CSV format and PXF Profile instead, explicitly specifying the delimiter and instructing the PXF Profile to skip 1 header line:

CREATE EXTERNAL TABLE parsing_issue_csv(account_id INTEGER, banner TEXT, other TEXT)
LOCATION ('pxf://tmp/parse.txt?profile=hdfs:csv&SKIP_HEADER_COUNT=1')
FORMAT 'CSV' (NULL AS 'staq_no_value_found' DELIMITER AS E'\t')
LOG ERRORS
SEGMENT REJECT LIMIT 2500;

SELECT * FROM parsing_issue_csv;
 account_id |                              banner                               |   other
------------+-------------------------------------------------------------------+------------
      11999 | www.butter.com                                                    | big words
      11999 | /^https?:\/\/([a-z0-9_\-\.]+\.)?abc\.net(:\d{1,5})?\//i.test(src) | jaffa kree
      11999 | www.google.com                                                    | text stuff
(3 rows)

While this works on your sample data, be careful with default escape and quote characters for CSV format (they both are ").If your data contains quotes, then make sure to change the escape and quote characters in the CSV format definition.

thank you for the response. That is good to know - we will adjust the way we load such data.