PXF when loading unescaped regex
sutera opened this issue · 2 comments
I am having an issue in PXF 5.16.0 on Greenplum version 6.14.0 where if I try to load a column's data with an un-escaped regex that the column will break into a new line at the characters \.
. Minimal reproducible example
parse.txt
account_id banner other
11999 www.butter.com big words
11999 /^https?:\/\/([a-z0-9_\-\.]+\.)?abc\.net(:\d{1,5})?\//i.test(src) jaffa kree
11999 www.google.com text stuff
create external table staging.parsing_issue_ext
(
account_id INTEGER,
banner TEXT,
other TEXT
)
LOCATION ('pxf://<external_file_path>/parse.txt?PROFILE=gs:text&SERVER=gcs')
FORMAT 'TEXT' (NULL AS 'staq_no_value_found' ESCAPE 'OFF')
LOG ERRORS
SEGMENT REJECT LIMIT 2500;
The first and third row should load while the second breaks into a new column at each occurrence of "."
Ideally I'd like to keep the regex un-escaped because we process the data further down in our pipeline using those regexes. Is this intended behavior or is loading this type of data intentionally not supported?
Thanks.
@sutera -- the issue you're seeing is due to the fact that PXF is using Greenplum COPY code on the C side to parse the data when format TEXT is specified and for that format \.
sequence has a special meaning. When this sequence is on a separate line by itself, it signifies the end of data stream. In your file it is in the middle of the line and hence Greenplum fails with the "end-of-copy marker corrupt"
message (you can see it if you do SELECT * from gp_read_error_log('staging.parsing_issue_ext');
)
This is a known issue with Postgres as well (Greenplum is based on Postgres), especially on Windows platforms, where this pattern is quite common when referring to file paths.
A workaround that might work for your data is to use the CSV format and PXF Profile instead, explicitly specifying the delimiter and instructing the PXF Profile to skip 1 header line:
CREATE EXTERNAL TABLE parsing_issue_csv(account_id INTEGER, banner TEXT, other TEXT)
LOCATION ('pxf://tmp/parse.txt?profile=hdfs:csv&SKIP_HEADER_COUNT=1')
FORMAT 'CSV' (NULL AS 'staq_no_value_found' DELIMITER AS E'\t')
LOG ERRORS
SEGMENT REJECT LIMIT 2500;
SELECT * FROM parsing_issue_csv;
account_id | banner | other
------------+-------------------------------------------------------------------+------------
11999 | www.butter.com | big words
11999 | /^https?:\/\/([a-z0-9_\-\.]+\.)?abc\.net(:\d{1,5})?\//i.test(src) | jaffa kree
11999 | www.google.com | text stuff
(3 rows)
While this works on your sample data, be careful with default escape and quote characters for CSV format (they both are "
).If your data contains quotes, then make sure to change the escape and quote characters in the CSV format definition.
thank you for the response. That is good to know - we will adjust the way we load such data.