pacman82/odbc2parquet

StarRocks parquet file import of parquet file generated by odbc2parquet fails with encoding error

chrisfw opened this issue · 11 comments

Hi @pacman82 ,

I hope you are well. I am struggling with an error importing into a StarRocks database a parquet file generated by odbc2parquet. It seems to occur at a certain record within the file, but I can't figure out how to identify the problematic row and column. From googling, the underlying error seems to occur in the C/C++arrow parquet reader being used under the covers by StarRocks. The only information I get in the error is below.

/build/starrocks/be/src/exec/parquet_reader.cpp:392 _parquet_reader->init_parquet_reader(_src_slot_descs)
/build/starrocks/be/src/exec/pipeline/scan/scan_operator.cpp:226 _get_scan_status()
W0903 14:51:10.263104  9367 pipeline_driver_executor.cpp:161] [Driver] Process error, query_id=2930bff9-b598-4010-915f-5f286d6bb406, instance_id=2930bff9-b598-4010-915f-5f286d6bb407, status=Internal error: IOError: Not yet implemented: Unsupported encoding.. filename: s3a://importdata/futcmas.parquet

This is the first time I have experienced the error with importing files into StarRocks that were produced by odbc2parquet. I am using version odbc2parquet 2.0.1 , Oracle 19c, and StarRocks 3.1.2.

Do you have any suggestions as to how I can identify the problem here? Any assistance you can provide would be greatly appreciated.

Regards,
Chris

Hello Chris,

I can not tell you what is wrong either, but here are some things you might want to try:

  • Does it fail even with a parquet file containing only a single row? This could be produced by changing the input query so only a single row gets returned. It might also be beneficial to speed up the feedback loop for the other experiments.
  • Which column causes this? Change the query so only certain columns are in the odbc2parquet output/StarRocks input. Maybe we could zero in on the exact column causing it?
  • --parquet-column-encoding can be used to change the encoding of a column. See odbc2parquet query --help
  • Ask the maintainers of StarRocks/Parquet kindly to provide a better error message, detailing what exactly it is which is unsupported.

Best, Markus

@pacman82 , thank you very much for this information.

  • Does it fail even with a parquet file containing only a single row? This could be produced by changing the input query so only a single row gets returned.
    No, I narrowed it down to a particular row set size after which it fails to import - it fails on a select with no sort applied and rownum < 56322, but examination of that row does not seem to contain any unusual text when I export that row to CSV using SQL Developer.

  • Which column causes this? Change the query so only certain columns are in the odbc2parquet output/StarRocks input.
    I started down this path by only including the varchar columns, but I wasn't able to identify the culprit. I will revisit this though and try each column. Sadly there are 181 columns in this table to eliminate. :(

--parquet-column-encoding can be used to change the encoding of a column. See odbc2parquet query --help
I will try this. Can you tell me what the default encoding used by odbc2parquet is?
Also, do you have a suggested sequence in which to try the others?
plain, bit-packed, delta-binary-packed, delta-byte-array, delta-length-byte-array or rle

  • Ask the maintainers of StarRocks/Parquet kindly to provide a better error message, detailing what exactly it is which is unsupported.
    I did this and I am waiting on an answer.

@pacman82 , FYI I identified the problematic column and specific rows in Oracle. The problem values were in a non-nullable varchar2 column and were populated with a single space. I tried the encoding options, but they were either not recognized or not applicable for the column data type, e.g.

`ENCODING` must be one of: `plain`, `bit-packed`, `delta-binary-packed`, `delta-byte-array`, `delta-length-byte-array` or `rle`

odbc2parquet  query --no-empty-file  --driver-does-not-support-64bit-integers --parquet-column-encoding 'UTCM_REF_ACCTt-packed' --column-compression-default snappy -c "Driver={Oracle 19 ODBC driver};Dbq=UNITST6_UNITIL;Uid=enqjsql;Pwd=enq;" /extdatemp/futcmas.parquet "select * from futcmasview"
error: invalid value 'UTCM_REF_ACCT:bit-packed' for '--parquet-column-encoding <PARQUET_COLUMN_ENCODING>': Sorry, I do not know alumn encoding called 'bit-packed'.``` 

I worked around this issue by `importing` the odbc2parquet generated parquet file into duckdb and then exporting it back out from duckdb as a parquet file.  The newly generated file loaded into StarRocks without an issue.

Regards,
Chris

Released odbc2parquet 2.0.2 with an updated text for --help query. The new update does not make mention of the bit-packed encoding. Since it is not supported.

I would be interessted to learn what changed importing/exporting the file in/out of DuckDB. The value of the field or its schema?

Command line tools like parquet-schema and parquet-read allow you to inspect the parquet file. Alternatively if you would produce a minimal parquet file with only the column and the field in it. One before and one after the export from duck db and share them with me, I could also take a look at them.

Did you verify that that field is a normal ascii white space? I could imagine a bunch of control characters also rendering that way?

@pacman82 , thank you for updating the release to exclude the invalid encoding option.

Thanks also for offering to take a look - I will create a test case for you and share the odbc2parquet and duckdb generated parquet files. I did verify through a query that the column contained ascii 32.

@pacman82 , I recreated the issue and solved it the same way. The error in StarRocks was different, but that may just have been due to the fact that the column was in the middle of the schema in the previous case and it is right after the primary key in this scenario with only 2 columns in total.
type:LOAD_RUN_FAIL; msg:Read Batch Error With Libarrow. filename: s3a://importdata/jobtest.parquet

Attached are the two files for your review. I hope this is helpful and thank you for your assistance. Please let me know if you need anything else.
jobtest.zip

Regards,
Chris

Hello @chrisfw ,

looking at both files I can see that the one created by odbc2parquet uses parquet format 2. More importantly all columns are required. The one created by duckDB uses parquet format 1 and only contains optional columns. I was suprised how many rows were in these files. Was a single record not enough to reproduce the issue?

I recreated the issue and solved it the same way.

How did you recreate and solve the issue?

Best, Markus

Hi Markus,

I tried a couple of times to reproduce the issue unsuccessfully using just a single record. Both an export with a single row with normal data in the JOBACCOUNT column and an export with a single row that contained only a " " in the JOBACCOUNT column imported into StarRocks without issue. As a result, I just exported the entire table to cause the StarRocks import to fail.

How did you recreate and solve the issue?

By "solve", I just meant that I used the workaround of importing the file into duckdb and then exporting from duckdb to parquet and finally importing the newly generated file into StarRocks successfully.

Regards,
Chris

Hello Chris,

so far StarRocks does not seem to take issue with schema, encoding, parquet version or anything. Moreover I currently like the fact that odbc2parquet uses the (recent) Parquet version 2 and correctly identifies required columns as such. At this point I fear any random change might make a file compatible or incompatible.

I inspected the files and could not find fault with them. I would really wish StarRocks would say what precisely it is they do not support, or what is wrong with the file. Really hard for me at the moment to debug the issue further.

Best, Markus

@pacman82 , thanks for looking into this. I totally agree based on my own and your investigation that this is a StarRocks issue. The StarRocks team indicated they are using the Apache Arrow libs to load the data, but I don't have any further information and it doesn't seem there is anything further to be done at this point. Thanks again for your help.

Regards,
Chris

Thanks @chrisfw , feel free to reopen this or open a new issue, if you learn something actionable.

Best, Markus