silent:true does not silence the error
kevinguzo opened this issue · 8 comments
Tried to use silent option to ignore problem records when extract from csv. for example:
USING Extractors.Csv(silent:true, skipFirstNRows:1);
The job still failed with following error:
Error occured while extracting row after processing 121549 record(s) in the vertex' input split. Column index: 0, column name: 'unique_transaction_id
195887140
Can you please provide the full error message? Please note that silent does not skip all errors. Only misaligned number of columns and it turns cast errors into nulls if the target type is nullable. It still fails if the target type is not nullable.
Does somebody solve this issue?
I have the same kind of error.
Sorry for the late reply. The above error message does unfortunately not show the details of the InnerError that is important.
If the columns are nullable, then there is a large probability though that the job failed because of an encoding mismatch. E.g., you tried to extract the data as UTF-8 (e.g. default encoding), but the file may contain a character that is not UTF-8 or the file is actually Windows-125x encoded but mostly uses the ASCII range (with the exception of this row). If the encoding is mismatched, you should specify the right encoding on the EXTRACT statement. If a single character is wrongly encoded, you will have to get the data clean from the originator.
We have an option on the backlog to give you the choice of converting wrongly encoded byte sequences into the illegal character character. So please let me know what the exact error and root cause is to help us with the prioritization of this feature.
Thanks for the responding, I've solved the issue a bit differently.
The problem was that the corrupted rows had the same size as healthy ones, so "silent: true" didn't skip it.
Additionally, I used the first row from the csv with "int" type as a primary key for the DataLake Table. However, a single row with incorrect type will crash the job.
I solved this problem by adding the calculated column with the hash code of the index column and use it as a primary key-
csvindexcol.GetHashCode() as tablerowid
and changed csvindexcol to "int?" type in EXTRACT stage.
Then, in INSERT INTO stage I added WHERE csvindexcol!=0 to handle the corrupted rows.
If anyone knows a better way to solve this, please, share.
I think the reason why you still received an error with silent:true was that your conversion into an int failed and since the column type was not nullable, it could not turn it into a null.
So you could still use silent:true in this case, but keep the data type of your primary key column nullable (e.g., int?). Then silent:true would turn the invalid value into null and you could filter the rows out in a second step. That may be easier than calculating the Hash code.
Thanks, that works for me. And it looks much easier.
Good :). Also, we have added support for Windows-125x code pages in the built-in CSV-style extractors. I will be closing this issue for now.