pacman82/odbc2parquet

Error reading Bit / Boolean columns

Closed this issue · 6 comments

Hi,

I am testing this tool and it have been great for most of our tables extraction.
However, I am encountering errors while trying to read columns with Bit data type.

odbc2parquet -vvv query --connection-string "Driver={ODBC Driver 17 for SQL Server};Server=<...>;database=<...>;UID=<...>;PWD=<...>;" Pessoas.parquet "SELECT HabilitadoRiscoSacado_pes FROM Pessoas"
2021-07-30T18:03:35-03:00 - DEBUG - ODBC Environment created.
2021-07-30T18:03:36-03:00 - WARN - State: 01000, Native error: 5701, Message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Contexto do banco de dados alterado para '<...>'.
2021-07-30T18:03:36-03:00 - WARN - State: 01000, Native error: 5703, Message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Definição do idioma alterada para Português (Brasil).
2021-07-30T18:03:36-03:00 - DEBUG - ODBC column description for column 1: ColumnDescription { name: [72, 97, 98, 105, 108, 105, 116, 97, 100, 111, 82, 105, 115, 99, 111, 83, 97, 99, 97, 100, 111, 95, 112, 101, 115], data_type: Bit, nullability: NoNulls }
2021-07-30T18:03:36-03:00 - DEBUG - ODBC buffer description for column 1: BufferDescription { nullable: true, kind: Bit }
2021-07-30T18:03:36-03:00 - INFO - Memory usage per row is 68 bytes. This excludes memory directly allocated by the ODBC driver.
2021-07-30T18:03:36-03:00 - INFO - Batch size set to 31580641
2021-07-30T18:03:37-03:00 - INFO - Fetched batch 1 with 83817 rows.
2021-07-30T18:03:37-03:00 - DEBUG - Writing column with index 0 and name 'HabilitadoRiscoSacado_pes'.
Error: EOF: unable to put boolean value

If I cast it as a Tinyint it works. Do you have any idea of why is this happening?

Thanks,
René

Hello René,

thanks for taking the time to file a bug report. Based on the error message and the log I would say the bug happens then writing to parquet, rather than reading from the database. I've written a minimal test case, but sadly it failed to reproduce the issue.

I setup an empty table with a bit column, and inserted five values into it.

setup_empty_table(&conn, table_name, &["BIT"]).unwrap();
let insert = format!(
        "INSERT INTO {}
        (a)
        VALUES
        (0), (1), (NULL), (1), (0);",
        table_name
    );

Yet odbc2parquet runs fine and produces a parquet file for which parquet read shows the following output.

let expected_values = "\
        {a: false}\n\
        {a: true}\n\
        {a: null}\n\
        {a: true}\n\
        {a: false}\n\
    ";

I am happy to see that you use a Microsoft SQL Database. Same as I use for the test setup for this tool. Yet my minimal mock data fails to reproduce the issue. I wonder if we need all of your 83817 rows to reproduce the issue. What would happen if you limit the query to only return part of the data? (Syntax for MSSQL can be found is this stack overflow issue: https://stackoverflow.com/questions/13220743/implement-paging-skip-take-functionality-with-this-query).

Your help in figuring this out would be greatly appreciated.

Cheers, Markus

Hi Markus, thanks for the quick response.
It seems like it fails when it tries to fetch more than 4096 rows.

Fetching 4096 rows

odbc2parquet -vvv query --connection-string "Driver={ODBC Driver 17 for SQL Server};---" Pessoas.parquet "SELECT TOP(4096) HabilitadoRiscoSacado_pes FROM Pessoas"
2021-07-31T09:10:19-03:00 - DEBUG - ODBC Environment created.
2021-07-31T09:10:20-03:00 - WARN - State: 01000, Native error: 5701, Message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Contexto do banco de dados alterado para '---'.
2021-07-31T09:10:20-03:00 - WARN - State: 01000, Native error: 5703, Message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Definição do idioma alterada para Português (Brasil).
2021-07-31T09:10:20-03:00 - DEBUG - ODBC column description for column 1: ColumnDescription { name: [72, 97, 98, 105, 108, 105, 116, 97, 100, 111, 82, 105, 115, 99, 111, 83, 97, 99, 97, 100, 111, 95, 112, 101, 115], data_type: Bit, nullability: NoNulls }
2021-07-31T09:10:20-03:00 - DEBUG - ODBC buffer description for column 1: BufferDescription { nullable: true, kind: Bit }
2021-07-31T09:10:20-03:00 - INFO - Memory usage per row is 68 bytes. This excludes memory directly allocated by the ODBC driver.
2021-07-31T09:10:20-03:00 - INFO - Batch size set to 31580641
2021-07-31T09:10:20-03:00 - INFO - Fetched batch 1 with 4096 rows.
2021-07-31T09:10:20-03:00 - DEBUG - Writing column with index 0 and name 'HabilitadoRiscoSacado_pes'.

Fetching 4097 rows

odbc2parquet -vvv query --connection-string "Driver={ODBC Driver 17 for SQL Server};---" Pessoas.parquet "SELECT TOP(4097) HabilitadoRiscoSacado_pes FROM Pessoas"
2021-07-31T09:10:14-03:00 - DEBUG - ODBC Environment created.
2021-07-31T09:10:15-03:00 - WARN - State: 01000, Native error: 5701, Message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Contexto do banco de dados alterado para '---'.
2021-07-31T09:10:15-03:00 - WARN - State: 01000, Native error: 5703, Message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Definição do idioma alterada para Português (Brasil).
2021-07-31T09:10:15-03:00 - DEBUG - ODBC column description for column 1: ColumnDescription { name: [72, 97, 98, 105, 108, 105, 116, 97, 100, 111, 82, 105, 115, 99, 111, 83, 97, 99, 97, 100, 111, 95, 112, 101, 115], data_type: Bit, nullability: NoNulls }
2021-07-31T09:10:15-03:00 - DEBUG - ODBC buffer description for column 1: BufferDescription { nullable: true, kind: Bit }
2021-07-31T09:10:15-03:00 - INFO - Memory usage per row is 68 bytes. This excludes memory directly allocated by the ODBC driver.
2021-07-31T09:10:15-03:00 - INFO - Batch size set to 31580641
2021-07-31T09:10:15-03:00 - INFO - Fetched batch 1 with 4097 rows.
2021-07-31T09:10:15-03:00 - DEBUG - Writing column with index 0 and name 'HabilitadoRiscoSacado_pes'.
Error: EOF: unable to put boolean value

Fetching 4096 rows offsetting 4096

odbc2parquet -vvv query --connection-string "Driver={ODBC Driver 17 for SQL Server};---" Pessoas.parquet "SELECT HabilitadoRiscoSacado_pes FROM Pessoas ORDER BY HabilitadoRiscoSacado_pes OFFSET 4096 ROWS FETCH NEXT 4096 ROWS ONLY"
2021-07-31T09:12:36-03:00 - DEBUG - ODBC Environment created.
2021-07-31T09:12:37-03:00 - WARN - State: 01000, Native error: 5701, Message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Contexto do banco de dados alterado para '---'.
2021-07-31T09:12:37-03:00 - WARN - State: 01000, Native error: 5703, Message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Definição do idioma alterada para Português (Brasil).
2021-07-31T09:12:37-03:00 - DEBUG - ODBC column description for column 1: ColumnDescription { name: [72, 97, 98, 105, 108, 105, 116, 97, 100, 111, 82, 105, 115, 99, 111, 83, 97, 99, 97, 100, 111, 95, 112, 101, 115], data_type: Bit, nullability: NoNulls }
2021-07-31T09:12:37-03:00 - DEBUG - ODBC buffer description for column 1: BufferDescription { nullable: true, kind: Bit }
2021-07-31T09:12:37-03:00 - INFO - Memory usage per row is 68 bytes. This excludes memory directly allocated by the ODBC driver.
2021-07-31T09:12:37-03:00 - INFO - Batch size set to 31580641
2021-07-31T09:12:37-03:00 - INFO - Fetched batch 1 with 4096 rows.
2021-07-31T09:12:37-03:00 - DEBUG - Writing column with index 0 and name 'HabilitadoRiscoSacado_pes'.

Fetching 4097 rows offsetting 4096

odbc2parquet -vvv query --connection-string "Driver={ODBC Driver 17 for SQL Server};Server=172.20.16.4;---" Pessoas.parquet "SELECT HabilitadoRiscoSacado_pes FROM Pessoas ORDER BY HabilitadoRiscoSacado_pes OFFSET 4096 ROWS FETCH NEXT 4097 ROWS ONLY"
2021-07-31T09:12:40-03:00 - DEBUG - ODBC Environment created.
2021-07-31T09:12:41-03:00 - WARN - State: 01000, Native error: 5701, Message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Contexto do banco de dados alterado para '---'.
2021-07-31T09:12:41-03:00 - WARN - State: 01000, Native error: 5703, Message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Definição do idioma alterada para Português (Brasil).
2021-07-31T09:12:41-03:00 - DEBUG - ODBC column description for column 1: ColumnDescription { name: [72, 97, 98, 105, 108, 105, 116, 97, 100, 111, 82, 105, 115, 99, 111, 83, 97, 99, 97, 100, 111, 95, 112, 101, 115], data_type: Bit, nullability: NoNulls }
2021-07-31T09:12:41-03:00 - DEBUG - ODBC buffer description for column 1: BufferDescription { nullable: true, kind: Bit }
2021-07-31T09:12:41-03:00 - INFO - Memory usage per row is 68 bytes. This excludes memory directly allocated by the ODBC driver.
2021-07-31T09:12:41-03:00 - INFO - Batch size set to 31580641
2021-07-31T09:12:41-03:00 - INFO - Fetched batch 1 with 4097 rows.
2021-07-31T09:12:41-03:00 - DEBUG - Writing column with index 0 and name 'HabilitadoRiscoSacado_pes'.
Error: EOF: unable to put boolean value

Hello René,

now there is a smoking gun. Thanks for running the experiment. I'm currently very busy moving. I'll try to reproduce the issue during train rides. In the meantime you may try to query all the data but set the --batch-size-row option to 4096. Could be a workaround.

Cheers, Markus

Have a nice reproducing test, now. Thanks, I'll keep you in the loop.

Cheers, Markus

Should work now in 0.6.8. Tests pass fine. I misunderstood what the write batch size property in the parquet library meant.

Thanks Markus, working great now!