ClickHouse/clickhouse-rs

Strange utf-8 error

Closed this issue · 6 comments

Describe the bug

I've imported a large Parquet file. In Rust I'm getting this error:

FromUtf8Error(FromUtf8Error { bytes: [48, 48, 205, 204, 204, 204, 204, 204, 244, 63, 1, 110, 154, 153, 153, 153, 153, 153, 40, 64, 51, 51, 51, 51, 51, 179, 75, 64, 0, 0, 0, 0, 0, 0, 36, 64, 3, 3, 111, 119, 110, 5, 49, 50, 50, 51, 53, 205, 204, 204, 204, 204, 204, 244], error: Utf8Error { valid_up_to: 2, error_len: Some(1) } })

I have a struct type with field Vec<u8> and upon calling String::from_utf8(field) on that field, I get this error.

I tried String type for the field, but then I get a cryptic message about invalid encoding coming deep from clickhouse-rs. Anyways, the clickhouse-cli has no problems with this field, same for the JS clickhouse client.

Steps to reproduce

I'm not able to provide detailed data, however there are no non-ascii encodings in that field.

When I invoke SELECT 'x' AS myfield I get correct "x" string, but when I do SELECT field (which is of String type), I get that utf-8 error.

Expected behaviour

No error.

Configuration

Environment

  • Client version: 0.13.1
  • OS: Linux (docker)

ClickHouse server

  • ClickHouse Server version: 24.10.2

  • CREATE TABLE statements for tables involved: well, DESCRIBE mytable returns myfield as String

BTW, it seems I'm able to make similar query with the golang client, and it works fine (both with native and http connections)...

loyd commented

Can you provide your struct and client.query() call?

Then, I don't understand what you mean by "I've imported a large Parquet file" and how it's related to the problem described.

Also, if you're sure that that column contains valid utf8 you should prefer myfield: String instead of myfield: Vec<u8> (or use serde_bytes as described in the documentation to avoid performance overhead).

My struct is like this:

#[derive(Debug, Row, Deserialize)]
struct CHObservation {
    // station_id: String,
    station_id: Vec<u8>,
    value: f64,
    // name: String,
    name: Vec<u8>,
    latitude: f64,
    longitude: f64,
    height: f64,
    score: f64,
    // owner_id: String,
    owner_id: Vec<u8>,
}

I tried String but it didn't work as well. The advantage of Vec<u8> seems that I can postpone to calling String::from_utf8 in my code and so I have an insight at the bytes that were received by the client. These bytes don't make sense for me though, the stationId strings are short ASCII codes.

My query is of this form:

 let q = r#"
       SELECT observations_ts.stationId AS station_id,
              1.3 AS value,
              'n' AS name,
              12.3 AS latitude,
              55.4 AS longitude,
              10.0 AS height,
              3 AS score,
              'own' AS owner_id
          FROM observations_ts
            WHERE ts BETWEEN (toDateTime(?) - toIntervalSecond(?)) AND (toDateTime(?) + toIntervalSecond(?))
       "#;
    let mut rows = db
        .query(&q)
        // .bind(Identifier(param_name))
        .bind(ti.t.to_string())
        .bind(ti.d.num_seconds())
        .bind(ti.t.to_string())
        .bind(ti.d.num_seconds())
        // .fetch::<(Vec<u8>, f64, Vec<u8>, f64, f64, f64, f64, Vec<u8>)>()?;
        .fetch::<CHObservation>()?;

When I specify some dummy string 'x' instead of observations_ts.stationId in the query, this works.

Mentioning Parquet file isn't directly related, but I had a thought that there might be some issues encoding/decoding strings (the Parquet file itself is generated as a dump from a DuckDB table; I can read that Parquet file in duckdb just fine).

Also, as you see from commented code above, I tried to just use a tuple instead of custom Row derive, as all examples I've seen have the Row and SELECT ?fetch and I use some more complex query which doesn't directly map field names to sql columns and I had a thought that the fields being deserialized might be in wrong places. But it didn't work as well.

duplicate of #173

OK I created a side project with just the clickhouse and that query and I played a bit with it, adding fields one by one. It seems that it works now but if you mistype some field (e.g. specify f64 as u8), you'll get utf8 encoding errors. This is quite misleading...