Strange utf-8 error
Closed this issue · 6 comments
Describe the bug
I've imported a large Parquet file. In Rust I'm getting this error:
FromUtf8Error(FromUtf8Error { bytes: [48, 48, 205, 204, 204, 204, 204, 204, 244, 63, 1, 110, 154, 153, 153, 153, 153, 153, 40, 64, 51, 51, 51, 51, 51, 179, 75, 64, 0, 0, 0, 0, 0, 0, 36, 64, 3, 3, 111, 119, 110, 5, 49, 50, 50, 51, 53, 205, 204, 204, 204, 204, 204, 244], error: Utf8Error { valid_up_to: 2, error_len: Some(1) } })
I have a struct type with field Vec<u8>
and upon calling String::from_utf8(field)
on that field, I get this error.
I tried String
type for the field, but then I get a cryptic message about invalid encoding coming deep from clickhouse-rs
. Anyways, the clickhouse-cli
has no problems with this field, same for the JS clickhouse client.
Steps to reproduce
I'm not able to provide detailed data, however there are no non-ascii encodings in that field.
When I invoke SELECT 'x' AS myfield
I get correct "x"
string, but when I do SELECT field
(which is of String
type), I get that utf-8 error.
Expected behaviour
No error.
Configuration
Environment
- Client version: 0.13.1
- OS: Linux (docker)
ClickHouse server
-
ClickHouse Server version: 24.10.2
-
CREATE TABLE
statements for tables involved: well,DESCRIBE mytable
returnsmyfield
asString
BTW, it seems I'm able to make similar query with the golang client, and it works fine (both with native and http connections)...
Can you provide your struct
and client.query()
call?
Then, I don't understand what you mean by "I've imported a large Parquet file" and how it's related to the problem described.
Also, if you're sure that that column contains valid utf8 you should prefer myfield: String
instead of myfield: Vec<u8>
(or use serde_bytes
as described in the documentation to avoid performance overhead).
My struct is like this:
#[derive(Debug, Row, Deserialize)]
struct CHObservation {
// station_id: String,
station_id: Vec<u8>,
value: f64,
// name: String,
name: Vec<u8>,
latitude: f64,
longitude: f64,
height: f64,
score: f64,
// owner_id: String,
owner_id: Vec<u8>,
}
I tried String
but it didn't work as well. The advantage of Vec<u8>
seems that I can postpone to calling String::from_utf8
in my code and so I have an insight at the bytes that were received by the client. These bytes don't make sense for me though, the stationId
strings are short ASCII codes.
My query is of this form:
let q = r#"
SELECT observations_ts.stationId AS station_id,
1.3 AS value,
'n' AS name,
12.3 AS latitude,
55.4 AS longitude,
10.0 AS height,
3 AS score,
'own' AS owner_id
FROM observations_ts
WHERE ts BETWEEN (toDateTime(?) - toIntervalSecond(?)) AND (toDateTime(?) + toIntervalSecond(?))
"#;
let mut rows = db
.query(&q)
// .bind(Identifier(param_name))
.bind(ti.t.to_string())
.bind(ti.d.num_seconds())
.bind(ti.t.to_string())
.bind(ti.d.num_seconds())
// .fetch::<(Vec<u8>, f64, Vec<u8>, f64, f64, f64, f64, Vec<u8>)>()?;
.fetch::<CHObservation>()?;
When I specify some dummy string 'x'
instead of observations_ts.stationId
in the query, this works.
Mentioning Parquet file isn't directly related, but I had a thought that there might be some issues encoding/decoding strings (the Parquet file itself is generated as a dump from a DuckDB table; I can read that Parquet file in duckdb just fine).
Also, as you see from commented code above, I tried to just use a tuple instead of custom Row
derive, as all examples I've seen have the Row
and SELECT ?fetch
and I use some more complex query which doesn't directly map field names to sql columns and I had a thought that the fields being deserialized might be in wrong places. But it didn't work as well.
OK I created a side project with just the clickhouse and that query and I played a bit with it, adding fields one by one. It seems that it works now but if you mistype some field (e.g. specify f64
as u8
), you'll get utf8 encoding errors. This is quite misleading...