Unexpected read result after write date as INT96 parquet type
Mort4lis opened this issue · 2 comments
Hi everyone! I have a problem with writing/reading parquet file.
Let's take a look at an example: I create a json writer and schema with one column (INT96) and try to write one row with current date. Before write I convert time.Time
to string by calling types.TimeToINT96
. But after reading the output parquet file, I have got a wrong result.
If I replace the jsonWriter
to usual ParquetWriter
then it works correctly, but I need to write json.
I will be glad for any help!
Code:
package main
import (
"encoding/json"
"fmt"
"log"
"time"
"github.com/xitongsys/parquet-go-source/local"
"github.com/xitongsys/parquet-go/reader"
"github.com/xitongsys/parquet-go/types"
"github.com/xitongsys/parquet-go/writer"
)
type Value struct {
OrderDate string `json:"order_date" parquet:"name=order_date, type=INT96"`
}
const writeJSONSchema = `
{
"Tag": "name=Schema, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=order_date, type=INT96, repetitiontype=OPTIONAL"}
]
}
`
func main() {
now := time.Now()
fw, err := local.NewLocalFileWriter("output.parquet")
if err != nil {
log.Fatalf("Can't create file: %v", err)
}
pw, err := writer.NewJSONWriter(writeJSONSchema, fw, 1)
if err != nil {
log.Fatalf("Can't create parquet writer: %v", err)
}
writer.NewParquetWriter()
val := Value{OrderDate: types.TimeToINT96(now)}
valBytes, err := json.Marshal(val)
if err != nil {
log.Fatalf("Can't marshal value: %v", err)
}
if err = pw.Write(valBytes); err != nil {
log.Fatalf("Can't write value: %v", err)
}
if err = pw.WriteStop(); err != nil {
log.Fatalf("Can't stop write: %v", err)
}
if err = fw.Close(); err != nil {
log.Fatalf("Can't close file: %v", err)
}
fr, err := local.NewLocalFileReader("output.parquet")
if err != nil {
log.Fatalf("Can't read file: %v", err)
}
pr, err := reader.NewParquetReader(fr, new(Value), 1)
if err != nil {
log.Fatalf("Can't create parquet reader: %v", err)
}
num := int(pr.GetNumRows())
vals := make([]Value, num)
if err = pr.Read(&vals); err != nil {
log.Fatalf("Read error: %v", err)
}
orderDate := types.INT96ToTime(vals[0].OrderDate)
// Wrong OrderDate
fmt.Printf("Expected = %v\n", now)
fmt.Printf("Got = %v\n", orderDate)
pr.ReadStop()
_ = fr.Close()
}
First of all, INT96 is deprecated, consider using something else if you can.
The problem is that INT96 is stored as string internally, even though it is not valid UTF8 string, so when Marshal
tries to serialize it to UTF8 string, it fails and populates Unicode replacement.
This is related to #434 and #321, both are problems caused by internal representation of []byte
as string
.
First of all, INT96 is deprecated, consider using something else if you can.
The problem is that INT96 is stored as string internally, even though it is not valid UTF8 string, so when
Marshal
tries to serialize it to UTF8 string, it fails and populates Unicode replacement.This is related to #434 and #321, both are problems caused by internal representation of
[]byte
asstring
.
Thank you for reply, man! Yes, indeed I store Julian date as a byte representation in INT96 column type. And these bytes are not Unicode code points.