xitongsys/parquet-go

Unexpected read result after write date as INT96 parquet type

Mort4lis opened this issue · 2 comments

Hi everyone! I have a problem with writing/reading parquet file.

Let's take a look at an example: I create a json writer and schema with one column (INT96) and try to write one row with current date. Before write I convert time.Time to string by calling types.TimeToINT96. But after reading the output parquet file, I have got a wrong result.

If I replace the jsonWriter to usual ParquetWriter then it works correctly, but I need to write json.
I will be glad for any help!

Code:

package main

import (
	"encoding/json"
	"fmt"
	"log"
	"time"

	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/reader"
	"github.com/xitongsys/parquet-go/types"
	"github.com/xitongsys/parquet-go/writer"
)

type Value struct {
	OrderDate string `json:"order_date" parquet:"name=order_date, type=INT96"`
}

const writeJSONSchema = `
{
  "Tag": "name=Schema, repetitiontype=REQUIRED",
  "Fields": [
    {"Tag": "name=order_date, type=INT96, repetitiontype=OPTIONAL"}
  ]
}
`

func main() {
	now := time.Now()

	fw, err := local.NewLocalFileWriter("output.parquet")
	if err != nil {
		log.Fatalf("Can't create file: %v", err)
	}

	pw, err := writer.NewJSONWriter(writeJSONSchema, fw, 1)
	if err != nil {
		log.Fatalf("Can't create parquet writer: %v", err)
	}

	writer.NewParquetWriter()

	val := Value{OrderDate: types.TimeToINT96(now)}

	valBytes, err := json.Marshal(val)
	if err != nil {
		log.Fatalf("Can't marshal value: %v", err)
	}

	if err = pw.Write(valBytes); err != nil {
		log.Fatalf("Can't write value: %v", err)
	}

	if err = pw.WriteStop(); err != nil {
		log.Fatalf("Can't stop write: %v", err)
	}

	if err = fw.Close(); err != nil {
		log.Fatalf("Can't close file: %v", err)
	}

	fr, err := local.NewLocalFileReader("output.parquet")
	if err != nil {
		log.Fatalf("Can't read file: %v", err)
	}

	pr, err := reader.NewParquetReader(fr, new(Value), 1)
	if err != nil {
		log.Fatalf("Can't create parquet reader: %v", err)
	}

	num := int(pr.GetNumRows())

	vals := make([]Value, num)

	if err = pr.Read(&vals); err != nil {
		log.Fatalf("Read error: %v", err)
	}

	orderDate := types.INT96ToTime(vals[0].OrderDate)

	// Wrong OrderDate
	fmt.Printf("Expected = %v\n", now)
	fmt.Printf("Got = %v\n", orderDate)

	pr.ReadStop()
	_ = fr.Close()
}

First of all, INT96 is deprecated, consider using something else if you can.

The problem is that INT96 is stored as string internally, even though it is not valid UTF8 string, so when Marshal tries to serialize it to UTF8 string, it fails and populates Unicode replacement.

This is related to #434 and #321, both are problems caused by internal representation of []byte as string.

First of all, INT96 is deprecated, consider using something else if you can.

The problem is that INT96 is stored as string internally, even though it is not valid UTF8 string, so when Marshal tries to serialize it to UTF8 string, it fails and populates Unicode replacement.

This is related to #434 and #321, both are problems caused by internal representation of []byte as string.

Thank you for reply, man! Yes, indeed I store Julian date as a byte representation in INT96 column type. And these bytes are not Unicode code points.