issue with dot "." in field name
pwmcintyre opened this issue · 10 comments
hi
I know it has been briefly mentioned in other issue about the drama of using "." in field names, but i'm hoping you can help
Using the Java parquet-tools to inspect the schema of an existing Parquet file i have, i can see it contains "." in the field names, but works fine:
$ docker run -it --rm -v ${PWD}:/data nathanhowell/parquet-tools schema /data/part-00001-d82e5581-88f1-4203-85db-861c8d907350.c000.snappy.parquet
message spark_schema {
optional binary version (STRING);
optional binary meta.format (STRING);
optional binary meta.id (STRING);
}
and while using your tool i get the following:
$ parquet-tools -cmd schema -file ./part-00001-d82e5581-88f1-4203-85db-861c8d907350.c000.snappy.parquet
----- Go struct -----
Spark_schema struct {
Version *string
Meta46format *string
Meta46id *string
}
----- Json schema -----
{
"Tag": "name=Spark_schema, repetitiontype=REQUIRED",
"Fields": [
{
"Tag": "name=Version, type=UTF8, repetitiontype=OPTIONAL",
"Fields": null
},
{
"Tag": "name=Meta46format, type=UTF8, repetitiontype=OPTIONAL",
"Fields": null
},
{
"Tag": "name=Meta46id, type=UTF8, repetitiontype=OPTIONAL",
"Fields": null
}
]
}
I'm similarly having trouble writing files with "." in the key — eg with this struct:
type Event struct {
Version *string `parquet:"name=version, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
MetaID *string `parquet:"name=meta.id, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
}
I get the following error when attempting to read it:
$ docker run -it --rm -v ${PWD}:/data nathanhowell/parquet-tools schema /data/output_test/struct/output.parquet
org.apache.parquet.io.InvalidRecordException: meta not found in message parquet_go_root {
optional binary version (STRING) = 0;
optional binary meta.id (STRING) = 0;
}
any ideas?
hi, @pwmcintyre
Golang doesn't support a variable name with dot. So you should provide a legal name for a go struct field.
Following is an example of write/read a parquet file with a field which name has a .
.
package main
import (
"log"
"github.com/xitongsys/parquet-go-source/local"
"github.com/xitongsys/parquet-go/parquet"
"github.com/xitongsys/parquet-go/reader"
"github.com/xitongsys/parquet-go/writer"
)
type Student struct {
//// name is the parquet filed name. inname is the variable name
Name string `parquet:"name=student.name, inname=name, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
Age int32 `parquet:"name=age, type=INT32, encoding=PLAIN"`
}
func main() {
var err error
fw, err := local.NewLocalFileWriter("output/flat.parquet")
if err != nil {
log.Println("Can't create local file", err)
return
}
//write
pw, err := writer.NewParquetWriter(fw, new(Student), 4)
if err != nil {
log.Println("Can't create parquet writer", err)
return
}
pw.RowGroupSize = 128 * 1024 * 1024 //128M
pw.PageSize = 8 * 1024 //8K
pw.CompressionType = parquet.CompressionCodec_SNAPPY
num := 10
for i := 0; i < num; i++ {
stu := Student{
Name: "StudentName",
Age: int32(20 + i%5),
}
if err = pw.Write(stu); err != nil {
log.Println("Write error", err)
}
}
if err = pw.WriteStop(); err != nil {
log.Println("WriteStop error", err)
return
}
log.Println("Write Finished")
fw.Close()
///read
fr, err := local.NewLocalFileReader("output/flat.parquet")
if err != nil {
log.Println("Can't open file")
return
}
pr, err := reader.NewParquetReader(fr, new(Student), 4)
if err != nil {
log.Println("Can't create parquet reader", err)
return
}
num = int(pr.GetNumRows())
stus := make([]Student, num) //read 10 rows
if err = pr.Read(&stus); err != nil {
log.Println("Read error", err)
}
log.Println(stus)
pr.ReadStop()
fr.Close()
}
running result:
2021/01/28 08:38:46 Write Finished
2021/01/28 08:38:46 [{StudentName 20} {StudentName 21} {StudentName 22} {StudentName 23} {StudentName 24} {StudentName 20} {StudentName 21} {StudentName 22} {StudentName 23}
{StudentName 24}]
@xitongsys — appreciate your time, thank you
i have reproduced your result above — but similar to my example earlier, when attempting to read this new parquet file with my existing systems (i'm using AWS Athena), i get an error similar to the below error from parquet-tools:
$ docker run -it --rm -v ${PWD}:/data nathanhowell/parquet-tools schema /data/output.parquet
org.apache.parquet.io.InvalidRecordException: student not found in message parquet_go_root {
required binary student.name (STRING) = 0;
required int32 age = 0;
}
similarly, using another Go implementation, i still cannot read this file:
$ parquet-tool schema output.parquet
panic: line 2: expected ;, got unknown start of token '46' instead
and so i suspect there may be an issue in the handling of the "." in the output file?
hi, @pwmcintyre
Could your provide a sample file like "/data/part-00001-d82e5581-88f1-4203-85db-861c8d907350.c000.snappy.parquet ?
@xitongsys — emailed, and while not sensitive, we would prefer it not shared publicly :)
hi @xitongsys ... did your post get about java implementation get deleted? did you find the answer?
hi, @pwmcintyre
I have found the reason. Parquet-go just use "." as a field delimiter which caused this issue. I'm considering how to fix it and keep the compatibility with before.
@xitongsys — thanks for the update, please let me know if there's anything I can help with
hi, @pwmcintyre
Fixed in this pull
Actually I just use \x01
as the delimiter instead of .
.
Example file you can found here
@xitongsys — well done! thanks again
I can confirm AWS Athena is happy with this change 👌 (ignore the nulls, it's just a test)
ok, I will close this issue.