segmentio/parquet-go

Not able to read .parquet files created by Julia and/or Python

Closed this issue · 5 comments

The code below fails to read .parquet files created with Julia and/or Python.
The code follows almost verbatim the sample given in the docs for read/write a file.
When the files are read in he functions below, the rows have 0 values for the int64 case and "" for the string case.
For reference I am attaching two .parquet files written with codec=ZSTD. Same issue with SNAPPY or GZIP.
Thank you for your work and your help.

type FiRowType struct{ x1, x2, x3 int64 }
type FsRowType struct{ x1, x2, x3 string }

func RdFiPrqFile() {
	rows, err := parquet.ReadFile[FiRowType]("fileName_ZSTD.parquet")
	if err != nil {
		log.Fatal(err)
	}
	for _, c := range rows {
		fmt.Printf("%+v\n", c)
	}
}

func RdFsPrqFile() {
	rows, err := parquet.ReadFile[FsRowType]("fileName_ZSTD.parquet")
	if err != nil {
		log.Fatal(err)
	}
	for _, c := range rows {
		fmt.Printf("%+v\n", c)
	}
}

Linked Stackoverflow question. Copying my answer from there:


In Go variables that start with a lower case character not exported so cannot be updated from other packages (e.g. segmentio/parquet-go). Try the below:

package main

import (
	"fmt"
	"log"

	"github.com/segmentio/parquet-go"
)

type FiRowType struct {
	X1 int64 `parquet:"x1,optional"`
	X2 int64 `parquet:"x2,optional"`
	X3 int64 `parquet:"x3,optional"`
}

func RdFiFile() {
	rows, err := parquet.ReadFile[FiRowType]("fi_ZSTD.parquet")
	if err != nil {
		log.Fatal(err)
	}
	for i, c := range rows {
		fmt.Printf("%d %+v\n", i, c)
	}
}

func main() {
	RdFiFile()
}

I replicated the issue by running Julia under Docker with the following:

# Command to run docker run -it --rm -v PATHTODATA:/usr/myapp -w /usr/myapp julia julia FILENAME.jl
using Pkg
pkg"add Parquet"
pkg"add DataFrames"
using Parquet;
using DataFrames;
function WrForGo( )
  min = 1
  max = 10
  # arrays of size (10,3). ai is int and as is
String
  ai = Array{Int64, 2}(undef, 10,3)
  as = Array{String, 2}(undef, 10,3)
  for i =1:max
      for j=1:3
        as[i,j] = string(i, pad=2) * "_" * string(j,pad=2)
        ai[i,j] = (j-1)*10 + i
      end
    end

    dfi = DataFrame(ai, :auto) ; dfs = DataFrame(as, :auto)
    print( dfi ) ;  print( dfs )
    Parquet.write_parquet( "fi_ZSTD.parquet", compression_codec = "ZSTD", dfi)
    Parquet.write_parquet( "fs_ZSTD.parquet", compression_codec = "ZSTD", dfs)

    Parquet.write_parquet( "fi_GZIP.parquet", compression_codec = "GZIP", dfi)
    Parquet.write_parquet( "fs_GZIP.parquet", compression_codec = "GZIP", dfs)

    Parquet.write_parquet( "fi_SNAPPY.parquet", compression_codec = "SNAPPY", dfi)
    Parquet.write_parquet( "fs_SNAPPY.parquet", compression_codec = "SNAPPY", dfs)

end
WrForGo()

Thank you for your response. I was the issue, not the API.

Thanks @MattBrittan to chime in with an answer here!