Not able to read .parquet files created by Julia and/or Python
Closed this issue · 5 comments
The code below fails to read .parquet files created with Julia and/or Python.
The code follows almost verbatim the sample given in the docs for read/write a file.
When the files are read in he functions below, the rows have 0 values for the int64 case and "" for the string case.
For reference I am attaching two .parquet files written with codec=ZSTD. Same issue with SNAPPY or GZIP.
Thank you for your work and your help.
type FiRowType struct{ x1, x2, x3 int64 }
type FsRowType struct{ x1, x2, x3 string }
func RdFiPrqFile() {
rows, err := parquet.ReadFile[FiRowType]("fileName_ZSTD.parquet")
if err != nil {
log.Fatal(err)
}
for _, c := range rows {
fmt.Printf("%+v\n", c)
}
}
func RdFsPrqFile() {
rows, err := parquet.ReadFile[FsRowType]("fileName_ZSTD.parquet")
if err != nil {
log.Fatal(err)
}
for _, c := range rows {
fmt.Printf("%+v\n", c)
}
}
Linked Stackoverflow question. Copying my answer from there:
In Go variables that start with a lower case character not exported so cannot be updated from other packages (e.g. segmentio/parquet-go
). Try the below:
package main
import (
"fmt"
"log"
"github.com/segmentio/parquet-go"
)
type FiRowType struct {
X1 int64 `parquet:"x1,optional"`
X2 int64 `parquet:"x2,optional"`
X3 int64 `parquet:"x3,optional"`
}
func RdFiFile() {
rows, err := parquet.ReadFile[FiRowType]("fi_ZSTD.parquet")
if err != nil {
log.Fatal(err)
}
for i, c := range rows {
fmt.Printf("%d %+v\n", i, c)
}
}
func main() {
RdFiFile()
}
I replicated the issue by running Julia under Docker with the following:
# Command to run docker run -it --rm -v PATHTODATA:/usr/myapp -w /usr/myapp julia julia FILENAME.jl
using Pkg
pkg"add Parquet"
pkg"add DataFrames"
using Parquet;
using DataFrames;
function WrForGo( )
min = 1
max = 10
# arrays of size (10,3). ai is int and as is
String
ai = Array{Int64, 2}(undef, 10,3)
as = Array{String, 2}(undef, 10,3)
for i =1:max
for j=1:3
as[i,j] = string(i, pad=2) * "_" * string(j,pad=2)
ai[i,j] = (j-1)*10 + i
end
end
dfi = DataFrame(ai, :auto) ; dfs = DataFrame(as, :auto)
print( dfi ) ; print( dfs )
Parquet.write_parquet( "fi_ZSTD.parquet", compression_codec = "ZSTD", dfi)
Parquet.write_parquet( "fs_ZSTD.parquet", compression_codec = "ZSTD", dfs)
Parquet.write_parquet( "fi_GZIP.parquet", compression_codec = "GZIP", dfi)
Parquet.write_parquet( "fs_GZIP.parquet", compression_codec = "GZIP", dfs)
Parquet.write_parquet( "fi_SNAPPY.parquet", compression_codec = "SNAPPY", dfi)
Parquet.write_parquet( "fs_SNAPPY.parquet", compression_codec = "SNAPPY", dfs)
end
WrForGo()
Thank you for your response. I was the issue, not the API.
Thanks @MattBrittan to chime in with an answer here!