Reading files written by parquet-go fails in pandas using pyarrow
Closed this issue · 3 comments
Error returned by pyarrow:
Not yet implemented: Unsupported encoding.
I am able to read and write the file in parquet-go, here is an example of how I am writing it out right now:
//The struct is used for csv & parquet
type Data struct {
PropertyKeyID int64 `csv:"PropertyKey_ID" parquet:"PropertyKey_ID"`
Dealid int64 `csv:"Deal_id" parquet:"Deal_id"`
Propertyid int64 `csv:"Property_id" parquet:"Property_id"`
Portfolio string `csv:"Portfolio" parquet:"Portfolio"`
Propertynb int64 `csv:"Property_nb" parquet:"Property_nb"`
Statustx string `csv:"Status_tx" parquet:"Status_tx"`
Statusdt string `csv:"Status_dt" parquet:"Status_dt"`
IntConveyednb string `csv:"IntConveyed_nb" parquet:"IntConveyed_nb"`
IntConveytx string `csv:"IntConvey_tx" parquet:"IntConvey_tx"`
TransTypetx string `csv:"TransType_tx" parquet:"TransType_tx"`
}
schema := pq.SchemaOf(new(Data))
pw := pq.NewGenericWriter[Data](pqWriter, schema)
defer pw.Close()
_, err = pw.Write([]Data{data})
if err != nil {
return errSummary, err
}
If I load up the file and just read it in in go I can read it back into the struct and get all the values. Is there a way to write it out in a compatible way for pyarrow or any library in python to read it into a dataframe?
I think I figured it out, if you change the encoding type for strings to dict everything seems to work. Example:
type Data struct {
PropertyKeyID int64 `csv:"PropertyKey_ID" parquet:"PropertyKey_ID,snappy"`
Dealid int64 `csv:"Deal_id" parquet:"Deal_id,snappy"`
Propertyid int64 `csv:"Property_id" parquet:"Property_id,snappy"`
Portfolio string `csv:"Portfolio" parquet:"Portfolio,dict,snappy"`
Propertynb int64 `csv:"Property_nb" parquet:"Property_nb,snappy"`
Statustx string `csv:"Status_tx" parquet:"Status_tx,dict,snappy"`
Statusdt string `csv:"Status_dt" parquet:"Status_dt,dict,snappy"`
IntConveyednb string `csv:"IntConveyed_nb" parquet:"IntConveyed_nb,dict,snappy"`
IntConveytx string `csv:"IntConvey_tx" parquet:"IntConvey_tx,dict,snappy"`
TransTypetx string `csv:"TransType_tx" parquet:"TransType_tx,dict,snappy"`
}
Hello, thanks for reporting the issue!
Parquet is a complex file format, it appears that pyarrow does not support the DELTA_LENGTH_BYTE_ARRAY encoding that parquet-go uses by default, and is documented as being the default in the parquet specification.
Adding the dict
tag switches to dictionary encoding, you may also use plain
to force plain encoding which all libraries should support.
I appreciate your help looking at this.