segmentio/parquet-go

Reading files written by parquet-go fails in pandas using pyarrow

Closed this issue · 3 comments

Error returned by pyarrow:

Not yet implemented: Unsupported encoding.

I am able to read and write the file in parquet-go, here is an example of how I am writing it out right now:

//The struct is used for csv & parquet
type Data struct {
	PropertyKeyID                       int64   `csv:"PropertyKey_ID" parquet:"PropertyKey_ID"`
	Dealid                              int64   `csv:"Deal_id" parquet:"Deal_id"`
	Propertyid                          int64   `csv:"Property_id" parquet:"Property_id"`
	Portfolio                           string  `csv:"Portfolio" parquet:"Portfolio"`
	Propertynb                          int64   `csv:"Property_nb" parquet:"Property_nb"`
	Statustx                            string  `csv:"Status_tx" parquet:"Status_tx"`
	Statusdt                            string  `csv:"Status_dt" parquet:"Status_dt"`
	IntConveyednb                       string  `csv:"IntConveyed_nb" parquet:"IntConveyed_nb"`
	IntConveytx                         string  `csv:"IntConvey_tx" parquet:"IntConvey_tx"`
	TransTypetx                         string  `csv:"TransType_tx" parquet:"TransType_tx"`
}

schema := pq.SchemaOf(new(Data))
pw := pq.NewGenericWriter[Data](pqWriter, schema)
defer pw.Close()
_, err = pw.Write([]Data{data})
if err != nil {
	return errSummary, err
}

If I load up the file and just read it in in go I can read it back into the struct and get all the values. Is there a way to write it out in a compatible way for pyarrow or any library in python to read it into a dataframe?

I think I figured it out, if you change the encoding type for strings to dict everything seems to work. Example:

type Data struct {
	PropertyKeyID                       int64   `csv:"PropertyKey_ID" parquet:"PropertyKey_ID,snappy"`
	Dealid                              int64   `csv:"Deal_id" parquet:"Deal_id,snappy"`
	Propertyid                          int64   `csv:"Property_id" parquet:"Property_id,snappy"`
	Portfolio                           string  `csv:"Portfolio" parquet:"Portfolio,dict,snappy"`
	Propertynb                          int64   `csv:"Property_nb" parquet:"Property_nb,snappy"`
	Statustx                            string  `csv:"Status_tx" parquet:"Status_tx,dict,snappy"`
	Statusdt                            string  `csv:"Status_dt" parquet:"Status_dt,dict,snappy"`
	IntConveyednb                       string  `csv:"IntConveyed_nb" parquet:"IntConveyed_nb,dict,snappy"`
	IntConveytx                         string  `csv:"IntConvey_tx" parquet:"IntConvey_tx,dict,snappy"`
	TransTypetx                         string  `csv:"TransType_tx" parquet:"TransType_tx,dict,snappy"`
}

Hello, thanks for reporting the issue!

Parquet is a complex file format, it appears that pyarrow does not support the DELTA_LENGTH_BYTE_ARRAY encoding that parquet-go uses by default, and is documented as being the default in the parquet specification.

Adding the dict tag switches to dictionary encoding, you may also use plain to force plain encoding which all libraries should support.

I appreciate your help looking at this.