talariadb/talaria

Orc file produced by the talaria ingestion gives different result when read by go orc reader and spark orc reader

kumarankit1234 opened this issue · 2 comments

Talaria version: 1.1.17

We use two deployments of talaria. One for ingestion and the other one for the real-time serving of events. The output of the first are orc files which is the input to the second.

We are facing the issue in which the output orc file from the first system produces different results when read by the second talaria deployment and when read by spark.

For example for this 2021-01-04-10-45-00--mcd--b5de9c69-e92d-4369-9278-b594b6c73f27.orc.zip (first unzip and then test), when read by talaria for real-time serving returns output 450 for column val. But the same file when read by spark SQL returns output 0.0. The expected value is 0.0. This specific row can be found with the filters event == "bs.calculationServerGBCConfigID" and "8515ec5141fd48758a96f0e6cdb282a1" == bch.

Sample code to read the ORC file in go

`package main

import (
"fmt"
"log"

"github.com/crphang/orc"

)

func main() {
r, err := orc.Open("2021-01-04-10-45-00--mcd--b5de9c69-e92d-4369-9278-b594b6c73f27.orc")
if err != nil {
fmt.Printf("%+v\n", err)
}
defer r.Close()

// Create a new Cursor reading the provided columns.
c := r.Select("req", "event", "val", "bch")

// Iterate over each stripe in the file.
for c.Stripes() {

	// Iterate over each row in the stripe.
	for c.Next() {

		// Retrieve a slice of interface values for the current row.
		event := c.Row()[1]
		//req := c.Row()[0]
		bch := c.Row()[3]
		if event == "bs.calculationServerGBCConfigID" {

			if "8515ec5141fd48758a96f0e6cdb282a1" == bch {
				fmt.Printf("%+v %+v\n", c.Row())
			}

		}

	}

}

if err := c.Err(); err != nil {
	log.Fatal(err)
}

}
`

Sample code to read the file in spark

`val df = spark.read.orc("2021-01-04-10-45-00--mcd--b5de9c69-e92d-4369-9278-b594b6c73f27.orc")

df.createOrReplaceTempView("stalkerprocessordata")

spark.sql("""
select * from stalkerprocessordata
where event = 'bs.calculationServerGBCConfigID'
and bch = '8515ec5141fd48758a96f0e6cdb282a1'
""").show(false)`

cc @kelindar @crphang

Also, if we read the orc file from spark, filter it using the above filters and then write the output to a new ORC file using spark, then that new file when read by GO ORC library returns correct value.

@kumarankit1234 remind me again how you resolved it?