YotpoLtd/metorikku

how to use schemaJson and get partition columns from s3?

Opened this issue · 4 comments

i have s3 paths like:
s3://mybucket/a/b/c/d/e/col1=x/col2=y/some1.csv
s3://mybucket/a/b/c/d/e/col1=a/col2=b/some2.csv
s3://mybucket/a/b/c/d/e/col1=i/col2=k/some3.csv
s3://mybucket/a/b/c/d/e/col1=t/col2=p/some4.csv

in my input yaml i put s3://mybucket/a/b/c/d/e/ as i want col1 and col2 values to be part of the dataframe and i want to read/select from all 4 files in one go. also i put a schemaJson as i want to rename cols in the dataframe. But my metric yaml complains about can't find column col1, col2. how to solve this?

So your input is:

input1: s3://mybucket/a/b/c/d/e/col1=x/col2=y/some1.csv,s3://mybucket/a/b/c/d/e/col1=a/col2=b/some2.csv,...

?

And then in your select you're getting can't find col1 right?
Without the schema file it works?
Can you share some sample of the data? metric file? job file? schema file?

input is just: s3://mybucket/a/b/c/d/e/
as i want to pick all recursive data without hardcoding

in select i'm getting can't find col1
without schema works

Can you attach your schema file?

@lyogev

{
"$schema": "smallTestSchema",
"id": "smallTestSchema",
"type": "object",
"name": "/",
"properties": {
"row_no": { "id": "smallTestSchema/row_no/", "type": "string", "name": "row_no" }
,"start_date": { "id": "smallTestSchema/start_date/", "type": "string", "name": "start_date" }
,"end_date": { "id": "smallTestSchema/end_date/", "type": "string", "name": "end_date" }
,"unit_name": { "id": "smallTestSchema/unit_name/", "type": "string", "name": "unit_name" }
}