how to use schemaJson and get partition columns from s3?

Question

how to use schemaJson and get partition columns from s3?

Opened this issue 5 years ago · 4 comments

i have s3 paths like:
s3://mybucket/a/b/c/d/e/col1=x/col2=y/some1.csv
s3://mybucket/a/b/c/d/e/col1=a/col2=b/some2.csv
s3://mybucket/a/b/c/d/e/col1=i/col2=k/some3.csv
s3://mybucket/a/b/c/d/e/col1=t/col2=p/some4.csv

in my input yaml i put s3://mybucket/a/b/c/d/e/ as i want col1 and col2 values to be part of the dataframe and i want to read/select from all 4 files in one go. also i put a schemaJson as i want to rename cols in the dataframe. But my metric yaml complains about can't find column col1, col2. how to solve this?

Answer 1 · 2019-09-03T20:24:33.000Z

So your input is:

input1: s3://mybucket/a/b/c/d/e/col1=x/col2=y/some1.csv,s3://mybucket/a/b/c/d/e/col1=a/col2=b/some2.csv,...

?

And then in your select you're getting can't find col1 right?
Without the schema file it works?
Can you share some sample of the data? metric file? job file? schema file?

Answer 2 · 2019-09-03T20:26:29.000Z

input is just: s3://mybucket/a/b/c/d/e/
as i want to pick all recursive data without hardcoding

in select i'm getting can't find col1
without schema works

Answer 3 · 2019-09-05T07:11:33.000Z

Can you attach your schema file?

Answer 4 · 2019-09-05T20:22:41.000Z

@lyogev

{
"$schema": "smallTestSchema",
"id": "smallTestSchema",
"type": "object",
"name": "/",
"properties": {
"row_no": { "id": "smallTestSchema/row_no/", "type": "string", "name": "row_no" }
,"start_date": { "id": "smallTestSchema/start_date/", "type": "string", "name": "start_date" }
,"end_date": { "id": "smallTestSchema/end_date/", "type": "string", "name": "end_date" }
,"unit_name": { "id": "smallTestSchema/unit_name/", "type": "string", "name": "unit_name" }
}