Don't use object files as app input and outputs

Question

Don't use object files as app input and outputs

Closed this issue 8 years ago · 7 comments

The options are: TSV, CSV, Parquet (or stay with object files). We need some interoperability with other formats in HDFS, like graph building reads from TSV in our case. What should be the standard interface to the apps? You can always build your own apps, the apps are just there for reference, handy usage/testing purposes.

/cc @maxjakob

Answer 1 · 2017-04-18T14:33:47.000Z

This is related to the example apps. I think the level of standardization does not need to be high there.

The only thing is that object files forces one to pull in this package for further processing, the other suggestions don't.

Answer 2 · 2017-04-18T15:20:37.000Z

Decided this wasn't necessary.

Answer 3 · 2017-04-19T08:53:40.000Z

Yeah, I see the argument. I will reopen this for the 1.0.0 release then.

Answer 4 · 2017-04-28T09:31:22.000Z

Quick reference: http://www.agildata.com/apache-spark-2-0-api-improvements-rdd-dataframe-dataset-sql/

Answer 5 · 2017-05-11T09:01:08.000Z

Ok, the plan is to read from TSV by default in the graph building driver and then use Parquet for all output/intermediate data formats. Working on this next.

@maxjakob WDYT?

Answer 6 · 2017-05-11T09:30:06.000Z

This should include using the new readers/writers for Metadata.

Answer 7 · 2017-05-16T09:47:27.000Z

Metadata has moved over first.