soundcloud/spark-pagerank

Don't use object files as app input and outputs

Closed this issue · 7 comments

The options are: TSV, CSV, Parquet (or stay with object files). We need some interoperability with other formats in HDFS, like graph building reads from TSV in our case. What should be the standard interface to the apps? You can always build your own apps, the apps are just there for reference, handy usage/testing purposes.

/cc @maxjakob

This is related to the example apps. I think the level of standardization does not need to be high there.

The only thing is that object files forces one to pull in this package for further processing, the other suggestions don't.

Decided this wasn't necessary.

Yeah, I see the argument. I will reopen this for the 1.0.0 release then.

Ok, the plan is to read from TSV by default in the graph building driver and then use Parquet for all output/intermediate data formats. Working on this next.

@maxjakob WDYT?

This should include using the new readers/writers for Metadata.

Metadata has moved over first.