treeverse/hadoop-router-fs

Spark: co-existing with existing underlying object store

Closed this issue · 7 comments

Spark users commonly load RDDs and DataFrames directly from an object store URI (e.g. s3a://bucket/path/to/table).
In many cases, these paths tend to sprawl and will be spread between configuration files, job code and external sources such as Hive Metastore or Airflow parameters.

This makes migrating a collection to lakeFS hard, because it forces a URI change.

The only way to do this safely, is to make sure every single occurrence of the previous URI now points to the new location (including every consumer in existence, including people's personal notebooks). This is hard to guarantee and is error prone.

I suggest adding a layer of abstraction/indirection to ease that transition: provide a configurable set of overrides to translate paths at runtime from an old location to a new.

Some pseudo-configuration to illustrate such a solution:

{
    "path_overrides": [
        {
            "from": "s3a://bucket/path/*",
            "to": "lakefs://repository/main/$1"
        },
        {
            "from": "gcs://*",
            "to": "s3a://$1"
        }
    ]
}

Spark would then be configured to load such a config file (TBD: how?) and apply these translations at runtime: even for dynamically generated/composed paths.

Nice to have: Since this is probably useful for non-lakeFS users as well, this should be a standalone component with no direct dependency on anything other than Spark itself (i.e. serialization and configuration should use Spark's bundled methods even if it means changing from JSON to... XML).

This requires discovery to make sure we do something that adds the least amount of friction to a Spark user as possible.

@ozkatz, can you add some information to this issue?

I have an idea for how to solve this.

How about creating a Hadoop file system that does the following:

  1. Holds an underlying file system, or uses FileSystem.get to fetch the underlying file system without holding it.
  2. Checks a configurable mapping between object-store prefixes and lakefs paths
  3. If a prefix is in mapping, translate it to something that lakeFS will understand, otherwise do nothing.
  4. Use the underlying fs to invoke the operation.

@arielshaqed @nopcoder let me know if you think this won’t work, and I’ll write a more detailed proposal.

@ozkatz can you please add more info to this issue? thanks!

@talSofer @johnnyaug description updated with more information.

Since this is probably useful for non-lakeFS users as well, this should be a standalone component with no direct dependency on anything other than Spark itself

@ozkatz is this a nice-to-have requirement? I mean, it will be great if we can find such a generic solution that will be useful for non-lakeFS users, but I think it may limit us if we treat it as a must-do.

@talSofer it is indeed nice-to-have! Edited the original comment to reflect that.

We've decided to split this delivery into multiple steps, and the following is the scope of the first step:

  • RouterFS to work with lakeFS only via S3Gateway
  • RouterFS' Jar is not published to maven central. Users can build it according to provided build instructions
  • The mapping configuration format is validated, and the intent is to keep it.
  • Usage it documented on github

cc @johnnyaug @ozkatz