/hadoop-router-fs

RouterFileSystem is a Hadoop FileSystem implementation that transforms URIs at runtime according to provided configurations. It then routes file system operations to another Hadoop file system that executes it against the underlying object store.

Primary LanguageJavaApache License 2.0Apache-2.0

hadoop-router-fs

RouterFileSystem is a Hadoop FileSystem implementation that transforms URIs at runtime according to provided configurations. It then routes file system operations to another Hadoop file system that executes it against the underlying object store.

Use-cases

  • Interact with multiple storages side-by-side, without making any changes to your code.
  • Migrate a collection to a new storage location without changing your Spark application code, or breaking it.

Build instructions

Pre-requisites

  • Install maven

Steps

  1. Clone the repo:

    git clone git@github.com:treeverse/hadoop-router-fs.git
  2. Build with maven:

    mvn clean install

How to configure RouterFS

Configure Spark to use RouterFS

Instruct Spark to use RouterFS as the file system implementation for the URIs you would like to transform at runtime by adding the following property to your Spark configurations:

fs.${fromFsScheme}.impl=io.lakefs.routerfs.RouterFileSystem

For example, by adding the fs.s3a.impl=io.lakefs.routerfs.RouterFileSystem you are instructing Spark to use RouterFS as the file system for any URI with scheme=s3a.

Add custom mapping configurations

RouterFS consumes your mapping configurations to understand which paths it needs to modify and how to modify them. It then performs a simple prefix replacement accordingly.
Mapping configurations are Hadoop properties of the form:

routerfs.mapping.${fromFsScheme}.${mappingIdx}.(replace|with)=${path-prefix}

For a given URI, RouterFS scans the mapping configurations defined for the URI's scheme, searches for the first mapping configuration that matches the URI prefix, and transforms the URI according to the matching configuration.

Notes about mapping configurations:

  • Make sure your source prefix ends with a slash when needed.
  • Mapping configurations apply in-order, and it is up to you to create non-conflicting configurations.

Default file system

For each mapped scheme you should configure a default file system implementation in case mapping is found.
Add the following configuration for the schemes you configured RouteFS to handle.

routerfs.default.fs.${fromFsScheme}=${the file system you used for this scheme without routerFS}

For example, by adding:

routerfs.default.fs.s3a=org.apache.hadoop.fs.s3a.S3AFileSystem

You are instructing RouterFS to use S3AFileSystem for any URI with scheme=s3a for which RouterFS did not find a mapping configuration.

When no mapping was found

In case RouterFS can't find a matching mapping configuration, it will make sure that it's handled by the default file system for the URI scheme.

Example

Given the following mapping configurations:

fs.s3a.impl=io.lakefs.routerfs.RouterFileSystem
routerfs.mapping.s3a.1.replace=s3a://bucket/dir1/ # mapping src
routerfs.mapping.s3a.1.with=lakefs://repo/main/ # mapping dst
routerfs.mapping.s3a.2.replace=s3a://bucket/dir2/ # mapping src
routerfs.mapping.s3a.2.with=lakefs://example-repo/dev/ # mapping dst
routerfs.default.fs.s3a=org.apache.hadoop.fs.s3a.S3AFileSystem # default file system implementation for the `s3a` scheme
  • For the URI s3a://bucket/dir1/foo.parquet, RouterFS will perform the next steps:

    1. Scan all routerfs mapping configurations include the s3a scheme in their key: routerfs.mapping.s3a.${mappingIndex}.replace.
    2. Iterate the configurations by the order of the priorities specified by ${mappingIdx} and try to match the URI prefix to the configurations values. The iteration stops once reaching the s3a://bucket/dir1/ prefix that matches the URI s3a://bucket/dir1/foo.parquet.
    3. Replace it with the destination mapping value: lakefs://repo/main/ to create the desired URI: lakefs://repo/main/foo.parquet.
  • For the URI s3a://bucket/dir3/bar.parquet, RouterFS will perform the next steps:

    1. Scan all routerfs mapping configurations include the s3a scheme in their key: routerfs.mapping.s3a.${mappingIndex}.replace.
    2. Iterate the configurations by the order of the priorities specified by ${mappingIdx} and try to match the URI prefix to the configurations values. The iteration stops with no matching mapping.
    3. Fall back to the default file system implementation (S3AFileSystem) and leave the URI as it is.

Configure File Systems Implementations

The final configuration step is to instruct Spark what file system to use for each URI scheme. Make sure to add this configuration for any URI scheme you defined a mapping configuration for. For example, to instruct Spark to use S3AFileSystem for any URI with scheme=lakefs

fs.lakefs.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

Usage

Run your Spark Application with RouterFS

After building RouterFS, the build artifact is a jar under the target directory. You should supply this jar to your Spark application when running the application, or by placing it under your $SPARK_HOME/jars directory.

Usage with lakeFS

The current version of RouterFS only works for Spark applications that interact with lakeFS via the S3 Gateway. That is, you can't use both RouterFS and LakeFSFileSystem together, but we have concrete plans to make this work.

S3AFileSystem

The current version of RouterFS requires the use of S3AFileSystem's per-bucket configuration functionality to support multiple mappings that use S3AFileSystem as their file system implementation. That means that the compiled Hadoop version should be >= 2.8.0.
The per-bucket configurations treat the first part of the path (also called the "authority") as the bucket to which we configure the S3A file system property.
For example, for the following configurations:

fs.s3a.impl=io.lakefs.routerfs.RouterFileSystem
routerfs.mapping.s3a.1.replace=s3a://bucket/dir/
routerfs.mapping.s3a.1.with=lakefs://repo/branch/
routerfs.default.fs.s3a=org.apache.hadoop.fs.s3a.S3AFileSystem

fs.lakefs.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

# The following configs will be used when URIs of the form `lakefs://repo/...` will be addressed
fs.s3a.bucket.repo.endpoint=https://lakefs.example.com
fs.s3a.bucket.repo.access.key=AKIAlakefs12345EXAMPLE
fs.s3a.bucket.repo.secret.key=abc/lakefs/1234567bPxRfiCYEXAMPLEKEY
...
# The following configs will be used when any non-mapped s3a URIs will be addressed
fs.s3a.endpoint=https://s3.us-east-1.amazonaws.com
fs.s3a.access.key=...
fs.s3a.secret.key=...

the configurations that begin with fs.s3a.bucket.repo will be used when trying to access lakefs://repo/<path>.
All other fs.s3a.<conf> properties will be used for the general case.

Working example

Please refer to the sample app.