/hadoop-bare-naked-local-fs

GlobalMentor Hadoop local FileSystem implementation directly accessing the Java API without Winutils.

Primary LanguageJava

GlobalMentor Hadoop Bare Naked Local FileSystem - A Sparkutils fork

This fork is raised due to this PR not progressing. The lack of an abstract file system implementation in the original project stops Delta usage and Spark Streaming. Only the package artifact has changed, the class packages are still per the pull request.

A Hadoop local FileSystem implementation directly accessing the Java API without Winutils, suitable for use with Spark.

The name of this project refers to the BareLocalFileSystem and NakedLocalFileSystem classes, and is a lighthearded reference to the Hadoop RawLocalFileSystem class which NakedLocalFileSystem extends—a play on the Portuguese expression, "a verdade, nua e crua" ("the raw, naked truth").

Usage

  1. If you have an application that needs Hadoop local FileSystem support without relying on Winutils, import the latest com.globalmentor:hadoop-bare-naked-local-fs library into your project, e.g. in Maven for v0.1.0:
<dependency>
  <groupId>com.sparkutils</groupId>
  <artifactId>hadoop-bare-naked-local-fs</artifactId>
  <version>0.2.0</version>
</dependency>
  1. Then specify that you want to use the Bare Local File System implementation com.globalmentor.apache.hadoop.fs.BareLocalFileSystem for the file scheme. (BareLocalFileSystem internally uses NakedLocalFileSystem.) The following example does this for Spark in Java:
SparkSession spark = SparkSession.builder().appName("Foo Bar").master("local").
        config("spark.hadoop.fs.file.impl", BareLocalFileSystem.class.getName()).
        config("spark.hadoop.fs.AbstractFileSystem.file.impl", BareStreamingLocalFileSystem.class.getName()).
        getOrCreate();

The config should be set before getOrCreate as Spark processes such as local meta-stores are already evaluating the Hadoop FS before any hadoop config setClass call. The AbstractFileSystem property is required to enable streaming checkpoint operations.

Note that you may still get warnings that "HADOOP_HOME and hadoop.home.dir are unset" and "Did not find winutils.exe". This is because the Winutils kludge permeates the Hadoop code and is hard-coded at a low-level, executed statically upon class loading, even for code completely unrelated to file access. See HADOOP-13223: winutils.exe is a bug nexus and should be killed with an axe.

Limitations

  • The current implementation does not handle symbolic links, but this is planned.
  • The current implementation does not register as a service supporting the file scheme, and instead must be specified manually. A future version will register with Java's service loading mechanism as org.apache.hadoop.fs.LocalFileSystem does, although this will still require manual specification in an environment such as Spark, because there would be two conflicting registered file systems for file.
  • The current implementation does not support the *nix sticky bit.

Background

The Apache Hadoop FileSystem was designed tightly coupled to *unix file systems. It assumes POSIX file permissions. Its model definition is still sparse. Little thought was put into creating a general file access API that could be implemented across platforms. File access on non *nix systems such as Windows was largely ignored and few cared.

Unfortunately the Hadoop FileSystem API has become somewhat of a de-facto common file storage layer for big data processing, essentially tying big data to *nix systems if local file access is desired. For example, Apache Spark pulls in Hadoop's FileSystem (and the entire Spark client access layer) to write output files to the local file system. Running Spark on Windows, even for prototyping, would be impossible without a Windows implementation of FileSystem.

RawLocalFileSystem, accessed indirectly via LocalFileSystem, is Hadoop's attempt at Java access of a local file system. It written before Java added access to *nix-centric features such as POSIX file permissions. RawLocalFileSystem attempts to access the local file system using system libraries via JNI, and if that is not possible falls back to creating Shell processes that run *nix commands such as chmod or bash. This in itself represents a security concern, not to mention an inefficient kludge.

In order to allow RawLocalFileSystem to function on Windows (for example to run Spark), one Hadoop contributor created the winutils package. This represents a set of binary files that run on Windows and "intercept" the RawLocalFileSystem native calls. While the ability to run Spark on Windows was of course welcome, the design represents one more kludge on top of an existing kludge, requiring the trust of more binary distributions and another potential vector for malicious code. (For these reasons there are Hadoop tickets such as HADOOP-13223: winutils.exe is a bug nexus and should be killed with an axe.)

This Hadoop Bare Naked Local File System project bypasses winutils and forces Hadoop to access the file system via pure Java. The BareLocalFileSystem and NakedLocalFileSystem classes are versions of LocalFileSystem and RawLocalFileSystem, respectively, which bypass the outdated native and shell access to the local file system and use the Java API instead. It means that projects like Spark can access the file system on Windows as well as on other platforms, without the need to pull in some third-party kludge such as Winutils.

Implementation Caveats (problems brought by LocalFileSystem and RawLocalFileSystem)

This Hadoop Bare Naked Local File System implementation extends LocalFileSystem and RawLocalFileSystem and "reverses" or "undoes" as much as possible JNI and shell access. Much of the original Hadoop kludge implementation is still present beneath the surface (meaning that "Bare Naked" is for the moment a bit of a misnomer).

Unfortunately solving the problem of Hadoop's default local file system accessing isn't as simple as just changing native/shell calls to their modern Java equivalents. The current LocalFileSystem and RawLocalFileSystem implementations have evolved haphazardly, with halway-implemented features scattered about, special-case code for ill-documented corner cases, and implementation-specific assumptions permeating the design itself. Here are a few examples.

  • Tentacled references to Winutils are practically everywhere, even where you least expect it. When Spark starts up, the (shaded) org.apache.hadoop.security.SecurityUtil class statically sets up an internal configuration object using setConfigurationInternal(new Configuration()). Part of this initialization calls conf.getBoolean(CommonConfigurationKeys.HADOOP_SECURITY_TOKEN_SERVICE_USE_IP CommonConfigurationKeys.HADOOP_SECURITY_TOKEN_SERVICE_USE_IP_DEFAULT). What could Configuration.setBoolean() have to do with Winutils? The (shaded) org.apache.hadoop.conf.Configuration.getBoolean(String name, boolean defaultValue) method uses StringUtils.equalsIgnoreCase("true", valueString) to convert the String to Boolean, and the (shaded) org.apache.hadoop.util.StringUtils class has a static reference to Shell in public static final Pattern ENV_VAR_PATTERN = Shell.WINDOWS ? WIN_ENV_VAR_PATTERN : SHELL_ENV_VAR_PATTERN. Simply referencing Shell brings in the whole static initialization block that looks for Winutils. This is low-level, hard-coded stuff (trash) that represented a bad design to begin with. (These sort of things should be pluggable and configurable, not hard-coded into static initializations.)
    if (WINDOWS) {
      try {
        file = getQualifiedBin(WINUTILS_EXE);
        path = file.getCanonicalPath();
        org.apache.hadoop.shaded.io. = null;
      } catch (IOException e) {
        LOG.warn("Did not find {}: {}", WINUTILS_EXE, e);
        // stack trace org.apache.hadoop.shaded.com.s at debug level
        LOG.debug("Failed to find " + WINUTILS_EXE, e);
        file = null;
        path = null;
        org.apache.hadoop.shaded.io. = e;
      }
    } else {
  • At the FileSystem level Winutils-related logic isn't contained to RawLocalFileSystem, which would have allowed it to easily be overridden, but instead relies on the static FileUtil class which is like a separate file system implementation that relies on Winutils and can't be modified. For example here is FileUtil code that would need to be updated, unfortunately independently of the FileSystem implementation:
  public static String readLink(File f) {
    /* NB: Use readSymbolicLink in java.nio.file.Path once available. Could
     * use getCanonicalPath in File to get the target of the symlink but that
     * does not indicate if the given path refers to a symlink.
     */try {
      return Shell.execCommand(
          Shell.getReadlinkCommand(f.toString())).trim();
    } catch (IOException x) {
      return "";
    }
  • Apparently there is a "new Stat based implementation" of many methods, but RawLocalFileSystem instead uses a deprecated implementations such as DeprecatedRawLocalFileStatus, which is full of workarounds and special-cases, is package-private so it can't be accessed by subclasses, yet can't be removed because of HADOOP-9652. The useDeprecatedFileStatus switch is hard-coded so that it can't be modified by a subclass, forcing a re-implementation of everything it touches. In other words, even the new, less-kludgey approach is switched off in the code, has been for years, and no one seems to be paying it any mind.
  • DeprecatedRawLocalFileStatus tries to detect if permissions have already been loaded for a file. It checks for an empty string owner; in some conditions (a shell error) the owner is set to null, which under the covers actually sets the value to "", making the whole process brittle. Moreover an error condition would cause an endless cycle of attempting to load permissions. (And what is CopyFiles.FilePair(), and does the current implementation break it, or would it only be broken if "extra fields" are added?)
    /* We can add extra fields here. It breaks at least CopyFiles.FilePair().
     * We recognize if the information is already loaded by check if
     * onwer.equals("").
     */
    private boolean isPermissionLoaded() {
      return !super.getOwner().isEmpty();
    }