This fork is raised due to this PR not progressing. The lack of an abstract file system implementation in the original project stops Delta usage and Spark Streaming. Only the package artifact has changed, the class packages are still per the pull request.
A Hadoop local FileSystem
implementation directly accessing the Java API without Winutils, suitable for use with Spark.
The name of this project refers to the BareLocalFileSystem
and NakedLocalFileSystem
classes, and is a lighthearded reference to the Hadoop RawLocalFileSystem
class which NakedLocalFileSystem
extends—a play on the Portuguese expression, "a verdade, nua e crua" ("the raw, naked truth").
- If you have an application that needs Hadoop local
FileSystem
support without relying on Winutils, import the latestcom.globalmentor:hadoop-bare-naked-local-fs
library into your project, e.g. in Maven for v0.1.0:
<dependency>
<groupId>com.sparkutils</groupId>
<artifactId>hadoop-bare-naked-local-fs</artifactId>
<version>0.2.0</version>
</dependency>
- Then specify that you want to use the Bare Local File System implementation
com.globalmentor.apache.hadoop.fs.BareLocalFileSystem
for thefile
scheme. (BareLocalFileSystem
internally usesNakedLocalFileSystem
.) The following example does this for Spark in Java:
SparkSession spark = SparkSession.builder().appName("Foo Bar").master("local").
config("spark.hadoop.fs.file.impl", BareLocalFileSystem.class.getName()).
config("spark.hadoop.fs.AbstractFileSystem.file.impl", BareStreamingLocalFileSystem.class.getName()).
getOrCreate();
The config should be set before getOrCreate as Spark processes such as local meta-stores are already evaluating the Hadoop FS before any hadoop config setClass call. The AbstractFileSystem property is required to enable streaming checkpoint operations.
Note that you may still get warnings that "HADOOP_HOME and hadoop.home.dir are unset" and "Did not find winutils.exe". This is because the Winutils kludge permeates the Hadoop code and is hard-coded at a low-level, executed statically upon class loading, even for code completely unrelated to file access. See HADOOP-13223: winutils.exe is a bug nexus and should be killed with an axe.
- The current implementation does not handle symbolic links, but this is planned.
- The current implementation does not register as a service supporting the
file
scheme, and instead must be specified manually. A future version will register with Java's service loading mechanism asorg.apache.hadoop.fs.LocalFileSystem
does, although this will still require manual specification in an environment such as Spark, because there would be two conflicting registered file systems forfile
. - The current implementation does not support the *nix sticky bit.
The Apache Hadoop FileSystem
was designed tightly coupled to *unix file systems. It assumes POSIX file permissions. Its model definition is still sparse. Little thought was put into creating a general file access API that could be implemented across platforms. File access on non *nix systems such as Windows was largely ignored and few cared.
Unfortunately the Hadoop FileSystem
API has become somewhat of a de-facto common file storage layer for big data processing, essentially tying big data to *nix systems if local file access is desired. For example, Apache Spark pulls in Hadoop's FileSystem
(and the entire Spark client access layer) to write output files to the local file system. Running Spark on Windows, even for prototyping, would be impossible without a Windows implementation of FileSystem
.
RawLocalFileSystem
, accessed indirectly via LocalFileSystem
, is Hadoop's attempt at Java access of a local file system. It written before Java added access to *nix-centric features such as POSIX file permissions. RawLocalFileSystem
attempts to access the local file system using system libraries via JNI, and if that is not possible falls back to creating Shell
processes that run *nix commands such as chmod
or bash
. This in itself represents a security concern, not to mention an inefficient kludge.
In order to allow RawLocalFileSystem
to function on Windows (for example to run Spark), one Hadoop contributor created the winutils
package. This represents a set of binary files that run on Windows and "intercept" the RawLocalFileSystem
native calls. While the ability to run Spark on Windows was of course welcome, the design represents one more kludge on top of an existing kludge, requiring the trust of more binary distributions and another potential vector for malicious code. (For these reasons there are Hadoop tickets such as HADOOP-13223: winutils.exe is a bug nexus and should be killed with an axe.)
This Hadoop Bare Naked Local File System project bypasses winutils and forces Hadoop to access the file system via pure Java. The BareLocalFileSystem
and NakedLocalFileSystem
classes are versions of LocalFileSystem
and RawLocalFileSystem
, respectively, which bypass the outdated native and shell access to the local file system and use the Java API instead. It means that projects like Spark can access the file system on Windows as well as on other platforms, without the need to pull in some third-party kludge such as Winutils.
This Hadoop Bare Naked Local File System implementation extends LocalFileSystem
and RawLocalFileSystem
and "reverses" or "undoes" as much as possible JNI and shell access. Much of the original Hadoop kludge implementation is still present beneath the surface (meaning that "Bare Naked" is for the moment a bit of a misnomer).
Unfortunately solving the problem of Hadoop's default local file system accessing isn't as simple as just changing native/shell calls to their modern Java equivalents. The current LocalFileSystem
and RawLocalFileSystem
implementations have evolved haphazardly, with halway-implemented features scattered about, special-case code for ill-documented corner cases, and implementation-specific assumptions permeating the design itself. Here are a few examples.
- Tentacled references to Winutils are practically everywhere, even where you least expect it. When Spark starts up, the (shaded)
org.apache.hadoop.security.SecurityUtil
class statically sets up an internal configuration object usingsetConfigurationInternal(new Configuration())
. Part of this initialization callsconf.getBoolean(CommonConfigurationKeys.HADOOP_SECURITY_TOKEN_SERVICE_USE_IP CommonConfigurationKeys.HADOOP_SECURITY_TOKEN_SERVICE_USE_IP_DEFAULT)
. What couldConfiguration.setBoolean()
have to do with Winutils? The (shaded)org.apache.hadoop.conf.Configuration.getBoolean(String name, boolean defaultValue)
method usesStringUtils.equalsIgnoreCase("true", valueString)
to convert the String to Boolean, and the (shaded)org.apache.hadoop.util.StringUtils
class has a static reference toShell
inpublic static final Pattern ENV_VAR_PATTERN = Shell.WINDOWS ? WIN_ENV_VAR_PATTERN : SHELL_ENV_VAR_PATTERN
. Simply referencingShell
brings in the whole static initialization block that looks for Winutils. This is low-level, hard-coded stuff (trash) that represented a bad design to begin with. (These sort of things should be pluggable and configurable, not hard-coded into static initializations.)
if (WINDOWS) {
try {
file = getQualifiedBin(WINUTILS_EXE);
path = file.getCanonicalPath();
org.apache.hadoop.shaded.io. = null;
} catch (IOException e) {
LOG.warn("Did not find {}: {}", WINUTILS_EXE, e);
// stack trace org.apache.hadoop.shaded.com.s at debug level
LOG.debug("Failed to find " + WINUTILS_EXE, e);
file = null;
path = null;
org.apache.hadoop.shaded.io. = e;
}
} else {
- At the
FileSystem
level Winutils-related logic isn't contained toRawLocalFileSystem
, which would have allowed it to easily be overridden, but instead relies on the staticFileUtil
class which is like a separate file system implementation that relies on Winutils and can't be modified. For example here isFileUtil
code that would need to be updated, unfortunately independently of theFileSystem
implementation:
public static String readLink(File f) {
/* NB: Use readSymbolicLink in java.nio.file.Path once available. Could
* use getCanonicalPath in File to get the target of the symlink but that
* does not indicate if the given path refers to a symlink.
*/
…
try {
return Shell.execCommand(
Shell.getReadlinkCommand(f.toString())).trim();
} catch (IOException x) {
return "";
}
- Apparently there is a "new
Stat
based implementation" of many methods, butRawLocalFileSystem
instead uses a deprecated implementations such asDeprecatedRawLocalFileStatus
, which is full of workarounds and special-cases, is package-private so it can't be accessed by subclasses, yet can't be removed because of HADOOP-9652. TheuseDeprecatedFileStatus
switch is hard-coded so that it can't be modified by a subclass, forcing a re-implementation of everything it touches. In other words, even the new, less-kludgey approach is switched off in the code, has been for years, and no one seems to be paying it any mind. DeprecatedRawLocalFileStatus
tries to detect if permissions have already been loaded for a file. It checks for an empty string owner; in some conditions (a shell error) the owner is set tonull
, which under the covers actually sets the value to""
, making the whole process brittle. Moreover an error condition would cause an endless cycle of attempting to load permissions. (And what isCopyFiles.FilePair()
, and does the current implementation break it, or would it only be broken if "extra fields" are added?)
/* We can add extra fields here. It breaks at least CopyFiles.FilePair().
* We recognize if the information is already loaded by check if
* onwer.equals("").
*/
private boolean isPermissionLoaded() {
return !super.getOwner().isEmpty();
}