This is a general cloudstore CLI command for Hadoop which can evolve fast as it can be released daily, if need be.
License: Apache ASF 2.0
All the implementation classes are under the org.apache.hadoop.fs
package
tree with a goal of ultimately moving this into Hadoop itself; it's been
kept out right now for various reasons
- Faster release cycle, so the diagnostics can evolve to track features going into Hadoop-trunk.
- Fewer test requirements. This is naughty, but...
- Ability to compile against older versions. We've currently switched to Hadoop 3.3+ due to the need to make API calls and operations not in older versions.
Author: Steve Loughran, Hadoop Committer, plus anyone else who has debugging needs.
Why?
- Sometimes things fail, and the first problem is classpath;
- The second, invariably some client-side config.
- Then there's networking and permissions...
- The Hadoop FS connectors all assume a well configured system, and don't do much in terms of meaningful diagnostics.
- This is compounded by the fact that we dare not log secret credentials.
- And in support calls, it's all to easy to get those secrets, even though its a major security breach to get them.
The main hadoop hadoop fs
commands are written assuming a filesystem, where
- Recursive treewalks are the way to traverse the store.
- The code was written for Hadoop 1.0 and uses the filesystem APIs of that era.
- The commands are often used in shell scripts and workflows, including parsing the output: we do not dare change the behaviour or output for this reason.
- And the shell removes stack traces on failures, making it of "limited value" when things don't work. And object stores are fairly fussy to get working, primarily due to authentication, classpath and network settings
The storediag
entry point is designed to pick up the FS settings, dump them
with sanitized secrets, and display their provenance. It then
bootstraps connectivity with an attempt to initiate (unauthed) HTTP connections
to the store's endpoints. This should be sufficient to detect proxy and
endpoint configuration problems.
Then it tries to perform some reads and writes against the store. If these fail, then there's clearly a problem. Hopefully though, there's now enough information to begin determining what it is.
Finally, if things do fail, the printed configuration excludes the login secrets, for safer reporting of issues in bug reports.
hadoop jar cloudstore-1.0.jar storediag -j -5 s3a://landsat-pds/
hadoop jar cloudstore-1.0.jar storediag --tokenfile mytokens.bin s3a://my-readwrite-bucket/
hadoop jar cloudstore-1.0.jar storediag wasb://container@user/subdir
hadoop jar cloudstore-1.0.jar storediag abfs://container@user/
The remote store is required to grant full R/W access to the caller, otherwise the creation tests will fail.
The --tokenfile
option loads tokens saved with hdfs fetchdt
. It does
not need Kerberos, though most filesystems expect Kerberos enabled for
them to pick up tokens (not S3A, potentially other stores).
Usage: storediag [options] <filesystem>
-D <key=value> Define a property
-e List the environmment variables. *danger: does not redact secrets*
-h redact all chars in sensitive options
-j List the JARs on the classpath
-l Dump the Log4J settings
-o Downgrade all 'required' classes to optional
-principal <principal> kerberos principal to request a token for
-required <file> text file of extra classes+resources to require
-s List the JVM System Properties
-t Require delegation tokens to be issued
-tokenfile <file> Hadoop token file to load
-verbose Verbose output
-w attempt write operations on the filesystem
-xmlfile <file> XML config file to load
-5 Print MD5 checksums of the jars listed (requires -j)
The -require
option takes a text file where every line is one of
a #-prefixed comment, a blank line, a classname, a resource (with "/" in).
These are all loaded
hadoop jar cloudstore-1.0.jar storediag -j -5 -required required.txt s3a://something/
and with a required.txt
listing things you require
# S3A
org.apache.hadoop.fs.s3a.auth.delegation.S3ADelegationTokens
# Misc
org.apache.commons.configuration.Configuration
org.apache.commons.lang3.StringUtils
This is useful to dynamically add some extra mandatory classes to the list of classes you need to work with a store...most useful when either you are developing new features and want to verify they are on the classpath, or you are working with an unknown object store and just want to check its depencies up front.
Missing file or resource will result in an error and the command failing.
The comments are printed too! This means you can use them in the reports.
Measure upload/download bandwidth.
bin/hadoop jar cloudstore-1.0.jar bandwidth
Usage: bandwidth [options] size <path>
-D <key=value> Define a property
-tokenfile <file> Hadoop token file to load
-verbose print verbose output
-xmlfile <file> XML config file to load
See bandwidth for details.
Prints some of the low level diagnostics information about an S3 bucket which can be obtained via the AWS APIs.
hadoop jar cloudstore-1.0.jar \
bucketstate \
s3a://mybucket/
2019-07-25 16:54:50,678 [main] INFO tools.BucketState (DurationInfo.java:<init>(53)) - Starting: Bucket State
2019-07-25 16:54:54,216 [main] WARN s3a.S3AFileSystem (S3AFileSystem.java:getAmazonS3ClientForTesting(675)) - Access to S3A client requested, reason Diagnostics
Bucket owner is alice (ID=593...e1)
Bucket policy:
NONE
If you don't have the permissions to read the bucket policy, you get a stack trace.
hadoop jar cloudstore-1.0.jar \
bucketstate \
s3a://mybucket/
2019-07-25 16:55:23,023 [main] INFO tools.BucketState (DurationInfo.java:<init>(53)) - Starting: Bucket State
2019-07-25 16:55:25,993 [main] WARN s3a.S3AFileSystem (S3AFileSystem.java:getAmazonS3ClientForTesting(675)) - Access to S3A client requested, reason Diagnostics
Bucket owner is alice (ID=593...e1)
2019-07-25 16:55:26,883 [main] INFO tools.BucketState (DurationInfo.java:close(100)) - Bucket State: duration 0:03:862
com.amazonaws.services.s3.model.AmazonS3Exception: The specified method is not allowed against this resource. (Service: Amazon S3; Status Code: 405; Error Code: MethodNotAllowed; Request ID: 3844E3089E3801D8; S3 Extended Request ID: 3HJVN5+MvOGit087AFqKLUyOUCU9inCakvJ44GW5Wb4toiVipEiv5uK6A54LQBjdKFYUU8ZI5XQ=), S3 Extended Request ID: 3HJVN5+MvOGit087AFqKLUyOUCU9inCakvJ44GW5Wb4toiVipEiv5uK6A54LQBjdKFYUU8ZI5XQ=
Bulk download of everything from s3a://bucket/qelogs/ to the local dir localquelogs (assuming the default fs is file://)
Usage
cloudup -s source -d dest [-o] [-i] [-l <largest>] [-t threads]
-s <uri> : source
-d <uri> : dest
-o: overwrite
-i: ignore failures
-t <n> : number of threads
-l <n> : number of "largest" files to start uploading before just randomly picking files.
Algorithm
- source files are listed.
- A pool of worker threads is created
- the largest N files are queued for upload first, where N is a default or the value set by
-l
. - The remainder of the files are randomized to avoid throttling and then queued
- the program waits for everything to complete.
- Source and dest FS stats are printed.
This is not distcp
run across a cluster; it's a single process with some threads.
Works best for reading lots of small files from an object store or when you have a
mix of large and small files to download or uplaod.
bin/hadoop jar cloudstore-1.0.jar cloudup \
-s s3a://bucket/qelogs/ \
-d localqelogs \
-t 32 -o
and the other way
bin/hadoop jar cloudstore-1.0.jar cloudup \
-d localqelogs \
-s s3a://bucket/qelogs/ \
-t 32 -o -l 4
Tries to instantiate a committer using the Hadoop 3.1+ committer factory mechanism, printing out what committer a specific path will create.
If this command fails with a ClassNotFoundException
it can mean that the version of hadoop the command is being
run against doesn't have this new API. The committer is therefore explicitly the classic FileOutputCommitter
.
Good: ABFS container with the ManifestCommitter
hadoop jar cloudstore-1.0.jar committerinfo abfs://testing@ukwest.dfs.core.windows.net/
2021-09-16 19:42:59,731 [main] INFO commands.CommitterInfo (StoreDurationInfo.java:<init>(53)) - Starting: Create committer Committer factory for path abfs://testing@ukwest.dfs.core.windows.net/ is org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory@3315d2d7
(classname org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory)
2021-09-16 19:43:00,897 [main] INFO manifest.ManifestCommitter (ManifestCommitter.java:<init>(144)) - Created ManifestCommitter with JobID job__0000, Task Attempt attempt__0000_r_000000_1 and destination abfs://testing@ukwest.dfs.core.windows.net/ Created committer of class org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitter:
ManifestCommitter{ManifestCommitterConfig{destinationDir=abfs://testing@ukwest.dfs.core.windows.net/, role='task committer', taskAttemptDir=abfs://testing@ukwest.dfs.core.windows.net/_temporary/manifest_job__0000/0/_temporary/attempt__0000_r_000000_1, createJobMarker=true, jobUniqueId='job__0000', jobUniqueIdSource='JobID', jobAttemptNumber=0, jobAttemptId='job__0000_0', taskId='task__0000_r_000000', taskAttemptId='attempt__0000_r_000000_1'}, iostatistics=counters=();
gauges=();
minimums=();
maximums=();
means=(); }
This is the new committer optimised for performance on abfs and gcs, where directory listings are performed in task commit, and the lists of files to rename and directories to create saved in manifest files. Job commit consists of loading the manifests and renaming all the files, operations which can all be parallelized. ABFS adds extra integration, including rate limiting of rename operations and recovery from rename failures under load. If your hadoop release supports this committer, do try it.
Danger: S3A Bucket with the classic FileOutputCommitter
> hadoop jar cloudstore-1.0.jar committerinfo s3a://landsat-pds/
2019-08-05 17:38:38,213 [main] INFO commands.CommitterInfo (DurationInfo.java:<init>(53)) - Starting: Create committer
2019-08-05 17:38:40,968 [main] WARN commit.AbstractS3ACommitterFactory (S3ACommitterFactory.java:createTaskCommitter(90)) - Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
2019-08-05 17:38:40,968 [main] INFO output.FileOutputCommitter (FileOutputCommitter.java:<init>(141)) - File Output Committer Algorithm version is 2
2019-08-05 17:38:40,968 [main] INFO output.FileOutputCommitter (FileOutputCommitter.java:<init>(156)) - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2019-08-05 17:38:40,970 [main] INFO commit.AbstractS3ACommitterFactory (AbstractS3ACommitterFactory.java:createOutputCommitter(54)) - Using Commmitter FileOutputCommitter{PathOutputCommitter{context=TaskAttemptContextImpl{JobContextImpl{jobId=job__0000}; taskId=attempt__0000_r_000000_1, status=''}; org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter@6cc0bcf6}; outputPath=s3a://landsat-pds/, workPath=s3a://landsat-pds/_temporary/0/_temporary/attempt__0000_r_000000_1, algorithmVersion=2, skipCleanup=false, ignoreCleanupFailures=false} for s3a://landsat-pds/
Created committer of class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter{PathOutputCommitter{context=TaskAttemptContextImpl{JobContextImpl{jobId=job__0000}; taskId=attempt__0000_r_000000_1, status=''}; org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter@6cc0bcf6}; outputPath=s3a://landsat-pds/, workPath=s3a://landsat-pds/_temporary/0/_temporary/attempt__0000_r_000000_1, algorithmVersion=2, skipCleanup=false, ignoreCleanupFailures=false}
2019-08-05 17:38:40,970 [main] INFO commands.CommitterInfo (DurationInfo.java:close(100)) - Create committer: duration 0:02:758
Good : S3A bucket with a staging committer:
> hadoop jar cloudstore-1.0.jar committerinfo s3a://hwdev-steve-ireland-new/
2019-08-05 17:42:53,563 [main] INFO commands.CommitterInfo (DurationInfo.java:<init>(53)) - Starting: Create committer
Committer factory for path s3a://hwdev-steve-ireland-new/ is org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory@3088660d (classname org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory)
2019-08-05 17:42:55,433 [main] INFO output.FileOutputCommitter (FileOutputCommitter.java:<init>(141)) - File Output Committer Algorithm version is 1
2019-08-05 17:42:55,434 [main] INFO output.FileOutputCommitter (FileOutputCommitter.java:<init>(156)) - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2019-08-05 17:42:55,434 [main] INFO commit.AbstractS3ACommitterFactory (S3ACommitterFactory.java:createTaskCommitter(83)) - Using committer directory to output data to s3a://hwdev-steve-ireland-new/
2019-08-05 17:42:55,435 [main] INFO commit.AbstractS3ACommitterFactory (AbstractS3ACommitterFactory.java:createOutputCommitter(54)) - Using Commmitter StagingCommitter{AbstractS3ACommitter{role=Task committer attempt__0000_r_000000_1, name=directory, outputPath=s3a://hwdev-steve-ireland-new/, workPath=file:/tmp/hadoop-stevel/s3a/job__0000/_temporary/0/_temporary/attempt__0000_r_000000_1}, conflictResolution=APPEND, wrappedCommitter=FileOutputCommitter{PathOutputCommitter{context=TaskAttemptContextImpl{JobContextImpl{jobId=job__0000}; taskId=attempt__0000_r_000000_1, status=''}; org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter@5a865416}; outputPath=file:/Users/stevel/Projects/Releases/hadoop-3.3.0-SNAPSHOT/tmp/staging/stevel/job__0000/staging-uploads, workPath=null, algorithmVersion=1, skipCleanup=false, ignoreCleanupFailures=false}} for s3a://hwdev-steve-ireland-new/
Created committer of class org.apache.hadoop.fs.s3a.commit.staging.DirectoryStagingCommitter: StagingCommitter{AbstractS3ACommitter{role=Task committer attempt__0000_r_000000_1, name=directory, outputPath=s3a://hwdev-steve-ireland-new/, workPath=file:/tmp/hadoop-stevel/s3a/job__0000/_temporary/0/_temporary/attempt__0000_r_000000_1}, conflictResolution=APPEND, wrappedCommitter=FileOutputCommitter{PathOutputCommitter{context=TaskAttemptContextImpl{JobContextImpl{jobId=job__0000}; taskId=attempt__0000_r_000000_1, status=''}; org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter@5a865416}; outputPath=file:/Users/stevel/Projects/Releases/hadoop-3.3.0-SNAPSHOT/tmp/staging/stevel/job__0000/staging-uploads, workPath=null, algorithmVersion=1, skipCleanup=false, ignoreCleanupFailures=false}}
2019-08-05 17:42:55,435 [main] INFO commands.CommitterInfo (DurationInfo.java:close(100)) - Create committer: duration 0:01:874
The log entry about a FileOutputCommitter
appears because the Staging Committers use the cluster filesystem (HDFS, etc) to safely pass information from the workers to the application master.
The classic filesystem committer v1 is used because it works well here: the filesystem is consistent and operations are fast. Neither of those conditions are met with AWS S3.
Good : S3A bucket with a magic committer:
> hadoop jar cloudstore-1.0.jar committerinfo s3a://hwdev-steve-ireland-new/
2019-08-05 17:37:42,615 [main] INFO commands.CommitterInfo (DurationInfo.java:<init>(53)) - Starting: Create committer
2019-08-05 17:37:44,462 [main] INFO commit.AbstractS3ACommitterFactory (S3ACommitterFactory.java:createTaskCommitter(83)) - Using committer magic to output data to s3a://hwdev-steve-ireland-new/
2019-08-05 17:37:44,462 [main] INFO commit.AbstractS3ACommitterFactory (AbstractS3ACommitterFactory.java:createOutputCommitter(54)) - Using Commmitter MagicCommitter{} for s3a://hwdev-steve-ireland-new/
Created committer of class org.apache.hadoop.fs.s3a.commit.magic.MagicS3GuardCommitter: MagicCommitter{}
2019-08-05 17:37:44,462 [main] INFO commands.CommitterInfo (DurationInfo.java:close(100)) - Create committer: duration 0:01:849
A variant on the hadoop du
command which does a recursive listFiles()
call on every directory immediately under the source path -in separate threads.
For any store which supports higher performance deep tree listing (S3A in particular) This can be significantly faster than du's normal treewalk.
Even without that, because lists are done in separate threads, a speedup is almost guaranteed.
There is no scheduling of work into separate threads within a directory; those stores which do prefetching in separate threads (recent ABFS and S3A builds) do add some paralellism here.
Usage: dux
-D <key=value> Define a property
-tokenfile <file> Hadoop token file to load
-xmlfile <file> XML config file to load
-threads <threads> number of threads
-limit <limit> limit of files to list
-verbose print verbose output
-bfs do a breadth first search of the path
<path>
The -verbose
option prints out more filesystem statistics, and of
the list iterators (useful if they publish statistics)
-limit
puts a limit on the total number of files to scan; this
is useful when doing deep scans of buckets so as to put an upper bound
on the scan. Note, when used against S3 an ERROR may be printed in the AWS SDK.
This is harmless; it comes from the SDK thread pool being closed while
a list page prefetch is in progress.
This is an extension of hdfs fetchdt
which collects delegation tokens
from a list of filesystems, saving them to a file.
hadoop jar cloudstore-0.1-SNAPSHOT fetchdt hdfs://tokens.bin s3a://landsat-pds/ s3a://bucket2
Usage: fetchdt <file> [-renewer <renewer>] [-r] [-p] <url1> ... <url999>
-r: require each filesystem to issue a token
-p: protobuf format
Successful query of an S3A session delegation token.
> bin/hadoop jar cloudstore-1.0.jar fetchdt -p -r file:/tmp/secrets.bin s3a://landsat-pds/
Collecting tokens for 1 filesystem to to file:/tmp/secrets.bin
2018-12-05 17:50:44,276 INFO fs.FetchTokens: Starting: Fetching token for s3a://landsat-pds/
2018-12-05 17:50:44,399 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2018-12-05 17:50:44,458 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2018-12-05 17:50:44,459 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started
2018-12-05 17:50:44,474 INFO delegation.S3ADelegationTokens: Filesystem s3a://landsat-pds is using delegation tokens of kind S3ADelegationToken/Session
2018-12-05 17:50:44,547 INFO delegation.S3ADelegationTokens: No delegation tokens present: using direct authentication
2018-12-05 17:50:44,547 INFO delegation.S3ADelegationTokens: S3A Delegation support token (none) with Session token binding for user hrt_qa, with STS endpoint "", region "" and token duration 30:00
2018-12-05 17:50:46,604 INFO delegation.S3ADelegationTokens: Starting: Creating New Delegation Token
2018-12-05 17:50:46,620 INFO delegation.SessionTokenBinding: Creating STS client for Session token binding for user hrt_qa, with STS endpoint "", region "" and token duration 30:00
2018-12-05 17:50:46,646 INFO auth.STSClientFactory: Requesting Amazon STS Session credentials
2018-12-05 17:50:47,099 INFO delegation.S3ADelegationTokens: Created S3A Delegation Token: Kind: S3ADelegationToken/Session, Service: s3a://landsat-pds, Ident: (S3ATokenIdentifier{S3ADelegationToken/Session; uri=s3a://landsat-pds; timestamp=1544032247065; encryption=(no encryption); 80fc87d2-0da2-4438-9ba6-7ae82751aba5; Created on HW13176.local/192.168.99.1 at time 2018-12-05T17:50:46.608Z.}; session credentials expiring on Wed Dec 05 18:20:46 GMT 2018; (valid))
2018-12-05 17:50:47,099 INFO delegation.S3ADelegationTokens: Creating New Delegation Token: duration 0:00.495s
Fetched token: Kind: S3ADelegationToken/Session, Service: s3a://landsat-pds, Ident: (S3ATokenIdentifier{S3ADelegationToken/Session; uri=s3a://landsat-pds; timestamp=1544032247065; encryption=(no encryption); 80fc87d2-0da2-4438-9ba6-7ae82751aba5; Created on HW13176.local/192.168.99.1 at time 2018-12-05T17:50:46.608Z.}; session credentials expiring on Wed Dec 05 18:20:46 GMT 2018; (valid))
2018-12-05 17:50:47,100 INFO fs.FetchTokens: Fetching token for s3a://landsat-pds/: duration 0:02:825
Saved 1 token to file:/tmp/secrets.bin
2018-12-05 17:50:47,166 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...
2018-12-05 17:50:47,166 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.
2018-12-05 17:50:47,166 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
~/P/h/h/t/hadoop-3.1.2-SNAPSHOT (cloud/BUG-99335-HADOOP-15364-S3Select-HDP-3.0.100 ⚡↩)
Failure to get anything from fs, with -r
option to require them
> hadoop jar cloudstore-1.0.jar fetchdt -p -r file:/tmm/secrets.bin file:///
Collecting tokens for 1 filesystem to to file:/tmm/secrets.bin
2018-12-05 17:47:00,970 INFO fs.FetchTokens: Starting: Fetching token for file:/
No token for file:/
2018-12-05 17:47:00,972 INFO fs.FetchTokens: Fetching token for file:/: duration 0:00:003
2018-12-05 17:47:00,973 INFO util.ExitUtil: Exiting with status 44: No token issued by filesystem file:///
Same command, without the -r.
> hadoop jar cloudstore-1.0.jar fetchdt -p file:/tmm/secrets.bin file:///
Collecting tokens for 1 filesystem to to file:/tmp/secrets.bin
2018-12-05 17:54:26,776 INFO fs.FetchTokens: Starting: Fetching token for file:/tmp
No token for file:/tmp
2018-12-05 17:54:26,778 INFO fs.FetchTokens: Fetching token for file:/tmp: duration 0:00:002
No tokens collected, file file:/tmp/secrets.bin unchanged
The token file is not modified.
Calls getFileStatus
on the listed paths, prints the values. For stores
which have more detail on the toString value of any subclass of FileStatus
,
this can be more meaningful.
Also prints the time to execute each operation (including instantiating the store),
and with the -verbose
option, the store statistics.
hadoop jar cloudstore-1.0.jar \
filestatus \
s3a://guarded-table/example
2019-07-31 21:48:34,963 [main] INFO commands.PrintStatus (DurationInfo.java:<init>(53)) - Starting: get path status
s3a://guarded-table/example S3AFileStatus{path=s3a://guarded-table/example; isDirectory=false; length=0; replication=1;
blocksize=33554432; modification_time=1564602680000;
access_time=0; owner=stevel; group=stevel;
permission=rw-rw-rw-; isSymlink=false; hasAcl=false; isEncrypted=true; isErasureCoded=false}
isEmptyDirectory=FALSE eTag=d41d8cd98f00b204e9800998ecf8427e versionId=null
2019-07-31 21:48:37,182 [main] INFO commands.PrintStatus (DurationInfo.java:close(100)) - get path status: duration 0:02:221
Help debug gcs credential bindings as set in fs.gs.auth.service.account.private.key
it does ths with some better diagnostics of parsing problems.
warning: at -verbose, this prints your private key
hadoop jar cloudstore-1.0.jar gcscreds gs://bucket/
key uses \n for separator -gs connector must convert to line endings
2022-01-19 17:55:51,016 [main] INFO gs.PemReader (PemReader.java:readNextSection(86)) - title match at line 1
2022-01-19 17:55:51,020 [main] INFO gs.PemReader (PemReader.java:readNextSection(88)) - scanning for end
Parsed private key -entry length 28 lines
factory com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.CredentialFactory@d706f19
Generate AWS IAM policy for a given bucket
hadoop jar cloudstore-1.0.jar iampolicy s3a://example-bucket/
{
"Version" : "2012-10-17",
"Statement" : [ {
"Sid" : "7",
"Effect" : "Allow",
"Action" : [ "s3:GetBucketLocation", "s3:ListBucket*" ],
"Resource" : "arn:aws:s3:::example-bucket"
}, {
"Sid" : "8",
"Effect" : "Allow",
"Action" : [ "s3:Get*", "s3:PutObject", "s3:PutObjectAcl", "s3:DeleteObject", "s3:AbortMultipartUpload" ],
"Resource" : "arn:aws:s3:::example-bucket/*"
}, {
"Sid" : "1",
"Effect" : "Allow",
"Action" : "kms:*",
"Resource" : "*"
} ]
}
Notes:
- KMS policy is always added in case data is encrypted/decrypted with S3-KMS; it is not need if this is not the case.
- Read and Write access up the tree is needed. Maybe if you enable directory marker retention writing from root becomes optional.
- "s3:GetBucketLocation" is used by the bucket existence v2 check. If the probe is at 0, it is never called.
Do a recursive listing of a path. Uses listFiles(path, recursive)
, so for any object store
which can do this as a deep paginated scan, is much, much faster.
Usage: list
-D <key=value> Define a property
-tokenfile <file> Hadoop token file to load
-limit <limit> limit of files to list
-verbose print verbose output
-xmlfile <file> XML config file to load
Example: list some of the AWS public landsat store.
> bin/hadoop jar cloudstore-1.0.jar list -limit 10 s3a://landsat-pds/
Listing up to 10 files under s3a://landsat-pds/
2019-04-05 21:32:14,523 [main] INFO tools.ListFiles (StoreDurationInfo.java:<init>(53)) - Starting: Directory list
2019-04-05 21:32:14,524 [main] INFO tools.ListFiles (StoreDurationInfo.java:<init>(53)) - Starting: First listing
2019-04-05 21:32:15,754 [main] INFO tools.ListFiles (DurationInfo.java:close(100)) - First listing: duration 0:01:230
[1] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B1.TIF 63,786,465 stevel stevel [encrypted]
[2] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B1.TIF.ovr 8,475,353 stevel stevel [encrypted]
[3] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B10.TIF 35,027,713 stevel stevel [encrypted]
[4] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B10.TIF.ovr 6,029,012 stevel stevel [encrypted]
[5] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B10_wrk.IMD 10,213 stevel stevel [encrypted]
[6] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B11.TIF 34,131,348 stevel stevel [encrypted]
[7] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B11.TIF.ovr 5,891,395 stevel stevel [encrypted]
[8] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B11_wrk.IMD 10,213 stevel stevel [encrypted]
[9] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B1_wrk.IMD 10,213 stevel stevel [encrypted]
[10] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B2.TIF 64,369,211 stevel stevel [encrypted]
2019-04-05 21:32:15,757 [main] INFO tools.ListFiles (DurationInfo.java:close(100)) - Directory list: duration 0:01:235
Found 10 files, 124 milliseconds per file
Data size 217,741,136 bytes, 21,774,113 bytes per file
List all objects and a path through the low-level S3 APIs. This bypasses the filesystem metaphor and gives the real view of the object store.
The -purge
option will remove all directory markers.
Usage: listobjects <path>
-D <key=value> Define a property
-limit <limit> limit of files to list
-purge purge directory markers
-tokenfile <file> Hadoop token file to load
-verbose print verbose output
-xmlfile <file> XML config file to load
See listversions.
Print out localhost information from java APIs and then the hadoop network APIs.
Use the mapreduce LocatedFileStatusFetcher
to scan for all non-hidden
files under a path. This matches exactly the scan process used in FileInputFormat
,
so offers a command line way to view and tune scans of object stores.
It can also be used in comparison with the list
command to compare the
difference between the maximum performance of scanning the directory
tree with the actual performance you are likely to see during query planning.
To control the number of threads used to scan directories in production
jobs, set the value in the configuration option
mapreduce.input.fileinputformat.list-status.num-threads
Usage:
hadoop jar cloudstore-1.0.jar locatefiles
Usage: locatefiles
-D <key=value> Define a property
-tokenfile <file> Hadoop token file to load
-xmlfile <file> XML config file to load
-threads <threads> number of threads
-verbose print verbose output
[<path>|<pattern>]```
Example
> hadoop jar cloudstore-1.0.jar locatefiles \
-threads 8 -verbose \
s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/
Locating files under s3a://landsat-pds/L8/001/002/LC80010022016230LGN00 with thread count 8
===========================================================================================
2019-07-29 16:48:19,844 [main] INFO tools.LocateFiles (DurationInfo.java:<init>(53)) - Starting: List located files
2019-07-29 16:48:19,847 [main] INFO tools.LocateFiles (DurationInfo.java:<init>(53)) - Starting: LocateFileStatus execution
2019-07-29 16:48:24,645 [main] INFO tools.LocateFiles (DurationInfo.java:close(100)) - LocateFileStatus execution: duration 0:04:798
[0001] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B1.TIF 63,786,465 stevel stevel [encrypted]
[0002] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B1.TIF.ovr 8,475,353 stevel stevel [encrypted]
[0003] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B10.TIF 35,027,713 stevel stevel [encrypted]
[0004] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_B10.TIF.ovr 6,029,012 stevel stevel [encrypted]
...
[0039] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_thumb_large.jpg 270,427 stevel stevel [encrypted]
[0040] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/LC80010022016230LGN00_thumb_small.jpg 12,873 stevel stevel [encrypted]
[0041] s3a://landsat-pds/L8/001/002/LC80010022016230LGN00/index.html 3,780 stevel stevel [encrypted]
2019-07-29 16:48:24,652 [main] INFO tools.LocateFiles (DurationInfo.java:close(100)) - List located files: duration 0:04:810
Found 41 files, 117 milliseconds per file
Data size 923,518,099 bytes, 22,524,831 bytes per file
Storage Statistics
==================
directories_created 0
directories_deleted 0
files_copied 0
files_copied_bytes 0
files_created 0
files_deleted 0
files_delete_rejected 0
fake_directories_created 0
fake_directories_deleted 0
ignored_errors 0
op_copy_from_local_file 0
op_create 0
op_create_non_recursive 0
op_delete 0
op_exists 0
op_get_delegation_token 0
op_get_file_checksum 0
op_get_file_status 2
op_glob_status 1
op_is_directory 0
op_is_file 0
op_list_files 0
op_list_located_status 1
op_list_status 0
op_mkdirs 0
op_open 0
op_rename 0
object_copy_requests 0
object_delete_requests 0
object_list_requests 3
object_continue_list_requests 0
object_metadata_requests 4
object_multipart_initiated 0
object_multipart_aborted 0
object_put_requests 0
object_put_requests_completed 0
object_put_requests_active 0
object_put_bytes 0
object_put_bytes_pending 0
object_select_requests 0
...
store_io_throttled 0
You get the metrics with the -verbose
option;
There is plenty room for improvements in S3A directory tree scanning. Patches welcome!
You can also explore what directory tree structure is most efficient here.
Creates a large CSV file designed to trigger/validate the ABFS prefetching bug which came in HADOOP-17156/
See mkcsv
Creates a new bucket
See mkcsv
Probes a filesystem for offering a specific named capability on the given path.
Requires a version of Hadoop with the PathCapabilities
interface, which includes Hadoop 3.3 onwards.
bin/hadoop jar cloudstore-1.0.jar pathcapability
Usage: pathcapability [options] <capability> <path>
-D <key=value> Define a property
-tokenfile <file> Hadoop token file to load
-verbose print verbose output
-xmlfile <file> XML config file to load
hadoop jar cloudstore-1.0.jar pathcapability fs.s3a.capability.select.sql s3a://landsat-pds/
Using filesystem s3a://landsat-pds
Path s3a://landsat-pds/ has capability fs.s3a.capability.select.sql
The exit code of the command is 0 if the capability is present, -1 if absent, and 55 if the hadoop version does not support the API.
Approximate HTTP equivalent: 505: Version Not Supported
.
As it is in Hadoop 3.3, all APIs new to that release (including openFile()
) can absolutely be probed for. Otherwise, the 55 response may mean "an API is implemented, just not the probe".
Invokes the AWS region provider chain to see if the client can automatically determine the region of AWS SDK calls.
This is how all AWS service clients determine the region for sending/signing requests if not explicitly set.
hadoop jar cloudstore-1.0.jar regions
Determining AWS region for SDK clients
======================================
Determining region using AwsEnvVarOverrideRegionProvider
========================================================
Use environment variable AWS_REGION
2021-06-22 12:04:59,277 [main] INFO extra.Regions (StoreDurationInfo.java:<init>(53)) - Starting: AwsEnvVarOverrideRegionProvider.getRegion()
2021-06-22 12:04:59,284 [main] INFO extra.Regions (StoreDurationInfo.java:close(100)) - AwsEnvVarOverrideRegionProvider.getRegion(): duration 0:00:010
region is not known
Determining region using AwsSystemPropertyRegionProvider
========================================================
System property aws.region
2021-06-22 12:04:59,286 [main] INFO extra.Regions (StoreDurationInfo.java:<init>(53)) - Starting: AwsSystemPropertyRegionProvider.getRegion()
2021-06-22 12:04:59,287 [main] INFO extra.Regions (StoreDurationInfo.java:close(100)) - AwsSystemPropertyRegionProvider.getRegion(): duration 0:00:000
region is not known
Determining region using AwsProfileRegionProvider
=================================================
Region info in ~/.aws/config
2021-06-22 12:04:59,336 [main] INFO extra.Regions (StoreDurationInfo.java:<init>(53)) - Starting: AwsProfileRegionProvider.getRegion()
2021-06-22 12:04:59,359 [main] INFO extra.Regions (StoreDurationInfo.java:close(100)) - AwsProfileRegionProvider.getRegion(): duration 0:00:023
Region is determined as "eu-west-2"
Determining region using InstanceMetadataRegionProvider
=======================================================
EC2 metadata; will only work in AWS infrastructure
2021-06-22 12:04:59,361 [main] INFO extra.Regions (StoreDurationInfo.java:<init>(53)) - Starting: InstanceMetadataRegionProvider.getRegion()
2021-06-22 12:04:59,363 [main] INFO extra.Regions (StoreDurationInfo.java:close(100)) - InstanceMetadataRegionProvider.getRegion(): duration 0:00:002
WARNING: Provider raised an exception com.amazonaws.AmazonClientException:
AWS_EC2_METADATA_DISABLED is set to true, not loading region from EC2 Instance Metadata service
region is not known
Region found: "eu-west-2"
=========================
Region was determined by AwsProfileRegionProvider as "eu-west-2"
This setup has set the environment variable AWS_EC2_METADATA_DISABLED
; if this variable was unset
and the command executed outside AWS infrastructure then after a 15 second delay a stack trace warning of
a failure to connect to the instance metadata server.
2021-06-22 11:54:15,774 [main] WARN util.EC2MetadataUtils (EC2MetadataUtils.java:getItems(410)) -
Unable to retrieve the requested metadata (/latest/dynamic/instance-identity/document).
Failed to connect to service endpoint:
com.amazonaws.SdkClientException: Failed to connect to service endpoint:
This is to be expected, given that the service isn't there.
Restores a versioned S3 Object to a path within the same bucket.
See versioned objects.
Probes an abfs store for being vulnerable to prefetch data corruption, providing the configuration information to disable it if so.
See safeprefetch
Verify the hadoop release has had its untar command hardened and will not evaluate commands passed in as filenames.
bin/hadoop jar $CLOUDSTORE tarhardened "file.tar; true"
Bad
Attempting to untar file with name "file.tar; true"
untar operation reported success
2023-01-27 16:42:35,931 [main] INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 0
Although the file doesn't exist, the bash "true" command was executed after the untar, so the operation was reported as a success.
Good
2023-01-27 16:48:44,461 [main] INFO util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status -1: ExitCodeException exitCode=1: tar: Error opening archive: Failed to open 'file.tar; true'
The file file.tar; true
was attempted to be opened; as it is not present the operation failed.
Expect a stack trace in the report
Print out tls information, and X509 certificates.
The -verbose
option prints the full details about each CA, rather than just their principals' names.
"undeletes" S3 objects by removing directory tombstones from a bucket path.
See versioned objects.
DiagnosticsAWSCredentialsProvider is an AWS credential provider which logs the fs.s3a login secrets (obfuscated and md5).
Roadmap: Whatever we need to debug things.
This file can be grabbed via curl
statements and executed to help automate
testing of cluster deployments.
To help with doing this with the latest releases, it may be enhanced regularly, with new releases.
There is no real release plan other than this.
Possible future work
- Exploration of higher performance IO.
- Diagnostics/testing of integration with foundational Hadoop operations.
- Improving CLI testing with various probes designed to be invoked in a shell and fail with meaningful exit codes. E.g: verifying that a filesystem has a specific (key, val) property, that a specific env var made it through.
- something to scan hadoop installations for duplicate artifacts, which knowledge of JARS which main contain the same classes (aws-shaded, aws-s3, etc), and the knowledge of required consistencies (hadoop-, jackson-).
- And extend to SPARK_HOME, Hive, etc.
Contributions through PRs welcome.
Bug reports: please include environment and ideally patches.
There is no formal support for this. Sorry.
To build against the latest hadoop 3.3
mvn clean install -Phadoop-3.3 -Pextra
The extra
profile pulls in extra source which calls some S3A FS API calls not in earlier hadoop versions
(note: they are in CDP 7.x/CDP cloud).
To build against Hadoop 3.2
mvn clean install -Phadoop-3.2
Building against older versions. This is generally done only in an emergency and hasn't been done for a while; it's probably not going to compile. Sorry.