A Hadoop Yarn cluster running in a docker compose deployment.
The deployment uses the docker images created by
The deployment needs Docker compose 2.x. Older versions may work but are untested. Installation instructions for Docker compose are on docs.docker.com.
docker compose up
A client node is running and can be accessed using SSH (username: sandbox, password: sandbox):
ssh -p 2222 sandbox@localhost
The different cluster service web user interfaces can be reached over:
Always typing port and username can become quite tedious. You can
configure your SSH client with a host in ~/.ssh/config
:
Host yarn
Hostname localhost
Port 2222
User sandbox
IdentityFile ~/.ssh/yarn
IdentitiesOnly yes
To enable password-less access to the client node, you can setup SSH keys.
ssh-keygen -f ~/.ssh/yarn
will create a key pair on your local machine. The key should be added
to your local ssh-agent
:
ssh-add -f ~/.ssh/yarn
The key has to be installed in the client node:
ssh-copy-id -i ~/.ssh/yarn.pub yarn
To login to client node, you can then use
ssh yarn
Start the cluster and log on to client node (see above). You can run Teragen/Terasort/Teravalidate to see if the cluster is able to run Hadoop MapReduce jobs and read/write to HDFS. To generate a 9.3 GiB dataset you would use:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
teragen 100000000 /user/sandbox/teragen
Teragen generates test data to be sorted by Terasort. To sort the generated dataset, you would use
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
terasort /user/sandbox/teragen /user/sandbox/terasort
Terasort sorts the generated dataset and outputs the same dataset globally sorted. Teravalidate verifies that the dataset is globally sorted. You can run it like this:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
teravalidate /user/sandbox/terasort /user/sandbox/teravalidate
Loading of native code dependencies can be verified on the client node as well. To check, issue the following command:
hadoop checknative
The output should look like this:
2021-10-25 19:31:48,186 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
2021-10-25 19:31:48,187 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2021-10-25 19:31:48,207 INFO nativeio.NativeIO: The native code was built without PMDK support.
Native library checking:
hadoop: true /hadoop/lib/native/libhadoop.so.1.0.0
zlib: true /lib/x86_64-linux-gnu/libz.so.1
zstd : true /lib/x86_64-linux-gnu/libzstd.so.1
bzip2: true /lib/x86_64-linux-gnu/libbz2.so.1
openssl: true /lib/x86_64-linux-gnu/libcrypto.so
ISA-L: true /lib/x86_64-linux-gnu/libisal.so.2
PMDK: false The native code was built without PMDK support.
The example uses Python and pywebhdfs. To setup, create a venv and install pywebhdfs:
python -m venv .venv && \
. .venv/bin/activate && \
python -m pip install pywebhdfs
Then you should be able to list the directory contents of the sandbox user home:
from pywebhdfs.webhdfs import PyWebHdfsClient
client = PyWebHdfsClient(host="localhost", port=9870, user_name="sandbox")
listing = client.list_dir("/user/sandbox")
print(listing)
Yarn applications can be profiled using the async-profiler. First fetch a release tarball of async-profiler and unpack on the client node (assuming your host is x86-64):
curl -fsSLo async-profiler.tgz 'https://github.com/jvm-profiling-tools/async-profiler/releases/download/v2.9/async-profiler-2.9-linux-x64.tar.gz'
sha256sum -c <<<"b9a094bc480f233f72141b7793c098800054438e0e6cfe5b7f2fe13ef4ad11f0 *async-profiler.tgz"
tar -xzf async-profiler.tgz --strip-components=1 --one-top-level=async-profiler
This command line then renders flame graphs of terasort as a separate log file for each Yarn container:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
terasort \
-files async-profiler/build/libasyncProfiler.so \
-D '"mapred.child.java.opts=-agentpath:libasyncProfiler.so=start,event=cpu,simple,title=@taskid@,file=$(dirname $STDOUT_LOGFILE_ENV)/@taskid@.html,log=$(dirname $STDOUT_LOGFILE_ENV)/@taskid@-profiler.log"' \
/user/sandbox/teragen /user/sandbox/terasort
PID 1 now runs an injected docker-init
binary via init: true
in
the docker-compose.yaml
. Previously, tini
was included in the docker images.
If you pulled the old images you will see
[WARN tini (7)] Tini is not running as PID 1 and isn't registered as a child subreaper.
in logs. This is harmless and can easily be corrected by pulling latest images using
docker compose pull
docker compose down
docker compose up
Writable volume mounts have been changed from bind
to the default driver for
docker volumes.
That means:
- previously created data stored in the
data/
subfolder will not be accessible by the hadoop-sandbox docker compose deployment - that includes data stored on HDFS, and data stored in home folder of
sandbox
user - additionally, SSH host keys will be regenerated
For users that have been using the deployment before 2022-08-18 there are two alternatives for migration:
- Migration of old data or
- Remove old data
Both alternatives are described below. You do not need to do anything unless you used the deployment before 2022-08-18.
If you used an older version of this deployment and would like to retain the data, you can follow these steps to transfer to the new deployment:
- Start the old deployment
- Copy HDFS data to local filesystem
ssh -p 2222 sandbox@localhost hdfs dfs -copyToLocal /user/sandbox hdfs-data
- Copy sandbox user's home dir to host:
ssh -p 2222 sandbox@localhost tar -cf - . > sandbox.tar
- Stop the old deployment
- Update the deployment via
git pull
- Start the new deployment
- Clear SSH's
known_hosts
file viassh-keygen '[localhost]:2222'
- Update your
known_hosts
file viassh -p 2222 sandbox@localhost true
- Upload old data via SSH:
ssh -p 2222 sandbox@localhost tar -xf - < sandbox.tar
- Upload old HDFS data to HDFS via
ssh -p 2222 sandbox@localhost hdfs dfs -copyFromLocal hdfs-data /user/sandbox
- Remove local copy on
clientnode
viassh -p 2222 sandbox@localhost rm -rf hdfs-data
- Remove backup on host via
rm sandbox.tar
- Remove old
bind
mounted data stored underdata/
subfolder on the host (might requiresudo
)
If you used an older version of this deployment and would like to start from scratch, you can follow these steps to remove old data:
- Update the deployment via
git pull
- Start the deployment
- Clear SSH's
known_hosts
file viassh-keygen '[localhost]:2222'
- Update your
known_hosts
file viassh -p 2222 sandbox@localhost true
- Install your SSH key via
ssh-copy-id -i ~/.ssh/yarn.pub -p 2222 sandbox@localhost
- Remove old
bind
mounted data stored underdata/
subfolder on the host (might requiresudo
)