/dek

Easily connect to multiple Hadoop clusters

Primary LanguageJavaMIT LicenseMIT

DEK! (Data Engineering toolKit)

dek measure?project=fr.layer4 measure?project=fr.layer4 pom dek

DEK! is a tool for managing versions and configurations of multiple clients like HDFS, HBase, MongoDB…​ on Windows and most Unix based systems in order to easily connect to multiple clusters. It provides a convenient Command Line Interface (CLI) and API for installing, switching, removing and listing clusters and their associated clients.

Formerly known as HHSL, the Hadoop Hyper Shell, DEK! was inspired by the very useful RVM and rbenv tools, used at large by the Ruby community. It has been developed by Hadoop sysadmins and developers, dealing with multiple, or even one, simple datasources or clusters, who don’t want anymore to do the boring stuff of configuring all the clients for each environment.

[brain]

By Data Engineers, for Data Engineers

Making life easier. No more trawling download pages, extracting archives, messing with $HOME/$PATH/$HADOOP_HOME/…​ environment variables. DEK! allows you to easily connect to multiple Hadoop clusters by downloading the rights clients, settings all the environment variables and preparing the -defaults.xml/-site.xml files.

[plug]

Extensible

Install data clients such as HDFS, HBase, Spark, Zookeeper, Oozie, Sqoop, Hive and many others also supported.

[figther jet]

Lightweight

Written in Java and only requires a JRE to be present on your system.

[linux]

Multi-platform

Runs on Windows and any UNIX based platforms (Mac OSX, Linux, Cygwin, Solaris and FreeBSD)

Requirements

DEK works on Linux, macOS and Windows and requires:

  • Java (1.8+)

Installation

  • Download the tarball and just unzip it.

  • Optional: change your PATH environment variable

About security

DEK uses a secure storage in order to manage secrets and credentials for registries and resolvers. Most DEK resolvers require credentials to interact with a third-party service managing their configuration: Ambari for Hortonworks, Cloudera Manager for Cloudera…​ DEK has a simple mechanism to protect secrets: it use by default an implementation of a secure storage based on a H2 encrypted database (see plugins/local-store). Therefor, if you want to use DEK, you need to specify a key to unlock the store. The first you use DEK, you will be prompted to enter the root password.

With DEK, you can unlock the store in 4 ways with the --unlock options: - direct password - prompt, if you enter @prompt - file, if you enter an option starting with "file://…​" - default secret file, if a file named 'secret' exists in the root path of DEK

The preferred way to unlock the secured store is @prompt, which asks the user to enter the root password:

dek ... --unlock @prompt
Root password:>

For background process or automatic filling of the root password, you can use the file (or default secret file) or direct password options:

dek ... --unlock <some password>
dek ... --unlock file:///...

Be aware that your commands will be store in the history of your shell if you use direct password…​ The file option and default secret file option allow you to use a file containing the root password. Ensure to protect this file with some restrictions (chown+chmod 700)

How to

> dek init
password: ****
retype password: ****
Do you want to store the password in ~/.dek/secret?[Y/n]

By initiating the database, a local registry is defined. A registry holds the references to clusters. DEK can refer multiple registries. By default, a local registry is defined, and for each commands using a registry, that local registry is used.

Display the info:

> dek info
-----------------------------------------------------------------------------
 DEK
 Version: 0.3-SNAPSHOT
 Commit: 181c0a9e0415efe0fd01b3cc088f9e97ed78845a ()
 Build: 2018-11-15T07:29:44+0000
 Terminal
 - size: Size[cols=77, rows=30]
 - class: PosixSysTerminal
 - type: xterm-256color
 - attributes: Attributes[lflags: echoke echoe echok echo echoctl isig icanon, iflags: brkint icrnl ixon ixany imaxbel iutf8, oflags: opost onlcr, cflags: cs6 cs7 cs8 cread hupcl, cchars: eof=^D eol=<undef> eol2=<undef> erase=^? werase=^W kill=^U reprint=^R intr=^C quit=^\ susp=^Z dsusp=<undef> start=^Q stop=^S lnext=^V discard=^O min=1 time=0 status=<undef>]
-----------------------------------------------------------------------------

You can then add a cluster. For instance, we add a cluster managed by Ambari, available at http://ambari:8080 and named PROD-HDP3.0:

dek add cluster local ambari PROD-HDP3.0 http://ambari:8080
> user:
> password:

#or

dek add cluster local hadoop-unit LOCAL-HU file:///home/devil/tools/hadoop-unit-standalone-2.9

#or

dek add cluster local cloudera PROD-CDH http://cloudera-manager:8080
> user:
> password:

Verify that the cluster has been added:

dek list cluster

┌────────┬───────────┬──────┬──────────────────┐
│Registry│Name       │Type  │URI               │
├────────┼───────────┼──────┼──────────────────┤
│local   │PROD-HDP3.0│ambari│http://ambari:8080│
└────────┴───────────┴──────┴──────────────────┘

Now use it:

dek use PROD-HDP3.0
# or dek use cluster PROD-HDP3.0

The command 'use' download the missing clients, prepare the environment variables and render the configuration files. You can check them by using the 'env' command in Linux for instance:

env
....
HADOOP_HOME=/xxxxx
SPARK_HOME=/xxxx
HBASE_HOME=/xxxxx

Also have a look on configuration files:

cat

DEK can be used to just install binaries, ie without cluster configuration or environment management:

dek install hadoop 2.7.7

Man

init

install

list env

set env

get env

Architecture

DEK store all its data by default in ~/.dek (will be configurable in future releases, see Roadmap)

~/.dek

archives: contains all the binaries, compressed and uncompressed.

confs: contains all the generated configurations done when the command 'use cluster …​' is called. confs is a multi level directories structured like this: registryconnection id > cluster id > service name

db: the content of the local db

Since both /archives and /confs contains generated content, these directories can be wiped without fear, their content will be regenerated on the next call to 'use cluster …​'

Configuration

Binaries

Not currently implemented

Table 1. Available properties for binaries configuration
Property Default value Mandatory Description

binaries.check

true

no

URLs

Not currently implemented

Table 2. Available properties for URL configuration
Property Default value Mandatory Description

url.mirror.apache.enabled

true

yes

url.mirror.apache

http://www.apache.org/dyn/closer.cgi/

yes

url.dist.apache

https://dist.apache.org/repos/dist/release/

no

Used when mirror.enabled is false

url.signature.apache

https://dist.apache.org/repos/dist/release/

yes if

Used when mirror.enabled is false

HTTP

Table 3. Available properties for HTTP configuration
Property Default value Mandatory Description

http.socket.timeout

30000

yes

Socket timeout

http.connect.timeout

30000

yes

Connect timeout

http.insecure

false

yes

Allow insecure SSL connections and transfers.

Proxy

DEK use external resources hosted on mirrors, like the Apache mirrors, and many others. You may need to use a proxy if your company or your private network settings requires some configuration. In DEK, you can to change these properties:

Table 4. Available properties for proxy configuration
Property Default value Mandatory Description

proxy.enabled

false

no

Enabled proxy configuration

proxy.host

-

yes

proxy.port

-

yes

proxy.non-proxy-hosts

127.0.0.1

localhost

no

proxy.auth.type

none

yes

Possible values: none, ntlm, basic

proxy.auth.ntlm.user

-

yes if proxy.auth.type is ntlm

proxy.auth.ntlm.password

-

yes if proxy.auth.type is ntlm

proxy.auth.ntlm.domain

-

yes if proxy.auth.type is ntlm

proxy.auth.basic.user

-

yes if proxy.auth.type is basic

proxy.auth.basic.password

-

yes if proxy.auth.type is basic

Build

Build a distribution from sources

On the root project, just run:

mvn clean package

At the end, you should the final archive in cli/target/cli-X.X.X.tar.gz

Release

mvn --batch-mode release:clean release:prepare -DignoreSnapshots=true -Dtag=v0.1 -DreleaseVersion=0.1 -DdevelopmentVersion=0.2-SNAPSHOT
mvn release:perform

Or, skipping the tests

mvn --batch-mode release:clean release:prepare -DignoreSnapshots=true -Dtag=v0.1 -DreleaseVersion=0.1 -DdevelopmentVersion=0.2-SNAPSHOT -DskipTests -DskipITs -Dmaven.javadoc.skip=true -Darguments="-Dmaven.javadoc.skip=true -DskipTests -DskipITs"

Force update the version:

mvn --batch-mode release:update-versions -DautoVersionSubmodules=true -DdevelopmentVersion=0.4-SNAPSHOT

FAQ

I need to use a proxy

See Proxy

Example for proxies without authentication

dek set env proxy.enabled true
dek set env proxy.host leproxy.intern
dek set env proxy.port 8888

Example for proxies requiring basic authentication

dek set env proxy.enabled true
dek set env proxy.host leproxy.intern
dek set env proxy.port 8888
dek set env proxy.auth.type basic
dek set env proxy.auth.basic.user myuser
dek set env proxy.auth.basic.password lepassword

Example for proxies requiring NTLM authentication

dek set env proxy.enabled true
dek set env proxy.host leproxy.intern
dek set env proxy.port 8888
dek set env proxy.auth.type ntlm
dek set env proxy.auth.ntlm.user myuser
dek set env proxy.auth.ntlm.password lepassword
dek set env proxy.auth.ntlm.domain INTERN

My service is not managed by DEK

DEK manages some services (HDFS, HBASE and many others). If your service is not yet managed by DEK, just create an implementation of fr.layer4.dek.binaries.ClientPreparer. See fr.layer4.dek.binaries.HdfsClientPreparer for an example.

How is security managed in DEK?

Kerberos authentication is not yet managed by DEK. This feature will be added soon.

Roadmap

  • Allow to use a custom path for DEK instead of the default ~/.dek via the configuration

  • Allow to use private binaries repositories instead of default Apache mirrors

  • Allow to skip integrity of files (specially in case of private repos)

  • Add option to skip winutils for Hadoop

  • Add security (Kerberos) switch

  • Manage other clients (Cassandra, MongoDB…​)

  • Add MapR

  • Add a Offline Mode

  • Force a version of a client for a cluster

Contributions

Contributions are welcome! To submit a pull request you should fork the project repository, and make your change on a feature branch of your fork.

License

Copyright © 2012-2018 Vincent Devillers, and the individual contributors to DEK. Use of this software is granted under the terms of the MIT License.

See the LICENSE for the full license text.

Update third parties license file

Update the content of the file THIRD-PARTY.txt:

mvn org.codehaus.mojo:license-maven-plugin:aggregate-add-third-party@aggregate-add-third-party

Update license header on files

Update licence header on files

mvn org.codehaus.mojo:license-maven-plugin:update-file-header@update-file-header

Authors

  • Vincent Devillers