This file contains commonly used commands for some basic tasks for the Hadoop big data open source framework and its major components.
HDFS is the primary storage powerhouse of the hadoop ecosystem.
N.B: The hdfs file system is navigated with the default linux command line commands, just prefix with a '-' . Also run commands without the braces i.e {}
Upload a local file to a HDFS directory
hdfs dfs -put {local-source-file-path} {destination-source-file-path}
Download file from HDFS to a local directory.
hdfs dfs -get {destination-source-file-path} {local-source-file-path}
Append the contents of a local file to a file on hdfs
hdfs dfs -appendToFile {local-source-file-path} {destination-source-file-path}
Merge the contents of mutiple files in a hdfs directory and download to a file on a local directory, then view contents to confirm.
hdfs dfs -getmerge {path-to-hdfs-directory-containing-all-files-to-be-merged or different-paths } {path-to loacal-file}
cat {path-to-loacl-file}
Merge multiple files on hdfs into one single file on hdfs
hadoop fs -cat {path-to-source-files-seperated-by-a-space-or-path-to-source-folder-containing-all-files/*} | hadoop -put - {path-to-destination-file}
Hbase is a No SQL, column oriented database for the big data hadoop ecosystem.
- Create a common table
create 'table-name','column-family-name'
A column family name can be likened to a sub-category of a table containing columns of related information that will likely be queried together
- Create a Namespace
create_namespace 'namespace-name'
A namespace is analogous to a function. Used to avoid conflict amognst common table names
- Create a table in a namespace
create 'namespace-name:table-name', 'column-family-name'
For all commands involving a table name, if the table was created in a namespace, the namespace should be included right before the table name i.e "namespace-name:table-name"
Add data to a table
put 'table-name', 'RowKey', 'column-family-name:column-name', 'value'
Query entries to a Rowkey
get 'table-name', 'RowKey'
Query entries to a column
get 'table-name', 'column-family-name:column-name'
Query table for a perticular value
scan 'table-name', {FILTER=>"ValueFilter(=,'binaryFilter:value')"}
Query a particular column for a particular value
`scan 'table-name', {FILTER=>"ColumnPrefixFilter('column-name') AND ValueFilter(=,'binaryFilter:value')"} -
Delete a value from a column
delete 'table-name', 'RowKey', 'column-family-name:column-name'
Delete all data in a row from differnet columns
deleteall 'table-name', 'RowKey',
Delete a whole table
disable 'table-name'
drop 'table-name'
Creating and Dividing a common table into regions and specifying row start keys for each region.
create 'table-name', 'column-family-name', SPILITS => ['first-start-key', 'second-start-key', ...]
Specifying n number of start keys creates n+1 number of regions where the first region starts at 0 and ends at the first startkey
Hive is a data warehouse used to query and analyze data stored in different databases and file systems that with hadoop using an SQL like interface.
Get the current date and time
select from_unixtime(unix_timestamp(), dd-MM-yyyy HH:mm);
Create an internal table
create table table-name (column-name data-type, column-name data type, ....) row format delimited fields terminated by ',' stored as textfile ;
Internal tables are tables that are only accessed within hive while external tables can be accessed outside of hive
Create an external table
create external table <table-name> (column-name data-type, column-name data type, ....) row format delimited fields terminated by ',' stored as textfile;
Load data from a local file to hive
load data local inpath <'path-to-local-file'> into table <table-name>;
Load data from a hdfs file
load data inpath 'path-to-hdfs-file' into table table-name;
Load data immediately from source when creating the table
create table <table-name> (column-name data-type, column-name data type, ....) row format delimited fields terminated by ',' stored as textfile location <'path-to-source-file'>;
Load only rows of a table which contain a given column value into another table
insert into <destination-table-name> select * from <source-table-name> where <column-name=value>;
Load data from one Hive table to another.
create table <new-table-name> as select * from <source-table-name>
insert into <destination-table-name> select * from <source-table-name> where <column-name=value>;
Create a table with the specifications of an existing table
create <new-table-name> like <existing-table-name>;
Query table for all rows containing occurence of a particular value in a column
select * from <table-name> where <column-name='value'>;
Query all entries to a table
select * from <table-name>;
Associating a Hive table with a Hbase base table on table creation
create external table <external-table-name> (key int, gid map <<column-1-data-type,column-2-data-type>>) stored by 'org.hadoop.hive.hbase.HbaseStorageHandler' with SERDEPROPERTIES ("hbase.columns.mapping" = "<hbase-table-column-family-name:>") TBLPROPERTIES ("" = "<hbase-table-name>");