/beeswax

Monadic interface for Hive databases

Primary LanguageScalaApache License 2.0Apache-2.0

beeswax

Build Status Gitter chat

Monadic wrapper for the Hive API.

Scaladoc

Usage

See https://commbank.github.io/beeswax/index.html

Creating a table

Table schemas are derived from a thrift struct. The underlying storage format can either be text or parquet. Here is an example:

Hive.createParquetTable[Pair]("database", "table", List(("year", "int"), ("name", "string")))

Querying

Create actions that perform hive queries.

Hive.query("SELECT COUNT(*) FROM datable.table")

Multiple operations

Creates a table and inserts data into it from another table.

for {
  _ <- Hive.createTexttable[Pair]("test", "pairs", List.empty)
  _ <- Hive.query("INSERT INTO TABLE test.pairs FROM SELECT * FROM test2.pairs")
} yield ()

Running the Hive monad

import org.apache.hadoop.hive.conf.HiveConf

val hc: HiveConf                 = new HiveConf
val q: Hive[List[String]]        = Hive.query("SELECT COUNT(*) FROM datable.table")
val result: Result[List[String]] = q.run(hc)

Known Issues

  • Need to specify the Hive metastore as a thrift endpoint instead of the database.

      <property>
        <name>hive.metastore.uris</name>
        <value>thrift://metastore:9083</value>
      </property>
    
  • In order to run queries the hive-site.xml need to include the yarn.resourcemanager.address property even if the value is bogus.

      <property>
        <name>yarn.resourcemanager.address</name>
        <value>bogus</value>
      </property>
    
  • In order to run queries with partitioning the partition mode needs to be set to nonstrict.

      <property>
        <name>hive.exec.dynamic.partition.mode</name>
        <value>nonstrict</value>
      </property>