/hackboxen-public

Infochimps' Public Hackboxen -- Archiving the World's Data Together

Primary LanguageJava

Hackboxen

This a repository of process packages (“hackboxen”) that are used to acquire dataset from raw sources and convert them into a form a easily digested and published by troop. Schema and other metadata are also generated.

A hackbox consists of:

  • A hackbox engine, which processes the configuration data and executes the actual data processing
  • A data directory, typically created by the get_data task of the engine. This directory may be local or remote (e.g. S3/HDFS)

The Hackbox Engine

A hackbox engine contains:

  • config: (required) A subdirectory containing one or more default configuration yaml (.yaml) files
  • engine: (required) A subdirectory containing:
    • (required) An executable main file
    • (optional) other executable and support files. There is no restriction on language and complexity.
  • Rakefile: (required) Used to read and combine all the sources of config metadata and execute main

Example hackbox directory structure (this one creates a tsv of word stats):


hb/language
└── corpora
    └── word_freq
        └── bnc
            ├── config
            │   └── config.yaml
            ├── engine
            │   ├── main
            │   └── word_freq_bnc_endpoint.rb
            └── Rakefile

Data Directory

The data directory is where all of the data that a hackbox acquires, reads, or creates lives. The location of the data directory is determind by the dataroot variable set in your hackbox.yaml configuration file as well as the namespace and protocol variables set in the hackbox’s own config.yaml file. This can be visualized in the following way:


dataroot
    └── namespace
          └── protocol
              ├── config
              ├── fixd
              ├── log
              ├── rawd
              ├── ripd
              └── tmp
  • dataroot (required) All work (inputs, outputs, log, etc) for all hackboxen will take place here.
  • namespace: (required) A “.” separated hierarchy eg. category.subcategory.subsubcategory...
  • protocol: (required) Name for this collection of datasets (often the same as the hackbox name)
  • log: (optional) All logging should go here.
  • tmp: (optional) Any truly ephemeral output of the workflow should go here.
  • fixd: (required) See the output interface described below.
  • rawd: (optional) This will contain all intermediate data processing outputs.
  • ripd (required) This will contain virginal downloaded source data adhering to the directory structure from which it was pulled.

Input directories are generally created dynamically and are not meant to be archival. Configuration files in the input directory as usually created as part of the data ingest process.

Output Interface (fixd)

fixd is the final output directory and contain to the following:

  • env: (required) This is a directory containing files describing the environment in which the hackbox is run. Any config"ish" files may go here. Often this will include:
    • working_environment.json: (required) The “final” runtime config metdata used to generate the schema and output data for all of the output datasets and configuration specific to the collection of output data sets.
  • protocol.icss.json: (required) The icss file for the entire collection, see ICSS
  • code: (optional) A directory containing the code assets described in the icss.
  • data: (optional) A directory containing subdirectories named for each created dataset ID. Each subdirectory contains:
  • One or more data files that collectively adhere to the schema this dataset. If there is only a single file, as convention, it should start with #{dataset ID}. (required)

FIXME————————

Example Hackbox

An example hackbox directory for Yahoo stock data might look like:


├── money
│   └── stock
│       └── historical_stock_yahoo
│           ├── config
│           │   └── config.yaml
│           ├── engine
│           │   └── main
│           └── Rakefile

The working_environment.json could look like:


{
    "namespace": "money.stock", 
    "protocol": "historical_stock_yahoo", 
    "datasets": {
        "yahoo_nasdaq_dividends": {
            "type": "yahoo_dividends_tsv"
        }, 
        "yahoo_amex_daily_prices": {
            "type": "yahoo_daily_price_tsv"
        }, 
        "yahoo_nyse_dividends": {
            "type": "yahoo_dividends_tsv"
        }, 
        "yahoo_amex_dividends": {
            "type": "yahoo_dividends_tsv"
        }, 
        "yahoo_nyse_daily_prices": {
            "type": "yahoo_daily_price_tsv"
        }, 
        "yahoo_nasdaq_daily_prices": {
            "type": "yahoo_daily_price_tsv"
        }
    }, 
    "tags": [
        "money", 
        "stocks", 
        "yahoo", 
        "timeseries", 
        "finance", 
        "economics"
    ], 
    "sources": [
        "http://www.nasdaq.com/screening/companies-by-industry.aspx?render=download", 
        "http://ichart.finance.yahoo.com/table.csv"
    ], 
    "types": [
        "yahoo_daily_price_tsv", 
        "yahoo_dividends_tsv"
    ]
}

The output of running this hackbox looks like:


├── fixd
│   ├── env
│   │   └── working_environment.json
│   ├── data
│   │   ├── yahoo_amex_daily_prices
│   │   │   ├── yahoo_amex_daily_prices.tsv
│   │   │   └── yahoo_amex_daily_prices.icss.json
│   │   ├── yahoo_amex_dividends
│   │   │   ├── yahoo_amex_dividends.tsv
│   │   │   └── yahoo_amex_dividends.icss.json
│   │   ├── yahoo_nasdaq_daily_prices
│   │   │   ├── yahoo_nasdaq_daily_prices.tsv
│   │   │   └── yahoo_nasdaq_daily_prices.icss.json
│   │   ├── yahoo_nasdaq_dividends
│   │   │   ├── yahoo_nasdaq_dividends.tsv
│   │   │   └── yahoo_nasdaq_dividends.icss.json
│   │   ├── yahoo_nyse_daily_prices
│   │   │   ├── yahoo_nyse_daily_prices.tsv
│   │   │   └── yahoo_nyse_daily_prices.icss.json
│   │   └── yahoo_nyse_dividends
│   │       ├── yahoo_nyse_dividends.tsv
│   │       └── yahoo_nyse_dividends.icss.json


END_FIXME————————

Configuration for a hackbox

A hackbox configuration file is a hash in yaml format and reads its
config data using configliere. Configuration will be read in the following order:

  • The command line
  • /etc/hackbox/hackbox.yaml: Machine-wide config
  • ~/.hackbox/hackbox.yaml: User-wide config
  • #{dataroot}/#{namespace}/#{protocol}/config/*.yaml: Input dataset specific config
  • #{hackboxdir}/config/*.yaml: #{hackboxdir} is where the Rakefile is located.

Later sources on this list overwrite earlier sources. Where *.yaml is specified in a directory, multiple yaml files should be read in filename sorted order. The combined configuration metdata is created as a yaml file in the output (troop) directory as working_config.yaml. This file is read by the hackbox engine. The only globally required configuration items for all hackboxen are:


FIXME————————

  • provides: An array of resource tokens that the system the hackbox is ran on provides. For example, localfilesystem, mysql, hadoop, and so on. See the resources section.
    -——————-END_FIXME————————
  • dataroot: All hackbox-related data (inputs, outputs, log, etc) goes here
  • namespace: The path under dataroot to the data input directory (dots (.) are replaced with slashes (/))
  • protocol: Unique id for the output collection

FIXME————————

Resources

The system running the hackbox presumably provides some pre-exsisting (properly configured) resources. These can be quite general, such as the operating system os, or as specific as mysql or hadoop. The canonical list is as follows (add more as you see fit):

  • os – the generic operating system. Eg. linux
  • machine – the machine architecture. Eg. x86_64
  • platform – the system platform. Eg. ubuntu, mac_os_x, and so on.
  • filesystems – listing of filesystems. Eg hadoop, local, s3
  • mysql – this setup provides access to a running mysql instance. Somewhere in the configuration a mysql_user, mysql_pass, mysql_host, and mysql_port are defined.
  • pig – this setup has Apache Pig setup and configured for use.
  • wukong – this setup has the Wukong ruby library available in the rubylib

For a hackbox to require a given resource use the requires directive in the #{hackboxdir}/config/*.yaml.:

Here’s an example of how the hackbox config would require resources:


requires:
  os: linux
  machine: x86_64
  gems:
    json:
      version: 1.4.6
    erubis:
      version: 2.6.6
  mysql:
  filesystems:
    hadoop:
    s3:

And here’s how the system would provide them:


provides:
  os: linux
  machine: x86_64
  filesystems:
    local:
    hadoop:
      host: localhost
      port: 8020
    s3:
      aws_secret_key: 1234secretkey
      aws_access_key: abcdaccesskey

Note we use nested hashes here as it provides the greatest flexibility.

When the hackbox is ran the requires for the hackbox are compared with the system provides. If they don’t match up then an error is thrown with a list of missing requires. At the moment it is up to the hackbox user to determine how to acquire the missing dependencies.

TODO: Write the class that actually implements this logic.


END_FIXME————————

Running a hackbox

Before running the hackbox, make sure that the hackboxen repo lib directory is in your RUBYLIB environment variable.

Externally, the execution of a hackbox appears as:

  • A Rakefile is run with rake from the shell with one of the following targets:
    • get_data: Perform the ingest step. The input data (in ripd/rawd) and any required metadata should exist after this step.
    • default: Perform the processing step.
    • all: Perform both get_data and default.

Execution Results:

  • If there is no failure, rake can be silent.
  • If there is a failure, rake ends with a thrown exception
  • After a successful execution, the complete output interface (fixd) must exist.

The rough steps of hackbox internal execution are:

  • The configuration sources (command line and files) are read and combined.
  • The output directory structure (fixd) is created.
  • The hackbox engine is run and the “troop ready” ouput datasets are created in fixd.

Note: Hackbox execution should be idempotent (when it is sensible and efficient), leveraging this behavior from rake.

When Creating a Hackbox

One should try to avoid redundant computation. In particular, idempotency of output creation should be observed. Sometimes incrementally updated information makes this hard, but should be done if not too painful.

Files read and written by the hackbox should use the File System Abstraction. See Swineherd

Utilities

To aid in the creation of a new hackbox there is a rake task in the default Rakefile called ‘scaffold’. Say you wanted to create a hackbox namespaced under ‘web/analytics’ and with the name ‘digital_element’:


rake scaffold -- --protocol=digital_element --namespace=web.analytics

will create subdirectories under hb and a running hackbox. You’ll have to add the rest of the code yourself.

TODO / Questions

  • A hackbox can be run in either development or production mode (needs further elaboration)
  • Resources logic and spec (eg. provides and requires)