This a repository of process packages (“hackboxen”) that are used to acquire dataset from raw sources and convert them into a form a easily digested and published by troop. Schema and other metadata are also generated.
A hackbox consists of:
- A hackbox engine, which processes the configuration data and executes the actual data processing
- A data directory, typically created by the
get_data
task of the engine. This directory may be local or remote (e.g. S3/HDFS)
A hackbox engine contains:
config
: (required) A subdirectory containing one or more default configuration yaml (.yaml) filesengine
: (required) A subdirectory containing:- (required) An executable
main
file - (optional) other executable and support files. There is no restriction on language and complexity.
- (required) An executable
Rakefile
: (required) Used to read and combine all the sources of config metadata and executemain
Example hackbox directory structure (this one creates a tsv of word stats):
hb/language
└── corpora
└── word_freq
└── bnc
├── config
│ └── config.yaml
├── engine
│ ├── main
│ └── word_freq_bnc_endpoint.rb
└── Rakefile
The data directory is where all of the data that a hackbox acquires, reads, or creates lives. The location of the data directory is determind by the dataroot
variable set in your hackbox.yaml
configuration file as well as the namespace
and protocol
variables set in the hackbox’s own config.yaml
file. This can be visualized in the following way:
dataroot
└── namespace
└── protocol
├── config
├── fixd
├── log
├── rawd
├── ripd
└── tmp
dataroot
(required) All work (inputs, outputs, log, etc) for all hackboxen will take place here.namespace
: (required) A “.” separated hierarchy eg.category.subcategory.subsubcategory...
protocol
: (required) Name for this collection of datasets (often the same as the hackbox name)log
: (optional) All logging should go here.tmp
: (optional) Any truly ephemeral output of the workflow should go here.fixd
: (required) See the output interface described below.rawd
: (optional) This will contain all intermediate data processing outputs.ripd
(required) This will contain virginal downloaded source data adhering to the directory structure from which it was pulled.
Input directories are generally created dynamically and are not meant to be archival. Configuration files in the input directory as usually created as part of the data ingest process.
fixd
is the final output directory and contain to the following:
env
: (required) This is a directory containing files describing the environment in which the hackbox is run. Any config"ish" files may go here. Often this will include:working_environment.json
: (required) The “final” runtime config metdata used to generate the schema and output data for all of the output datasets and configuration specific to the collection of output data sets.
protocol.icss.json
: (required) The icss file for the entire collection, see ICSScode
: (optional) A directory containing the code assets described in the icss.data
: (optional) A directory containing subdirectories named for each created dataset ID. Each subdirectory contains:- One or more data files that collectively adhere to the schema this dataset. If there is only a single file, as convention, it should start with
#{dataset ID}
. (required)
FIXME————————
An example hackbox directory for Yahoo stock data might look like:
├── money
│ └── stock
│ └── historical_stock_yahoo
│ ├── config
│ │ └── config.yaml
│ ├── engine
│ │ └── main
│ └── Rakefile
The working_environment.json
could look like:
{ "namespace": "money.stock", "protocol": "historical_stock_yahoo", "datasets": { "yahoo_nasdaq_dividends": { "type": "yahoo_dividends_tsv" }, "yahoo_amex_daily_prices": { "type": "yahoo_daily_price_tsv" }, "yahoo_nyse_dividends": { "type": "yahoo_dividends_tsv" }, "yahoo_amex_dividends": { "type": "yahoo_dividends_tsv" }, "yahoo_nyse_daily_prices": { "type": "yahoo_daily_price_tsv" }, "yahoo_nasdaq_daily_prices": { "type": "yahoo_daily_price_tsv" } }, "tags": [ "money", "stocks", "yahoo", "timeseries", "finance", "economics" ], "sources": [ "http://www.nasdaq.com/screening/companies-by-industry.aspx?render=download", "http://ichart.finance.yahoo.com/table.csv" ], "types": [ "yahoo_daily_price_tsv", "yahoo_dividends_tsv" ] }
The output of running this hackbox looks like:
├── fixd
│ ├── env
│ │ └── working_environment.json
│ ├── data
│ │ ├── yahoo_amex_daily_prices
│ │ │ ├── yahoo_amex_daily_prices.tsv
│ │ │ └── yahoo_amex_daily_prices.icss.json
│ │ ├── yahoo_amex_dividends
│ │ │ ├── yahoo_amex_dividends.tsv
│ │ │ └── yahoo_amex_dividends.icss.json
│ │ ├── yahoo_nasdaq_daily_prices
│ │ │ ├── yahoo_nasdaq_daily_prices.tsv
│ │ │ └── yahoo_nasdaq_daily_prices.icss.json
│ │ ├── yahoo_nasdaq_dividends
│ │ │ ├── yahoo_nasdaq_dividends.tsv
│ │ │ └── yahoo_nasdaq_dividends.icss.json
│ │ ├── yahoo_nyse_daily_prices
│ │ │ ├── yahoo_nyse_daily_prices.tsv
│ │ │ └── yahoo_nyse_daily_prices.icss.json
│ │ └── yahoo_nyse_dividends
│ │ ├── yahoo_nyse_dividends.tsv
│ │ └── yahoo_nyse_dividends.icss.json
END_FIXME————————
A hackbox configuration file is a hash in yaml format and reads its
config data using configliere. Configuration will be read in the following order:
- The command line
/etc/hackbox/hackbox.yaml
: Machine-wide config~/.hackbox/hackbox.yaml
: User-wide config#{dataroot}/#{namespace}/#{protocol}/config/*.yaml
: Input dataset specific config#{hackboxdir}/config/*.yaml
:#{hackboxdir}
is where theRakefile
is located.
Later sources on this list overwrite earlier sources. Where *.yaml
is specified in a directory, multiple yaml files should be read in filename sorted order. The combined configuration metdata is created as a yaml file in the output (troop) directory as working_config.yaml
. This file is read by the hackbox engine. The only globally required configuration items for all hackboxen are:
FIXME————————
provides
: An array of resource tokens that the system the hackbox is ran on provides. For example,localfilesystem
,mysql
,hadoop
, and so on. See the resources section.
-——————-END_FIXME————————
dataroot
: All hackbox-related data (inputs, outputs, log, etc) goes herenamespace
: The path underdataroot
to the data input directory (dots (.) are replaced with slashes (/))protocol
: Unique id for the output collection
FIXME————————
The system running the hackbox presumably provides some pre-exsisting (properly configured) resources. These can be quite general, such as the operating system os
, or as specific as mysql
or hadoop
. The canonical list is as follows (add more as you see fit):
os
– the generic operating system. Eg. linuxmachine
– the machine architecture. Eg. x86_64platform
– the system platform. Eg. ubuntu, mac_os_x, and so on.filesystems
– listing of filesystems. Eg hadoop, local, s3mysql
– this setup provides access to a running mysql instance. Somewhere in the configuration amysql_user
,mysql_pass
,mysql_host
, andmysql_port
are defined.pig
– this setup has Apache Pig setup and configured for use.wukong
– this setup has the Wukong ruby library available in the rubylib
For a hackbox to require a given resource use the requires
directive in the #{hackboxdir}/config/*.yaml
.:
Here’s an example of how the hackbox config would require resources:
requires:
os: linux
machine: x86_64
gems:
json:
version: 1.4.6
erubis:
version: 2.6.6
mysql:
filesystems:
hadoop:
s3:
And here’s how the system would provide them:
provides:
os: linux
machine: x86_64
filesystems:
local:
hadoop:
host: localhost
port: 8020
s3:
aws_secret_key: 1234secretkey
aws_access_key: abcdaccesskey
Note we use nested hashes here as it provides the greatest flexibility.
When the hackbox is ran the requires
for the hackbox are compared with the system provides
. If they don’t match up then an error is thrown with a list of missing requires
. At the moment it is up to the hackbox user to determine how to acquire the missing dependencies.
TODO: Write the class that actually implements this logic.
END_FIXME————————
Before running the hackbox, make sure that the hackboxen repo lib
directory is in your RUBYLIB
environment variable.
Externally, the execution of a hackbox appears as:
- A
Rakefile
is run withrake
from the shell with one of the following targets:get_data
: Perform the ingest step. The input data (inripd
/rawd
) and any required metadata should exist after this step.default
: Perform the processing step.all
: Perform bothget_data
anddefault
.
Execution Results:
- If there is no failure,
rake
can be silent. - If there is a failure,
rake
ends with a thrown exception - After a successful execution, the complete output interface (
fixd
) must exist.
The rough steps of hackbox internal execution are:
- The configuration sources (command line and files) are read and combined.
- The output directory structure (
fixd
) is created. - The hackbox engine is run and the “troop ready” ouput datasets are created in
fixd
.
Note: Hackbox execution should be idempotent (when it is sensible and efficient), leveraging this behavior from rake
.
One should try to avoid redundant computation. In particular, idempotency of output creation should be observed. Sometimes incrementally updated information makes this hard, but should be done if not too painful.
Files read and written by the hackbox should use the File System Abstraction
. See Swineherd
To aid in the creation of a new hackbox there is a rake task in the default Rakefile called ‘scaffold’. Say you wanted to create a hackbox namespaced under ‘web/analytics’ and with the name ‘digital_element’:
rake scaffold -- --protocol=digital_element --namespace=web.analytics
will create subdirectories under hb
and a running hackbox. You’ll have to add the rest of the code yourself.
- A hackbox can be run in either
development
orproduction
mode (needs further elaboration) - Resources logic and spec (eg. provides and requires)