A column oriented, embarrassingly distributed, relational event database.
- column oriented - super fast queries
- events - write only semantics
- distributed insert - designed for petabyte scale distributed datasets with massive write loads
- compressed - bitmap indexes, lz4, and prefix trie compression
- relational - join gigantic data sets
- partitioned - smart shards
- embarrassingly distributed (based on Disco)
- embarrassingly fast (uses LMDB)
- NoSQL - Python DSL
- bulk append only semantics
- highly available, horizontally scalable
- REPL/CLI query interface
select(impressions.ad_id, impressions.date, h_sum(pix.amount), h_count(),
where=((impressions.date < '2014-01-13') & (impressions.ad_id == 30010),
pix.date < '2014-01-13'),
join=(impressions.site_id, pix.site_id),
order_by=impressions.date)
After cloning this repo, here are some considerations:
- you will need Python 2.7 or higher - note that it probably won't work on 2.6 (has to do with pickling lambdas...)
- you need to install Disco 0.5 and its dependencies - get that working first
- you need to install Hustle and its 'deps' thusly:
cd hustle
sudo ./bootstrap.sh
Please refer to the Installation Guide for more details
Special thanks to following open-source projects: