medallia: A Java repository from lautarobock

Medallia ECI Challenge 2013
---------------------------

 A bit of background
 -------------------

 This year's challenge is inspired by a problem we tried to solve with our 
internal OLAP engine. Slug (which is the name of the engine) is an in-memory, 
extremely fast OLAP engine, that allows our customers to slice-and-dice their 
data in any way they want, answering queries under a second.
 
 An OLAP engine works over discrete data cubes (hypercubes usually). Where each 
dimension of the cube is a variable, with its own data domain (i.e. range of 
legal values).
 
 For example, you could have a cube with three dimensions:
 	- age (0 to 120)
 	- gender (male or female)
 	- overall satisfaction (1 to 5)
 
 Each coordinate within the cube represents a data point. Usually an OLAP cube 
is mostly empty.
 
 Typical OLAP queries are:
 	- get the score distribution grouped by gender
 	- get the average score overall
 	- get the average score grouped by 10 year age buckets
 	
 These are modelled as a series of slices (e.g. slice the cube across the age 
dimension, use 10 year slices) and agregation operations (e.g. compute the 
average of each slice).
 
 Since cubes are mostly empty, only coordinates that hold values are stored, 
for example:
 
 	age | gender | score
  -----+--------+-------
    23 |  male  |   4
    41 |  male  |   5
    34 | female |   3
    52 |  ----  |   5  
 
 Note that some data points may not contain a value across a dimension. In 
general, from the engine point of view, all variables are nullable. Some may 
have a default value if not present.
 
 In practice, a cube as above would have an additional dimension that will 
serve as a record id and makes each point unique, but we can ignore that for 
now.
 
 In Slug teminology, each of the cube's dimensions is a field, which has a name 
and metadata about valid values that it can hold.
 
 Internally, Slug stores it's data in several segments, each segment having 
several storage columns, and each column holding values for one or more fields:
 
 Segment 0: 
 	+----------------+----------------+----------------+----------------+
 	|      col 0     |      col 1     |      col 2     |      col 3     |
 	+----------------+----------------+----------------+----------------+
 	| f1 |     f2    |       f5       |    f3  |   f4  | f6 |   f7   |f8|
 	|....|...........|................|........|.......|....|........|..|
	
	...

  Segment N: 
 	+----------------+----------------+----------------+----------------+
 	|      col 0     |      col 1     |      col 2     |      col 3     |
 	+----------------+----------------+----------------+----------------+
 	| f1 |     f2    |       f5       |    f3  |   f4  | f6 |   f7   |f8|
 	|....|...........|................|........|.......|....|........|..|
 	
 The field to column layout is the same for all segments. Each column is an 
integer array, and fields are packed in as little space as needed. Packing 
colun-wise helps with adding/removing fields, and with cache locality when 
executing queries that do not touch many fields.
 
 Segments sizes are typically bounded between 1-64K rows per segment, rounded 
to the next power of two.
 
 For example, the space requirements for the fields of the cube defined above:
  	- age (0 to 120): needs 7 bits
 	- gender (male or female): needs 2 bits (assuming is nullable, so 3 
possible values)
 	- overall satisfaction (1 to 5): needs 3 bits
 	
 Typical Slug cubes are much larger, some with 1000s of fields/dimensions, and 
millions of rows.
 
 One thing that we observe in customer datasets is that fields are added over 
time and some fields become unused, but since the data is kept for querying, 
datasets are typically sparsely populated. 
 
 Medallia is in the business of customer experience management, so many of the 
records are the results of surveys. Each survey question is modelled as a 
field, and a customer may have many different survey programs running at any 
given time, each of which fills only a few fields of all the available ones.
 
 So a dataset segment will look something like the following:

 +col0|col1|col2|col3|col4|col5+
 +----+----+----+----+----+----+
 +XXXX|XXXX|____|____|____|____+
 +XXXX|XXXX|____|____|____|____+
 +XXXX|XXXX|____|____|____|____+
 +XXXX|XXXX|____|____|____|____+
 +____|____|___X|XXXX|__XX|X___+
 +____|____|___X|XXXX|__XX|X___+
 +____|____|___X|XXXX|__XX|X___+
 +____|____|___X|XXXX|__XX|X___+
 
  With Xs representing used bits and dashes representing zeroes, what you'd see 
is large chunks of data with large chunks of zeroes.
 
  By carefully re-arranging the fields into columns:

 +col0|col1|col2|col3|col4|col5+
 +----+----+----+----+----+----+
 +XXXX|XXXX|____|____|____|____+
 +XXXX|XXXX|____|____|____|____+
 +XXXX|XXXX|____|____|____|____+
 +XXXX|XXXX|____|____|____|____+
 +____|____|XXXX|XXXX|____|____+
 +____|____|XXXX|XXXX|____|____+
 +____|____|XXXX|XXXX|____|____+
 +____|____|XXXX|XXXX|____|____+
  
 You can represent the same data, but columns 4 and 5 end up being nothing but 
zeroes, so as long as these records are not modified, they could share the 
underlying storage, that is, sharing the same null vector.
 
 This null vector could be shared across all segments in the dataset, producing 
significant savings.
 
 We implemented this on Slug, getting 3% to 40% memory savings on customer 
datasets just by reordering columns (the average is around 20%). For example in 
one large dataset it means we save close to 30GB of RAM.
 
 The other thing we could do to make the savings better is reordering rows, so 
as to group similar rows in the same segments, increasing the chance that a 
column of nulls will arise.
 
 For example, splitting the segment above into two:
 
 +col0|col1|col2|col3|col4|col5+
 +----+----+----+----+----+----+
 +XXXX|XXXX|____|____|____|____+
 +XXXX|XXXX|____|____|____|____+
 +XXXX|XXXX|____|____|____|____+
 +XXXX|XXXX|____|____|____|____+
 
 +col0|col1|col2|col3|col4|col5+
 +____|____|XXXX|XXXX|____|____+
 +____|____|XXXX|XXXX|____|____+
 +____|____|XXXX|XXXX|____|____+
 +____|____|XXXX|XXXX|____|____+
 
 There are now many more null vectors, at the cost of potentially lost savings 
in having more segments (segments use some memory, even if empty).
 
 The Challenge
 -------------
 
 This year's challenge is (perhaps unsurprisingly) also known as the 
Null-vector challenge.
 
 Your mission, should you decide to accept it, is to implement two algorithms:
 	- column reordering: re-arrange fields into columns so as to maximize 
chances of null columns.
 	- row reordering: re-arrange rows into segments, trying to maximize 
null columns
 	
 This will run in a simulator, on anonymized production data detailing used 
field layouts, field to column mappings and record to segments.
 
 The amount of data will be quite large and the runs will be time-constrained.
 Currently, runs over 50 minutes are aborted. The JVM runs with 3GB of ram, 
your submission should have at least 1GB of RAM available to it (the data files 
are fully loaded in memory and have different sizes, so memory pressure will 
vary during the run).
 
 Setup
 -----
 
 First of all, you'll need the SDK (if you're reading this from the SDK's 
README file, I assume you already have it ;).
 
 Go to http://challenge.medallia.com and download it and extract it.
 
 You'll need to have a recent version of Apache Maven and a recent Java SDK 
too. If you don't have them ready, go ahead an install both.
 
 Building and running
 --------------------
 
 You can build the sample simulator by doing:
 
 	~/eci-challenge$ mvn clean package
 
 
 It can be run by calling:
 
 	~/eci-challenge$ ./run.sh
 
 If the script fails, check that java is in your search path.
 
 Getting Started
 ---------------

 You should have the following files in your challenge SDK: 
 
 .
├── README
├── pom.xml   <-- sdk root pom.xml
├── run.sh    <-- utility script to execute your submission with sample data
├── runner    <-- runner module, contains the source code for the simulator
└── simulator <-- your code goes in this module
    ├── pom.xml
    └── src
        └── main
            └── java
                └── medallia
                    └── sim
                        └── Submission.java <-- submission entry point

 Take a peek at Submission.java, it contains a very simple (and not very good) 
implementation of a column and row packing simulator.
 
 The submission itself implements SimulatorFactory interface, used to create 
both a RecordLayoutSimulator and a FieldLayoutSimulator.
 
 There are basic implementations of both, and several utility classes and 
methods available to help with them.
 
 Simulators
 ----------
  
 There are two fundamental pieces of information available to each of the 
simulators:
 
 	- BitSet[] layouts: an array of unique field layouts. 
 	- List<Field> fields: a list of fields
 
 The simulators will receive records as indices into the layouts array. 
 The layouts have a bit set for each field, and fields in the 'fields' array 
are in the same order than bits in the layout.
 
 For example, if you have a dataset with 5 fields:
 { a, b , c, d, e }
 
 And the following field usage data:
 
 01001
 11010
 10101
 01001
 11010
 10101

 There are three unique layouts:
 
 0: 01001
 1: 11010
 2: 10101
 
 So the records will be passed to the simulator in the following order:
 
 {0, 1, 2, 0, 1, 2}
 
 Where each layout is passed as an argument in subsequent calls to the 
processRecord() method.
 
 You also have an 'index' attribute in each field that gives its original 
position. This will be useful when re-arranging fields into columns.
 
 So if you want to know if field.get(42) is set in layout 5, you can do:
 
 	layouts[5].get(field.get(42).index) 
 
 Be careful not to modify the layouts, it will be checked in the analysis phase 
and might cause hard to track problems.
 
 You can run Submission's main method from an IDE such as InteliJ IDEA or 
Eclipse to test your submission.
 
 The output of a single run is similar to the following:
 
company12:dataset1:Simple 72894 rows (7951 layouts), 545 fields (197 columns in 
2 segments): 14.2% used-data, 89.3% used-columns, 51M
company12:dataset2:Simple 142345 rows (510 layouts), 107 fields (42 columns in 
3 segments): 30.1% used-data, 92.9% used-columns, 22M
company11:dataset1:Simple 3150 rows (14 layouts), 93 fields (37 columns in 1 
segments): 38.8% used-data, 78.4% used-columns, 371K
...
** Simple - total bytes used: 6.5G

 It will output, for each company's dataset:
 	- # of records (rows) in the dataset
 	- # of unique field layouts
 	- # of fields
 	- how many columns are used by your layout and how many segments
 	- % of bits that are used by non-zero fields
 	- % of columns that contain at least one non-zero value
 	- used memory taking the null vector savings in consideration
 
 In the end it will print the total memory usage of a dataset if the data was 
laid out as specified by your algorithm.
 
 
 Submitting your code
 --------------------
 
 Go to the SDK directory, and rebuild your submission:
 
  	~/eci-challenge$ mvn clean package

 The generated code will be named 'simulator-1.0.jar' under simulator/target:

  	~/eci-challenge$ls simulator/target/
	classes           generated-sources maven-archiver    simulator-1.0.jar 
	surefire
 
 Submit that jar in http://challenge.medallia.com
lautarobock/medallia