main-memory-selection-query-optimizer: A Java repository from Jiaza492

README - COMS W4112, Project 2 (Phase II)



In this phase of project 2, we implement a query optimizer for selection
conditions in main memory as described in algorithm 4.11 of Kenneth Ross's
Selection Conditions in Main Memory [Columbia University, 2004]. 





   ----------------------------SUMMARY-------------------------------

The optimization is based on avoiding branch mispredictions by optimizing
both the order of and the type of &-conjunction between the elements in a
set of input selection conditions. The decisions made by the optimizer are
based on the selectivities of each input condition as well as several
(user-specified) properties of the machine on which the query will be run
on. The types of &-conjunctions we choose between are:

    Branching-And: This implementation uses the "&&" (boolean) operator
    between two selection conditions and creates a branch in the compiled
    code, which leaves us vulnerable to branch misprediction penalties
    but saves work when the first term is very selective.
    
         for(i=0; i < number_of_records; i++) {
             if(f1(r1[i]) && ... && fk(rk[i])) {
                 answer[j++] = i;
             }
         }
     

    Logical-And: This implementation uses the "&" (bitwise) operator
    between two selection conditions and does not create a branch in
    the compiled code. Because there is no branch created in the resulting
    compiled code, there is no possibility of branch misprediction.
    However, it can still perform poorly when the first term is very
    selective, as we must consider the right operand even when the left
    operand evaluates to false.

         for(i = 0; i < number_of_records; i++) {
             if(f1(r1[i]) & ... & fk(rk[i])) {
                 answer[j++] = i;
             }
         }
     
    No-Branch: This implementation pulls the selection condition out of
    the if-statement altogether, instead placing a series of "Logical-And"
    ("&") conjunctions directly in the body of the for-loop. Because there
    is neither an if-statement nor any branching operators, there is no
    branching at all in the resulting compiled code.

         for(i = 0; i < number_of_records; i++) {
             answer[j] = i;
             j += (f1(r1[i]) & ... & fk(rk[i]));
         }




   ----------------------------USAGE-------------------------------

The program can be compiled with the included Makefile.

    $ make

Once compiled, the program can be run with the included shell script.

    $ ./stage2.sh query.txt config.txt

The query file should consist of space-delimited lists of selectivities
separated by newlines, as follows:

    0.8 0.5 0.3 0.2
    0.2 0.1 0.9
    0.6 0.75 0.8 1 0.9
    0.8 0.8 0.9 0.7 0.7 0.7

    In this example, the first line represents that f1's selectivity
    p1 = 0.8, f2's selectivity p2 = 0.5, f3's selectivity p3 = 0.3, 
    and f4's selectivity p4 = 0.2. The second line is another case 
    wherein f1's selectivity p1 = 0.2, f2's selectivity p2 = 0.1, and
    f3's selectivity p3 = 0.9. (fi is a selection function for a basic
    term.) Each line is separated by newline, and each selectivity 
    value is separated by single white space. All values in this file
    are 0 <= x <= 1.

The configuration file includes values of estimated costs. Those values
are estimated from CPU specification. The file format follows standard
java property convention.

    r = 1
    t = 2
    l = 1
    m = 16
    a = 2
    f = 4

    All values shown in this file represents parameters for the cost 
    estimation explained in section 4.2. We assume that the costs of
    applying function are equal for all functions fi, therefore "f"
    represents all of them. "m" means the cost of a branch misprediction. 
    It corresponds to the length of stage pipeline.

The program will print results to STDOUT in the following format.

    ==================================================================
    0.7 0.4 0.2 0.3 0.6
    ------------------------------------------------------------------
    if((t1[o1[i]] & t2[o2[i]]) && t3[o3[i]]) {
    answer[j] = i;
    j += (t4[o4[i]] & t5[o5[i]]);
    }
    ------------------------------------------------------------------
    cost: 10.5
    ==================================================================
    0.7 0.8 0.8 0.9
    ------------------------------------------------------------------
    if(t1[o1[i]] && (t2[o2[i]] && (t3[o3[i]] && t4[o4[i]]))) {
    answer[j++] = i;
    }
    ------------------------------------------------------------------
    cost: 28.3
    ==================================================================





   -------------------------IMPLEMENTATION----------------------------
 
We chose to use Java to implement the optimization algorithm. We used a class
"QueryOptimizer" to represent an optimizer for a single query containing only
simple selection conditions. Multiple queries can be optimized concurrently by
instantiating multiple QueryOptimizers. 

Another class, "QueryPlan" is an abstraction of the preliminary plans (or plan
subtrees) that we consider during the optimization. We use a bitmap to keep track
of the conditions present in each plan, and this format also allows us to
easily and efficiently calculate power sets, unions, and intersections (all of
which are required by the optimization algorithm.

Third, we refactored some of the static code into a utility class (stuff like
bitmap operations, output formatting, and configuration)
Jiaza492/main-memory-selection-query-optimizer