README - COMS W4112, Project 2 (Phase II) In this phase of project 2, we implement a query optimizer for selection conditions in main memory as described in algorithm 4.11 of Kenneth Ross's Selection Conditions in Main Memory [Columbia University, 2004]. ----------------------------SUMMARY------------------------------- The optimization is based on avoiding branch mispredictions by optimizing both the order of and the type of &-conjunction between the elements in a set of input selection conditions. The decisions made by the optimizer are based on the selectivities of each input condition as well as several (user-specified) properties of the machine on which the query will be run on. The types of &-conjunctions we choose between are: Branching-And: This implementation uses the "&&" (boolean) operator between two selection conditions and creates a branch in the compiled code, which leaves us vulnerable to branch misprediction penalties but saves work when the first term is very selective. for(i=0; i < number_of_records; i++) { if(f1(r1[i]) && ... && fk(rk[i])) { answer[j++] = i; } } Logical-And: This implementation uses the "&" (bitwise) operator between two selection conditions and does not create a branch in the compiled code. Because there is no branch created in the resulting compiled code, there is no possibility of branch misprediction. However, it can still perform poorly when the first term is very selective, as we must consider the right operand even when the left operand evaluates to false. for(i = 0; i < number_of_records; i++) { if(f1(r1[i]) & ... & fk(rk[i])) { answer[j++] = i; } } No-Branch: This implementation pulls the selection condition out of the if-statement altogether, instead placing a series of "Logical-And" ("&") conjunctions directly in the body of the for-loop. Because there is neither an if-statement nor any branching operators, there is no branching at all in the resulting compiled code. for(i = 0; i < number_of_records; i++) { answer[j] = i; j += (f1(r1[i]) & ... & fk(rk[i])); } ----------------------------USAGE------------------------------- The program can be compiled with the included Makefile. $ make Once compiled, the program can be run with the included shell script. $ ./stage2.sh query.txt config.txt The query file should consist of space-delimited lists of selectivities separated by newlines, as follows: 0.8 0.5 0.3 0.2 0.2 0.1 0.9 0.6 0.75 0.8 1 0.9 0.8 0.8 0.9 0.7 0.7 0.7 In this example, the first line represents that f1's selectivity p1 = 0.8, f2's selectivity p2 = 0.5, f3's selectivity p3 = 0.3, and f4's selectivity p4 = 0.2. The second line is another case wherein f1's selectivity p1 = 0.2, f2's selectivity p2 = 0.1, and f3's selectivity p3 = 0.9. (fi is a selection function for a basic term.) Each line is separated by newline, and each selectivity value is separated by single white space. All values in this file are 0 <= x <= 1. The configuration file includes values of estimated costs. Those values are estimated from CPU specification. The file format follows standard java property convention. r = 1 t = 2 l = 1 m = 16 a = 2 f = 4 All values shown in this file represents parameters for the cost estimation explained in section 4.2. We assume that the costs of applying function are equal for all functions fi, therefore "f" represents all of them. "m" means the cost of a branch misprediction. It corresponds to the length of stage pipeline. The program will print results to STDOUT in the following format. ================================================================== 0.7 0.4 0.2 0.3 0.6 ------------------------------------------------------------------ if((t1[o1[i]] & t2[o2[i]]) && t3[o3[i]]) { answer[j] = i; j += (t4[o4[i]] & t5[o5[i]]); } ------------------------------------------------------------------ cost: 10.5 ================================================================== 0.7 0.8 0.8 0.9 ------------------------------------------------------------------ if(t1[o1[i]] && (t2[o2[i]] && (t3[o3[i]] && t4[o4[i]]))) { answer[j++] = i; } ------------------------------------------------------------------ cost: 28.3 ================================================================== -------------------------IMPLEMENTATION---------------------------- We chose to use Java to implement the optimization algorithm. We used a class "QueryOptimizer" to represent an optimizer for a single query containing only simple selection conditions. Multiple queries can be optimized concurrently by instantiating multiple QueryOptimizers. Another class, "QueryPlan" is an abstraction of the preliminary plans (or plan subtrees) that we consider during the optimization. We use a bitmap to keep track of the conditions present in each plan, and this format also allows us to easily and efficiently calculate power sets, unions, and intersections (all of which are required by the optimization algorithm. Third, we refactored some of the static code into a utility class (stuff like bitmap operations, output formatting, and configuration)
Jiaza492/main-memory-selection-query-optimizer
A query optimizer for selection conditions in main memory as described in algorithm 4.11 of Kenneth Ross's Selection Conditions in Main Memory [Columbia University, 2004].
Java