hpi-epic/column_selection_example

Some questions about the reproduction of the paper.

Opened this issue · 6 comments

Hi Bouncner,
Thanks for your ampl code!
Is there any relevant source code for the paper to be implemented on Hyrise? Especially about the modified components compared to vanilla Hyrise in Fig.1 and benchmark performance.

Hey walt676,

Unfortunately, there is no sharable code for this project. The code heavily relied on the AMM library, which is proprietary code of EMC.
Since we started rewriting Hyrise in 2016/17, there is been no further work on hybrid table layouts.

Hey walt676,

Unfortunately, there is no sharable code for this project. The code heavily relied on the AMM library, which is proprietary code of EMC. Since we started rewriting Hyrise in 2016/17, there is been no further work on hybrid table layouts.

Thank you so much for taking the time to answer my question.
According to your reply, is it very difficult to implement the change of table layout in the rewritten Hyrise?
If it is convenient for you, can I ask you some questions about the paper? It‘s ok if you do not reply.

  1. I found that 'A. Model Description' of the thrid section 'COLUMN SELECTION' tries to extend the integer optimization problem into the continuous framework. I would like to ask what does it do to have $x_i\in [0,1]$ rather than $x_i\in {0,1}$?
  2. From step 2 to step 3 in the forementioned subsection, (5) omits the constraint (3) and include a penalty term. Why could it 'guarantees admissible integer solutions in a continuous framework'?

Thank you again for your previous reply. My mathematical foundation is somewhat lacking. If you think these questions are too basic, could you please recommend me some mathematical content or courses?

According to your reply, is it very difficult to implement the change of table layout in the rewritten Hyrise?

It's actually possible in the current Hyrise version, but it hasn't been done so far.
All accesses to segments are done via iterators. It's certainly possible to implement such iterators for a "hybrid" segment. As soon as this iterator exists, all operators etc. should immediately work with the new segment type.

Concerning your questions:

  1. I am not sure if I get the question right. What do you mean by "what does it do"? Generally, the placement problem of this paper is to determine where a column is kept. Thus, it is a binary decision (e.g., column XYZ is in DRAM, if the configuration X stores a 1 for it). A continuous decision wouldn't make sense in our model, since there is no interpretation for "0.35 of column XYZ is in DRAM, 0.65 is on SSD". However, due to the later discussed properties of the model, there will be integral solutions even for the continuous problem. As the continuous problem is faster to solve, that's a big advantage.
  2. Even though we remove the budget constraint, please note that x elements still are constrained to be within [0,1]. Please have a look at the slide "Pareto-Optimal Relaxations " and following at https://hpi.de/fileadmin/user_upload/fachgebiete/plattner/teaching/Dynamic_Pricing/Slides_ILP2_SS19.pdf. Does that help?

Thank you very much for your reply, it was very helpful.

  1. Even though we remove the budget constraint, please note that x elements still are constrained to be within [0,1]. Please have a look at the slide "Pareto-Optimal Relaxations " and following at https://hpi.de/fileadmin/user_upload/fachgebiete/plattner/teaching/Dynamic_Pricing/Slides_ILP2_SS19.pdf. Does that help?

Sorry to bother again, where could I get the full course content (slideshow and other possible content)?
I would be very grateful if you could also answer the following two questions :

  1. In Fig.3, is for a continuous solution only the part that intersects with the integer solution is drawn?

  2. In the calculation of the scan cost in the relative performance of the continuous solution, is it necessary to directly bring x_i into EQ. 1 or to treat the value above 0.5 as 1, and vice versa as 0?

  3. The selectivity s_i of column i is the average share of rows with the same attribute. Probably because I'm not a native English speaker and I don't quite understand what it means.

  4. As the continuous problem is faster to solve, that's a big advantage.

    Sorry for my poor math foundation, I don't quite understand why continuous problem is faster to solve.

Thank you again for your previous reply.

Sorry, walt 676 for the late response.

  1. you can find the course material here: https://hpi.de/en/plattner/teaching/archive/summer-term-2019/data-driven-decision-making-in-enterprise-applications.html
  2. No, no need to truncate or round values. Even float values will be either 0.0 or 1.0 due to the given constraints (the paper discusses this in more detail).
  3. Assume you have a column with ten values, and each values occurs in every tenth line (uniform distribution assumed here). Then the share of each value is 0.1, which is also the selectivity for an equal predicate filter on this column.
    If you would have five values, the selectivity would be 0.2 etc. etc.
  4. Continuous problems are much less restricted than integer ones. At first, once might think that integers are "easier" to handle. But in fact, variables enforced to be integers is just a much heavier constraint on each variable.