chop-dbhi/harvest

Progress bar and explain estimator

Opened this issue · 9 comments

In thinking about this more, I think an explain estimator might be helpful, but it may very well not, at least not without a lot of effort. Since it is very hard to find code to do this, it is probably quite hard to predict query times. However, this would make for a fun and interesting thesis project ;-) I see that NEC has a recent patent for query time estimation, using PostgreSQL no less, which is kind of cool but also annoying; I haven't looked at it.

The PostgreSQL planner uses a handful of abstract cost factors (seq_page_cost, random_page_cost, cpu_tuple_cost, cpu_index_tuple_cost, cpu_operator_cost), and if the ratios between these are sufficiently "wrong", it's hard to imagine coming up with an estimating algorithm that would work really well. On the other hand, the planner actually uses these, which is somewhat encouraging.

So one challenge would be to figure out a way of calibrating these as best as possible (which would also improve plans, presumably). I haven't come across a methodical way of calibrating these constants for a particular machine. You could imagine using specially designed calibration data and queries, timing those, and doing some fancy math to calculate these cost factors. The disk costs would be challenging to calculate; you'd have to shut the database down, flush the OS cache, etc ....

Anyway, we could take a few hours to try making some simple calibration queries based on temporary test data such that the planner makes good/perfect row estimates. Then run some sample Harvest queries and see how far off our wall clock estimates are based on simple math. I wouldn't be surprised if this produces iffy results without trying to tune calibrate the cost factors first.

I think the indication of query complexity does not have to be extremely accurate. Even if we can provide some sort of rough estimation (maybe something as naive as the size of the tables being joined in the query?) we can provide the user with some feedback regarding complexity. In this case, any indication begins to provide some level of transparency to the data model driving the application and helps inform user behavior.

A visual indicator immediately sets a level of expectation. I'd agree that calibration queries might be the best way to handle this. Calibration queries could highlight potential problem areas right off the bat, and provide some transparency to the end user. I think we'll find in some cases "problem" queries uncovered might not be worth optimizing because those queries will never be hit. In other cases we can log the complexity of queries in relation to how often they're being hit to prioritize optimization

I agree that extreme accuracy isn't necessary. Is there a way to calibrate the "explain cost" to "time/complexity" function based on actual usage? Maybe the function parameters are stored in the database and the accuracy is evaluated and the parameters adjusted with each query?

I agree about accuracy. We just have to experiment. Even when the planner makes "good" row estimates, the cost and time can still be off by orders of magnitude. Keeping track of plans, execution times, and estimates over time is a good idea.

Briefly talked to @masinoa about building a simple model to estimate the time of a Harvest query. The general idea is to break down the query into components, high-level ones such as filters and view concepts or low-level ones like tables, joins etc. When the query is executed with the component present in it, the execution time will logged for that individual component. We could include the query cost estimate as well, but I am not sure how much that is needed if we have the execution time.

One could ask the question, what is the estimated query time when a component is used? The standard deviation would be calculated to determine roughly how accurate the estimate is. A query composed of multiple components would be some weighted average of the query times across components.

To build the model we could execute a set of queries built from the published Harvest fields and concepts, recording the query and execution times. At the end we would calculate the estimated query time for each original query from our data and compare it with the actual execution time that was recorded.

One advantage to this approach (if it works) is that it's database agnostic since it doesn't depend on a query plan.

Another indicator that could be useful would be the query planner's estimated number of rows to be returned (db-specific, of course, but much easier than estimated time). This could be used as a placeholder value for the root model count while the real count is being performed. "Approx. 1.1M individuals, counting ..." .

I think it could be even more valuable in the "Change Columns" dialog. Because of the Cartesian product problem, adding the "wrong" column could result in millions or even billions of rows. Even with the offset/limit fix, a very large result set which is the result of multiple joins and is sorted to boot is going to be very slow.

@bruth I'm fascinated by your query time estimation model, but it sounds like a lot of work...

My suggestion would be that we build a time estimation performance tracker, which compares some estimate of time against the measured actual time of a query, and outputs the difference, first.

Then I think we should start incrementing through time estimate models, starting with the simplest, until we are happy with the results. You're suggestion is certainly somewhere on that path, but I'm not sure if it represents the optimal effort to results ratio.

I'm also in favor of starting with a simple cost model. The idea that I started sketching out earlier today creates an additional avocado subcommand called analyze which would (surprise) analyze the DataField's associated with a project and build a (very) naive cost model based on associated models.

        models = []
        seen = []

        # dm_baseline would represent the most expensive table in the data model so that we could 
        # potentially display relative cost for each model 
        dm_baseline = 0
        for f in DataField.objects.iterator():
            if f.model not in seen:
                t0 = time.time()
                c = f.model.objects.count()
                t1 = time.time()
                td = t1-t0
                models.append({
                    'model': f.model,
                    'count': c,
                    'cost': td
                })
                seen.append(f.model)
                if td > dm_baseline:
                    dm_baseline = td

again, I realize.... super simple -- and maybe not the best cost model -- but i wanted to start somewhere.

I then thought we could perform a lookup on our cost model here, retrieving the alias_refcount of the associated query with something like ref_count = instance.apply(queryset=queryset).query.alias_refcount this gives us a mapping of tables and their join counts i.e. {u'path_histology': 1, u'tumor_location_in': 1, u'portal_subject': 3, u'nautilus_aliquot': 1}. We could then apply our cost model to the table mapping and come up with some kind of quick estimation of query time/complexity. Not sure yet what the feedback loop would look like on calibration, but I figured I'd throw this out there for consideration.

It is great that we all have an idea on how to pursue this estimator. The best thing to do is for each of us to implement our models and try it out it at the next retreat day. It may be desirable to swap a model in a Harvest project to customize how things are calculated.

It appears that all of the models described take zero or more of the following as inputs:

  • Parts of a query (expressions, tables, joins)
  • External cost factors of the query (query plan cost, execution time)

and produces an estimated query cost. The query itself and execution time would be logged for testing and validation of the model.

To expand on my idea a bit more, in order for the model to learn as the database changes, the inputs and outputs of the model need to be persisted. This also provides a way of calculating the frequency of a component. Each unique component would be stored once, but will likely be present in multiple queries. More importantly provides a way of seeing how the model is changing over time. We can tweak the model and run it against the same set of queries to see if it improves.

A relational model for storing these data could look like this:

class Query(models.Model):
    sql = models.TextField()
    actual_cost = models.IntegerField()
    estimated_cost = models.IntegerField()

class Component(models.Model):
    key = models.TextField(unique=True)
    weight = models.FloatField(default=1.0)

class QueryComponent(models.Model):
    query = models.ForeignKey(Query)
    component = models.ForeignKey(Component)
    value = models.TextField(null=True)
    estimated_cost = models.IntegerField()

The Component.key field would be the table name, join pair, unbounded or bounded condition expression, external cost factor name, etc. QueryComponent.value would store the runtime value if the component has one. A few examples:

  • An unbounded component age > ? would be stored once assuming the cost would be the same regardless of the value. The value when a query is executed using that component would be what that parameter value is, e.g. 10.
  • A bounded component could be something like text ilike '%foo%' which is a notoriously expensive query. The cost would be much different from text ilike 'foo%' which is affixed on the one side. To prevent lumping the two components together, this may be stored bounded while others would be unbounded.
  • An external cost factor postgres query plan cost and the value would be the cost after doing an explain.

In addition, something like Component.weight could be added to allow for tuning each component if there is a need.

Re: Byron's discovery of http://yunchi.org/publication/13icde_prediction.pdf -- cool, that conclusion makes sense to me. The optimizer/planner takes into consideration a lot of metainformation about the data (most common values, distributions, ...) and also is in a good position to guess how many sequential or random reads or memory accesses each operation will take. As flawed as the optimizer can sometimes be with large data, multiple joins, and complex query conditions, it would seem like a good idea to leverage its smarts. One thing the PG optimizer does not do (but commercial DB's might?) is learn from its failures, but we can't do anything about that except to remember the costs of queries we have run before. My understanding is that the optimizer's flaws are mostly related to it having inadequate statistics about the data ("this column has a bimodal distribution with such and such characteristics") and no tools yet for dealing with the statistics that it doesn't yet have access to.