dotnet/efcore

OrderBy lifting specification

divega opened this issue · 7 comments

Copying here our old spec from EF 4 (produced originally by Colin Meek and I), because I think it can still help set direction for EF Core every time we need to make a decision, e.g. recently in #16144. I have made a few edits to remove details that aren't relevant, like ASP.NET WebForms query extenders or EF DbExpression.

OrderBy Lifting for LINQ to Entities

Current behavior

The current design of LINQ to Entities requires that OrderBy is specified immediately before paging operators like Skip and Take/Top. Also, when certain operators (especially filters) are applied on top of a query containing an OderBy operator, the OrderBy is often ignored.

The basic contract is that OrderBy applies only to the next operation in the query tree, which is in alignment with the behavior of the underlying relational database. For instance, in SQL, when composing a query on top of another query, it is illegal for the inner query to be ordered:

SELECT TOP(1) *
FROM (SELECT *
      FROM Products 
      ORDER BY UnitPrice DESC) AS OrderedProducts 
WHERE ProductName LIKE 'c%'

In theory, this query could be modified to preserve the order by placing a SELECT TOP (100) PERCENT in the inner query, but there are no guarantees that further operations applied in the outer query will preserve the ordering (i.e. the database engine could choose a completely different index to optimize a filter operation), and therefore the end result could be that this query would be forcing an expensive sorting operation to happen in vain. For this reason, Entity Framework chooses to assume that the OrderBy in an inner query like this can be ignored.

The following query is not an actual query translated from EF, but it server the purpose of illustrating how even if the inner query specifies ordering, this one is overridden by the filter in the outer query:

SELECT TOP(1) *
FROM (SELECT TOP (100) PERCENT *
      FROM Products 
      ORDER BY UnitPrice DESC) AS OrderedProducts 
WHERE ProductName LIKE 'c%'

Here the results of for the inner query when executed alone on a version of the Northwind sample database:

ProductID ProductName SupplierID CategoryID QuantityPerUnit UnitPrice
38 Côte de Blaye 18 1 12 - 75 cl bottles 263.50
29 Thüringer Rostbratwurst 12 6 50 bags x 30 sausgs. 123.79
9 Mishi Kobe Niku 4 6 18 - 500 g pkgs. 97.00

However, the results for the full query are like this:

ProductID ProductName SupplierID CategoryID QuantityPerUnit UnitPrice
1 Chai 1 1 10 boxes x 20 bags 18.00

Expected behavior

The current design differs from common user expectations and intent, in particular, of LINQ users who are accustomed to a contract in which the query processor tries to honor the semantics of IEnumerable<T>.

In this contract, OrderBy becomes more of a declaration of the order in which results need to be returned thereafter, rather than as a query operator that is applied to the query results at a certain point.

LINQ to Objects produces ordered results trivially because operators are applied directly to an actual in-memory sequence. LINQ to SQL on the other side will very often lift the OrderBy operation in the query expression to produce a behavior that is consistent with LINQ to Objects. For the aforementioned example, LINQ to SQL will most likely produce a query similar to this:

SELECT TOP (1) *
FROM Products 
WHERE ProductName LIKE 'c%'
ORDER BY UnitPrice DESC

This will return the expected row:

ProductID ProductName SupplierID CategoryID QuantityPerUnit UnitPrice
38 Côte de Blaye 18 1 12 - 75 cl bottles 263.50

Here is another way to write this query using LINQ that clearly illustrates the significance of the gap between the current and the expected behavior:

context.Products
    .OrderBy(p => -p.UnitPrice)
    .First(p => p.ProdcutName.StartsWith("c"));

In LINQ to SQL or LINQ to Objects, First() will correctly return the most expensive product with a name that starts with “c”.

In LINQ to Entities, on the other hand, the price is ignored and the first product that has a name that starts with “c” in an arbitrary order (generally the order of the most efficient index the database finds that optimizes the application of the filter predicate) is returned.

Proposed solution

The proposed solution consists of applying an OrderBy lifting operation as a step in LINQ translation . This will result in a change of the observable behavior of LINQ to Entities in the sense that the order requested will be more often preserved on query composition.

Details

The basic strategy is to rewrite expressions bottom up, lifting sorts as we go. One slightly catch: as the sorted expression is lifted, it may optionally include a projection, in which case some additional rebinding may be required. The following argument patterns to collection operators are recognized:

  1. PS: Projectp(Sorts(TheRest))
  2. S: Sorts(TheRest)
  3. expr: arbitrary expression
Operator Arguments Translation
Apply/ CrossJoin/ Join   None: if the left hand-side is sorted with an identifying expression, it is possible to preserve order. However, we don’t currently have a mechanism to determine if an expression is identifying for a collection.
Distinct (PS) None: If we attempt to pull the sort above the distinct, we risk returning duplicates: Projectp(Sorts(Distinct(TheRest))). We cannot pull the sort above the project in general because we cannot invert the projection function: Sorts compose inverse(p)(Distinct(Projectp(TheRest)))
  (S) Sorts(Distinct(TheRest))
Except/ Intersect/ UnionAll (PS, expr) None: we can’t invert the projection to lift the sort, e.g.: Sorts compose inverse(p)(op(Projectp(TheRest), expr)
  (S, expr) The first sequence argument controls the order of the result, which is consistent with the behavior of LINQ to Objects: Sorts(op(TheRest, expr))
Filterf (PS) Projectp(Sorts(Filterf compose p(TheRest))
  (S) Sorts(Filterf(TheRest))
GroupBy   None: hopefully for obvious reasons
Limit/ Skip   We replicate the sort abovethe Limit/Skip: Sorts(Limitl(Skips(Sorts(TheRest))))
OfType (PS) Slightly interesting: this operator is mostly like a filter, but requires rebinding of elements when the target type is a supertype of the input. Sorts [compose treat] compose p(OfType(Projectp(TheRest)))
  (S) Sorts [compose treat](OfType(TheRest))
Projectq (PS) Projectq compose p(Sorts(TheRest))
  (S) None: this works already.
Sortt (PS) Projectp(Sort{s compose p, t}(TheRest))
  (S) Sort{s, t}(TheRest)
LINQSkip (S, expr) Skips(TheRest, expr) current behavior
  (PS, expr) Projectp(Skips compose p(TheRest)) new behavior to account for above patterns

Possible additional cases

  1. We might consider handling expr.OrderBy(…).Take(…).Skip(…), although it is an odd construct.
  2. LINQSkip could also get the Sort operation replicated above.
  3. We could also lift OrderBy for Project if the projection includes the Sort expression.

Note on LINQSkip

LINQSkip is just the LINQ Skip operator. In EF there is a DbSkipExpression which includes an Input, a Count and a SortOrder. The LINQ Skip operator takes only an Input and a Count. DbSkip and LINQSkip therefore have different handling.

Also see #16086

Here is a different take on the transformations implemented in EF4, using a different notation that may be more helpful. I haven't gone trough it yet to verify if there are any incongruities.

  1. Root: source.Sort(o)
    1. source.Sort(o).Filter(f) → source.Filter(f).Sort(o)
    2. source.Sort(o).Project(p)
    3. source.Sort(o).Limit (k)
    4. source.Sort(o).Skip(k) → source.Skip(k, o)
  2. Root: source.Skip(k, o)
    1. source.Skip(k, o).Filter(f) → source.Skip(k, o).Filter(f).Sort(o)
    2. source.Skip(k, o).Project(p)
    3. source.Skip(k, o).Limit (k2)
    4. source.Skip(k, o).Skip(k2) → source.Skip(k + k2, o) or source.Skip(k, o).Skip(k2, o) when either k or k2 is not a constant.
  3. Root: source.Sort(o).Project(p)
    1. source.Sort(o).Project(p).Filter(f) → source.Filter(e => f(p(e))).Sort(o).Project(p)
    2. source.Sort(o).Project(p).Project(p2) → source.Sort(o).Project(e => p2(p(e)))
    3. source.Sort(o).Project(p). Limit (k)
    4. source.Sort(o).Project(p).Skip(k) → source.Skip(k, o).Project(p)
  4. Root: source.Skip(k, o).Project(p)
    1. source.Skip(k, o).Project(p).Filter(f) → source.Skip(k, o).Filter(e => f(p(e))).Sort(o).Project(p)
    2. source.Skip(k, o).Project(p).Project(p2) → source.Skip(k, o).Project(e => p2(p(e)))
    3. source.Skip(k, o).Project(p).Limit (k)
    4. source.Skip(k, o).Project(p).Skip(k2) → source.Skip(k + k2, o).Project(p) or source.Skip(k, o).Skip(k2, o).Project(p) when either k or k2 is not a constant.
  5. Root: source.Sort(o).Limit(k).Project(p)
    1. source.Sort(o).Limit(k).Project(p).Filter(f) → source.Sort(o).Limit(k).Filter(e => f(p(e))).Sort(o).Project(p)
    2. source.Sort(o).Limit(k).Project(p).Project(p2) → source.Sort(o).Limit(k).Project(e => p2(p(e)))
    3. source.Sort(o).Limit(k).Project(p).Limit(k2) → source.Sort(o).Limit(Min(k, k2)).Project(p) or source.Sort(o).Limit(k).Sort(o).Limit(k2).Project(p) when either k or k2 is not a constant
    4. source.Sort(o).Limit(k).Project(p).Skip(k2) → source.Sort(o).Limit(k).Skip(k2, o).Project(p)
  6. Root: source.Skip(k, o).Limit(k2).Project(p)
    1. source.Skip(k, o).Limit(k2).Project(p).Filter(f) → source.Skip(k, o).Limit(k2).Filter(e => f(p(e))).Sort(o).Project(p)
    2. source.Skip(k, o).Limit(k2).Project(p).Project(p2) → source.Skip(k, o).Limit(k2).Project(e => p2(p(e)))
    3. source.Skip(k, o).Limit(k2).Project(p).Limit(k3) → source.Skip(k, o).Limit(Min(k2, k3)).Project(p) or source.Skip(k, o).Limit(k2).Sort(o).Limit(k3).Project(p) when either k or k2 is not a constant.
    4. source.Skip(k, o).Limit(k2).Project(p).Skip(k3) → source.Skip(k + k3, o).Limit(k2 – k3).Project(p) or source.Skip(k, o).Limit(k2).Skip(k3, o).Project(p) when either k, k2 or k3 is not a constant.
  7. Root: source.Sort(o).Limit(k)
    1. source.Sort(o).Limit(k).Filter(f) → source.Sort(o).Limit(k).Filter(f).Sort(o)
    2. source.Sort(o).Limit(k).Project(p)
    3. source.Sort(o).Limit(k).Limit(k2) → source.Sort(o).Limit(Min(k, k2)) or source.Sort(o).Limit(k).Sort(o).Limit(k2) when either k or k2 is not a constant.
    4. source.Sort(o).Limit(k).Skip(k2) → source.Sort(o).Limit(k).Skip(k2, o)
  8. Root: source.Skip(k, o).Limit(k2)
    1. source.Skip(k, o).Limit(k2).Filter(f) → source.Skip(k, o).Limit(k2).Filter(f).Sort(o)
    2. source.Skip(k, o).Limit(k2).Project(p)
    3. source.Skip(k, o).Limit(k2).Limit(k3) → source.Skip(k, o).Limit(Min(k2, k3)) or source.Skip(k, o).Limit(k2).Sort(o).Limit(k3) when either k2 or k3 is not a constant.
    4. source.Skip(k, o).Limit(k2).Skip(k3) → source.Skip(k, o).Limit(k2).Skip(k3, o)
  9. Root: source.Sort(p).Project(p).Limit(k)
    1. source.Sort(o).Project(p).Limit(k).* is equivalent to transformation for source.Sort(o).Limit(k).Project(p).*
  10. Root: source.Skip(k, o).Project(p).Limit(k2)
    1. source.Skip(k, o).Project(p).Limit(k2).* is equivalent to transformation for source.Skip(k, o).Limit(k2).Project(p).*

I have added some observations #16086 (comment)

While a lot of things in above specification seem to be working already. I don't think we need to do anything anymore unless we have a customer issue.

Putting this in the backlog to revisit and see if there are any other parts we should do.

3 cases where we are differing from original specification

  • We don't lift ordering if Distinct is specified.

    • We clear ordering when distinct is applied (unless skip/take in which case we do a pushdown first). This was one point been discussed few times in past if we should do it or not.
    • OrderBy term needs to appear in projection to apply distinct
    • Where we deviate from spec is pushdown. Spec lifts ordering out of distinct if all the columns of distinct are present.
  • We preserve ordering from outer in case of Cross join. Spec says not possible. It just happens to be. In any join outer ordering will remain.

  • Set operations - Spec preserves ordering from source1. We currently don't. We could possibly add it.

Removing from milestone to discuss with the team.

We decided in design meeting that we are preserving current behaviour in source code even though it does not match with specification. We will reconsider if we get a customer issue specifying current behaviour is incorrect.