OrderBy lifting specification
divega opened this issue · 7 comments
Copying here our old spec from EF 4 (produced originally by Colin Meek and I), because I think it can still help set direction for EF Core every time we need to make a decision, e.g. recently in #16144. I have made a few edits to remove details that aren't relevant, like ASP.NET WebForms query extenders or EF DbExpression
.
OrderBy Lifting for LINQ to Entities
Current behavior
The current design of LINQ to Entities requires that OrderBy
is specified immediately before paging operators like Skip and Take/Top. Also, when certain operators (especially filters) are applied on top of a query containing an OderBy
operator, the OrderBy
is often ignored.
The basic contract is that OrderBy
applies only to the next operation in the query tree, which is in alignment with the behavior of the underlying relational database. For instance, in SQL, when composing a query on top of another query, it is illegal for the inner query to be ordered:
SELECT TOP(1) *
FROM (SELECT *
FROM Products
ORDER BY UnitPrice DESC) AS OrderedProducts
WHERE ProductName LIKE 'c%'
In theory, this query could be modified to preserve the order by placing a SELECT TOP (100) PERCENT
in the inner query, but there are no guarantees that further operations applied in the outer query will preserve the ordering (i.e. the database engine could choose a completely different index to optimize a filter operation), and therefore the end result could be that this query would be forcing an expensive sorting operation to happen in vain. For this reason, Entity Framework chooses to assume that the OrderBy
in an inner query like this can be ignored.
The following query is not an actual query translated from EF, but it server the purpose of illustrating how even if the inner query specifies ordering, this one is overridden by the filter in the outer query:
SELECT TOP(1) *
FROM (SELECT TOP (100) PERCENT *
FROM Products
ORDER BY UnitPrice DESC) AS OrderedProducts
WHERE ProductName LIKE 'c%'
Here the results of for the inner query when executed alone on a version of the Northwind sample database:
ProductID | ProductName | SupplierID | CategoryID | QuantityPerUnit | UnitPrice |
---|---|---|---|---|---|
38 | Côte de Blaye | 18 | 1 | 12 - 75 cl bottles | 263.50 |
29 | Thüringer Rostbratwurst | 12 | 6 | 50 bags x 30 sausgs. | 123.79 |
9 | Mishi Kobe Niku | 4 | 6 | 18 - 500 g pkgs. | 97.00 |
However, the results for the full query are like this:
ProductID | ProductName | SupplierID | CategoryID | QuantityPerUnit | UnitPrice |
---|---|---|---|---|---|
1 | Chai | 1 | 1 | 10 boxes x 20 bags | 18.00 |
Expected behavior
The current design differs from common user expectations and intent, in particular, of LINQ users who are accustomed to a contract in which the query processor tries to honor the semantics of IEnumerable<T>
.
In this contract, OrderBy
becomes more of a declaration of the order in which results need to be returned thereafter, rather than as a query operator that is applied to the query results at a certain point.
LINQ to Objects produces ordered results trivially because operators are applied directly to an actual in-memory sequence. LINQ to SQL on the other side will very often lift the OrderBy
operation in the query expression to produce a behavior that is consistent with LINQ to Objects. For the aforementioned example, LINQ to SQL will most likely produce a query similar to this:
SELECT TOP (1) *
FROM Products
WHERE ProductName LIKE 'c%'
ORDER BY UnitPrice DESC
This will return the expected row:
ProductID | ProductName | SupplierID | CategoryID | QuantityPerUnit | UnitPrice |
---|---|---|---|---|---|
38 | Côte de Blaye | 18 | 1 | 12 - 75 cl bottles | 263.50 |
Here is another way to write this query using LINQ that clearly illustrates the significance of the gap between the current and the expected behavior:
context.Products
.OrderBy(p => -p.UnitPrice)
.First(p => p.ProdcutName.StartsWith("c"));
In LINQ to SQL or LINQ to Objects, First() will correctly return the most expensive product with a name that starts with “c”.
In LINQ to Entities, on the other hand, the price is ignored and the first product that has a name that starts with “c” in an arbitrary order (generally the order of the most efficient index the database finds that optimizes the application of the filter predicate) is returned.
Proposed solution
The proposed solution consists of applying an OrderBy
lifting operation as a step in LINQ translation . This will result in a change of the observable behavior of LINQ to Entities in the sense that the order requested will be more often preserved on query composition.
Details
The basic strategy is to rewrite expressions bottom up, lifting sorts as we go. One slightly catch: as the sorted expression is lifted, it may optionally include a projection, in which case some additional rebinding may be required. The following argument patterns to collection operators are recognized:
- PS: Projectp(Sorts(TheRest))
- S: Sorts(TheRest)
- expr: arbitrary expression
Operator | Arguments | Translation |
---|---|---|
Apply/ CrossJoin/ Join | None: if the left hand-side is sorted with an identifying expression, it is possible to preserve order. However, we don’t currently have a mechanism to determine if an expression is identifying for a collection. | |
Distinct | (PS) | None: If we attempt to pull the sort above the distinct, we risk returning duplicates: Projectp(Sorts(Distinct(TheRest))). We cannot pull the sort above the project in general because we cannot invert the projection function: Sorts compose inverse(p)(Distinct(Projectp(TheRest))) |
(S) | Sorts(Distinct(TheRest)) | |
Except/ Intersect/ UnionAll | (PS, expr) | None: we can’t invert the projection to lift the sort, e.g.: Sorts compose inverse(p)(op(Projectp(TheRest), expr) |
(S, expr) | The first sequence argument controls the order of the result, which is consistent with the behavior of LINQ to Objects: Sorts(op(TheRest, expr)) | |
Filterf | (PS) | Projectp(Sorts(Filterf compose p(TheRest)) |
(S) | Sorts(Filterf(TheRest)) | |
GroupBy | None: hopefully for obvious reasons | |
Limit/ Skip | We replicate the sort abovethe Limit/Skip: Sorts(Limitl(Skips(Sorts(TheRest)))) | |
OfType | (PS) | Slightly interesting: this operator is mostly like a filter, but requires rebinding of elements when the target type is a supertype of the input. Sorts [compose treat] compose p(OfType(Projectp(TheRest))) |
(S) | Sorts [compose treat](OfType(TheRest)) | |
Projectq | (PS) | Projectq compose p(Sorts(TheRest)) |
(S) | None: this works already. | |
Sortt | (PS) | Projectp(Sort{s compose p, t}(TheRest)) |
(S) | Sort{s, t}(TheRest) | |
LINQSkip | (S, expr) | Skips(TheRest, expr) current behavior |
(PS, expr) | Projectp(Skips compose p(TheRest)) new behavior to account for above patterns |
Possible additional cases
- We might consider handling expr.OrderBy(…).Take(…).Skip(…), although it is an odd construct.
- LINQSkip could also get the Sort operation replicated above.
- We could also lift OrderBy for Project if the projection includes the Sort expression.
Note on LINQSkip
LINQSkip is just the LINQ Skip operator. In EF there is a DbSkipExpression which includes an Input, a Count and a SortOrder. The LINQ Skip operator takes only an Input and a Count. DbSkip and LINQSkip therefore have different handling.
Here is a different take on the transformations implemented in EF4, using a different notation that may be more helpful. I haven't gone trough it yet to verify if there are any incongruities.
- Root: source.Sort(o)
- source.Sort(o).Filter(f) → source.Filter(f).Sort(o)
- source.Sort(o).Project(p)
- source.Sort(o).Limit (k)
- source.Sort(o).Skip(k) → source.Skip(k, o)
- Root: source.Skip(k, o)
- source.Skip(k, o).Filter(f) → source.Skip(k, o).Filter(f).Sort(o)
- source.Skip(k, o).Project(p)
- source.Skip(k, o).Limit (k2)
- source.Skip(k, o).Skip(k2) → source.Skip(k + k2, o) or source.Skip(k, o).Skip(k2, o) when either k or k2 is not a constant.
- Root: source.Sort(o).Project(p)
- source.Sort(o).Project(p).Filter(f) → source.Filter(e => f(p(e))).Sort(o).Project(p)
- source.Sort(o).Project(p).Project(p2) → source.Sort(o).Project(e => p2(p(e)))
- source.Sort(o).Project(p). Limit (k)
- source.Sort(o).Project(p).Skip(k) → source.Skip(k, o).Project(p)
- Root: source.Skip(k, o).Project(p)
- source.Skip(k, o).Project(p).Filter(f) → source.Skip(k, o).Filter(e => f(p(e))).Sort(o).Project(p)
- source.Skip(k, o).Project(p).Project(p2) → source.Skip(k, o).Project(e => p2(p(e)))
- source.Skip(k, o).Project(p).Limit (k)
- source.Skip(k, o).Project(p).Skip(k2) → source.Skip(k + k2, o).Project(p) or source.Skip(k, o).Skip(k2, o).Project(p) when either k or k2 is not a constant.
- Root: source.Sort(o).Limit(k).Project(p)
- source.Sort(o).Limit(k).Project(p).Filter(f) → source.Sort(o).Limit(k).Filter(e => f(p(e))).Sort(o).Project(p)
- source.Sort(o).Limit(k).Project(p).Project(p2) → source.Sort(o).Limit(k).Project(e => p2(p(e)))
- source.Sort(o).Limit(k).Project(p).Limit(k2) → source.Sort(o).Limit(Min(k, k2)).Project(p) or source.Sort(o).Limit(k).Sort(o).Limit(k2).Project(p) when either k or k2 is not a constant
- source.Sort(o).Limit(k).Project(p).Skip(k2) → source.Sort(o).Limit(k).Skip(k2, o).Project(p)
- Root: source.Skip(k, o).Limit(k2).Project(p)
- source.Skip(k, o).Limit(k2).Project(p).Filter(f) → source.Skip(k, o).Limit(k2).Filter(e => f(p(e))).Sort(o).Project(p)
- source.Skip(k, o).Limit(k2).Project(p).Project(p2) → source.Skip(k, o).Limit(k2).Project(e => p2(p(e)))
- source.Skip(k, o).Limit(k2).Project(p).Limit(k3) → source.Skip(k, o).Limit(Min(k2, k3)).Project(p) or source.Skip(k, o).Limit(k2).Sort(o).Limit(k3).Project(p) when either k or k2 is not a constant.
- source.Skip(k, o).Limit(k2).Project(p).Skip(k3) → source.Skip(k + k3, o).Limit(k2 – k3).Project(p) or source.Skip(k, o).Limit(k2).Skip(k3, o).Project(p) when either k, k2 or k3 is not a constant.
- Root: source.Sort(o).Limit(k)
- source.Sort(o).Limit(k).Filter(f) → source.Sort(o).Limit(k).Filter(f).Sort(o)
- source.Sort(o).Limit(k).Project(p)
- source.Sort(o).Limit(k).Limit(k2) → source.Sort(o).Limit(Min(k, k2)) or source.Sort(o).Limit(k).Sort(o).Limit(k2) when either k or k2 is not a constant.
- source.Sort(o).Limit(k).Skip(k2) → source.Sort(o).Limit(k).Skip(k2, o)
- Root: source.Skip(k, o).Limit(k2)
- source.Skip(k, o).Limit(k2).Filter(f) → source.Skip(k, o).Limit(k2).Filter(f).Sort(o)
- source.Skip(k, o).Limit(k2).Project(p)
- source.Skip(k, o).Limit(k2).Limit(k3) → source.Skip(k, o).Limit(Min(k2, k3)) or source.Skip(k, o).Limit(k2).Sort(o).Limit(k3) when either k2 or k3 is not a constant.
- source.Skip(k, o).Limit(k2).Skip(k3) → source.Skip(k, o).Limit(k2).Skip(k3, o)
- Root: source.Sort(p).Project(p).Limit(k)
- source.Sort(o).Project(p).Limit(k).* is equivalent to transformation for source.Sort(o).Limit(k).Project(p).*
- Root: source.Skip(k, o).Project(p).Limit(k2)
- source.Skip(k, o).Project(p).Limit(k2).* is equivalent to transformation for source.Skip(k, o).Limit(k2).Project(p).*
I have added some observations #16086 (comment)
While a lot of things in above specification seem to be working already. I don't think we need to do anything anymore unless we have a customer issue.
Putting this in the backlog to revisit and see if there are any other parts we should do.
3 cases where we are differing from original specification
-
We don't lift ordering if Distinct is specified.
- We clear ordering when distinct is applied (unless skip/take in which case we do a pushdown first). This was one point been discussed few times in past if we should do it or not.
- OrderBy term needs to appear in projection to apply distinct
- Where we deviate from spec is pushdown. Spec lifts ordering out of distinct if all the columns of distinct are present.
-
We preserve ordering from outer in case of Cross join. Spec says not possible. It just happens to be. In any join outer ordering will remain.
-
Set operations - Spec preserves ordering from source1. We currently don't. We could possibly add it.
Removing from milestone to discuss with the team.
We decided in design meeting that we are preserving current behaviour in source code even though it does not match with specification. We will reconsider if we get a customer issue specifying current behaviour is incorrect.