dotnet/runtime

New System.Data.Common batching API

roji opened this issue · 245 comments

roji commented

This proposal has been approved in principle, subject to confirmation by several database providers that the API shape is good.

Issues tracking provider implementations:


This issue is based on previous discussions in #15375 and #17445.

Background

Batching multiple SQL statements can be critical for database performance, as they can be executed in a single roundtrip rather than waiting for each statement to complete before sending the next one. System.Data.Common doesn't currently provide a first-class API for this, but some providers (e.g. SqlClient) allow batching statements by concatenating them together into DbCommand.CommandText, separated by semicolons:

var cmd = connection.CreateCommand();
cmd.CommandText = "INSERT INTO table (1); INSERT INTO table (2)";

The problem with this approach, is that most databases require separate protocol messages for each statement (e.g. PostgreSQL, MySQL), forcing the database's ADO.NET provider to parse the SQL client-side and to split on semicolons. This is both unreliable (parsing SQL is difficult) and bad for performance - ideally an ADO.NET provider should simply forward user-provided SQL to the database, without parsing or rewriting it (see #25022 for a related issue forcing providers to parse SQL).

As a specific case, SQL Server does natively support the multi-statement commands, but even there this has drawbacks for performance, detailed by @GSPP in dotnet/SqlClient#19. Also, as @bgribaudo pointed out, the current concatenation-based approach doesn't allow invoking multiple stored procedures in batching fashion. SqlClient does include SqlCommandSet, which performs efficient, non-concatenation-based batching, but that class is internal and currently usable only via DbDataSet, and not for general usage (NHibernate apparently accesses this via reflection). This proposal would allow exposing SqlCommandSet via a public API.

Goals

  • Provide a structured way to execute multiple SQL statements in a single roundtrip, without any need for client-side parsing of SQL.
  • Keep the API consistent with other ADO.NET APIs, and specifically close to DbCommand (both are "executable"), to reduce the conceptual complexity for adoption.
  • Allow mixing in different types of statements in the same batch (insert, update, select). The current concatenation approach supports this, and so does our reader API (multiple resultsets).
  • Provide non-aggregated access to the number of rows affected, for each individual command in the batch.

Proposed API

public abstract class DbBatch : IDisposable, IAsyncDisposable
{
    public DbBatchCommandCollection BatchCommands => DbBatchCommands;
    protected abstract DbBatchCommandCollection DbBatchCommands { get; }

    public DbBatchCommand CreateBatchCommand() => CreateDbBatchCommand();
    protected abstract DbBatchCommand CreateDbBatchCommand();

    #region Execution (mirrors DbCommand)

    // Delegates to ExecuteDbDataReader
    public DbDataReader ExecuteReader(CommandBehavior behavior = CommandBehavior.Default);
    protected abstract DbDataReader ExecuteDbDataReader(CommandBehavior behavior);

    // Delegate to ExecuteDbDataReaderAsync
    public Task<DbDataReader> ExecuteReaderAsync(CancellationToken cancellationToken = default);
    public Task<DbDataReader> ExecuteReaderAsync(CommandBehavior behavior, CancellationToken cancellationToken = default);

    protected abstract Task<DbDataReader> ExecuteDbDataReaderAsync(CommandBehavior behavior, CancellationToken cancellationToken);

    public abstract int ExecuteNonQuery();
    public abstract Task<int> ExecuteNonQueryAsync(CancellationToken cancellationToken = default);

    public abstract object ExecuteScalar();
    public abstract Task<object> ExecuteScalarAsync(CancellationToken cancellationToken = default);

    #endregion

    #region Execution properties (mirrors DbCommand)

    public abstract int Timeout { get; set; }

    // Delegates to DbConnection
    public DbConnection Connection { get; set; }
    protected abstract DbConnection DbConnection { get; set; }

    // Delegates to DbTransaction
    public DbTransaction Transaction { get; set; }
    protected abstract DbTransaction DbTransaction { get; set; }

    #endregion
    
    #region Other methods mirroring DbCommand
    
    public abstract void Prepare();
    public abstract Task PrepareAsync(CancellationToken cancellationToken = default);
    public abstract void Cancel();
        
    #endregion

    #region Standard dispose pattern

    public void Dispose() { ... }
    protected virtual void Dispose(bool disposing) {}

    #endregion
}

public class DbBatchCommandCollection : Collection<DbBatchCommand>
{
}

public abstract class DbBatchCommand
{
    public abstract string CommandText { get; set; }
    public abstract CommandType CommandType { get; set; }
    public abstract int RecordsAffected { get; }
    
    public DbParameterCollection Parameters => DbParameterCollection;
    protected abstract DbParameterCollection DbParameterCollection { get; }
}

public class DbConnection
{
    public DbBatch CreateBatch() => CreateDbBatch();
    protected virtual DbBatch CreateDbBatch() => throw new NotSupportedException();
    
    public DbBatchCommand CreateBatchCommand() => CreateDbBatchCommand();
    protected virtual DbBatchCommand CreateDbBatchCommand() => throw new NotSupportedException();
    
    // Covers both CreateBatch and CreateBatchCommand
    public virtual bool CanCreateBatch => false;
}

public class DbProviderFactory
{
    public virtual DbBatch CreateBatch() => throw new NotSupportedException();
    public virtual DbBatchCommand CreateBatchCommand() => throw new NotSupportedException();
    
    // Covers both CreateBatch and CreateBatchCommand
    public virtual bool CanCreateBatch => false;
}

public class DbException
{
    public DbBatchCommand BatchCommand => DbBatchCommand;
    protected virtual DbBatchCommand DbBatchCommand => null;
}

General usage and examples

Usage is fairly trivial and aligned with the existing DbCommand APIs. Users first create a new DbBatch, either by calling DbCommandFactory.CreateBatch(), or by instantiating one directly. Commands are added into the batch, execution properties (e.g. Timeout) are set on it, and finally on of the Execute*() methods are called on it. Connection, Transaction and Timeout are specified on the DbBatch, like they are set on DbCommand for un-batched operations.

Here is a code sample using DbProviderFactory for database portability:

using var batch = dbProviderFactory.CreateBatch();

var cmd1 = dbProviderFactory.CreateBatchCommand();
cmd1.CommandText = "UPDATE table SET f1=@p1 WHERE f2=@p2";
var p1 = dbProviderFactory.CreateParameter();
var p2 = dbProviderFactory.CreateParameter();
p1.Value = 8;
p2.Value = 9;
cmd1.Parameters.Add(p1);
cmd1.Parameters.Add(p2);
batch.Add(cmd1);

var cmd2 = dbProviderFactory.CreateBatchCommand();
cmd2.CommandText = "SELECT * FROM table WHERE f2=@p3";
var p3 = dbProviderFactory.CreateParameter();
p3.Value = 8;
cmd2.Parameters.Add(p3);
batch.Add(cmd2);

batch.Connection = conn;
batch.Transaction = transaction;

using var reader = batch.ExecuteReader();
// read contains one resultset, from SELECT

The verboseness of this API corresponds to how ADO.NET currently looks. General usability improvements are planned for later.

Here is a suggested code sample working against a specific provider and using initializers for better readability and terseness:

using var batch = new XyzBatch
{
    Connection = conn,
    Transaction = transaction,
    BatchCommands =
    {
        new XyzBatchCommand("UPDATE table SET f1=@p1 WHERE f2=@p2")
        {
            Parameters =
            {
                new XyzParameter("p1", 8),
                new XyzParameter("p2", 9),
            }
        },
        new XyzBatchCommand("SELECT * FROM table WHERE f2=@p1")
        {
            Parameters =
            {
                new XyzParameter("p1", 8),
            }
        }
    }
};

using var reader = batch.ExecuteReader();
// read contains one resultset, from SELECT

Design and naming motivations

The original proposal had the batch holding a list of DbCommand instead of the new DbBatchCommand type. The change, originally proposed by @bgribaudo, was motivated by the following reasons:

  • DbCommand would have had some dead properties when executed as part of a batch (Connection, Transaction, CommandTimeout).
  • DbCommand is disposable, raising various questions around its lifespan (does the command get disposed with the batch or not, is it OK for DbException to reference a disposable object...).

The name DbBatchCommand will hopefully convey the similarity between that type and DbCommand (both hold SQL, parameters and a CommandType), while at the same time keeping them distinct (DbBatchCommand isn't executable outside of a DbBatch). The name DbStatement was also considered, although it seems this would introduce more confusion:

  • It's not directly identifiable as a batch component ("why can't I execute it?")
  • On some providers (e.g. SqlClient) a single DbBatchCommand can still contain concatenation batching, so a DbStatement would actually contain multiple statements.

Affected rows

DbCommand has APIs for finding out how many rows were affected by the command:

  • The return value of ExecuteNonQuery[Async]() returns an int
  • The DbDataReader.RecordsAffected property

However, both of these provide an aggregate number; if the command contained more than one statement, it's currently impossible to find out how many rows were affected by each statement separately.

Providing non-aggregate affected rows could be done via an overload of ExecuteNonQuery[Async]() which returns an int[], or which accepts a user-provided int[] and populates it. The approaches have their complexities (perf, array management...), and as we're introducing a new DbBatchCommand dedicated to batching, we propose to simply add a RecordsAffected property on it instead. The provider would populate this property for each DbBatchCommand to indicate how many rows were affected by that command.

As with DbDataReader.RecordsAffected, this property would contain -1 for SELECT statements, and 0 if no rows were affected or the statement failed.

Command behavior

DbCommand.ExecuteReader() accepts a CommandBehavior enum which allows execution behavior to be tweaked. Even if it seems to be a rare case, users may need to specify different behaviors for different batch commands. As a result, DbBatch.ExecuteReader() does not accept a behavior parameter, and DbBatchCommand has a CommandBehavior property instead.

This requires the user to set any non-default behavior on each and every batch command. A batch-wide default could be accepted by DbBatch.ExecuteReader(), but that would require a way to distinguish between the batch default and the enum's default value (CommandBehavior.Default), e.g. by making it nullable on the DbCommandBatch. The introduced complexity doesn't seem to be worth it.

Note that not all providers will support all command behaviors on batched commands - it is expected that in some cases the entire batch will have to share the same behavior. Also, CommandBehavior.CloseConnection makes no sense on batched commands except the last, and the provider should probably throw on this.

DbException.Command

Since we now execute batches, we should provide a way of letting users know which command in the batch triggered an exception. We propose to do this by introducing a new virtual property on DbException, called BatchCommand, pointing to the relevant DbBatchCommand instance. In non-batched command execution this property would be left unassigned.

Notes on batch size management

In some cases, there may be some value in having the provider's batching implementation implicitly manage batch sizes. Examples:

  • There may be some database hard limit (on the number of statements, on the number of parameters). The implementation could break down the user-provided batch into several sub-batches, to save the user the trouble of doing that. This would mean wrapping multiple internal readers with a single reader, which isn't trivial (although we could provide an implementation).
    • Note: it seems that the SQL Server 2100-parameter limit will not apply here, as it applies to each individual statement rather than to the batch. So we're not sure we have a good real-world example.
  • It may be inefficient to batch statements below a certain number (because of overheads associated with batches). In this case the implementation would simply not batch.
    • For example, with current concatenation-based batching on SQL Server, batching 2 statements performs worse than not batching.

The pro here is to free the user from knowing low-level, database-specific details, transparently increasing perf and reducing friction for everyone (we're assuming the user has no special knowledge which would assist in the decision-making). The con here is more complexity and a general uncertainty that this is needed - any examples from the community would be helpful.

Regardless, this can be viewed as an implementation detail of a provider; although we can provide guidelines, at the end of the day each provider will implement the behavior they desire.

Backwards-compatibility

As a new API, no breakage is being introduced, but it has been proposed to provide a backwards compatibility shim for the new API that either performs concatenation-based batching, or simply executes without batching at all.

As of now, the decision is to not provide any sort of shim: trying to use the new batching API with providers that haven't implemented support for it will throw a NotSupportedException. The reasons for this are:

  • Silently falling to non-batching execution would give the false expectation/illusion of batching without delivering on it.
  • Concatenation-based batching doesn't work if the different batched commands have named parameters with the same name (since parameter lists would need to be unified and names must be unique). This is expected to be a common scenario as multiple identical statements are executed with different parameter values (but identical names).
  • We want to avoid the complexity and effort of implementing the shim. For example, a non-batching shim would need to somehow combine the readers from the individual commands into a single reader interface for user consumption. We don't think this is worth the effort.
  • Most applications work against a single database type, and so can either use the new API or not, based on knowledge of what the provider supports. Multi-database applications (and layers) are much more rare, and it's expected they can take the added burden of detecting whether batching is supported or not, and implementing two code paths.

Feature detection

As the API will throw for providers which don't implement it, a way must be provided for consumers to know whether the API is supported. The proposal is to add a CanCreateBatch bool property to both DbProviderFactory and DbConnection, alongside the CreateCommandSet() methods.

Additional Notes

  • No guarantees are made with regards to batch transactionality (do the batch commands execute in an implicit transaction) or to error handling (if a command fails, are later commands automatically skipped). These details are provider-dependent. However, wherever possible, it is suggested that implementations provide both transactionality and skip-on-failure behavior.
  • DbBatch is disposable, like DbCommand. This is to allow for cases where it maps to a native resource that needs to be freed, and may also help with pooling/recycling of instances (to be elaborated in a separate proposal).
  • We hope to add a DbBatch.Add(string commandText), which would create the CommandText internally and be a lighter API, but some design issues need to resolved around parameter management. We're deferring this for now as it's non-critical sugar.
  • An alternative proposal was made by @bgribaudo in #28794, in which a separate DbDataReader is returned for each DbBatchCommand, instead of one covering the entire batch. The goal was to help user identify command boundaries in resultsets when commands return more than one resultset (or even a variable number of resultsets). Due to the assumed rareness of the problem and the existence of other workarounds we think it's better to leave on DbDataReader for the entire batch.

Open questions

  • DbCommand and DbBatch share a lot of surface APIs. In theory there could be an interface capturing those similarities, allowing users to abstract away what's being executed (but suffers from the same versioning flaws of interfaces, modulu default interface methods).
  • Does DbBatch need to implement System.ComponentModel.Component, like all other ADO objects (for Visual Studio designer)? There seems to be little change that VS would be updated for this, but possibly for consistency.

Edit history

Date Modification
2019-02-14 DbCommandSet now implements IEnumerable<DbCommand> and includes the two GetEnumerator() methods, as per @bgrainger's suggestion
2019-02-18 Updated proposal with different concrete options for backwards compatibility, batch size management, and added a usage example.
2019-02-22 Updated with the current decision on backwards compatibility (no shim), added feature detection discussion, added non-aggregated rows affected to goals.
2019-02-23 Added missing DbConnection.CreateCommandSet() and renamed DbProviderFactory.CreateCommandSet() (it previously had a Db which doesn't belong). Added CanCreateCommandSet property to DbProviderFactory, DbConnection.
2019-02-25 Added public non-virtual DbConnection.CreateCommandSet() along protected virtual CreateDbCommandSet()
2019-03-04 Added DbConnection, DbTransaction to DbCommandSet. Corrected ExecuteReaderAsync overloads, corrected some comments.
2019-03-06 DbCommandSet now includes a Commands list instead of implementing IEnumerable<DbCommand> itself.
2019-03-12 Renamed DbCommandSet to DbBatch, and changed it to hold a new DbBatchCommand type instead of DbCommand. Added readable/terse code sample, added explanatory notes and rearranged sections slightly.
2019-04-18 Added CommandBehavior to DbBatchCommand and removed the commandBehavior parameter to DbBatch.ExecuteReader(); added a section to explain. Added RecordsAffected to DbBatchCommand and updated the affected rows section.
2019-04-20 Final typo pass (thanks @bgrainger) and added sentence on CommandBehavior support.
2019-05-14 Note on values of RecordsAffected (for SELECT, for failed statements), add standard Dispose pattern methods to DbBatch, fixed issues raised by @bgrainger
2019-05-23 Apply standard ADO.NET pattern to DbBatch with virtual ExecuteDbDataReader and non-virtual ExecuteDataReader
2019-06-17 Removed DbBatch.CancelAsync() following removal of DbCommand.CancelAsync() in #28596, thanks @Wraith2. Replaced IList<DbBatchCommand> with DbBatchCommandCollection.
2021-06-21 Make DbException.DbBatchCommand protected
2021-06-29 Move CommandBehavior from DbBatchCommand to DbBatch.ExecuteReader
2021-07-01 Remove the setter for DbBatchCommand.RecordsAffected
2021-07-01 Added DbBatch.CreateBatchCommand and CreateDbBatchCommand

Thanks @roji for putting this together. A couple of points:

  • It seems this completely supersedes
    https://github.com/dotnet/corefx/issues/31070 created by @GSPP. There is some good information there, but there is no point in doing this in a provider specific way if it is possible to do it in a provider agnostic way. Once we have this API, we can create a new issue for the SqlClient optimized implementation. Let me know if you agree and I can close dotnet/SqlClient#19.
  • In one of our latest email discussions, I proposed overloads of the Execute* methods that take an int[] argument for the records affected breakdown. Can we update the list of options with this?

Typo: IAsyncDisposal.

Will DbCommandSet implement the Dispose pattern (i.e., protected virtual Dispose(bool disposing), or public virtual Dispose() { /* no op */ }, or force all derived types to implement both Dispose and DisposeAsync (to satisfy the interfaces' requirements)?

Is it worth implementing IEnumerable a) to retrieve the commands in the batch, b) to enable C# collection initialization syntax?

Backwards-compatibility shim

Add:

  • The shim will set the CommandTimeout, Connection and Transaction properties on the batch command to the values of its corresponding properties.
  • Prepare[Async]() will simply call the same method on each of commands in the batch.

Or should it call the same method on the batch command it creates? (I don't understand the benefit of preparing the DbCommand objects in the DbCommandSet, which won't actually be executed themselves.)


MySQL implementation thoughts:

For the "text" protocol (the default), MySQL already supports all commands in a batch being sent in a single network packet. MySqlConnector would probably end up concatenating all the SQL from the DbCommandSet into one payload (maybe by performing multiple UTF-8 conversions into one output buffer, not by really concatenating actual C# strings). The DbCommandSet may not provide a practical enhancement for most MySQL users (who use the "text" protocol). Additionally, concatenating all the SQL means that it would be highly unlikely that DbException.Command could be set correctly (just as with the backwards-compatibility shim).

For the "binary" protocol (i.e., prepared commands), this would be a helpful API improvement. MySQL disallows more than one statement being prepared at once, so MySqlConnector has to parse SQL and split commands apart. The design goals of this proposal eliminate that requirement.

However, for MySQL, the second command in the batch wouldn't actually be sent until DbDataReader.NextResult[Async] is called. (I.e., DbCommandSet.ExecuteReaderAsync doesn't send all "binary protocol" commands in one network payload.) I don't think that would cause any problems with this proposal but executing “multiple SQL statements in a single roundtrip” would not be possible.

@roji, thank you for writing this up. I am excited about the possibilities!

Currently, the proposal allows command behavior to be specified once for the entire set of commands. However, command behavior can be command-specific. The same behavior may not apply to all commands that could go into a command set. To accommodate this, what would you think of changing the proposal so that behavior may optionally be specified when commands are added (e.g. Add(someCommand, CommandBehavior.KeyInfo))?

roji commented

@divega maybe it makes sense to repurpose dotnet/SqlClient#19 to track the implementation of the new batching API in SqlClient specifically? That seems to be its main focus in any case. Regarding the additional overload accepting an int[] for records affected, I'm not sure what you mean anymore... The proposal above includes an overload which returns int[], but of course feel free to edit and add. I think this is one of the main open questions which still need to be resolved.

roji commented

@bgrainger

Typo: IAsyncDisposal.

Thanks, fixed.

Will DbCommandSet implement the Dispose pattern (i.e., protected virtual Dispose(bool disposing), or public virtual Dispose() { /* no op */ }, or force all derived types to implement both Dispose and DisposeAsync (to satisfy the interfaces' requirements)?

Good question. Most of the ADO.NET base classes implement System.ComponentModel.Component, which implements IDisposable with the dispose pattern (i.e. virtual empty implementation of Dispose(bool)). It definitely seems like DbCommandSet should also implement that pattern as well (possibly by extending Component). I'm guessing that the same pattern applies for IAsyncDisposable - I'll look into this soon and update the proposal. Let me know what you think.

Is it worth implementing IEnumerable a) to retrieve the commands in the batch, b) to enable C# collection initialization syntax?

Richer access to the commands in the set is one thing I wanted to get feedback on. I'm not sure it makes sense for the command set to expose full list semantics - what's your view?

Unless I'm mistaken, C# collection initialization syntax actually has nothing to do with IEnumerable (which says nothing about addition/modification); all it requires is for the type in question to have an Add() method, which we already have in the proposal above.

The shim will set the CommandTimeout, Connection and Transaction properties on the batch command to the values of its corresponding properties.

Are you referring only to the shim, or to the DbCommandSet spec in general? For the latter, I'm not sure this makes much sense - what value does it provide over clearly specifying that they are ignored in batched execution? In addition, setting CommandTimeout could be construed as meaning that each command has its own timeout, which definitely doesn't work in the general batching case. So all in all, unless there's a persuasive reason, I'd rather we just ignored these.

  • Prepare[Async]() will simply call the same method on each of commands in the batch.

Or should it call the same method on the batch command it creates? (I don't understand the benefit of preparing the DbCommand objects in the DbCommandSet, which won't actually be executed themselves.)

You're right, this was a mistake. In theory it's possible to prepare the batch command as you suggest, and reuse it again with different parameter values. The issue is how to invalidate if any of the user-provided commands are modified (e.g. change in CommandText) - it doesn't seem like it will be possible to handle this by the shim. It's probably better to simply not do anything here.

Thanks for your comments on the MySQL provider. To be sure I'm understanding things correctly, here are a few questions:

  • Are parameters supported in the text protocol? If so, does your provider format parameter values in text, and are they integrated inline in the SQL or sent out-of-band?
  • Is the choice between the text and binary protocols something that the user controls somehow? Or is text used for regular commands and binary for prepared?
  • I seem to remember a conversation about this, but to be sure - so the MySQL binary protocol does not and cannot allow for batching at all (or is this not yet supported by your driver)? Can't multiple messages be sent (with increasing sequence IDs) be pipelined without waiting for previous responses? Also, this page seems to indicate that it may at least be possible to concatenate multiple statements in a single COM_STMT_PREPARE. I'm definitely far from an MySQL expert though, so all this may be wrong.
roji commented

@bgribaudo,

Currently, the proposal allows command behavior to be specified once for the entire set of commands. However, command behavior can be command-specific. The same behavior may not apply to all commands that could go into a command set. To accommodate this, what would you think of changing the proposal so that behavior may optionally be specified when commands are added (e.g. Add(someCommand, CommandBehavior.KeyInfo))?

Most of the value on CommandBehavior seem like they make sense only at the batch level - CloseConnection, SingleResult, SingleRow, SequentialAccess (since a single reader is returned, and would not likely support switching between sequential and non-sequential between commands across providers). A good way to think about CommandBehavior is that actually affects the reader, rather than the command - it's a parameter to DbCommand.ExecuteReader(), and cannot be specified where there is no reader (ExecuteNoQuery(), ExecuteScalar()). In batched mode, we're still going to have a single reader (possibly with multiple resultsets), so multiple CommandBehaviors don't really seem to fit well.

Will DbCommandSet implement the Dispose pattern

In favour of implementing the standard Dispose pattern for consistency. I have no experience with IAsyncDisposable but would lean towards implementing the same pattern if one exists.

Unless I'm mistaken, C# collection initialization syntax actually has nothing to do with IEnumerable

I believe you are mistaken; see docs.

Perhaps unfortunately, this does immediately imply access to the commands in the collection, which feels like a distraction from the point of this class. It does already have Add and Clear, and Count might make sense, which is halfway to ICollection<T>. But Contains, CopyTo, and Remove feel completely unnecessary, make implementations more complex, and imply some kind of equality for DbCommand objects. (This would probably end up just being reference equality, but is that what users expect?)

It doesn't feel like the point of DbCommandSet is to be a generic container of commands with full IList API. There's no obviously best answer here (to me), so I think I would lean towards keeping your proposed API, not implementing IEnumerable, and giving up C# collection initialisation until it becomes clear that that would be a high-value addition for typical users of the API. (I don't yet have a good sense for who will be the typical user; if it ends up primarily being used in ORM implementations, then some C# syntax sugar is less important.)

The shim will set the CommandTimeout, Connection and Transaction properties

Are you referring only to the shim

Yes, I was suggesting that you specify the behaviour of the shim implementation w.r.t. those properties. I think the shim should do something with them when creating its batch command, and this seems like the right thing to do.

Are parameters supported in the text protocol?

No, they are not. MySqlConnector parses the CommandText, finds ? or @param tokens (with the minimum smarts necessary to not detect these in string literals, comments, etc.) then substitutes in the (escaped) parameter values.

Is the choice between the text and binary protocols something that the user controls somehow?

Yes, through a convoluted mechanism. Binary protocol is implied by calling MySqlCommand.Prepare. However, because prepared commands support only a limited subset of SQL, and this might be surprising to users, Connector/NET added a IgnorePrepare option, which defaults to True. MySqlConnector implements the same convention. To use the binary protocol, you have to set IgnorePrepare=False, then call .Prepare() on each command.

so the MySQL binary protocol does not and cannot allow for batching at all

Correct. Each individual statement is given its own (integral) statement ID, and an "execute statement" command specifies a single statement ID. The client prepares all the statements in advance, then sends "execute statement ID 1", then reads the result set, then sends "execute statement ID 2", then reads its result set, etc.

Some MySQL Servers will permit pipelining, i.e., the client sending one network packet that contains multiple "execute statement ID n" payloads, then the server reading those one-at-a-time and processing them in order. However, there are a wide variety of proxies and backends (Aurora, Azure, TiDB, Percona, etc.) that speak the MySQL protocol and pipelining tends to break them. (It could possibly be added as an opt-in option for people who needed maximum performance and were confident their DB supported it.)

Also, this page seems to indicate that it may at least be possible to concatenate multiple statements in a single COM_STMT_PREPARE.

I have not found this to be true in practice. I thought I had once found documentation confirming that it's not supported but don't remember where that was. But that's a good point that some future MySQL Server might permit it.

Regarding the additional overload accepting an int[] for records affected, I'm not sure what you mean anymore... The proposal above includes an overload which returns int[], but of course feel free to edit and add. I think this is one of the main open questions which still need to be resolved.

@roji no problem. I added what I described on my emails as option 2, since it is a variation of option 1.

maybe it makes sense to repurpose dotnet/SqlClient#19 to track the implementation of the new batching API in SqlClient specifically? That seems to be its main focus in any case.

Ok, done.

roji commented

Unless I'm mistaken, C# collection initialization syntax actually has nothing to do with IEnumerable

I believe you are mistaken; see docs.

Sorry, my bad - you're right of course. That's what comes out of sloppy testing at 3AM.

Perhaps unfortunately, this does immediately imply access to the commands in the collection, which feels like a distraction from the point of this class. It does already have Add and Clear, and Count might make sense, which is halfway to ICollection<T>. But Contains, CopyTo, and Remove feel completely unnecessary, make implementations more complex, and imply some kind of equality for DbCommand objects. (This would probably end up just being reference equality, but is that what users expect?)

It doesn't feel like the point of DbCommandSet is to be a generic container of commands with full IList API. There's no obviously best answer here (to me), so I think I would lean towards keeping your proposed API, not implementing IEnumerable, and giving up C# collection initialisation until it becomes clear that that would be a high-value addition for typical users of the API. (I don't yet have a good sense for who will be the typical user; if it ends up primarily being used in ORM implementations, then some C# syntax sugar is less important.)

I agree with most of what you say, although I do think there's real value in providing collection initialization - maybe a good mid-way solution is to implement IEnumerable<DbCommand> but not IList<DbCommand>.

The shim will set the CommandTimeout, Connection and Transaction properties

Are you referring only to the shim

Yes, I was suggesting that you specify the behaviour of the shim implementation w.r.t. those properties. I think the shim should do something with them when creating its batch command, and this seems like the right thing to do.

OK, I understand, but I don't see why the shim should do anything with these properties - can you explain why you think that's important? The shim would obviously set those properties on the internally-used batch command it creates (since that's the one that actually executes), but why the others? Another possible issue is that it would be somewhat incoherent to have the shim do it, but not to require the same of actual provider implementations, and I don't really see the sense in that either...

Are parameters supported in the text protocol?

No, they are not. MySqlConnector parses the CommandText, finds ? or @param tokens (with the minimum smarts necessary to not detect these in string literals, comments, etc.) then substitutes in the (escaped) parameter values.

I see, so if I understand correctly you support formatting and parsing parameter values both in text (for unprepared) and in binary (for prepared)... Npgsql used to work the same way a long while ago. Note also #25022 which I plan to tackle soon, and which may also obviate the need to parse for parameter substitution/interpolation. The ultimate goal here is to allow you to write your driver with any sort of parsing/rewriting - hopefully that can work out for you. If there are any other reasons you parse/rewrite (aside from batching and parameter substitution), could you post a note about them?

Is the choice between the text and binary protocols something that the user controls somehow?

Yes, through a convoluted mechanism. Binary protocol is implied by calling MySqlCommand.Prepare. However, because prepared commands support only a limited subset of SQL, and this might be surprising to users, Connector/NET added a IgnorePrepare option, which defaults to True. MySqlConnector implements the same convention. To use the binary protocol, you have to set IgnorePrepare=False, then call .Prepare() on each command.

Thanks for this information - this kind of cross-provider/cross-database information is very important. In any case, it's unfortunate that MySQL imposes these restrictions...

so the MySQL binary protocol does not and cannot allow for batching at all

Correct. Each individual statement is given its own (integral) statement ID, and an "execute statement" command specifies a single statement ID. The client prepares all the statements in advance, then sends "execute statement ID 1", then reads the result set, then sends "execute statement ID 2", then reads its result set, etc.

Some MySQL Servers will permit pipelining, i.e., the client sending one network packet that contains multiple "execute statement ID n" payloads, then the server reading those one-at-a-time and processing them in order. However, there are a wide variety of proxies and backends (Aurora, Azure, TiDB, Percona, etc.) that speak the MySQL protocol and pipelining tends to break them. (It could possibly be added as an opt-in option for people who needed maximum performance and were confident their DB supported it.)

OK. IMHO batching is important enough for perf that it may make sense to provide an opt-in for this, although it's true that you also have the text protocol that already supports this. Regarding special backends and their capabilities, in the PostgreSQL case it's sometimes possible for the driver to detect this and automatically switch off/on certain features, maybe it's possible in your case as well (just thinking out loud).

Thanks again for all the detail and the valuable input into this - please don't hesitate to share any other thoughts on this proposal etc.

Looks great :)

For the shim: e.g. DB2 requires BEGIN/END around batched statements, so the shim itself needs to consult the ADO.NET provider whether it can simply concat the statements or needs to do additional things. Some databases don't support select statements batched together with update statements for instance, so your shim should anticipate on that, consulting the ADO.NET provider. Some databases don't support batching at all (access/oledb) so for these, the shim should simply execute them one by one.

When concatenating statements, you obviously run into the problem of parameters with the same name. You therefore have to make the parameter names unique. This can be problematic as it requires you to parse the statement (as far as I can see).

Most ADO.NET providers have no real limit on the # of parameters except... SQL Server! :) So you have to consult the ADO.NET provider there too in the shim how much statements you can batch in 1 go automatically based on the # of parameters. I don't think it's that important to offer a 'batch size' value for the shim.

This is for .NET Core x.y only? Or will this be available on .net full as well?

roji commented

@FransBouma

For the shim: e.g. DB2 requires BEGIN/END around batched statements, so the shim itself needs to consult the ADO.NET provider whether it can simply concat the statements or needs to do additional things. Some databases don't support select statements batched together with update statements for instance, so your shim should anticipate on that, consulting the ADO.NET provider. Some databases don't support batching at all (access/oledb) so for these, the shim should simply execute them one by one.

I wasn't planning on the shim actually implementing all the different behaviors, but it's an interesting thought... Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?

When concatenating statements, you obviously run into the problem of parameters with the same name. You therefore have to make the parameter names unique. This can be problematic as it requires you to parse the statement (as far as I can see).

Yes, I noted this problem above in the proposal - my plan is currently to throw in this case. It definitely seems like a non-starter for the shim to parse SQL to uniquify parameter names (especially as different databases have different SQL dialects etc.). This is after all only a backwards-compat shim, and it seems reasonable for it not to work for some edge cases - of course provider implementations of DbCommandSet can do what they want.

Most ADO.NET providers have no real limit on the # of parameters except... SQL Server! :) So you have to consult the ADO.NET provider there too in the shim how much statements you can batch in 1 go automatically based on the # of parameters. I don't think it's that important to offer a 'batch size' value for the shim.

That's a good point. I'd expect this kind of thing to be managed as an implementation detail by SqlClient's future implementation of DbCommandSet (as mentioned above, SqlClient actually has an internal SqlCommandSet which actually isn't publicly accessible). I do agree that we shouldn't concern ourselves with this at the general API level. Of course it's also possible for the user to execute two batches instead of one in certain situation (not that passing this onto the user is a good idea).

This is for .NET Core x.y only? Or will this be available on .net full as well?

I don't have a definite answer for this yet, but as a general rule new APIs get added to .NET Core (or rather to .NET Standard) and not necessarily to .NET Framework.

@roji

Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?

yes, in the case when it's created directly. It could do that when the first DbCommand is added. With the DbProviderFactory route I recon the ADO.NET provider could create a specific derived type of the DbCommandSet shim and simply execute things sequentially if batching isn't supported.

I likely overlooked it, but if you don't have this yet, I think if there are commands added to the DbCommandSet which are assigned to different connections it should throw.

If I'm understanding this correctly, when a command set returns multiple result sets, the first result set is accessed by calling ExecuteReader() (or ExecuteReaderAsync()). Then, subsequent result sets are fetched by calling NextResult() on the last-returned DbDataReader instance (as returned by execute or the last call to NextResult()). All result sets returned by the set of commands are returned one after the other, with no distinction made as to which result set came from which command.

It's possible (at least in the SQL Server world) for a single command to return zero, one or multiple result sets--and the number returned can dynamically vary (e.g. a stored procedure might be programmed to return one result set under certain conditions and two under others). Making sense out of the returned result sets in a situation like this can get tricky/be impossible without knowing which set came from which command.

For example, let's say I execute a command set consisting of two commands, both of which call StoredProcA. This proc returns zero, one or two result sets depending on circumstances. If executing the command set returns four result sets, I know that the first two correspond with the first call to StoredProcA and the last two correspond with the second call. However, if, say, only two total result sets are returned, sorting out which came from where is a bit trickier--did they both come from the first proc invocation, the second invocation or did one come from each?

To accommodate situations like this, is something like the following an option?

public class DbCommandSet … {
  // executes each command in the set
  CommandSetResults Execute(); 
  Task<CommandSetResults> ExecuteAsync();
   … 
}

// corresponds with the results returned by a single command in the set
public class CommandSetResults {
   DbDataReader GetReader(); // returns data reader for first result set from command; subsequent result sets from the command are accessed by calling LastResults() on the last-returned data reader instance
   object GetScaler(); 
   CommandSetResults CommandSetResults(); // returns a CommandSetResults for the next command's set of result sets, similar to how DbDataReader.NextResults() is used to get the next result set from a single command
   int AffectedRows { get; }
}

With an API like this, the developer passes in a set of commands then fetches out a set of command results--the granularity passed in is the same as what is returned. It also eliminates the need to figure out how to return a combined records effected count (because that number is available on a per-command basis) and potentially simplifies the exception situation (an exception associated with a command's results can be thrown while that command's results are being worked with, eliminating the need to figure out how to identify how to tag the exception to indicate which command triggered it).

GSPP commented

if the command contained more than one statement, it's currently impossible to find out how many rows were affected by each statement separately

This is critical for ORMs.

In general, this new API should expose all the raw power that is there. It should not drop information or limit the possible actions. This API is for consumers that maximally care about performance and flexibility.


The int[] recordsAffectedBreakdown allocation seems insignificant compared to the cost of making a database call. Even SELECT NULL is like 10000x the cost of that allocation. It's also G0 garbage.


Regarding error handling: It must be possible for providers to keep executing in case of error and keep incurring errors. SQL Server will happily continue in case of many errors. All those errors must be reported the API caller somehow. SQL Server commands can also generate info messages. These must be provided as well.


Is the timeout considered global or per command? This was not clearly stated as far as I can tell. IMO, it should be a global timeout. Most of the time, you want a timeout for the entire business transactions. If you expect creating a customer order takes 100ms then you want to abort if the entire thing takes e.g. 10s. You don't care about the duration of individual queries at all.


Feel free to close my now superseded issue https://github.com/dotnet/corefx/issues/31070. I am very happy to see progress on this. This is going to a major performance improvement.

roji commented

@FransBouma

Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?

yes, in the case when it's created directly. It could do that when the first DbCommand is added. With the DbProviderFactory route I recon the ADO.NET provider could create a specific derived type of the DbCommandSet shim and simply execute things sequentially if batching isn't supported.

The way I was seeing it, the shim is only there for providers which haven't (yet) implemented any support. Providers which do simply have their own implementation of DbCommandSet which does not derive from the shim - they can do whatever it is they want. Thus, DB2 would automatically add BEGIN/END, SqlClient would split the batch when there are too many parameters, etc.

This raises the question of how desirable it is for the shim to start implementing provider-specific logic, which really belongs in the provider implementations. Some basic functionality may make sense: knowing that some specific providers simply don't support batching, and executing the command sequentially could be useful for the transitional period. But ultimately the expectation is for providers to provide their own implementation, totally unrelated to the shim.

I likely overlooked it, but if you don't have this yet, I think if there are commands added to the DbCommandSet which are assigned to different connections it should throw.

The idea was simply to ignore the Connection property on the commands in the set, as well as the Transaction and Timeout properties - only those on the DbCommandSet are used.

GSPP commented

Maybe the provider should be able to tell if it does support batching and other parameters such as what the maximum efficient number of commands and parameters is (before splitting needs to happen). Callers such as EF could interrogate the provider and adapt their behavior.

We'd need to identify all the common but relevant properties in which providers differ.

roji commented

@GSPP

if the command contained more than one statement, it's currently impossible to find out how many rows were affected by each statement separately

This is critical for ORMs.

In general, this new API should expose all the raw power that is there. It should not drop information or limit the possible actions. This API is for consumers that maximally care about performance and flexibility.

We're in agreement, the question is really about how best to expose the non-aggregated rows affected.

The int[] recordsAffectedBreakdown allocation seems insignificant compared to the cost of making a database call. Even SELECT NULL is like 10000x the cost of that allocation. It's also G0 garbage.

That may be true in the general case, but don't forget that this is an API abstraction that's also used to contact local and even in-memory databases such as Sqlite. We've also seen that when fetching cached data from PostgreSQL over a low-latency 10Gbps network link, even minor-seeming allocations start to have an effect in the TechEmpower benchmarks. I agree that these cases may be somewhat artificial and not related to real-world usage, but I'm generally averse to introducing APIs that necessarily cause allocations if there's another way.

Is the timeout considered global or per command? This was not clearly stated as far as I can tell. IMO, it should be a global timeout. Most of the time, you want a timeout for the entire business transactions. If you expect creating a customer order takes 100ms then you want to abort if the entire thing takes e.g. 10s. You don't care about the duration of individual queries at all.

I agree - and per-command timeouts are probably not technically possible in at least some (or even most) of the providers. There's a note on all this in the proposal above - CommandTimeout is ignored on the individual commands (like Connection and Transaction), and DbCommandSet has its own Timeout property.

Regarding error handling: It must be possible for providers to keep executing in case of error and keep incurring errors. SQL Server will happily continue in case of many errors. All those errors must be reported the API caller somehow. SQL Server commands can also generate info messages. These must be provided as well.

This is of course very provider-specific; the general batching API doesn't define any mandatory behavior but definitely should not get in the way either (there's already a note in the proposal above). I personally think that it makes more sense to abort the batch once an error occurs, but if continuing is the desired behavior for a specific provider (e.g. SQL Server), then it could throw an aggregate exception containing all the individual exceptions at the end of batch execution. At the moment I don't see anything in the proposal that needs to be changed, let me know if you see a problem.

Feel free to close my now superseded issue dotnet/SqlClient#19. I am very happy to see progress on this.

As @divega wrote on that issue, it's still useful to keep it to track the implementation of the SqlClient DbCommandSet - presumably the existing internal SqlCommandSet would be changed to conform to the new API being shaped here. So I think we should keep it open for now.

This is going to a major performance improvement.

Great :) Thanks for your own analysis and description in dotnet/SqlClient#19, it helped push the realization that a true batching API really is necessary.

Maybe the provider should be able to tell if it does support batching and other parameters such as what the maximum efficient number of commands and parameters is (before splitting needs to happen). Callers such as EF could interrogate the provider and adapt their behavior.

This is similar to the point @FransBouma made above. The way I'm looking at it, the API consumer (e.g. EF Core) should not be concerned with such implementation details - they should simply construct a single big batch and execute it. If a specific provider's batching implementation considers that the batch should be split - either because it must be (e.g. too many parameters) or because it would run more efficiently that way - then it would do so internally in a transparent way.

Unless I'm mistaken, for SQL Server there's even a threshold below which batching isn't efficient at all (see dotnet/efcore#9270). The SqlClient implementation would simply not batch at all for these cases.

There may be some details which make it impossible or undesirable to totally abstract this away from the user - possibly around exception handling - but at the moment I can't see a reason not to.

@roji

Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?

yes, in the case when it's created directly. It could do that when the first DbCommand is added. With the DbProviderFactory route I recon the ADO.NET provider could create a specific derived type of the DbCommandSet shim and simply execute things sequentially if batching isn't supported.

The way I was seeing it, the shim is only there for providers which haven't (yet) implemented any support.

Ok, though if the shim fails, it's of no use to the user, so the user should be aware of the limits up front. E.g. if DB2 doesn't yet have their own implementation (and thus falls back on the shim, like any other provider) it will fail. This might be OK, but the user might be confused: "I use the default one and it breaks". So documentation might be in order that using the shim likely doesn't work in some cases.

Be aware too that mixing select commands and update commands might not always work with all db's. So if you concatenate statements in the shim, the user might run into this limitation too. I think documentation is enough here, as it breaks at runtime and it's not determinable up front if it will.

This raises the question of how desirable it is for the shim to start implementing provider-specific logic, which really belongs in the provider implementations. Some basic functionality may make sense: knowing that some specific providers simply don't support batching, and executing the command sequentially could be useful for the transitional period. But ultimately the expectation is for providers to provide their own implementation, totally unrelated to the shim.

yeah this is a bit of a tough call: concatenating statements and then not uniquing the parameter names already makes it fail on most systems, so if you don't do that, the user has to do that, and they likely already take other precautions (like group inserts together in 1 shim instance, updates in another etc.).

Regarding the values returned by ExecuteNonQuery(Async) which uses the concatenated query command: There is a trick to use here which allows us to determine per update statement whether it affected 0 or more rows: append ;SELECT param=rowcountvar to the update query, where param is an output parameter and rowcountvar is the rowcount var of the db (e.g. @@ROWCOUNT) . If the parameter is 0, the update statement affected 0 rows and should be reported as such to the user.

So for each query concatenated you append a unique parameter and a unique snippet. A bit overkill (I decided not to go for this in my framework, as when 1 query fails, e.g. in the middle, the whole batch is rolled back but you likely have to rerun the whole batch anyway as that's way easier than to sift through which command is the wrong one and why, then schedule the batch again without the ones already succeeded).

roji commented

@bgribaudo

If I'm understanding this correctly, when a command set returns multiple result sets, the first result set is accessed by calling ExecuteReader() (or ExecuteReaderAsync()). Then, subsequent result sets are fetched by calling NextResult() on the last-returned DbDataReader instance (as returned by execute or the last call to NextResult()). All result sets returned by the set of commands are returned one after the other, with no distinction made as to which result set came from which command.

That's correct. Note that this is how the ADO.NET API already works when you use concatenation-based batching.

It's possible (at least in the SQL Server world) for a single command to return zero, one or multiple result sets--and the number returned can dynamically vary (e.g. a stored procedure might be programmed to return one result set under certain conditions and two under others). Making sense out of the returned result sets in a situation like this can get tricky/be impossible without knowing which set came from which command.
[...]

That's a very valid point - the scenario of a single command returning a variable number of resultsets is indeed tricky, even if it seems like quite an edge case. Your proposal for a CommandSetResults (or some version thereof) could help, but IMHO this approach has the following drawbacks:

  • It makes the API for the common case much clunkier and more complicated. A user batching a couple queries now has to go through (and understand) an additional layer. Everyone already knows DbDataReader, so it's desirable to leave that in place.
  • There's an advantage in having DbCommandSet have the same API as DbCommand - the two are quite similar at the moment, and that similarity could even be captured by an interface. Aside from being facilitating a coherent and more understandable API, this could enable scenarios where one part of a program creates a command or commandset, and another part executes it and consumes the results without knowing whether it was a batch or not.

In addition, even without this something like this, there are some workarounds which can allow users to distinguish which resultset belongs to which command: one could look at the shape of the resultset (which columns it has) to determine which stored procedure generated it, or one could even insert an arbitrary SELECT 'delimiter' in the batch, which consuming code would look for as a marker that the previous command completed. This could be considered a bit hacky, but it works well and doesn't seem to have any downsides.

To summarize, I think the concern you raise is valid. However, in my opinion it's too much of a rare case to consider (significantly) complicating the API surface for the general, simple case - especiall when valid workarounds seem to exist.

roji commented

@FransBouma

The way I was seeing it, the shim is only there for providers which haven't (yet) implemented any support.

Ok, though if the shim fails, it's of no use to the user, so the user should be aware of the limits up front. E.g. if DB2 doesn't yet have their own implementation (and thus falls back on the shim, like any other provider) it will fail. This might be OK, but the user might be confused: "I use the default one and it breaks". So documentation might be in order that using the shim likely doesn't work in some cases.

Agreed. Another approach would be for the shim to contain a hard-coded list of the specific providers which support cancatenation-based batching, and only do it for them; for other providers it would fall back to non-batched execution. This would ensure that the shim always works, although it doesn't always actually batch.

A more extreme approach would be to give up entirely on actually batching in the shim, and always execute in non-batched fashion (this could make sense if we think that the duplicate parameter name problem is going to occur a lot - not sure about this).

Be aware too that mixing select commands and update commands might not always work with all db's. So if you concatenate statements in the shim, the user might run into this limitation too. I think documentation is enough here, as it breaks at runtime and it's not determinable up front if it will.

Absolutely - this is a database-specific detail that we shouldn't make any attempt to handle. Documentation will be important here.

yeah this is a bit of a tough call: concatenating statements and then not uniquing the parameter names already makes it fail on most systems, so if you don't do that, the user has to do that, and they likely already take other precautions (like group inserts together in 1 shim instance, updates in another etc.).

That's true - I think we should do our reasonable best, but rewriting user-provided SQL of unknown dialects seems like it's a non-starter. Here as well, documentation will hopefully help people out, and providers will hopefully implement their own DbCommandSet to obviate the shim entirely.

Regarding the values returned by ExecuteNonQuery(Async) which uses the concatenated query command: There is a trick to use here which allows us to determine per update statement whether it affected 0 or more rows: append ;SELECT param=rowcountvar to the update query, where param is an output parameter and rowcountvar is the rowcount var of the db (e.g. @@ROWCOUNT) . If the parameter is 0, the update statement affected 0 rows and should be reported as such to the user.
[...]

Yes, this is similar to what EF Core does for SQL Server in order to implement optimistic concurrency, which requires knowing for each update if it worked or not. This is exactly the kind of details that I'd expect the SqlDbCommandBatch to take care of internally, abstracting away the task from O/RMs and users. By the way, PostgreSQL provides this information as part of the wire protocol, so no additional hacky queries are needed.

GSPP commented

Thanks @roji. It seems I overlooked a few statements in this long thread.

should not be concerned with such implementation details

Unless I'm mistaken, for SQL Server there's even a threshold below which batching isn't efficient at all (see aspnet/EntityFrameworkCore#9270). The SqlClient implementation would simply not batch at all for these cases.

These details do not matter for semantics but for performance. The caller can adjust its behavior in ways that the provider cannot. For example, the provider does not understand the nature of the queries and the network latency environment.

EF might have 1 billion inserts queued but it will likely not want to create 1 billion command objects. So some moderation/negotiation on the number of commands per set has to take place in any case. If EF could query the maximum command and parameter count it could directly materialize the perfect number of commands.

I believe quite strongly, that the provider should never try to optimize the batch size or suppress batching in some cases. It's the wrong place to make these decisions.

The EF issue does not state what mechanism of batching was used. Maybe just concatenated SQL which has different properties than command sets (most significantly, recompilation). If command sets ever are slower I'd like to know why. What specific property (of the protocol?) caused that? It would be surprising to me.


A concatenation based shim would not work for statements that must be the single statement in the batch e.g. CREATE VIEW. Also, renaming parameters is observable. T-SQL can query the catalog views to obtain the currently executing SQL text. It might have implications for parameterization, query store and other tooling. Of course, SQL Server will get a true implementation but these examples show how difficult it is to shim this.

Maybe the shim should be the safest option possible and all optimizations should be left to the provider.

OK, I understand, but I don't see why the shim should do anything with these properties - can you explain why you think that's important? The shim would obviously set those properties on the internally-used batch command it creates (since that's the one that actually executes), but why the others?

I think we're saying the exact same thing here; I may have just worded it confusingly.

I am not proposing that DbCommandSet modify properties on any DbCommand that is added to the set.

All I'm suggesting is adding to the spec a bullet point that explains how the shim will set the properties on the internally-created batch command. Your spec mentions how CommandText is used; I thought it would be good to be thorough and explicitly document how all properties on DbCommandSet get applied to the internally-created batch command. (The bullet point I suggested adding deliberately paralleled your existing second bullet point.)

roji commented

@bgrainger ahh, understood! I indeed thought you wanted the shim to modify the properties on the user-provided commands. Added a bullet point as suggested.

roji commented

@GSPP

These details do not matter for semantics but for performance. The caller can adjust its behavior in ways that the provider cannot. For example, the provider does not understand the nature of the queries and the network latency environment.
[...]
I believe quite strongly, that the provider should never try to optimize the batch size or suppress batching in some cases. It's the wrong place to make these decisions.

I'd like to understand this point a bit better - it also seems to contradict somewhat with your comment above that proposes the provider expose the maximum efficient number of commands/parameters. After all, if the provider can expose this knowledge, why shouldn't it just act on it and remove the burden from the user?

But more importantly, I'm interested in exactly what kind of user knowledge could be useful for determining batch sizes/thresholds, that the provider does not have. Shouldn't batching always be desirable regardless of the network latency environment? How would the nature of the query impact the decision-making here? Some real-world examples could be useful on this.

Note that the user would always have the option of creating multiple DbCommandSets, according to whatever criteria they see fit. The question here is whether the provider implementation should make its own decisions on how to execute the commands given by the user. Don't forget we have some very clear provider-specific cases, like the max parameter limit per batch in the SQL Server case - do you really think this kind of thing should be the user's responsibility rather than the provider's?

The EF issue does not state what mechanism of batching was used. Maybe just concatenated SQL which has different properties than command sets (most significantly, recompilation). If command sets ever are slower I'd like to know why. What specific property (of the protocol?) caused that? It would be surprising to me.

That's very possible - I was also very surprised that batching two statements could actually be slower than not batching them, and was explained that it's related to overhead involved in batching. @divega and/or @AndriySvyryd may be able to provide more info.

EF might have 1 billion inserts queued but it will likely not want to create 1 billion command objects. So some moderation/negotiation on the number of commands per set has to take place in any case. If EF could query the maximum command and parameter count it could directly materialize the perfect number of commands.

Regardless of the above, if there really are many commands to execute, EF may decide to execute them in several batches - regardless of anything provider- or environment-specific, i.e. as a means of avoiding instantiation of lots of command instances. I'm not sure this is actually necessary, but the point is that it's a different kind of batch-breaking-down as the one discussed above.

roji commented

@GSPP one more comment - note that whether a particular implementation of DbCommandBatch splits the batch or not is entirely internal and provider-specific. We can of course publish guidelines and recommendations, but at the end of the day each provider will do whatever it is they think is necessary - in other words, it's not part of the API itself.

roji commented

@GSPP missed this last comment of yours:

A concatenation based shim would not work for statements that must be the single statement in the batch e.g. CREATE VIEW

Would it work in a non-concatenation-based batch? If not, then that seems OK - it's simply a limitation of SQL Server for both forms of batching. If CREATE VIEW can be batched, then it's an odd limitation :) But even in this case, it seems OK for the shim not to always work - it's a transitional mechanism until SqlClient gets its own optimized implementation, and it's a new API so nothing is being broken.

Also, renaming parameters is observable. T-SQL can query the catalog views to obtain the currently executing SQL text. It might have implications for parameterization, query store and other tooling. Of course, SQL Server will get a true implementation but these examples show how difficult it is to shim this.

I assume that by renaming parameters you mean changing their names so that they can be unique across all commands in the batch. If so, then I agree that this isn't viable, if only for practical reasons - we're not going to be able build an SQL parser into the shim that can detect placeholders which aren't part of SQL literals etc.

Maybe the shim should be the safest option possible and all optimizations should be left to the provider.

That seems to point towards the shim executing the commands without any sort of batching, as proposed above in the discussion with @FransBouma. I agree there's some merit in this...

Is there going to be an considerations for T-SQLs GO batch separator? One thing that has always been a pain when building DB migration bits is when I want to execute scripts generated by SSMS that typically generate a number of batch scripts separated by GO.

Also, how will the different providers hand scope? For instance if I do the following:

DECLARE @val INT;
SELECT @val = 1;

If this were two commands, what happens to the scope of @val or is this actually a non-issue here?

Could an API be provided to accept a SQL statement and return a command batch? Is there any merit in such an API?

[@GSPP] Regarding error handling: It must be possible for providers to keep executing in case of error and keep incurring errors. SQL Server will happily continue in case of many errors. All those errors must be reported the API caller somehow. SQL Server commands can also generate info messages. These must be provided as well.

[@roji] This is of course very provider-specific; the general batching API doesn't define any mandatory behavior but definitely should not get in the way either (there's already a note in the proposal above). I personally think that it makes more sense to abort the batch once an error occurs, but if continuing is the desired behavior for a specific provider (e.g. SQL Server), then it could throw an aggregate exception containing all the individual exceptions at the end of batch execution. At the moment I don't see anything in the proposal that needs to be changed, let me know if you see a problem.

I concur with the idea of allowing execution to continue on error when the server supports this. However, I think it's also important to have the ability to know about errors as they are encountered vs. after the entire batch has been executed.

Currently, with ADO.Net (at least, the SQL Server implementation), error exceptions are raised when the error is read from the incoming data stream. The consumer can then respond/react to the exception, as appropriate.

With this API as it currently stands, if a consumer is reading a result set that was not fully provided by the server due to an error, it sounds like the result set will look like a valid, complete result set. The consumer won't be able rely on that actually being the case until the entire set of commands has finishes executing. If the consumer wants to perform an irrevocable action on a per-data set basis, the consumer will need to cache all data sets as they are received while they wait for execution to complete and the possible aggregate exception before taking action.

If, instead, errors are raised as they are encountered, the consumer knows that when they read a complete data set, it is actually complete, and so can be actioned on immediately. If a related error is encountered during reading, the consumer is informed and gets to decide whether or not to continue on (e.g. move to the next data reader).

Another benefit of immediate exception on error is that, if the consumer decides that an error warrants cancelling the command set's execution, they can close the connection (perhaps automatically as the exception exits a using statement). The server can then detect this and cancel execution instead of continue on only for the consumer to ignore/discard the remainder of what's returned.

Currently, with ADO.Net (at least, the SQL Server implementation), error exceptions are raised when the error is read from the incoming data stream. The consumer can then respond/react to the exception, as appropriate.

I don't know, if the error isn't a transient error, the current transaction is likely to be aborted on the server side, meaning whatever you decide to do, you have to run the previous commands again as they've been rolled back due to the aborted transaction.

roji commented

@bgribaudo I'm not too sure about the handling of multiple errors in general and especially as they come - but some more information on exactly how SQL Server behaves is probably needed here (@divega do you have this knowledge?).

First, I agree with @FransBouma that in general, an error is likely to trigger a batch failure and/or transaction rollback - although this kind of thing is definitely database-dependent.

With this API as it currently stands, if a consumer is reading a result set that was not fully provided by the server due to an error, it sounds like the result set will look like a valid, complete result set.

You're talking about a partial resultset that gets terminated by an error, without a way for the user to know this? If the error is transient (e.g. network partition, server out of memory) then it's very unlikely that the batch can continue in any case (since connectivity has been lost). If you're thinking about a non-transient error, then I have no idea how that could happen in the middle of a resultset - do you have any concrete, real-world example?

roji commented

@Antaris

Is there going to be an considerations for T-SQLs GO batch separator?

I'm not an expert on SQL Server, but unless I'm mistaken the GO bit is required in SQL scripts to signal the end of a batch, and isn't part of the protocol in any way (as per the docs). In other words, it operates at a higher level than what we're discussing here, which is the low-level interface with the ADO.NET provider. A higher-level tool reading an SQL script may look for GO in order to execute the statements in a batch using this API, but it doesn't seem like this API is itself supposed to be aware of GO.

Also, how will the different providers hand scope?

That is a very SQL Server-specific question, and doesn't seem like it should be the concern of this API abstraction. If a variable's scope is restricted to its batch (as seems to be indicated in the GO docs), then it's the user's responsibility to make sure all statements referring to that variable are in the same batch as its declaration.

Could an API be provided to accept a SQL statement and return a command batch? Is there any merit in such an API?

The existing proposal already allows you to construct a DbCommandSet contain only a single DbCommand. However, I'm not sure what would be the point of that.

@roji:

I personally think that it makes more sense to abort the batch once an error occurs, but if continuing is the desired behavior for a specific provider (e.g. SQL Server), then it could throw an aggregate exception containing all the individual exceptions at the end of batch execution.

@roji:

You're talking about a partial resultset that gets terminated by an error, without a way for the user to know this? If the error is transient (e.g. network partition, server out of memory) then it's very unlikely that the batch can continue in any case (since connectivity has been lost). If you're thinking about a non-transient error, then I have no idea how that could happen in the middle of a resultset - do you have any concrete, real-world example?

Hope you're having a great Friday!

My comment has to do with the "but if continuing" idea from the first quote above where errors are saved until the end of batch execution then thrown in an aggregate exception.

Say a single batch is executed against a SQL Server which returns two result sets. An error is encountered with the first result set resulting in the server being unable to fully return it. The second result set is returned in its entirety without any errors. In order to hold throwing the exception for the first result set's error until after execution of the entire batch has been completed (e.g. after the second result set has been read in its entirety), wouldn't the first result set have to be readable by the consumer without throwing exceptions?

In this scenario, the consumer would read as many rows out of the first result set as were returned before the error then be told by the data reader that the result set is complete (e.g. reader.Read() would return false), then move to the next result set (reader.NextResult()), read it in its entirety, and then and only then receive the aggregate exception letting them know that the first result set really wasn't complete.

I think this would be dangerous; that it would be better to throw an exception for a fatal error when it is encountered. If the consumer wants to continue working with additional result sets from the batch, they can catch the exception and move to the next reader. This is how SqlCommand currently works.

Example of SqlCommand's behavior:

using (var command = connection.CreateCommand())
{
	command.CommandText = "SELECT 1/0; SELECT 2";

	using (var reader = command.ExecuteReader())
	{
		try
		{
			reader.Read(); // Throws an exception because of the error in the first result set.
		}
		catch (SqlException e) when (e.Message == "Divide by zero error encountered.") {
			// However, if we catch the exception, we can move to reading the next result set.
		}

		// The fact that the first result set died on an error didn't cause SQL Server to abort the entire batch.
		// The second result set is still returned.

		reader.NextResult();

		reader.Read();
		reader.GetInt32(0); // returns 2
	}
}
roji commented

@bgribaudo, if you're happy with the DbCommand API as in the example above, then that's exactly what DbCommandSet will have according to the current proposal: when you invoke DbCommandSet.ExecuteReader(), you get a reader that can behave in the exact same way as above - depending on how a particular provider implements it. The API doesn't force a particular error handling behavior on implementors. In fact, the idea is for the DbCommandSet API to be as close as possible to the DbCommand API.

Does that make sense?

Add an overload of ExecuteNonQuery[Async]() which returns an int[]. Calling it would incur a heap allocation (although pay-per-play), and the naming for this isn't going to be pretty. In addition, add a similar int[] property on DbDataReader;

I don't think that having this property is a good idea. You already can get affected rows by each statement by iterating over results.

Returning an int as a result of ExecuteNonQuery doesn't provide meaningful information. For example, you have two DML statements. The first one should modify 2 rows, and the second - 3 rows. But because of an error you could get 3 and 2. The sum still is 5.

roji commented

I don't think that having this property is a good idea. You already can get affected rows by each statement by iterating over results.

Affected rows is mainly for DELETE, UPDATE and INSERT, where there are no results to iterate over... Think of a batch containing several updates - we want to provide the user with a way of knowing which update affected how many rows.

Returning an int as a result of ExecuteNonQuery doesn't provide meaningful information. For example, you have two DML statements. The first one should modify 2 rows, and the second - 3 rows. But because of an error you could get 3 and 2. The sum still is 5.

That's precisely the problem we're trying to solve. There's an int-returning version of ExecuteNonQuery() because users rarely care about the command-by-command breakdown (and the array allocation is involved), and also because we want to keep the API similar to DbCommand. This is why we're discussing the possibility of having two ways to get the affected records count.

Affected rows is mainly for DELETE, UPDATE and INSERT, where there are no results to iterate over... Think of a batch containing several updates - we want to provide the user with a way of knowing which update affected how many rows.

var recordsAffected = new array[10];
set.ExecuteNonQuery(recordsAffected);

// or

var recordsAffected = new array[10];
var currentIndex = 0;
using (var reader = set.ExecuteReader())
{
    do
    {
        recordsAffected[currentIndex] = reader.RecordsAffected;
    }
    while (reader.NextResult());
}

The first way perfectly fits the requirement. You know how many rows affected by each command in a batch. The second approach shouldn't be used when you're doing some data manipulations, but it's the right way to get the count of affected records in cases when you mix data selection and manipulation.

roji commented

@YohDeadfall, your first proposal is essentially what @divega proposed (option 2 under "Affected Rows").

The 2nd option of repeatedly checking RecordsAffected doesn't work because UPDATE, DELETE and INSERT don't have resultsets. For example, if you currently send a batch containing "SELECT ...; UPDATE...; SELECT ..." you get only two resultsets back. Also, in the docs for DbDataReader.RecordsAffected it's specified that "The RecordsAffected property is not set until all rows are read and you close the DataReader".

Also, in the docs for DbDataReader.RecordsAffected it's specified that "The RecordsAffected property is not set until all rows are read and you close the DataReader".

Could this behavior be changed for command sets only at least? I mean the last part of the sentence.

roji commented

Could this behavior be changed for command sets only at least? I mean the last part of the sentence.

It could, but there's still the other more important issue of INSERT, UPDATE and DELETE not returning a resultset.

One more thing with regards to your 1st proposal (overload of ExecuteNonQuery() which accepts an array)... See the note in https://github.com/dotnet/corefx/issues/35135#issuecomment-461422211 about stored procedures returning an arbitrary number of resultsets - this makes it harder for the user to know in advance exactly how many resultsets there's going to be.

It could, but there's still the other more important issue of INSERT, UPDATE and DELETE not returning a resultset.

Don't know how other providers work, but Npgsql (source) and SqlClient (source) use data readers to get the number of modified rows.

See the note in dotnet/corefx#35135 (comment) about stored procedures returning an arbitrary number of resultsets

It would be strange to get the count of affected rows for each result set from the stored procedure if the user doesn't know what's it doing. Is it an real scenario?

roji commented

It could, but there's still the other more important issue of INSERT, UPDATE and DELETE not returning a resultset.

Don't know how other providers work, but Npgsql (source) and SqlClient (source) use data readers to get the number of modified rows.

Yes, readers aggregate the rows affected from all executed commands. But the point is that there is no way to make them "stop" on each delete/update/insert and get the individual rows affected by it, because they have no resultset. It's not that they have empty resultsets - they have no resultsets at all, so when would you access DbDataReader.RecordsAffected to get the value for a specific delete/update/insert statement?

See the note in dotnet/corefx#35135 (comment) about stored procedures returning an arbitrary number of resultsets

It would be strange to get the count of affected rows for each result set from the stored procedure if the user doesn't know what's it doing. Is it an real scenario?

@bgribaudo hasn't provided a real scenario, but it doesn't seem completely implausible - you provide some parameter to the function which determines in some way how many resultsets are returned. In any case, I don't think this is decisive.

But the point is that there is no way to make them "stop" on each delete/update/insert and get the individual rows affected by it, because they have no resultset.

The next statement must be executed when NextResult is called, and the rest of them on closing.

roji commented

But the point is that there is no way to make them "stop" on each delete/update/insert and get the individual rows affected by it, because they have no resultset.

The next statement must be executed when NextResult is called, and the rest of them on closing.

I'm sorry but I'm not following... Can you please explain your proposal for fetching the non-aggregated affected records for a batch of updates (UPDATE ...; UPDATE ...; UPDATE ...)? With the current reader API, this kind of batch produces no resultsets whatsoever, so there's no way to get the records affected by each individual commands.

@Roj:

@bgribaudo, if you're happy with the DbCommand API as in the example above, then that's exactly what DbCommandSet will have according to the current proposal: when you invoke DbCommandSet.ExecuteReader(), you get a reader that can behave in the exact same way as above - depending on how a particular provider implements it. The API doesn't force a particular error handling behavior on implementors. In fact, the idea is for the DbCommandSet API to be as close as possible to the DbCommand API.
Does that make sense?

Yes! Thanks!

See the note in dotnet/corefx#35135 (comment) about stored procedures returning an arbitrary number of resultsets

It would be strange to get the count of affected rows for each result set from the stored procedure if the user doesn't know what's it doing. Is it an real scenario?

@bgribaudo hasn't provided a real scenario, but it doesn't seem completely implausible - you provide some parameter to the function which determines in some way how many resultsets are returned. In any case, I don't think this is decisive.

For a real-world example, how about SQL Server's sp_help? "The result sets that are returned depend on whether name is specified, when it is specified, and which database object it is." For example, if that sp is passed a table name as its parameter, the number of result sets returned can vary--e.g. based on whether the table has indexes and/or constraints defined.

The number of result sets returned can vary even when each invoked procured/batch has been build to return a fixed number of result sets. Say, several commands are batched together with no transaction in use. The first is an invocation of a particular stored proc that was built to return exactly two results sets. However, while assembling the first result set, an error is encountered, causing the procedure's execution to terminate. Execution then moves on to the next batch in the command set, which successfully returns one result set (as expected). Execution of the entire command set returned two result sets (one dying with an error, one without errors). The assumption that the first two result sets returned correspond with the first stored procedure doesn't hold here, even though that procedure is expected to return a fixed number of result sets.

@YohDeadfall

Also, in the docs for DbDataReader.RecordsAffected it's specified that "The RecordsAffected property is not set until all rows are read and you close the DataReader".

Could this behavior be changed for command sets only at least? I mean the last part of the sentence.

Are you looking for a way to, say, get the affected row count from the first command in a batch then use that value to decide whether to read the results from the second command in a batch vs. having to wait until you've read all result sets returned from all commands before being able to access the affected count (or the array of affected counts)?

@roji, @GSPP & others, here is way back machine link to a short article that explained how batching worked in SqlClient with SqlDataAdapter in ADO.NET 2.0:

https://web.archive.org/web/20080808113743/https://blogs.msdn.com/dataaccess/archive/2005/05/19/420065.aspx

Copying the important bits here:

The other way of doing batching is “simpler”, but happens at a lower level. I’m going to describe the SQL Server implementation here, but other database servers might have a similar feature. At the client-server protocol level (TDS), each command is executed against the server using a “request”. For parametrized and batched statements, this request is an “RPC request” (nothing to do with the DCE RPC, just same name). It turns out that in SQL Server we can put one RPC request on the network, and right before putting the “we’re done” mark on the wire, we can actually put a “here comes another RPC” mark. This way, we can keep sending RPCs over the wire all together. Once we sent all of them, we switch to listen for responses, and we get the responses for each RPC in the order we sent them. The protocol has the appropriate infrastructure that allows us to distinguish when a response for a given RPC ends and the next starts.

This means that we get rid of the following problems:

  • We can send multiple request for statement execution in a single network round-trip, which was the original goal
  • No need to parse statements and rename parameters, because each statement goes in a different request
  • No procedure cache bloat, because regardless of the order of execution, each row has an RPC of its own
  • No limit to the number of rows that can be updated with a single batch, because we don’t depend on lots of parameters for a single RPC

There are some minor specific drawbacks to RPC-level batching, such as paying a little bit of extra cost in the server per-RPC execution, but the benefits above in general out-weigh those.

The key here is that if we can leverage the existing batch support (currently only accessible through DataTable or DataSet) there is no 2100 parameter limit for the whole batch.

roji commented

@divega that's some great info! Something tells me that we may also be surprised by the optimal batch size with this batching style (i.e. it may be 2 😄 ). And yeah, it's also great to know that the 2100 parameter limit goes away...

@bgribaudo, I'm looking for other ways to get the count of affected rows than adding a property or something like that for which I should pay even if it isn't used.

roji commented

Note to all, I've updated the original proposal above with some more thoughts and options, and an example.

roji commented

@YohDeadfall

@bgribaudo, I'm looking for other ways to get the count of affected rows than adding a property or something like that for which I should pay even if it isn't used.

Option 2 and 3 in the proposal above don't involve paying for anything more than populating a user-provided array (option 2) or writing an int property on DbCommand (option 3), which already seems reasonable. But if you have any other ideas that would be great.

GSPP commented

I'd like to understand this point a bit better - it also seems to contradict somewhat with your comment above that proposes the provider expose the maximum efficient number of commands/parameters. After all, if the provider can expose this knowledge, why shouldn't it just act on it and remove the burden from the user?

This was about why the provider should expose the maximum efficient limits. Imagine EF wanting to insert 1 billion rows. It certainly cannot create a command set with 1 billion subcommands. That would be very inefficient. Rather, it would need to perform it's own batching. What batch size should it pick? If it picks too low, it will be inefficient. If it picks too high, the provider will introduce unnecessarily small batches (e.g. provider limit 1000, EF sent 1500 items. This results in batches of 1000 and 500 items.). EF has no good way of picking a number for it's own batching.


CREATE VIEW Would it work in a non-concatenation-based batch

Yes, totally possible. SqlCommandSet can do this just fine because it sends multiple batches at the TDS layer.

That seems to point towards the shim executing the commands without any sort of batching, as proposed above in the discussion with @FransBouma. I agree there's some merit in this...

I recommend that. All popular providers will get an efficient implementation quickly.

GO batch separator

Yes, I can confirm that this has nothing to do with SQL Server. This is an SSMS feature.


Also, @Antaris mentions local variables. This alone is a reason concatenation can never be used.


I concur with the idea of allowing execution to continue on error when the server supports this. However, I think it's also important to have the ability to know about errors as they are encountered vs. after the entire batch has been executed.

Currently, with ADO.Net (at least, the SQL Server implementation), error exceptions are raised when the error is read from the incoming data stream. The consumer can then respond/react to the exception, as appropriate.

As explained elsewhere there can be no assumption about what does or does not terminate batch and/or transaction.

Also, it is possible for errors to occur mid-way through a result set. This can happen when a computation such as 1/0 fails. Example query: select *, 1/ (ID - 1000000) from LargeTable. This will crash when the row with ID 1000000 is encountered.

roji commented

I'd like to understand this point a bit better - it also seems to contradict somewhat with your comment above that proposes the provider expose the maximum efficient number of commands/parameters. After all, if the provider can expose this knowledge, why shouldn't it just act on it and remove the burden from the user?

This was about why the provider should expose the maximum efficient limits. Imagine EF wanting to insert 1 billion rows. It certainly cannot create a command set with 1 billion subcommands. That would be very inefficient. Rather, it would need to perform it's own batching. What batch size should it pick? If it picks too low, it will be inefficient. If it picks too high, the provider will introduce unnecessarily small batches (e.g. provider limit 1000, EF sent 1500 items. This results in batches of 1000 and 500 items.). EF has no good way of picking a number for it's own batching.

There are several things here... First, as I wrote above, if the provider can expose a simple numeric limit, then I don't really see a reason for it not to simply apply it. Why burden all users with handling this manually when it can be taken care of?

Second, a simple "maximum efficient limit" (or whatever) seems very specific and inflexible. What if the batch size is determined more by the number of parameters, or the total byte size or similar? It doesn't make sense to me to try to expose such things, as different databases will have different constraints.

Third, I was thinking of this feature a bit more for hard, protocol-related limits and not for "optimal efficiency" limits. For example, when I wrote this I was under the impression that batches would be limited to 2100 parameters (@divega since wrote that it's not the case). With such a hard limit it could make sense for the implementation to split.

Otherwise, if there's indeed an "optimal efficiency number", it's likely to to depend on very environmental factors (server specs, network), which the provider doesn't know about any more than the EF provider. I can't say I see any value in trying to guess that.

In any case, as I've written earlier today in the updated proposal, as of now we're not aware of many uses for this, and in any case it's a totally internal provider detail.

That seems to point towards the shim executing the commands without any sort of batching, as proposed above in the discussion with @FransBouma. I agree there's some merit in this...

I recommend that. All popular providers will get an efficient implementation quickly.

I wouldn't necessarily bet too much on that :) We still have providers out there which haven't even implemented actual async I/O, which has been around for very long - I think experience shows that we should be a little pessimistic here. In any case I've updated the proposal with more analysis of the different options.

CREATE VIEW Would it work in a non-concatenation-based batch

Yes, totally possible. SqlCommandSet can do this just fine because it sends multiple batches at the TDS layer.
[...]

GO batch separator

Yes, I can confirm that this has nothing to do with SQL Server. This is an SSMS feature.

Great, thanks for confirming!

As explained elsewhere there can be no assumption about what does or does not terminate batch and/or transaction.

Also, it is possible for errors to occur mid-way through a result set. This can happen when a computation such as 1/0 fails. Example query: select *, 1/ (ID - 1000000) from LargeTable. This will crash when the row with ID 1000000 is encountered.

That's a good example. It's worth mentioning that I/O errors can also occur at absolutely any place which may involve I/O, that's just the way things are. The question in this context seems to be about the error behavior, which is going to be provider-specific - but I definitely hope there's a way for batches to be aborted on error across all relevant providers.

GSPP commented

For example, when I wrote this I was under the impression that batches would be limited to 2100 parameters

Yes, I was under the same impression. I think you're right. This doesn't make much sense.

For example, when I wrote this I was under the impression that batches would be limited to 2100 parameters

Yes, I was under the same impression. I think you're right. This doesn't make much sense.

By batches, do you mean sets of batches? :-) In the SQL Server world, the approx. 2,100 parameter limit applies each batch (which may, in turn contain multiple SQL statements) but not across multiple batches that are executed as part of a single TDS message.

(Behind the scenes, executing a SQL batch with parameters involves calling a stored procedure. SQL Server limits stored procedures to 2,100 parameters. All user-supplied parameters plus a small number of driver/API provided parameters [like one containing the query's text] need to fit within that 2,100 limit.)

From Does ADO.NET update batching really do something?:

It turns out that some databases such as SQL Server have an upper limit to the number of parameters that you can pass to a single parametrized statement invocation. In the case of SQL Server this is around 2100. That means that your batch cannot sum up more than those parameters total (not per statement).

roji commented

@bgribaudo

By batches, do you mean sets of batches? :-) In the SQL Server world, the approx. 2,100 parameter limit applies each batch (which may, in turn contain multiple SQL statements) but not across multiple batches that are executed as part of a single TDS message.

Yeah, have a danger of terminology mismatch... I don't know much about SQL Server, but I'm just referring to @divega's comment above, according to which the batch is of RPC messages, each of which can contain 2100 pararmeters. I assume that each of these RPC messages can contain multiple statements - this is probably the concatenation-based approach used today which we're trying to get away from.

It turns out that some databases such as SQL Server have an upper limit to the number of parameters that you can pass to a single parametrized statement invocation.

Right, the idea is (I think) to make DbCommandSet correspond to a set of parameterized statement invocations.

GSPP commented

Can the content from that Data Access blog be saved somehow? This is very valuable information. The wayback machine is not an appropriate place. Maybe activate the SQL Server documentation team...

Thanks, @roji!

I assume that each of these RPC messages can contain multiple statements - this is probably the concatenation-based approach used today which we're trying to get away from.

That is correct--each RPC message can contain one batch, which consists of one or more SQL statements.

roji commented

Note to everyone: I've updated the proposal again (updated with the current decision on backwards compatibility (no shim), added feature detection discussion, added non-aggregated rows affected to goals).

@roji I agree ADO.NET doesn't have a very developed pattern for feature detection, but it has the beginning of one.

For example, if you look at DbProviderFactory, there is a method/property pair:

    public abstract partial class DbProviderFactory
    {
        ...
        public virtual bool CanCreateDataSourceEnumerator => false;
        ...
        public virtual DbDataSourceEnumerator CreateDataSourceEnumerator() => null;
    }

At the time this was designed, DbDataSourceEnumerator was probably considered the only optional abstraction in the ADO.NET provider model. Since then, we have stretched things a bit and, and on Microsoft.Data.Sqlite (which started as a .NET Core 1.0 ADO.NET provider, before DataSet was ported to .NET Core), SqliteFactory doesn't implement CreateCommandBuilder or CreateDataAdapter.

Note that:

  1. The Create* methods on DbProviderFactory return null rather than being abstract or throwing NotSupportedException. I am not 100% sure how intentional this was. However, you could argue that any client code that needs to reason about capabilities can (and should) check for the result of the Create* method being null.
  2. We did not try to add CanCreateDataAdapter and CanCreateCommandBuilder when we effectively made them optional, probably because we did not think it was necessary. But even if we wanted to do it, the only way to implement it that would work for existing providers would be to call the corresponding Create* method and check if the result is null. I am not sure this is worth doing, but we could consider it for consistency.

Since we are intentionally adding a new optional abstraction to the provider model, I think the most consistent would be to add CanCreateCommandSet and CreateCommandSet to DbProviderFactory:

public class DbProviderFactory
{
    public virtual bool CanCreateCommandSet => false;

    public virtual DbCommandSet CreateCommandSet() => null;
}

BTW, there is a small bug in the proposal: the "Db" prefix is not used in the Create* methods on DbProviderFactory.

We also discussed that we don't want to require customers to have to resort to DbProviderFactory, so we can have a similar API directly on DbConnection. Note existing Create* methods on DbConnection follow a bit more elaborate template method pattern, and the template (protected virtual) method itself uses the "Db" prefix and is abstract:

https://github.com/dotnet/corefx/blob/7817a109400b5b22c02ae6667706cda8a7f3d168/src/System.Data.Common/src/System/Data/Common/DbConnection.cs#L70-L74

Obviously, we cannot add a new abstract method on DbConnection, so we can add something similar to what we put in DbProviderFactory, but aligned to the template method pattern for consistency, e.g.:

public class DbConnection
{
    public virtual bool CanCreateCommandSet => false;

    public DbCommand CreateCommandSet() => CreateDbCommandSet();

    public protected virtual DbCommandSet CreateDbCommandSet() => null;
}

Though for the latter we could choose to throw instead of returning null.

To be consistent with all other APIs in ADO.NET, provider implementations should override the template method and hide the public CreateCommandSet to return a provider-specific DbCommmandSet instance.

All in all, I don't think this is the most elegant API, but at the moment it seems the most pragmatic solution given the existing public surface.

roji commented

@roji I agree ADO.NET doesn't have a very developed pattern for feature detection, but it has the beginning of one.

I admit I didn't have that pattern in mind...

I don't have very strong feelings on this - I think adding CanCreateDbCommandSet() on both DbProviderFactory and DbConnection is definitely a possible solution. The (non-critical) issue I have with this, is that checking capabilities is likely to be an advanced/rare feature, used only in applications or layers which have to handle multiple databases without knowledge of specific providers. In other words, my operating assumption (which may be wrong) has been that the typical application uses one database, and is coded to work against that database exactly: it contains SQL designed for that database, parameter placeholders which work on it, and possibly other things that will not necessarily port easily. In the same way, such a typical application will use the new batching API because it knows it's supported, or it will not - I don't think we expect it to check the flag (why would it need to).

If we agree on this, and capability checking is in fact restricted to advanced, multi-database applications, then it makes sense to me to concentrate those methods in a separate, metadata-dedicated place rather than adding clutter to frequently-used basic classes such as DbConnection (I'm thinking about keeping Intellisense as useful as possible). In that sense, the DataSourceInformation metadata API is doing the right thing by keeping everything out of the way, even if it uses DataTable and is ugly.

Finally, I'm trying to also think about capabilities/features beyond whether a certain Create*() would succeed or fail (see the JDBC metadata class for some examples) - I'm not sure the approach of Can*() methods scales well to that. I'm not saying we should implement everything that's there, but it's something to think about.

Having said all that, I don't have super-strong feelings on it. I will add a note to dotnet/corefx#35521 detailing the Can*() pattern.

The Create* methods on DbProviderFactory return null rather than being abstract or throwing NotSupportedException. I am not 100% sure how intentional this was. However, you could argue that any client code that needs to reason about capabilities can (and should) check for the result of the Create* method being null.

That seems... odd, and seems like it will become especially problematic with the coming C# 8 non-nullable references. If certain providers can return null to indicate that the feature isn't supported, than it seems like the Create*() method will return a nullable reference. This would impose on everyone to either check for null, or use the bang (!) operator to tell the compiler that no null is actually possible there - not good practice since users simply know that their specific provider supports the feature. Conversely, if the Create*() method returns a non-nullable reference, then you may get warnings for checking it for null. In other words, this seems like it will very soon become an anti-pattern. A method returning a non-nullable reference and throwing NotSupportedException seems like the right way in this sense (not to mention that it's more immediately understandable to users than a NullReferenceException).

We could fix this for existing APIs in 3.0 and switch to throwing since we can introduce breaking changes - we'd add Can*() method as you suggest which would by default attempt to call the corresponding Create*() method and catch an exception. Let me know what you think.

BTW, there is a small bug in the proposal: the "Db" prefix is not used in the Create* methods on DbProviderFactory.

Thanks, fixed!

We also discussed that we don't want to require customers to have to resort to DbProviderFactory, so we can have a similar API directly on DbConnection. Note existing Create* methods on DbConnection follow a bit more elaborate template method pattern, and the template method itself uses the "Db" prefix and is abstract:

You're right, I did intend for DbConnection to have CreateCommandSet() but left it out by mistake. I think the more complex CreateCommand() and CreateDbCommand() is a result of the old IDbConnection being implemented by DbConnection (IDbConnection.CreateCommand() returns an IDbCommand, so a separate method had to be defined to return DbCommand?).

Since I don't think we intend to update IDbConnection, we can maybe skip this and simply have a public CreateCommandSet() on DbConnection - for now I've updated the proposal with that. It's true that this isn't 100% consistent with other ADO.NET APIs, but the inconsistency isn't public so it could be OK.

I'm not sure the approach of Can*() methods scales well to that.

It certainly doesn't scale to every scenario, but it doesn't need to. It is a pragmatic solution. A metadata oriented API can be more useful for things that vary even on the same provider, e.g. depending on the database server version, the database schema, etc. But I would say having a completely different story for that is ok.

If certain providers can return null to indicate that the feature isn't supported, than it seems like the Create*() method will return a nullable reference... antipatern... We could fix this for existing APIs in 3.0...

Good point. Yep, let's discuss it. I am glad that you think there is a way to fix it. I believe very few providers will be impacted if the default implementation of DbProviderFactory.Create* methods starts throwing. cc @bricelam

Since I don't think we intend to update IDbConnection, we can maybe skip this and simply have a public CreateCommandSet()

I am not sure that the System.Data interfaces are the only reason. When you write client code that is provider specific, you expect the return type from this public method on XyzConnection to be the provider-specific type XyzCommandSet, so that you don't need to cast to get access to any provider specific APIs. Perhaps providers could override the base method and and hide at the same time, but that feels weird. Having a separate protected virtual template method that is overridden seems cleaner.

CC @NickCraver @mgravell in case they want to chime in for the feature detection discussion. Probably worth looking at https://github.com/dotnet/corefx/issues/35135#issuecomment-466452344 as a starting point.

roji commented

I'm not sure the approach of Can*() methods scales well to that.

It certainly doesn't scale to every scenario, but it doesn't need to. It is a pragmatic solution.

OK, I've changed the proposal to add the Can*() pattern.

If certain providers can return null to indicate that the feature isn't supported, than it seems like the Create*() method will return a nullable reference... antipatern... We could fix this for existing APIs in 3.0...

Good point. Yep, let's discuss it. I am glad that you think there is a way to fix it. I believe very few providers will be impacted if the default implementation of DbProviderFactory.Create* methods starts throwing. cc @bricelam

I agree. Opened dotnet/corefx#35535 to track, we should also try to bring this to review for .NET Standard 2.1.

Since I don't think we intend to update IDbConnection, we can maybe skip this and simply have a public CreateCommandSet()

I am not sure that the System.Data interfaces are the only reason. When you write client code that is provider specific, you expect the return type from this public method on XyzConnection to be the provider-specific type XyzCommandSet, so that you don't need to cast to get access to any provider specific APIs. Perhaps providers could override the base method and and hide at the same time, but that feels weird. Having a separate protected virtual template method that is overridden seems cleaner.

Isn't it enough for the provider to simply declare a new method CreateCommandSet() returning XyZCommandSet, with the new modifier? This way if the user has a variable of type XyzConnection they get an XyZCommandSet, and if they have a DbConnection they get DbCommandSet, no?

Isn't it enough for the provider to simply declare a new method CreateCommandSet() returning XyZCommandSet, with the new modifier? This way if the user has a variable of type XyzConnection they get an XyZCommandSet, and if they have a DbConnection they get DbCommandSet, no?

Consider this:

DbConnection conn = new XyzConnection();
Setup(conn);
var commandSet = conn.CreateCommandSet();

At this point, the type of the commandSet variable is DbCommandSet, but the actual instance should be an XyzCommandSet. In other words, the provider needs to both:

  • Override a virtual method on DbConnection so that any calls hit is own implementation that return the right thing
  • Provide a strongly typed method that return XyzCommandSet, usually hiding the public method on DbConnection.
roji commented

@divega makes sense, I updated the proposal.

GSPP commented

If we agree on this, and capability checking is in fact restricted to advanced, multi-database applications, then it makes sense to me to concentrate those methods in a separate, metadata-dedicated place rather than adding clutter to frequently-used basic classes such as DbConnection (I'm thinking about keeping Intellisense as useful as possible).

I think that's a good point. Having feature detection in general seems very desirable to have, though. The cost of implementing that is very low as well.

All in all, I don't think this is the most elegant API, but at the moment it seems the most pragmatic solution given the existing public surface.

I think a lot of potential was lost in ADO.NET 1.0 due to clumsy API design. We have to live with it now.

That seems... odd, and seems like it will become especially problematic with the coming C# 8 non-nullable references.

I see a lot of problems like that coming with this C# 8 feature. Whether something can be null or not is often context specific. I feel this C# feature is going to lead to a lot of warning noise in many APIs (both old and new).

roji commented

Note: reconsider providing a way to specify a CommandBehavior on a per-command basis, and not on the batch. This has been briefly discussed above, but raised again by @bgribaudo in dotnet/corefx#35592.

roji commented

@divega I've added public non-virtual Connection and Transaction properties to DbCommandSet, and made Connection and Transaction protected abstract, following the general ADO.NET pattern. Providers will be able to hide Connection/Transaction for their own provider-specific classes.

GSPP commented

I noticed that the command set class is enumerable:

public class DbCommandSet : IDisposable, IAsyncDisposable, IEnumerable<DbCommand>

I think this should be done differently. The command set is not a collection of commands. It aggregates a set of commands. The command set should expose a collection of commands. Inheriting from IEnumerable is a misuse of inheritance. It also makes for a bad API because it exposes LINQ stuff on the command set which is just noise. This also applies to https://github.com/dotnet/corefx/issues/35592.

I tend to agree with @GSPP on that point, a DbCommandSet is more than an ISet is has it's own behaviours and properties not related to the collection nature. I can see why you might want to make it directly enumerable but it seems unlikely that it will be useful for it to be so other than to allow implicit collection initialization.

I'm not opposed to extending DbException but i would note that SqlException doesn't inherit from it. Is the idea behind the new property to allow identification of the command that caused the problem during execution? and if so would you expect a specific provider to produce the original error (in my focus that would be SqlException) or DbException?

roji commented

As @Wraith2 wrote, the IEnumerable<DbCommand> was indeed added to allow for collection initialization, which seems to make a lot of sense in the context of DbCommandSet (as proposed by @Bgrainer above). Without this, initialization is going to be quite clunky for many cases.

The command set is not a collection of commands. It aggregates a set of commands. The command set should expose a collection of commands. Inheriting from IEnumerable is a misuse of inheritance. It also makes for a bad API because it exposes LINQ stuff on the command set which is just noise.

Does the fact that DbCommandSet has behavior and properties beyond collections preclude it from also being a collection? Its containership of commands does seem to be its primary use, and I can see many implementing classes of IEnumerable<T> in the .NET API docs which could be said to be aggregates of commands, with additional functionality added on top.

What exact harm do you see in this, and what value would there be in forcing users to go through a property on DbCommandSet in order to access the commands?

roji commented

@Wraith2, regarding the exception:

I'm not opposed to extending DbException but i would note that SqlException doesn't inherit from it.

Yeah, that's unfortunate. SqlException could still add it independently though.

Is the idea behind the new property to allow identification of the command that caused the problem during execution?

That's right.

and if so would you expect a specific provider to produce the original error (in my focus that would be SqlException) or DbException?

I'm not sure I understand... Are you referring to the SqlException/SqlError distinction for SqlClient? In any case, I'd expect specific providers to simply reference the command which caused the problem in the same DbException-derived exception class that they likely already produce to signal errors.

Assume an SqlClient implementaiton of DbCommandSet and add multiple commands to it. Run the command set, one of the commands fails. The individual command execution will throw an SqlException, should this be kept or changed/wrapped somehow to become a DbException supporting the new Command property.

What exact harm do you see in this, and what value would there be in forcing users to go through a property on DbCommandSet in order to access the commands?

I don't see any harm but i'd see the DbCommandSet public surface as the major api with the IEnumerable being there as a convenience, as such it might be worth considering having a specific Commands collection. At the moment there's no way to tell if a command is already present in the set, or count. None of these are blocking issues i'm just thinking through the surface i see proposed.

roji commented

Assume an SqlClient implementaiton of DbCommandSet and add multiple commands to it. Run the command set, one of the commands fails. The individual command execution will throw an SqlException, should this be kept or changed/wrapped somehow to become a DbException supporting the new Command property.

I'd expect SqlClient to maintain its current behavior, in the sense that it would continue throwing SqlException which would optionally contain a new property identical to the one we're adding to DbException. If SqlClient doesn't currently ever throw a DbException, I don't see why that should change...

roji commented

I don't see any harm but i'd see the DbCommandSet public surface as the major api with the IEnumerable being there as a convenience, as such it might be worth considering having a specific Commands collection.

What would you use as the type of the Commands property on DbCommandSet?

At the moment there's no way to tell if a command is already present in the set, or count. None of these are blocking issues i'm just thinking through the surface i see proposed.

I assume that means you'd want the proposed Commands property to be of type IList<DbCommand>? Note that it's still possible for to get a count or to tell if a command is in the set by accessing the IEnumerable interface (e.g. cmdSet.Count(), cmdSet.Any(c => c == something)), although that is indeed not efficient or ideal.

Note that I'm not against the idea of introducing Commands instead of implementing IEnumerable<DbCommand> - am just looking for more info/suggestions.

It also makes for a bad API because it exposes LINQ stuff on the command set which is just noise.

Having a lot of LINQ extension methods show up in IntelliSense would be unhelpful noise.

I wrote earlier (emphasis added):

I would lean towards keeping your proposed API, not implementing IEnumerable, and giving up C# collection initialisation until it becomes clear that that would be a high-value addition for typical users of the API.

If most of the usage actually ends up being similar to this (database independent example), then supporting collection initialisation is pointless:

var commandSet = connection.CreateCommandSet();
var command = connection.CreateCommand();
command.CommandText = sql[0];
commandSet.Add(command);
command = connection.CreateCommand();
command.CommandText = sql[1];
commandSet.Add(command);

The collection initialisation alternative would be something like:

var commandSet = new SomeDbCommandSet
{
    new SomeDbCommand
    {
        CommandText = sql[0],
    },
    new SomeDbCommand
    {
        CommandText = sql[1],
    },
};
commandSet.Connection = connection;

(Possibly shorter depending on the available SomeDbCommandSet and SomeDbCommand constructors.)

It's a little less dense, but it doesn't save that much typing. Additionally, it prevents Connection and Transaction being initialised with property initialisation syntax. I'm really not sure if overall it's a win.

roji commented

@bgrainger

If most of the usage actually ends up being similar to this (database independent example), then supporting collection initialisation is pointless

My general assumption (which may be wrong) is that database-independent code is very rare. Our experience seems to show that most applications (as opposed to O/RMs and other similar layers) code directly against their provider's specific classes, rather than via the factory methods.

However, there seems to be a consensus here against implementing DbCommandSet IEnumerable - and I do agree that there are issues with that. Would you all rather:

  1. Have a Commands property on DbCommandSet? If so, what would be its exact type (IEnumerable<DbCommand>, IList<DbCommand>..?). Note that it would be possible to use a collection initializer on this property, possibly giving the best of all worlds.
  2. Not have a Commands property, and simply exposing some traversal/addition methods on DbCommandSet without any interface (it could be the same API as currently proposed).

If we do option 1, then at the risk of bloating the API (and implementation) substantially, I wonder if for familiarity and consistency of usage, DbCommandBatch's relationship with DbCommand should be similar to DbCommand's with DbParameter (via DbParameterCollection)?

(BTW, I'm renaming DbCommandSet to DbCommandBatch here because a) both are used interchangeably in the OP, and b) I'm introducing DbCommandCollection so making it clear which type is used for batch-execution is important.)

public class DbCommandBatch
{
    public DbCommandCollection Commands => DbCommandCollection;
    protected abstract DbCommandCollection DbCommandCollection { get; }

    // other methods as above
}

public abstract class DbCommandCollection : IList<DbCommand>
{
    // mostly public abstract methods implementing IList<T>
    public abstract void Add(DbCommand command);
    // not totally thought through yet, do we need 'protected GetCommand(int)' and make `DbCommand IList<DbCommand>.this[int]` explicitly implemented?
    // ...

    // some virtual implementations
    public virtual bool IsReadOnly => false;
}

Then each concrete provider implements subclasses and exposes them via new properties:

public class SomeDbCommandBatch : DbCommandBatch
{
    public new SomeDbCommandCollection Commands { get; private set; }
    protected override DbCommandCollection DbCommandCollection => Commands;
}

public class SomeDbCommandCollection : DbCommandCollection
{
    public override void Add(MySqlCommand);
    public new void Add(SomeDbCommand);
    public new SomeDbCommand this[int index];
    public override int Count { get; }
    // ... mix of new & override methods, should be straightforward if you've implemented SomeDbParameterCollection already
}

Thus, clients programming "directly against their provider's specific classes" get strongly-typed and provider-specific APIs when adding or retrieving commands from a batch, DbCommandBatch retains property and collection initialisation syntax, and implementers gets to write a lot more code 😃.

roji commented

If we do option 1, then at the risk of bloating the API (and implementation) substantially, I wonder if for familiarity and consistency of usage, DbCommandBatch's relationship with DbCommand should be similar to DbCommand's with DbParameter (via DbParameterCollection)?

It's an interesting analogy... although I'm not sure DbParameterCollection is an exemplary model to follow :) One fundamental difference is that DbParameterCollection was designed to index its contained parameters both by position and by name, since different database systems support different parameter schemes. In our case, there doesn't seem to be a need for anything beyond an ordered collection that would specify the commands to be sent. But you're right that there's a similarity, namely in the amount of code that DbParameterCollection implementors have to write :)

Note that in your example above, it's not possible to use a collection initializer on the SomeDbCommandBatch.Commands property since it isn't settable (just like DbCommand.Parameters, by the way). It seems possible to have a public setter for that property, as long as the super class's property is read-only - otherwise one can inject any subclass of DbCommandCollection, not just the provider's.

But thinking about all this again, I'm not very comfortable with the API allowing external instances to be injected into the Commands property - this may force providers to track which instance belongs where. ADO.NET contains several such loose relationships, between connections/commands/parameters, which have proven to be a liability - it seems healthier to not allow setting the Commands property. One advantage of having DbCommandSet itself be the collection (by implementing IEnumerable<DbCommand>) is to avoid this kind of mess (although I acknowledge there are downsides).

If the above is accepted and the Commands property remains read-only, then I don't really see any value of having an additional DbCommandCollection type - it seems like it may as well just be an IList. An extra type is an extra confusing concept, and at the moment I don't really see what else it would add. We could still follow the pattern of a public non-virtual Commands property (hidden in subclasses) alongside a protected virtual DbCommands property.

Note that I expect most implementations to simply have a List<DbCommand> internally, making the implementation of an IList<DbCommand> Commands property trivial. Of course, if an implementation wants to wrap something else it would have to do more work.

Let me know what you think and I've missed anything. I'm frankly not completely sure what I prefer at the moment.

(BTW, I'm renaming DbCommandSet to DbCommandBatch here because a) both are used interchangeably in the OP, and b) I'm introducing DbCommandCollection so making it clear which type is used for batch-execution is important.)

I've been going back and forth myself on the naming, between DbCommandSet, DbCommandBatch or even just DbBatch (maybe it's time for some conciseness in this API...). DbCommandSet still feels a bit more correct and less committed to exactly how things are executed, but I'm definitely interested in other opinions.

Note that in your example above, it's not possible to use a collection initializer on the SomeDbCommandBatch.Commands property since it isn't settable (just like DbCommand.Parameters, by the way).

That isn't true. SqlCommand.Parameters isn't settable; the collection initialization syntax that looks as though it's assigning it is actually getting the property value then calling .Add on it: https://sharplab.io/#v2:C4LghgzgtgPgAgBgARwIwG4CwAoRLUB0AImMGAQMoCOANgMI0CWApgHbBbY5wDMKATEjpIA3jiQSUfOABYkAWQAUASlHjJGgG5gATkgDGAeyhQwrACZIAvElbMA7kmr1jpi+o0Sx2T76GHWO31gRgDrWwcnWjoAoJCAlQAaDz8JAAVdMChmYGYdCHCU1O9U1LtHZwydLJy8pKLSiXKomiqa3J16n0akAF9k7t9ezg1enF6gA

(Apparently it calls SqlCommand.get_Parameters twice, which was unexpected to me.)

roji commented

You're right.

However, do you see any value in providing DbCommandCollection rather than using a simple IList<DbCommand>?

DbCommandCollection would be the .net 1.0 pattern, given that we now have generics I think Collection<DbCommand> or List<DbCommand> would be best, i'd rather avoid IList because it prevents extensions and some useful methods being available.

However, do you see any value in providing DbCommandCollection rather than using a simple IList?

I guess not.

If it were important for foreach (var command in batch.Commands) to return the most-derived type (as opposed to using the implicit cast feature of foreach) then a derived DbCommandBatch type might want to offer public new IList<SomeDbCommand> Commands { get; }. It might then need to build a custom implementation that implements both IList<DbCommand> and IList<SomeDbCommand> so it can return the same object for the base-class property and its own property (since the interface isn't covariant). This doesn't seem like a major implementation burden.

I think Collection<DbCommand> or List<DbCommand> would be best, i'd rather avoid IList

Assuming you mean "avoid IList<T>" I disagree because if DbCommandBatch.Commands is List<DbCommand>, there's no efficient way to implement both that and public new List<SomeDbCommand> Commands { get; } in a derived type.

roji commented

@bgrainger

If it were important for foreach (var command in batch.Commands) to return the most-derived type (as opposed to using the implicit cast feature of foreach) then a derived DbCommandBatch type might want to offer public new IList Commands { get; }. It might then need to build a custom implementation that implements both IList and IList so it can return the same object for the base-class property and its own property (since the interface isn't covariant). This doesn't seem like a major implementation burden.

This seems like a necessity in any case, even without the foreach-on-most-derived-type feature - following the pattern, we'd need to have a protected abstract IList<DbCommand> DbCommands on DbCommandSet, which would be called by the non-virtual Commands property. Assuming a provider implementation will internally contain a List<SomeDbCommand>, a custom implementation/adapter would be needed to access that via the virtual DbCommands property (because as you say IList<T> isn't covariant).

The alternative here is to go back to IEnumerable, which is covariant and would place less implementation burden (this is one of the reasons it was first proposed). But I admit that at this point it makes sense to me to burden provider writers a bit more in order to provide more command management features to users.

@Wraith2

DbCommandCollection would be the .net 1.0 pattern, given that we now have generics I think Collection or List would be best, i'd rather avoid IList because it prevents extensions and some useful methods being available.

I agree with @bgrainger on this, the need to expose the same command list as both DbCommand and SomeDbCommand means we need to specify an interface here rather than a concrete implementation. It also seems in general better to not force a specific implementation on providers, if for whatever reasons something other than List<T> is more appropriate in some scenario.

public class DbException
{
    // Naming: possibly TriggeringCommand for clarity? See below for details
    public virtual DbCommand Command { get; }
}

Since DbCommand is IDisposable, shouldn't DbException be made IDisposable and in its implementation of that interface call Dispose() on the Command?

Taking a step back, is it okay for an exception to be carrying around a IDisposable? I don't usually think of exceptions carrying properties that need disposing but maybe that's because the exceptions I've worked with haven't done that. Still it seems strange.... :-)

Above, the idea of calling a set of commands a batch is mentioned (e.g. "I'm renaming DbCommandSet to DbCommandBatch.…).

For what it's worth, this introduces some name overloading in the SQL Server world. With SQL Server, a batch is a single unit of SQL text (which may consist of one or more SQL statements). Ref1, Ref2 -- see in particular Explicit Batches.

So, DbCommandBatch could be interpreted as a synonym for DbCommand because a single DbCommand maps to/corresponds with a database server batch.

How about the exception holding the index of the failed command in the list?

roji commented

Since DbCommand is IDisposable, shouldn't DbException be made IDisposable and in its implementation of that interface call Dispose() on the Command?

Taking a step back, is it okay for an exception to be carrying around a IDisposable? I don't usually think of exceptions carrying properties that need disposing but maybe that's because the exceptions I've worked with haven't done that. Still it seems strange.... :-)

Hmm, I'm not aware of a particular problem with an exception referencing a disposable... Specifics are going to depend on exactly how the user's code looks like. The most basic pattern seems to be something like:

using (var cmd = new SomeCommandSet { .... })
{
    try
    {
        cmd.ExecuteNonQuery();
    }
    catch (DbException e)
    {
        // do something with e.Command
    }
}

If the user code happens to dispose the command as the DbException bubbles up, then yes, accessing the command property may trigger an ObjectDisposedException. But that doesn't seem to be problematic in any general sense.

@Wraith2 @bgribaudo can you provide an example of a general lifetime issue this triggers?

How about the exception holding the index of the failed command in the list?

How would that be better than just referencing the command?

roji commented

One additional question this does raise is whether disposing the command set should also dispose its commands - at this point I really think it should not. For example, one can easily imagine mixing and matching the same commands in different batches (they may even be prepared).

Above, the idea of calling a set of commands a batch is mentioned (e.g. "I'm renaming DbCommandSet to DbCommandBatch.…).

I suggested that because I was also proposing that a new DbCommandCollection class be introduced. That was rejected, so I'm fine keeping the original DbCommandSet name.

roji commented

I suggested that because I was also proposing that a new DbCommandCollection class be introduced. That was rejected, so I'm fine keeping the original DbCommandSet name.

FWIW nothing was/is rejected :) If you still think it brings something to the table we should definitely discuss.

@bgribaudo thanks for the SQL Server terminology context, I think the term "batch" has already been the source of some confusion in our conversation. I guess it's indeed better to keep the neutral DbCommandSet term.

Consider:

try
{
    using (var connection = new SqlConnection("connectionString"))
    using (var commands = new SqlCommandSet(connection))
    {
        commands.Add(new SqlCommand("bad query here");
        ...
        commands.Execute();
    }
}
catch (DbException dbex)
{
    logger.Log(dbex.Command.CommandText); // command is disposed, can we access anything?
}

if you don't dispose then the DbException holds a reference to a live command, which refers to a connection, and possibly a transaction. All of which are disposable and possibly scarce resources that should be cleaned up. In SqlClient that'll keep the connection from returning to the pool which could end up with pool exhaustion or dos on the server.

That was rejected

Read: I was convinced it was unnecessary and withdrew the proposal.

Better? 🙂

roji commented

Much better 😈

One additional question this does raise is whether disposing the command set should also dispose its commands - at this point I really think it should not.

I hadn't thought about it much, but I had assumed the DbCommandSet would "own" the DbCommands added to it. I can see the arguments either way, though.

Not owning would make the "simple" API usage (i.e., new up a DbCommandSet, add new DbCommands to it with collection initialisation) very complex when it comes to disposing. Only a complete stickler would bother tracking the commands separately to dispose each one. (Fortunately, I think disposing DbCommand is "optional" for most providers in the sense that it doesn't actually release native objects or put pressure on the finalizer to not dispose one. (EDIT: @Wraith2 says it does matter for SqlCommand.) However, there are teams with coding guidelines that disposable objects must be disposed, and memory leak tools that track that.)

You can avoid this by creating the commands first, separately (here using C# 8 using declarations):

using var command1 = new SomeDbCommand(sql1) { Parameters = { ... } };
using var command2 = new SomeDbCommand(sql2) { Parameters = { ... } };
using var set = new SomeDbCommandSet { Commands = { command1, command2 } };
using var reader = await set.ExecuteReaderAsync();
// ...

However, that feels like a lot of overhead when (in most cases?) the lifetime of commands is tied to the lifetime of the command set.

If you think that advanced scenarios will share and reuse commands across multiple command sets (and the commands will have lifetimes exceeding the command set), should there be an advanced option to control disposal, similar to leaveOpen in the StreamWriter constructor? (Maybe a DbCommandSet.DisposeCommands property, which is true by default?)

roji commented

@Wraith2 I think that it's OK for your code above to throw ObjectDisposedException - you've explicitly disposed the command involved. User code could be refactored to avoid this by not disposing the command before catching the exception (although it may not be simple).

Regarding cleanup/resource exhaustion, I'm not sure I see the danger. If you explicitly dispose your commands/connections (as you should be), then everything should be fine anyway (as the connection is returned to the pool immediately). If you don't dispose, then it's true the connection may live a bit longer, until the Sql/DbException instance referencing it is garbage-collected as well - but that seems reasonable.

Am I missing something, can you provide more info? Do you have an alternate suggestion, or are you trying to say that it's better to remove the reference from DbException altogether?

RE: DbCommand ownership; for implementation guidance, it would be nice to specify concretely what the expectations are around DbCommandSet using/mutating DbCommand objects.

E.g., (as above) does DbCommandSet.Dispose call Dispose on each command? Does/can DbCommandSet.Prepare call Prepare on each command? What is the result of this code?

var command = new SomeDbCommand(...);
var commandSet = new SomeDbCommandSet { Commands = { command } };
commandSet.Prepare();
Console.WriteLine(commandSet.Commands[0].IsPrepared); // true or false?

That is, is the command set expected to (or allowed to) mutate the commands passed to it, or should (or must) it simply take them as informative, and use their CommandText and Parameters to build whatever data is necessary internally to effectively prepare/execute those commands as a batch?