dotnet/runtime

Broken System.Net.Http 4.1.1-4.3.0 post-mortem

karelz opened this issue ยท 41 comments

Issue #18280 caused a lot of problems for a long time. The overall road towards solution (fix in a new NuGet package 4.3.1) was less than ideal.
Let's track the post-mortem here (as initiated in https://github.com/dotnet/corefx/issues/11100#issuecomment-281827797):

High-level plan to cover:

  1. How did the issue slip through into release?
    • How to prevent such situation in future?
  2. Why did it take 6 months to fix?
    • Why wasn't it treated/communicated/recognized as high-impact issue earlier?
    • How to recognize and react to high-impact issues earlier in future?
  3. Other concerns (e.g. overall communication)

See the writeup in https://github.com/dotnet/corefx/issues/17522#issuecomment-338418610

So it has been a month (again) and I still didn't do it. Sigh. I am sorry! There are just so many important things to do.
I am not trying to dodge it, it will happen. After 2.0 ZBB (May 10 - dotnet/corefx#17619) the latest - driving 2.0 and all other fires around just take quite a lot of my time lately :(

While I understand the competing priorities, I am concerned that you'll "lose the trail" in figuring out what went wrong. People will move on and forget details. Do you yet know who the key people responsible for were involved in the efforts that created the situation yet or is this still in the pre-investigation stage?

It is not about people (I won't point fingers, it is not productive). It is about decisions and motivations. The first decisions that lead to the unfortunate state happened before June 2016, the last decisions (affecting our reaction time) around October 2016. So waiting couple of more weeks won't really make a difference in tracking it down.

That wasn't what I was implying. I meant do you have the right people to talk to or do you have to go fishing and risk ending up in a situation where you can't get the answers because you've waited too long.

My concern is "a couple of more weeks" will turn into a "few more months". It took a long time to fix the issue and there appears to be even less motivation to figure out what went wrong.

I have the right people. I talked to them (few months ago). I have to sit down, write it down (in a way to avoid finger pointing, just stating the facts), get it reviewed. Fill in additional gaps in the story I might have missed ... solid day or two of work.

On technical side we already took steps to avoid the problems in future (esp. 2.0):

  1. We're talking more about these things.
  2. We added Desktop testing into our signoff matrix
  3. We reviewed all 'problematic' Desktop-only packages (#16823) and we are assessing what to do about them in 2.0 -- #20502 and #20074 (note this started ~3 months ago)
    • Note: We will likely backout the changes in #20074, because it brings furher unpleasant side-effects in the end-to-end story (I still have to get educated a bit more about it) -- so we will still run with a small risk we're missing some details in the whole story and it may turn bad, but we at least have immediate fix ready from #20074 (it's current status).
    • This is a result of dozens of discussions with the experts. It is a hard problem (everytime I understand a bit more and I lost hope I will understand all teeny-tiny details any time soon). Plenty of tradeoffs to make, not a black-and-white problem and not even a one-dimensional problem.
  4. We also reviewed and did our best for related end-to-end story with BindingRedirects: #17770 and dotnet/corefx#18300 (I still need to follow up where exactly we landed and which scenarios are now fixed / still broken)

1 and 2 are the most important - having recognised that corefx packages are used on netfx and need to be tested on netfx will go a long way to preventing this class of problem. When the answer to "how did this make it to production, did anybody test this" is "no" you're in constant danger!

The approach in dotnet/corefx#18300 is interesting, but i hope it doesn't become a license to require more and more binding redirects. There are still environments where you simply cannot specify them, or at least where no machinery exists to generate them. Consider msbuild tasks, non-exe unit tests, plugins loaded with Assembly.LoadFrom...

I hope you get a task force standing by to pounce on such issues with immediate response time if they keep occurring. Right now I just hit Update in NuGet and suddenly I start getting "Type System.X.Y.Z does not match constraint ABC" style exceptions that I suspect are exactly because of some .NET Core cancer I do not even care about ruining my .NET Framework app.

@sandersaares .NET Core "cancer" won't affect .NET Framework apps. Whatever we shipped (and broke) on .NET Framework as NuGet packages was because we wanted to deliver additional value to .NET Framework developers.
Our main approach from now on is "don't ship anything replacing .NET Framework via NuGet, unless we really, really have to". That will take care of that.

It's good to see the lessons learned from this package dependency complexity, and the focus to fix this with .NET Standard 2.0.

However, what about libraries that target versions before .NET 4.6.1 (.NET Standard 2 compatible baseline), and may not be able to switch to the .NET Standard 2 target?

This is what originally caused this issue; where some libraries target the framework version, and other libraries then target newer Out-of-Band (OOB) NuGet variants of the same libraries. The combination of the two in an app requires binding redirects, where the consumer (developer) has to make the trade-off between either the lower or upper version.

Isn't there a limbo for libraries that use any OOB NuGet that isn't targeting .NET Standard 2 yet? And those libraries cause the same binding redirect issues in the apps with framework targets?

So:

  • If I create a .NET 4.5 app, I'd have to ensure that all NuGet packages do not use an OOB NuGet dependency (like System.Net.Http). Otherwise I have to deal with binding redirects, which is the root of all agony.
  • If I create a 4.6.1 app, I'd have to ensure that all NuGet packages target at least .NET Standard 2. To avoid dealing with binding redirects.
  • If I create a .NET Core 2.0 app, I'd have to ensure that all packages I depend on can target at least target .NET Standard 2.

Correct?

the focus to fix this with .NET Standard 2.0

The focus is to fix it in all NuGet packages (not scoped to .NET Standard 2.0).

... some libraries target the framework version, and other libraries then target newer Out-of-Band (OOB) NuGet variants of the same libraries. The combination of the two in an app requires binding redirects, where the consumer (developer) has to make the trade-off between either the lower or upper version.

That is correct and it will never go away, until we either stop shipping those OOB entirely (my original plan in #20502 and #20074 - which we are most likely going to back out), or until we change .NET Framework loader to follow .NET Core loader policy to upgrade version to latest available without any bindingRedirects (which is being considered, but also very tricky to do - the code (Fusion), is extremely challenging and it is easy to break other things with any change to it, or make it function only in some scenarios).

If you create .NET 4.5 or 4.6.1 app, and your dependencies target 2 different versions of the same package, you will have to make a choice - to upgrade to latest or downgrade to lowest. Via bindingRedirects. dotnet/corefx#18300 will help, but as @gulbanana points out, it doesn't solve all scenarios.

We do not plan to ship more (replacement) NuGet packages except the 2 we ship today.

If you create .NET Core 2.0 app, then you don't have to ensure anything - things will just work out for you. Of course unless you start referencing (indirectly) packages from higher (2.1+) .NET Core versions, while trying to force them to run on .NET Core 2.0, then you might need some bindingRedirects (or whatever the alternative in Core is).

Update: Ignore this comment (dumb oversight)

Were the issues resolved? I'm still experiencing very weird issues with HttpClient, like doing PostAsync and it does a "GET" call...

I'm targeting net core 1.1 the library that wraps HttpClient is NET standard 1.3

Yes, all known problems were resolved. If you see any new issues, please file a new bug with description what happens when. Thank you!

@karelz Guess what I would like to know ;-) Hint: It's been 4 weeks since last update.

@jahmai know what?
Re: Hint: Yep, understood. Should happen before 2.0 ships. After the Ask mode & bug driving madness (it takes a LOT of my time).

I'm a little confused here, please be patient with me if I'm posting in the wrong location. I created a new netstandard2.0 project using bash on a mac:

dotnet new classlib

Then I added the IdentityModel nuget package:

dotnet add package IdentityModel

In the default Class1.cs I instantiate HttpClient:

    using System;
    using System.Net.Http;
    namespace Test
    {
        public class Class1
        {
            HttpClient client = new HttpClient();
            public Class1() {}
        }
    }

Now when I dotnet build I get the following error:

Class1.cs(7,13): error CS0433: The type 'HttpClient' exists in both
'System.Net.Http, Version=4.1.1.1, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a' and
'netstandard, Version=2.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'

The beginning of my project.assets.json looks like:

    {
      "version": 2,
      "targets": {
        ".NETStandard,Version=v2.0": {
          "IdentityModel/2.8.1": {
            "type": "package",
            "dependencies": {
              "NETStandard.Library": "1.6.1",
              "Newtonsoft.Json": "9.0.1",
              "System.Net.Http": "4.3.2",
              "System.Security.Claims": "4.3.0",
              "System.Security.Cryptography.Algorithms": "4.3.0",
              "System.Security.Cryptography.X509Certificates": "4.3.0",
              "System.ValueTuple": "4.3.1"
            },
            "compile": {
              "lib/netstandard1.4/IdentityModel.dll": {}
            },
            "runtime": {
              "lib/netstandard1.4/IdentityModel.dll": {}
            }
          },

IdentityModel package is dependent on System.Net.Http >= 4.3.2, I'm not sure where version 4.1.1.1 is coming from.

sadly the assembly versions do NOT correspond to package versions

@daveclarke fyi this is a bug in the netstandard2.0 conflict resolution that will be fixed for preview2: dotnet/standard#372, dotnet/sdk#1313
Maybe follow up on one of these repos if you still have this problem with preview2 bits.

So is this issue considered resolved? Back in April it was a "couple more weeks". I was concerned it would turn into a "couple more months". Well. It's July.

It is still on my personal backlog. Pretty high up now. As I said earlier:

Should happen before 2.0 ships. After the Ask mode & bug driving madness (it takes a LOT of my time).

Happy 1-year anniversary of #18280!

Kesmy commented

This just broke a project I'm working on, again, and took quite a lot of searching to find, again. Where's the postmortem?

The issue is also apparently still not fixed, I've upgraded a project to the latest (4.3.2), and still have to manually adjust the bindingredirect from 4.1.1.1 to 4.0.0.0

In our case we've seen it because of two problems:

  1. When the host project imports the Entity Framework package (EntityFramework 6.1.3), which is needed for EF related App.config sections. I believe that package pulls in an older version of Newtonsoft.Json which has a framework dependency on System.Runtime.Serialization.Primitives (related?). That framework dependency seems to be incompatible with the OOB System.Net.Http NuGet package.
  2. When we fetch data from Azure KeyVault (Microsoft.Azure.KeyVault.Core 2.0.4) in a child project, its HTTP connections will fail at runtime. Similar to the above, the host project seems to have loaded the framework version of System.Runtime.Serialization.Primitives and not the OOB version that is needed for the KeyVault package.

To solve these problems, in our host project (e.g. console app, windows service, web site) we include explicitly the following NuGet packages (even though they are used in child projects, and there is no direct use of them in the host project), to make sure the binding redirects are in place so that the correct versions are loaded on startup:

  • Newtonsoft.Json 9.0.1
  • System.Net.Http 4.3.2
  • System.Runtime.Serialization.Primitives 4.3.0

That fixes the problem for us. Hope it helps.

Kesmy commented

@bgever Unfortunately, we're already using Newtonsoft.Json 10.0.3, and have no dependency on Primitives (well, we do, on 4.3.0, but not at the point where System.Net.Http fails to load), so this specific case isn't valid for us.

That might help me debug, at least. If I can track down exactly what's not loading, I'll update here.

Thanks, regardless.

To save anyone else the hours of trouble should they find this...

Had to do two new things to get it working:

  1. Set a privatePath in the bindingredirects, because resolution was ignoring /bin and still not finding the assembly (Maybe this is unrelated, but it has never been necessary before in this project)
  2. Set "Load User Profile" to true for application pools as System.Net.Http couldn't be cached, which is also brand-new behaviour for this app.

@karelz Status update please?

Thanks to @Kesmy for indirectly posting a fix that actually works!
@karelz its time to get everyone in a room or on skype and get this fixed! At this point this not just a development issue it is also a management issue. Get the execs in the room and get the resources you need to get this done.

    <dependentAssembly>
        <assemblyIdentity name="System.Net.Http" publicKeyToken="b03f5f7f11d50a3a" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.1.1.2" newVersion="4.0.0.0" />
    </dependentAssembly>

Edit:
I see @daveclarke has posted steps to reproduce and @dasMulli has indicated there will be fix in preview2.
My scenario involves .net framework. Here are steps to reproduce:

Create a net framework 4.7 class libarary project (in my case it is my test project)
Create a second net framework 4.7 libarary project. Add nuget package System.Net.Http version 4.3.3 -OR- add a newget package that depends on the aforementioned. Add a bit of code to new up a HttpClient.
Reference the second class lib from the first and attempt to run second class lib code.

You need a Meta-Post-Mortem now :(

I'm giving @karelz the benefit of the doubt and assuming he is on vacation, because the alternative would be extremely disappointing.

Just lost 2 full days of my team playing around with packages, updating nuget for the whole project, going back again to the old version in hopes that we could find a stable solutions that doesn't involve custom VS setups or weird assembly redirects. I insisted to the team to trust microsoft and the VS team that it must be something we were doing wrong. Very angry, sad and disappointed about this issue that for what I understand is around for over a year

Hi @karelz,

For clarity, here are the topics I'd like addressed by the post-mortem;

  1. We need clear instructions on how to mitigate this problem in various scenarios.

Even though the problem was resolved some time ago, there are still customers experiencing the symptoms because there isn't enough clear guidance on how to resolve it.
Going by reports in this issue alone, there are clearly some scenarios where just updating the System.Net.Http and changing the binding redirect isn't enough.
Someone at Microsoft needs to put together a list of "if this is your symptom, here's how you fix it" steps.

  1. Was this problem ever considered a risk during planning or was it overlooked?

There were obviously some big decisions made about delivering .net Framework package updates through NuGet.
On the surface this seems to have some great benefits, including .net versioning isolation from other installed applications and accelerated delivery for .net assemblies.

Was the question ever asked; "What happens if we ship something from NuGet that is inherently incompatible with the OOB assembly?"
If so, what was the discussion, what risks were identified, and how would they be managed?
If not, why not, where was the failure?
How has this experiencing changed the way these kinds of decisions are made moving forward?

  1. How did this problem make it's way to customers?

Hindsight shows us that this is actually a pretty easy problem to reproduce.
All you need is an application that takes a dependency on two packages, one that references the new NuGet package, and another that references the OOB assembly, and try run it.

What kind of testing was done before shipping?
Where was the gap that allowed this problem to arrive to customers?
What changes have been made to the testing process to prevent these issues finding their way to customers?

  1. Why did it take so long to recognize this issue as significant?

There was a huge lead time between the issue getting reported and being recognized as significant.

What is the process for traiging issues that come through github?
Why did this process fail to pick up on the gravity of the problem sooner?
What can the customer do to change the visibility of an issue if there is a belief it deserves more attention?
What changes have been made to the triage process to try catch these issues sooner?

  1. Why did it take so long to get the resources to fix this?

Once the gravity of the problem was recognized, it's obvious that immediate and decisive action should have been taken, such as gathering those with the authority to approve resources and gathering the best engineers to design and implement a solution.

What is the process for escalating an issue in a .net deliverable to get more resources and attention?
Which points in that process contributed the most to the delays?
What changes have been made to that process to reduce the delay in addressing these kinds of problems in the future?

Thanks @jahmai for the pings. I have started putting the draft from my head into written form on Friday. I expect to review it and publish it by end of week.

Your questions [2]-[5] are aligned with the post-mortem plan/scope as outlined in top post. It is the "looking back at what happened and why, and how to make it better in future".

Your question [1] is already addressed by using the fixed version of System.Net.Http (4.3.1+) NuGet package.
Any problem specific to System.Net.Http 4.3.1+ and/or to "Inheritance security rules violated by type: 'System.Net.,Http.WebRequestHandler'. ..." would be directed into separate issue (as hinted in https://github.com/dotnet/corefx/issues/11100#issuecomment-289234304). However, so far we didn't get any such reports.
The "only" related problems I have seen mentioned couple of times here are general bindingRedirects problems (or the dislike of them), which typically have their own workarounds / solutions in various parts of the .NET Framework tooling -- with quite a few being filed originally against CoreFX repo, then routed appropriately by our teams - see query. I fully understand that bindingRedirects are key source of frustration with any OOB package (believe me, it pains us as well). However, to our best knowledge, there is nothing specific in bindingRedirects about System.Net.Http, or to its specific version (i.e. same problems appear with both 4.1.1-4.3.0 NuGet packages and 4.3.1+ NuGet packages). For general bindingRedirects discussion I think it would be best to join the discussion on one of the existing issues or to file a new issue.

First, I want to apologize for the additional delay of the post-mortem after 2.0 shipped, very sorry for that, I am to blame for the delay.

A little bit of history first

At the time we introduced System.Net.Http 4.1.0 NuGet package in 2016/6, we wanted to deliver additional value to .NET Framework developers. The new implementation of HttpClient on top of WinHTTP we had in .NET Core/CoreFX was in many ways better and superior to the "older one" in .NET Framework (based on HTTP stack). There were perf gains and feature gains (http2 support) in the "new one". We wanted to deliver the value also to .NET Framework developers, without the need for them to wait for a new .NET Framework update (incl. the delay of the .NET Framework update being deployed to end customers at larger scale).

We call it out-of-band (OOB) package, and we believed back then that we can make it work and deliver value to customers (developers) directly and much faster than in the cycle of .NET Framework updates being deployed to the customers of our customers.

The problem with System.Net.Http 4.1.0-4.3.0 OOB package was that when it is used together with System.Net.Http.WebRequest (which is part of .NET Framework and used indirectly by ASP.NET and many popular Azure APIs), the user will get an exception at the runtime: Inheritance security rules violated by type: 'System.Net.Http.WebRequestHandler'. Derived types must either match the security accessibility of the base type or be less accessible. See #18280 for more details.

So how did the issue slip through into release?

Here are key contributing factors:

  1. The end-to-end scenarios form a very complex and large matrix. There are many .NET Framework versions you can build against and then run on (build against may be different from the one you run on). There are many tools contributing to the end-to-end scenario (NuGet, msbuild, Visual Studio), each with several versions supporting. Also, OOB libraries have different versions and can be combined in almost any way.
    • Obviously, we can't test the whole matrix, so we chose representative combo which we focus on in our automated runs. We test all latest libraries together (we skip any combinations of old and new libraries as it doesn't scale and devs can always upgrade the remaining libraries as a workaround). We test on one .NET Framework version (either the one we want to be compatible with, or the latest released version). We leave upstack tooling testing to upstack components (as automation and test cases of tooling is very different from how CoreFX tests look like and it is better co-located with the other tooling tests).
    • Note, for completeness: Beside automated unit testing and component testing we also do integration testing and end-to-end testing (in the functional testing category). Some of that is automated in upstack components. Some of that is done manually, e.g. via dogfooding efforts (especially before shipping final product versions).
  2. We didn't have automated OOB test runs on .NET Framework until early 2017. All OOB/CoreFX test runs on .NET Framework were done manually by developers prior to checkin.
  3. The bug can't be uncovered by simple unit testing or component testing. It requires usage of 2 components (System.Net.Http OOB package and System.Net.Http.WebRequest inbox DLL) and execution of tests which are in .NET Framework test bed, not part of CoreFX. It's basically integration testing spanning 2 independent products (CoreFX OOB package and .NET Framework).
    • Also note that the .NET Framework tests have to be specifically augmented to use System.Net.Http OOB as there is no global installation of OOB packages.

We didn't realize the dependency danger during development of System.Net.Http OOB package. Our test coverage was focused on component testing of System.Net.Http on .NET Framework (see [2]). Those tests had no chance to uncover the problem we introduced (see [3]).

How to prevent such situation in future?

There were several results of this issue:

  1. We introduced .NET Framework test runs for all CoreFX packages (incl. OOBs) as part of our PR legs and full test passes.
    • This also helps us to keep compatibility of .NET Core with .NET Framework under control.
    • Note that we would have not caught the original issue #18280, even if we had the test runs in place before shipping the original 'bad' package 4.1.0 in 2016/6.
    • As a side note, we have an outstanding issue that we currently test the .NET Framework version and not the OOB package due to a regression in our test infrastructure - #23497
  2. We reviewed all OOB packages shipping from CoreFX, and assessed risk of pulling them out vs. continue shipping them - #20505
    • As a result we identified 2 problematic OOB packages, which are not leaf-nodes in the platform itself, and have dependency from the platform on them - System.Net.Http and System.IO.Compression.
      We decided to keep shipping them instead of pulling them out. Pulling them out would have caused mysterious bindingRedirect, asking devs to downgrade the version, which is not very standard. The tooling around bindingRedirects downgrades is also not well hardened (as the workaround of #18280 showed), because it is not mainline scenario. We expected bad user experience if we would take that approach.
    • Note that System.Net.Http NuGet package now builds & ships only from 1.1 servicing branch (PR dotnet/corefx#15659). It does not build or ship from master and 2.0 branches. It helps us prevent accidental changes to the code and enforces us to carefully review any change we might want to make to the package in future.
  3. We are extra cautious whenever we ship new OOB packages:
    • Replacing existing platform assemblies "in-the-middle" via OOB is entirely out of question based on this experience. Currently, we can't imagine shipping more OOB packages like that.
    • We had lengthy discussions and reviews of impact on user scenarios on other kinds of OOB packages, e.g. OOB packages which also appear in one of the platforms like .NET Framework or .NET Core -- examples are Tuples, Span<T> and friends.

Why did it take 6 months to fix?

Here's brief recap of the issue's history: (with +?m as time from the issue creation in months)

  • 8/24 - Issue created with a repro (which is great).
    • No sign of large impact yet. Only 3 upvotes until today.
  • 9/12 (+0.5m) - Workaround fine-tuned.
    • First note about "significant pain" (assumption: painful to apply and come up with the workaround).
    • Still not clarified that it happens on each change to NuGet dependencies.
    • 13 upvotes (unclear if right away, or months later) "confirm" the workaround (or maybe want to upvote the pain?)
  • 11/2 (+2m) - First escalation, "hours wasted" mentioned (that's a red flag), we now have quite a few replies and quite a few upvoting replies.
  • 11/4 (+2m) - Issue acknowledged as being actively looked into (by networking experts).
  • 11/14 (+2.5m) - First mention that bindingRedirect changes are reverted on each NuGet update.
  • 12/9 (+3.5m) - Second escalation due community frustration and lack of communication from our side.
    • Started active meetings between networking experts and packaging experts to explore alternative workarounds/solutions.
  • 12/15 (+3.5m) - First list of solutions published, info about investments into alternative solutions.
  • 12/17 (+4m) - Solutions refinement published.
  • 1/10 (+4.5m) - Detailed execution plan published.
  • 1/11 (+4.5m) - Fix is checked in master (PR dotnet/corefx#15036).
  • 1/24 (+5m) - Ask for reports on end-to-end validation of scenarios by original reports.
  • 1/30 (+5m) - 6 scenarios confirmed as working.
  • 1/31 (+5m) - Change ported to 1.1 hotfix branch.
  • 2/11 (+5.5m) - 1.1 hotfix package available for validation.
  • 2/19 (+6m) - 5 scenarios confirmed as working with the 1.1 hotfix package.
  • 2/21 (+6m) - Final 1.1 hotfix package published on nuget.org.

So where did most of the time go?

  • It took 2-3 weeks to document a workaround (there were only 3-4 reports by this time).
    • This seems to be fairly reasonable timeframe, given the number of people being affected (at that point it looked like any other new issue).
  • It took 2 months to gather enough reports to escalate the issue as serious/impactful (we had ~10 reports by that time and a red flag that it "wastes hours").
    • From this point, the response time measuring should start.
    • Open question: Would it help to document how to escalate / point out serious/impactful issues? The path used (pinging @terrajobst on GH/Twitter) worked, but is known only to MVPs and some active community members at that time. One option would be to have a group of people (GitHub team) with expected time to response (i.e. when it is ok to ping again). Although it may suffer from higher noise as everyone has different bar for what is important. Another option would be to have alerts for indicators like higher number of replies on an issue or upvotes on an issue/reply.
  • It took 2 months since the workaround availability to clarify the workaround is not suitable long-term (the worst part of this issue), because the bindingRedirect entries are reverted with each change to NuGet dependencies.
    • Open question: I am not sure if some of the reporters were aware of the fact earlier and if it was just not expressed explicitly or clarified until this point (+2.5m).
  • There was 1 month silence after the first escalation.
    • While area experts were working on the issue, we did not communicate what is happening.
    • Given the rising number of reporters and the new fact that the workaround is not suitable long-term, we should have engaged in communication with community earlier. Even if just by acknowledging the pain and informing about status of the work and how complicated it is. I would personally expect update every 1-2 weeks at minimum at this point.
    • For future: We need to make sure all engineers on CoreFX repo know when and how to communicate in exceptional cases like this one, or know how to get help/support from their peers.
  • It took 1 week (12/9-12/17) to decide on solution and explore alternatives since second escalation.
    • It was very productive week, we worked through several dead ends and were able to identify the best solution, by balancing technical part and time to delivery.
  • It took 3 weeks (12/17-1/11) to deliver solution.
    • Fairly reasonable timeframe, given the holidays season and the fact that we had a networking security issue distracting our area experts.
    • Overall we delivered preview packages in 2 months since first escalation (when the clock started).
  • It took 3 weeks (1/11-1/30) to validate the end-to-end scenarios.
    • It took us 1 week to validate (some) of the scenarios in-house and ask for help with validation of the end-to-end scenarios we didn't have repros for.
      • Question: Maybe we should have parallelized the in-house validation with ask for help from community?
    • It took 1 week and 1 ping to get first end-to-end scenario validated.
      • Question: Overall fairly reasonable timeframe, although we were a bit surprised that people didn't jump on the preview package right away. Maybe everyone found a way to work around the problem in one way or another?
  • It took 3 weeks (1/31-2/21) to port, validate and ship the change as official hotfix.
    • Fairly reasonable and fast turnaround for official hotfix.

Overall 6 months breakdown:

  • 10 weeks until the issue impact was clarified as serious/impactful (with ~10 reporters) (8/24-11/2)
  • Time to hotfix preview in master:
    • 10 weeks from the time the impact was recognized to having hotfix preview in master (11/2-1/11).
    • OR: 8 weeks from the time we realized the workaround is very painful to use to having hotfix preview in master (11/14-1/11)
    • We could have likely saved about 2 weeks if we communicated better earlier on.
  • 6 weeks from hotfix preview in master to final hotfix release on nuget.org (1/11-2/21)
    • We could have probably saved 1-2 weeks if we drove end-to-end validation more aggressively.

Note: The post-mortem above is not trying to hide or marginalize any impact, shift blames, or point fingers. It's an honest attempt to reiterate what happened when and why, with focus on future -- how to improve process and engineering practices to streamline similar events in future.
If you think some of the recapturing of the past is not honest/correct, please let me know what and why. I'll be happy to discuss.

Thanks for detailed writeup!

I ran into the issue in its early days and would like to point out one thing that particularly frustrated me: it was not possible to understand what was happening - how exactly installing a NuGet package update can break something seemingly unrelated. Even after having read the threads on the topic, I still only have a vague understanding of the technical details.

This caused frustration because it was not even possible to effectively experiment with it, beyond simply stating that a problem exists and hoping someone from Microsoft took an interest. Similar issues where "the post-net46 universe" does something illogical without presenting any no obvious investigative threads to even start unraveling exist even today (NuGet/Home#5812 for one example out of several that still impact me).

In such a case, I would expect Microsoft to respond promptly to issues and, if not provide a resolution, at least maintain a steady dialogue to attempt to diagnose it (I am fine with being asked to experiment but I am not fine with a month of silence in what I consider a blocking issue).

It feels like receiving a bouncy ball promised by Microsoft documentation to bounce most excellently and then finding it is in fact a puddle of water that does not bounce at all and that you will be returning to the merchant as part of some Monty Python sketch.

Especially for a new product like .NET core, with new tooling and with complex interactions, I expect greater engagement from Microsoft in helping people diagnose and report such issues. Right now there are still many broken things with the tooling when trying to use .NET Core/Standard, which get barely any comments on GitHub or the VS issues portal, that it takes some real motivation to even report them as I do not feel that there is a meaningful dialogue when I do so.

The linked issue above indicates a problem with installing a very common library, a problem that only occurs when .NET Standard libraries are in the mix, and a problem that flat out prevents one from using that library. For the package manager component of VS to not be able to install seemingly perfectly valid packages and for it to give a clearly invalid error message to explain the failure boggles the mind as much as does the fact that this issue is assigned to "Backlog" without obvious signs at investigation and will probably not even be fixed.

I continue to avoid .NET Core and .NET Standard due to the number of such issues I encounter and due to a very underwhelming response when I do report them. This is despite the fact that both .NET Core and .NET Standard would (if I could use them) satisfy a lot of my business needs.

Now I realize that most of this is not really your area, @karelz - issues these days are generally with bad tooling - but it is hard to see different parts of the VS/.NET ecosystem as separate from the emotional standpoint. It is all just a big pile of stuff I need and pay a lot of MSDN subscription money for in order to do my job.

jnm2 commented

I am fine with being asked to experiment but I am not fine with a month of silence

This cannot be emphasized enough.

(Side note; there needs to be some consistency for terminology. Some issues use OOB as out-of-box, and some use it to mean out-of-band, which in the context of these issues mean completely the opposite packages).

First of all, thank you @karelz for finally taking the time to write that out, I do not know why you are in the unenviable position to take that responsibility, but you have fielded the varying levels of customer feedback (constructive through outraged) with a level of grace and professionalism and I thank you for that.

However, I believe that given the timeline highlighted, this break down could have been done 6 months ago, and I have my doubts that people monitoring this thread will be nearly as satisfied with the response now that so much time has passed (I know I am not). The priority between shipping the new-hot-thing vs addressing grave customer concerns raised since shipping the broken last-hot-thing could perhaps be considered part of "the problem" at Microsoft right now.

I also think the timeline should account for #17770 and #17786 (reported early July) as the writing was already on the wall that there were fundamental incompatibilities between out-of-band and out-of-box versions of the assembly, so even at that time, someone should on the team should have been prepared to test the combination of the two prior to shipping subsequent versions of the package.

There was obviously a communication breakdown with customers which needs improvement, but I want to highlight that good communication needs to be accompanied with appropriate action. All the status updates in the world are meaningless unless they are reporting on, or setting the expectations for, acceptable progress.

As a founder of a company that has been using the latest (released/final) Microsoft products for the past 6 years I have to say the last 18-24 months stands out as being an extremely tumultuous and painful time to be a developer on Microsoft platforms. I don't think it's appropriate to list out all of the separate incidents of pain my team experienced over that period (I don't want this issue to be the flash-point of a dozen Microsoft failures) but what I do want to communicate is that this issue was a significant blow to mine and my teams confidence in Microsoft's ability to deliver quality, and that our attitude regarding big platform changes has shifted from excitement to skepticism and anxiety.

I will say to Microsoft's credit, that the transition from netstandrd1.x to netstandard2.0 for us was fairly trivial and incident free, which is exactly how it should be, but there is still a lot of work to do to earn back that trust for our team.

Thanks for the detailed post-mortem, it definitely helps understanding the process and the learning.

A point that I'd like to address:

You seem to have improved the testing of OOB packages in .NET Framework scenarios, which is good. However, it seems that the other way around - impact of .NET Framework updates when consuming OOB libraries - seems to be missing a few tests scenarios.

For example in 4.7.1, the following issues surfaces:

  • The tooling shipped in the .NET Core 2.0.0/2.0.2 and VS 15.3/15.4 assumed that .NET 4.7.1 will implement binary compatibility with .NET Standard 2.0, however the assembly versions of System.Net.Http and System.IO.Compression were not updated accordingly, creating type load issues - see dotnet/sdk#1647
    (From the outside perspective, it looks like this assumption of the net471 behaviour was not validated)
  • A similar issue happened for System.ValueTuple: apps built for .NET 4.6.1 using ValueTuple will fail to load on .NET 4.7.1 unless they are explicitly targeting 4.7.1 because of the generated binding redirects. See microsoft/dotnet-framework-early-access#9, dotnet/standard#543, known issues document
    this one is quite serious for everyone who adopted C#7 tuples and now tries to run their apps on win10 1709.

@karelz I'm just wondering if something could be done to prevent or reduce such issues in future updates. .NET Framework used to be a highly compatible upgrade, but the introduction of OOB packages seems to have complicated this a lot. I fully understand that it is very hard to test these scenarios as they are quite specific to the combination of "fx version built for"/"OOB package version"/"platform run on"/"tools used to build".

In effect, System.ValueTuple is a new OOB package. It was built as a package first, then incorporated into the BCL.. which is also the plan of record for many other apis. Will this keep happening? It's not really reassuring to hear that there are now OOB test runs but they
a) do not cover the original system.net.http scenario, because it's in the middle of a dep chain
b) do not cover this new scenario, and
c) are broken anyway (#23889)

The postmortem is appreciated, but frankly it indicates that these process issues are not solved. We've gone from nuget updates breaking desktop apps to framework auto-updates breaking desktop apps, which if anything is worse.

Perhaps people who are clearly knowledgeable such as @onovotny, who raised the issue, should have the privileges to tag particular issues as serious and urgent.

Just reading the first sentence of the issue tells me that it was a significant one, but I understand that the triage needs to consider who is raising and whether they have enough reputation. Otherwise they might just be somebody inexperienced who is doing something silly.

My experience with the current VS tooling and all its issues is similar to others and it does 'waste hours' for me. The number of open issues against NuGet for example is crazy!

Kesmy commented

Just need to throw out there that "We didn't test things" is not a great reason to force us back to the inbox assemblies.

Now we appear to be (and correct me if I'm wrong) stuck in this limbo where installing a package with a dependency on one of these packages is a crapshoot. Maybe the library depends on bugfixes in the package, maybe it works with the inbox, we won't know until runtime when everything fails spectacularly on an edge case and we're up at 0200 after production goes down in flames.

Followup. My issue was that I did not even realise I was on older tooling on some older ASP.NET projects.

Fix:
<Project ToolsVersion="12.0" .. > => <Project ToolsVersion="14.0" ..>