dotnet test keeps hanging
muratg opened this issue · 18 comments
From @livarcocc on Tuesday, 21 August 2018 23:19:09
From @dustinmoris on August 21, 2018 20:20
Steps to reproduce
- Go to https://github.com/giraffe-fsharp/Giraffe
- Clone the repo and switch to the develop branch
- Run
.\build.ps1
(or just rundotnet test ./tests/Giraffe.Tests/Giraffe.Tests.fsproj -f netcoreapp2.1
)
The build will succeed.
If you go to the Giraffe.Tests.fsproj
file and change
<PackageReference Include="Microsoft.AspNetCore.TestHost" Version="2.0.*" />
to
<PackageReference Include="Microsoft.AspNetCore.TestHost" Version="2.1.*" />
and re-run the tests then you'll see that it will hang forever as if something went into a deadlock.
Previously I was using dotnet xunit
which also didn't have problems, but since dotnet xunit
has been discontinued I switched over to dotnet test
which is giving me problems now.
Also this issue doesn't exist if I downgrade the test project to netcoreapp2.0.
Expected behavior
Tests will not deadlock with Microsoft.AspNetCore.TestHost
version 2.1.*
.
Actual behavior
Tests hang forever.
Environment data
dotnet --info
output:
Dustins-MBP:giraffe dustinmoris$ dotnet --info
.NET Core SDK (reflecting any global.json):
Version: 2.1.401
Commit: 91b1c13032
Runtime Environment:
OS Name: Mac OS X
OS Version: 10.13
OS Platform: Darwin
RID: osx.10.13-x64
Base Path: /usr/local/share/dotnet/sdk/2.1.401/
Host (useful for support):
Version: 2.1.3
Commit: 124038c13e
.NET Core SDKs installed:
2.0.0 [/usr/local/share/dotnet/sdk]
2.0.3 [/usr/local/share/dotnet/sdk]
2.1.3 [/usr/local/share/dotnet/sdk]
2.1.4 [/usr/local/share/dotnet/sdk]
2.1.300 [/usr/local/share/dotnet/sdk]
2.1.301 [/usr/local/share/dotnet/sdk]
2.1.302 [/usr/local/share/dotnet/sdk]
2.1.400 [/usr/local/share/dotnet/sdk]
2.1.401 [/usr/local/share/dotnet/sdk]
.NET Core runtimes installed:
Microsoft.AspNetCore.All 2.1.0 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.1.1 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.1.2 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.1.3 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.0 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.1.1 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.1.2 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.1.3 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.0.0 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 2.0.3 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 2.0.4 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 2.0.5 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.0 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.1 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.2 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.3 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
To install additional .NET Core runtimes or SDKs:
https://aka.ms/dotnet-download
Copied from original issue: dotnet/cli#9874
Copied from original issue: dotnet/aspnetcore#3443
From @livarcocc on Tuesday, 21 August 2018 23:19:10
@natemcmaster do you know who owns Microsoft.AspNetCore.TestHost
?
From @livarcocc on Tuesday, 21 August 2018 23:19:10
From @natemcmaster on August 21, 2018 21:11
That's my team. You can move this to aspnet/Home.
From @natemcmaster on Wednesday, 22 August 2018 00:21:54
@dustinmoris can share more logs? What do you see in console before the tests hang? Also, VSTest diagnostic logs may provide information about which process is stuck. Unlike dotnet xunit,
dotnet testuses multiple processes to launch and run tests.
dotnet test --diag mylog.txt`. If you take a look inside those logs, you'll be able to see which processes were created and which are hanging. This will help if we need you to gather a process dump.
From @dustinmoris on Friday, 24 August 2018 22:31:36
Before the tests I just see the normal .NET Core output from building the project:
Dustins-MBP:giraffe dustinmoris$ dotnet build ./tests/Giraffe.Tests/Giraffe.Tests.fsproj
Microsoft (R) Build Engine version 15.8.166+gd4e8d81a88 for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.
Restoring packages for /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/Giraffe.Tests.fsproj...
Restore completed in 32.02 ms for /Users/dustinmoris/GitHub/Giraffe/src/Giraffe/Giraffe.fsproj.
Restore completed in 989.08 ms for /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/Giraffe.Tests.fsproj.
Giraffe -> /Users/dustinmoris/GitHub/Giraffe/src/Giraffe/bin/Debug/netstandard2.0/Giraffe.dll
Giraffe -> /Users/dustinmoris/GitHub/Giraffe/src/Giraffe/bin/Debug/net461/Giraffe.dll
Giraffe.Tests -> /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/bin/Debug/net461/Giraffe.Tests.dll
Giraffe.Tests -> /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/bin/Debug/netcoreapp2.1/Giraffe.Tests.dll
Build succeeded.
0 Warning(s)
0 Error(s)
Time Elapsed 00:00:18.40
Dustins-MBP:giraffe dustinmoris$ dotnet test ./tests/Giraffe.Tests/Giraffe.Tests.fsproj -f netcoreapp2.1 --diag testlog.txt
Microsoft (R) Build Engine version 15.8.166+gd4e8d81a88 for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.
Restore completed in 41.46 ms for /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/Giraffe.Tests.fsproj.
Restore completed in 41.44 ms for /Users/dustinmoris/GitHub/Giraffe/src/Giraffe/Giraffe.fsproj.
Build started, please wait...
Build completed.
Test run for /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/bin/Debug/netcoreapp2.1/Giraffe.Tests.dll(.NETCoreApp,Version=v2.1)
Microsoft (R) Test Execution Command Line Tool Version 15.8.0
Copyright (c) Microsoft Corporation. All rights reserved.
Starting test execution, please wait...
Logging Vstest Diagnostics in file: /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/testlog.txt
^C
Attempting to cancel the build...
Dustins-MBP:giraffe dustinmoris$
Because it hangs I had to cancel the build.
After running dotnet test
with the --diag
flag it actually created two log files:
testlog.host.18-08-24_23-22-45_88253_4.txt
testlog.txt
Seems like this is the point at which it starts hanging. After this line it just keeps saying Polling on remoteEndPoint
:
"TestRunStatistics":{"ExecutedTests":59,"Stats":{"Passed":59}},"ActiveTests":[{"Id":"1c4254f9-7162-8077-35e3-3e0a4cbbd847","FullyQualifiedName":"Giraffe.Tests.HttpContextExtensionsTests.WriteHtmlFileAsync should return html from content folder","DisplayName":"Giraffe.Tests.HttpContextExtensionsTests.WriteHtmlFileAsync should return html from content folder","ExecutorUri":"executor://xunit/VsTestRunner2/netcoreapp","Source":"/Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/bin/Debug/netcoreapp2.1/Giraffe.Tests.dll","CodeFilePath":null,"LineNumber":0,"Properties":[]}]}}
TpTrace Verbose: 0 : 61463, 5, 2018/08/24, 23:22:48.749, 158168527666476, vstest.console.dll, TestRunRequest:SendTestRunStatsChange: Starting.
TpTrace Verbose: 0 : 61463, 5, 2018/08/24, 23:22:48.749, 158168527877462, vstest.console.dll, InProgress is Giraffe.Tests.HttpContextExtensionsTests.WriteHtmlFileAsync should return html from content folder
TpTrace Information: 0 : 61463, 5, 2018/08/24, 23:22:48.750, 158168528048414, vstest.console.dll, TestRunRequest:SendTestRunStatsChange: Completed.
TpTrace Verbose: 0 : 61463, 5, 2018/08/24, 23:22:48.750, 158168528196945, vstest.console.dll, TcpClientExtensions.MessageLoopAsync: Polling on remoteEndPoint: 127.0.0.1:55624 localEndPoint: 127.0.0.1:55623
TpTrace Verbose: 0 : 61463, 5, 2018/08/24, 23:22:49.755, 158169533575644, vstest.console.dll, TcpClientExtensions.MessageLoopAsync: Polling on remoteEndPoint: 127.0.0.1:55624 localEndPoint: 127.0.0.1:55623
Let me know if there's more information I can provide!
From @natemcmaster on Monday, 27 August 2018 15:39:47
Thanks for the test logs. As far as I can tell, this rules out issues in the test runner itself. There appears to be some kind of deadlock in the test code. Gathering a process dump is the next step for investigating this. When dotnet test runs, there will be several "dotnet.exe" processes. I recommend using Process Explorer to find the right dotnet.exe process for which to collect a full process dump. The dotnet.exe process which is hanging will have startup arguments like this:
dotnet.exe exec --runtimeconfig "/Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/bin/Debug/netcoreapp2.1/Giraffe.Tests.runtimeconfig.json" --depsfile "/Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/bin/Debug/netcoreapp2.1/Giraffe.Tests.deps.json" "/Users/dustinmoris/.nuget/packages/microsoft.testplatform.testhost/15.8.0/lib/netstandard1.5/testhost.dll" --port 55623 --endpoint 127.0.0.1:055623 --role client --parentprocessid 61463 --diag "/Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/testlog.host.18-08-24_23-22-45_88253_4.txt" --telemetryoptedin false
Once you have a process dump, we can look at where the threads are stuck. You should be able to open .dmp files in Visual Studio.
Hi,
So I have figured out that the deadlock was caused with Microsoft.AspNetCore.TestHost
2.1.*
when I was testing routes which accessed files from ContentRoot, however these tests wouldn't deadlock with the TestHost 2.0.*
.
Initially I had this problem on my MBP and on my Win 10 machine, but after changing some tests I have fixed the issue now, however I am still getting a deadlock on AppVeyor, Travis and Azure Pipelines for Ubuntu and Windows images, which is really weird because on two independent Windows 10 machines and on my MacBook Pro I can't reproduce this anymore.
I feel like I am wasting here so much time trying to figure out why tests deadlock after upgrading the NuGet package Microsoft.AspNetCore.TestHost
from 2.0 to 2.1. Can someone shed some light on what you guys have changed inbetween those versions or maybe someone from the team who can go on a Google Hangout with me and where I can screenshare and demo the issue. I even have a GitHub project which you guys could just pull to investigate, but I am literally losing patience now.
You should be able to pull the latest version from the develop branch of Giraffe and then run dotnet test
for the Giraffe.Tests.fsproj
project.
With TestHost 2.0 the build works perfectly fine.
After updating only the NuGet version of TestHost from 2.0 to 2.1 the dotnet test
command deadlocks for the exact same build.
The weird thing is that on my macOS Mojave MBP, Windows 10 Home laptop and Windows 10 Professional Desktop I am not able to get the tests hanging either.
I don't think it is an AppVeyor issue though, because the same also happens in TravisCI and Azure Pipelines.
TestHost 2.0 works perfectly on ALL environments.
There is something incredibly weird going on here and I hope you can look into this as there might be some major flaw with TestHost 2.1.
I'm more than happy to assist, but I would hope that Microsoft could take a little bit more the lead on investigating an issue which has only emerged after you made changes to your TestHost library.
FYI I see two changes in TestHost 2.1 that might have an impact on test execution:
Response Stream Pipes
HttpContext pattern
Edit: hadn't read the history above... Were you able to narrow down which test is hanging? You can turn on xunit diagnosticMessages to track which test is running. https://xunit.github.io/docs/configuring-with-json
This doesn't repro locally for me on Win10 either. We use AppVeyor and TravisCI ourselves and haven't seen any issues like this. We have known those systems to be flakier if only because they run so much slower. That can make them prone to uncovering race conditions that you wouldn't normally hit during development. It seems like the only way to repro this right now is to send you a PR and try to debug it.
Thanks @Tratcher for getting back so quickly and checking this yourself! I have narrowed it down that it has to do with these tests:
What they both have in common is that the TestHost is being created with a content root and that they both stream a file. The streamFile handler is implemented here, which essentially calls into StreamCopyOperation.CopyToAsync.
When I remove both files from the compilation by commenting them in the .fsproj
file then the build works fine with TestHost 2.1.
We have known those systems to be flakier if only because they run so much slower. That can make them prone to uncovering race conditions that you wouldn't normally hit during development.
That's a good point, however I think this is not the case here because the tests are deadlocking consistently on 3 different CI environments (AppVeyor, Travis and Azure Pipelines) every single time for many weeks now (since TestHost 2.1 has been released). They are also passing on all my machines consistently and some of them are not particularly strong either. I was thinking that perhaps what the CI servers have in common is that the tests are being executed from a docker container? Not sure how these environments are spawned but that could be a reason or maybe because they all run on a VM/container which has only a single core/CPU allocated?
I've got time this evening to look into this in more detail again if you have some concrete ideas what to try or how to debug it the best I'll do my best to get this done today.
The best thing to do is to narrow it down to a single test that fails. It may take a bit of binary searching where you delete half of the tests and see what happens.
I also see that the tests target both net461 and netcoreapp2.1. I've tried dropping net461 in my PR.
Ok, I can help you with this, because I've already done a lot of these tests where I was removing everything but a few tests. I'll add commits to your PR if that is ok with you?
Go right ahead.
@davidfowl here's another threadpool exhastion variant. giraffe-fsharp/Giraffe#317 (comment)
xunit limits the number of threads dynamically based on the number of cores. On CIs like AppVeyor and Travis that could be a very low number (2?). It also runs tests in parallel. If tests start blocking any of those threads then it can quickly starve the threadpool, block any async operations, and cause a full deadlock.
x-thread update. This is mostly resolved by the debugging/discussion at giraffe-fsharp/Giraffe#317. The tests were starving threads via Task.Result and making them properly async addressed the issue. The TestHost 2.1 changes discussed in #1537 (comment) appear to have made TestHost more sensitive to thread starvation, likely by running more callbacks on the threadpool rather than inline. The only question remaining is which of the two checkins caused this and what background operation is stuck.
Closing as there's no plan to pursue this.
I just ran across the same issue, and switching everything to async/await fixed it, cheers all