Frequent timeouts/failures when deploying locally
kentcb opened this issue · 10 comments
I'm getting frequent local deployment problems, solved by resetting my cluster (which takes quite some time). Here is an example build log (detailed).
I'm using the latest SDK (3.3.644
). This is running in a Windows 10 VM on Azure using latest VS2017 bits.
Also, I did check the SF explorer after and it didn't appear to tell me anything useful regarding the problem.
Literally every time I try and deploy right now. This is murdering my productivity.
Digging around event viewer, I see a slew of errors from the deployment, including all these.
Log Name: Microsoft-ServiceFabric/Admin
Source: Microsoft-ServiceFabric
Date: 3/13/2019 5:17:52 PM
Event ID: 58368
Task Category: FabricDeployer
Level: Error
Keywords: Default
User: Development\kent
Computer: Development
Description:
Failed to execute command dockerd -H localhost:2375 -H npipe:// --pidfile=C:\SfDevCluster\Data\_sf_docker_pid\636880942720384145_sfdocker.pid from working dir C:\Program Files\Docker wait for exit False exception System.ComponentModel.Win32Exception (0x80004005): The system cannot find the file specified
at System.Diagnostics.Process.StartWithCreateProcess(ProcessStartInfo startInfo)
at System.Fabric.FabricDeployer.Utility.ExecuteCommand(String command, String arguments, String workingDir, Boolean waitForExit, Nullable`1 timeout)
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-ServiceFabric" Guid="{cbd93bc2-71e5-4566-b3a7-595d8eeca6e8}" />
<EventID>58368</EventID>
<Version>2</Version>
<Level>2</Level>
<Task>228</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000001</Keywords>
<TimeCreated SystemTime="2019-03-13T07:17:52.090514100Z" />
<EventRecordID>3739</EventRecordID>
<Correlation />
<Execution ProcessID="3120" ThreadID="3868" />
<Channel>Microsoft-ServiceFabric/Admin</Channel>
<Computer>Development</Computer>
<Security UserID="S-1-5-21-813763744-1233322525-2174010731-500" />
</System>
<EventData>
<Data Name="id">
</Data>
<Data Name="type">FabricDeployer</Data>
<Data Name="message">Failed to execute command dockerd -H localhost:2375 -H npipe:// --pidfile=C:\SfDevCluster\Data\_sf_docker_pid\636880942720384145_sfdocker.pid from working dir C:\Program Files\Docker wait for exit False exception System.ComponentModel.Win32Exception (0x80004005): The system cannot find the file specified
at System.Diagnostics.Process.StartWithCreateProcess(ProcessStartInfo startInfo)
at System.Fabric.FabricDeployer.Utility.ExecuteCommand(String command, String arguments, String workingDir, Boolean waitForExit, Nullable`1 timeout)</Data>
</EventData>
</Event>
Log Name: Microsoft-ServiceFabric/Admin
Source: Microsoft-ServiceFabric
Date: 3/13/2019 5:17:52 PM
Event ID: 58368
Task Category: FabricDeployer
Level: Error
Keywords: Default
User: Development\kent
Computer: Development
Description:
Failed to clean up container network.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-ServiceFabric" Guid="{cbd93bc2-71e5-4566-b3a7-595d8eeca6e8}" />
<EventID>58368</EventID>
<Version>2</Version>
<Level>2</Level>
<Task>228</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000001</Keywords>
<TimeCreated SystemTime="2019-03-13T07:17:52.910937900Z" />
<EventRecordID>3741</EventRecordID>
<Correlation />
<Execution ProcessID="3120" ThreadID="3868" />
<Channel>Microsoft-ServiceFabric/Admin</Channel>
<Computer>Development</Computer>
<Security UserID="S-1-5-21-813763744-1233322525-2174010731-500" />
</System>
<EventData>
<Data Name="id">
</Data>
<Data Name="type">FabricDeployer</Data>
<Data Name="message">Failed to clean up container network.</Data>
</EventData>
</Event>
Log Name: Microsoft-ServiceFabric/Admin
Source: Microsoft-ServiceFabric
Date: 3/13/2019 5:18:47 PM
Event ID: 2560
Task Category: General
Level: Error
Keywords: Default
User: Development\kent
Computer: Development
Description:
GetFileAttributesEx failed with the following error 5
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-ServiceFabric" Guid="{cbd93bc2-71e5-4566-b3a7-595d8eeca6e8}" />
<EventID>2560</EventID>
<Version>2</Version>
<Level>2</Level>
<Task>10</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000001</Keywords>
<TimeCreated SystemTime="2019-03-13T07:18:47.548031400Z" />
<EventRecordID>3762</EventRecordID>
<Correlation />
<Execution ProcessID="12756" ThreadID="9648" />
<Channel>Microsoft-ServiceFabric/Admin</Channel>
<Computer>Development</Computer>
<Security UserID="S-1-5-21-813763744-1233322525-2174010731-500" />
</System>
<EventData>
<Data Name="id">File.GetLastWriteTime</Data>
<Data Name="type">File</Data>
<Data Name="text">GetFileAttributesEx failed with the following error 5</Data>
</EventData>
</Event>
From this build log, all I can make out is that the cluster was running but the mesh application deployment failed:
3> Deploying application to local Service Fabric cluster...
3> C:\Windows\sysnative\WindowsPowerShell\v1.0\powershell.exe -NoProfile -ExecutionPolicy Bypass -Command "import-module 'C:\Program Files\Microsoft SDKs\Service Fabric\Tools\Mesh\Scripts\PSModule\Microsoft.ServiceFabric.Powershell.Http.psd1'; Connect-SFCluster; New-SFMeshResourceDeployment -ResourceDescriptionList 'C:\Users\kent\Repository<>\src<>\MESH<>\App Resources\app.yaml', 'C:\Users\kent\Repository<>\src<>\MESH\MyService\Service Resources\service.yaml', 'C:\Users\kent\Repository<>\src<>\MESH\API\Service Resources\service.yaml', 'C:\Users\kent\Repository<>\src<>\MESH<>\obj\SFApp\App.debug.yaml' -ParameterFileName 'C:\Users\kent\Repository<>\src<>\MESH<>\Environments\Local\parameters.yaml'"
3> Connected Successfully!
3> Deploying the application to the Service Fabric local cluster failed. See Service Fabric Explorer for additional details.
3> Done executing task "LocalDeployApplication" -- FAILED.
If would help if you can reset your cluster again, and try app deployment and share the cluster traces from here: C:\SFDevCluster\Log\Traces, I can look and tell you why the app deployment is failing.
Our messages collided.
From the event logs it seems to be an issue with the docker failing to start the process. I would suggest you to reboot your machine or reset docker to factory settings.
@anantshankar17 Thanks. The traces are large. Can you provide an email address with which I can share them via OneDrive?
You can share the traces to this id: ashank at microsoft dot com. Also did you try to reboot machine or reset the docker to factory settings ? From the event logs, it seems the docker is failing to start the container, in which case SF traces will also indicate the same thing only.
OK, it's still happening even after Docker reset/SF reset/reboot, so I've shared some traces with you.
@kentcb Also, can you please check if there is sufficient free disk space on the disk the cluster is deployed ? Can you please share the resource descriptions that you are trying to deploy ? Which VM type are you using ? How many apps have been deployed on this cluster ?
Since, I see all kinds of insufficient resources available traces i suspect this is some incorrect config in the resource description:
2019-3-13 22:48:37.859 PLB.InsufficientResourcesDetected 9900 8640 Insufficient resources when adding service fabric:/inf/API for metric servicefabric:/_CpuCores: Requested 500000, Available 0, Reserved0 BufferPercentage 0, Capacity 1000000
2019-3-13 22:48:37.876 CM.CreateSingleInstanceDeploymentTraceComponent 9900 1932 [(00000000-0000-0000-0000-000000002000:131969890879859490)+eee2219d-a093-4836-817b-eff781ddbefa:0] single instance deployment create failed due to FABRIC_E_INSUFFICIENT_CLUSTER_CAPACITY
2019-3-13 22:51:01.446 FabricDCA.AppConfig 6096 11704 Manifest C:\SfDevCluster\Data_App_Node_0\SingleInstance_0_App2\App.1.0.xml was not found. This could be because the application has been deleted, but we haven't yet processed the event that notifies us about the deletion.
Also, you can use this cmdlet to check the resource availability on the cluster: (Get-ServiceFabricClusterLoadInformation).LoadMetricInformationList | select Name, ClusterCapacity, ClusterLoad, IsClusterCapacityViolation
https://docs.microsoft.com/en-us/powershell/module/servicefabric/Get-ServiceFabricClusterLoadInformation?view=azureservicefabricps#outputs
@anantshankar17 There's over 54GB free disk space, so guessing that's not the issue.
Those resource requests do look odd! At this point it's just a spike project, so only two dummy services. Here are their definitions:
API
service.yaml
## Service definition ##
application:
schemaVersion: 1.0.0-preview2
name: inf
properties:
services:
- name: API
properties:
description: API description.
osType: Windows
codePackages:
- name: API
image: api:dev
endpoints:
- name: gRPCEndpoint
port: 9026
environmentVariables:
- name: ENVIRONMENT
value: "[parameters('ENVIRONMENT')]"
- name: RESOURCE_GROUP
value: "[parameters('RESOURCE_GROUP')]"
- name: SEQ_SERVER_URL
value: "[parameters('SEQ_SERVER_URL')]"
- name: SEQ_API_KEY
value: "[parameters('SEQ_API_KEY')]"
- name: GRPC_PORT
value: "[parameters('GRPC_PORT')]"
resources:
requests:
cpu: 0.5
memoryInGB: 1
replicaCount: 1
networkRefs:
- name: infNetwork
endpointRefs:
- name: gRPCEndpoint
Dockerfile
FROM microsoft/dotnet:2.1-runtime-nanoserver-1803 AS base
WORKDIR /app
EXPOSE 9026
FROM microsoft/dotnet:2.1-sdk AS build
WORKDIR /src
COPY API/API.csproj API/
RUN dotnet restore API/API.csproj
COPY . .
WORKDIR /src/API
RUN dotnet build API.csproj -c Release -o /app
FROM build AS publish
RUN dotnet publish API.csproj -c Release -o /app
FROM base AS final
WORKDIR /app
COPY --from=publish /app .
ENTRYPOINT ["dotnet", "API.dll"]
InvitationCodes
service.yaml
## Service definition ##
application:
schemaVersion: 1.0.0-preview2
name: inf
properties:
services:
- name: InvitationCodes
properties:
description: InvitationCodes description.
osType: Windows
codePackages:
- name: InvitationCodes
image: invitationcodes:dev
environmentVariables:
- name: ENVIRONMENT
value: "[parameters('ENVIRONMENT')]"
- name: RESOURCE_GROUP
value: "[parameters('RESOURCE_GROUP')]"
- name: SEQ_SERVER_URL
value: "[parameters('SEQ_SERVER_URL')]"
- name: SEQ_API_KEY
value: "[parameters('SEQ_API_KEY')]"
- name: GRPC_PORT
value: "[parameters('GRPC_PORT')]"
resources:
requests:
cpu: 0.5
memoryInGB: 1
replicaCount: 1
networkRefs:
- name: infNetwork
Dockerfile
FROM microsoft/dotnet:2.1-runtime-nanoserver-1803 AS base
WORKDIR /app
FROM microsoft/dotnet:2.1-sdk AS build
WORKDIR /src
COPY InvitationCodes/InvitationCodes.csproj InvitationCodes/
RUN dotnet restore InvitationCodes/InvitationCodes.csproj
COPY . .
WORKDIR /src/InvitationCodes
RUN dotnet build InvitationCodes.csproj -c Release -o /app
FROM build AS publish
RUN dotnet publish InvitationCodes.csproj -c Release -o /app
FROM base AS final
WORKDIR /app
COPY --from=publish /app .
ENTRYPOINT ["dotnet", "InvitationCodes.dll"]
App Resources
app.yaml
application:
schemaVersion: 1.0.0-preview2
name: inf
properties:
description: inf description.
gateway.yaml
gateway:
schemaVersion: 1.0.0-preview2
name: infGateway
properties:
description: INF Gateway Resource
sourceNetwork:
name: Open
destinationNetwork:
name: infNetwork
tcp:
- name: gRPCTCP
port: 9026
destination:
applicationName: inf
serviceName: API
endpointName: gRPCEndpoint
network.yaml
## Network definition ##
network:
schemaVersion: 1.0.0-preview2
name: infNetwork
properties:
description: infNetwork description.
networkAddressPrefix: 10.192.0.0/16
kind: Local
@kentcb which VM type are you using, how many cores does that have ?
Gateway resource would itself use 1 core, and your service will also consume 1.
After app deployment, did you run this cmdlet and check the resource utilization ?
(Get-ServiceFabricClusterLoadInformation).LoadMetricInformationList | select Name, ClusterCapacity, ClusterLoad, IsClusterCapacityViolation
https://docs.microsoft.com/en-us/powershell/module/servicefabric/Get-ServiceFabricClusterLoadInformation?view=azureservicefabricps#outputs
Please reopen if you see this issue even after resolving the resource constraints.