proxb/PoshRSJob

Start-RSJob -Timeout

opustecnica opened this issue · 8 comments

Do you want to request a feature or report a bug?

Would it be possible to add a "-Timeout" feature to Start-RSJob?

What is the current behavior?

There is currently no timeout. When working with large number of computers, particularly if making WMI calls, a few might suffer from a "hang" WMI connection. When piped to a Wait-RSJob, the process never completes because of still-running RSJobs. The Wait-RSJob Timeout applies - if I understand correctly - to the entire batch of RSJobs and there seem to be no way to apply a timeout to individual RSJobs.

Thank you.

Yes, Its applied to entire batch , but do you really need something other ?
What problem you see with current implementation ? please more detail

I can see why this would be needed. I run scans across our network on a few thousand machines and sometimes have issues with 1-2 machines not error out when I cannot remote into them. The script never completes cause of those.

If I had each specific job have a timeout I could mitigate those issues. I'd prefer not to do a batch timeout and have each job have a specific timeout. That way I can be sure the entire scan is complete and any machines that don't error out still get completed so the script can properly end.

@brian-heidrich That is exactly the issue I am trying to resolve. Particularly when dealing with WMI, I always experience a few "hanged" jobs out of a couple thousands.

Likely not very elegant, but this is the work around I have developed:

$RSJobTimeout = 3600
$RSJobPollingMilliseconds = 2000
$RSBatchTimeout = 3600 * 24

$RSJobArguments = @{
                Throttle = $Throttle
                Batch = $Batch
                FunctionsToLoad = $FunctionsToLoad
                Name = $Name
                ScriptBlock = $ScriptBlock
            }

$RSJob = $Computers | Start-RSJob @RSJobArguments

$RSJob | ForEach-Object { $_ | Add-Member -NotePropertyName ExpirationTime -NotePropertyValue (Get-Date).AddSeconds($RSJobTimeout) }

# Record batch start time.
$BatchStartTime = Get-Date

While ($true) {
    $StartTime = Get-Date

    # Update ExpirationTime on NotStarted Jobs
    $NotStartedRSJobs = Get-RSJob -Batch $Batch -State NotStarted
    $NotStartedRSJobs | ForEach-Object { $_.ExpirationTime = (Get-Date).AddSeconds($RSJobTimeout) }
            
    # Count the number of unfinished jobs before acting on the queues to ensure we don't miss any output.
    $UnfinishedRSJobs = @(Get-RSJob -Batch $Batch | Where-Object -Property State -NE "Completed")
                
    if ($UnfinishedRSJobs.Count -gt 0) {
        $CurrentTime = Get-Date
        $UnfinishedRSJobs | ForEach-Object { 
            if (($CurrentTime - $_.ExpirationTime).TotalSeconds -gt 0) { $_ | Remove-RSJob -Force }
        }
    }

    # Ouptput completed jobs. 
    $FinishedRSJobs = Get-RSJob -Batch $Batch -State Completed 
    $FinishedRSJobs | Receive-RSJob
    $FinishedRSJobs | Remove-RSJob -Force

    if ($UnfinishedRSJobs.Count -eq 0 -or ((Get-Date) - $BatchStartTime).TotalSeconds -gt $RSBatchTimeout) {
        # All jobs finished or the batch timed out.
        break
    } else {
        # Calculate how long we just spent polling the queues
        $LoopDuration = ((Get-Date) - $StartTime).TotalMilliseconds

        # Use that to adjust how long we wait.
        Start-Sleep -Milliseconds ([System.Math]::Max(0, $RSJobPollingMilliseconds - $LoopDuration))
    }
}

# Remove background jobs. If the batch timed out some of them may still be running.
Get-RSJob -Batch $Batch | Remove-RSJob -Force

so, what exactly problem with

$completed_jobs = $jobs | Wait-RSJob -Timeout xxx
$incompleted = $jobs | where { -not $_.Completed }
$incompleted | Remove-RSJob -Force

in this scenario ?

it never out or what ?

@MVKozlov the issue is that the Timeout value needs to be guessed.

Scenarios:

  1. If jobs-batch requires less time than the timeout value, then we need to wait for timeout to retrieve completed jobs. Why wait?
  2. If the value of timeout is lower than the amount of time required to complete the jobs-batch, than we "interrupt" the batch and retrieve an incomplete batch.
  3. The value of timeout needs to be guessed. Timeout should be used to avoid "runaway" tasks and to terminate a jobs-batch only if it lasts more than "X" amount of time.

The code above sets both. A timeout for each individual RSJob and a timeout for the jobs-batch. This returns the collection as soon as feasible - while terminating individual unresponsive jobs - and sets a jobs-batch timeout to handle runaway processes.

One more important point. A timeout on individual jobs maximizes the "throttle" usage. Assuming a throttle value of 5 (default), if a job hangs, we will only be able to have batches of 4. If two hang, we will only have batches of 3. That is inefficient when dealing with a large number of endpoints.

@MVKozlov Exactly what @opustecnica said. I never know for sure exactly how long my global timeout needs to be due to the amount of machines on my network. We are always imaging new machine and taking machines off the network, meaning we never have a set amount for me to know how long to always make the global timeout.

A per job timeout would be a lot more useful, as I know how long 1 job should take.

@opustecnica

  1. No, You are wrong, Wait-RSJob doesn't always wait full timeout, It out by timeout only if your last job run longer
    so, code above mimic original wait-rsjob cycle :)
Do{
    $JustFinishedJobs = New-Object System.Collections.ArrayList
    $RunningJobs = New-Object System.Collections.ArrayList
    ForEach ($WaitJob in $WaitJobs) {
        If($WaitJob.State -match 'Completed|Failed|Stopped|Suspended|Disconnected') {
            [void]$JustFinishedJobs.Add($WaitJob)
        } Else {
            [void]$RunningJobs.Add($WaitJob)
        }
    }
    $WaitJobs = $RunningJobs

    #Wait just a bit so the HasMoreData can update if needed
    Start-Sleep -Milliseconds 100
    $JustFinishedJobs

    $Completed += $JustFinishedJobs.Count
	if($Timeout){
		if((New-Timespan $Date).TotalSeconds -ge $Timeout){
			$TimedOut = $True
            break
		}
	}		
} 
While($Waitjobs.Count -ne 0)

@opustecnica , @brian-heidrich
So you can always set maximal aceeptable timeout for all of your jobs and get result as soon as possible (or at timeout)

The only reason I see for individual timeout is throttling. Actual job start may be long after Start-RSJob launched.This problem you try to solve by $NotStartedRSJobs ….

Unfortunately, @proxb has not been here since april, so if I have some time I try to implement it next week in my own fork

please test PerJobTimeout branch at https://github.com/MVKozlov/PoshRSJob/tree/PerJobTimeout

Sorry, Throttling slot still not freed because Stop-RSJob launched only on the end of popeline
It is needed more work on this

But now you can use it like this

$goodJobs = $data | Start-RSJob { ... } | Wait-RSJob -Timeout 120 -PerJobTimeout
$goodJobs | WorkOnThis
$goodJobs | Remove-RSJob
$badJobs = Get-RSJob