mco-gh/gcslock

Unsuitable semantics for long running processes

Closed this issue · 11 comments

Consider a program which runs indefinitely:

  1. A gcslock.Lock(l, time.Second) fails, for any reason.
  2. The underlying locking goroutine will keep running indefinitely, until it succeeds.

[In the meantime, another path of the program may have successfully acquired and released the lock]

  1. At last, the goroutine initially started in step 1 succeeds.
  2. Inconsistent state. Next gcslock.Lock will never succeed now, until the lock is removed by an external job.

One solution is to simply limit goroutine retries to some max.

Another solution is to stop the locking/unlocking goroutine before returning an error.

Actually, limiting num. of retries of a locking goroutine won't work. I believe the best approach is to always terminate the locking/unlocking goroutine before returning.

So, some code which does something like this in a goroutine?

for {
  err := gcslock.Lock(l, time.Second)
  if err == nil {
    break
  }
}

The timed Lock fails for any reason (e.g. timeout), and some other part of the same program gets and releases the lock. Won't this loop eventually obtain the lock properly? Not seeing the inconsistency.

That example is not a long running process. This one is:

// Loop indefinitely processing one task at a time.
for {
  // Receive a new task from a channel ch.
  task := <- ch
  // Try to obtain a global lock or skip the task, reporting the error.
  if err := gcslock.Lock(l, time.Second); err != nil {
    reportf("lock failed: %v", err)
    // Lock failed so we skip the task and wait for a new one.
    // Unfortunately, the above lock goroutine will keep running,
    // trying to acquire a lock we don't need anymore.
    continue
  }
  execute(task)
}

Most importantly, whenever the above gcslock.Lock(l, time.Second) fails, the underlying gcslock goroutine will needlessly keep trying to acquire the lock and may very well succeed at some point. Meaning, any future "useful" gcslock.Lock(l, time.Second) will never succeed anymore, until the lock is removed by an external process.

This is very similar to what would happen in the case with push-to-publish git lock.

What I think needs to happen instead is for the locking goroutine to stop even if gcslock.Lock returned an error, instead of keep running.

I get it. Thanks for clarifying. I need to stop the goroutine when I get a timeout, something like this:

  case <-time.After(d):
    // stop running goroutine here 
    return errors.New("lock request timed out")

Looks like there's no way to externally stop a goroutine but I'm sure there's some trick I can use. Will do some googling.

Exactly, the // stop running goroutine here part is what's missing.
The locking loop needs to be controlled and stopped - it currently never exits.

There's no built-in way to stop a goroutine via some kind of a "goroutine handle" but it can be controlled using a context.WithTimeout or a channel. I would probably use a context with a deadline equal to the d argument passed to gcslock.Lock.

On it, first thing tomorrow am. Thanks!

This looks like exactly my scenario.

sync.WaitGroup feels like an overkill just for one goroutine:

A WaitGroup waits for a collection of goroutines to finish ...