zio/zio-zmx

Summary is not threadsafe

Closed this issue · 1 comments

vigoo commented

The ConcurrentSummary#observe function is not thread-safe:

      def observe(value: Double, t: java.time.Instant): Unit = {
        if (currentCount.intValue() == maxSize) {
          values.removeFirst()
        } else {
          currentCount.increment()
        }
        values.add((t, value))

        count.increment()
        sum.add(value)
        ()
      }

I have seen a crash in production on the removeFirst line with a summary where it's maxSize was set to 1, my assumption is that the currentCount.intValue() == maxSize check was evaluated on two fibers in parallel and both tried to remove the single item. (Could not reproduce yet)

Another stress test to show the issue:

object Test extends App {
  override def run(args: List[String]): URIO[zio.ZEnv, ExitCode] = {
    val summary = MetricAspect.observeSummary("test", 10.minutes, 100, 0.0, Chunk.empty)

    (for {
      _ <- PrometheusClient.snapshot
             .flatMap(snapshot => console.putStrLn(snapshot.value))
             .repeat(Schedule.fixed(10.seconds))
             .forkDaemon
      _ <- (ZIO.succeed(0.1) @@ summary).forever.forkDaemon
      _ <- (ZIO.succeed(-0.1) @@ summary).forever
    } yield ExitCode.success).provideCustomLayer(PrometheusClient.live)
  }
}

With this test the currentCount and the length of values in ConcurrentSummary jumps over maxSize (usually) and grows infinitely.

This is definitely not thread-safe. Summaries should be using arrays that never change length.