awslabs/aws-embedded-metrics-java

putMetric in MetricsLogger is not ThreadSafe and leading to ArrayIndexOutOfBoundsExceptions

dbadami opened this issue · 9 comments

Hi,

We're adding metrics using using the MetricsLogger putMetric method in our application. We have multiple threads adding a metric for the same key and we received an ArrayIndexOutOfBoundsException when doing so. In our logs we were able to track it down to the addValue method in MetricDefinition.

Our hypothesis is that this is happening because the resizing of the array list is synchronized on so you can have an element that can being inserted at index that is greater than the current size of the array.

Happy to make the fix if the change is as simple as:

   MetricDefinition(String name, Unit unit, double value) {
        this(name, unit, Collections.synchronizedList(new ArrayList<>(Arrays.asList(value))));
    }

Thanks,

Devavrat

The library is not thread-safe. Your fix can fix one place but there're also other places needs. We will prioritize the enhancement of this.

We also need to make sure that the size of the list does not go beyond the current 100 limit.

#72

The current API is not compatible with being used in a multi-threaded application at all.

Since putProperty, putDimensions, and putMetric are separate calls, it is not possible to publish a metric with a given set of properties and dimensions without other threads potentially changing the properties and dimensions concurrently.

Are there any plans to address this?

@goldmar this assumes you’re using the same metrics logger instance on multiple threads, right? Can you use a different instance per thread? You could access it through a ThreadLocal to simplify this. There are definitely other enhancements we need to make to improve throughput for socket writing but even with that I think you’d still want a distinct logger per thread. Can you help us understand your use case a little better? It’s possible I’ve misunderstood. Is there some issue you’re hitting where distinct metrics loggers per thread are not working as expected?

@jaredcnance Thank you for the quick reply! We have started building our first service on ECS Fargate and are just trying to figure out the best way to emit metrics. We have configured Firelens with fluentbit and it works well, however I am not sure yet what is the best way to use this logging library. We have an Akka Streams application which consumes data from Kinesis and SQS and writes to Elasticsearch.

I don't think using ThreadLocals is the way to go in an async context; I think I'll create a custom Sink for emitting metrics in Akka instead. This will guarantee that there is only a single thread calling the MetricsLogger.

I see, so you want to be able to do something like:

  1. Collect some metrics or metadata on a single MetricsLogger
  2. Pass the logger into one or more async contexts where they can add new metrics or metadata
  3. Join the async contexts (e.g. Future.get()) and flush the metrics

Or even:

  1. Collect some metrics or metadata on a single MetricsLogger
  2. Clone the metrics logger and pass into an async context
  3. Flush from the async context

Is that the generalized use case?

So what I would like to do, is pass around a single MetricsLogger around in my async contexts and emit my metrics. This is obviously not possible because MetricsLogger is not only not threadsafe but also has the above mentioned limitations of its API. If you look at the implementations of different Logger classes (log4j1, log4j2, logback), all of these classes are threadsafe. So I think it is reasonable to expect this to be the case here as well, since the use cases are somewhat related.

Since this is currently not possible, a workaround is required. It could be using a ThreadLocal as you suggested above (I am not sure whether this plays well together with Akka though) or some other way of ensuring that the object is not called from multiple threads at the same time.

We discussed offline and to restate goldmar@'s specific ask is for an additional interface that simplifies the emission of metrics where flush() can be called immediately. This provides a better DX for multithreaded scenarios since user's don't have to be concerned with state and concurrency, but also comes at the loss of colocating related values and increased log volume. In addition, this kind of flush every time behavior should likely be implemented (or used carefully) after we support async writing. Otherwise every individual metric can block on a socket write.

add(
  String key,
  Double value,
  CloudWatchUnit unit,
  List<DimensionSet> dimensions,
  Map<String, String> properties
);

addAll(
  List<Metric> metrics,
  List<DimensionSet> dimensions,
  Map<String, String> properties
);

@dbadami Hi Devavrat,
As we’re continuing working on enhancing thread safety of the library, we’re trying to gather all available information on reported thread-safe problems. Since now the version has changed to v1.0.5, I would greatly appreciate it if you can provide more information about whether you’re still experiencing issues while using this library with multi-threading, what exception type you get, and any other details about your issue.
I would like to simulate the issue using your specific example as well. Will it be possible to share a part of your application(if not complete) that can help me do that?