open-telemetry/community

[Donation Proposal]: OTEL Arrow Adapter

lquerel opened this issue · 28 comments

Description

This project extends, in a compatible way, the existing OTLP protocol with a generic columnar representation for metrics, logs, and traces based on the cross-language Apache Arrow transport system for columnar data. This extension significantly improves the efficiency of the OTLP protocol for scenarios involving the transmission of large batches of OTLP entities. Results show a 2-3 times better compression rate for typical data, while Apache Arrow’s zero-copy serialization and deserialization techniques help lower overhead.

The first phase of this project will deliver the following primary components, contained in the donated repository:

  1. Golang-based adapter reference implementation that implements the translation to and from OTLP-Arrow using the OTel Collector’s “pdata” interface.
  2. Protobuf definition for OTLP-Arrow that would be migrated into the opentelemetry-proto repo as it stabilizes. This includes a representation for multi-variate metrics which allows metrics with shared attributes to be compactly represented and processed.
  3. Arrow-enabled OTel Collector Exporter and Receiver components that are drop-in compatible with the OTLP Exporter and Receiver core components.
  4. Arrow-enabled OTel-Go SDK “shim” to the Arrow-enabled Exporter component, which is mainly useful for validation purposes.

In a second phase of the project, we propose developing new OpenTelemetry components and mechanisms for exchanging and processing data with the Apache Arrow ecosystem, including:

  1. Arrow-enabled Collector: an OTel Collector pipeline for processing Arrow batches directly
  2. Support for reading/writing Apache Parquet files, and other methods to leverage non-Golang Apache Arrow ecosystem components (e.g., the Arrow Query engine)
  3. Arrow-enabled OpenTelemetry SDKs (i.e., without a “shim” and adapter).

The repository for donation: https://github.com/f5/otel-arrow-adapter

Collector repo fork containing the drop-in compatible exporter and receiver components under development: https://github.com/open-telemetry/experimental-arrow-collector

More details on the associated OTEP text: https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md. The OTEP is still pending, unmerged.

Benefits to the OpenTelemetry community

Compared to the existing OpenTelemetry protocol this compatible extension has the following improvements:

  • Reduce the bandwidth requirements of the protocol. The two main levers are: 1) a better representation of the telemetry data based on a columnar representation, 2) a stream-oriented gRPC endpoint that is more efficient to transmit batches of OTLP entities.
  • Provide a more optimal representation for multivariate time-series data. With the current version of the OpenTelemetry protocol, users have to transform multivariate time-series (i.e multiple related metrics sharing the same attributes and timestamp) into a collection of univariate time-series resulting in a large amount of duplication and additional overhead covering the entire chain from exporters to backends.
  • Provide more advanced and efficient telemetry data processing capabilities. Increasing data volume, cost efficiency, and data minimization require additional data processing capabilities such as data projection, aggregation, and filtering.

Reasons for donation

The proposed protocol is tightly integrated with the existing OTLP protocol. A fallback mechanism between the two protocols has been implemented via a new pair of OTLP Arrow Exporter/Receiver that are drop-in compatible with the OTLP Exporter/Receiver core component, justifying the integration of this extension directly into the upstream project.

Repository

https://github.com/f5/otel-arrow-adapter

Existing usage

The project is developed and maintained jointly by F5 and Lightstep. F5 seeks to align the open-source community built around NGINX with the open-source community built around OpenTelemetry; this effort will deliver industrial-grade analytics capability to the telemetry data produced by its software components for its customers. Improving the compression performance and representation of multivariate time series are also among F5's goals. Lightstep, wanting to recommend OpenTelemetry collectors for its customers to use, seeks to improve the compression performance and efficiency achieved by OTLP when used for bulk data transport.

We are actively testing the components for donation on production data using an OpenTelemetry collector built from our drop-in compatible OTLP exporter and receiver. We are extending test coverage and eliminating gaps. We are planning to start beta tests with the community by providing documentation and tooling for benchmarking and troubleshooting.

Maintenance

F5 and Lightstep will continue to develop and maintain the project. We will encourage and help all new contributors to participate in this project. We are open to suggestions and ideas.

Our current roadmap is as follows:

  1. Continue to work on performance, reliability and robustness, extending test coverage, testing with more production data.
  2. Phase 2 of this project
    a) New client SDK natively supporting OTLP Arrow and multivariate metrics.
    b) Continuing the migration to Apache Arrow to achieve end-to-end performance gains.
    c) Integrating processing, aggregation, and filtering capabilities that leverages the Apache Arrow eco-system.
    d) Parquet integration.

Licenses

This project is licensed under the terms of the Apache 2.0 open source license.

Trademarks

The Apache Arrow project is used in this project.

Other notes

No response

I am beyond excited to see this proposal! Excellent work, and I can't wait to see phase 2 of this project. 🚀

I think it might be good to call out some of the large technical differences in OTLP-arrow vs. OTLP and when to use it vs. other options.

I was looking over details.

  1. This new protocol is streaming gRPC. That's a change from batch OTLP. How would this interact with load balancers and the overall design goals of OTLP?
  2. Is the new protocol stateful (or able to be stateful?) I saw it has optional dictionaries that can be passed and such. I'm curious if this is the case.

I'm definitely supportive of OTLP-arrow as an viable alternative to OTLP where it makes sense, but I'd love if you could outline where it shines and where it doesn't. Specifically when should you pick one vs. the other.

Thanks @jsuereth for your support.

Regarding your first comment, I added the following note in the OTEP to explain the motivations behind streaming gRPC for OTLP Arrow.

Unary RPC vs Stream RPC: We use a stream-oriented protocol to get rid of the overhead of specifying the schema and dictionaries for each batch. A state will be maintained receiver side to keep track of the schemas and dictionaries. The Arrow IPC format has been designed to follow this pattern and also allows the dictionaries to be sent incrementally. Similarly, ZSTD dictionaries can also be transferred to the RPC stream to optimize the transfer of small batches (for more details see the description of the compression field in the following paragraphs). To mitigate the usual pitfalls of a stream-oriented protocol please see this paragraph in the implementation recommendations section.

For the second comment, I confirm that the protocol is stateful. We use this property to maintain schemas and dictionaries on the receiver side to amortize their cost. This small extra cost in the first messages is actually paid back quite quickly and allows us to obtain gains on the size of the messages as well as improvements on the compression rate.

For your last question, the following two scenarios are particularly suitable for OTLP Arrow (non exhaustive list):

  • Production environments, consisting of a large number of OTEL data producers, which send their telemetry data to an external system (via the Internet). The use of an Arrow OTLP collector acting as a batching point at the output of the environment will result in substantial savings in network costs.
  • Individual producers of telemetry data with massive trace streams (e.g. API gateway, Load balancer, ...) will be able to use this new protocol with a native OTLP Arrow client (currently in design phase). This is also valid for systems that emit massive amounts of metrics and logs.

A future enhancement (based on ZSTD dictionaries) should also make this project interesting for scenarios where the batch size is small.

A couple questions:

  • There is no support for this new protocol if you use plain HTTP, it requires gRPC?
  • Since it uses gRPC streams how does it handle the load balancing? Streams are long-lived and can cause imbalance over time.

@tigrannajaryan

The design currently calls for the use of gRPC streams to benefit from OTLP-Arrow transport. We believe that some of this benefit can be had even for unary gRPC and HTTP requests with large request batches to amortize sending of dictionary and schema information. This remains an area for study.

Regarding the "imbalance over time" question, we rely on the MaxAge and MaxAgeGrace of the gRPC connection to mitigate with this kind of issue. This is discussed in the "Implement recommendations" section of the OTEP.

@jsuereth @tigrannajaryan

I updated my benchmark tool to compare the three following experiments:

  • OTLP
  • OTLP Arrow stream mode
  • OTLP Arrow unary RPC mode (or plain HTTP mode)

The following results are based on traces coming from a production environment (~400K traces).

Screen Shot 2023-03-02 at 11 24 04 AM

The streaming mode is definitely the best among these 3 experiments. This protocol is stateful and allows to :

  • amortize the initial cost of the schema
  • transfer delta dictionaries to remove entirely redundant information between batches in the same stream

It is interesting to note that for batches with more than 100 traces, the unary RPC mode is slightly better than the standard OTLP protocol. This is due to the organization of the data being better suited to compression.

I updated my benchmark tool to compare the three following experiments

@lquerel thanks for the update! What does the batch size reflect? Is it literally the number of traces or number of spans? If number of traces, how many spans in each trace?

@lquerel I am surprised streaming saves so much for large batch sizes. I would expect that for sufficiently large batches the dictionary is a small part of it so re-sending it with every batch shouldn't result in much bigger payloads.
Do you have this particular benchmark's code that I can run myself and understand a bit better?

Also, if I understand correctly the columnar format shows little advantage in compressed form (and sometimes is even worse, for very small batches). So, essentially most of the gains are from streaming not from using columnar format.

@tigrannajaryan The following pull request contains the modifications to benchmark the "unary RPC" mode for traces and logs. I'm still working on finalizing a memory optimisation for metrics. These benchmarks reproduces all the steps from the conversion between OTLP and OTLP Arrow, the serialization, the compression, ... These steps are cleary represented in the output of the benchmark tool.

f5/otel-arrow-adapter#108

To run the benchmark you need to get some data in protobuf format. Basically a Trace Request serialized in binary format and saved as a file. To run the benchmark for traces I'm using the following command.

go run tools/trace_benchmark/main.go -unaryrpc traces.bin.

traces.bin being the file containing the traces. The file must contain enough traces to run all the different tested scenarios (batch sizes 10, 100, 1000, 2000, 5000, 10000). I personally used a dataset of ~400k traces.

I will soon update the benchmark tool to produce more detailed stats in order to present the distribution of spans and traces in the input file.

UPDATE: stats on the data set.
Screen Shot 2023-03-02 at 4 27 29 PM

traces.bin being the file containing the traces.

Can you share the traces.bin that you used? I want to rerun the same benchmarks.

@tigrannajaryan

Regarding the impact of streaming on the compression ratio. Not having to send dictionaries over and over again for many fields is definitely an advantage. By nature, a dictionary doesn't compress well because there aren't many repetitions, but when you can get rid of dictionaries entirely, it's directly a net gain. This is what happens in the scenario using streaming. However, the gain may vary a bit depending on the nature of the data set.

In the case of the unary rpc scenario, we pay both the overhead of the schema and the dictionaries that are rebuilt for each batch and that in comparison with a good compression algorithm do not do much better in unary rpc mode. However in terms of allocation, serialization and processing speed (e.g. for filtering, aggregation), the columnar format will be incomparably better (visible in phase 2). It is theoretically possible to do better I think by sorting some columns to optimize the compression rate. This is a well known optimization in the world of columnar data. However it is quite difficult to implement in a general way in the OTel context because the OTel entities are strongly hierarchical and the attributes are not uniform between the different entities which compose the batches. This will be the subject of a future optimization that will not impact the protocol and the data format.

Finally, the fact that the compression ratio is a bit worse with small batches was expected. The goal of this effort is to optimize batch sizes of at least 100-200 elements.

jmacd commented

For the record:

This new protocol is streaming gRPC. That's a change from batch OTLP. How would this interact with load balancers and the overall design goals of OTLP?

We did a study of the impact of unary vs. streaming gRPC. There is nothing preventing us from using unary RPC, but you will need large batches to get good compression and you will still be leaving compression "on the table".

Is the new protocol stateful (or able to be stateful?) I saw it has optional dictionaries that can be passed and such. I'm curious if this is the case.

It is point-to-point stateful. You can reset a connection at any time and it will rebuild the state needed in a future connection.

Support for reading/writing Apache Parquet files

I'm not very familiar with this project, and so might be rehashing something that has already been discussed, but thought I would highlight that Parquet does not support UnionArrays, something the current schema appears to use extensively. There was a proposal to add it a while back, apache/parquet-format#44, but it has stalled out. Given adoption of parquet v2, released almost 10 years ago, is still ongoing, I wouldn't hold out much hope for broad ecosystem support even if this were to be picked up again.

In general support for UnionArray within the arrow ecosystem is fairly inconsistent, and I would strongly advise against using them if ecosystem interoperability, especially with parquet-based tooling, is a key design principle.

Otherwise very excited to see movement in this space, keep up the good work 👍

Regarding protocol, has Arrow Flight RPC been considered? It's essentially a gRPC service built to move Arrow Record Batches over the network.

Arrow Flight is an RPC framework for high-performance data services based on Arrow data, and is built on top of gRPC and the IPC format.

Flight is organized around streams of Arrow record batches, being either downloaded from or uploaded to another service. A set of metadata methods offers discovery and introspection of streams, as well as the ability to implement application-specific methods.

Methods and message wire formats are defined by Protobuf, enabling interoperability with clients that may support gRPC and Arrow separately, but not Flight. However, Flight implementations include further optimizations to avoid overhead in usage of Protobuf (mostly around avoiding excessive memory copies).

jmacd commented

@tustvold Thank you for your point about UnionArrays. We absolutely want to avoid taking steps that limit the future use of Parquet. While we were aiming for network compression on a bridge between two OTel collectors, it is natural that users want to record files of telemetry data for other uses.

jmacd commented

@jacobmarble Our implementation uses Arrow IPC under the cover of a gRPC stream encapsulating Arrow IPC messages. We took this approach (a) to fit with the OTel collector which provides a rich set of helpers for setting up and monitoring gRPC connections, and (b) to make it easy to upgrade by sharing the gRPC server instance used by an OTLP receiver, so that one server can support both protocols on the same port.

jmacd commented

@open-telemetry/technical-committee please consider whether we are willing to accept this donation based on the technical diligence carried out by @tigrannajaryan in open-telemetry/oteps#171 (comment).

Before focusing on current, unresolved feedback, I want to step back and remind us of the goal here. The goal is to have an OTel-Arrow project supporting generation and exchange of column-oriented telemetry data using OpenTelemetry's data model.

When @lquerel presented his first proposal on this, members of @open-telemetry/technical-committee including myself and @tigrannajaryan requested the project be split into at least two phases. The first phase was to bring a high-compression bridge feature to the OTel collector and pave the way for further development of OTel-Arrow. The first phase is essentially complete at this point.

@tigrannajaryan's results indicate that slightly more compression ought to be achievable by comparing against a bespoke, row-based encoding. He his correct, and we take the potential for a slight compression improvement as a bug in our implementation. @lquerel is currently working on an enhancement to overcome this problem.

While @tigrannajaryan's results show there is an easier way to achieve a high-compression bridge between OTel collectors, a bespoke encoding does not bring us closer to supporting column-oriented telemetry data in the Arrow ecosystem. I see the result ultimately as confirming that our implementation, built on the Arrow-Go library, has good performance. Moreover, this sort of bespoke, row-oriented encoding will not be able achieve the significant gains in processing speed that Arrow and its columnar representation will offer in Phase 2.

Accepting this donation means agreeing to form a new OTel group focused on the kind of technical and user feedback we are now receiving. Here are some items that we regard as future work blocked until this donation is accepted:

  • We care deeply about support for Parquet files, which was not scoped in Phase 1. This may require changes relative to what we have, but like any protocol OTel-Arrow needs to build in support for change.
  • We want to gather more data to tune the algorithms used to choose Arrow representations from OTLP data. It may be that we are overfitting for the small number of data sets that we have today and we expect to continue making compression improvements.

We should not stall this donation because there is room for technical improvement in our Phase 1 product. Further, I suggest we should approve and merge @lquerel's OTEP as it captures our Phase 1 results. The kinds of discussion points raised by @tigrannajaryan should be discussed in the future when we move to add protocol support to core opentelemetry-proto repository, not at this stage.

On this last point, @tigrannajaryan's feedback was well received. I agree we should not require the use of Arrow to achieve reasonably good compression in OpenTelemetry. I support the notion that OTLP's existing, row-oriented protocol would benefit from a dictionary mechanism. Since @tigrannajaryan demonstrated that a bespoke encoding can easily achieve very good results, it occurred to us that we should try to separate our work on bringing a streaming bridge to the OTel collector from our work on the OTel-Arrow adapter. The observation is that the changes made to the OTLP exporter and receiver components in https://github.com/open-telemetry/experimental-arrow-collector for streaming are decoupled from the OTel-Arrow adapter itself; with some work, the OTLP exporter and receiver components could support pluggable stream encodings. For instance, we could engineer the core OTLP exporter and receiver components to support both the OTel-Arrow mechanism (based on Arrow IPC) or the bespoke mechanism designed by @tigrannajaryan; this also appears to be a way we could build support for versioning in the OTel-Arrow protocol.

@jmacd great points, thank you.

Just wanted to add that I am still working with @lquerel on the review and will need a bit more time to complete it and post my findings.

The TC looked at the donation proposal and the associated OTEP and discussed it with Collector maintainers.

Here are TC's observations:

  • The Arrow-based telemetry data representation can bring significant wire size savings as demonstrated by the benchmarks. The savings can be particularly high for multivariate metric data (over 10x in uncompressed form or over 3x in compressed form), however typically they are more modest for other data sets (1.5x or 2x is more typical).
  • The experimental implementation in some rare cases shows an increase in wire size (negative savings), which will need further investigation.
  • For certain dataset the experimental implementation crashes during encoding/decoding operations.
  • The Phase 2 of the proposal intends to add built-in support for arrow-based pdata in the Collector. There are currently no experimental implementations of this idea to demonstrate the benefits of the approach end-to-end, including the expected performance gains (reduced CPU usage).
  • Collector maintainers told me that they would like to see such end-to-end implementation in the Collector to understand the Phase 2 benefits better.

Based on the above the TC makes the following recommendations:

  • The Arrow-based telemetry data representation and processing based on the new data format looks promising. OpenTelemetry accepts the proposed donation in experimental form, with the understanding that further incorporation into the OpenTelemetry ecosystem is conditional on items listed below.
  • The donated code should be clearly marked as experimental until more progress is made in having production-grade and stable implementations.
  • While in experimental phase the Arrow-based protocol should use a different name to avoid creating confusion and to avoid reputational damage to OTLP.
    • The Arrow-based protocol implementations in the Collector should be added as a dedicated contrib receiver and exporter instead of being incorporated in the OTLP receiver and exporter.
    • The Arrow-based protocol implementations in the SDK should be added as a dedicated experimental exporter instead of being incorporated in the OTLP exporter.
  • After proven and stable implementations exist, the Arrow-based data representation can be considered to be added to OTLP specification as a runtime-negotiated extension and be incorporated in the OTLP receiver and exporter in the Collector and in the OTLP exporter of the SDKs. The choice of the default data format (row-based Protobuf or columnar Arrow) will be made later based on benchmarking and feedback from users.
  • The proposal authors are encouraged to add arrow-based data format support to pdata in the experimental-arrow-collector repository and work together with Collector maintainers to demonstrate the benefits. Engineering cost-benefit analysis is necessary to show that the benefits outweigh increased complexity and maintenance costs.

Independently from the above as it was discovered during the OTEP review, OpenTelemetry should consider adding a dictionary encoding extension to OTLP to gain significant wire size benefits.

@lquerel if you can adjust your OTEP to match the conditions listed in the TC response we should be able to also merge the OTEP.

@lquerel if you can adjust your OTEP to match the conditions listed in the TC response we should be able to also merge the OTEP.

@tigrannajaryan yes I plan to update this OTEP soon to reflect my last changes and to integrate the conditions listed in the TC response. Thanks.

jmacd commented

@tigrannajaryan We believe the OTEP has been updated as you requested. Are we clear to merge this donation and the OTEP?

The GC voted to accept the donation (#1332) on July 6 2023.

jmacd commented

I've requested a formal statement from F5 and Lightstep's sales and marketing departments as detailed in the community donation process, after which I will close this issue. Thanks all!

ServiceNow formerly known as Lightstep acknowledges the marketing guidelines (https://github.com/open-telemetry/community/blob/main/marketing-guidelines.md)

jmacd commented

The donation is complete. Thank you especially to @lquerel and F5.