Opentelemetry tracer client has infinite loop tracing itself
Pascal-Delange opened this issue · 12 comments
Environment details
- Programming language: Go
- OS: Docker image based on alpine:3.19
- Language runtime version: 1.22.0
- Package version: google.golang.org/api v0.154
Steps to reproduce
I'm stuck in a state where I can't update my go gcp client library dependencies. Trying to go to or above v0.154.0 for google.golang.org/api (or anything that depends on it), I start seeing endless traces coming from the trace service itself, similar to what is described here. Up to v0.153 works ok.
I cannot find an obvious way to tell the tracer client not to trace itself.
Any suggestion on where to look would be most welcome.
My go.mod file is attached.
go.mod.txt
@Pascal-Delange Can you please provide some more detail about your configuration? Are you using the OpenCensus Bridge? Please show your imports as well as your complete OpenCensus and OpenTelemetry configuration. If you can write a simple demo that I can easily run that reproduces the loop, that would be ideal, thanks!
Are you using the OpenCensus Bridge?
Because you do not directly require go.opencensus.io
in the attached go.mod file, I think you are not using the bridge. But please provide your full configuration, thanks.
@Pascal-Delange Are you able to upgrade libraries such as cloud.google.com/go/storage
, firebase.google.com/go/v4
and google.golang.org/api
to the latest versions?
Thanks for the answers,
Ok, I had a bit of a hard time extracting only the minimal parts of the code, but here you go:
package main
import (
"context"
"fmt"
"log"
"time"
"github.com/GoogleCloudPlatform/opentelemetry-operations-go/exporter/trace"
"github.com/exaring/otelpgx"
"github.com/jackc/pgx/v5/pgxpool"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
oteltrace "go.opentelemetry.io/otel/trace"
)
func main() {
type ContextKey int
var ContextKeyOpenTelemetryTracer ContextKey = 0
connectionString := fmt.Sprintf("host=%s user=%s password=%s database=%s sslmode=disable port=%s",
"localhost", "postgres", "password", "database", "5432")
exporter, err := trace.New(trace.WithProjectID("projectId"))
if err != nil {
panic(err)
}
res, err := resource.New(context.Background(),
resource.WithTelemetrySDK(),
resource.WithAttributes(
semconv.ServiceNameKey.String("test app"),
),
)
if err != nil {
panic(err)
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
)
otel.SetTextMapPropagator(
propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
),
)
otel.SetTracerProvider(tp)
tracer := tp.Tracer("test app")
connectionPool, err := newPostgresConnectionPool(connectionString)
if err != nil {
log.Fatal("error creating postgres connection to database", err.Error())
}
ctx := context.Background()
ctx = context.WithValue(ctx, ContextKeyOpenTelemetryTracer, tracer)
for i := range 1000 {
fmt.Println(i)
tracer, _ := ctx.Value(ContextKeyOpenTelemetryTracer).(oteltrace.Tracer)
ctx, span := tracer.Start(ctx, fmt.Sprintf("span %v", i))
_, err := connectionPool.Exec(ctx, "SELECT 1")
if err != nil {
panic(err)
}
span.End()
time.Sleep(2 * time.Second)
}
}
func newPostgresConnectionPool(connectionString string) (*pgxpool.Pool, error) {
cfg, err := pgxpool.ParseConfig(connectionString)
if err != nil {
return nil, fmt.Errorf("create connection pool: %w", err)
}
cfg.ConnConfig.Tracer = otelpgx.NewTracer()
cfg.MaxConns = 10
cfg.MaxConnIdleTime = 5 * time.Minute
pool, err := pgxpool.NewWithConfig(context.Background(), cfg)
if err != nil {
return nil, fmt.Errorf("unable to create connection pool: %w", err)
}
return pool, nil
}
(it's perhaps a bit clumsy but it should illustrate what happens).
So after running this test, it appears that what is happening is rather, upgrading beyond v0.154.0 makes the trace client trace itself. It actually does not seem to do so abnormally (it's flushing every few seconds and takes ~200-300ms each time) - but if the activity in the app is not too high, this is still drowning the trees (app request traces) behind a forest of trace client self-traces.
See the screenshot below, where the "fast" traces are from the db query spans and the "slow" traces are from the tracer client.
This is with the full go.mod as attached
go.mod.txt
How are you changing the version of google.golang.org/api
if it is not a direct dependency in go.mod? Can you also provide the < v0.154.0 go.mod for the example you just posted? I am wondering what the version differences are for the direct dependencies that indirectly require google.golang.org/api
. It would be good to establish the change in the direct dependency(ies) that results in the issue.
Within google.golang.org/api
itself, there really isn't very much telemetry code. I believe it's just the trace context propagation support for http and grpc. Please correct me if I'm wrong.
Hi,
I pinpointed that google.golang.org/api might be the root cause by trying to narrow down at which version my issue was appearing, after observing that bumping any gcp library was also bumping google.golang.org/api's version along. I specifically pinned the version by running go get google.golang.org/api@version
.
Attached are the go.mod with v153 and 154, along with the diff below:
pascal % diff go.mod_153 go.mod_154
20a21
> github.com/felixge/httpsnoop v1.0.4 // indirect
31a33,34
> go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc v0.46.1 // indirect
> go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.46.1 // indirect
40c43
< google.golang.org/api v0.153.0 // indirect
---
> google.golang.org/api v0.154.0 // indirect
So there are 3 additional indirect dependencies included if I switch to v154, with the following (or similar) dependency path:
pascal % go mod why -m go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc
# go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc
github.com/checkmarble/marble-backend
github.com/GoogleCloudPlatform/opentelemetry-operations-go/exporter/trace
cloud.google.com/go/trace/apiv2
google.golang.org/api/transport/grpc
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc
Release v0.154.0 added OpenTelemetry trace context propagation via 46421d4. It also added the following (direct and indirect) dependencies to go.mod:
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc v0.46.1
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.46.0
github.com/felixge/httpsnoop v1.0.4 // indirect
github.com/go-logr/logr v1.3.0 // indirect
github.com/go-logr/stdr v1.2.2 // indirect
go.opentelemetry.io/otel v1.21.0 // indirect
go.opentelemetry.io/otel/metric v1.21.0 // indirect
go.opentelemetry.io/otel/trace v1.21.0 // indirect
Can you modify your v153 demo script above to directly import these dependencies, and specify these versions (or whatever versions are in your v154 script) in your go.mod, while keeping google.golang.org/api at v0.153.0? The goal is to isolate the problem at its source. If you can still reproduce the problem while google.golang.org/api is at v0.153.0, we will know that the context propagation mechanism in v0.154.0 is not the problem. And if you can't reproduce, then I will investigate the context propagation mechanism in v0.154.0.
Thanks for bearing with me - and for your helpful indications.
The modified script with the additional direct dependencies as added in v0.154 (script and modified go.mod attached): the trace client does not trace itself.
go.mod_153_modified.txt
main.go.txt
If I now further add v0.154.0 from this base, I get my issue with self-tracing again. Note that bumping to v0.154.0 also bumps the (direct) dependency go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
from v0.46.0 to v0.46.1 - but that doing just that part with v0.153.0 also does not replicate the issue I have.
Thanks for the quick response. The next thing I think we should try is to programmatically disable the trace context propagation in the Cloud Trace client (cloud.google.com/go/trace
) with option.WithTelemetryDisabled. In your script above, change this line:
exporter, err := trace.New(trace.WithProjectID("projectId"))
to
exporter, err := trace.New(trace.WithProjectID("projectId"), trace.WithTraceClientOptions([]option.ClientOption{option.WithTelemetryDisabled()}))
This should disable the trace context propagation (source), which was the only OpenTelemetry-related code change in v0.154.0.
Thank you ! Passing this option seems to do the job.
I honestly tried to look for such an option, but TBH I didn't find it and didn't think of moving up to the "google.golang.org/api/option"
package.
All good for me then, it might be a good thing for developer convenience to expose a WithoutTelemetry
option directly on the trace exporter :-)
@Pascal-Delange Thanks for your quick responses and for doing the legwork on this issue. Very much appreciated!
I am wondering if the Cloud Trace client (cloud.google.com/go/trace
) should internally set WithTelemetryDisabled
. Can you think of any use cases in which the Cloud Trace client should propagate the context for its own tracing to the Cloud Trace service? Or will it always just cause extra noise?
Hi,
Considering tracing is made to estimate the time things (tasks, requests...) take to execute and the tracing operation itself is happening asynchronously (except perhaps the odd "flush" operation), I would say it's safe to assume that the traces from the trace client sending traces itself are going to be noise >99.99% of the time (which will be more or less visible hence more or less annoying depending on the type of app being traced).
(it may still show up if you're profiling the application, but that is a different story)
A default behavior of not sending traces would seem reasonable to me.