equinix-labs/otel-cli

GitLab CI jobs tracing

Petromar88 opened this issue · 6 comments

Hi everyone, i'm trying to trace GitLab CI pipelines jobs using otel-cli but i'm having troubles with unexpected span tracing.

When CI jobs run i can correctly see each span being traced on Cloud Trace with the expected context propagation.
Inside each job i'm running one Earthly command that causes a lot of other operations to be performed.

The problem is that all of these operations are traced within different unrelated spans.
I've tried many different solutions to force those spans to inherit the current context (traceparent) but none worked.

So for example, running the following pipeline:

Screenshot 2024-07-03 alle 12 45 42

where the start monitoring and the 2. test jobs aren't doing anything other than creating a span with a name and the 1. lint job is declared this way:

  1. lint:
    stage: build
    script:
    -otel-cli span --name ${CI_JOB_NAME_SLUG} --tp-export --tp-carrier ${OTEL_ENV_FILE}
    - earthly +lint --ENV=${TASK_ENV}
    artifacts:
    paths:
    - "${OTEL_ENV_FILE}"

results in the following on Cloud Trace

Screenshot 2024-07-03 alle 12 41 12

Screenshot 2024-07-03 alle 12 41 43

All the spans following the one in the first screenshot are generated by buildkit operations performed by the job 1. lint without any explicit otel-cli command.

I've already tried using otel-cli exec, executing the exported traceparent declaration before running earthly and sending a span with a --start and --end dates after running the earthly command. On top of that i've also tried those solutions by changing the spans kind but nothing worked.

Could you please help me understand how are those spans being created and how to make all of them inherit the traceparent from previous spans in the same job?

Thanks in advance!

@tobert can you please help me with this one? Was i clear enough explaining the problem and the context?

Many thanks

Hello, apologies for the delay. I started a new job this week.

It looks like the traceparent isn't being passed to otel-cli. Are you setting the TRACEPARENT envvar or writing out a file with the traceparent in it?

One way to see what's going on is to replace your otel-cli span with otel-cli status which will do the same thing but dump a bunch of data in JSON to look through and see what otel-cli is doing internally. If you like, please send me a gist and I'll take a look.

Hello, apologies for the delay. I started a new job this week.

It looks like the traceparent isn't being passed to otel-cli. Are you setting the TRACEPARENT envvar or writing out a file with the traceparent in it?

One way to see what's going on is to replace your otel-cli span with otel-cli status which will do the same thing but dump a bunch of data in JSON to look through and see what otel-cli is doing internally. If you like, please send me a gist and I'll take a look.

No problem at all, thanks for your answer!
I managed to create this gist with a simplified version of our CI and Earthfile just to give you an overview of what i'm executing at the moment.

Unfortunately we have a lot more that is included in both the .gitlab-ci.yml and the Earthfile from private repositories so it would be hard to provide a fully functional gist.

Anyways, the problem can be summarized as everything from inside the docker container started by earthly is being traced even if it's not contained in a dedicated span. Any other command (like an echo or something) is not being traced as expected.

These are two screenshots from Cloud Trace after running the jobs i shared within the gist:
cloud-trace-main-trace-details
cloud-trace-earthly-trace-example

I've also tried to replace otel-cli span with otel-cli status like you suggested but then everything was traced with a different trace ID.

I suspect what's missing is using 1 of 2 approaches:

1.) since you're setting --tp-carrier you could use volume mounts to share the same file into docker containers. In this approach, the carrier file is what transmits the traceparent across invocations of otel-cli. You can either have otel-cli use it directly or in shell you can source the carrier file (with --tp-export enabled) and it will set the environment variable.

2.) you could also set the TRACEPARENT envvar somewhere before calling into your tools, and make sure it's propagated into Docker e.g. docker run -e TRACEPARENT="${TRACEPARENT}". This is the other way to communicate traceparent to otel-cli runs.

Does this help?

I realized the carrier file was already being copied into the docker container as i'm copying the whole workdir during lint operations. I've anyways tried to pass the TRACEPARENT as an argument and then set it as an env var like you suggested but with no luck.

I suspect all of the operations being traced with different trace IDs are from the IMPORTs in the Earthfile but what i'm missing is: why are those operations being traced without an explicit otel-cli invocation from inside any of the Earthly targets?

It looks like the only otel-cli span command keeps listening until the end of the CI job even though i'm not using the background approach, or if i was using the otel-cli docker image as a base image which i'm not.

@tobert i just find out Earthly inherits the Go OTEL library as an indirect dependency in order to trace analytics data.

So basically my problem was setting most of the otel-cli's params through env vars which were also used by that dependency.

Even thought Earthly analytics data collection feature can be disabled from the .earthly/config.yaml file, setting the trace endpoint from the span command instead of using the OTEL_EXPORTER_OTLP_TRACES_ENDPOINT env var did resolve the problem.

Thank you very much for your help!