open-telemetry/opentelemetry-collector-contrib

New component: Github Actions Receiver

krzko opened this issue ยท 36 comments

krzko commented

The purpose and use-cases of the new component

The GitHub Actions Receiver processes GitHub Actions webhook events to observe workflows and jobs. It handles workflow_job and workflow_run event payloads, transforming them into trace telemetry.

Each GitHub Action workflow or job, along with its steps, are converted into trace spans, allowing the observation of workflow execution times, success, and failure rates.

If a secret is configured (recommended), it validates the payload ensuring data integrity before processing.

Example configuration for the component

receivers:
  githubactions:
    endpoint: 0.0.0.0:443
    path: /ghaevents
    secret: YourSecr3t
    tls:
      certfile: /path/to/cert
      keyfile: /path/to/key

Telemetry data types supported

traces

Is this a vendor-specific component?

  • This is a vendor-specific component
  • If this is a vendor-specific component, I am proposing to contribute and support it as a representative of the vendor.

Code Owner(s)

No response

Sponsor (optional)

No response

Additional context

Multi Job

image

Matrix Strategy

image

Deterministic Step Spans

image

hi @krzko,

Thank you for the new component proposal. If you have not already please make sure you review the new component guidelines.

If you have not found a volunteer sponsor yet then I encourage you to come to our weekly collector sig meetings. You can add an item to the agenda to discuss this new component proposal.

krzko commented

Thanks @bryan-aguilar for the heads up.

No sponsor as of yet, so I'll have to add an agenda item for this component and try and make the EU-APAC meeting, as I'm based out of Australia.

krzko commented

Jumped into the Collector SIG (EU-APAC) meeting today, but nobody around ๐Ÿคท

Not looking forward to do doing a 3AM call for the main SIG meeting.

What experience have you had with long running workflows/jobs? We have some workflows that will run for 3+ hours.
Are there any examples of what a trace looks like for jobs that heavily use job matrices? Another scenario my team is in is that we have a one job that will fire off 100+ jobs based on it's matrix config at the time.

Matrix strategies are represented as traces quite well using all the o11y backends that I've uses, so I think this use case is covered.

I use the workflow_run event as the root span, for some long running workflows your o11y backbend might report a missing root span if you view the trace before the workflow run has completed. We emit the root span when status is completed and then we set the span status based on the conclusion.

For the 100+ scenario, that might be a rather large waterfall view, have been toying around with potentially using links, but so far keeping it simple.

Just curious here @krzko - had you tried the webhook receiver? We were playing around with event logging through GitHub apps leveraging the Webhook receiver from OTEL, but from what I recall, we found two main issues.

  1. support for auth extensions
  2. flattening of data support in the transform processor

It's been a while, so this is from memory, but curious if you had tried/experimented with that.

Additionally, wondering if you've tried tracing at the runner level instead of event logging to inferring traces after the fact. We had found that in order to accomplish tracing at the runner level we'd have to use env vars as propagators (which isn't in the spec yet, but is currently being looked at as an OTEP) cross workflows.

One aspect of doing it at the runner level is being to see exactly the time the runner starts up, etc.

@adrielp I've not had a chance to play around with the extensions as of yet or apply transforms to the data that we emit. The receiver is based on the standard internal components so there is no reasoning why that would not work.

The basis of the design was on the Zipkin (HTTP) Receiver, so whatever that supports, we would like wise.

With respect to tracing from the runner, we looked into that as well and went down another route, as we we didnt want to have the chore of updating user workflows for tracing to work.

But, to compliment this receiver, if you want additional telemetry within the steps, Since we create deterministic IDs for traces and spans, I also wrote a run-with-telemetry action that will use the receiver's step span as the parent ID and emit the associated telemetry. Already working quite well for us.

With that design, Im actually injecting a bunch of EnvVars into the shell, such as TRACEPARENT, amongst others, so if you use other tools like the excellent otel-cli by @tobert, it'll, just work (tm). Have a look at this screenshot and note the otel-cli-curl service.

image

This will provide for fine grained timings within the step. which sadly GitHub doesnt provide as they only use seconds for units of measure.

Anything stopping us from naming it simply "GitHub Actions" receiver instead of "GitHub Actions Event" receiver?

No there isn't, we can simplify it to that. I'll make it so.

There was some reason I used that name, but the reasoning escapes me at the moment ๐Ÿคท๐Ÿป

thanks for the response on that @krzko! I still maintain my earlier statement from the SIG that this is a great step in the direction for CI/CD observability ๐Ÿ˜„

My one caveat here is that to do distributed tracing, the CI/CD system needs itself needs to inject the carriers such that all pieces in the pipeline can leverage. The Jenkins plugin does a good job of this, even though environment variables as carriers wasn't part of the spec. Once it does become part of the spec, I think we're going to see vendors enable that support more broadly which would thereby enable the GitHub runner to instantiate and propagate the carriers to all steps, enabling distributed tracing. GitLab has had a related issue open for years on this. Some of our folks did a quick PoC on what'd that could look like at the runner level when using GitLab.

It's possible that some time in the (potentially "near" but still speculative) future this might become OBE. However, in the real world it definitely has value today!

I'd be curious to know @astencel-sumo & other's thoughts on that.

re: the name

Personally I'd call it the githubworkflowevents receiver due to the data it's injesting and how it lines up with the terminology on the GitHub side. But that's a SUPER long name, and hard to type ๐Ÿ˜… .

I've renamed and refactored from githubactionseventreceiver to githubactionsreceiver via krzko@8c8d9c8.

Keeping the name short and succinct as per @astencel-sumo suggestion, as opposed to the long descriptive name.

Only thing I see outstanding is adding more test cases. Will look to add in the next couple of days.

Built and deployed internally. Working as expected.

Just a heads up @krzko - talked about this in yesterdays SIG and a sponsor has to be someone who is an approver or a maintainer in the OTEL collector contrib repo. I'm neither so can't sponsor at this time. Apologize for the confusion on my end there.

Thanks for the heads up. Hopefully we can find one soon and progress this along.

Will look to start the merge process on the PR request, now that I'm back from an extended holiday.

PR submitted, hopefully a minimal amount of issues to sort out. First contrib to the otelcol contrib repo.

@krzko what is the status of the component you've linked in the description? Are you using it in production?

Hey @TylerHelmuth , yes this new component is in production and we've been running it for several months now.

@krzko glad to hear you're getting value! @adrielp what is the status of the CI/CD semantic conventions?

@krzko is the component you're using already hosted somewhere public where others could use it if they wanted?

@TylerHelmuth - right now we're working on the data model. Last couple weeks have been slow for me but hoping to get portions of the model up for review by the next SemConv meeting. There's a lot of information there in figuring out overlap between CDEvents & Eiffel.

We've also been talking in the #otel-cicd channel within Slack. A few conversations have brought up the real need for instrumentation to be native at the runner level provided by the SCM vendors, enabling propagation at that level for the TRACEPARENT. That would be the ideal move as the Semantic Conventions come to fruition, and as the ENV propagators gets added to the spec. We're just not there yet & there's value in being able to iterate.

@TylerHelmuth, it's not hosted publicly for use, but anyone can build a custom collector and stand it up internally, like we have.

I've intentionally left a build config for folks to use if they want to compile it themselves, as per #31326

I very much want a component like this, if accepted, to be in alignment with the CICD Working Group. @adrielp would you be willing to be an additional code owner with @krzko if the component is accepted?

I've added a public image of this component via this repo https://github.com/krzko/otelcol-distributions, built using ocb.

Here's the direct link to the GHCR image - https://github.com/krzko/otelcol-distributions/pkgs/container/otelcol-distributions%2Fgithubactions

@TylerHelmuth Yes, I'd be willing to be an additional code owner. @krzko has been very active in soliciting comments from the folks in #otel-cicd from day one and working to be part of the convention initiative, so I definitely think that as the conventions evolve, so will this.

My somewhat hot take is that components like this would ideally not be needed in the future, and SCM Vendors like GitHub / GitLab would just provide tracing out of the box at the runner level emitting over OTLP, and there wouldn't be post processing from event logs needed to build traces through a receiver.

We have both GitHub members & GitLab members in the CI/CD WG so maybe in the future that'll happen. Even if that end state is reached, I think there's value now in the industry for stuff like this and hopefully this will help accelerate that end goal as this evolves alongside the conventions.

The expectation, if this gets accepted, is still to follow the fundamental processes for multiple pull requests to break down the cognitive overhead of reviewing them correct?

The expectation, if this gets accepted, is still to follow the fundamental processes for multiple pull requests to break down the cognitive overhead of reviewing them correct?

Correct.

My somewhat hot take is that components like this would ideally not be needed in the future, and SCM Vendors like GitHub / GitLab would just provide tracing out of the box at the runner level emitting over OTLP, and there wouldn't be post processing from event logs needed to build traces through a receiver.

Agreed, this is something that makes me hesitant to put it directly in Contrib. Luckily @krzko is hosting the component already for others if they want it.

@krzko does this component handle measuring queue times?

I will sponsor this component. I believe it could end up being used to observe our actions here in Contrib.

@krzko please move forward with the first PR outlined in CONTRIBUTING.md.

Thanks @TylerHelmuth, much appreciated.

I'll start to kick off the process outlined, over the weekend.

Yes, we will be able to derive the queue times, when waiting for the runners to pick up a job based of the implemented span attributes.

Hey there.
I'm also interested in this receiver and would like to use it in my project. Any updates regarding release dates? (no push, just curious)

Hey there. I'm also interested in this receiver and would like to use it in my project. Any updates regarding release dates? (no push, just curious)

Some refactoring needed to be done prior to submitting. This has now been done, so will move ahead with it.

In the mean time you can use the component via a custom collector build here;

docker pull ghcr.io/krzko/otelcol-distributions/githubactions:0.99.1

The PR readme has the steps for getting it configured.

Thanks for the update!

@krzko - any update on the getting the first skeleton pull request opened up? Wanted to make sure I hadn't missed an update here.

@krzko - any update on the getting the first skeleton pull request opened up? Wanted to make sure I hadn't missed an update here.

@adrielp Started on it, and then #life. I'll see if I can restart the effort again.

@TylerHelmuth - are you still willing to sponsor this component? I talked to @krzko and got permission to go ahead and make the contribution myself.

Yes