version: every will skip versions if a parallel upstream job's latest build finishes before an older one
jpalermo opened this issue · 21 comments
I'm also seeing this behavior on my end on the pullrequest-resource, currently running Concourse 2.4.0
So this is a tricky one. The semantics for version: every
coming from a job whose builds run in parallel are complicated by how the passed
constraints work with version: every
. See #358 (comment)
tl;dr: passed
constraints allow it to skip versions, otherwise when a bad version enters the mix, all downstream jobs can no longer run.
The scheduling for the second job would have to know that a build is in-flight for the ideal next version, and know to wait for it to finish before determining the inputs.
Our team also makes heavy use of the pullrequest-resource, and reliably running all passed
versions of a PR in downstream jobs is important. As I currently understand it, we need to set serial: true
or max_in_flight: 1
to ensure all PRs make it through a pipeline?
@oppegard That's right. Until we figure out whether there's something better we could do. I'm a little wary of the changes this kind of implies; having scheduling so dependent on ephemeral job state feels wrong.
Possible workaround:
Have each job do a put
to an otherwise-unused time
resource. Then thread that through the subsequent jobs with trigger: true
and a passed
constraint. Do this for each job. This would make it so there's always a newer version once the build completes.
e.g.:
resources:
- name: my-pr
type: pr
source: {...}
- name: dummy-time
type: time
source: {interval: 24h} # doesn't matter, but something has to be there
jobs:
- name: job-1
plan:
- get: my-pr
version: every
trigger: true
- task: test-pr
# ...
- put: dummy-time
- name: job-2
plan:
- get: my-pr
trigger: true
passed: [job-1]
- get: dummy-time
version: every
trigger: true
passed: [job-1]
- # ...
- put: dummy-time
- name: job-3
plan:
- get: my-pr
trigger: true
passed: [job-2]
- get: dummy-time
version: every
trigger: true
passed: [job-2]
- # ...
- put: dummy-time
Note that version: every
is configured on the dummy-time
resource in subsequent jobs.
Hi @vito,
Is still an issue (and is the above workaround is still required) when using the pullrequest-resource, even though the mentioned problems with using passed
and every
together have been (I believe?) resolved by #358 (comment) ?
If so, could you please explain the details, and why the problem seems to be specific to pullrequest-resource?
My guess is that it is because the pullrequest-resource conflates two independent "versions" (PR numbers and SHAs) into a single "version", which breaks some of the concourse assumptions about how resources and versioning work. It would be great to have you confirm or deny that, and if so, possibly explain why from your perspective.
Thanks!
-- Chad
The problem is not specific to the PR resource. This issue is still open. The linked comment explains why this issue exists; it's a fundamental issue with passed
and every
together, where it's either broken this way or broken in the way that comment mentioned, and I think this way is the less broken of the two (as the alternative is a deadlocked pipeline).
@vito OK, thanks.
EDIT
I think understand now. Actually, I'm still confused. I thought when you said above "the second job", you meant a downstream job, but now I realize you probably meant a concurrently executing instance of the same job. So, I'm still trying to understand the crux of the problem, and why it only occurs with the combination of version: every
, passed:
, and non-serial or max-in-flight > 1 (and others on my team are also having trouble understanding it). So, it would be great if you could expand and clarify exactly what problem this causes, and why:
tl;dr: passed constraints allow it to skip versions, otherwise when a bad version enters the mix, all downstream jobs can no longer run.
The scheduling for the second job would have to know that a build is in-flight for the ideal next version, and know to wait for it to finish before determining the inputs.
END EDIT
But, with respect to the pullrequest-resource specifically:
- The only reason we need to use "every" is because the pullrequest-resource conflates two independent "versions" (PR# and SHA) into one emitted "version". If we want the latest (not every) SHA to be built for every PR#.
- If we only used the standard concourse git resource, which only emits one "version" (a SHA), in conjunction with a pipeline-per-pr# (i.e. pipeline-per-branch) approach as you discuss here, we would not need to use "every", because the separate pipeline-per-pr# (each with separate standard git resources defined for the PRs' branches) would ensure the latest SHA for the per-branch git resource would always be run (not every SHA, which is fine because we only care about running the latest commit on the PR's branch).
- Thus, since we wouldn't need to use "every", we would not be exposed to this issue.
Correct?
If not, then please explain in more detail, because I'm still confused :)
Thanks,
-- Chad
@vito Also, I tried out the workaround above.
It doesn't appear to work if you ever have more than one entry in a passed:
array.
E.g., if you try this, job-3
hang and never start with the message "waiting for a suitable set of input versions\n dummy-time - no versions satisfy passed constraints"
resources:
- name: my-pr
type: pr
source: {...}
- name: dummy-time
type: time
source: {interval: 24h} # doesn't matter, but something has to be there
jobs:
- name: job-1
plan:
- get: my-pr
version: every
trigger: true
- task: test-pr
# ...
- put: dummy-time
- name: job-2a
plan:
- get: my-pr
trigger: true
passed: [job-1]
- get: dummy-time
version: every
trigger: true
passed: [job-1]
- # ...
- put: dummy-time
- name: job-2b
plan:
- get: my-pr
trigger: true
passed: [job-1]
- get: dummy-time
version: every
trigger: true
passed: [job-1]
- # ...
- put: dummy-time
- name: job-3
plan:
- get: my-pr
trigger: true
passed: [job-2a, job-2b]
- get: dummy-time
version: every
trigger: true
passed: [job-2a, job-2b]
- # ...
- put: dummy-time
Is this expected (and thus the workaround doesn't work in this situation), or do I have something wrong?
Thanks,
-- Chad
+1 We are also seeing this issue at our end. We are going to try the workaround now.
Hey @vito we have the same setup as @thewoolleyman and, while the workaround seems to accomplish what it is meant for, we also experience jobs-2x
being triggered multiple times (with the same my-pr
version but different dummy-time
).
Is this meant to work when there are multiple entries in the passed
array downstream?
@nazrhom (and, well, everyone) To be honest we're starting to see version: every
+ passed:
as a bit of a mis-feature. By saying "run with every version" but also "oh wait but only the ones that made it through this job" it can be seen as kind of an oxymoron, and from that perspective the the (mis)behavior described in this issue is somewhat unsurprising. It's also not super surprising that the workaround has led to other things that need working-around. 🙂 I don't see why having multiple entries in passed
would break it any more or any less than with a single entry, fwiw.
It seems like what y'all are really trying to do is have a full pipeline run for every version. I think that's actually a challenge that can be met by a concept similar to spaces (#1707). We've got a new plan laid out in concourse/rfcs#1 (comment) that should make this kind of thing possible. At that point we might deprecate configuring both version: every
and passed:
at the same time and encourage users to adopt the new pattern.
@vito thanks, will look into the rfcs. I did some more digging and the issue I am experiencing does not seem to be related to fan-in/out as I can reproduce the issue even with a setup like:
jobs:
- name: job-1
plan:
- get: my-pr
version: every
trigger: true
- task: test-pr
# ...
- put: dummy-time
- name: job-2
plan:
- get: my-pr
trigger: true
passed: [job-1]
- get: dummy-time
version: every
trigger: true
passed: [job-1]
- # ...
- put: dummy-time
- name: job-3
plan:
- get: my-pr
trigger: true
passed: [job-2]
- get: dummy-time
version: every
trigger: true
passed: [job-2]
- # ...
- put: dummy-time
I have a question on the (current) expected semantics here: job-2
will get
a certain version of dummy-time
that was put
by job-1
; after it does its thing it will put
a new version of dummy-time
. Doesn't this create 2 versions of dummy-time
that have passed job-2
(the one that job-2
get
s and the one it put
s?) that can potentially both trigger job-3
?
Beep boop! This issue has been idle for long enough that it's time to check
in and see if it's still important.
If it is, what is blocking it? Would anyone be interested in submitting a
PR or
continuing the discussion to help move things forward?
If no activity is observed within the next week, this issue will be
exterminated closed, in accordance with our stale issue
process.
Would ask that this issue and/or #1298 remain open as it's a common problem and the proposed fix (spaces) is a large and uncertain features (concourse/rfcs#24).
I'll leave this issue open as long as the issue still stands. The resolution will probably be deprecating/disallowing version: every
+ passed:
in favor of spatial pipelines using the across
step (concourse/rfcs#29). The problem with version: every
+ passed:
is pretty fundamental as noted in #736 (comment), and supporting it adds a lot of complexity in our scheduling algorithm.
I'll keep the stale bot at bay by placing this in our Spatial Resources epic as a long-term goal to close out once the epic is complete.
We've been hitting this same issue. So, I've been trying to understand how the scheduling algorithm is currently implemented. I'd like to check my understanding, and see if it inspires any reasonable fixes.
My understanding is,
- A per-pipeline
atc/scheduler/Runner
runs, and on a regular basis,tick()
s. (https://github.com/concourse/concourse/blob/release/5.6.x/atc/scheduler/runner.go#L67) tick()
loads, among other things, a (possibly-cached) copy of the pipeline config, and aVersionDB
containing the set of all resource versions, build inputs & outputs related to a pipeline.- This is passed to a
atc/scheduler/Scheduler.Schedule()
- for each job in the pipeline, this attempts to
ensurePendingBuildExists
. The purpose ofensurePendingBuildExists
seems to be to find-or-create the single next build that should start. (https://github.com/concourse/concourse/blob/release/5.6.x/atc/scheduler/scheduler.go#L62) ensurePendingBuildExists
callsatc/scheduler/InputMapper.SaveNextInputMapping()
.SaveNextInputMapping
calls, per-resource,atc/db/algorithm/InputConfigs.Resolve()
(https://github.com/concourse/concourse/blob/release/5.6.x/atc/scheduler/inputmapper/inputmapper.go#L66) [there's an additional 'final resolve' afterward, which isn't relevant to the current thread]Resolve()
fetches version candidates. In particular, it fetches exactly one of "all candidates", "the latest candidate", "a specific pinned candidate", or "the candidates which passed the configured jobs". (https://github.com/concourse/concourse/blob/release/5.6.x/atc/db/algorithm/input_configs.go#L19-L54)- As an implementation detail, these candidates are returned in descending CheckOrder order, due to
atc/db/algorithm/VersionDB.*
callingatc/db/algorithm/Versions.With()
, which finds-or-adds versions in sorted order. (https://github.com/concourse/concourse/blob/release/5.6.x/atc/db/algorithm/version.go#L36) - The collected candidates are then passed to
atc/db/algorithm/InputCandidates.Reduce()
. Reduce()
experimentally pins each version of a resource, in order, until it finds a version that "IsNext()
" https://github.com/concourse/concourse/blob/release/5.6.x/atc/db/algorithm/input_candidates.go#L99IsNext()
returns true if a build already exists for the given version of a resource (https://github.com/concourse/concourse/blob/release/5.6.x/atc/db/algorithm/input_candidates.go#L30)- Once we've found a version for this resource that
IsNext()
, we return it as a mapping, we collect all those mappings, ensure that we have a mapping for each input config, do a final group resolve to check compatibility, save the input mapping to the DB, and then determine whether we should trigger the build.
If my understanding is correct about this process, the cause of this bug is pretty straightforward - When
A) multiple versions of a resource exist, and
B) the newest version has a build that exists
then no other version of that resource is ever considered as an input to a potential job, regardless of version: every
.
Given this, it seems like there might also be edge cases in which inputs that were version: every
and didn't have a passed
could also skip versions. I'd have to dig in to confirm or reproduce that, though.
A naiive suggestion might be, in cases where an input_config
is version: every
, to consider candidates in ascending order of check_order, skipping candidates that already have a build. I wouldn't be surprised if this is actually how it was implemented before, and that caused other subtle problems of "job gets stuck and never schedules any new builds". When i get some more time, I'll do some history spelunking.
@YenTheFirst Impressive sleuthing! We're actually near the end of a complete re-design and re-implementation of the scheduling algorithm, currently living on the algorithm-v3
and release/6.0.x
branch. If you'd like to read a bit about it, I had a section on it in the v10 roadmap post. We're aiming to push it out as a 6.0 beta release sometime soon.
For inputs with version: every
the new algorithm actually does exactly what you proposed: walk through versions in ascending order from whichever version was previously used, falling back on descending order if none of those are valid.
However, at least at some level, the problem with passed
and version: every
with parallel upstream jobs is still pretty fundamental. Both attributes demand two things that are naturally in conflict: "use every version" and "use only versions that passed through the job". "Correcting" the behavior with parallel upstream builds would likely mean changing the semantics of version: every
to either a) wait for in-fight builds to complete before deciding which version to use or b) support walking "backwards" to execute older versions.
Neither of these options really feel compelling to me. Even version: every
without passed
constraints is actually really awkward to support, and I really want to replace it with something else in the future as it has been the root cause of a ton of complexity/confusion in both the old implementation and the new one.
I don't think the concepts of passed
and version every
are fundamentally at-odds, at least from a user's point of view. It's typically either "I want to use every version of a resource, as long as it passed an upstream job", or "I want to use every version of a resource, but wait until a sure-to-succeed upstream job has produced an accompanying artifact", or "I want to use every output of an upstream job (even if that's not every version of the resource)". It may make sense for those use-cases to get fulfilled by something other than version: every
, but I do think those are use-cases that should be supported.
In our particular case, the pipeline is something like:
- For every commit on every pr-branch, build a docker image
- for every successfully built docker image, run unit tests
- for every successfully built docker image, run integration tests
- for every commit that was successfully built, run the linter
I suspect we could work around this by removing the initial "build a docker image", and instead having unit-tests, integration-tests do their own building, but that duplicates work and code.
[In this particular case, yes, we're using version: every
as a hack around the fact that Concourse does not have spaces yet, but...Concourse doesn't have spaces yet. I suspect that there's reasonable variations on this pipeline where version: every
isn't merely a hack]
I'll take a closer look at those algorithms-v3
and release/6.0.x
branches. Where would be a reasonable place to leave feedback related to those branches, if applicable?
Based only on your description,
walk through versions in ascending order from whichever version was previously used
At first glance, this sounds like it would have the exact same problem. If a downstream build for D completes before an upstream build for C, the downstream build for C will never start, since "D" will be the version that was previously used.
b) support walking "backwards" to execute older versions.
I expect this is basically what's needed.
To be clear, I don't doubt the validity of your use case at all. I just don't see us improving this specific approach any time soon, because it will take a nontrivial amount of work, and it will just delay the real fix which we've recently gained a pretty clear picture of.
The problem is that, while version: every
+ passed
may not seem to be at odds from a user-level, it's technically impossible to satisfy without significant changes to how Concourse works, which would mean other user-level changes. I just don't think these changes are the right thing to do.
b) support walking "backwards" to execute older versions.
I expect this is basically what's needed.
This is why I think it's a fundamental issue. Concourse pipelines go forwards, not backwards. Changing this would have a ton of implications, and it isn't a clean fix anyway because now there's a question of when does version: every
stop going backwards. For example, if I apply version: every
to a regular Git repo without passed
constraints, it's not likely that I would want it to literally go all the way back to the initial commit (resources will eventually collect all versions, not just from initial configuration time).
We already know this feature is a pain in the butt to support, and doubling down on it to get it working correctly would require re-designing significant portions of the product around it, both internally and in the UI. So at this point I'm not on board with supporting version: every
+ passed
, and I would like to even phase out version: every
in the long run. It's a crutch that makes things "work" just enough for us to not find better solutions.
Your use case of building every commit of every PR is totally fine. I think a model that better fits how Concourse works would be to configure a pipeline for each commit. We just need to improve the automation and navigation of pipelines. This is all broken down into many small features, laid out in the v10 roadmap post. Implementing most of these features will be easier or at least more beneficial in the long-term than doubling down on version: every
.
I'll take a closer look at those
algorithms-v3
andrelease/6.0.x
branches. Where would be a reasonable place to leave feedback related to those branches, if applicable?
Hmm good point, not yet. We should just open a PR for it. 🤔 I'll link to it here once we do (or just keep an eye out if I forget!)
edit: PR is open #4721