version: every will skip versions if a parallel upstream job's latest build finishes before an older one

Question

version: every will skip versions if a parallel upstream job's latest build finishes before an older one

jpalermo opened this issue 8 years ago · 21 comments

Bug Report

Follow up to #666 and #563

We're running 2.4.0.

What I saw this morning, Job starts multiple copies of itself; A, B, C, and D. In that order.

The jobs finished in A, C, D, B order. The next job that depends on that resource (which is version: every) did not run with resource version B.

Answer 1 · 2016-11-05T01:01:30.000Z

I'm also seeing this behavior on my end on the pullrequest-resource, currently running Concourse 2.4.0

Answer 2 · 2016-11-07T18:53:09.000Z

So this is a tricky one. The semantics for version: every coming from a job whose builds run in parallel are complicated by how the passed constraints work with version: every. See #358 (comment)

tl;dr: passed constraints allow it to skip versions, otherwise when a bad version enters the mix, all downstream jobs can no longer run.

The scheduling for the second job would have to know that a build is in-flight for the ideal next version, and know to wait for it to finish before determining the inputs.

Answer 3 · 2016-11-11T21:27:09.000Z

Our team also makes heavy use of the pullrequest-resource, and reliably running all passed versions of a PR in downstream jobs is important. As I currently understand it, we need to set serial: true or max_in_flight: 1 to ensure all PRs make it through a pipeline?

Answer 4 · 2016-11-14T16:12:41.000Z

@oppegard That's right. Until we figure out whether there's something better we could do. I'm a little wary of the changes this kind of implies; having scheduling so dependent on ephemeral job state feels wrong.

Answer 5 · 2016-12-02T19:50:08.000Z

Possible workaround:

Have each job do a put to an otherwise-unused time resource. Then thread that through the subsequent jobs with trigger: true and a passed constraint. Do this for each job. This would make it so there's always a newer version once the build completes.

e.g.:

resources:
- name: my-pr
  type: pr
  source: {...}

- name: dummy-time
  type: time
  source: {interval: 24h} # doesn't matter, but something has to be there

jobs:
- name: job-1
  plan:
  - get: my-pr
    version: every
    trigger: true
  - task: test-pr
    # ...
  - put: dummy-time

- name: job-2
  plan:
  - get: my-pr
    trigger: true
    passed: [job-1]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-1]
  - # ...
  - put: dummy-time

- name: job-3
  plan:
  - get: my-pr
    trigger: true
    passed: [job-2]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-2]
  - # ...
  - put: dummy-time

Note that version: every is configured on the dummy-time resource in subsequent jobs.

Answer 6 · 2017-01-19T18:01:31.000Z

Hi @vito,

Is still an issue (and is the above workaround is still required) when using the pullrequest-resource, even though the mentioned problems with using passed and every together have been (I believe?) resolved by #358 (comment) ?

If so, could you please explain the details, and why the problem seems to be specific to pullrequest-resource?

My guess is that it is because the pullrequest-resource conflates two independent "versions" (PR numbers and SHAs) into a single "version", which breaks some of the concourse assumptions about how resources and versioning work. It would be great to have you confirm or deny that, and if so, possibly explain why from your perspective.

Thanks!
-- Chad

Answer 7 · 2017-01-19T18:44:10.000Z

The problem is not specific to the PR resource. This issue is still open. The linked comment explains why this issue exists; it's a fundamental issue with passed and every together, where it's either broken this way or broken in the way that comment mentioned, and I think this way is the less broken of the two (as the alternative is a deadlocked pipeline).

Answer 8 · 2017-01-19T20:23:27.000Z

@vito OK, thanks.

EDIT
~~I think understand now~~. Actually, I'm still confused. I thought when you said above "the second job", you meant a downstream job, but now I realize you probably meant a concurrently executing instance of the same job. So, I'm still trying to understand the crux of the problem, and why it only occurs with the combination of version: every, passed:, and non-serial or max-in-flight > 1 (and others on my team are also having trouble understanding it). So, it would be great if you could expand and clarify exactly what problem this causes, and why:

tl;dr: passed constraints allow it to skip versions, otherwise when a bad version enters the mix, all downstream jobs can no longer run.

The scheduling for the second job would have to know that a build is in-flight for the ideal next version, and know to wait for it to finish before determining the inputs.

END EDIT

But, with respect to the pullrequest-resource specifically:

The only reason we need to use "every" is because the pullrequest-resource conflates two independent "versions" (PR# and SHA) into one emitted "version". If we want the latest (not every) SHA to be built for every PR#.
If we only used the standard concourse git resource, which only emits one "version" (a SHA), in conjunction with a pipeline-per-pr# (i.e. pipeline-per-branch) approach as you discuss here, we would not need to use "every", because the separate pipeline-per-pr# (each with separate standard git resources defined for the PRs' branches) would ensure the latest SHA for the per-branch git resource would always be run (not every SHA, which is fine because we only care about running the latest commit on the PR's branch).
Thus, since we wouldn't need to use "every", we would not be exposed to this issue.

Correct?

If not, then please explain in more detail, because I'm still confused :)

Thanks,
-- Chad

Answer 9 · 2017-01-20T15:43:28.000Z

@vito Also, I tried out the workaround above.

It doesn't appear to work if you ever have more than one entry in a passed: array.

E.g., if you try this, job-3 hang and never start with the message "waiting for a suitable set of input versions\n dummy-time - no versions satisfy passed constraints"

resources:
- name: my-pr
  type: pr
  source: {...}

- name: dummy-time
  type: time
  source: {interval: 24h} # doesn't matter, but something has to be there

jobs:
- name: job-1
  plan:
  - get: my-pr
    version: every
    trigger: true
  - task: test-pr
    # ...
  - put: dummy-time

- name: job-2a
  plan:
  - get: my-pr
    trigger: true
    passed: [job-1]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-1]
  - # ...
  - put: dummy-time

- name: job-2b
  plan:
  - get: my-pr
    trigger: true
    passed: [job-1]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-1]
  - # ...
  - put: dummy-time

- name: job-3
  plan:
  - get: my-pr
    trigger: true
    passed: [job-2a, job-2b]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-2a, job-2b]
  - # ...
  - put: dummy-time

Is this expected (and thus the workaround doesn't work in this situation), or do I have something wrong?

Thanks,
-- Chad

Answer 10 · 2017-02-23T01:34:57.000Z

+1 We are also seeing this issue at our end. We are going to try the workaround now.

Answer 11 · 2019-04-01T18:38:45.000Z

Hey @vito we have the same setup as @thewoolleyman and, while the workaround seems to accomplish what it is meant for, we also experience jobs-2x being triggered multiple times (with the same my-pr version but different dummy-time).

Is this meant to work when there are multiple entries in the passed array downstream?

Answer 12 · 2019-04-03T16:13:38.000Z

@nazrhom (and, well, everyone) To be honest we're starting to see version: every + passed: as a bit of a mis-feature. By saying "run with every version" but also "oh wait but only the ones that made it through this job" it can be seen as kind of an oxymoron, and from that perspective the the (mis)behavior described in this issue is somewhat unsurprising. It's also not super surprising that the workaround has led to other things that need working-around. 🙂 I don't see why having multiple entries in passed would break it any more or any less than with a single entry, fwiw.

It seems like what y'all are really trying to do is have a full pipeline run for every version. I think that's actually a challenge that can be met by a concept similar to spaces (#1707). We've got a new plan laid out in concourse/rfcs#1 (comment) that should make this kind of thing possible. At that point we might deprecate configuring both version: every and passed: at the same time and encourage users to adopt the new pattern.

Answer 13 · 2019-04-04T10:34:17.000Z

@vito thanks, will look into the rfcs. I did some more digging and the issue I am experiencing does not seem to be related to fan-in/out as I can reproduce the issue even with a setup like:

jobs:
- name: job-1
  plan:
  - get: my-pr
    version: every
    trigger: true
  - task: test-pr
    # ...
  - put: dummy-time

- name: job-2
  plan:
  - get: my-pr
    trigger: true
    passed: [job-1]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-1]
  - # ...
  - put: dummy-time

- name: job-3
  plan:
  - get: my-pr
    trigger: true
    passed: [job-2]
  - get: dummy-time
    version: every
    trigger: true
    passed: [job-2]
  - # ...
  - put: dummy-time

I have a question on the (current) expected semantics here: job-2 will get a certain version of dummy-time that was put by job-1; after it does its thing it will put a new version of dummy-time. Doesn't this create 2 versions of dummy-time that have passed job-2 (the one that job-2 gets and the one it puts?) that can potentially both trigger job-3?

Answer 14 · 2019-04-10T17:45:27.000Z

@nazrhom Yeah, that may be what's happening.

Answer 15 · 2019-07-16T15:00:44.000Z

Beep boop! This issue has been idle for long enough that it's time to check
in and see if it's still important.

If it is, what is blocking it? Would anyone be interested in submitting a
PR or
continuing the discussion to help move things forward?

If no activity is observed within the next week, this issue will be
~~exterminated~~ closed, in accordance with our stale issue
process.

Answer 16 · 2019-07-17T00:31:59.000Z

Would ask that this issue and/or #1298 remain open as it's a common problem and the proposed fix (spaces) is a large and uncertain features (concourse/rfcs#24).

Answer 17 · 2019-07-30T20:06:07.000Z

I'll leave this issue open as long as the issue still stands. The resolution will probably be deprecating/disallowing version: every + passed: in favor of spatial pipelines using the across step (concourse/rfcs#29). The problem with version: every + passed: is pretty fundamental as noted in #736 (comment), and supporting it adds a lot of complexity in our scheduling algorithm.

I'll keep the stale bot at bay by placing this in our Spatial Resources epic as a long-term goal to close out once the epic is complete.

Answer 18 · 2019-11-02T01:24:00.000Z

We've been hitting this same issue. So, I've been trying to understand how the scheduling algorithm is currently implemented. I'd like to check my understanding, and see if it inspires any reasonable fixes.

My understanding is,

A per-pipeline atc/scheduler/Runner runs, and on a regular basis, tick()s. (https://github.com/concourse/concourse/blob/release/5.6.x/atc/scheduler/runner.go#L67)
tick() loads, among other things, a (possibly-cached) copy of the pipeline config, and a VersionDB containing the set of all resource versions, build inputs & outputs related to a pipeline.
This is passed to a atc/scheduler/Scheduler.Schedule()
for each job in the pipeline, this attempts to ensurePendingBuildExists. The purpose of ensurePendingBuildExists seems to be to find-or-create the single next build that should start. (https://github.com/concourse/concourse/blob/release/5.6.x/atc/scheduler/scheduler.go#L62)
ensurePendingBuildExists calls atc/scheduler/InputMapper.SaveNextInputMapping().
SaveNextInputMapping calls, per-resource, atc/db/algorithm/InputConfigs.Resolve() (https://github.com/concourse/concourse/blob/release/5.6.x/atc/scheduler/inputmapper/inputmapper.go#L66) [there's an additional 'final resolve' afterward, which isn't relevant to the current thread]
Resolve() fetches version candidates. In particular, it fetches exactly one of "all candidates", "the latest candidate", "a specific pinned candidate", or "the candidates which passed the configured jobs". (https://github.com/concourse/concourse/blob/release/5.6.x/atc/db/algorithm/input_configs.go#L19-L54)
As an implementation detail, these candidates are returned in descending CheckOrder order, due to atc/db/algorithm/VersionDB.* calling atc/db/algorithm/Versions.With(), which finds-or-adds versions in sorted order. (https://github.com/concourse/concourse/blob/release/5.6.x/atc/db/algorithm/version.go#L36)
The collected candidates are then passed to atc/db/algorithm/InputCandidates.Reduce().
Reduce() experimentally pins each version of a resource, in order, until it finds a version that "IsNext()" https://github.com/concourse/concourse/blob/release/5.6.x/atc/db/algorithm/input_candidates.go#L99
IsNext() returns true if a build already exists for the given version of a resource (https://github.com/concourse/concourse/blob/release/5.6.x/atc/db/algorithm/input_candidates.go#L30)
Once we've found a version for this resource that IsNext(), we return it as a mapping, we collect all those mappings, ensure that we have a mapping for each input config, do a final group resolve to check compatibility, save the input mapping to the DB, and then determine whether we should trigger the build.

If my understanding is correct about this process, the cause of this bug is pretty straightforward - When
A) multiple versions of a resource exist, and
B) the newest version has a build that exists
then no other version of that resource is ever considered as an input to a potential job, regardless of version: every.

Given this, it seems like there might also be edge cases in which inputs that were version: every and didn't have a passed could also skip versions. I'd have to dig in to confirm or reproduce that, though.

A naiive suggestion might be, in cases where an input_config is version: every, to consider candidates in ascending order of check_order, skipping candidates that already have a build. I wouldn't be surprised if this is actually how it was implemented before, and that caused other subtle problems of "job gets stuck and never schedules any new builds". When i get some more time, I'll do some history spelunking.

Answer 19 · 2019-11-03T22:57:00.000Z

@YenTheFirst Impressive sleuthing! We're actually near the end of a complete re-design and re-implementation of the scheduling algorithm, currently living on the algorithm-v3 and release/6.0.x branch. If you'd like to read a bit about it, I had a section on it in the v10 roadmap post. We're aiming to push it out as a 6.0 beta release sometime soon.

For inputs with version: every the new algorithm actually does exactly what you proposed: walk through versions in ascending order from whichever version was previously used, falling back on descending order if none of those are valid.

However, at least at some level, the problem with passed and version: every with parallel upstream jobs is still pretty fundamental. Both attributes demand two things that are naturally in conflict: "use every version" and "use only versions that passed through the job". "Correcting" the behavior with parallel upstream builds would likely mean changing the semantics of version: every to either a) wait for in-fight builds to complete before deciding which version to use or b) support walking "backwards" to execute older versions.

Neither of these options really feel compelling to me. Even version: every without passed constraints is actually really awkward to support, and I really want to replace it with something else in the future as it has been the root cause of a ton of complexity/confusion in both the old implementation and the new one.

Answer 20 · 2019-11-04T19:28:51.000Z

I don't think the concepts of passed and version every are fundamentally at-odds, at least from a user's point of view. It's typically either "I want to use every version of a resource, as long as it passed an upstream job", or "I want to use every version of a resource, but wait until a sure-to-succeed upstream job has produced an accompanying artifact", or "I want to use every output of an upstream job (even if that's not every version of the resource)". It may make sense for those use-cases to get fulfilled by something other than version: every, but I do think those are use-cases that should be supported.

In our particular case, the pipeline is something like:

For every commit on every pr-branch, build a docker image
for every successfully built docker image, run unit tests
for every successfully built docker image, run integration tests
for every commit that was successfully built, run the linter

I suspect we could work around this by removing the initial "build a docker image", and instead having unit-tests, integration-tests do their own building, but that duplicates work and code.

[In this particular case, yes, we're using version: every as a hack around the fact that Concourse does not have spaces yet, but...Concourse doesn't have spaces yet. I suspect that there's reasonable variations on this pipeline where version: every isn't merely a hack]

I'll take a closer look at those algorithms-v3 and release/6.0.x branches. Where would be a reasonable place to leave feedback related to those branches, if applicable?

Based only on your description,

walk through versions in ascending order from whichever version was previously used

At first glance, this sounds like it would have the exact same problem. If a downstream build for D completes before an upstream build for C, the downstream build for C will never start, since "D" will be the version that was previously used.

b) support walking "backwards" to execute older versions.
I expect this is basically what's needed.

Answer 21 · 2019-11-05T15:03:43.000Z

To be clear, I don't doubt the validity of your use case at all. I just don't see us improving this specific approach any time soon, because it will take a nontrivial amount of work, and it will just delay the real fix which we've recently gained a pretty clear picture of.

The problem is that, while version: every + passed may not seem to be at odds from a user-level, it's technically impossible to satisfy without significant changes to how Concourse works, which would mean other user-level changes. I just don't think these changes are the right thing to do.

b) support walking "backwards" to execute older versions.
I expect this is basically what's needed.

This is why I think it's a fundamental issue. Concourse pipelines go forwards, not backwards. Changing this would have a ton of implications, and it isn't a clean fix anyway because now there's a question of when does version: every stop going backwards. For example, if I apply version: every to a regular Git repo without passed constraints, it's not likely that I would want it to literally go all the way back to the initial commit (resources will eventually collect all versions, not just from initial configuration time).

We already know this feature is a pain in the butt to support, and doubling down on it to get it working correctly would require re-designing significant portions of the product around it, both internally and in the UI. So at this point I'm not on board with supporting version: every + passed, and I would like to even phase out version: every in the long run. It's a crutch that makes things "work" just enough for us to not find better solutions.

Your use case of building every commit of every PR is totally fine. I think a model that better fits how Concourse works would be to configure a pipeline for each commit. We just need to improve the automation and navigation of pipelines. This is all broken down into many small features, laid out in the v10 roadmap post. Implementing most of these features will be easier or at least more beneficial in the long-term than doubling down on version: every.

I'll take a closer look at those algorithms-v3 and release/6.0.x branches. Where would be a reasonable place to leave feedback related to those branches, if applicable?

Hmm good point, not yet. We should just open a PR for it. 🤔 I'll link to it here once we do (or just keep an eye out if I forget!)

edit: PR is open #4721