ansible-community/community-topics

Generic sanity check for all collections

felixfontein opened this issue ยท 27 comments

Summary

As ansible-community/ansible-build-data#114 showed we should really check the existing collections including in the Ansible package. Some ideas:

  1. Set up some nightly CI that runs some basic sanity checks (like ansible-test sanity --docker -v) on all collections. Let's specify one or two of the stable branches for that.
  2. Give all collections that fail this some time to fix it. If they don't, say, in two months, let's deprecate them with planned removal from Ansible 7. (Doing this for Ansible 6 is too close IMO.)
  3. From then on, warn in advance once new stable branches are added to this CI (and remove old ones so that at most 1-2 of them are active at the same time), with the same rules: if a collection fails CI and doesn't fix it in a certain amount of time (with a new release), they will be deprecated and removed in the next major version of Ansible.

What do you think?

Also it would be great to do some manual review (check whether CI is set up, look at least at some of the conditions from https://github.com/ansible-collections/overview/blob/main/collection_requirements.rst, ...), but that is very time consuming, and since we already struggle with reviewing new collections, I doubt we will find enough time to do this.

Sounds reasonable. But I have a question about this:

  1. Give all collections that fail this some time to fix it. If they don't, say, in two months, let's deprecate them with planned removal from Ansible 7. (Doing this for Ansible 6 is too close IMO.)

If a collection takes, say, two and a half months to fix it, would we still remove it from Ansible 7? Or would we "un-deprecate" it? Could we even do it, I mean how would we add to the changelog: We've deprecated this collection and wanted to remove it from Ansible 7 but we've changed our minds.

I guess we can still un-deprecate it if we're happy with the result (and the explanation why it took so long). I guess this would work similarly to the document we're voting on in #90, i.e. we'd have a procedure described for this with possible conditions.

I guess we can still un-deprecate it if we're happy with the result (and the explanation why it took so long).

But how would we document this in the changelog? There's deprecated_features but not undeprecated_features or something similar, is there?

I'd probably use major_changes, since both shows up in the porting guide. We can also remove the existing changelog entry if we want to.

All the ideas in the description sound good to me
+1

Do all collections actually have sanity tests? Especially if they are only containing roles, I'm uncertain if these are even doing much, they seem very geared towards Python code in plugins from a quick look.

@MarkusTeufelberger the ones included in Ansible are required to run (and pass) sanity checks (though it's not clear how many actually do - fortinet.fortios is an example of a collection which obviously didn't run them before release). ansible-test's sanity checks don't do much for collections that only contain roles, but I think it's still a good idea to run them (it will also be rather quick), because a) the tests also check meta/runtime.yml (which you can have without plugins/modules) and changelog fragments (in case you use them), and b) hopefully sooner or later there will be a sanity check for role argument specs.

FYI I have a WIP PR in which I provide an implementation to (optionally) run sanity tests on all the collections included the package as part of the build process: ansible-community/antsibull#419

I've ran a comprehensive sanity test on every collection included in 6.0.0a2 (minus pylint) and found that 67 collections (out of 98) are currently failing sanity tests, data is here: https://gist.github.com/dmsimard/8a0526917398b19410581568947ee0dc

I'm not entirely positive if part of it could be due to ansible-test somehow not picking up the ignore files and need to double check but I figured I would at least share my preliminary findings for the time being and will update them if need be.

Thanks @dmsimard, great work!

I'm having a look at the output for community.vmware. vca_fw and vca_nat fail the tests:

ERROR: Found 2 validate-modules issue(s) which need to be resolved:
ERROR: plugins/modules/vca_fw.py:0:0: no-default-for-required-parameter: DOCUMENTATION.options.fw_rules: Argument is marked as required but specifies a default. Arguments with a default should not be marked as required for dictionary value @ data['options']['fw_rules']. Got {'description': ['A list of firewall rules to be added to the gateway, Please see examples on valid entries'], 'required': True, 'default': False}
ERROR: plugins/modules/vca_nat.py:0:0: no-default-for-required-parameter: DOCUMENTATION.options.nat_rules: Argument is marked as required but specifies a default. Arguments with a default should not be marked as required for dictionary value @ data['options']['nat_rules']. Got {'description': ['A list of rules to be added to the gateway, Please see examples on valid entries'], 'required': True, 'default': False}
ERROR: The 2 sanity test(s) listed below (out of 43) failed. See error output above for details.

They're deprecated, so we didn't want to invest any more time in them. I thought this would ignore the errors:

https://github.com/ansible-collections/community.vmware/blob/2ce7a1313af74bdc96e9f476330c808f25c16033/tests/sanity/ignore-2.13.txt#L1-L12

Interestingly, vca_vapp doesn't fail the tests although we had to ignore some things there, too. That's a bit weird.

Then there are a lot of this:

DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives

I wonder why we don't see this in our Zuul jobs (see here for a recent CI run).

Hmmm, we've already replaced LooseVersion from distutils.version with the one from ansible.module_utils.compat.version in ansible-collections/community.vmware#1165. Looks like we have to do the same for StrictVersion.

Nevertheless, the question remains: Why don't we see this error in the Zuul jobs?

The problems in community.general and community.network are very likely due to symlinks being converted to actual files by the build process (ansible-community/antsibull#218). (Also for c.g the wrong version is used, that's fixed in ansible-community/antsibull#421.)

@mariolenz you don't see these errors in Zuul since you don't test against stable-2.13. You only test against devel (and the errors are correctly ignored in the ignore-2.14.txt), but then you test against milestone, which is a very out of date version of stable-2.13 (for example it does not have ansible/ansible#77268 which was merged on March 17th, which is what produces these errors). It basically is the state of the devel branch on 2022-02-14.

Hmmm, we've already replaced LooseVersion from distutils.version with the one from ansible.module_utils.compat.version in ansible-collections/community.vmware#1165. Looks like we have to do the same for StrictVersion.

Since the whole of distutils is deprecated: yes, you do. LooseVersion is the most commonly used thing from distutils.version, but everything else from distutils also has to be removed. StrictVersion can be 'removed' the same way as LooseVersion.

you don't see these errors in Zuul since you don't test against stable-2.13. You only test against devel (and the errors are correctly ignored in the ignore-2.14.txt), but then you test against milestone, which is a very out of date version of stable-2.13

Thanks, that explains it. And now I remember: I couldn't ignore those in ignore-2.13.txt because milestone is so old that it doesn't know about them.

Since the whole of distutils is deprecated: yes, you do. LooseVersion is the most commonly used thing from distutils.version, but everything else from distutils also has to be removed.

I think distutils.version.StrictVersion is the only thing left from distutils.

StrictVersion can be 'removed' the same way as LooseVersion.

ansible-collections/community.vmware#1306

you test against milestone, which is a very out of date version of stable-2.13. It basically is the state of the devel branch on 2022-02-14.

Shouldn't the milestone branch have been advanced 2022-05-02?

I've ran a comprehensive sanity test on every collection included in 6.0.0a2 (minus pylint) and found that 67 collections (out of 98) are currently failing sanity tests

Maybe some collections test against the milestone branch, not against stable-2.13. We're doing exactly this in community.vmware, and it looks like this is part of the problem.

Now that the milestone branch has been advanced (ansible-collections/news-for-maintainers#16), we hopefully will see some improvements here because they run into problems in their CI.

I've finished creating standalone issues in collection repositories where we picked up errors from sanity tests.

@felixfontein pointed out in ansible-collections/community.dns#100 that running tests from the release tarball is not ideal because of a symlink issue described in ansible-community/antsibull#218 so I will address that by the time we run tests again.

Sanity tests can be actually fixed like in ansible-collections/community.rabbitmq#121 but not released yet.

  • The question is if they should appear on Galaxy and there's nothing else to ship (no bugfixes, major/minor change), what maintainers should do? Would it be OK to consider such fixes bugfixes and to release a patch release containing only test fixes?
  • If so, should we recommend maintainers (at least, in the collection requirements) to release their collections on the spot after fixing sanity tests?

Sanity tests can be actually fixed like in ansible-collections/community.rabbitmq#121 but not released yet.

* The question is if they should appear on Galaxy and there's nothing else to ship (no bugfixes, major/minor change), what maintainers should do? Would it be OK to consider such fixes bugfixes and to release a patch release containing only test fixes?

I should say: Yes.

* If so, should we recommend maintainers (at least, in the collection requirements) to release their collections on the spot after fixing sanity tests?

Generally: Yes. But I think "on the spot" might be a bit extreme, as soon as possible and preferably befor the next Ansible release would be sufficient, wouldn't it?

Generally: Yes. But I think "on the spot" might be a bit extreme, as soon as possible and preferably befor the next Ansible release would be sufficient, wouldn't it?

I fully agree!

A quick recap and then an update:

I've ran a comprehensive sanity test on every collection included in 6.0.0a2 (minus pylint) and found that 67 collections (out of 98) are currently failing sanity tests, data is here: https://gist.github.com/dmsimard/8a0526917398b19410581568947ee0dc

I'm not entirely positive if part of it could be due to ansible-test somehow not picking up the ignore files and need to double check but I figured I would at least share my preliminary findings for the time being and will update them if need be.

It turns out there was something wrong:

@felixfontein pointed out in ansible-collections/community.dns#100 that running tests from the release tarball is not ideal because of a symlink issue described in ansible-community/antsibull#218 so I will address that by the time we run tests again.

I've re-run sanity tests for every packaged collection, this time installed with ansible-galaxy collection install -r galaxy-requirements.yaml provided by the release of 6.0.0a2.

This time there were 38 failures out of 98 with many now passing and a few now failing (edit: this was mistakenly with 2.12, see edit at the end):
diff

I'll update the issues I've created and I've sent a WIP PR to ansible-build-data (changing the approach from within antsibull) so we can test this on a regular basis: ansible-community/ansible-build-data#122

Edit: While troubleshooting, I realized I mistakenly ran the tests with "ansible-galaxy collection install" from ansible-core 2.12 instead of 2.13.0rc1 ๐Ÿคฆ

It looks like that may be why there were less failures because it's back up to 59 now:
Screenshot from 2022-05-12 23-52-12

I apologize for the confusion, even though the comparison between 2.12 and 2.13 may be interesting.

@dmsimard The community.dns issue is already closed, and I don't see that you've opened one in community.general. So it looks like these issues are left to be closed since the tests don't fail anymore:

Is it OK for you if I do?

@mariolenz your help would be appreciated, sure !

I'd still like to understand why it is that these particular collections are failing from the tarball (and not from galaxy collection install) and whether it is fixable or not, though.

My initial objective was to test the package itself as it is shipped so it's unfortunate that we get different results but we have to start somewhere and it may be a better investment to focus on the collections that are failing when installing with galaxy.

We can always circle back to the packaging later.

@dmsimard I can't close those issues due to missing permissions :-/ But I've added a comment that they can be closed because of a false positive from our side.

@dmsimard I can't close those issues due to missing permissions :-/ But I've added a comment that they can be closed because of a false positive from our side.

I've closed the remaining ones, thanks !

Worked through some of the collections and updated linked issues above (will edit this list for a first batch of updates to keep the noise down):

  • amazon.aws issues are different from the release tarball vs ansible-galaxy collection install. Fixes are WIP.
  • chocolatey.chocolatey issues are the same as the release tarball, maintainers are responsive.
  • community.digitalocean issues are different from the release tarball vs ansible-galaxy collection install. Fixes are WIP.
  • hetzner.hcloud issues are different from the release tarball vs ansible-galaxy collection install. It has new failures with 2.13 that don't reproduce with 2.12, has a responsive maintainer.
  • kubernetes.core issues are different from the release tarball vs ansible-galaxy collection install. Has not yet been acknowledged.

I haven't gotten around to re-visiting every collection issue yet but in the meantime, 6.0.0a3 is upon us and I've run sanity tests again with the latest releases of collections.

We're down from 59 failures to 51 which is progress (full data here):
Screenshot from 2022-05-17 18-27-50

@dmsimard Most of the issues in the collections that don't fail any more have been closed already. I have added a comment that the tests seem to succeed in 6.0.0a3 to the two or three remaining.

vmware.vmware_rest is new, but I think the tests are fixed but not released yet: ansible-collections/vmware.vmware_rest#332

edit:

Still open: