falcosecurity/kernel-crawler

Bug: index out of range in OpenSuse

FedeDP opened this issue · 13 comments

Describe the bug

Latest update-kernels job in prow failed with:

Listing packages
Traceback (most recent call last):
  File "/usr/local/bin/kernel-crawler", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/kernel_crawler/main.py", line 56, in crawl
    res = crawl_kernels(distro, version, arch, image, out_fmt == 'driverkit')
  File "/usr/local/lib/python3.8/site-packages/kernel_crawler/crawler.py", line 89, in crawl_kernels
    res = d.get_package_tree(version)
  File "/usr/local/lib/python3.8/site-packages/kernel_crawler/repo.py", line 46, in get_package_tree
    for release, dependencies in repo.get_package_tree(version).items():
  File "/usr/local/lib/python3.8/site-packages/kernel_crawler/rpm.py", line 228, in get_package_tree
    kernel_default_devel_pkg_url = self.get_loc_by_xpath(repodb, expression)
  File "/usr/local/lib/python3.8/site-packages/kernel_crawler/rpm.py", line 34, in get_loc_by_xpath
    return loc[0]
IndexError: list index out of range

The issue is in Suse rpm repository class, in get_package_tree.

How to reproduce it

Just run latest kernel-crawler image against OpenSUSE:

docker run -ti --rm falcosecurity/kernel-crawler:0.4.1 crawl --distro OpenSUSE

Expected behaviour

We should not crash :)

/cc @EXONER4TED

Mmh i added a ci in kernel-crawler that checks if crawling is successful; it seems it is now: #64.
Perhaps it was a timing issue?

Nope, we tried to restart the prow job, but the issue persists: https://prow.falco.org/view/s3/falco-prow-logs/logs/update-kernels/1589623173425401856.

Perhaps it is a python modules version mismatch?

I am able to reproduce the issue on my Archlinux laptop; it seems broken using the docker image too.
Moreover, i also tested on ubuntu:22.04 docker image, following exact same steps as gh actions CI, and it is not working:

kernel-crawler crawl --distro OpenSUSE
Checking repositories  [####################################]  100%
Listing packages  [#-----------------------------------]    5%  00:03:25  https://mirrors.edge.kernel.org/opensuse/distribution/leap/15.4/repo/oss/Killed

(Notice the https://mirrors.edge.kernel.org/opensuse/distribution/leap/15.4/repo/oss/**Killed**)

I am not sure how is possible that gh actions CI is working though; but the produced json has OpenSUSE entries in it, and process exits with 0.

Hmm... this is interesting. We have been running just fine for a few weeks now crawling OpenSUSE on our end in a container. I wonder if the Killed has to do with being rate limited or something?

At the base URL: https://mirrors.edge.kernel.org/opensuse/distribution/leap/15.4/repo/oss/, Killed does not even exist as a subdir or file.

Just double-checked our nightly to confirm, we were able to run OpenSUSE crawling last night without issue. I'll try to spend some time to see if I can replicate this behavior on my machine...

Thanks!
Mmh care to try using the kernel-crawler docker image? (Perhaps you're already using it!)
I upgraded my arch, and the issue seems gone. Will double check tomorrow.
Fact is, it is dying on test-infra :/

Okay, I was able to reproduce with the Killed issue with the falcosecurity/kernel-crawler:latest image. I am wondering if it's OOM killing itself... works just fine not in a container. And it works just fine in our environment (we are not using that container but rather installing kernel-crawler from source).

I think this Killed issue is separate from the original issue posted though, where loc[0] is out of bounds - that occurs when parsing the repodmd XML manifest doesn't produce results. I agree, it shouldn't crash there, but do something more elegant.

Let me try to figure out this container issue first.

I am now unable to get it to fail 🤔 What happens if you try to rerun the job again? Is there a chance this was an upstream blip with a mirror?

What I am doing now to continue debugging is I rewrote the container to not be wrapped in a bash script. The full error I've been getting (when I was able to reproduce this) is:

/usr/bin/entrypoint: line 3:     7 Killed                  kernel-crawler "$@"

...which isn't super helpful 😅 I'm hoping removing the shell script wrapper may tell us a LOC where it bailed. But I also wonder if this was the upstream server sending some HTTP response to kill our requests...and I can't get it to do it anymore haha.

Yep i don't get it either :/ i even tried on an ec2 instance just to be sure it was a network of mine thing (but prow itself runs on an aws cluster).
And github action runs just fine too, pretty solid and stable!

Yea... I am still unable to reproduce anymore. I think we can attribute some of this to weirdness with the mirrors. Let me pull a branch and work on a fix to elegantly handle failure to parse the repomd.xml file.

Were you able to rerun the test-infra job successfully, or does it still fail there?

Without the ability to replicate the issue... it's hard to test the changes haha. But in #66 I have changed the behavior when parsing the repomd.xml to return None if the found tuple is empty rather than trying to pull the first element out of an empty tuple. At the very least, it should give us a better error next time it runs if it continues to fail...