lkrg-org/lkrg

Inconsistent use of get_seccomp_filter vs. __put_seccomp_filter

Opened this issue · 4 comments

On 5.9+, we use __put_seccomp_filter instead of the (no longer existent) put_seccomp_filter. However, we still pair it with get_seccomp_filter. This looks wrong at least for current mainline because get_seccomp_filter increments two refcounts:

static void __get_seccomp_filter(struct seccomp_filter *filter)
{
        refcount_inc(&filter->refs);
}

/* get_seccomp_filter - increments the reference count of the filter on @tsk */
void get_seccomp_filter(struct task_struct *tsk)
{
        struct seccomp_filter *orig = tsk->seccomp.filter;
        if (!orig)
                return;
        __get_seccomp_filter(orig);
        refcount_inc(&orig->users);
}

whereas __put_seccomp_filter decrements just one. Do we possibly trigger a resource leak there? Refcount overflow or hopefully saturation risk, too?

static void __seccomp_filter_orphan(struct seccomp_filter *orig)
{
        while (orig && refcount_dec_and_test(&orig->users)) {
                if (waitqueue_active(&orig->wqh))
                        wake_up_poll(&orig->wqh, EPOLLHUP);
                orig = orig->prev;
        }
}

static void __put_seccomp_filter(struct seccomp_filter *orig)
{
        /* Clean up single-reference branches iteratively. */
        while (orig && refcount_dec_and_test(&orig->refs)) {
                struct seccomp_filter *freeme = orig;
                orig = orig->prev;
                seccomp_filter_free(freeme);
        }
}

static void __seccomp_filter_release(struct seccomp_filter *orig)
{
        /* Notify about any unused filters in the task's former filter tree. */
        __seccomp_filter_orphan(orig);
        /* Finally drop all references to the task's former tree. */
        __put_seccomp_filter(orig);
}

Looks like we'd need to use __seccomp_filter_release instead, or even the non-static seccomp_filter_release (for which we'd need a dummy task struct). Those wake_up_poll things don't look like something we normally want to do, but perhaps if it just happened that the seccomp filter was otherwise freed while we held it, then it becomes our responsibility to do this, too?

Looks like we'd need to use __seccomp_filter_release instead, or even the non-static seccomp_filter_release (for which we'd need a dummy task struct).

Alternatively, since __get_seccomp_filter is currently just a refcount_inc, we could do the refcount_inc directly. Then pairing it with __put_seccomp_filter would remain correct for now (but is risky with future kernel changes).

We could also have our own implementation of __put_seccomp_filter and its inlined seccomp_filter_free (which is tiny) to avoid discrepancy risk and dependency on that static symbol. But then we'd assume we have more knowledge on the cleanup needed (which is differently risky with future kernel changes).

Refcount overflow or hopefully saturation risk, too?

I think the saturation should produce warnings in kernel messages. Just need to trigger over 2 billion credentials validations in one task with seccomp.

Hm... I think you are correct @solardiz. We have problems with seccomp while they starting to inline a lot of functions. Implementing own refcount logic may solve that problem. However, struct seccomp_filter is also not "public":
https://elixir.bootlin.com/linux/latest/source/kernel/seccomp.c#L226

So we would need to carry an own definition of it to be able to implement the refcount logic. Nevertheless, we already do something similar for few other structures so maybe it's OKish... @solardiz what do you think?

struct seccomp_filter is also not "public"

We only need the refs field, which is the first:

e2cfabdfd0756 (Will Drewry              2012-04-12 16:47:57 -0500  226) struct seccomp_filter {
b707ddee11d1d (Christian Brauner        2020-05-31 13:50:28 +0200  227)         refcount_t refs;
99cdb8b9a5739 (Christian Brauner        2020-05-31 13:50:30 +0200  228)         refcount_t users;

prior to 2020, it was called usage, but was still the first:

commit b707ddee11d1dc4518ab7f1aa5e7af9ceaa23317
Author: Christian Brauner <brauner@kernel.org>
Date:   Sun May 31 13:50:28 2020 +0200

    seccomp: rename "usage" to "refs" and document

and it was this way since 2017:

e2cfabdfd0756 (Will Drewry              2012-04-12 16:47:57 -0500  130) struct seccomp_filter {
0b5fa2290637a (Kees Cook                2017-06-26 09:24:00 -0700  131)         refcount_t usage;
e66a39977985b (Tyler Hicks              2017-08-11 04:33:56 +0000  132)         bool log;

The 2017 commit was a type change:

commit 0b5fa2290637a3235898d18dc0e7a136783f1bd2
Author: Kees Cook <keescook@chromium.org>
Date:   Mon Jun 26 09:24:00 2017 -0700

    seccomp: Switch from atomic_t to recount_t
    
    This switches the seccomp usage tracking from atomic_t to refcount_t to
    gain refcount overflow protections.

going even further back, we had since 2012:

e2cfabdfd0756 (Will Drewry        2012-04-12 16:47:57 -0500  58) struct seccomp_filter {
e2cfabdfd0756 (Will Drewry        2012-04-12 16:47:57 -0500  59)        atomic_t usage;
e2cfabdfd0756 (Will Drewry        2012-04-12 16:47:57 -0500  60)        struct seccomp_filter *prev;
7ae457c1e5b45 (Alexei Starovoitov 2014-07-30 20:34:16 -0700  61)        struct bpf_prog *prog;
e2cfabdfd0756 (Will Drewry        2012-04-12 16:47:57 -0500  62) };

which covers all of our supported kernel versions. So we're good in terms of this field's placement, but ideally we'll need to use the right type and inc/dec macros to match the 2017 change. In fact, if we try to use refcount_t on a too ancient kernel, I guess our code wouldn't even build. But if we use atomic_t (and macros) on a kernel that actually already has refcount overflow protection, I guess we'll simply stay without such protection in this one place.

typedef struct refcount_struct {
        atomic_t refs;
} refcount_t;

So defining and using our own struct with just the first field feels low risk for our supported kernels so far. We may.