Inconsistent use of get_seccomp_filter vs. __put_seccomp_filter
Opened this issue · 4 comments
On 5.9+, we use __put_seccomp_filter
instead of the (no longer existent) put_seccomp_filter
. However, we still pair it with get_seccomp_filter
. This looks wrong at least for current mainline because get_seccomp_filter
increments two refcounts:
static void __get_seccomp_filter(struct seccomp_filter *filter)
{
refcount_inc(&filter->refs);
}
/* get_seccomp_filter - increments the reference count of the filter on @tsk */
void get_seccomp_filter(struct task_struct *tsk)
{
struct seccomp_filter *orig = tsk->seccomp.filter;
if (!orig)
return;
__get_seccomp_filter(orig);
refcount_inc(&orig->users);
}
whereas __put_seccomp_filter
decrements just one. Do we possibly trigger a resource leak there? Refcount overflow or hopefully saturation risk, too?
static void __seccomp_filter_orphan(struct seccomp_filter *orig)
{
while (orig && refcount_dec_and_test(&orig->users)) {
if (waitqueue_active(&orig->wqh))
wake_up_poll(&orig->wqh, EPOLLHUP);
orig = orig->prev;
}
}
static void __put_seccomp_filter(struct seccomp_filter *orig)
{
/* Clean up single-reference branches iteratively. */
while (orig && refcount_dec_and_test(&orig->refs)) {
struct seccomp_filter *freeme = orig;
orig = orig->prev;
seccomp_filter_free(freeme);
}
}
static void __seccomp_filter_release(struct seccomp_filter *orig)
{
/* Notify about any unused filters in the task's former filter tree. */
__seccomp_filter_orphan(orig);
/* Finally drop all references to the task's former tree. */
__put_seccomp_filter(orig);
}
Looks like we'd need to use __seccomp_filter_release
instead, or even the non-static seccomp_filter_release
(for which we'd need a dummy task struct). Those wake_up_poll
things don't look like something we normally want to do, but perhaps if it just happened that the seccomp filter was otherwise freed while we held it, then it becomes our responsibility to do this, too?
Looks like we'd need to use
__seccomp_filter_release
instead, or even the non-staticseccomp_filter_release
(for which we'd need a dummy task struct).
Alternatively, since __get_seccomp_filter
is currently just a refcount_inc
, we could do the refcount_inc
directly. Then pairing it with __put_seccomp_filter
would remain correct for now (but is risky with future kernel changes).
We could also have our own implementation of __put_seccomp_filter
and its inlined seccomp_filter_free
(which is tiny) to avoid discrepancy risk and dependency on that static symbol. But then we'd assume we have more knowledge on the cleanup needed (which is differently risky with future kernel changes).
Refcount overflow or hopefully saturation risk, too?
I think the saturation should produce warnings in kernel messages. Just need to trigger over 2 billion credentials validations in one task with seccomp.
Hm... I think you are correct @solardiz. We have problems with seccomp
while they starting to inline a lot of functions. Implementing own refcount logic may solve that problem. However, struct seccomp_filter
is also not "public":
https://elixir.bootlin.com/linux/latest/source/kernel/seccomp.c#L226
So we would need to carry an own definition of it to be able to implement the refcount logic. Nevertheless, we already do something similar for few other structures so maybe it's OKish... @solardiz what do you think?
struct seccomp_filter
is also not "public"
We only need the refs
field, which is the first:
e2cfabdfd0756 (Will Drewry 2012-04-12 16:47:57 -0500 226) struct seccomp_filter {
b707ddee11d1d (Christian Brauner 2020-05-31 13:50:28 +0200 227) refcount_t refs;
99cdb8b9a5739 (Christian Brauner 2020-05-31 13:50:30 +0200 228) refcount_t users;
prior to 2020, it was called usage
, but was still the first:
commit b707ddee11d1dc4518ab7f1aa5e7af9ceaa23317
Author: Christian Brauner <brauner@kernel.org>
Date: Sun May 31 13:50:28 2020 +0200
seccomp: rename "usage" to "refs" and document
and it was this way since 2017:
e2cfabdfd0756 (Will Drewry 2012-04-12 16:47:57 -0500 130) struct seccomp_filter {
0b5fa2290637a (Kees Cook 2017-06-26 09:24:00 -0700 131) refcount_t usage;
e66a39977985b (Tyler Hicks 2017-08-11 04:33:56 +0000 132) bool log;
The 2017 commit was a type change:
commit 0b5fa2290637a3235898d18dc0e7a136783f1bd2
Author: Kees Cook <keescook@chromium.org>
Date: Mon Jun 26 09:24:00 2017 -0700
seccomp: Switch from atomic_t to recount_t
This switches the seccomp usage tracking from atomic_t to refcount_t to
gain refcount overflow protections.
going even further back, we had since 2012:
e2cfabdfd0756 (Will Drewry 2012-04-12 16:47:57 -0500 58) struct seccomp_filter {
e2cfabdfd0756 (Will Drewry 2012-04-12 16:47:57 -0500 59) atomic_t usage;
e2cfabdfd0756 (Will Drewry 2012-04-12 16:47:57 -0500 60) struct seccomp_filter *prev;
7ae457c1e5b45 (Alexei Starovoitov 2014-07-30 20:34:16 -0700 61) struct bpf_prog *prog;
e2cfabdfd0756 (Will Drewry 2012-04-12 16:47:57 -0500 62) };
which covers all of our supported kernel versions. So we're good in terms of this field's placement, but ideally we'll need to use the right type and inc/dec macros to match the 2017 change. In fact, if we try to use refcount_t
on a too ancient kernel, I guess our code wouldn't even build. But if we use atomic_t
(and macros) on a kernel that actually already has refcount overflow protection, I guess we'll simply stay without such protection in this one place.
typedef struct refcount_struct {
atomic_t refs;
} refcount_t;
So defining and using our own struct with just the first field feels low risk for our supported kernels so far. We may.