lkrg-org/lkrg

ovl_create_or_link inlining leads to false positive 'off' flag corruption for dockerd

Closed this issue · 40 comments

ajakk commented

I'm running lkrg-0.9.4 on Gentoo. Docker is at 20.10.16.

$ docker run hello-world
docker: error during connect: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.41/containers/create": EOF.
See 'docker run --help'.

This seems to be the interesting part of dmesg, with log level set to 6:

Jul 23 12:35:23 sol kernel: LKRG: ALERT: DETECT: Task: 'off' flag corruption for pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: DEBUG: 'off' flag[0x35d561dd823507c9] (normalization via 0x5fb43c3475b39c1)
Jul 23 12:35:23 sol kernel: LKRG: ALERT: BLOCK: Task: Killing pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977875
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977880
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977879
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977881
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977910
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977908
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977884
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977870
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977883
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977878
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977876
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977907
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977882
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977877
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977868
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 978087
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977873
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977871
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977909
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Can't find in internal tracking list pid 977910, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977867
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Can't find in internal tracking list pid 977882, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Can't find in internal tracking list pid 977883, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: ALERT: DETECT: Task: 'off' flag corruption for pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977872
Jul 23 12:35:23 sol kernel: LKRG: DEBUG: 'off' flag[0x35d561dd823507c9] (normalization via 0x5fb43c3475b39c1)
Jul 23 12:35:23 sol kernel: LKRG: ALERT: BLOCK: Task: Killing pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: WATCH: Removing pid 977869
Jul 23 12:35:23 sol kernel: LKRG: ALERT: DETECT: Task: 'off' flag corruption for pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: DEBUG: 'off' flag[0x35d561dd823507c9] (normalization via 0x5fb43c3475b39c1)
Jul 23 12:35:23 sol kernel: LKRG: ALERT: BLOCK: Task: Killing pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: ALERT: DETECT: Task: 'off' flag corruption for pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: DEBUG: 'off' flag[0x35d561dd823507c9] (normalization via 0x5fb43c3475b39c1)
Jul 23 12:35:23 sol kernel: LKRG: ALERT: BLOCK: Task: Killing pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: ALERT: DETECT: Task: 'off' flag corruption for pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: DEBUG: 'off' flag[0x35d561dd823507c9] (normalization via 0x5fb43c3475b39c1)
Jul 23 12:35:23 sol kernel: LKRG: ALERT: BLOCK: Task: Killing pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: ALERT: DETECT: Task: 'off' flag corruption for pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: DEBUG: 'off' flag[0x35d561dd823507c9] (normalization via 0x5fb43c3475b39c1)
Jul 23 12:35:23 sol kernel: LKRG: ALERT: BLOCK: Task: Killing pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: ALERT: DETECT: Task: 'off' flag corruption for pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: DEBUG: 'off' flag[0x35d561dd823507c9] (normalization via 0x5fb43c3475b39c1)
Jul 23 12:35:23 sol kernel: LKRG: ALERT: BLOCK: Task: Killing pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: ALERT: DETECT: Task: 'off' flag corruption for pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: DEBUG: 'off' flag[0x35d561dd823507c9] (normalization via 0x5fb43c3475b39c1)
Jul 23 12:35:23 sol kernel: LKRG: ALERT: BLOCK: Task: Killing pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: ALERT: DETECT: Task: 'off' flag corruption for pid 977874, name dockerd
Jul 23 12:35:23 sol kernel: LKRG: DEBUG: 'off' flag[0x35d561dd823507c9] (normalization via 0x5fb43c3475b39c1)
Jul 23 12:35:23 sol kernel: LKRG: ALERT: BLOCK: Task: Killing pid 977874, name dockerd

Thank you for reporting this, @ajakk!

Is this issue new with 0.9.4 or does it also occur with 0.9.3?

Either way, can you please try uncommenting //#define P_LKRG_TASK_OFF_DEBUG in src/modules/print_log/p_lkrg_print_log.h? By the way, this is something we hadn't tested recently - I hope it still builds and works - let's see.

Looks like the reported flag value is exactly 9x the normalization multiplier:

$ printf "%x\n" $[0x5fb43c3475b39c1*9]
35d561dd823507c9

This is close to the maximum off nesting depth we support in p_validate_off_flag(), and moreover if we still have this in the debugging output the actual depth must have been about twice that, so 18x. That's weird. When we set the maximum at 9x, the expectation was that actual maximum to be seen is 2x or maybe 3x. Perhaps we overlooked some possibility for this to be way higher.

Another possibility is we unexpectedly see this in p_ed_is_off_off() reached other than via p_validate_off_flag(), in which case we expect no nesting at all (or rather 1x in terms of the value), so seeing 8x (or the 9x value) there is also very weird.

Anyway, @ajakk can you try the below patch and see what happens? -

@@ -970,7 +970,7 @@ inline void p_validate_off_flag(struct p_ed_process *p_source, long p_val, int *
 
    while (p_val > p_global_cnt_cookie) {
       p_val -= p_global_cnt_cookie;
-      if (unlikely(p_val > (p_global_cnt_cookie << 3)))
+      if (unlikely(p_val > (p_global_cnt_cookie << 8)))
          break;
    }
 

This isn't a full/proper fix anyhow. If we do need to support off nesting this deep, then we also need to lower the maximum multiplier used to avoid signed integer wraparound, and we might want to optimize the normalization loop differently (such as back to usage of the % operator when the nesting level is high).

Sorry, that very patch won't even work because of the integer wraparound I mentioned. The maximum that would work reliably (regardless of what the random values are) is 4 in place of 3, but that's not enough to cover the 18x we presumably had. Here's a more elaborate version:

diff --git a/src/modules/exploit_detection/p_exploit_detection.c b/src/modules/exploit_detection/p_exploit_detection.c
index f0e987d..9907018 100644
--- a/src/modules/exploit_detection/p_exploit_detection.c
+++ b/src/modules/exploit_detection/p_exploit_detection.c
@@ -970,7 +970,7 @@ inline void p_validate_off_flag(struct p_ed_process *p_source, long p_val, int *
 
    while (p_val > p_global_cnt_cookie) {
       p_val -= p_global_cnt_cookie;
-      if (unlikely(p_val > (p_global_cnt_cookie << 3)))
+      if (unlikely(p_val > (p_global_cnt_cookie << 8)))
          break;
    }
 
diff --git a/src/modules/exploit_detection/p_exploit_detection.h b/src/modules/exploit_detection/p_exploit_detection.h
index 21e3837..97927b4 100644
--- a/src/modules/exploit_detection/p_exploit_detection.h
+++ b/src/modules/exploit_detection/p_exploit_detection.h
@@ -239,10 +239,10 @@ struct p_seccomp {
 
 #ifdef CONFIG_X86_64
  #define P_NORMALIZE_LONG 0x0101010101010101
- #define P_MASK_COUNTER   0x07FFFFFFFFFFFFFF
+ #define P_MASK_COUNTER   0x003FFFFFFFFFFFFF
 #else
  #define P_NORMALIZE_LONG 0x01010101
- #define P_MASK_COUNTER   0x07FFFFFF
+ #define P_MASK_COUNTER   0x003FFFFF
 #endif
 
 #ifdef P_LKRG_TASK_OFF_DEBUG
ajakk commented

Thank you for reporting this, @ajakk!

Is this issue new with 0.9.4 or does it also occur with 0.9.3?

I believe this has been an issue since I started using lkrg, around 0.9.2. So, certainly not a recent regression.

The first patch works! The second patch seems to be an improvement, but it produces inconsistent results:

jake@sol ~/git/lkrg $ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

[snip success message]

jake@sol ~/git/lkrg $ docker run hello-world
ERRO[0000] error waiting for container: unexpected EOF
docker: error during connect: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.41/containers/111b36d5ca3291350975fcff3e0d972103fa286e2a930b8eaddeb82ac42fe70a/start": EOF.
jake@sol ~/git/lkrg $ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

[snip success message]

ERRO[0002] error waiting for container: unexpected EOF

The first patch works! The second patch seems to be an improvement, but it produces inconsistent results:

That's weird. I'd suspect that if the second patch doesn't consistently fix the issue, the first patch would even more likely/frequently not do so. Maybe you need to unload/reload LKRG with the first patch a few times to trigger it failing too. The random values used in off flag protection vary per-LKRG-load.

When you get that ERRO[0000] error waiting for container: unexpected EOF message with the second patch applied, does LKRG log messages similar to what you had posted?

Would you please also try P_LKRG_TASK_OFF_DEBUG?

ajakk commented

So, with this patch:

diff --git a/src/modules/exploit_detection/p_exploit_detection.c b/src/modules/exploit_detection/p_exploit_detection.c
index f0e987d..9907018 100644
--- a/src/modules/exploit_detection/p_exploit_detection.c
+++ b/src/modules/exploit_detection/p_exploit_detection.c
@@ -970,7 +970,7 @@ inline void p_validate_off_flag(struct p_ed_process *p_source, long p_val, int *

    while (p_val > p_global_cnt_cookie) {
       p_val -= p_global_cnt_cookie;
-      if (unlikely(p_val > (p_global_cnt_cookie << 3)))
+      if (unlikely(p_val > (p_global_cnt_cookie << 8)))
          break;
    }

diff --git a/src/modules/exploit_detection/p_exploit_detection.h b/src/modules/exploit_detection/p_exploit_detection.h
index 21e3837..97927b4 100644
--- a/src/modules/exploit_detection/p_exploit_detection.h
+++ b/src/modules/exploit_detection/p_exploit_detection.h
@@ -239,10 +239,10 @@ struct p_seccomp {

 #ifdef CONFIG_X86_64
  #define P_NORMALIZE_LONG 0x0101010101010101
- #define P_MASK_COUNTER   0x07FFFFFFFFFFFFFF
+ #define P_MASK_COUNTER   0x003FFFFFFFFFFFFF
 #else
  #define P_NORMALIZE_LONG 0x01010101
- #define P_MASK_COUNTER   0x07FFFFFF
+ #define P_MASK_COUNTER   0x003FFFFF
 #endif

 #ifdef P_LKRG_TASK_OFF_DEBUG
diff --git a/src/modules/print_log/p_lkrg_print_log.h b/src/modules/print_log/p_lkrg_print_log.h
index 3a41c3d..dc88d2b 100644
--- a/src/modules/print_log/p_lkrg_print_log.h
+++ b/src/modules/print_log/p_lkrg_print_log.h
@@ -27,7 +27,7 @@

 /* Do we want to precisely track changes of 'off' flag per each process?
  * If yes, uncomment it here */
-//#define P_LKRG_TASK_OFF_DEBUG
+#define P_LKRG_TASK_OFF_DEBUG

 // Do we want to precisely track all kernel .text section changes?
 // By default NO. If you want it (and print relevant information)

I've reloaded lkrg a few times, and run docker run hello-world twice. This is lkrg's logs during the docker-run executions:

lkrg-log.txt

This time we have:

Jul 23 15:12:24 sol kernel: LKRG: WATCH: 'off' flag[0x1b4b0c27975c7c9] (normalization via 0x2197716bcdfbad)
Jul 23 15:12:24 sol kernel: LKRG: WATCH: OFF debug: normalization[0x2197716bcdfbad] cookie[0x5363ff438f9bb179]

and it's 13x:

$ printf "%x\n" $[0x2197716bcdfbad*13]
1b4b0c27975c7c9

I really don't see why the loop in p_validate_off_flag() didn't bring it in range now. Possibilities are it was actually something like 256+13 or we're seeing this in p_ed_is_off_off() called directly. Maybe some other possibility I overlook.

Further debugging output suggests this involved override_creds() / revert_creds() (this is no surprise) and did probably bring the value to 13 (indirectly seen via p_off_debug_cnt). So apparently we have a call to p_ed_is_off_off() in some place where the flag shouldn't necessarily expected to be off. Also, if this logic is valid, then none of the patches I posted here should have helped.

There were also stack traces apparently omitted from the log because of grepping for LKRG. They don't include LKRG on each line because we're invoking the kernel's own function to print them. @ajakk Can you share those?

ajakk commented

Sorry, didn't realize!

This is generated with journalctl -b -k | grep -E '15:12:(22|23|24|25)', limiting the output to the few seconds of the other log:

lkrg-log.txt

Thanks. Can you try increasing P_LKRG_TASK_OFF_MAXBUF from 256 to, say, 10000? I guess the debug logging here is limited by that buffer's size, and we'd have even more of it with the buffer increased.

The overrides look mostly balanced (interleaved on/off), but not fully, and perhaps this adds up...

At the same time, you can revert to default log_level now. Enabling P_LKRG_TASK_OFF_DEBUG increases severity of just the messages we're interested in here. I think there's no further need to see all other debugging as well. So let's have a more focused debugging log now.

@ajakk would you be able to answer a few questions?

  1. Which kernel version do you use?
  2. Do you have custom build / compilation or distro kernel?
  3. Do you have any aggressive optimization enabled?
  4. Do you use overlayfs / overlayfs2 (which one if you do)?
  5. Can you check if you see the same problem if you revert to the commit 6e30ac2 ?

@Adam-pi3 Let's also hear from @ajakk, but meanwhile I can answer some of these:

The kernel is 5.15.52-gentoo-dist-hardened. I don't know if it's custom or distro, or if aggressive optimization was used - that's for @ajakk to answer.

Per the logs the overlay module is in use, and various ovl_ functions are seen in stack traces triggered by P_LKRG_TASK_OFF_DEBUG.

That commit you reference is very recent. @ajakk had already answered my similar question as follows:

I believe this has been an issue since I started using lkrg, around 0.9.2. So, certainly not a recent regression.

@ajakk Adam's questions reminded me, do you by any chance get this message triggered? -

      if (p_install_ovl_create_or_link_hook(1)) {
         p_print_log(P_LOG_FAULT,
                "OverlayFS is being loaded but LKRG can't hook 'ovl_create_or_link' function. "
                "It is very likely that LKRG will produce false positives. Please reload LKRG.");
      }

If you do, please try to reload LKRG as it says and see if the problem persists.

Also, upon reloading do you possibly get the message from here? -

   /* OverlayFS
    *
    * OverlayFS might not be installed in that system - it is not critical
    * scenario. If OverlayFS is installed, used but not found (unlikely)
    * in worst case, we might have FP. Continue...
    */
   { "ovl_create_or_link",
     p_install_ovl_create_or_link_hook,
     p_uninstall_ovl_create_or_link_hook,
     0,
     "Can't hook 'ovl_create_or_link' function. This is expected when OverlayFS is not used.",
     1
   },

Despite of the overlay module already appearing in lsmod?

If so, that would indicate that aggressive optimizations of the kernel prevented ovl_create_or_link from being hookable by us, leading to the false positives seen here.

ajakk commented

@ajakk would you be able to answer a few questions?

1. Which kernel version do you use?

Linux sol 5.15.52-gentoo-dist-hardened #1 SMP Fri Jul 8 13:10:28 CDT 2022 x86_64 AMD Ryzen 7 5800X 8-Core Processor AuthenticAMD GNU/Linux

I've relatively recently switched back down to upstream-LTS from upstream-stable kernels, though.

2. Do you have custom build / compilation or distro kernel?

This is Gentoo's sys-kernel/gentoo-kernel distribution kernel (adds -dist) with USE=hardened (adds -hardened), which is the upstream kernel with Gentoo's patches (genpatches), with Fedora's kernel config, with some "hardening" kernel configuration settings from the following:

https://github.com/mgorny/gentoo-kernel-config/blob/master/hardened-base.config
https://github.com/mgorny/gentoo-kernel-config/blob/master/hardened-amd64.config
https://github.com/mgorny/gentoo-kernel-config/blob/master/hardened-gcc-plugins.config

Additionally, I've done some mild meddling in /etc/kernel/config.d:

$ cat /etc/kernel/config.d/*
CONFIG_KFENCE=y
CONFIG_KFENCE_SAMPLE_INTERVAL=y
CONFIG_DEBUG_KERNEL=y
CONFIG_KALLSYMS_ALL=y
CONFIG_AUDIT=y
CONFIG_SECURITY_APPARMOR=y

CONFIG_LSM="apparmor,yama"

For good measure, I've attached my config.gz.

3. Do you have any aggressive optimization enabled?

Nothing that I'm aware of.

4. Do you use overlayfs / overlayfs2 (which one if you do)?

I'm unsure, but I don't think I've done any meddling related to overlayfs.

5. Can you check if you see the same problem if you revert to the commit [6e30ac2](https://github.com/lkrg-org/lkrg/commit/6e30ac23b9d1d70e966767bbbede70cba1eb6262) ?

Not a 0.9.4 regression, so I'm not sure this would be informative.

Also, upon reloading do you possibly get the message from here? -

   /* OverlayFS
    *
    * OverlayFS might not be installed in that system - it is not critical
    * scenario. If OverlayFS is installed, used but not found (unlikely)
    * in worst case, we might have FP. Continue...
    */
   { "ovl_create_or_link",
     p_install_ovl_create_or_link_hook,
     p_uninstall_ovl_create_or_link_hook,
     0,
     "Can't hook 'ovl_create_or_link' function. This is expected when OverlayFS is not used.",
     1
   },

Despite of the overlay module already appearing in lsmod?

If so, that would indicate that aggressive optimizations of the kernel prevented ovl_create_or_link from being hookable by us, leading to the false positives seen here.

This is what I (reproducibly) see on reloading lkrg with a systemctl restart lkrg:

Jul 23 15:33:22 sol kernel: LKRG: DYING: LKRG unloaded
Jul 23 15:33:22 sol kernel: LKRG: ALIVE: Loading LKRG
Jul 23 15:33:22 sol kernel: Freezing user space processes ... (elapsed 0.001 seconds) done.
Jul 23 15:33:22 sol kernel: OOM killer disabled.
Jul 23 15:33:22 sol kernel: LKRG: ISSUE: Can't enforce SELinux validation (CONFIG_GCC_PLUGIN_RANDSTRUCT detected)
Jul 23 15:33:22 sol kernel: LKRG: ISSUE: [kretprobe] register_kretprobe() for <ovl_create_or_link> failed! [err=-2]
Jul 23 15:33:22 sol kernel: LKRG: ISSUE: Can't hook 'ovl_create_or_link' function. This is expected when OverlayFS is not used.
Jul 23 15:33:22 sol kernel: LKRG: ISSUE: [kretprobe] register_kretprobe() for <lookup_fast> failed! [err=-2]
Jul 23 15:33:22 sol kernel: LKRG: ISSUE: Won't enforce pCFI validation on 'lookup_fast'
Jul 23 15:33:22 sol kernel: LKRG: ALIVE: LKRG initialized successfully
Jul 23 15:33:22 sol kernel: OOM killer enabled.
Jul 23 15:33:22 sol kernel: Restarting tasks ... done.

@Adam-pi3 ovl_create_or_link is static and with only two callers, both also static. So it's not too surprising it might have gotten e.g. inlined into those or into their callers. I think we need to look into something else to hook.

ovl_create_or_link uses security_dentry_create_files_as, which is reliably hookable and isn't currently used by anything else in-tree. However, the issues are possible other calls (later or by out-of-tree) and the placement of this call inside ovl_create_or_link - is it suitable for our needs? There are several other calls we could consider as well.

This is what I (reproducibly) see on reloading lkrg with a systemctl restart lkrg:

Can you also check lsmod | grep overlay just before systemctl restart lkrg?

CONFIG_KFENCE=y

Unrelated to this issue, but that setting is otherwise problematic with LKRG, see #79.

ajakk commented

At the same time, you can revert to default log_level now. Enabling P_LKRG_TASK_OFF_DEBUG increases severity of just the messages we're interested in here. I think there's no further need to see all other debugging as well. So let's have a more focused debugging log now.

Sorry, here's that too (with kernel logs from the start of lkrg's first ALERT):

$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

ERRO[0000] error waiting for container: unexpected EOF
jake@sol ~ $ docker run hello-world
ERRO[0002] error waiting for container: unexpected EOF
docker: error during connect: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.41/containers/ed3e36bb2609a7edf447f9fbcbec413706fe776f510337f526ab669a9b29ec23/start": EOF.

lkrg-log.txt

Just a few notes, I've just checked the recent Ubuntu LTS 22.04:

pi3@darkstar:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04 LTS
Release:	22.04
Codename:	jammy
pi3@darkstar:~$ uname -a
Linux darkstar 5.15.0-41-generic #44-Ubuntu SMP Wed Jun 22 14:20:53 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
pi3@darkstar:~$ lsmod|grep lkrg
lkrg                  208896  0
pi3@darkstar:~$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

pi3@darkstar:~$ 

@ajakk Thanks for all the information, the problem is likely related to the following line:

Jul 23 15:33:22 sol kernel: LKRG: ISSUE: [kretprobe] register_kretprobe() for <ovl_create_or_link> failed! [err=-2]
Jul 23 15:33:22 sol kernel: LKRG: ISSUE: Can't hook 'ovl_create_or_link' function. This is expected when OverlayFS is not used.

@solardiz We would need to do the homework to verify what can or cannot be hooked. Be aware that 'static' can be added anyway to other functions which we choose (it happens all the time).
@ajakk Can you provide the output from the following example command?

pi3@darkstar:~$ cat /proc/kallsyms |grep ovl_create_or_link
0000000000000000 t ovl_create_or_link	[overlay]
ajakk commented

This is what I (reproducibly) see on reloading lkrg with a systemctl restart lkrg:

Can you also check lsmod | grep overlay just before systemctl restart lkrg?

$ lsmod | grep overlay
overlay               139264  11

@ajakk Can you provide the output from the following example command?

pi3@darkstar:~$ cat /proc/kallsyms |grep ovl_create_or_link
0000000000000000 t ovl_create_or_link	[overlay]
$ cat /proc/kallsyms |grep ovl_create_or_link
0000000000000000 t p_ovl_create_or_link_entry   [lkrg]
0000000000000000 t p_uninstall_ovl_create_or_link_hook  [lkrg]
0000000000000000 t p_ovl_create_or_link_ret     [lkrg]
0000000000000000 b p_ovl_create_or_link_kretprobe_state [lkrg]
0000000000000000 t p_verify_ovl_create_or_link  [lkrg]
0000000000000000 t p_install_ovl_create_or_link_hook    [lkrg]
0000000000000000 t p_reinit_ovl_create_or_link_kretprobe        [lkrg]

Sorry, here's that too (with kernel logs from the start of lkrg's first ALERT):

Sorry, I was wrong about using default log_level. Should have been log_level=4. Anyway, we don't need this now - we seem to have figured out the problem already.

@ajakk Can you do a test where you modify the following file in the linux kernel sources? Execute the following command:

# echo "EXPORT_SYMBOL(ovl_create_or_link);" >>  fs/overlayfs/dir.c

next, recompile overlay module and reload it

@Adam-pi3 I think that test could be rather time-consuming for @ajakk if he's not used to that on this system (uses a distro kernel). Maybe it's better use of everyone's time to ask @ajakk to experiment with a possible LKRG fix instead.

Looking at the code, I think simply hooking security_dentry_create_files_as instead of ovl_create_or_link should do the trick. So let's try?

@solardiz From the top of my head I don't even remember why this specific function must be hooked to double allign the flag. From my understanding somewhere down-the path, there is 'forgotten' call to the 'revert' or called it twice. I would like to first find out what was exactly the case, if nothing has been changed in the kernel since we developed this hook and the we can look for possible solutions. Sure we can do this 'blind' check/fix but we would not know side effects.
From my understanding, gentoo doesn't provide you a binary of the kernel but rather compiles it in your local machine and that's why @ajakk has this function inlined (it's not a stock kernel in binary form).

@ajakk Can you try this LKRG hack, please? -

diff --git a/src/modules/exploit_detection/p_exploit_detection.c b/src/modules/exploit_detection/p_exploit_detection.c
index f0e987d..fb462bd 100644
--- a/src/modules/exploit_detection/p_exploit_detection.c
+++ b/src/modules/exploit_detection/p_exploit_detection.c
@@ -304,7 +304,7 @@ static const struct p_functions_hooks {
     * scenario. If OverlayFS is installed, used but not found (unlikely)
     * in worst case, we might have FP. Continue...
     */
-   { "ovl_create_or_link",
+   { "security_dentry_create_files_as",
      p_install_ovl_create_or_link_hook,
      p_uninstall_ovl_create_or_link_hook,
      0,
diff --git a/src/modules/exploit_detection/syscalls/override/overlayfs/p_ovl_create_or_link/p_ovl_create_or_link.c b/src/modules/exp
loit_detection/syscalls/override/overlayfs/p_ovl_create_or_link/p_ovl_create_or_link.c
index bb4767e..5f64625 100644
--- a/src/modules/exploit_detection/syscalls/override/overlayfs/p_ovl_create_or_link/p_ovl_create_or_link.c
+++ b/src/modules/exploit_detection/syscalls/override/overlayfs/p_ovl_create_or_link/p_ovl_create_or_link.c
@@ -25,7 +25,7 @@
 char p_ovl_create_or_link_kretprobe_state = 0;
 
 static struct kretprobe p_ovl_create_or_link_kretprobe = {
-    .kp.symbol_name = "ovl_create_or_link",
+    .kp.symbol_name = "security_dentry_create_files_as",
     .handler = p_ovl_create_or_link_ret,
     .entry_handler = p_ovl_create_or_link_entry,
     .data_size = sizeof(struct p_ovl_create_or_link_data),
@@ -37,7 +37,7 @@ static struct kretprobe p_ovl_create_or_link_kretprobe = {
 void p_reinit_ovl_create_or_link_kretprobe(void) {
 
    memset(&p_ovl_create_or_link_kretprobe,0x0,sizeof(struct kretprobe));
-   p_ovl_create_or_link_kretprobe.kp.symbol_name = "ovl_create_or_link";
+   p_ovl_create_or_link_kretprobe.kp.symbol_name = "security_dentry_create_files_as";
    p_ovl_create_or_link_kretprobe.handler = p_ovl_create_or_link_ret;
    p_ovl_create_or_link_kretprobe.entry_handler = p_ovl_create_or_link_entry;
    p_ovl_create_or_link_kretprobe.data_size = sizeof(struct p_ovl_create_or_link_data);

@Adam-pi3 Sure we need fresh understanding of the underlying issue before fixing it for real.

ovl_create_or_link looks like it usually does two overrides followed by one revert. The first override is inside its call of ovl_override_creds. The second is in put_cred(override_creds(override_cred));, which is conditional upon prepare_creds having succeeded (nasty for us that it's conditional, but I assume that does succeed except when the system is badly out-of-memory). And then there's one revert_creds.

There are many other ovl_ functions that also use ovl_override_creds, but a few I checked look balanced.

BTW, that the call to security_dentry_create_files_as is also conditioned on prepare_creds having succeeded is actually good for us. Unfortunately, it's also conditioned on if (!attr->hardlink), which is irrelevant to the imbalance we're trying to work around, and thus is problematic for us.

@ajakk In fact, we can try different approach. Would you be able to run the following command? Example of output

pi3@darkstar:~$ cat /proc/kallsyms |grep ovl_dentry_is_whiteout
0000000000000000 t ovl_dentry_is_whiteout	[overlay]

If you see the same function as being visible in your kernel (ovl_dentry_is_whiteout), then you can try the following LKRG patch and verify if docker works fine:

diff --git a/src/modules/exploit_detection/p_exploit_detection.c b/src/modules/exploit_detection/p_exploit_detection.c
index f0e987d..ec7c14c 100644
--- a/src/modules/exploit_detection/p_exploit_detection.c
+++ b/src/modules/exploit_detection/p_exploit_detection.c
@@ -983,7 +983,7 @@ notrace int p_verify_ovl_create_or_link(struct p_ed_process *p_source) {
 
    p_validate_off_flag(p_source,p_off,NULL);   // Validate
 
-   return p_off == 2 * p_global_cnt_cookie;
+   return p_off == 3 * p_global_cnt_cookie;
 }
 
 notrace void p_ed_is_off_off_wrap(struct p_ed_process *p_source) {
diff --git a/src/modules/exploit_detection/syscalls/override/overlayfs/p_ovl_create_or_link/p_ovl_create_or_link.c b/src/modules/exploit_detection/syscalls/override/overlayfs/p_ovl_create_or_link/p_ovl_create_or_link.c
index bb4767e..2a800bb 100644
--- a/src/modules/exploit_detection/syscalls/override/overlayfs/p_ovl_create_or_link/p_ovl_create_or_link.c
+++ b/src/modules/exploit_detection/syscalls/override/overlayfs/p_ovl_create_or_link/p_ovl_create_or_link.c
@@ -25,7 +25,8 @@
 char p_ovl_create_or_link_kretprobe_state = 0;
 
 static struct kretprobe p_ovl_create_or_link_kretprobe = {
-    .kp.symbol_name = "ovl_create_or_link",
+//    .kp.symbol_name = "ovl_create_or_link",
+    .kp.symbol_name = "ovl_dentry_is_whiteout",
     .handler = p_ovl_create_or_link_ret,
     .entry_handler = p_ovl_create_or_link_entry,
     .data_size = sizeof(struct p_ovl_create_or_link_data),
@@ -37,7 +38,8 @@ static struct kretprobe p_ovl_create_or_link_kretprobe = {
 void p_reinit_ovl_create_or_link_kretprobe(void) {
 
    memset(&p_ovl_create_or_link_kretprobe,0x0,sizeof(struct kretprobe));
-   p_ovl_create_or_link_kretprobe.kp.symbol_name = "ovl_create_or_link";
+//   p_ovl_create_or_link_kretprobe.kp.symbol_name = "ovl_create_or_link";
+   p_ovl_create_or_link_kretprobe.kp.symbol_name = "ovl_dentry_is_whiteout";
    p_ovl_create_or_link_kretprobe.handler = p_ovl_create_or_link_ret;
    p_ovl_create_or_link_kretprobe.entry_handler = p_ovl_create_or_link_entry;
    p_ovl_create_or_link_kretprobe.data_size = sizeof(struct p_ovl_create_or_link_data);

@Adam-pi3 Yes, let's try that. However, don't we even more importantly need to update the hooked function name in p_exploit_detection.c?

@solardiz this is just a temp patch to verify if @ajakk environment will be correctly handling it. We do not change any function names here, just inside of the function body we changed the logic and name of the hooked function (but still using old LKRG's function name).
If this patch is fine, we would need to rename directories and filenames together with function names

ajakk commented

@Adam-pi3 I think that test could be rather time-consuming for @ajakk if he's not used to that on this system (uses a distro kernel). Maybe it's better use of everyone's time to ask @ajakk to experiment with a possible LKRG fix instead.

It's not too terribly time consuming, there's a binary distribution kernel package (sys-kernel/gentoo-kernel-bin) and a package that compiles the distribution kernel locally (sys-kernel/gentoo-kernel, what I'm using) to allow for the user to change build time specifics. In this case, patching the kernel is as simple as generating the patch, putting it in the right directory, and rebuilding the kernel.

In any case, the latest patch seems to work! Happy to test the kernel patch too, if necessary. LKRG doesn't produce any log messages if I try to run Docker containers, and the only messages I see when lkrg stops/starts are:

[84470.729871] LKRG: DYING: LKRG unloaded
[84470.898793] LKRG: ALIVE: Loading LKRG
[84470.901081] Freezing user space processes ... (elapsed 0.001 seconds) done.
[84470.902724] OOM killer disabled.
[84470.904816] LKRG: ISSUE: Can't enforce SELinux validation (CONFIG_GCC_PLUGIN_RANDSTRUCT detected)
[84471.055414] LKRG: ISSUE: [kretprobe] register_kretprobe() for <lookup_fast> failed! [err=-2]
[84471.055416] LKRG: ISSUE: Won't enforce pCFI validation on 'lookup_fast'
[84471.232921] LKRG: ALIVE: LKRG initialized successfully
[84471.232922] OOM killer enabled.
[84471.232923] Restarting tasks ... done.

@Adam-pi3 Sure. I wrongly thought we actually used function names from the p_functions_hooks array for hooking, but I now see they're only for reporting. Maybe that's actually something to change in an unrelated cleanup later.

Great news that the patch worked. Meanwhile, I was thinking - maybe this issue is a reminder for us to reconsider and instead of insisting that off is off where expected, just check its integrity (that it's a multiple) and reset it. Indeed, that would weaken LKRG's self-defense a little bit, but I think that's acceptable - wouldn't be its weakest point anyway - e.g., at least we use random magic values to protect off, whereas for task credentials we don't (maybe we should, but even then we wouldn't protect them more than by that, so protecting off more would still be inconsistent). And then maybe we can drop the special handling of overlay. What do you think?

@solardiz I look at this problem a bit opposite. Instead of 'weakening' off we should add hardening in other places (like uid as you mentioned). Let's take this discussion offline (meeting?). I will try to look a bit more on the proposed patch (compare older kernels) and if it is stable and portable, I will prepare PR.

Summary of evaluation:

  • The proposed patch should be valid and backwards compatible up to version 4.16 (inclusive).
  • The ovl_dentry_is_whiteout function first appeared in the kernel 4.10 but it has slightly different logic until version 4.16. The difference is that in ovl_rename override is called before ovl_dentry_is_whiteout. Although the logic in the proposed patch should correctly handle this situation (not tested)
  • The ovl_create_or_link function first appeared in the 3.18 kernel, while ovl_dentry_is_whiteout only in 4.10. What to do there?
    • Until kernel 4.6 (inclusive) the logic of the ovl_create_or_link is correct and does not require any correction.
      • Of course the exception is 4.4 LTS where the new and problematic logic were back-ported to the kernel 4.4.179
    • Between the kernels 4.10 until 4.16 we must rely on ovl_create_or_link hook.
      • What is worth to mention, this old kernels are usually compiled with older compiler and less aggressive optimizer.

Thanks for your analysis, @Adam-pi3. I've confirmed some of this (looking at those kernels' code) and didn't find any discrepancies with your description. Meanwhile, I got the below thoughts:

  1. Per Adam's findings above, we also have an issue on most kernels below and including 4.6, where our current code tries to correct for unbalanced override/revert that doesn't actually exist. I think the impact is resetting off too early, perhaps leading to false positives from credentials mismatches? We don't currently restrict the hooking of ovl_create_or_link by kernel version - we just try and hook where we can. So we need to fix more than just the inability to hook ovl_create_or_link on some kernels.

  2. The option of "weakening" off checks (and dropping of all ovl_ hooking) that I brought up above is not good in that it'd permit off to silently stay set (after a path going through ovl_create_or_link) until a hooked call where that's definitely unexpected is invoked. There can be a variety of credentials-using calls in that period, especially if the condition is deliberately triggered and calls are carefully chosen by an exploit. In fact, this same weakness exists on systems where this present issue (reported by @ajakk here) shows up now, whereas with successful hooking of ovl_create_or_link (as is probably the case on most systems that have overlay loaded at all) we avoid it (as long as it's the only such place in the kernel).

  3. The fact that unbalanced override/revert can stay in the kernel as a non-issue suggests that maybe we are wrong or just unnecessarily strict in expecting and requiring the balance. The first idea would have been to reset (not decrement) off on revert_creds, however I recall @Adam-pi3 said (in private discussions) that actual nesting was also seen, where it would be too early to reset off on the first revert (would trigger false positives from credentials discrepancies). Another obvious idea would be, instead of having that off flag at all, to update our shadow credentials on overrides and reverts. However, that would make these functions just as usable as commit_creds by exploits, which would then bypass LKRG by simply using override_creds in place of commit_creds. This is also why we don't just hook and update on commit_creds, but instead hook the many other places where credentials are modified. Finally, moving to the idea that I think is new (wasn't discussed privately before): maybe on override_creds we can store the new credentials not in our main shadow credentials (that we validate current ones against when not off), but in a separate new overridden credentials stack (essentially an array indexed by off depth), which we'd only use on revert_creds to see how many levels up we're reverting to. Then we end up handling unbalanced override/revert in a generic manner. The only drawback I see is it's actually more complicated than hooking the different ovl_ functions and doing some hard-coded checks/adjustments.

  4. Unfortunately, the issue with keeping off set for too long (over many non-hooked calls into the kernel) is not avoided by implementing any of these ideas. Exploits will continue to be able to abuse calls to override_creds to set off. So maybe the option of accepting that and "weakening" off checks doesn't make things much worse (they're pretty bad anyhow). Similarly, maybe even updating shadow credentials on commit_creds is within consideration (and then ditto on override_creds and revert_creds) is within consideration, which would let us simplify LKRG a lot. As I now recall we had discussed privately before, this is something we could do if we accept relying only on pCFI to protect from exploits' calls to (or ROP'ing into) those functions. (Our decision not to update on commit_creds pre-dates the introduction of pCFI into LKRG.) Maybe that's the way to go, if we can't reasonably do much better anyway?

I don't mind implementing a fix like Adam seems to have suggested above for now, but longer-term we really might want to reconsider whether we possibly incur a lot of complexity and kernel incompatibility risks for too little gain in protection.

Between the kernels 4.10 until 4.16 we must rely on ovl_create_or_link hook.

I guess you meant between 4.7 and 4.9?

Yes, 4.7 until 4.9

@ajakk We've just merged Adam's fix for this issue. We'd appreciate you testing. I might make a follow-up PR cleaning up this fix, so you might want to test before and/or after that. Thanks!

I might make a follow-up PR cleaning up this fix

@ajakk This is now #224, hopefully to be merged real soon.