coreos/bugs

pthread_create priority thread returns EPERM when run as root

Closed this issue · 13 comments

Summary:

Setting a thread priority to anything other than zero using a scheduling policy of Round Robin fails with an EPERM when running as root.

Details:

After building the SDK and an image on an Ubuntu 14.04 LTS I tried to set the priority of a thread created as root and received an EPERM. I then pulled the source code below is directly from the man pages for pthread_setschedparam to investigate further and it too returns EPERM when run as root with the command line options -ar20 -ie. Just to make sure my SDK or image where not at fault I scp'd the program to a Rackspace machine running the Alpha release and it also failed with EPERM.

To compile: gcc -Wall pthreads_sched_test.c -lpthread -o sched_test

Usage:

Usage: ./sched_test [options]
Options are:
    -a<policy><prio> Set scheduling policy and priority in
                     thread attributes object
                     <policy> can be
                         f  SCHED_FIFO
                         r  SCHED_RR
                         o  SCHED_OTHER
    -A               Use default thread attributes object
    -i {e|i}         Set inherit scheduler attribute to
                     'explicit' or 'inherit'
    -m<policy><prio> Set scheduling policy and priority on
                     main thread before pthread_create() call

Receive an EPERM executing as root:

Set the scheduling policy to SCHED_RR (r), the priority to 20, and the inherit scheduling policy to "explicit". ./sched_test -ar20 -ie

localhost core # ./sched_test -ar20 -ie
Scheduler settings of main thread
    policy=SCHED_OTHER, priority=0

Scheduler settings in 'attr'
    policy=SCHED_RR, priority=20
    inheritsched is EXPLICIT

pthread_create: Operation not permitted

Ulimits are set to "unlimited":

localhost core # ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 3825
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) unlimited
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 3825
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Contents of /etc/os-release

core@localhost ~ $ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=734.0.0+2015-07-05-1552
VERSION_ID=734.0.0
BUILD_ID=2015-07-05-1552
PRETTY_NAME="CoreOS 734.0.0+2015-07-05-1552"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Source code from pthread_setschedparam man page:

/* pthreads_sched_test.c */

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>

#define handle_error_en(en, msg) \
       do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)

static void
usage(char *prog_name, char *msg)
{
   if (msg != NULL)
       fputs(msg, stderr);

   fprintf(stderr, "Usage: %s [options]\n", prog_name);
   fprintf(stderr, "Options are:\n");
#define fpe(msg) fprintf(stderr, "\t%s", msg);          /* Shorter */
   fpe("-a<policy><prio> Set scheduling policy and priority in\n");
   fpe("                 thread attributes object\n");
   fpe("                 <policy> can be\n");
   fpe("                     f  SCHED_FIFO\n");
   fpe("                     r  SCHED_RR\n");
   fpe("                     o  SCHED_OTHER\n");
   fpe("-A               Use default thread attributes object\n");
   fpe("-i {e|i}         Set inherit scheduler attribute to\n");
   fpe("                 'explicit' or 'inherit'\n");
   fpe("-m<policy><prio> Set scheduling policy and priority on\n");
   fpe("                 main thread before pthread_create() call\n");
   exit(EXIT_FAILURE);
}

static int
get_policy(char p, int *policy)
{
   switch (p) {
   case 'f': *policy = SCHED_FIFO;     return 1;
   case 'r': *policy = SCHED_RR;       return 1;
   case 'o': *policy = SCHED_OTHER;    return 1;
   default:  return 0;
   }
}

static void
display_sched_attr(int policy, struct sched_param *param)
{
   printf("    policy=%s, priority=%d\n",
           (policy == SCHED_FIFO)  ? "SCHED_FIFO" :
           (policy == SCHED_RR)    ? "SCHED_RR" :
           (policy == SCHED_OTHER) ? "SCHED_OTHER" :
           "???",
           param->sched_priority);
}

static void
display_thread_sched_attr(char *msg)
{
   int policy, s;
   struct sched_param param;

   s = pthread_getschedparam(pthread_self(), &policy, &param);
   if (s != 0)
       handle_error_en(s, "pthread_getschedparam");

   printf("%s\n", msg);
   display_sched_attr(policy, &param);
}

static void *
thread_start(void *arg)
{
   display_thread_sched_attr("Scheduler attributes of new thread");

   return NULL;
}

int
main(int argc, char *argv[])
{
   int s, opt, inheritsched, use_null_attrib, policy;
   pthread_t thread;
   pthread_attr_t attr;
   pthread_attr_t *attrp;
   char *attr_sched_str, *main_sched_str, *inheritsched_str;
   struct sched_param param;

   /* Process command-line options */

   use_null_attrib = 0;
   attr_sched_str = NULL;
   main_sched_str = NULL;
   inheritsched_str = NULL;

   while ((opt = getopt(argc, argv, "a:Ai:m:")) != -1) {
       switch (opt) {
       case 'a': attr_sched_str = optarg;      break;
       case 'A': use_null_attrib = 1;          break;
       case 'i': inheritsched_str = optarg;    break;
       case 'm': main_sched_str = optarg;      break;
       default:  usage(argv[0], "Unrecognized option\n");
       }
   }

   if (use_null_attrib &&
           (inheritsched_str != NULL || attr_sched_str != NULL))
       usage(argv[0], "Can't specify -A with -i or -a\n");

   /* Optionally set scheduling attributes of main thread,
      and display the attributes */

   if (main_sched_str != NULL) {
       if (!get_policy(main_sched_str[0], &policy))
           usage(argv[0], "Bad policy for main thread (-m)\n");
       param.sched_priority = strtol(&main_sched_str[1], NULL, 0);

       s = pthread_setschedparam(pthread_self(), policy, &param);
       if (s != 0)
           handle_error_en(s, "pthread_setschedparam");
   }

   display_thread_sched_attr("Scheduler settings of main thread");
   printf("\n");

   /* Initialize thread attributes object according to options */

   attrp = NULL;

   if (!use_null_attrib) {
       s = pthread_attr_init(&attr);
       if (s != 0)
           handle_error_en(s, "pthread_attr_init");
       attrp = &attr;
   }

   if (inheritsched_str != NULL) {
       if (inheritsched_str[0] == 'e')
           inheritsched = PTHREAD_EXPLICIT_SCHED;
       else if (inheritsched_str[0] == 'i')
           inheritsched = PTHREAD_INHERIT_SCHED;
       else
           usage(argv[0], "Value for -i must be 'e' or 'i'\n");

       s = pthread_attr_setinheritsched(&attr, inheritsched);
       if (s != 0)
           handle_error_en(s, "pthread_attr_setinheritsched");
   }

   if (attr_sched_str != NULL) {
       if (!get_policy(attr_sched_str[0], &policy))
           usage(argv[0],
                   "Bad policy for 'attr' (-a)\n");
       param.sched_priority = strtol(&attr_sched_str[1], NULL, 0);

       s = pthread_attr_setschedpolicy(&attr, policy);
       if (s != 0)
           handle_error_en(s, "pthread_attr_setschedpolicy");
       s = pthread_attr_setschedparam(&attr, &param);
       if (s != 0)
           handle_error_en(s, "pthread_attr_setschedparam");
   }

   /* If we initialized a thread attributes object, display
      the scheduling attributes that were set in the object */

   if (attrp != NULL) {
       s = pthread_attr_getschedparam(&attr, &param);
       if (s != 0)
           handle_error_en(s, "pthread_attr_getschedparam");
       s = pthread_attr_getschedpolicy(&attr, &policy);
       if (s != 0)
           handle_error_en(s, "pthread_attr_getschedpolicy");

       printf("Scheduler settings in 'attr'\n");
       display_sched_attr(policy, &param);

       s = pthread_attr_getinheritsched(&attr, &inheritsched);
       printf("    inheritsched is %s\n",
               (inheritsched == PTHREAD_INHERIT_SCHED)  ? "INHERIT" :
               (inheritsched == PTHREAD_EXPLICIT_SCHED) ? "EXPLICIT" :
               "???");
       printf("\n");
   }

   /* Create a thread that will display its scheduling attributes */

   s = pthread_create(&thread, attrp, &thread_start, NULL);
   if (s != 0)
       handle_error_en(s, "pthread_create");

   /* Destroy unneeded thread attributes object */

   if (!use_null_attrib) {
     s = pthread_attr_destroy(&attr);
     if (s != 0)
         handle_error_en(s, "pthread_attr_destroy");
   }

   s = pthread_join(thread, NULL);
   if (s != 0)
       handle_error_en(s, "pthread_join");

   exit(EXIT_SUCCESS);
}

this workaround here seems to fix the issue but i haven't found the underlying cause yet. what other systems have you tried this on and what are the results?

Yes. That fixed it - thank you.
sysctl -w kernel.sched_rt_runtime_us=-1

All of the other systems our code is running works out of the box without setting this parameter.
CentOS 5, 6, Debian 7, SLES 11 SP3, SLES 12, Ubuntu 12 and 14 LTS releases. If you want I can actually run this program, but the code it uses calls the same functions as in the example.

Update: Here is Ubuntu 14.04 LTS
(Linux trusty 3.13.0-57-generic #95-Ubuntu SMP Fri Jun 19 09:28:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux)

[root@vagrant-arch cppTest]#  sysctl kernel.sched_rt_runtime_us
kernel.sched_rt_runtime_us = 950000
root@trusty:~# sysctl kernel.sched_rt_runtime_us
kernel.sched_rt_runtime_us = 950000

root@trusty# ./sched_test -ar20 -ie
Scheduler settings of main thread
    policy=SCHED_OTHER, priority=0

Scheduler settings in 'attr'
    policy=SCHED_RR, priority=20
    inheritsched is EXPLICIT

Scheduler attributes of new thread
    policy=SCHED_RR, priority=20

Here is Arch Linux:
(Linux vagrant-arch.vagrantup.com 4.0.7-2-ARCH #1 SMP PREEMPT Tue Jun 30 07:50:21 UTC 2015 x86_64 GNU/Linux)

[root@vagrant-arch cppTest]#  sysctl kernel.sched_rt_runtime_us
kernel.sched_rt_runtime_us = 950000

root@vagrant-arch cppTest]# ./sched_test -ar20 -ie
Scheduler settings of main thread
    policy=SCHED_OTHER, priority=0

Scheduler settings in 'attr'
    policy=SCHED_RR, priority=20
    inheritsched is EXPLICIT

Scheduler attributes of new thread
    policy=SCHED_RR, priority=20

@mischief: I have updated my reply with two examples. Do you need more information from me to investigate this bug? I am more than willing to give output on other distributions if you think that will help.

what may or may not be very related, systemd defaults put ssh connections into a non-RT cgroup. if you're trying to ssh in and run FIFO/RR tasks, it just won't work. you need to move your shell into the default cpu:/ group which does permit priority scheduling.

cgclassify -g cpu:/ $$

your systemd unit file can start your service and get priority scheduling if you put ControlGroup=cpu:/ under [Service]

@FirefighterBlu3 Thank you - this is good information. I just built an image from the coreOS sdk and booted it using vmware fusion. I tried the same test executing from the console window and it failed. I am not up all of systemd yet - so this is all good info.

Here is the screenshot from the console of my coreos-image:
screen shot 2015-07-22 at 8 57 55 am

yup. now run cgclassify -g cpu:/ $$

then retry your sched_test program.

@FirefighterBlu3
Unless I missed something the cgclassify command is not part of the CoreOS distro.

you can do it alternatively with: echo $$ > /sys/fs/cgroup/cpu/tasks

(check path, this is from memory)

That worked. It looks like even the console is restricted. Very interesting.
Clearly I have much to learn about systemd. Where can get more information about what systemd is doing with cgroups? Is this a attribute of systemd, or CoreOS, cgroups best practices?

screen shot 2015-07-22 at 9 39 44 am

@FirefighterBlu3
What's also interesting is that other distros don't appear to do this out of the box. Here is Arch Linux's output from an ssh session:

[root@vagrant-arch vagrant]# cat /proc/$$/cgroup
8:blkio:/
7:memory:/
6:cpuset:/
5:net_cls:/
4:devices:/user.slice
3:cpu,cpuacct:/
2:freezer:/
1:name=systemd:/user.slice/user-1000.slice/session-c2.scope

perhaps it depends on your installation. my Arch installation that i'm working on does exactly this and i have a special need for sched_fifo so i had to figure out why it didn't work when by all expected accounts, it should have. unfortunately the documentation searches on google for cgroup information are still rather sparse but they're starting to get a lot better. arch has some good documentation, redhat has some too. if i had stumbled upon https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-Moving_a_Process_to_a_Control_Group.html some time back, then i wouldn't have been pulling my hair out for hours.

Still an issue with 1262.0.0.

Thank you for reporting this issue. Unfortunately, we don't think we'll end up addressing it in Container Linux.

We're now working on Fedora CoreOS, the successor to Container Linux, and we expect most major development to occur there instead. Meanwhile, Container Linux will be fully maintained into 2020 but won't see many new features. We appreciate your taking the time to report this issue and we're sorry that we won't be able to address it.