Docker build error and how to use host container

Question

Docker build error and how to use host container

Opened this issue a year ago · 107 comments

Thank you for developing.
I'm using this repository in research activity from Japan.

In my environment in ubuntu22, when executing

sudo docker build -t plugsched_host .

This error is occurrd.
Should I edit Dockerfile?

Errors during downloading metadata for repository 'epel':
  - Curl error (6): Couldn't resolve host name for http://mirrors.cloud.aliyuncs.com/epel/8/Everything/x86_64/repodata/repomd.xml [Could not resolve host: mirrors.cloud.aliyuncs.com]
Error: Failed to download metadata for repo 'epel': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

Second question, Can I use the container by this Dockerfile as host of plugsched?
I imagine that it is difficult to use kernel module from a container.

Answer 1 · 2023-07-26T02:44:59.000Z

For the first question, I'm sorry that we found "mirrors.cloud.aliyuncs.com" is an internal address that only ECS from aliyun can access. You may edit Dockerfile to change the address to "mirrors.aliyun.com" in /etc/yum.repos.d/epel.repo after installing epel-aliyuncs-release. Or directly use our image(plugsched-registry.cn-hangzhou.cr.aliyuncs.com/plugsched/plugsched-sdk).

For the second question, are you trying to install the plugsched rpm on ubuntu? We have not supported deb, but may try to provide you a simulated way to install plugsched on host(not container).

Answer 2 · 2023-07-26T02:51:48.000Z

First, create a working dir:
/var/plugsched/$(uname -r)

Then, move necessary files into it:
install -m 755 working/symbol_resolve/symbol_resolve /var/plugsched/$(uname -r)/symbol_resolve
install -m 755 kernel/sched/mod/scheduler.ko /var/plugsched/$(uname -r)/scheduler.ko
install -m 444 working/tainted_functions /var/plugsched/$(uname -r)/tainted_functions
install -m 755 working/scheduler-installer /var/plugsched/$(uname -r)/scheduler-installer
install -m 755 working/hotfix_conflict_check /var/plugsched/$(uname -r)/hotfix_conflict_check

Last, run the installer script:
/var/plugsched/$(uname -r)/scheduler-installer install

I've not tested with this, so feel free to report any problem you meet.

Answer 3 · 2023-07-27T09:51:35.000Z

Thank you for your replying.
That means the purpose of conatainer is setting up necessary files (and copying to the host), right?

If I use centOS as the host, can I run new scheduler actually in this way?

In addition, I have one more question.
We can implement new scheduler by editing files under kernel/sched/mod/ after executing bounday analyzer(plugsched-cli init ?), right?
How to build this new scheduler as kernel module?
Source files under src/ (like main.c , sched_rebuild.c) are related?

Answer 4 · 2023-07-27T12:16:52.000Z

Thank you for your replying. That means the purpose of conatainer is setting up necessary files (and copying to the host), right?

Yes

If I use centOS as the host, can I run new scheduler actually in this way?

Yes. After copying the rpm from container to host, you can use "rpm -i" to install the new scheduler directly on host.

In addition, I have one more question. We can implement new scheduler by editing files under kernel/sched/mod/ after executing bounday analyzer(plugsched-cli init ?), right? How to build this new scheduler as kernel module? Source files under src/ (like main.c/sched_rebuild.c) are related?

main.c and sched_rebuild.c under src/ will be copied to kernel/sched/mod/ automatically. They are related.

You can refer to cmd_build() in cli.py to see how it work in detail. The key file is kernel/sched/mod/Makefile (which is copied from src/Makefile), and the kernel module will be built as kernel/sched/mod/scheduler.ko

Directly insmod scheduler.ko may fail because we still need some other works. See scheduler-installer.

Answer 5 · 2023-07-28T05:27:42.000Z

Thank you very much.

Answer 6 · 2023-08-01T16:21:16.000Z

When executing
plugsched-cli init $(uname -r) ./kernel ./scheduler
in the podman container, this error was generated.

I use a docker image build by myself using Dockerfile.
Also, I use /work as working dir instead of /tmp/work because of tmpfs capacity.

If you have some ideas about this error, please tell me.

Answer 7 · 2023-08-02T03:14:55.000Z

I'm sorry that we found our anolisos:latest has updated gcc minor version recently, so gcc-python-plugin needs rebuild. We will fix this problem soon.

To workaround, please downgrade the gcc version (using yum install):
gcc-8.5.0-10.1.0.3.an8
gcc-c++-8.5.0-10.1.0.3.an8
gcc-plugin-devel-8.5.0-10.1.0.3.an8
libstdc++-static-8.5.0-10.1.0.3.an8
gcc-python-plugin-0.17-1.4.an8

What's more, we found there may be something wrong with "pip3 install pyyaml".

We suggest just replacing:
RUN yum install epel-aliyuncs-release -y &&
with:
RUN yum install epel-release -y &&
in Dockerfile. This will use the source from mirrors.fedoraproject.org directly, and then "yum install python3-pyyaml".

Answer 8 · 2023-08-02T07:57:36.000Z

As I modified two points(gcc version and epel-release), which you teach, gcc-python-plugin error is fixed. Thank you.
(Modified Dockerfile is this)
However, another error was generated.

ImportError: cannot import name 'CLoader'
Is this the pyyaml error that you told?

What's more, we found there may be something wrong with "pip3 install pyyaml".

What should I do?

Answer 9 · 2023-08-02T08:33:06.000Z

Do not use
pip3 install pyyaml

Use
yum install python3-pyyaml
instead

Answer 10 · 2023-08-02T13:46:09.000Z

I've overlooked.
Thank you.

Answer 11 · 2023-08-02T13:59:02.000Z

In my environment, executing plugsched-cli init $(uname -r) ./kernel ./scheduler in a container has not been finished in 3 hours. (Current my environment is not good.)
Is this so time-consuming process, right?
For example, which is more time-consuming than compiling the Linux kernel?

Answer 12 · 2023-08-02T14:03:22.000Z

Is this so time-consuming process, right?

Yes, it is time-consuming, but...
Hmm... It should be faster than compiling the whole Linux kernel, I think.

Is your terminal printing info continuously? Or is stuck at one step?

Answer 13 · 2023-08-02T14:11:43.000Z

We usually work on servers with 100+ cpus, so in our environment "init" will only cost several minutes.

After "init", the "build" step is very fast.

We suggest making the "init" task to background. When the task is done, maybe you can make "./scheduler" folder a backup if you want to develop different branches.

Answer 14 · 2023-08-02T14:14:30.000Z

Before Pressing Ctrl-C, CC/LD/AR is printed continuously.

By the way, I use kernel 6.46.
However, content of boundary.yaml is same as it of kernel 5.10, because I didn't find how to modify.
Is this related to unfinished process?

Answer 15 · 2023-08-02T14:19:36.000Z

For parallel processing, do we need some options?

Answer 16 · 2023-08-02T14:32:13.000Z

By the way, I use kernel 6.46.

Oh... That's really new. We've not tested on this version. The boundary of 5.10 may not fit 6.4, but I think plugsched can still work when directly using 5.10 boundary. The mismatch boundary config will result in smaller scope of modifiable code (many functions will be analyzed as "outer function" so you cannot modify them). The functions that cannot be edited will be removed or commented below them ("DON'T MODIFY INLINE EXTERNAL FUNCTION") in kernel/sched/mod/.

For parallel processing, do we need some options?

No need. It will auto use all cpus.

Answer 17 · 2023-08-07T08:08:37.000Z

What is the process includeing CC,LD,AR ?
~~collect.py or analyze.py or extract.py or else~~?
It seems the most time-consuming process.

It seems collect.py
what part of collect.py?
I cannnot find the strings like gcc or ld in collect.py.

Answer 18 · 2023-08-07T08:19:57.000Z

In addition, could you find what is this error?
I think it seems no error reason written.
Hardware error?

Because there are no file named mocules_prepare declared in Makefile.plugsched?

Answer 19 · 2023-08-07T09:25:41.000Z

It seems collect.py what part of collect.py? I cannnot find the strings like gcc or ld in collect.py.

See src/Makefile.plugsched

collect: modules_prepare
	$(MAKE) CFLAGS_KERNEL="$(GCC_PLUGIN_FLAGS)" \
		CFLAGS_MODULE="$(GCC_PLUGIN_FLAGS)" $(vmlinux-dirs)

In addition, could you find what is this error? I think it seems no error reason written.

Due to parallel processing, you need to page up to find the reason.

The process of "collect" is similar to compile the Linux kernel.

Answer 20 · 2023-08-07T09:59:17.000Z

Thank you for replying.
I see. The process is like building linux kernel and collect.py is work during compiling the kernel.
where is vmlinux-dirs declared?
What is make collect differ from normal compiling of linux kernel?

Answer 21 · 2023-08-07T10:27:55.000Z

The error is lower part of this screen shot.
CONFIG_DEBUG_INFO_BTF is one of configuration of host?

Answer 22 · 2023-08-07T11:27:08.000Z

You need yum install dwarves

where is vmlinux-dirs declared? What is make collect differ from normal compiling of linux kernel?

I've just found that, they are related questions.
$(vmlinux-dirs) is in Makefile in kernel 5.10

vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, \
		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
		     $(libs-y) $(libs-m)))

with make $(vmlinux-dirs) we will only compile necessary files, not the whole Linux kernel (so we will not generate BTF as well).

But since Linux 6.1, it disappeared, so our Makefile "collect" will build the whole kernel...

Answer 23 · 2023-08-07T16:01:19.000Z

Thank you.
I installed DWARF in docker image, and the error was fixed.

But since Linux 6.1, it disappeared, so our Makefile "collect" will build the whole kernel...

I see. Therefore, collecting is so time-consuming...

Could you find out this error?

Answer 24 · 2023-08-07T16:45:40.000Z

Do I need to create the dynamic_springboard.patch myself?
(Is this where I write the changes to the switch_to function?)

Answer 25 · 2023-08-08T06:18:28.000Z

Oops
I've tried with Linux 6.4 and see the huge difference.

These two commits totally break our works:
f96eca432015 ("sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there")
801c14195510 ("sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there")

The files inside or outside our boundary are "mixed" into one file, and the analyze and extract step will totally be broken.

Let me find the way to solve it...

BTW, you can change $(vmlinux-dirs) with $(build-dir) in src/Makefile.plugsched for latest Linux to speed up the collect stage.

Answer 26 · 2023-08-08T08:49:40.000Z

Thank you .
I should replace boundary.yaml .
How do you know what to fix boundary.yaml ?
Is it difficult?

you can change $(vmlinux-dirs) with $(build-dir)

Thank you. I'll try.

Answer 27 · 2023-08-08T12:34:08.000Z

f96eca432015 ("sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there")
801c14195510 ("sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there")

These two commits are since Linux 5.18. So plugsched seems not support >=5.18 (not sure about 5.11~5.17)

Sorry.

Answer 28 · 2023-08-09T03:03:50.000Z

Even if I add kernel/sched/build_policy.c and so on to boundary.yaml, it will not work with kernel 6.26 ?

Answer 29 · 2023-08-09T08:54:31.000Z

Yes. I've tried to adapt Linux 6.x but it's too hard...

Answer 30 · 2023-08-09T09:16:16.000Z

I see. Thank you.
I'll try with kernel 5

Answer 31 · 2023-08-16T11:04:37.000Z

Please allow me to ask another question.

I think this code is for jump to original scheduler.
It seems that this code is patched to /kernel/sched/mod/core.c in /tmp/work dir.

Question1 : When is this code built ?
analyze? collect? extract?
I think src/Makefile is related but I cannot find out.

Question2 : When is this code (/kernel/sched/mod/core.c in /tmp/work dir) executed and by who ?
Is this included in kernel module?
I haven't understood why old scheduler in kernel binary will be not executed after using kernel module.

P.S.
https://dl.acm.org/doi/10.1145/3582016.3582054
I read this paper.
Thank you.

Answer 32 · 2023-08-16T13:03:09.000Z

I think this code is for jump to original scheduler. It seems that this code is patched to /kernel/sched/mod/core.c in /tmp/work dir.

Yes.

Question1 : When is this code built ? analyze? collect? extract? I think src/Makefile is related but I cannot find out.

This code is patched after "extract", and is built through "plugsched-cli build". Because "core.o" is in src/Makefile.

Question2 : When is this code (/kernel/sched/mod/core.c in /tmp/work dir) executed and by who ? Is this included in kernel module? I haven't understood why old scheduler in kernel binary will be not executed after using kernel module.

It's a bit like hotfix. We dynamically patch the function entry (i.e., first 5 bytes in x86) to "jmp our_module". So old scheduler in kernel binary will be not executed.

The reason of the patch in your picture is mainly about sleeping threads. Their RIP will stay in switch_to() and go to sleep. If we rmmod plugsched, the memory (including text code) will be freed. Then if these sleeping threads are woken up, they cannot find the text code according to their last RIP.

Answer 33 · 2023-08-17T05:02:27.000Z

It's a bit like hotfix. We dynamically patch the function entry (i.e., first 5 bytes in x86) to "jmp our_module".

Please tell me the codes or scripts for this operation in plugsched.

Answer 34 · 2023-08-17T09:09:39.000Z

I'm sorry for one more question.
Which do you think is better to kernel 5.8 or 5.11 when using boundary.yaml of 5.10 ?
(Fedora 33 adopts kernel 5.8 , 34 adopts kernel 5.11)

Answer 35 · 2023-08-17T13:59:30.000Z

Please tell me the codes or scripts for this operation in plugsched.

JUMP_OPERATION() in src/head_jump.h (called from __sync_sched_install() in src/main.c)

Answer 36 · 2023-08-17T14:00:41.000Z

I'm sorry for one more question. Which do you think is better to kernel 5.8 or 5.11 when using boundary.yaml of 5.10 ? (Fedora 33 adopts kernel 5.8 , 34 adopts kernel 5.11)

I'm not sure, but maybe you could try 5.8 first

Answer 37 · 2023-08-17T15:23:29.000Z

Thank you very much.

Answer 38 · 2023-08-17T15:50:30.000Z

When JUMP_INIT_FUNC is executed?

Answer 39 · 2023-08-18T06:04:12.000Z

See jump_init_all() in sched_mod_init()

Answer 40 · 2023-08-18T15:15:40.000Z

Thank you.

Answer 41 · 2023-08-18T17:13:18.000Z

In kernel ver5.X, this error is occurred in extract.

Do you know this error ?

Answer 42 · 2023-08-19T02:50:01.000Z

It seems there may be sth wrong with your src.rpm?

Actually this step is to fetch kernel source code (which is not strong related to plugsched)
Do you have another way to get this code? :-/

Answer 43 · 2023-08-19T02:51:21.000Z

Maybe you could try rpm2cpio xxx.rpm | cpio -idmv to extract it directly

Answer 44 · 2023-08-19T05:19:52.000Z

I use
yumdownloader --source kernel-$(uname -r)

and
result of ls command after rpm2cpio xxx.rpm | cpio -idmv is this.

I'll try with source rpm which get using WEB page.

Answer 45 · 2023-08-19T11:32:04.000Z

take a look at *.spec

this file will show the kernel path

Answer 46 · 2023-08-25T11:27:09.000Z

If we just want source codes, can I get only kernel code by github?
Some errors will occur?

Answer 47 · 2023-08-26T12:53:05.000Z

Yes, we do not care where the source code comes from.
You can get it from github or git.kernel.org

Answer 48 · 2023-08-27T01:29:39.000Z

Thank you.

Answer 49 · 2023-08-29T11:01:36.000Z

Isn't the difference between using kernel version and extracted kernel version are allowed?
(For example fedora using kernel 5.10.13-200rcXXX and extracting code of kernel 5.10 from GitHub)

Answer 50 · 2023-08-29T14:19:04.000Z

They are allowed if the code not breaking the boundary. (Even if it breaks, it should only cause smaller boundary and more outside functions that you cannot modify.)

The "init" step will rely on both source code and debuginfo package (including vmlinux and .config for your target host kernel).
The source code decides how the code works after installing plugsched, while debuginfo package decides how to replace old functions (with jmp xxx).
Ensure your kernel-debuginfo rpm is the right version.

Answer 51 · 2023-08-29T14:25:37.000Z

The source code decides how the code works after installing plugsched

For example, if your fedora kernel has its own feature which upstream Linux has not:

static int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
	...
	printk("This is fedora own feature, it will print a line.\n"); // just an example, not real.
	...
}

This printk is add by fedora, and not appear at upstream Linux code. If you use upstream code, you may lose this feature. IMO it's not a critical problem.

Answer 52 · 2023-08-31T16:02:02.000Z

When I excuted init on kernel5, there was no Makefile in the ./scheduler/usr/include . (This caused an error.)

In kernel5

In kernel6

Do you have any ideas why this happened?
I rewrote the extract_src part in cli.py, but that has nothing to do with it, right?
(How ./schedule directory is generated?)

Answer 53 · 2023-08-31T16:16:52.000Z

@dtcccc
Can I become a Github sponsor of you?

Answer 54 · 2023-09-01T08:44:45.000Z

When I excuted init on kernel5, there was no Makefile in the ./scheduler/usr/include . (This caused an error.)

Take a look at your kernel source path? (extracted by "plugsched-cli extract_src xxx")

Can I become a Github sponsor of you?

Thank you, but I don't need because I work for my company to maintain this project :-)

Answer 55 · 2023-09-02T08:11:00.000Z

Take a look at your kernel source path?

Yes. I take a kernel code by
rpm2cpio your-package.rpm | cpio -idmv
and tar command using remade plugsched-cli extract_src
https://github.com/yushoyamaguchi/plugsched/blob/yama_build1/cli.py#L244

https://github.com/aliyun/plugsched/blob/master/cli.py#L68
From this code, I cannot find the codes that constructed ./scheduler (tmp_dir) except ./scheduler/kernel/sched/mod/ (mod_path) and ./scheduler/working (tmp_dir).
(These parts are constructed by file_mapping() )
How other parts of ./scheduler are constructed?

Answer 56 · 2023-09-02T12:43:42.000Z

I think you may confused with "extract_src" and "init"?

Take a look at your <target_dir> (which is the same as <release_kernel> in "init") after doing "extract_src" and see whether there exists <target_dir>/usr/include/Makefile

Answer 57 · 2023-09-02T12:44:40.000Z

How other parts of ./scheduler are constructed?

This step is during "init". See this line:
https://github.com/aliyun/plugsched/blob/master/cli.py#L166

Answer 58 · 2023-09-04T10:24:32.000Z

Take a look at your <target_dir> (which is the same as <release_kernel> in "init") after doing "extract_src" and see whether there exists <target_dir>/usr/include/Makefile

After
plugsched-cli extract_src kernel-${uname_r%.*}.src.rpm ./kernel
There is a Makefile in kernel/usr/include.

However, after
plugsched-cli init $(uname -r) ./kernel ./scheduler
There is a no Makefile in scheduler/user/include.
(Instead of Makefile, there are other directories like asm, asm-generic)

Answer 59 · 2023-09-05T04:26:32.000Z

This is really strange because "make collect" should never touch usr/include/ since it is not in $(vmlinux-dirs)

What will happen if you run sync command manually?

rm -rf ./scheduler
rsync ./kernel ./scheduler --archive --verbose --delete --exclude=.git --filter=":- .gitignore"

Answer 60 · 2023-09-07T15:43:48.000Z

When running sync command manually, there is no Makefile in scheduler/usr/include.
However, when I delete --filter=":- .gitignore, there is Makefile in this directory.

Next, this error occurred end of plugsched-cli init.
(In my environment, this occurred after 40 min from starting init. )
Have you seen it?

FAILED: load BTF from vmlinux: Unknown error -22make[1]: *** [Makefile:1167: vmlinux] Error 255
make: *** [/work5/scheduler/working/Makefile.plugsched:16: collect] Error 2

Answer 61 · 2023-09-07T15:48:11.000Z

Oh, I know the reason after reading the code in your repo

yushoyamaguchi@25e3e66

please revert this, becasue build-dir is not defined in Linux 5.x
just use vmlinux-dirs

Answer 62 · 2023-09-07T15:50:27.000Z

The step about loading btfids from BTF is related to bpf codes (not related to scheduler). It should never be executed if "collect" works properly.

Answer 63 · 2023-09-07T15:53:47.000Z

I think you should try the original plugsched now. The sugguestions I've given are mostly for Linux 6.x

Answer 64 · 2023-09-07T15:54:28.000Z

Thank you very much !
I'll try.

Answer 65 · 2023-09-07T16:03:25.000Z

I'd like to give you one more sugguestion. Since it takes long time for you to do "init".

You may break in applying "post_extract.patch" (https://github.com/aliyun/plugsched/blob/master/cli.py#L194) because our patch may not fit other Linux kernel properly. You can check the .rej files to see the not applied part, and modify the codes manually. After this work, you can remove line 184-194 and run "init" again to continue the following work. The same if you meet problems when applying "dynamic_springboard.patch" or "dynamic_springboard_2.patch".

It can save much time.

Answer 66 · 2023-09-14T05:59:57.000Z

@dtcccc

plugsched-cli init has finished!!
Thank you very much.

Answer 67 · 2023-09-15T07:39:37.000Z

@dtcccc

plugsched works correctly.
Thank you very much.

By the way, module-contlib/scheduler-installer is executed by who?
It seems that plugsched.service , but what is plugsched.service ?

Answer 68 · 2023-09-15T08:59:38.000Z

By the way, module-contlib/scheduler-installer is executed by who? It seems that plugsched.service , but what is plugsched.service ?

Yes, scheduler-installer is executed by plugsched.service.
We register this service to systemd so that every time after the system startup, systemd will start the service (and then run scheduler-installer).
The register happens when you install the rpm. See:
https://github.com/aliyun/plugsched/blob/master/module-contrib/scheduler.spec#L50
and
https://github.com/aliyun/plugsched/blob/master/module-contrib/scheduler.spec#L84

Answer 69 · 2023-09-16T10:14:34.000Z

https://github.com/aliyun/plugsched/blob/master/configs/5.10/dynamic_springboard.patch#L39
https://github.com/aliyun/plugsched/blob/master/src/main.c#L599

sched_springboard is used for only replacing of context_switch function?
Where does SPRINGBOARD variable come from? (src/main.c)

It looks like replacing other many boundary functions are done in other place like JUMP_INSTALL_FUNC in src/head_jump.c or so on.
(Is it correct?)
https://github.com/aliyun/plugsched/blob/master/src/head_jump.h#L90

Answer 70 · 2023-09-18T02:48:20.000Z

It looks like replacing other many boundary functions are done in other place like JUMP_INSTALL_FUNC in src/head_jump.c or so on. (Is it correct?) https://github.com/aliyun/plugsched/blob/master/src/head_jump.h#L90

Yes. All the function replacing is from here.

sched_springboard is used for only replacing of context_switch function? Where does SPRINGBOARD variable come from? (src/main.c)

No. It is not for "replacing", but jump to origin function (from the new plugsched module). SPRINGBOARD is from the last few lines in kernel/sched/mod/Makefile. It is dynamically generated by
https://github.com/aliyun/plugsched/blob/master/tools/springboard_search.sh#L125
to ensure we will jump to the right place of origin vmlinux.

The reason why we need SPRINGBOARD is in Chapter 4.4.1 in our paper.

Answer 71 · 2023-09-20T08:29:47.000Z

I would like to rewrite it to work with kernel6 if possible.
Where should I modify it? (What files should I rewrite?)

~~You mentioned before that it would be very difficult, what part of it is difficult?~~
~~What is different from change of kernel 4.19 -> 5.10 ?~~

I'm sorry, I have overlooked.

These two commits totally break our works:
f96eca432015 ("sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there")
801c14195510 ("sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there")

The files inside or outside our boundary are "mixed" into one file, and the analyze and extract step will totally be broken.

Let me find the way to solve it...

Do you think this problem cannot be solved?

Answer 72 · 2023-09-20T16:12:21.000Z

Do you think this problem cannot be solved?

I believe all problems can be solved finally, but some problems may be really difficult. One should be familiar with the elf object about vmlinux.

f96eca432015 ("sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there")
801c14195510 ("sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there")

For these two commits, they break the mapping relationship (function, file) in our plugsched.

For example, the symbol info about init_rt_rq() should be (init_rt_rq, kernel/sched/rt.c). But after we resolving the vmlinux of Linux 6.x, we get (init_rt_rq, kernel/sched/build_policy.c). So all the following steps are broken.

Answer 73 · 2023-09-21T16:23:58.000Z

we get (init_rt_rq, kernel/sched/build_policy.c)

This is by vmlinux of Linux 6.x , right?

the symbol info about init_rt_rq() should be (init_rt_rq, kernel/sched/rt.c)

Why can we say this? By what part of plugsched?

In addition, relationship in our plugsched refers to specific files or part of plugsched?

Answer 74 · 2023-09-21T16:37:59.000Z

In addition, should be result of plugsched-cli init so accurate?

How about plugsched-cli init in kernel5, and move scheduler directory (result of init) to kernel6?
It will be cause of error?

Answer 75 · 2023-09-22T09:24:39.000Z

It will likely cause error. The "init" step will read the symbols and other info from vmlinux. You may get wrong boundary info when resolving mismatch vmlinux.

However, in fact we've tried a similar way to backport new features. But it is a different topic...

Answer 76 · 2023-09-22T14:43:40.000Z

we get (init_rt_rq, kernel/sched/build_policy.c)

This is by vmlinux of Linux 6.x , right?

right.

the symbol info about init_rt_rq() should be (init_rt_rq, kernel/sched/rt.c)

Why can we say this? By what part of plugsched?

In addition, relationship in our plugsched refers to specific files or part of plugsched?

See fn.signature in https://github.com/aliyun/plugsched/blob/master/boundary/analyze.py#L345
It should be something like (init_rt_rq, kernel/sched/rt.c)

We trace a function according to both its name and file, because there may be static functions with the same name in different files.

Answer 77 · 2023-09-22T14:50:07.000Z

What's more, in the "extract" step, we need to rewrite some codes. e.g., remove the code defination of outside functions (only keep their declaration), add comments like "/* DON'T MODIFY SIGNATURE OF INTERFACE FUNCTION {} */", and fix some var declarations. See https://github.com/aliyun/plugsched/blob/master/boundary/extract.py#L298 to L300.

We need to know the location (mainly line) of these symbols to rewrite them.

So we will record init_rt_rq() is in line 1234 in kernel/sched/rt.c. But now we find it is in build_policy.c, and the location info is then totally wrong.

Answer 78 · 2023-10-05T07:46:50.000Z

See fn.signature in https://github.com/aliyun/plugsched/blob/master/boundary/analyze.py#L345
It should be something like (init_rt_rq, kernel/sched/rt.c)

Where does this tuple information come from ?
(I understand it is not vmlinux.)
boundary.yaml?

Answer 79 · 2023-10-08T03:03:56.000Z

Where does this tuple information come from ? (I understand it is not vmlinux.) boundary.yaml?

From kernel/sched/rt.c.boundary, which is generated in "collect" step.
https://github.com/aliyun/plugsched/blob/master/boundary/collect.py#L356

Answer 80 · 2023-10-16T05:03:41.000Z

Thank you very much.

Answer 81 · 2023-10-20T04:58:56.000Z

@dtcccc

From kernel/sched/rt.c.boundary, which is generated in "collect" step.
https://github.com/aliyun/plugsched/blob/master/boundary/collect.py#L356

This operation in collect.py refers call graph which made from actual kernel code in 'extract.py' , right?

If it is correct, vmlinux and call graph (from actual kernel code) are contradictory in kernel6 ?

Answer 82 · 2023-10-23T07:30:09.000Z

I'm not sure understand your meaning...

'extract.py' is generating code files in kernel/sched/mod
'extract.py' relys on the result from 'collect.py'

vmlinux and call graph (from actual kernel code) are contradictory in kernel6?

I think this is 99% yes. In our example, init_rt_rq() is in rt.c, not build_policy.c.
But build_policy.c include rt.c, so hmm.....

From our human view, init_rt_rq() is in rt.c
but from the compiler view, init_rt_rq() is in build_policy.c

Answer 83 · 2023-10-23T10:51:26.000Z

I think this is 99% yes.

OMG...
I understand why it is so difficult.
Once, I will give up.
Thank you very much for teaching me so kindly.

Answer 84 · 2023-12-05T16:48:36.000Z

@dtcccc

I'm sorry for repeating questions.
After, rpm -ivh /path/to/scheduler-xxx.rpm ,
The permission error occurred.

Do you know how to fix?
I could not find whose authority is not enough.

Answer 85 · 2023-12-06T02:16:49.000Z

Any error reported in dmesg? or do you install rpm in a docker?

Answer 86 · 2023-12-06T16:45:23.000Z

Ok, thank you. The problem will be solved.
And one more question. I'm sorry.

Like that, version of kernel module object is 5.10.23-200.fc33.x86_64+, in spite of the result of uname -r is 5.10.23-200.fc33.x86_64.
It result in insmod error.
This is the result of sudo dmesg after sudo insmod /run/plugsched/scheduler.ko .

Do you know how to fix?

Answer 87 · 2023-12-06T19:02:17.000Z

https://elixir.bootlin.com/linux/latest/source/include/linux/vermagic.h#L42
here?

Answer 88 · 2023-12-07T06:03:12.000Z

Do you get source code from git and build it yourself?
I usually touch .scmversion in my kernel source before I build it.

See https://stackoverflow.com/a/32699989

Answer 89 · 2023-12-21T15:14:40.000Z

@dtcccc
I want to add ioctl to the kernel-module generated by plugsched.
Can we register the kernel-module as character device in sched_mod_init() in kernel/sched/mod/main.c ?

Answer 90 · 2023-12-25T15:12:52.000Z

Sorry, I'm not familiar with ioctl :-(
What do you actually want to do?

Answer 91 · 2023-12-26T03:12:53.000Z

I'm sorry.
I want to revoke functions in sched-module (by plugsched), from user program.
I run remodeled sched_deadline and want to manipulate information about tasks in the remodeled sched_dealine from the user programs which are the task entities.

Answer 92 · 2024-02-12T13:15:47.000Z

@dtcccc

In scheduler by plugsched, call back functions for load balancing in sched_rt is disabled ?
https://github.com/aliyun/plugsched/blob/master/src/main.c#L124

why is this process required?

https://github.com/aliyun/plugsched/blob/master/boundary/extract.py#L145
https://github.com/aliyun/plugsched/blob/master/src/head_jump.h#L28
Are these parts related ?

Answer 93 · 2024-02-12T16:15:01.000Z

In scheduler by plugsched, call back functions for load balancing in sched_rt is disabled ?

Not disabled, just clear the lists to prevent asynchronous calls. e.g., a balance_callback() from origin kernel after new plugsched is installed. These lists will be added new items after new plugsched installed.

Are these parts related ?

Partly is (they are similar), but exactly not.
These parts aim to replace all function ptrs. If we do not handle it, the kernel may crash due to:

__mod_func() is add to a list, schedule_work() or call_rcu(), etc.
plugsched is uninstalled and freed
the asynchronous call runs __mod_func(), but it has been freed

Answer 94 · 2024-02-13T06:40:05.000Z

I have understood about balance_callback(). Thank you very much.
Why can we avoid cashing kernel by rename (prefix "_cb" to the function name)?
I have not understood about relationship between replacing function ptrs and renaming.

Answer 95 · 2024-02-13T15:15:26.000Z

I'm sorry for one more question.

There are some structs' definition which are in sched/sched.h but aren't in sched/mod/sched.h .
For example, sched_rt_entity struct is like that.
Can I redifine these struct in sched/mod/sched.h ?
(I want to add some variables in sched_rt_entity struct.)

Answer 96 · 2024-02-14T02:39:46.000Z

Why can we avoid cashing kernel by rename (prefix "_cb" to the function name)?
I have not understood about relationship between replacing function ptrs and renaming.

Yeah, the key point is replacing function ptrs. Renaming is not important. But renaming is convenient for us to replace function ptrs without other changes, because it distinguishs __cb_func() from origin func().

For example, in origin source code, we call schedule_work(func) in many places.
Now in plugsched sched/mod/, we keep these codes without any changes, and modify the definition as:

extern void func();
void __cb_func()
{
    xxx
}

So that we tell the compiler to define a new function named "__cb_func" instead of "func". So schedule_work(func) will try to find the symbol in origin kernel, not plugsched.

Answer 97 · 2024-02-14T02:53:28.000Z

I'm sorry for one more question.

There are some structs' definition which are in sched/sched.h but aren't in sched/mod/sched.h . For example, sched_rt_entity struct is like that.

No, sched_rt_entity is defined in include/linux/sched.h, this is a global include header and outside the scheduler scope. (unlike kernel/sched/sched.h)

Can I redifine these struct in sched/mod/sched.h ? (I want to add some variables in sched_rt_entity struct.)

Some distro kernels will reserve few fields for structs to prepare for hotfix. If the kernel you are using has reserved fields in sched_rt_entity, just use it. Otherwise I'm not suggest to redifine it.
You should NEVER change the size of sched_rt_entity, because it is embedded in task_struct, which is used frequently by other subsystems. If you want to change the meaning of specific field, you must be very careful. Because the field may be used by other subsystems. See Limitations in README:

We don't recommend modifying structures and semantics of their members at well. If you really need to, please refer to the working/boundary_doc.yaml documentation.

Answer 98 · 2024-02-14T06:22:32.000Z

I understand.
Thank you very much.

Answer 99 · 2024-03-05T17:51:03.000Z

@dtcccc
Is it allowed to use global variable in scheduler module?

I declare original queue as global variable. (https://github.com/yushoyamaguchi/sched_as_plug/blob/c8c968759f27aa19f079f484d093849e263119f8/kernel/sched/mod/rt.c#L11)

At the part of using the queue (https://github.com/yushoyamaguchi/sched_as_plug/blob/c8c968759f27aa19f079f484d093849e263119f8/kernel/sched/mod/rt.c#L1293), a null pointer error occurs. (Currently, using comment-out)

The error is access to address 0.

Also, I confirm the error place.

It seems to me that this is simply a coding error on my part... I can't find out.

Answer 100 · 2024-03-08T02:30:32.000Z

Is it allowed to use global variable in scheduler module?

Of course yes. But you cannot define list_head like that...

A single list_head should be defined byLIST_HEAD(yama_rt_rq_list);
But you want to define an array ... Hmm...
Maybe something like

struct list_head yama_rt_rq_list[NR_CPUS];
for (i = 0; i < NR_CPUS; i++) {
    INIT_LIST_HEAD(&yama_rt_rq_list[i]);
}

For your usage, I sugguest to use percpu var. However it does need more knowledge for a developer, so it is harder to use for linux beginners...
Can refer to xskmap_flush_list in net/xdp/xsk.c

static DEFINE_PER_CPU(struct list_head, xskmap_flush_list);
static int __init xsk_init(void) {
    ......
    for_each_possible_cpu(cpu)
        INIT_LIST_HEAD(&per_cpu(xskmap_flush_list, cpu));
    ......
}