containerd/overlaybd

Performance regression for image pull with concurrent container creation

shuochen0311 opened this issue · 5 comments

What happened in your environment?

We have multiple containers running on a same node with overlaybd as its container snapshotter, which are doing lazy pulling for all rootfs contents. When running it on prod, we found the P95 latency has huge gaps with P50 (20s vs 10s). After checking some logs we saw an interesting coincident that

For those image pulling with unexpected latencies:

Oct 02 06:04:28 [Event] Start to pull image for container executor: image harbor-xxxxx
Oct 02 06:04:37  [Event] Finish pulling image for container executor: image harbor-xxxx

There is a container creation events hapenning inside containerd

Oct 02 06:04:29 ip-10-1-162-245 containerd[387]: time="2023-10-02T06:04:29.671653617Z" level=info msg="CreateContainer within sandbox \"e0d9308c3259dc01251575ad5c27d2efdbdaf00b7c267f06a7ab15ed6d827e23\""
Oct 02 06:04:29 ip-10-1-162-245 containerd[387]: time="2023-10-02T06:04:29.672341423Z" level=info msg="StartContainer for \"3a6b0dce5e9168993ccd0c3213929af87e4765304774773188e1830631e2ff39\""
Oct 02 06:04:29 ip-10-1-162-245 containerd[387]: time="2023-10-02T06:04:29.672417656Z" level=info msg="container start request for xxxx"
Oct 02 06:04:29 ip-10-1-162-245 containerd[387]: time="2023-10-02T06:04:29.837229175Z" level=info msg="StartContainer for \"3a6b0dce5e9168993ccd0c3213929af87e4765304774773188e1830631e2ff39\" returns successfully"

We are suspecting the container creating events (which contains some container rootfs construction process) is interfering with container image pulling and impact image lazy pull latency.

We are looking for some insights from upstream about what is the potential reason for such performance regression.

What did you expect to happen?

No response

How can we reproduce it?

Use overlaybd as snapshotter, overlap some container creation with container image download.

What is the version of your Overlaybd?

0.6.17

What is your OS environment?

ubuntu 20.04

Are you willing to submit PRs to fix it?

  • Yes, I am willing to fix it.

@shuochen0311
What was the workload in container created at 06:04:29, did it load a large amount of data which affected image pulling?
were there any other logs between 06:04:29 and 06:04:37?

@liulanzheng thanks for responding. Let me see what else can I find from the log in that period of time.

A question on my side is if the container creation/start requires a lot of data pulling, Will it affect the performance for the rpull(metadata pulling) which is at the critical path before container starts?

if the container creation/start requires a lot of data

It depends on the application itself. If it is a busybox, it requires little data.

@lihuiba how do I know if my container is downloading a lot of data? Meanwhile, I think the question is is it expected that data downloading will affect the rpull performance?

@shuochen0311 iostat can show you how much data has been read from a block device. It can also show you the realtime I/O speed.