"Generating global index page" took too long...

Question

"Generating global index page" took too long...

r00t1900 opened this issue 3 years ago · 22 comments

r00t1900 commented 3 years ago

description

I've noticed that either tunasync or bandersnatch itself take a stable&long time to do "generating global index page", about 2 hours in a RaspberryPi 4B, no matter how many packages are synced.

example

Here is an example which indicates that even only "5 packages had changes" bandersnatch still took about 90 mins(from 07:34 a.m. to 09:03 a.m. to do the job "generating global index page":

2022-04-08 07:33:08,581 INFO: Selected storage backend: filesystem (configuration.py:128)             
2022-04-08 07:33:08,581 INFO: Selected compare method: stat (configuration.py:174)                    
2022-04-08 07:33:08,582 INFO: Selected alternative download mirror https://pypi.tuna.tsinghua.edu.cn (configuration.py:179)                                                                                 
2022-04-08 07:33:09,137 INFO: Initialized project plugin blocklist_project, filtering ['tf-nightly', 'torchrec-nightly', 'tf-nightly-gpu', 'tf-nightly-cpu', 'tensorflow-io-nightly', 'pyagrum-nightly'] (bl
ocklist_name.py:27)                                
2022-04-08 07:33:09,296 INFO: Syncing with https://pypi.org. (mirror.py:56)                           
2022-04-08 07:33:09,297 INFO: Current mirror serial: 13437808 (mirror.py:267)                         
2022-04-08 07:33:09,297 INFO: Resuming interrupted sync from local todo list. (mirror.py:274)                                                                                                               
2022-04-08 07:33:09,298 INFO: Trying to reach serial: 13441154 (mirror.py:299)                        
2022-04-08 07:33:09,298 INFO: 10 packages to sync. (mirror.py:301)                                    
2022-04-08 07:33:09,298 INFO: No metadata filters are enabled. Skipping metadata filtering (mirror.py:75)                                                                                                   
2022-04-08 07:33:09,298 INFO: No release filters are enabled. Skipping release filtering (mirror.py:77)                                                                                                     
2022-04-08 07:33:09,298 INFO: No release file filters are enabled. Skipping release file filtering (mirror.py:79)                                                                                           
2022-04-08 07:33:09,300 INFO: Fetching metadata for package: anybinding (serial 13439002) (package.py:57)                                                                                                   
2022-04-08 07:33:09,308 INFO: Fetching metadata for package: cs18-api-client (serial 13440888) (package.py:57)                                                                                              
2022-04-08 07:33:09,310 INFO: Fetching metadata for package: datacenter-pyarmor (serial 13438941) (package.py:57)                                                                                           
2022-04-08 07:33:09,313 INFO: Fetching metadata for package: dustilock (serial 13440143) (package.py:57)                                                                                                    
2022-04-08 07:33:09,315 INFO: Fetching metadata for package: gardener-cicd-base (serial 13440013) (package.py:57)                                                                                           
2022-04-08 07:33:09,317 INFO: Fetching metadata for package: gardener-cicd-whd (serial 13440017) (package.py:57)                                                                                            
2022-04-08 07:33:09,320 INFO: Fetching metadata for package: get-all-slack-emojis (serial 13438779) (package.py:57)                                                                                         
2022-04-08 07:33:09,322 INFO: Fetching metadata for package: saiph (serial 13438350) (package.py:57)                                                                                                        
2022-04-08 07:33:09,324 INFO: Fetching metadata for package: scs-test-0704 (serial 13439826) (package.py:57)                                                                                                
2022-04-08 07:33:09,327 INFO: Fetching metadata for package: zeev-test (serial 13440618) (package.py:57)                                                                                                    
2022-04-08 07:33:10,399 INFO: zeev-test no longer exists on PyPI (package.py:65)                      
2022-04-08 07:33:10,502 INFO: dustilock no longer exists on PyPI (package.py:65)                      
2022-04-08 07:33:10,558 INFO: Downloading: https://pypi.tuna.tsinghua.edu.cn/packages/79/8b/2e822598c7f15faf1ae5a79fef68d51932a2b0983622638b5a01b9e0e6db/anybinding-1.0.0.tar.gz (mirror.py:933)
2022-04-08 07:33:10,680 INFO: Downloading: https://pypi.tuna.tsinghua.edu.cn/packages/86/54/3e7b5fc5dde318c9a3246a605e44fe9581c5c8a4e143709160dfa405fac3/anybinding-1.0.1.tar.gz (mirror.py:933)
...
...
2022-04-08 07:34:19,334 INFO: Downloading: https://pypi.tuna.tsinghua.edu.cn/packages/fe/ba/a58bda355460b0ca31943beb4e270c9278162fd98d51f4e335d1b20e02e3/gardener-cicd-base-1.1656.0.tar.gz (mirror.py:933)
2022-04-08 07:34:26,038 INFO: Downloading: https://pypi.tuna.tsinghua.edu.cn/packages/59/ad/85e11a200aee64497fce01d80402601ba93562740dced1ad3b23d06895c1/cs18-api-client-0.0.2.2768.tar.gz (mirror.py:933)
2022-04-08 07:34:26,178 INFO: Storing index page: cs18-api-client - in /mnt/storage/data/web/simple/cs18-api-client (mirror.py:791)
2022-04-08 07:34:58,279 INFO: Storing index page: gardener-cicd-whd - in /mnt/storage/data/web/simple/gardener-cicd-whd (mirror.py:791)
2022-04-08 07:34:58,405 INFO: Storing index page: gardener-cicd-base - in /mnt/storage/data/web/simple/gardener-cicd-base (mirror.py:791)
2022-04-08 07:34:58,525 INFO: Generating global index page. (mirror.py:483)
2022-04-08 09:03:04,315 INFO: New mirror serial: 13441154 (mirror.py:507)
2022-04-08 09:03:04,545 INFO: 5 packages had changes (mirror.py:1043)
2022-04-08 09:03:04,546 INFO: Writing diff file to mirrored-files (mirror.py:1053)

and the related configuration file bs.conf:

[mirror]
directory = /mnt/storage/data
master = https://pypi.org
json = true
timeout = 300
workers = 10
hash-index = false
stop-on-error = false
delete-packages = true
compare-method = stat
download-mirror = https://pypi.tuna.tsinghua.edu.cn
download-mirror-no-fallback = false

[plugins]
enabled =
  blocklist_project
[blocklist]
packages =
  tf-nightly
  tf-nightly-gpu
  tf-nightly-cpu
  tensorflow-io-nightly
  pyagrum-nightly
  torchrec_nightly

I add torchrec_nightly in additional to the given example because sometimes I found this packages often fail on syncing.

question

I've noticed that in server status on tuna mirror, the sync status of pypi is like "xx minutes ago", which is really "real-time", unlike the situation that I met. So why is that and how can I behave like what the tuna server do?
I use bandersnatch on raspberrypi 4b to do a pypi mirror job because tuna docker image has some privilege problem when I run docker in raspberry pi 4b. However, after the log output show "0 packages to sync", the size of the final mirrored data is only 8949G, which is far too small than 9.75T that showed in server status on tuna mirror. Of course, I've ignore some of the nightly building just as the given pypi.sh in https://github.com/tuna/tunasync-scripts. So I think the size I should get is exactly same as yours. What could make this difference?
I've also noticed that the bandersnatch sometime will start with Resuming interrupted sync from local todo list., but the previous running did not say anything about the interrupt things. This make the syncing very painful because I did not use a crontab or others to auto restart it. Besides, even if it can restart automatically, like what I said above, it always took 90 mins to do a stable generating job. So under these 2 conditions the syncing job are pretty low efficiency.

Answer 1 · 2022-04-08T02:41:18.000Z

Here is an example which indicates that even only "5 packages had changes" bandersnatch still took about 90 mins(from 07:34 a.m. to 09:03 a.m. to do the job "generating global index page":

What type of storage are you using?

tuna docker image has some privilege problem

Please elaborate on this. We do use our image on both x86_64 and aarch64 servers and there should not be any problem.

Answer 2 · 2022-04-08T03:00:45.000Z

Here is an example which indicates that even only "5 packages had changes" bandersnatch still took about 90 mins(from 07:34 a.m. to 09:03 a.m. to do the job "generating global index page":

What type of storage are you using?

tuna docker image has some privilege problem

Please elaborate on this. We do use our image on both x86_64 and aarch64 servers and there should not be any problem.

An SATA3 HDD Storage with EXT4 filesystem. The IO performance is about 120MB/s. I think that is not the IO performance does, but the bandersnatch logic, that always generate a whole index page no matter how many packages are really changed.
I will report it later because now I have limited aarch64 board.

Answer 3 · 2022-04-08T03:01:54.000Z

You can refer to #167 , the symptom and the cause is the same.

Answer 4 · 2022-04-08T03:04:48.000Z

When starting cleanly, bandersnatch will generate a "todo" file, and after a successful run, it will be deleted. On starting, it will detect the existence of the file. If todo file exists, it will try to continue last interrupted sync and not start a new sync.

Answer 5 · 2022-04-08T03:06:08.000Z

About the difference of the size, it simply because bandersnatch does not delete packages which are already deleted from upstream.

Answer 6 · 2022-04-08T03:07:24.000Z

You can refer to #167 , the symptom and the cause is the same.

Of course, that issues are submit by myself too...

Answer 7 · 2022-04-08T03:08:28.000Z

And about the speed of generating index page, the bottleneck is not at bandwidth of your HDD, but IOPS. In your particular case, it might be also related with the CPU performance.

Answer 8 · 2022-04-08T03:09:18.000Z

You can refer to #167 , the symptom and the cause is the same.

Of course, that issues are submit by myself too...

So why are you posting a new issue? Is there anything new that happened?

Answer 9 · 2022-04-08T03:10:26.000Z

About the difference of the size, it simply because bandersnatch does not delete packages which are already deleted from upstream.

"bandersnatch does not delete packages which are already deleted from the upstream", it that point to my bandersnatch on Pi4B or bandersnatch on tuna server?
Conducting that is true, but why the size is different between tuna server and my pi server, we share the same configuration file.

Answer 10 · 2022-04-08T03:12:59.000Z

You can refer to #167 , the symptom and the cause is the same.

Of course, that issues are submit by myself too...

So why are you posting a new issue? Is there anything new that happened?

In this thread, I've also post other questions, mainly is the unmatched size problem. And recently the "generating global index page" problem confuse me again, so I re post this problem. It's my fault, of my carefulness.

Answer 11 · 2022-04-08T03:15:28.000Z

And about the speed of generating index page, the bottleneck is not at bandwidth of your HDD, but IOPS. In your particular case, it might be also related with the CPU performance.

Yes, I agreed that IOPS is the bottleneck. So I have some further questions(annoying you again, thank you for the patience):

Is there are a recommended IOPS and the performance under this IOPS?
How can I test my IOPS?

Answer 12 · 2022-04-08T03:15:38.000Z

Because bandersnatch won't delete packages that are already deleted from upstream, and the initial sync in TUNA server happens much earlier than that in your Pi, there are more such garbage in TUNA server and the size is thus larger.

Answer 13 · 2022-04-08T03:18:32.000Z

Because bandersnatch won't delete packages that are already deleted from upstream, and the initial sync in TUNA server happens much earlier than that in your Pi, so there are more such garbage in TUNA server and the size is thus larger.

Bingo!
I think that is the real reason!
And since the "extra" data is defined as garbage, I think I can just ignore the size difference but can still make sure that my local pypi mirror can work well, is that correct? Because there is no necessary to sync that "garbage" data.

Answer 14 · 2022-04-08T03:21:42.000Z

Because bandersnatch won't delete packages that are already deleted from upstream, and the initial sync in TUNA server happens much earlier than that in your Pi, so there are more such garbage in TUNA server and the size is thus larger.

Bingo! I think that is the real reason! And since the "extra" data is defined as garbage, I think I can just ignore the size difference but can still make sure that my local pypi mirror can work well, is that correct? Because there is no necessary to sync that "garbage" data.

Correct. However, I strongly recommend against self building a mirror of pypi, especially using a Pi-like board and a magnetic hard drive. Instead, a local caching server might be more adequate.

Answer 15 · 2022-04-08T03:33:24.000Z

Because bandersnatch won't delete packages that are already deleted from upstream, and the initial sync in TUNA server happens much earlier than that in your Pi, so there are more such garbage in TUNA server and the size is thus larger.

Bingo! I think that is the real reason! And since the "extra" data is defined as garbage, I think I can just ignore the size difference but can still make sure that my local pypi mirror can work well, is that correct? Because there is no necessary to sync that "garbage" data.

Correct. However, I strongly recommend against self building a mirror of pypi, especially using a Pi-like board and a magnetic hard drive. Instead, a local caching server might be more adequate.

Totally agreed. the reason I use pi board to carry the pypi mirror job is that it is protable and non-noise and I can place it to a well-network area. Once I confirmed that the pypi mirror are well synced(this is the confusing problem, I will explain that below), I will transfer the hard drive to a real server to provide service. Because the place I use this local pypi mirror has no internet access.

etc

what is the local caching server you just mentioned? I have no idea what it is and what is the difference between it and th self-building pypi mirror. Isn't it a pypi mirror station? Or an brand new form of caching pypi data? No matter what it is, I still need a complete copy of the pypi mirror(like cloning from rsync source), is the local cache server can do it?
the reason why I can not confirm if pypi mirror is completed synced is that, everytime bandersnatch finish it running with "genrating global index page...", it is not really finish. The next time you run you will find it start over from an todo file. But there are no any hint indicates that any error has occur during the running. But I could not describe and explain this very well, maybe it is just some coincident.

Answer 16 · 2022-04-08T03:35:35.000Z

Because bandersnatch won't delete packages that are already deleted from upstream, and the initial sync in TUNA server happens much earlier than that in your Pi, so there are more such garbage in TUNA server and the size is thus larger.

Bingo! I think that is the real reason! And since the "extra" data is defined as garbage, I think I can just ignore the size difference but can still make sure that my local pypi mirror can work well, is that correct? Because there is no necessary to sync that "garbage" data.

Correct. However, I strongly recommend against self building a mirror of pypi, especially using a Pi-like board and a magnetic hard drive. Instead, a local caching server might be more adequate.

I've just search for local cache server and I have some mind in head.

Answer 17 · 2022-04-08T03:37:35.000Z

Because bandersnatch won't delete packages that are already deleted from upstream, and the initial sync in TUNA server happens much earlier than that in your Pi, so there are more such garbage in TUNA server and the size is thus larger.

Bingo! I think that is the real reason! And since the "extra" data is defined as garbage, I think I can just ignore the size difference but can still make sure that my local pypi mirror can work well, is that correct? Because there is no necessary to sync that "garbage" data.

Correct. However, I strongly recommend against self building a mirror of pypi, especially using a Pi-like board and a magnetic hard drive. Instead, a local caching server might be more adequate.

However, that is like a cache, not a full copy. You first need to request this package and then the cache server do the caching job.

But, if there is a way that can cache server make a full copy, I think I could use it. Besides, what is the solution of caching pypi? I found nexus and apt-cacher, but are not for pypi mirror.

Answer 18 · 2022-04-08T03:41:25.000Z

其实可以说中文的。全量同步 PyPI 带来的 IOPS 压力非常大，如果您并不是需要离线提供服务，事实上用缓存的方式就可以了，甚至不需要特殊的软件，nginx 反代就可以缓存。

Answer 19 · 2022-04-08T03:55:54.000Z

其实可以说中文的。全量同步 PyPI 带来的 IOPS 压力非常大，如果您并不是需要离线提供服务，事实上用缓存的方式就可以了，甚至不需要特殊的软件，nginx 反代就可以缓存。

我看很多issue都英文就也英文了，保持一致 :)
是的，其实我需要的就是离线提供服务，所以才考虑做完整的copy。现在的情况是这样的：

用树莓派4B的板子接机械硬盘在存，已经是存到8.9T了，下载速度没问题，但就像你说的，最后要IOPS支撑，在生成主页索引环节。关于容量不匹配，根据之前的回复意思，应该是存不到服务器状态里面说的9.8T，因为有未删除的包在清华源里，而我现在同步是下载不来的。
现在同步到后面，速度就很慢了，因为有很多是fetch后storing index page的，下载和IO交替进行，不像之前都在Downloading
同步到后面，涉及到主页索引生成，这个每次都是要90分钟，而我每次都要运行一次bandersnatch mirror是因为不清楚现在的镜像状态是否是同步完成了（狭义上的，即上次的同步任务是否完全OK没下漏或有错误终止，而不是准确意义上的与服务器同步，因为每时每刻都在变化）。所以经常就运行这个命令去检查，而有时候就会启动新的同步，在新的同步里有时就的确发生了error downloading，所以一直在这个循环里出不来。

Answer 20 · 2022-04-08T03:58:03.000Z

其实可以说中文的。全量同步 PyPI 带来的 IOPS 压力非常大，如果您并不是需要离线提供服务，事实上用缓存的方式就可以了，甚至不需要特殊的软件，nginx 反代就可以缓存。

另外，那你们有没有一些指导性的建议，如果是要产生一套全量的Pypi离线源，有无推荐的解决方案？

Answer 21 · 2022-04-08T04:37:11.000Z

目前看起来只能通过 bandersnatch 的退出状态来判断是否正常结束了同步任务。

全量的离线源就直接用 bandersnatch 制作即可。

Answer 22 · 2022-04-08T05:21:00.000Z

OK，感谢。我继续尝试，有更新动态再来讨论。