"Generating global index page" took too long...
r00t1900 opened this issue · 22 comments
description
I've noticed that either tunasync
or bandersnatch
itself take a stable&long time to do "generating global index page", about 2 hours in a RaspberryPi 4B, no matter how many packages are synced.
example
Here is an example which indicates that even only "5 packages had changes" bandersnatch
still took about 90 mins(from 07:34 a.m. to 09:03 a.m. to do the job "generating global index page":
2022-04-08 07:33:08,581 INFO: Selected storage backend: filesystem (configuration.py:128)
2022-04-08 07:33:08,581 INFO: Selected compare method: stat (configuration.py:174)
2022-04-08 07:33:08,582 INFO: Selected alternative download mirror https://pypi.tuna.tsinghua.edu.cn (configuration.py:179)
2022-04-08 07:33:09,137 INFO: Initialized project plugin blocklist_project, filtering ['tf-nightly', 'torchrec-nightly', 'tf-nightly-gpu', 'tf-nightly-cpu', 'tensorflow-io-nightly', 'pyagrum-nightly'] (bl
ocklist_name.py:27)
2022-04-08 07:33:09,296 INFO: Syncing with https://pypi.org. (mirror.py:56)
2022-04-08 07:33:09,297 INFO: Current mirror serial: 13437808 (mirror.py:267)
2022-04-08 07:33:09,297 INFO: Resuming interrupted sync from local todo list. (mirror.py:274)
2022-04-08 07:33:09,298 INFO: Trying to reach serial: 13441154 (mirror.py:299)
2022-04-08 07:33:09,298 INFO: 10 packages to sync. (mirror.py:301)
2022-04-08 07:33:09,298 INFO: No metadata filters are enabled. Skipping metadata filtering (mirror.py:75)
2022-04-08 07:33:09,298 INFO: No release filters are enabled. Skipping release filtering (mirror.py:77)
2022-04-08 07:33:09,298 INFO: No release file filters are enabled. Skipping release file filtering (mirror.py:79)
2022-04-08 07:33:09,300 INFO: Fetching metadata for package: anybinding (serial 13439002) (package.py:57)
2022-04-08 07:33:09,308 INFO: Fetching metadata for package: cs18-api-client (serial 13440888) (package.py:57)
2022-04-08 07:33:09,310 INFO: Fetching metadata for package: datacenter-pyarmor (serial 13438941) (package.py:57)
2022-04-08 07:33:09,313 INFO: Fetching metadata for package: dustilock (serial 13440143) (package.py:57)
2022-04-08 07:33:09,315 INFO: Fetching metadata for package: gardener-cicd-base (serial 13440013) (package.py:57)
2022-04-08 07:33:09,317 INFO: Fetching metadata for package: gardener-cicd-whd (serial 13440017) (package.py:57)
2022-04-08 07:33:09,320 INFO: Fetching metadata for package: get-all-slack-emojis (serial 13438779) (package.py:57)
2022-04-08 07:33:09,322 INFO: Fetching metadata for package: saiph (serial 13438350) (package.py:57)
2022-04-08 07:33:09,324 INFO: Fetching metadata for package: scs-test-0704 (serial 13439826) (package.py:57)
2022-04-08 07:33:09,327 INFO: Fetching metadata for package: zeev-test (serial 13440618) (package.py:57)
2022-04-08 07:33:10,399 INFO: zeev-test no longer exists on PyPI (package.py:65)
2022-04-08 07:33:10,502 INFO: dustilock no longer exists on PyPI (package.py:65)
2022-04-08 07:33:10,558 INFO: Downloading: https://pypi.tuna.tsinghua.edu.cn/packages/79/8b/2e822598c7f15faf1ae5a79fef68d51932a2b0983622638b5a01b9e0e6db/anybinding-1.0.0.tar.gz (mirror.py:933)
2022-04-08 07:33:10,680 INFO: Downloading: https://pypi.tuna.tsinghua.edu.cn/packages/86/54/3e7b5fc5dde318c9a3246a605e44fe9581c5c8a4e143709160dfa405fac3/anybinding-1.0.1.tar.gz (mirror.py:933)
...
...
2022-04-08 07:34:19,334 INFO: Downloading: https://pypi.tuna.tsinghua.edu.cn/packages/fe/ba/a58bda355460b0ca31943beb4e270c9278162fd98d51f4e335d1b20e02e3/gardener-cicd-base-1.1656.0.tar.gz (mirror.py:933)
2022-04-08 07:34:26,038 INFO: Downloading: https://pypi.tuna.tsinghua.edu.cn/packages/59/ad/85e11a200aee64497fce01d80402601ba93562740dced1ad3b23d06895c1/cs18-api-client-0.0.2.2768.tar.gz (mirror.py:933)
2022-04-08 07:34:26,178 INFO: Storing index page: cs18-api-client - in /mnt/storage/data/web/simple/cs18-api-client (mirror.py:791)
2022-04-08 07:34:58,279 INFO: Storing index page: gardener-cicd-whd - in /mnt/storage/data/web/simple/gardener-cicd-whd (mirror.py:791)
2022-04-08 07:34:58,405 INFO: Storing index page: gardener-cicd-base - in /mnt/storage/data/web/simple/gardener-cicd-base (mirror.py:791)
2022-04-08 07:34:58,525 INFO: Generating global index page. (mirror.py:483)
2022-04-08 09:03:04,315 INFO: New mirror serial: 13441154 (mirror.py:507)
2022-04-08 09:03:04,545 INFO: 5 packages had changes (mirror.py:1043)
2022-04-08 09:03:04,546 INFO: Writing diff file to mirrored-files (mirror.py:1053)
and the related configuration file bs.conf
:
[mirror]
directory = /mnt/storage/data
master = https://pypi.org
json = true
timeout = 300
workers = 10
hash-index = false
stop-on-error = false
delete-packages = true
compare-method = stat
download-mirror = https://pypi.tuna.tsinghua.edu.cn
download-mirror-no-fallback = false
[plugins]
enabled =
blocklist_project
[blocklist]
packages =
tf-nightly
tf-nightly-gpu
tf-nightly-cpu
tensorflow-io-nightly
pyagrum-nightly
torchrec_nightly
I add
torchrec_nightly
in additional to the given example because sometimes I found this packages often fail on syncing.
question
- I've noticed that in server status on tuna mirror, the sync status of pypi is like "xx minutes ago", which is really "real-time", unlike the situation that I met. So why is that and how can I behave like what the tuna server do?
- I use
bandersnatch
on raspberrypi 4b to do a pypi mirror job because tuna docker image has some privilege problem when I run docker in raspberry pi 4b. However, after the log output show "0 packages to sync", the size of the final mirrored data is only8949G
, which is far too small than9.75T
that showed in server status on tuna mirror. Of course, I've ignore some of thenightly
building just as the givenpypi.sh
in https://github.com/tuna/tunasync-scripts. So I think the size I should get is exactly same as yours. What could make this difference? - I've also noticed that the
bandersnatch
sometime will start withResuming interrupted sync from local todo list.
, but the previous running did not say anything about the interrupt things. This make the syncing very painful because I did not use acrontab
or others to auto restart it. Besides, even if it can restart automatically, like what I said above, it always took 90 mins to do a stable generating job. So under these 2 conditions the syncing job are pretty low efficiency.
Here is an example which indicates that even only "5 packages had changes" bandersnatch still took about 90 mins(from 07:34 a.m. to 09:03 a.m. to do the job "generating global index page":
What type of storage are you using?
tuna docker image has some privilege problem
Please elaborate on this. We do use our image on both x86_64 and aarch64 servers and there should not be any problem.
Here is an example which indicates that even only "5 packages had changes" bandersnatch still took about 90 mins(from 07:34 a.m. to 09:03 a.m. to do the job "generating global index page":
What type of storage are you using?
tuna docker image has some privilege problem
Please elaborate on this. We do use our image on both x86_64 and aarch64 servers and there should not be any problem.
An SATA3 HDD Storage with EXT4 filesystem. The IO performance is about 120MB/s. I think that is not the IO performance does, but the bandersnatch logic, that always generate a whole index page no matter how many packages are really changed.
I will report it later because now I have limited aarch64 board.
You can refer to #167 , the symptom and the cause is the same.
When starting cleanly, bandersnatch will generate a "todo" file, and after a successful run, it will be deleted. On starting, it will detect the existence of the file. If todo file exists, it will try to continue last interrupted sync and not start a new sync.
About the difference of the size, it simply because bandersnatch does not delete packages which are already deleted from upstream.
You can refer to #167 , the symptom and the cause is the same.
Of course, that issues are submit by myself too...
And about the speed of generating index page, the bottleneck is not at bandwidth of your HDD, but IOPS. In your particular case, it might be also related with the CPU performance.
You can refer to #167 , the symptom and the cause is the same.
Of course, that issues are submit by myself too...
So why are you posting a new issue? Is there anything new that happened?
About the difference of the size, it simply because bandersnatch does not delete packages which are already deleted from upstream.
- "bandersnatch does not delete packages which are already deleted from the upstream", it that point to my
bandersnatch
on Pi4B orbandersnatch
on tuna server? - Conducting that is true, but why the size is different between tuna server and my pi server, we share the same configuration file.
You can refer to #167 , the symptom and the cause is the same.
Of course, that issues are submit by myself too...
So why are you posting a new issue? Is there anything new that happened?
In this thread, I've also post other questions, mainly is the unmatched size problem. And recently the "generating global index page" problem confuse me again, so I re post this problem. It's my fault, of my carefulness.
And about the speed of generating index page, the bottleneck is not at bandwidth of your HDD, but IOPS. In your particular case, it might be also related with the CPU performance.
Yes, I agreed that IOPS is the bottleneck. So I have some further questions(annoying you again, thank you for the patience):
- Is there are a recommended IOPS and the performance under this IOPS?
- How can I test my IOPS?
Because bandersnatch won't delete packages that are already deleted from upstream, and the initial sync in TUNA server happens much earlier than that in your Pi, there are more such garbage in TUNA server and the size is thus larger.
Because bandersnatch won't delete packages that are already deleted from upstream, and the initial sync in TUNA server happens much earlier than that in your Pi, so there are more such garbage in TUNA server and the size is thus larger.
Bingo!
I think that is the real reason!
And since the "extra" data is defined as garbage, I think I can just ignore the size difference but can still make sure that my local pypi mirror can work well, is that correct? Because there is no necessary to sync that "garbage" data.
Because bandersnatch won't delete packages that are already deleted from upstream, and the initial sync in TUNA server happens much earlier than that in your Pi, so there are more such garbage in TUNA server and the size is thus larger.
Bingo! I think that is the real reason! And since the "extra" data is defined as garbage, I think I can just ignore the size difference but can still make sure that my local pypi mirror can work well, is that correct? Because there is no necessary to sync that "garbage" data.
Correct. However, I strongly recommend against self building a mirror of pypi, especially using a Pi-like board and a magnetic hard drive. Instead, a local caching server might be more adequate.
Because bandersnatch won't delete packages that are already deleted from upstream, and the initial sync in TUNA server happens much earlier than that in your Pi, so there are more such garbage in TUNA server and the size is thus larger.
Bingo! I think that is the real reason! And since the "extra" data is defined as garbage, I think I can just ignore the size difference but can still make sure that my local pypi mirror can work well, is that correct? Because there is no necessary to sync that "garbage" data.
Correct. However, I strongly recommend against self building a mirror of pypi, especially using a Pi-like board and a magnetic hard drive. Instead, a local caching server might be more adequate.
Totally agreed. the reason I use pi board to carry the pypi mirror job is that it is protable and non-noise and I can place it to a well-network area. Once I confirmed that the pypi mirror are well synced(this is the confusing problem, I will explain that below), I will transfer the hard drive to a real server to provide service. Because the place I use this local pypi mirror has no internet access.
etc
- what is the local caching server you just mentioned? I have no idea what it is and what is the difference between it and th self-building pypi mirror. Isn't it a pypi mirror station? Or an brand new form of caching pypi data? No matter what it is, I still need a complete copy of the pypi mirror(like cloning from rsync source), is the local cache server can do it?
- the reason why I can not confirm if pypi mirror is completed synced is that, everytime
bandersnatch
finish it running with "genrating global index page...", it is not really finish. The next time you run you will find it start over from antodo
file. But there are no any hint indicates that any error has occur during the running. But I could not describe and explain this very well, maybe it is just some coincident.
Because bandersnatch won't delete packages that are already deleted from upstream, and the initial sync in TUNA server happens much earlier than that in your Pi, so there are more such garbage in TUNA server and the size is thus larger.
Bingo! I think that is the real reason! And since the "extra" data is defined as garbage, I think I can just ignore the size difference but can still make sure that my local pypi mirror can work well, is that correct? Because there is no necessary to sync that "garbage" data.
Correct. However, I strongly recommend against self building a mirror of pypi, especially using a Pi-like board and a magnetic hard drive. Instead, a local caching server might be more adequate.
I've just search for local cache server and I have some mind in head.
Because bandersnatch won't delete packages that are already deleted from upstream, and the initial sync in TUNA server happens much earlier than that in your Pi, so there are more such garbage in TUNA server and the size is thus larger.
Bingo! I think that is the real reason! And since the "extra" data is defined as garbage, I think I can just ignore the size difference but can still make sure that my local pypi mirror can work well, is that correct? Because there is no necessary to sync that "garbage" data.
Correct. However, I strongly recommend against self building a mirror of pypi, especially using a Pi-like board and a magnetic hard drive. Instead, a local caching server might be more adequate.
However, that is like a cache, not a full copy. You first need to request this package and then the cache server do the caching job.
But, if there is a way that can cache server make a full copy, I think I could use it. Besides, what is the solution of caching pypi? I found nexus and apt-cacher, but are not for pypi mirror.
其实可以说中文的。全量同步 PyPI 带来的 IOPS 压力非常大,如果您并不是需要离线提供服务,事实上用缓存的方式就可以了,甚至不需要特殊的软件,nginx 反代就可以缓存。
其实可以说中文的。全量同步 PyPI 带来的 IOPS 压力非常大,如果您并不是需要离线提供服务,事实上用缓存的方式就可以了,甚至不需要特殊的软件,nginx 反代就可以缓存。
我看很多issue都英文就也英文了,保持一致 :)
是的,其实我需要的就是离线提供服务,所以才考虑做完整的copy。现在的情况是这样的:
- 用树莓派4B的板子接机械硬盘在存,已经是存到8.9T了,下载速度没问题,但就像你说的,最后要IOPS支撑,在生成主页索引环节。关于容量不匹配,根据之前的回复意思,应该是存不到服务器状态里面说的9.8T,因为有未删除的包在清华源里,而我现在同步是下载不来的。
- 现在同步到后面,速度就很慢了,因为有很多是fetch后storing index page的,下载和IO交替进行,不像之前都在Downloading
- 同步到后面,涉及到主页索引生成,这个每次都是要90分钟,而我每次都要运行一次
bandersnatch mirror
是因为不清楚现在的镜像状态是否是同步完成了(狭义上的,即上次的同步任务是否完全OK没下漏或有错误终止,而不是准确意义上的与服务器同步,因为每时每刻都在变化)。所以经常就运行这个命令去检查,而有时候就会启动新的同步,在新的同步里有时就的确发生了error downloading,所以一直在这个循环里出不来。
其实可以说中文的。全量同步 PyPI 带来的 IOPS 压力非常大,如果您并不是需要离线提供服务,事实上用缓存的方式就可以了,甚至不需要特殊的软件,nginx 反代就可以缓存。
另外,那你们有没有一些指导性的建议,如果是要产生一套全量的Pypi离线源,有无推荐的解决方案?
目前看起来只能通过 bandersnatch 的退出状态来判断是否正常结束了同步任务。
全量的离线源就直接用 bandersnatch 制作即可。