CPU/IO not fully utilized when under heavy load
pnorman opened this issue · 8 comments
The OSMF got a new tile server, and in scaling it up, I've observed some odd behavior.
When I throw a heavy load onto the server, a queue builds up. This is semi-expected, but the average time per tile, as computed by TimeRenderedZoom/ZoomRendered
becomes approximately 4.2 seconds. Normally it ranges from <1s for high zooms to 5-10s for zoom 13.
This can be seen in prometheus
With a bit of work I was able to see this with a partially filled queue
although in this case it's just the high zooms.
Looking at the code that produces the render time and number of tiles stats, they're set in
Lines 371 to 401 in 24a5752
Line 505 in 24a5752
maps[i].store->tile_stat
and render
, in ms.
I'm somewhat stumped here, but a few thoughts are
- this is limiting render capacity well below what it should be
- the machine has idle capacity throughout, particularly CPU
- why 4.2s?
- how does the average zoom 13 time per tile decrease?
Tied in with this is the fact it's starting to queue tiles when CPU is under 50%, while num_threads is approximately equal to the number of hardware threads.
In an attempt to isolate what's going on, I left the load normal and used render_list to render z17 tiles. It started off behaving as above, with 4 seconds per tile rendered and CPU usage <50%. Then without changing anything, throughput spiked up, time/tile went down, and CPU usage went to 100%.
render_list was with -n 80
and num_threads is 120, and the machine has 80 hardware threads
In an effort to reduce the problem down, I shut off apache2/mod_tile and tried render_list with an assortment of -n. Increasing -n doesn't increase throughput or CPU usage, except that sometimes it goes from not running at capacity to generating more tiles and being CPU bottlenecked.
I'm wondering, could some kind of lock be limiting how many requests per second it can deal with? But that wouldn't explain the bursts of good performance.
I wonder if this is related to
mod_tile/src/render_submit_queue.c
Lines 207 to 213 in 24a5752
This limits the queue size to 32, which might be too low modern hardware. But as far as I can tell, this file is only included by the render_* helper programs.
There are also a bunch of constants for mod_tile that might not make sense on modern hardware:
Lines 24 to 33 in 24a5752
mod_tile/includes/render_config.h
Lines 112 to 114 in 24a5752
We did some debugging on piasa today.
Increasing QMAX to 256 had no impact. Increasing REQ_LIMIT and DIRTY_LIMIT to 256 and 8000 had no impact. The following patchset had no impact
download.compton.nu_tmp_queue-concurrency.patch.txt.
The testing with the patchset and with the REQ_LIMIT and DIRTY_LIMIT changes was not identical to the previous testing, because it was built with march=native.
Running for longer established that the spikes of working at 100% happen every 8.4 minutes. In tests on the 18th, they were happening 7.25 minutes, and on the 15th, every 7.5 minutes.
I attached gdb and got some useful information with a backtrace of all the threads. Of the 81 threads, 61 of them were at futex_wait
within mapnik::freetype_engine::create_face_impl
, inside the rendering of an image. 9 threads were doing other Mapnik work, 4 were in __pthread_cond_wait_common
with no Mapnik calls, 2 were within PQgetResult, and the last two were overall renderd threads.
Thread 80 (Thread 0x7f97cbfff640 (LWP 495657) "renderd"):
#0 __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x55a35fb6c26c) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x55a35fb6c26c) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x55a35fb6c26c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3 0x00007f98de693ac1 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55a35fb6c218, cond=0x55a35fb6c240) at ./nptl/pthread_cond_wait.c:503
#4 ___pthread_cond_wait (cond=0x55a35fb6c240, mutex=0x55a35fb6c218) at ./nptl/pthread_cond_wait.c:627
#5 0x000055a35e730a1b in ?? ()
#6 0x000055a35e7343cc in ?? ()
#7 0x00007f98de694b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#8 0x00007f98de726a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 78 (Thread 0x7f97dd7fa640 (LWP 495655) "renderd"):
#0 futex_wait (private=0, expected=2, futex_word=0x7f98df96b060 <mapnik::singleton<mapnik::freetype_engine, mapnik::CreateUsingNew>::mutex_>) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x7f98df96b060 <mapnik::singleton<mapnik::freetype_engine, mapnik::CreateUsingNew>::mutex_>, private=0) at ./nptl/lowlevellock.c:49
#2 0x00007f98de698082 in lll_mutex_lock_optimized (mutex=0x7f98df96b060 <mapnik::singleton<mapnik::freetype_engine, mapnik::CreateUsingNew>::mutex_>) at ./nptl/pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=0x7f98df96b060 <mapnik::singleton<mapnik::freetype_engine, mapnik::CreateUsingNew>::mutex_>) at ./nptl/pthread_mutex_lock.c:93
#4 0x00007f98def6cf13 in mapnik::freetype_engine::create_face_impl(...) () from target:/lib/libmapnik.so.3.1
#5 0x00007f98def6d3a1 in mapnik::freetype_engine::create_face(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mapnik::font_library&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::pair<int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::unique_ptr<char [], std::default_delete<char []> >, unsigned long>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::pair<std::unique_ptr<char [], std::default_delete<char []> >, unsigned long> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::pair<int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::unique_ptr<char [], std::default_delete<char []> >, unsigned long>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::pair<std::unique_ptr<char [], std::default_delete<char []> >, unsigned long> > > >&) () from target:/lib/libmapnik.so.3.1
#6 0x00007f98def6d817 in mapnik::face_manager::get_face(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from target:/lib/libmapnik.so.3.1
#7 0x00007f98def6dc63 in mapnik::face_manager::get_face_set(mapnik::font_set const&) () from target:/lib/libmapnik.so.3.1
#8 0x00007f98def6de9e in mapnik::face_manager::get_face_set(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::optional<mapnik::font_set>) () from target:/lib/libmapnik.so.3.1
#9 0x00007f98df333038 in ?? () from target:/lib/libmapnik.so.3.1
#10 0x00007f98df32dcc6 in ?? () from target:/lib/libmapnik.so.3.1
#11 0x00007f98df32ea31 in ?? () from target:/lib/libmapnik.so.3.1
#12 0x00007f98df32eceb in ?? () from target:/lib/libmapnik.so.3.1
#13 0x00007f98df338868 in ?? () from target:/lib/libmapnik.so.3.1
#14 0x00007f98df3faf45 in ?? () from target:/lib/libmapnik.so.3.1
#15 0x00007f98df74817a in mapnik::agg_renderer<mapnik::image<mapnik::rgba8_t>, mapnik::label_collision_detector4>::process(mapnik::text_symbolizer const&, mapnik::feature_impl&, mapnik::proj_transform const&) () from target:/lib/libmapnik.so.3.1
#16 0x00007f98def67885 in mapnik::feature_style_processor<mapnik::agg_renderer<mapnik::image<mapnik::rgba8_t>, mapnik::label_collision_detector4> >::render_style(mapnik::agg_renderer<mapnik::image<mapnik::rgba8_t>, mapnik::label_collision_detector4>&, mapnik::feature_type_style const*, mapnik::rule_cache const&, std::shared_ptr<mapnik::Featureset>, mapnik::proj_transform const&) () from target:/lib/libmapnik.so.3.1
#17 0x00007f98def68225 in mapnik::feature_style_processor<mapnik::agg_renderer<mapnik::image<mapnik::rgba8_t>, mapnik::label_collision_detector4> >::render_material(mapnik::layer_rendering_material const&, mapnik::agg_renderer<mapnik::image<mapnik::rgba8_t>, mapnik::label_collision_detector4>&) () from target:/lib/libmapnik.so.3.1
#18 0x00007f98def68eee in mapnik::feature_style_processor<mapnik::agg_renderer<mapnik::image<mapnik::rgba8_t>, mapnik::label_collision_detector4> >::apply(double) () from target:/lib/libmapnik.so.3.1
#19 0x000055a35e7350c7 in ?? ()
#20 0x00007f98de694b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#21 0x00007f98de726a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Hi @pnorman, I'm seeing a pull request which seems to address this issue, #334. As well as another related pull request, #335. Let me know if you have yet had a chance to try out those changes and if they have resolved this issue for you so that this issue can be closed out (or not.) I am currently planning to get a new release put together soon and so just wanted to gather some information and close some issues & pull requests out beforehand.