Performance Improvements

Question

Performance Improvements

saviorand opened this issue 2 years ago · 8 comments

saviorand commented 2 years ago

Parallelization and performance optimizations

Answer 1 · 2024-04-05T20:10:03.000Z

I appreciate you may not have optimised yet.
But fyi, I get approximately:-

50req/s with mojo lightbug.🔥
100req/s compiled

whereas python flask does about 1000 req/s on a single core.
Performance profile attached

Answer 2 · 2024-04-05T20:32:12.000Z

Nether mind it's just something in the Welcome handler
With my own handler, I get 2700 req/s.

Answer 3 · 2024-04-05T21:33:46.000Z

Woah, nice! Thanks for testing! Yes, the welcome handler serves an html page with an image, which might be slower. Can I ask how you're profiling this? The charts look sick

Answer 4 · 2024-04-07T16:53:54.000Z

Profile was with Linux's built in kernel profiler and "perf" usermode tool, I couldn't find a profiler specifically for mojo yet. This technique does have the advantage of showing all user and kernel mode activity, i.e. the libc and cpython work.

I suspect there is a lot of memory allocation or copying happening in the welcome handler but I'm not all that familiar with mojo and haven't found a technique to profile memory allocation.

i'm also suspicious the use of python sockets might be suboptimal, but what do i know?

flame graph is by https://www.brendangregg.com/perf.html

git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
sudo perf record -F99 -g -p `pgrep lightbug` -- sleep 60
sudo perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf.svg
google-chrome perf.svg

Answer 5 · 2024-04-07T16:59:19.000Z

you might also enjoy perf top

Answer 6 · 2024-04-07T17:08:44.000Z

Yeah 1500req/s with the base64 image removed.

Answer 7 · 2024-04-10T18:32:31.000Z

@crunchy-vonage we're actually doing external_calls to C in the Mojo server implementation in the sys folder (this one is enabled by default) and not talking to Python! Python is only invoked in the separate Python implementation in the python folder

Answer 8 · 2024-05-21T19:02:54.000Z

I've made some improvements in #40 , getting 10468 reqs per second now with wrk. wrk is the tool used, among other things, for TechEmpower benchmarks. I have a fork for potential submission here, but the performance is not satisfying enough yet, and we don't even have JSON serialization support in order to submit it to the listing. Would be cool if we can make an entry at some point though.

Running 1s test @ http://localhost:8080
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.67ms   11.33ms  78.56ms   94.60%
    Req/Sec     9.56k     2.40k   11.29k    72.73%
  Latency Distribution
     50%   53.00us
     75%   58.00us
     90%   98.00us
     99%   66.70ms
  10468 requests in 1.10s, 1.59MB read