audetto/AppleWin

New features (SDL)

Closed this issue · 48 comments

Qt app: quit from menu. Ctrl-Q
SDL2: F6 full screen and some command line options.

SDL2: F2 quit, Left / Right Alt Open Solid Apple
--qt-ini will reuse the Qt config file.

Ok keyboard works.
But I do not like it. I was not sure if I should take a physical view of the keyboard or an ASCII one, and the result is a bit of both.
It is impossible to do CTRL-ASCII. Need to sort it out.

This all looks very promising and it's great to see the SDL build included with Qt build. A quick informal test (Mario Bros) and I definitely see some input delays for Qt. SDL appears more responsive, but although CPU is only about 30% on Pi4, emulation appears to be running slower than normal (enhanced speed is unchecked). The --qt-ini flag makes changing configurations very easy. Exciting to see the rapid progress!

Added audio.
Currently there are 200ms delay. I need to see how much it can be reduced while avoiding underruns.

Enhanced speed only affects the emulator when the disk is spinning.
CPU utilisation is not always obvious. Are you looking at 30% of 1 CPU? or 30% of all CPUs?
On the PI3 it uses 1 entire CPU. and it is easy to fall behind.

I need to find some tradeoff quality / speed.

Would it be possible to allocate audio processing to another core? I can imagine synchronization might be an issue... what about dedicating a core to the 6502 CPU responsibilities and another core to everything else (or developing the model of add-on cards using separate threads / cores)?

Have you reached out the main AppleWin group to inquire about your code repository being wrapped in as a first class citizen?

The problem with what you suggest is that it requires a departure from AppleWin, which would make merging 10x complicated.
The only thing I could try to put on a different thread is the non-AppleWin audio processing, but I doubt it will make any difference. At the moment it is just copying the buffer to SDL.

What might be more effective is a decrease in video quality where most of the time is actually spent.

As merging the code with AW, you could try to mention it there and see what they answer: AppleWin#538

I periodically create PRs to fix compiler issues in the files that are shared, but have never tried to unify the code completely. AW would require some work to become more modular, which I don't think they are interested in.

I stumbled across the following value in the applelin.conf file from https://github.com/linappleii/linapple:
Singlethreaded = 0

The comment surround this config value reads "By default the emulator's draw code, a large share of the processing, is performed in a separate thread, probably on a different core." I can confirm that on Pi two cores are being consumed for smooth sound & video.

I understand not wanting to deviate from the AppleWin core, but since the video component deviates anyway, offloading video to a separate core may be reasonably straightforward and reconcile issues on Pi4.

Could you please do me a favour and check this

https://github.com/audetto/AppleWin/blob/master/source/frontends/sa2/emulator.cpp#L197

Using top check the %CPU with master and then comment out lines 197-205.

The emulator will not run and just render a black screen.
For me on a Pi3, it still goes at 77% CPU which means the SDL2 rendering is the bottleneck.

No idea how to improve it though, short of reducing FPS.

With top, CPU utilization is at 97-98%. Looks like pretty close to a full core being utilized based on overall CPU utilization of around 33%. Don't see a latency / performance problem with the normal screen.

Switching to fullscreen, I now see lag (guessing top would show 100% utilization). Commenting out lines dropped overall utilization by 5-8%, but obviously isn't displaying anything.

The above referenced linapple repository successfully runs SDL on a separate core, but it's SDL 1.2. I'm attaching A2 emulator code I've been working on from James Hammons that pulls some of the SDL2 initialization routines from GSPlus as a possible point of reference.
apple2.cpp.txt

The code may or may not be useful... the emulator itself only uses ~2/3 of a CPU core, so I have no idea whether it would span an additional core if needed.

An additional note... SDL seems to be very sensitive to how it's initialized. It's still early to say definitively, but I'm seeing that initializing with the wrong audio values can sometimes sap performance.

I am still puzzled by the results.
In Windows AW runs at < 1% CPU, on linux (same machine) at about 20%.

What definitely takes a long time is the video update, which you can switch off to check.

If you disable this

https://github.com/audetto/AppleWin/blob/master/source/frontends/sa2/emulator.cpp#L199

if will still run the CPU, draw the screen (black) but will not update the Apple bitmap.
This takes a very long time.

I suspect that other emulators have a less precise video generation and so run quicker. I originally wrote my own video update (non precise) but dropped as it was a lot of extra work.

You are definitely write about SDL and I still have to find a quick way to just paint a black screen at 60FPS on a Pi using SDL2.

try this
I get a 10% CPU improvements

https://github.com/audetto/AppleWin/tree/pi

On a Pi3 with FakeKMS ~77% in 1x size, ~99% is 2x size.
What do you get on Pi4?

I have tried
https://github.com/robmcmullen/apple2

and it does run very much at the same speed as my SDL port.
I tweaked the makefile for optimisation, but it runs at 98% of CPU with very bad audio.

My version runs at 78% or 99% as mentioned above and in 1x size audio is ok.

I think they all suffer the same problem: SDL screen drawing.

If you find any other emulator that runs quick on a Pi at 60FPS, I am happy to copy their video rendering.

I've really been surprised by the performance differential between SheepShaver (PPC Mac emulator) versus several different Apple II emulators. I'm not sure whether the build I'm running is using SDL1.2 or SDL2, but resource utilization is only about 10% of CPU. My guess is the A2 emulators dedicate more cycles to ensure proper 6502 timing (whereas PPC Mac emulators mainly don't seem to care about CPU timing). It looks like SheepShaver can be compiled against either SDL1.2 or SDL2, which may be a solid test to identify whether SDL2 is the performance culprit.

For GSPlus (and several other A2 emulators), increasing audio sample buffer initialization up to 4096, (e.g. wanted.samples = 4096;) addressed both sound problems AND reduced CPU load for SDL1.2 AND SDL2. A key advantage of SDL2 is that you get "free" scaling and texture overlays. It's allowed me to make the screen resizeable (up to 1080p) without significantly affecting performance as well as simulating scanlines. HOWEVER, in order to do this, I had to disable where the emulator was relying on code routines to double the image size (i.e. switched emulator FROM running at 560x384 with software scaling BACK to 280x192 relying on SDL2 hardware scaling for arbitrary resolution). This emulator is running at 60FPS along with some video improvements at about 66% core utilization. It seems that increasing the video buffer from 280x192 to 560x384 is pushing too much data to the SDL video buffer.

I haven't taken a close look at how you're rendering video yet (i.e. a whole frame at a time or only refreshed regions). I don't know of any easy way to calculate a "buffer checksum," but for some emulators I think CPU utilization would go WAY down if it were possible to quickly calculate the video buffer checksum (something like an array_sum function) and only push video updates to SDL1.2/2 video buffer if the checksum changes.

I'm not really surprised that DirectX on Windows is rendering faster than SDL2 on Linux. Unless you're using OpenGL within SDL, it's probably not taking full advantage of the GPU.

GSPlus idles at about 10% CPU utilization, which is much more inline w/ SheepShaver (PPC Mac emulator). It uses SDL2 and is probably the best reference (check the Issues section for tips on proper compilation). Linapple happily runs over 100% (top), meaning it appears to be truly multi-core, but it's SDL1.2. The emulator I've optimized by James Hammons runs at 66% of a CPU core. Happy to drop ARM compiled binaries if you want to check for yourself. Alternatively, happy to supply whatever source you'd like (if you can't find it on GitHub).

One thing at a time.

  1. We are only talking about Pi (3/4). On my PC it runs at 20% CPU maximum, no matter what the window size is.
  2. I've tried https://github.com/robmcmullen/apple2 and it behaves exactly like my code on a Pi3. If you believe it is faster, then please fork it, make all necessary changes and I will compare it.
  3. I thought that this was your code, I will try the cpp file directly
  4. SheepShaver, GSPlus: please post some github links I can try (again, please fork and modify if they need tweaks)
  5. without running it, vice has exactly the same drawing code: https://github.com/hpingel/vice-emu-mirror/blob/9842c45458aea54a05cbf081636cb013fa4d2de5/vice/src/arch/sdl/video_sdl2.c#L686
  6. I already asked the Pi forums about fast screen drawing and did not get any useful ideas: https://www.raspberrypi.org/forums/viewtopic.php?f=67&t=259450&p=1581132#p1581073
  7. I have an idea about splitting SDL to a separate thread which I will try today

I sat down and took a look at your rendering code. I think you're spending a lot of time on the memcpy operation inside emulator.cpp (~40% of CPU time on a Pi4) to basically copy your video buffer for SDL (I believe instead of using SDL's own version). My own tests with memcpy in the past weren't very good for a block of data as large as what you're generating (I think you mentioned 560 x 384 in your SDL forum post.

Here's relevant (working) code that renders without a memcpy operation:
SDL_LockTexture(sdlTexture, NULL, (void **)&scrBuffer, &scrPitch);
... operations that draw graphics / text to scrBuffer ...
SDL_UnlockTexture(sdlTexture);
SDL_RenderCopy(renderer, sdlTexture, NULL, NULL);

Here's a really simple example that renders a black screen:
SDL_LockTexture(sdlTexture, NULL, (void **)&scrBuffer, &scrPitch);
memset(scrBuffer, 0, VIRTUAL_SCREEN_WIDTH * VIRTUAL_SCREEN_HEIGHT * sizeof(uint32_t));
SDL_UnlockTexture(sdlTexture);
SDL_RenderCopy(renderer, sdlTexture, NULL, NULL);

You've got a similar set of operations within your refreshTexture method, but are returning a rectangle and then pursuing subsequent rendering operations. I tried a relatively straightforward code swap, but am getting a black screen. Top, however, shows %CPU right at 60%, so if you can get rid of memcpy I believe it'll run with breathing room on a Pi 4.

If you do manage to engage the Broadcom GPU blob mentioned within the SDL forums (perhaps using OpenGL ES), combined with eliminating memcpy, I think you'll be able to get this running with desired performance characteristics on a Pi 3.

Yes,

this is something I changed here

https://github.com/audetto/AppleWin/tree/pi

It seems that SDL_UpdateTexture is faster even if the doc says no. So it requires no memcpy.

The problem with your code is that AW manages its own memory buffer for the video, so it would require deeper changes.

I think your best bet is to really investigate how GSPlus is doing things:
https://github.com/digarok/gsplus (v015 branch)

The only change I've made locally is documented here (to get sound working): digarok/gsplus#106

When I start this up, the emulator indicates that OpenGL is being used. Running at 2.8Mhz top indicates about 10% CPU utilization. At 8Mhz it's up to 25%. At "unlimited" I'm hitting 100% CPU, but that's no surprise. Looks like the key is to get OpenGL involved. Based on what I'm seeing with GSPlus, that should fix the problems on Pi3 too.

Try this

https://github.com/audetto/AppleWin/tree/threads

I've moved AW CPU to a separate thread.
CPU utilisation overall has gone up, but I can hear audio without glitches both at 2x size and full screen.

Tested threads branch. I do see that CPU is exceeding 100% (multi-threading) and emulator is playable at full-screen with a minor increase in load. Sound, for me, is still significantly delayed (by at least a few seconds).

As another point of reference, GSPort (https://github.com/david-schmidt/gsport/) is running at under 5% utilization (aoss ./gsportx for sound). Pretty sure it's not SDL2 though.

Could this be useful: digarok/gsplus#58?

https://github.com/digarok/gsplus : this one is fast because it does not redraw at a constant speed. If the screen changes, CPU goes up. It is a good idea, but needs cooperation from AW. The code was more complicated that I was able to understand in a quick glance. I put some counters around texture update and render copy.

https://github.com/david-schmidt/gsport/ : this one uses xlib and maybe the same trick as gsplus. I dont want to try xlib. If you can make up a simple loop refreshing at 60Hz, and it is fast, then we can move from SDL to xlib (a very sad decision).

digarok/gsplus#58 : it does not say much.

Have you run a profiler on it? Seems like gprof and Gperftools work on Pi.

I actually rewrote the core Windows BitBlt/StretchBlt in GSport back in 2015 to add full-screen integer scaling. I'm pretty sure some partial redraws were done, but I completely forget the redraws, sorry. I did test other GSport features I was developing on a Pi 2 and performance seemed fine.

Short of profiling SDL2 and the Pi kernel, I don't know what else to do.

In order to increase SNR, I've created a small projects that does exactly the same as this SDL2 port of AW so we can experiment and find the best configuration.

https://github.com/audetto/SDL_Demo

compiled in Release, on a Pi3 I get 66% CPU just to redraw the screen.
This leaves very little for the emulator (which could be profiled on ARM anyway).

If anybody can do better than this, I'd be happy to know.

Doing a smart update of the screen requires invasive changes to AW video update routines, which will not happen anytime soon.
Running at 30Hz is another possibility.

This is a good approach. I took a few stabs at it and don't see a way to significantly reduce the CPU load. It DOES look like the problem relates to SDL and there are special ways to build SDL for Pi that interface directly with the Broadcom GPU (which then has to be statically linked). A discussion here that looks particularly relevant: grimfang4/sdl-gpu#87

...and this: https://sourceforge.net/projects/raspberry-pi-cross-compilers/

In that discussion they were suggesting opengles and if one uses the opengles2 driver in SDL, CPU usage drops to 42% in the demo. Good.

I've added a few options to ./sa2 (see --help):

At the end of the run, it will print stats about timings:

Video refresh rate: 60 Hz, 16.67 ms
Global:  [. .], total =    7789.16 ms, mean =    7789.16 ms, std =       0.00 ms, n =      1
Events:  [0 M], total =      22.42 ms, mean =       0.05 ms, std =       0.17 ms, n =    471
Texture: [0 M], total =     113.32 ms, mean =       0.24 ms, std =       0.06 ms, n =    471
Screen:  [0 .], total =    7624.87 ms, mean =      16.19 ms, std =       1.66 ms, n =    471
CPU:     [1 M], total =     647.21 ms, mean =       1.34 ms, std =       0.48 ms, n =    484
Expected clock: 1020484.45 Hz, 7.74 s
Actual clock:   1014560.11 Hz, 7.79 s

The meaning of [0 M] is: 0/1 which thread and M if it is in the mutex protected area.

  • events: SDL events and audio
  • texture: SDL_UpdateTexture
  • screen: SDL_RenderCopyEx and SDL_RenderPresent (this includes vsync)
  • cpu: AW's code

They do not include time spent in locking.

The clock shows expected vs actual speed (crucial for correct audio play).

FYI, the changes drop CPU from ~50% (top) on Pi4 to 27%. Just about cuts resource usage in half.

This is what implicitly happens with the 2 threads version (I think).
If the main thread lags behind, it will skip to the next vsync without affecting too much the thread of the CPU (the vsync is not mutex protected).
It will delay audio by 16ms in the worst case (just about missed the vsync), but I am running now with 200ms buffer with is 10 frames (probably excessive anyway).

What I was toying with is exactly what you said, trying to be smart about detecting duplicate frames.
Unfortunately, doing it "outside" AppleWin bitmap presents some challenges

  1. check sum: one must scan the whole buffer always, and this seems to take forever (is there a super fast CRC for ARM? we do not need "cryptographically" secure hash, just a quick check). This is the code I tried and it does not perform well at all:
  template <class T>
  inline void hash_combine(std::size_t& seed, T const& v)
  {
    seed ^= std::hash<T>()(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
  }

  size_t seed = 0;
  for (size_t i = 0; i < width * height * 4; ++i)
  {
    hash_combine(seed, *(data + i));
  }
  1. detect the updated "rectangle": in this case I need to scan it to the "first" (from both ends ideally) diff. Incredibly how slow memcpy can be. True that you only memcpy between first and last so in total one has to check or copy the entire buffer every time.

With the latest findings about opengles2, none of them are urgent, but I find the problem challenging and interesting, I will try to see what can be done.

Here are a few compiler options compatible w/ cmake:
set(CMAKE_CXX_FLAGS "-Wall -Wextra -fomit-frame-pointer -mcpu=cortex-a72 -mfloat-abi=hard -mfpu=crypto-neon-fp-armv8 -mneon-for-64bits")

the 'crypto' option for fpu might help with those hashes. It's designed for coin mining, so you may need to choose your hash algorithms carefully for it to kick in.

  1. check sum: one must scan the whole buffer always, and this seems to take forever (is there a super fast CRC for ARM? we do not need "cryptographically" secure hash, just a quick check). This is the code I tried and it does not perform well at all:
    seed ^= std::hash()(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);

That is very complicated considering we are trying to fix a performance problem. ; - ) Using an 8-bit (size_t) hash is also a bad idea because you only have 256 values, so a high chance of collision.

I would just XOR or ADD the data to get the simplest checksum possible. Because we want the fastest result, and also the smallest chance of hash (checksum) collision, we should choose the longest data unit available. If you can use 64-bit then do that, otherwise just a running 32-bit XOR or ADD should be good enough.

It's probably worth looking at the generated assembly language and using that to try to optimize your C++ loop. For instance, it might be like 6502 assembly in that counting backwards to 0 is more efficient that counting from 0 up to a constant.

Cheers,
Nick.

All you say is true except: size_t is 32 / 64 bits depending on architecture.

You are right as well in the data size, the loop should be over 32 / 64 bits at least, now it does byte by byte.

Of course, you're right. I've been working in C#, JavaScript, and PowerShell for weeks, so my C++ personality is paged out. ; - )

I thought I would have a quick look in VS 2017, and got a surprise.

Counting down, I was pleased that it unrolled the loop, and this version took 20 microseconds:

	int const length = 560 * 192;
	int data[length] = {};
00B4109C  mov         esi,4600h  
	auto t1 = std::chrono::high_resolution_clock::now();
00B410A1  adc         dword ptr [esp+10h],edx  
	size_t result = 0;
00B410A5  xor         edx,edx  
	for (int i = length - 1; i >= 0 ; i -= 1)
	{
		result ^= (data[i]);
00B410A7  mov         ecx,dword ptr [eax+10h]  
00B410AA  lea         eax,[eax-18h]  
00B410AD  xor         ecx,dword ptr [eax+24h]  
00B410B0  xor         ecx,dword ptr [eax+20h]  
00B410B3  xor         ecx,dword ptr [eax+14h]  
00B410B6  xor         ecx,dword ptr [eax+1Ch]  
00B410B9  xor         ecx,dword ptr [eax+18h]  
00B410BC  xor         edx,ecx  
00B410BE  sub         esi,1  
00B410C1  jne         main+0A7h (0B410A7h)  
00B410C3  mov         dword ptr [esp+24h],edx  
	}

If your compiler doesn't do this you could manually unroll the loop - which didn't change the time here of course:

	for (int i = length - 1; i >= 0 ; i -= 4)
	{
		result ^= (data[i]);
		result ^= (data[i - 1]);
		result ^= (data[i - 2]);
		result ^= (data[i - 3]);
	}

But then I tried counting up, and VS unleased the SIMD magic. This code ran in 11 microseconds:

	for (int i = 0; i < length; i += 1)
	{
		result ^= (data[i]);
00B810A1  movups      xmm0,xmmword ptr [esp+eax*4+28h]  
00B810A6  pxor        xmm1,xmm0  
00B810AA  movups      xmm0,xmmword ptr [esp+eax*4+38h]  
00B810AF  add         eax,8  
00B810B2  pxor        xmm2,xmm0  
00B810B6  cmp         eax,1A400h  
00B810BB  jl          main+0A1h (0B810A1h)  
	int const length = 560 * 192;
	int data[length] = {};
00B810BD  pxor        xmm1,xmm2  
00B810C1  movaps      xmm0,xmm1  
00B810C4  psrldq      xmm0,8  
00B810C9  pxor        xmm1,xmm0  
00B810CD  movups      xmm0,xmm1  
00B810D0  psrldq      xmm0,4  
00B810D5  pxor        xmm1,xmm0  
00B810D9  movd        dword ptr [esp+24h],xmm1  
	}

I would have to look those instructions up(!) but I know ARM has SIMD these days too.

Cheers,
Nick.

Audio: made the emulator speed stick to wall clock.
Removed some hacks around audio speed to leave AW adaptive algorithm to decide.
Press F1 during emulation and it will print what it thinks the audio buffer size / queue is:

Channels: 1, buffer: 32768, SDL:  8804, queue: 0.47 s
Channels: 2, buffer: 45000, SDL: 65536, queue: 0.63 s

Channels 1 is Speaker, 2 is Mboard.
The rest is the actual number of bytes to be played in the internal and SDL buffers, and queue the total lag in seconds.
It is probably twice as bad as AW in Windows but should be stable at least.

Awesome to see you making headway! I've been investigating ways to integrate a proper UI (Gtk3 or Qt5) with SDL2 in order to be able to provide the responsiveness along with a full-fledged interface. SDL2 provides the SDL_CreateWindowFrom(window_id); call that I'm able to attached to a window I create with Gtk3. It should, in theory, work with something like Qt's WId QWidget::winId() const (https://doc.qt.io/qt-5/qwidget.html#winId).

There's not a whole lot of documentation available for mixing SDL with UI libraries, but it looks like Bsnes (https://github.com/bsnes-emu/bsnes) is using SDL2 with Gtk2, Gtk3, Qt4, and Qt5 (apparently selectable at compile time). It might be possible to come full circle back the the Qt interface you built along with the optimized SDL2 rendering code.

Can you post an example of how you display a gtk dialog in sdl.

Unfortunately I don't have Gtk code that goes further than attaching SDL2 to a Gtk3 window. For my own needs, I think ImGui (https://github.com/ocornut/imgui) is the more straightforward approach. It provides GUI elements with native SDL2 support. It's pretty lightweight w/ extensive demo code. Since it uses the native rendering engine (e.g. SDL2) it doesn't require the "window hack" (for Gtk/Qt) that probably isn't portable to Windows and appears reasonably cross-platform.

Since you're rewriting the display mechanism, I assume you have access to it, but the caveat with ImGui is that it ties into the main rendering loop. If you're interested in taking the ImGui route, I'll try to throw together sample dialog code.

My first reaction was: not another GUI toolkit! GTK or QT are stable, supported, and available everywhere.

But, but, but ....

their front page looks really impressive.
It is still a one man effort though.

Are packages available in the main distros: Fedora, Ubuntu, Raspbian? That would definitely help.

I've learned what ImGui is and made a 2nd SDL version using it.
No dialogs yet, but they seem easy to add.

It uses OpenGL2, but they say one should jump to OpenGL3, and I need to see how they both work on a Pi.

https://github.com/audetto/AppleWin/tree/imgui

one needs to pass -DIMGUI_PATH=/path/to/imgui to cmake.

Most of the SDL code can be reused.

I think this is a positive development! Keeping the GUI elements managed by SDL reduces the dependencies and, I suspect, will provide greater longevity.

XGS (https://github.com/jmthompson/xgs) also uses ImGui + OpenGL3 and this has been tested on a Pi. In private exchanges with the author, he had this to say about rendering: "The VideoCore stuff isn't supported on the Pi 4 or 400; there is now a fully open source OpenGL driver for the Pi 4/400 as well as the Pi 3. But, I had to custom compile SDL to enable it (the KMSDRM driver). If you don't do this, then on the Pi 4/400 Mesa (and by extension, SDL) will fall back to the llvmpipe software rendering pipeline..and that would certainly spike your CPU because it really will be copying textures around manually. Why Raspberry Pi OS still ships without KMSDRM enabled in SDL baffles me." A new commit should be coming out soon that includes ImGui 1.80 and some further speed optimizations.

Looking forward to trying out the ImGui version of AppleWin and will post back if I encounter any problems.

Good to hear.
It would be good to understand why the good version is not shipped with the Pi. Has he tried to open an issue or write to the forums?

Apologies, haven't had a chance to compile the latest code on my build machine over the last few weeks. Attempting the following:
cmake -DIMGUI_PATH=../imgui ..

Returns CMake Warning: Manually-specified variables were not used by the project: IMGUI_PATH

I git cloned the imgui library into the imgui path within the AppleWin project directory. I assume if there were a path issue or something that cmake would complain a little more loudly. The only other error message cmake displays is "Bad LIBRETRO_COMMON_PATH=NONE, skipping libretro code." The inclusion of imgui sounds like a promising development, but I'm not sure how to test it.

have you used the imgui branch?
it is currently very much behind, but i have been busy integrating all the changes from AW that make x-platform a lot easier.

if the path is not found, you should get a warning
https://github.com/audetto/AppleWin/blob/imgui/source/frontends/imgui/CMakeLists.txt#L6

have you used the imgui branch?
Well, that explains it :-/ No problems with the imgui compile :-)

The imgui version runs very well on the Pi4 (very responsive). Not sure why, but the Qt build seems to run slow, even though the CPU core isn't hitting 100%. Scaling on imgui didn't seem to noticeably affect performance.

Try this

https://github.com/audetto/AppleWin/tree/imgui3

It uses SDL2+OpenGLES2 which should be a better option.
The other branch is OpenGL2, which according to imgui is not a good choice.

You need libgles-dev.

I tried the imgui3 branch, but don't see a performance difference (on Pi4).

I believe that that the Imgui note on "OpenGL2 being a non-ideal choice" is related to a couple factors:
1.) I think the OpenGL2 code example in ImGui uses older style initialization syntax whereas the OpenGLES code uses newer syntax
2.) OpenGL2 has technically been deprecated by OpenGL3

I don't think there's a specific reason to opt for OpenGLES 2 over OpenGL 2 in ImGui if you use the newer style syntax to initialize your OpenGL2 rendering engine. By comparison, OpenGL ES support was more recently added, so the examples provide "modern" syntax.

Hope this provides some clarity. It's how I understand the situation with ImGui.

Let's use a separate Issue for ImGui related opinions: #22

Ability to use OpenGL2 or OpenGLES has been added

option(SA2_OPENGL "Prefer OpenGL over OpenGL ES" OFF)