audetto/AppleWin

Execute loop target cycles computation ignores full-speed?

sh95014 opened this issue · 19 comments

Hiya, me again. So I was trying to chase down why my app (which does not use the main loop in source/frontends/sdl/main.cpp) seems to be much slower than sa2. I've found that sa2 actually calls the writeAudio-ProcessEvents-ExecuteOneFrame-VideoPresentScreen sequence in a tight loop. This is surprising to me, and it pegs my CPU at ~42%.

I might be missing something big here, but my implementation is actually driven off of a 60 Hz timer, so if it finishes executing a 1/60 slice of 1 MHz 6502 code in less time, it'll just take a nap. This approach keeps the CPU at ~25%, but is of course quite slow. As an experiment, I hacked in the same nap (which computes to about 6 ms each time) to sa2:

diff --git a/source/frontends/sdl/main.cpp b/source/frontends/sdl/main.cpp
index d7f3f0ff..25fe286e 100644
--- a/source/frontends/sdl/main.cpp
+++ b/source/frontends/sdl/main.cpp
@@ -138,6 +138,8 @@ void run_sdl(int argc, const char * argv [])
 
     do
     {
+      uint64_t begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
+            
       frameTimer.tic();
 
       eventTimer.tic();
@@ -156,6 +158,12 @@ void run_sdl(int argc, const char * argv [])
         refreshScreenTimer.toc();
       }
       frameTimer.toc();
+      
+      uint64_t end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
+      if (oneFrame > end - begin)
+      {
+        usleep((oneFrame - (end - begin)) * 1000);
+      }
     } while (!quit);
 
     global.toc();

and it did slow down just like mine did, but CPU utilization also dropped to ~31%.

I dug into it, and I think the problem is that Speed::getCyclesTillNext() doesn't actually know how many cycles to run when at full-speed. It calls Speed::getCyclesAtFixedSpeed() and computes 16-ms (because 60 Hz) worth of 1 MHz cycles and returns it, when I think it should be instead computing how many 6502 instructions can actually be executed in a 16-ms slice of the host CPU's time.

I tried to basically live-benchmark the emulation to get a rough "6502 cycles per microsecond" and use that to determine how many 6502 instructions to try to run in our 16-ms timeslice if we were at full-speed:

diff --git a/source/frontends/common2/speed.cpp b/source/frontends/common2/speed.cpp
index 9933ce05..aee6ff70 100644
--- a/source/frontends/common2/speed.cpp
+++ b/source/frontends/common2/speed.cpp
@@ -25,9 +25,20 @@ namespace common2
     return cycles;
   }
 
-  uint64_t Speed::getCyclesTillNext(const size_t microseconds) const
+  uint64_t Speed::getCyclesAtFullSpeed(const size_t microseconds, const uint64_t microsecondsExecuting) const
   {
-    if (myFixedSpeed || g_bFullSpeed)
+    const uint64_t instructionsPerMicrosecond = g_nCumulativeCycles / microsecondsExecuting;
+    const uint64_t cycles = microseconds * instructionsPerMicrosecond;
+    return cycles;
+  }
+
+  uint64_t Speed::getCyclesTillNext(const size_t microseconds, const uint64_t microsecondsExecuting) const
+  {
+    if (g_bFullSpeed)
+    {
+        return getCyclesAtFullSpeed(microseconds, microsecondsExecuting);
+    }
+    else if (myFixedSpeed)
     {
       return getCyclesAtFixedSpeed(microseconds);
     }
diff --git a/source/frontends/common2/speed.h b/source/frontends/common2/speed.h
index ceb661bb..d3dd89ac 100644
--- a/source/frontends/common2/speed.h
+++ b/source/frontends/common2/speed.h
@@ -14,8 +14,9 @@ namespace common2
 
     // calculate the number of cycles to execute in the current period
     // assuming the next call will happen in x microseconds
-    uint64_t getCyclesTillNext(const size_t microseconds) const;
+    uint64_t getCyclesTillNext(const size_t microseconds, const uint64_t microsecondsExecuting) const;
     uint64_t getCyclesAtFixedSpeed(const size_t microseconds) const;
+    uint64_t getCyclesAtFullSpeed(const size_t microseconds, const uint64_t microsecondsExecuting) const;
 
   private:
 
diff --git a/source/frontends/sdl/sdlframe.cpp b/source/frontends/sdl/sdlframe.cpp
index dfb6448f..2364c063 100644
--- a/source/frontends/sdl/sdlframe.cpp
+++ b/source/frontends/sdl/sdlframe.cpp
@@ -133,6 +133,7 @@ namespace sa2
     , myDragAndDropDrive(DRIVE_1)
     , myScrollLockFullSpeed(false)
     , mySpeed(options.fixedSpeed)
+    , myMicrosecondsExecuting(0)
   {
   }
 
@@ -609,8 +610,12 @@ namespace sa2
   void SDLFrame::ExecuteInRunningMode(const size_t msNextFrame)
   {
     SetFullSpeed(CanDoFullSpeed());
-    const uint64_t cyclesToExecute = mySpeed.getCyclesTillNext(msNextFrame * 1000);  // this checks g_bFullSpeed
+    const uint64_t cyclesToExecute = mySpeed.getCyclesTillNext(msNextFrame * 1000, myMicrosecondsExecuting);  // this checks g_bFullSpeed
+      
+    uint64_t microsecondsBeforeExecution = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
     Execute(cyclesToExecute);
+    uint64_t microsecondsAfterExecution = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
+    myMicrosecondsExecuting += (microsecondsAfterExecution - microsecondsBeforeExecution);
   }
 
   void SDLFrame::ExecuteInDebugMode(const size_t msNextFrame)
diff --git a/source/frontends/sdl/sdlframe.h b/source/frontends/sdl/sdlframe.h
index a262b392..bfbfe23a 100644
--- a/source/frontends/sdl/sdlframe.h
+++ b/source/frontends/sdl/sdlframe.h
@@ -79,6 +79,9 @@ namespace sa2
     std::shared_ptr<SDL_Window> myWindow;
 
     CConfigNeedingRestart myHardwareConfig;
+      
+  private:
+    uint64_t myMicrosecondsExecuting;
   };
 
 }

Before I go too far down this rabbit hole, am I actually on the right track, or is the tight loop actually the preferred implementation? The patch above allows both sa2 and my app to run at approximately the same speed but with lower CPU utilization in both cases.

Thanks!

I am not sure.
It is one of the parts of the code where I differ from AW and each frontend has a different implementation.

Before we look at your solution, are you saying that my version of the full speed does not do anything (other than skipping the video update)? I can't figure it out any longer...

This is sa2 with no changes, with enhanced disk speed off, just added logging in SDLFrame::ExecuteInRunningMode():

1641839436073764: run 2710 cycles
1641839436079347: run 5696 cycles
1641839436081357: run 2051 cycles
1641839436087736: run 6508 cycles
1641839436089788: run 2093 cycles
1641839436096011: run 6350 cycles
1641839436097631: run 1653 cycles
1641839436104465: run 6972 cycles
1641839436107923: run 3528 cycles
1641839436112692: run 4867 cycles

The first number is a microsecond timestamp, so this appears to be called every ~3.9ms (but a bit wild if you do the math), but given:

Video refresh rate: 120 Hz, 8.33 ms

I would expect the gap between calls to SDLFrame::ExecuteInRunningMode() to be around 8 ms.

With the patch, the access pattern becomes:

1641839759170619: run 9474 cycles, sleep 6000us
1641839759179591: run 9157 cycles, sleep 5000us
1641839759187962: run 8542 cycles, sleep 4000us
1641839759196563: run 8777 cycles, sleep 4000us
1641839759205422: run 9040 cycles, sleep 5000us
1641839759214175: run 8933 cycles, sleep 5000us
1641839759222428: run 8417 cycles, sleep 5000us
1641839759230749: run 8492 cycles, sleep 5000us
1641839759239586: run 9017 cycles, sleep 5000us
1641839759247805: run 8387 cycles, sleep 4000us

so it gets called every ~7.7ms.

With enhanced disk speed on, I see stretches in the unmodified code that are the same ~3.5ms, but also bursts where it's executing more cycles (8163 versus 2000-6000) in very short (~932μs!) frames:

1641840176379568: run 8163 cycles
1641840176381966: run 8163 cycles
1641840176382732: run 8163 cycles
1641840176383800: run 8163 cycles
1641840176384494: run 8163 cycles
1641840176385609: run 8163 cycles
1641840176386237: run 8163 cycles
1641840176387319: run 8163 cycles
1641840176388096: run 8163 cycles
1641840176388894: run 8163 cycles

With the patch and enhanced disk speed on, the bursts look like this:

1641840446988086: run 200000 cycles, sleep 1000us
1641840446995482: run 200000 cycles,
1641840447003758: run 200000 cycles,
1641840447011282: run 200000 cycles, sleep 1000us
1641840447019904: run 208000 cycles, sleep 1000us
1641840447027457: run 208000 cycles, sleep 1000us
1641840447036041: run 208000 cycles, sleep 1000us
1641840447043770: run 208000 cycles, sleep 2000us
1641840447051919: run 208000 cycles,
1641840447060336: run 208000 cycles, sleep 2000us

so sleeping a lot less and in some cases not at all (where the target number of instructions might be too high), but the 10 logs cover 72.2ms while the prior (enhanced disk speed off) logs covered 77.1ms.

In the unmodified case, the pasted logs show ~8.75 cycles per μs. The logs of the patched case cover a lot more time, but the logs show ~28.3 cycles per μs. This might be because it didn't have to run through the audio/event/video bits as many times as the unmodified code had to.

My expectation when reading the code was that turning off enhanced disk speed would have no effect on the unpatched version because it's executing instructions as fast as it possibly could whenever it's not doing audio/events/video, but it does slow down when I turn it off so I think the m_enhanceDisk code is still taking effect regardless of this.

So I think the only way the enhanced disk speed can be taken advantage of is if the CPU isn't starved by getCyclesAtFixedSpeed() underestimating how much can be executed. That issue was then masked by the main loop being run without delays trying to match a frame rate, but it seems to me that the right approach would be the main loop trying to keep to a refresh rate (120Hz in this case) and ExecuteInRunningMode() deciding whether to run just 8.3ms worth of 1MHz instructions or 8.3ms worth of as many instructions as would fit.

I was looking at your patch and you try to estimate the actual speed and re-apply it next time.
But the actual speed of Execute(cyclesToExecute); can be very different according to g_bFullSpeed.
Moreover, you cannot detect if full speed should end sooner.

How AppleWin deals with this is different and I should probably just do the same

  1. start
  2. run 1ms (of 6502 time) (with no video update)
  3. if 16ms wall clock have passed, update & repaint the host screen
  4. if still full speed go back to 2
  5. sleep 1ms

First of all I run in blocks of 16ms, not 1ms (Windows timers seem to be a lot better, and I rely on OpenGL to go at the correct speed), I don't do 3, and you see here

if (!options.headless)
{
refreshScreenTimer.tic();
frame->VideoPresentScreen();
refreshScreenTimer.toc();
}
that I always present the frame.
So, re-reading my code, I can't figure out where I gain speed, but if I simply skipped VideoPresentScreen in full speed (except every wall-clock Frame), it would be simpler.
In this case I should ignore the real speed and run at fixed speed always (which I do).

Found it:

setGLSwapInterval(0);

So, yes.
What I should do as well is not to call VideoPresentScreen unless 16ms wall-clock have passed.

Ok. I'm going to wait to see how you go about this and pattern my real fix after that. Thanks!

Could you please have a look at that commit and see if it works for you.

sa2 still runs very fast with the fullspeed branch changes, but it also pegs the CPU at around 45%, so it's not letting the host CPU rest at all even when running at 1 MHz. (The fix also doesn't seem to affect my app, because it's still not executing enough 6502 instructions in "turbo" mode because getCyclesAtFixedSpeed() just tells it to run as if it was running at 1 MHz only.)

I think the delay/sleep in the main loop is necessary, which then shifts the problem to correctly determine the right number of cycles to execute in "turbo" mode.

Avoiding the screen refresh is probably also part of the right fix (because refreshing faster than 60/120 Hz is certainly pointless), but if you call right back into Execute then you'll peg the host CPU unnecessarily, because the 16 ms worth of 1 MHz CPU emulation has already completed.

So basically I'm thinking there are two modes:

  1. Normal - execute what a 1 MHz 6502 would've executed within the frame (16 ms), and sleep whatever time is left.
  2. Turbo - execute as many instructions as possible within the 16 ms minus the audio/event/video overhead time.

and the tight main loop does not allow the normal mode.

When running in full speed we do indeed want 100% CPU utilisation either because it is running the emulation or repainting the screen.
I am not sure what your goal of full speed exactly is. Try AppleWin and see if it does what you expect it to do.

With my change (modulo a bug), the amount of cycles executed each time in full speed should make little or no difference.
It will affect only how often writeAudio or processEvents are called. Before the change it would do repaint too.

The only difference between your Turbo and the one available now is how often audio/events are executed, but I avoid the difficult task of estimating the turbo speed.

Sorry, I must be explaining it poorly. The problem isn't that it pegs the CPU when in turbo mode, but that it pegs the CPU during normal mode.

My expectation is that in normal mode we call Execute once every 16 ms, and it runs 16 ms worth of what the 6502 would've run at 1 MHz, but it would have time left because our host CPU is much faster and would sleep the rest of the time until the next 16 ms slice starts. In turbo mode, we still call Execute once every 16 ms, but now it runs the entire 16 ms (minus the audio/event/video time) and does not go to sleep. So a lot more gets done.

The key to the fix that I was proposing is the usleep() call in the main loop. That call would do (approximately) nothing in turbo mode because we've already used up the entire 16 ms executing code, but will take effect in normal mode because we've finished all the work needed for the 16 ms and have time to spare.

But inserting the usleep() means that Execute() needs to know how many instructions to execute in order to fill up the 16 ms when in turbo mode.

I have to admit that measuring CPU utilisation has always been complicated.
In Windows AppleWin runs < 1%. But I think this is not comparable.

When I run sa2 normally, top tells me 6% CPU utilisation, which goes to 100% if I rump up the speed to the maximum.

What I found more interesting is to see the maximum FPS when you do not wait for VSYNC which the same situation I get 400 FPS. sa2 --fixed-speed --gl-swap 0

If I skip ExecuteOneFrame() I get down to 3% CPU but the maximum FPS stays the same.

Are you saying that you see high CPU utilisation during normal execution or full speed?

I intentionally do not want any usleep because VSYNC does a very good job without.

Can you try to run with sa2 --fixed-speed and sa2 and paste here the console output.

build % sa2 --fixed-speed
IMGUI_VERSION: 1.87 WIP
GL_VENDOR: Apple
GL_RENDERER: Apple M1 Pro
GL_VERSION: 4.1 Metal - 76.1
GL_SHADING_LANGUAGE_VERSION: 4.10
[DSInit] PC=00000000, WC=00000000, Diff=00000000
found Apple2e_Enhanced.rom at /usr/local/share/applewin/resource/Apple2e_Enhanced.rom
found Parallel.rom at /usr/local/share/applewin/resource/Parallel.rom
found DISK2-13sector.rom at /usr/local/share/applewin/resource/DISK2-13sector.rom
found DISK2.rom at /usr/local/share/applewin/resource/DISK2.rom
found Apple2_JPlus_Video.rom at /usr/local/share/applewin/resource/Apple2_JPlus_Video.rom
found Base64A_German_Video.rom at /usr/local/share/applewin/resource/Base64A_German_Video.rom
Default GL swap interval: 1
Video refresh rate: 120 Hz, 8.33 ms
Global:  total =   26848.36 ms, mean =   26848.36 ms, std =       0.00 ms, n =      1
Frame:   total =   26843.98 ms, mean =       3.20 ms, std =       5.39 ms, n =   8389
Screen:  total =   23610.16 ms, mean =       2.81 ms, std =       2.31 ms, n =   8389
Events:  total =    1199.05 ms, mean =       0.14 ms, std =       4.73 ms, n =   8389
CPU:     total =    2025.58 ms, mean =       0.24 ms, std =       0.12 ms, n =   8389
Expected clock: 1020484.45 Hz, 67.12 s
Actual clock:   2551004.99 Hz, 26.85 s
build % ./sa2
IMGUI_VERSION: 1.87 WIP
GL_VENDOR: Apple
GL_RENDERER: Apple M1 Pro
GL_VERSION: 4.1 Metal - 76.1
GL_SHADING_LANGUAGE_VERSION: 4.10
[DSInit] PC=00000000, WC=00000000, Diff=00000000
Default GL swap interval: 1
Video refresh rate: 120 Hz, 8.33 ms
Global:  total =   38598.72 ms, mean =   38598.72 ms, std =       0.00 ms, n =      1
Frame:   total =   38592.76 ms, mean =       3.27 ms, std =       4.91 ms, n =  11801
Screen:  total =   35618.55 ms, mean =       3.91 ms, std =       2.21 ms, n =   9108
Events:  total =    1471.90 ms, mean =       0.12 ms, std =       3.91 ms, n =  11801
CPU:     total =    1490.71 ms, mean =       0.13 ms, std =       0.20 ms, n =  11801
Expected clock: 1020484.45 Hz, 60.07 s
Actual clock:   1588278.34 Hz, 38.60 s

It varies a bit. This is sa2 running in the Karateka crack screen:

Screen Shot 2022-01-13 at 12 20 58 AM

Although I normally see it in the high 30's. Goes up to 47% (I've see it close to 50%) while loading the game itself, but I've also seen this screen steady at around 25%. Just never 6% or 1%.

Frame:   total =   26843.98 ms, mean =       3.20 ms, std =       5.39 ms, n =   8389

This is odd.
Mine is

Frame:   total =   22541.80 ms, mean =      16.67 ms, std =       1.16 ms, n =   1352

And I run at 60 FPS, you seem to be running at 300 FPS. What does the Help dialog say?
I suspect your OpenGL implementation does not respect VSync which explains everything.

In a normal one, the call to SDL_GL_SwapWindow has an implicit sleep until the next vertical refresh to minimise flickering.

Can you run sa2 --no-imgui and try the different SDL drivers if they behave differently sa2 --no-imgui --sdl-driver 1 and so on

andrea@bomba:~/projects/cvs/a2e/build-release$ ./sa2 --no-imgui
SDL: 3 drivers
 0: opengl
 1: opengles2
 2: software
Active driver (-1): opengl
 SDL_RENDERER_SOFTWARE: 0
 SDL_RENDERER_ACCELERATED: 1
 SDL_RENDERER_PRESENTVSYNC: 0
 SDL_RENDERER_TARGETTEXTURE: 1

The help screen says 23x-24x fps.

build % ./sa2 --no-imgui
SDL: 4 drivers
 0: metal
 1: opengl
 2: opengles2
 3: software
Active driver (-1): metal
 SDL_RENDERER_SOFTWARE: 0
 SDL_RENDERER_ACCELERATED: 1
 SDL_RENDERER_PRESENTVSYNC: 1
 SDL_RENDERER_TARGETTEXTURE: 1

Try any other driver and I get:

SDL_GL_SetSwapInterval: No OpenGL context has been made current

The help screen says 23x-24x fps.

This does not add up.
23 FPS = 43ms, not 3ms as you reported.

Either my timer or ImGui Framerate are wrong.

frameTimer.tic();

https://github.com/ocornut/imgui/blob/6d5388448794f76cc92576a74ad3cb17da2fca84/imgui.cpp#L3915-L3919

When I run here, the 2 agree.

Sorry, by “23x-24x” I mean 230-something to 240-something.

I will try out the ideas. To summarize, you think the right way to run this loop is with the main outer loop calling in to Execute as often as it can, but limiting the video update to the actual display frame rate, and the idle period is basically waiting for vsync?

I may have less time these next few days, but I’ll let you know how it goes. Thanks!

Ok, then it all makes sense.
I don't have any sleep because the video repaint (i.e. SDL_GL_SwapWindow) blocks till the next vsync.
And yes, the idle period is inside this function.

Default GL swap interval: 1

This seems to imply that SDL knows that it should sync every 1 video frame, but then it does not do it.

Maybe you could reach out to the SDL forum and see if they know anything about it.