libretro/beetle-pce-libretro

[Performance Discussion] Libretro PCE is slower than Mednafen PCE

negativeExponent opened this issue · 9 comments

Original discussion was here: libretro/beetle-pce-fast-libretro#142

(THIS IS NOT A THIS CORE VS THAT CORE DISCUSSION. THE PURPOSE AT LEAST IS TO OPTIMIZE THE PORT FURTHER IF POSSIBLE)

the libretro_pce core is significantly slower compared to Mednafen PCE.

tested by running the cores in fast-forward mode. im aware this is not the best way to compare as i am bottlenecking my cpu at 100% when running this way but still its slower:

Common setups:

  • everything is on default settings both versions as much as possible
  • both seems to prefer audio enabled when fast-forwarding so lets leave that as is.
  • retroarch uses 16 bit for colors, so i assume the slowdown is not about the color-format
  • using linux-arch with xfce4's window manager disabled for best framerate
  • retroarch uses RGUI with blings disabled (at least the animation stuff)
  • Mednafen has a max fast-forward rate of 15x, but i think i am not at its max limit yet based on screenshot below

libretro PCE:
pce_libretro

Mednafen PCE with frameskip enabled:
pce_mednafen

Mednafen PCE with frameskip disabled:
pce_mednafen_no_frameskip

I cannot do the same comparison with pce_fast or supergrafx since Mednafen is always at max fps with those (about 900fps) and i haven't found yet if mednafen's fast-forward multiplier limit can be changed or set to infinite

Mednafen PCE_Fast with SuperGrafx enabled, frameskip enabled:
pce_fast_supergrafx_mednafen

My opinions are:

  • Libretro CPU overclocking method. That should eat some cycles as its done very heavily. If it was switched to Mesen-like PPU overclocking, maybe it'd help recover some speed.

  • Doesn't standalone have SIMD / AVX support for the resampler? I'm thinking Libretro core does not enable?

  • Retroarch shaders are off? The ultra-wide base textures are huge! (2048x2048)

Important to bear in mind that RetroArch does not use frameskipping while fast forwarding, but just disables audio and video sync instead. I am not sure how Mednafen's fastforwarding works (somebody would have to look at the SDL implementation), but it's possible it could be accomplished through frameskip. Either way, it should be taken into consideration that we cannot compare FPS results 1:1 because of this.

That being said, I do not discount the fact that possibilities might exist for things to become more performant on the libretro core side here.

let me remind, that this is not a "comparison" as in a debate purposes. just intended to make some more optimizations to be equal if not better than upstream. And to reply to the frameskip thing, frameskip can also be disabled in mednafen. i think its also in the screenshots.

I compared the core with stand-alone a while ago and reached the same conclusion as negativeExponent; that the core is lagging behind it perf wise. (it should be 80% faster looking at notes I took)

I tried removing some code in the internal video part in case anything costly would have been added in a timing sensible place, but I found nothing relevant.

I wonder if it's just a muti-threading thing we don't do here?

Perhaps there are optimizations we do when compiling down that the original doesn't have. I've seen how picky Mednafen is with some optimization flags and perhaps forcing -O3 has some adverse effects. Needs more investigative work for sure.

wonder whats going on here:

out-of-bounds

looks like horizontal display registers gets way out of bounds for a frame or two when last scanline is increased passed 239

Perhaps this is related to the "fixes" made in the vce to allow for more scanlines but I can't see how:
d05b6cc

normally, you should be able to show all 243 visible lines (regardless if they are just data, or extra background,overscans) and that is fine. In this case though, its probably switching late. it might even make sense to just make these 243 and just let initial/last scanline options to handle how many scanlines needs to be drawn and to set the height. I haven't looked at the core closely though

when the core gets to a good performance level, make it available for all if not most platform libretro supports, then this can be a defacto PCE core and retire the supergrafx one. (the pce_fast still need to be maintained for performance reasons). But as of now, ill stay with pce_fast/supergrafx