Question about DDR3 usage
gyurco opened this issue ยท 33 comments
Hi!
I'd like to port this core to a board with dual SDRAMs, but without a DDR3. Do you think is it possible?
As I see, even with the PSX_DualSDRAM.qpf, DDR3 is still used for savestates, memcard(?), etc.
Currently I wired up the two SDRAMs as in the MiSTer top-level, but the iCPU halts very soon, with mem_request remaining asserted.
Looks like VRAM is also in DDR3, even with dual SDRAM?
+1 question: is it possible to disable savestates? It takes a lot of time to compile with it. Just commenting out the modules however makes SPU also disappear, so looks like some signals needed from it anyway.
...ahh, normal reset doesn't reset the CPU perfectly, need ss_reset to set PC to the correct reset address.
Several registers, LUTRAMs and BRAMs only have one "initial value" write port that is either filled with savestate values or reset values(typically zeros). That's why it can't be removed so easily.
For the DDR3: the main use is for VRAM. The whole GPU is designed to make use of it's 64bit avalon like interface. You could certainly use SDRAM instead but it would take effort to change the interface. For a start it would be easiest if it would provide a 64bit burstread and single write interface to get started. It can be changed/optimized of course.
Some other parts like memory card are also in DDR3, but to get started those can be removed.
If the CPU halts, this is most likely due to either the reset not working properly, bios not available in sdram or the sdram handling not being the same as in mister, so the memmux module waits for some response.
Thanks for the answer!
As I see, after a reset, the savestate goes into 'loading mode', even if nothing is saved before(e.g. state goes IDLE->WAITPAUSE->WAITIDLE->LOAD_WAITSETTLE->etc..). Is it intended way? Then will it fill the components with garbage from DDR3, or in my case, with 0?
With leaving it in, I actually get some sounds from the BIOS opening theme, so it seems to work for some extent. However no vram_WE asserted ever, so probably the GPU stuck before even trying to reach the VRAM. I wonder why.
Do you think the core (at least the BIOS) can be tested without the SPU, just to be focused on making the GPU working, then SPU can be re-enabled, when there's picture? Or the BIOS will probably hang without SPU?
I had the core running without SPU for several months, even within games, so it should be possible, but I cannot remember if at least some status bits needed to be reported.
For the savestate reset: there is a "resetMode" bit, which is used when loading data. It's loading a fake savestate with all values replaced by zeros, see line 576, instead of DDR3 reading.
For VRAM: make sure the busy and pause is at 0. I'm not sure if vram_WE or vram_RD comes first, but one of these should get active at some point. Maybe you can get away with having vram_DOUT_READY at 1 so even reads will at least not get stuck.
If neither read or write ever happens, I would check the bus or dma signals.
Also keep in mind that the BIOS in PSX is one of the harder things to get running. Some demo .exe can be much easier and even games(reached by using fastboot) often require less functionality than the BIOS.
That's good news, I can concentrate on VRAM then. I hope the SDR's speed will be enough. I can use an SDRAM controller with bank-interleaving, so various subsystems can use it more parallel, but the burst speed still won't match with DDR3.
I see the vramState just stuck in READVRAM, so probably one of the signals you mentioned must be missing.
Yes, it's reading and waiting for the data coming with vram_DOUT_READY being high when stuck in that state.
For the VRAM speed: normal VRAM accesses are not hard to reach, it's only 32bit at 33Mhz to match the most common psx model. However, the original SGRAM could write a whole row (16 words if i remember right) in a single access. This is used, e.g. to clear the screen with a solid color.
I wouldn't worry about it to much for now. Most games do not care that much and even without that feature, you can still easily reach the original scph1000 VRAM speed.
I was successful with the port, the main core has to be slightly modified to get rid of MLABs (Cyclone 10LP doesn't support it). In some cases it was even possible to replace them with BRAM, which conserves logic (and compiles faster).
Here's my port:
https://github.com/gyurco/PSX_MiSTer/tree/mist/mist
Currently I disabled savestates, the extra muxes/logic make the timings very bad.
A DDR3 'emulator' for SDRAM is here:
https://github.com/gyurco/PSX_MiSTer/blob/mist/mist/sdram_4w.sv
It's great that it works!
I'm surprised that sometimes you could use BRAM, as MLAB will deliver the result from the read address in the same cycle (unclocked). At least with Cyclone 5 BRAMs that is not possible as far as I know.
Using FFs for the CPU register will work, I had it like this before, it just costs a lot of ressources, especially if you enable savestates.
Overall, great work porting it so fast.
The trick is to use the negated signal for the read clock, and it'll have the same effect. It's not very good for the timings, thus I didn't use it for the CPU's tag RAM.
However I think the logic can be changed in some places to not rely on same-cycle reads. The SPU doesn't seem to need to run fast, the states are even slowed down by a CE, so maybe I'll try to convert it to normal BRAM.
BTW, I noticed a bug in the game MDK (EU version): it needs at least low turbo, or occasionally locks up. I don't know if it happens on MiSTer, too, if somebody reads this, it would be good to check.
The SPU can be slowed down if you have dedicated memory for it. It only needs to complete all channels in 768 clock cycles.
It's made as fast as possible in the Mister core, because with single SDRAM, the SPU-RAM would be stored in DDR3, which is shared with the GPU and isn't always fast enough, so having more headroom for unpredictable RAM latency solved this issue.
You may also remove the SPU cache then.
So far that MDK crash was not reported. It's possible this is due to the GPU being much slower with the SDRAM.
You could try with the GPU benchmark that can be found here: https://github.com/JaCzekanski/ps1-tests
and see how it compares against MiSTer or the console.
Unfortunately it is true , there are occasional locks up in MDK (PAL) but there is no easy way to reproduce this.
Sometimes for 2-3 hours I had no problems. Sometimes right at the beginning the game would hang. Most often you can encounter this problem in tunnels where the game data is being read.
Good to know, maybe we should add an issue about it then?
Especially if turbo low does resolve it as workaround.
done!
So far that MDK crash was not reported. It's possible this is due to the GPU being much slower with the SDRAM. You could try with the GPU benchmark that can be found here: https://github.com/JaCzekanski/ps1-tests and see how it compares against MiSTer or the console.
Thanks for pointing to those tests. The write rate a bit sucks (64 bits/9 100MHz cycles, it's about 22MHz/32 bit, only 66% of the original PS1, which can be seen exactly in the first tests in each group ). Just curious, why didn't you use burst writes? Or was it fast enough without that?
Check the FPS columns, the MiSTer PSX core is faster than a real PSX in these benchmarks on every test except for Lines Flat and Lines Shaded with transparency on (which I'm not sure, but aren't these benchmarks outdated after some later fixes to some of the GPU transparency issues? Remembering the FF9 slowdown in the rain issue for instance).
The hard DDR3 controller in the DE10-Nano is amazing. It doesn't care if writes are bursts or not. If they are not, it will group them by itself, which makes the handling much easier. You can reach >85% of the total bandwidth with single writes. Only for reads bursts are required.
For SDRAM it would be worth to replace the write fifo with a double buffer blockram and burst logic.
The PSX does render in scanlines, so each burst could be 1-1024 pixels (16bit each) long with increasing address and no row switch in between.
You could burst out all drawed pixels from one rendered line while the next is already drawn. This should give a significant speedup.
The whole store logic can be found here. Currently it groups writes into 64 bit data words to make use of the 64bit DDR3 interface, but that is not neccesary. You could replace it with your own store logic that fits better to your RAM/controller.
https://github.com/MiSTer-devel/PSX_MiSTer/blob/main/rtl/gpu.vhd#L1516-L1588
The FF9 slowdown fix was related to pixels outside of the screen. The game draws >50% of the rain outside the screen and the GPU was only not storing pixels outside, but still fetching from framebuffer. If outside pixels are also not read back for transparency merge, it was fine.
The benchmark speed from the table is still mostly correct. The speed only dropped slightly(~5%) due to SPU RAM also being added to DDR3.
I could make it 7 cycles/64 bit write (when SPU RAM access is not kicking in), it already reached 90% of the PSX speed. I think with 128 bit writes (11 cycles/128 bit) it could easily pass that. But maybe then modifying the FIFO for bigger burst would be easy, too.
If you get close to the original PSX in the benchmark, I wouldn't worry to much.
The values in the table are from the more common model. The first gen PSX was slower but games still work fine with it.
Most games are bound to CPU speed anyway.
It's definitely much better now. Even MDK doesn't want to freeze (or at least not that often).
There's only a strangeness happened with these test programs, some of them (like triangle) only shows the bottom 1/3 part, the upper 2/3 part of the screen is just white. With the slower VRAM, the whole picture was visible, but flickered heavily.
This test does use a single framebuffer only, so it depends on the draw speed a lot, otherwise it's giving out lines after the buffer clear before it was drawn again.
Fortunatly only very few games really depend on the draw speed.
I think all of them that are known to have issues on the Mister core are mentioned here: #244
Another one not mentioned there is Colin 2.0 in the car setup screen. It will flicker when the draw speed is too slow.
This was one of the toughest things to get right because of the clear speed of the real console being super fast due to the full row clear in a single RAM access. The core does have a 64bit=4pixel per 67mhz cycle clear for that purpose, but i guess you cannot really do that, so maybe some games will have issues. This mostly hits 480i games, because they are all using a single frame buffer.
It's this part in the core: https://github.com/MiSTer-devel/PSX_MiSTer/blob/main/rtl/gpu.vhd#L1533-L1540
Maybe the most critical part in terms of GPU->RAM timing.
Good insights. Actually I think creating a special 'fill' cycle with any amount of writes would be not that hard, as there's no need to pick up new values for the RAM writes, only a burst count with a fixed input data. And even on the SDRAM with CL2, it can run at 100MHz/16bit rate (200 MB/sec).
Interesting that I could reach ~41 MHz/32 bit fill rate, and it's nowhere near enough. In the triangle test, about half of the upper area is cleared. At the end of the vblank, it only finished half of the job. What's the original fill rate? It must be much more than 33 MHz/32 bit (about double). Maybe I have to use 133MHz/CL3 for the SDRAM.
Hmm, found the SGRAM's datasheet about block write: "This cycle writes the color register data in 256 bits (8 columns x 32 I/Os) memory cell in one cycle"
If the memory is clocked the same speed as the CPU (33 MHz), then 64 bits/33MHz = 264MB/s. I wonder how these tests perform on the older VRAM based console.
The fill rate for shaded polygons on the console is 2 pixels per 33Mhz clock, so 32bit/33mhz.
The core is running the GPU at double clock speed instead of duplicating the pixel pipeline, so in the core it's 16bit/66mhz.
The fill rate for textured polygons is lower and depends on the texture size for the fetches, which also cost memory time of course.
The fill rate for solid color however is much higher. It's 16 pixel/32bytes per access.
You found the datasheet for it. On page 39 it tells that it can do that every second clock cycle(page 39), so the maximum bandwidth in that mode is 128bit/33mhz.
The older consoles are much slower. I can't find the absolute numbers anymore, but i still have this comparison:
There is however no pure fill benchmark in the suite.
The fill rate for shaded polygons on the console is 2 pixels per 33Mhz clock, so 32bit/33mhz.
I wonder what was the original burst size, as I can hardly believe it achieved this with single writes.
I'm not sure anyone has measured this, but the rectangle test in the benchmark gives a good hint.
Unlike triangle/quads, the rectangle commands have only few overhead, so most comes from writing the pixels.
It scales nearly perfect with that 2 pixel per clock, maybe 10% less fillrate only.
That means the original GPU has a good knowledge about possible usable burst sizes. In a fill or rect rendering, it's easy to calculate ahead, I'm not sure about polygons or lines.
My change for VRAM fill to use a burst instead of separate 64 bit writes:
gyurco@b7ef1b5#diff-77fd441776bd36b382e7e4052fa712ee7efa2e4321d141d96956a3496a503da9
Also I saw in the PSX specs that there's a texture cache in the GPU, but I didn't discover it in the code. Is it there and am I just blind?
Yes, the GPU cache is here:
https://github.com/MiSTer-devel/PSX_MiSTer/blob/main/rtl/gpu_pixelpipeline.vhd#L333-L379
Typically it would only be there one time, but to implement texture filtering as option it's there 4 times.
closing this as it seems to be done. If you have further questions, you can still ask of course
Yepp, thanks for the help!