Gameboy on 32X

mic_ · Post by **mic_** » Sat Feb 21, 2009 2:46 pm

This runs as fast as one MS-DOS NES emulator on a 25MHz 486SX2

You mean Nesticle? Anyway, I've successfully moved all the rendering code to the slave SH2 (there are some minor sync issue, like scrolling happening one scanline too early in some cases) and there was a noticeable speedup in Fusion. The annoying part is that when I ran it on my 32X there was no speedup at all.. I'll have to look into it more later and see if something can be done about it.

TmEE co.(TM) · Post by **TmEE co.(TM)** » Sat Feb 21, 2009 2:48 pm

No its not Nesticle... its some emulator which I have no idea how its called, I have no documentation of it left... Nesticle should be faster than that emulator (but less accurate)...

Can't wait for a release

haroldoop · Post by **haroldoop** » Sat Feb 21, 2009 10:34 pm

Really impressive, who could imagine that the 32X would be able to emulate a GameBoy at an acceptable speed?

Chilly Willy · Post by **Chilly Willy** » Sun Feb 22, 2009 1:44 am

mic_ wrote:Anyway, I've successfully moved all the rendering code to the slave SH2 (there are some minor sync issue, like scrolling happening one scanline too early in some cases) and there was a noticeable speedup in Fusion. The annoying part is that when I ran it on my 32X there was no speedup at all.. I'll have to look into it more later and see if something can be done about it.

Remember that it's mostly about bus contention. You want to stay off the bus as much as possible while still doing the job. In the case of Wolf32X, my original attempt slowed the main game down because it stayed on the bus too much. My latest code runs from the cache, only purges the cache line of the variables updated (using one associative purge area write), copies the vars into cache, and doesn't constantly check the hardware (while reading the system registers is fast, it's still using the bus - using a tight loop that waits on a sys reg will consume a LOT of bus time).

mic_ · Post by **mic_** » Sun Feb 22, 2009 8:51 am

All the slave code is copied to cache for that reason. Almost all the data it uses is in cache as well, except for the 32X framebuffer, and the GB I/O registers, VRAM and OAM because both the main and slave SH2 needs access to them.

The slave's flag polling loop looks something like:

Code: Select all

mov.l COMM_PORT_0,r4
wait:
nop
nop
mov.l @r4,r5
nop
nop
tst r5,r5
bt wait
! do stuff..

Maybe that's too tight, I don't know.

Chilly Willy · Post by **Chilly Willy** » Sun Feb 22, 2009 9:18 am

Could be. My slave code is waiting on the audio FIFO to not be full, so I have a delay loop of a few hundred cycles between checks. If you're waiting on a vertical blank signal, I imagine you could stretch that time between checks considerably. Or perhaps you could go with an interrupt driven routine. The vertical blank is one of the interrupts the SH2 can respond to.

mic_ · Post by **mic_** » Sun Feb 22, 2009 9:43 am

No, not for a vertical blank - for the master SH2 to place a command on the comm port (which is currently either "draw a scanline" or "flip the framebuffer").

Chilly Willy · Post by **Chilly Willy** » Sun Feb 22, 2009 10:03 am

Okay - you really didn't go into any detail about what it was doing... not that you have to, of course.

Well, the two SH2s can't interrupt each other directly, but you COULD poke a value in the comm port telling the 68000 to interrupt the slave. The 68000 CAN be in a really tight loop on the comm registers as they're dual-ported. The interrupt would be as fast as the 68000 loop, assuming you aren't also using the 68000 to do other things.

Another thought - depending on where you wait for the slave to finish the command, you could just put up with a longer delay between checks. For example, instead of

Poke command
Wait for response

do

Wait for previous response
Poke command

That allows to master to do other things while the slave works on the task. For example, an emu would be

Do one line worth of emulated cycles
Wait for previous response
Poke command
<repeat>

mic_ · Post by **mic_** » Sun Feb 22, 2009 10:19 am

Yeah that's what I'm doing:

Code: Select all

! The Master<->Slave communication is implemented like this:
!
!  The Master adds a command for the Slave to execute by calling
!  _slave_send_command. This function first waits for the Slave's
!  status to become SLAVE_READY, then sends the command and whatever
!  data is needed through the communication port.
!  The Slave just sits in a loop looking for some command other than
!  SLAVE_CMD_NULL. Once it gets a command, the Slave sets its status
!  to SLAVE_BUSY and executes whatever task it has been given. Once it's
!  finished the Slave changes its status back to SLAVE_READY and starts
!  polling for a new command.

Chilly Willy · Post by **Chilly Willy** » Sun Feb 22, 2009 12:41 pm

Well, one thing I'd do is just try playing with the delay between checks. Too often and you'll use too much bus time. Too long and you'll make the master wait. Try various values to see if it makes any difference at all. That should be easy enough to do without any major changes.

mic_ · Post by **mic_** » Sun Feb 22, 2009 2:51 pm

You can get all the code in its current state here. You'll need the GNU binutils for SH and M68k to build it. Add whatever ROMs you want to roms.s, but there's no GUI yet so it'll only load the first ROM in the table.

Known issues:

* Speed, obviously.
* Joypad emulation is buggy.
* The SBC instruction sets the flags incorrectly (at least H, possibly C).
* ..so do a few others, like ADD SP,n.
* Scrolling sometimes occurs on the wrong scanline since CPU emulation and PPU emulation are run in parallel, and are allowed to get out of sync by up to about 450 cycles (GB CPU cycles, that is).
* It assumes that every ROM uses MBC1, so only ROMs using MBC1 (or no memory bank controller at all) will work.

Chilly Willy · Post by **Chilly Willy** » Sun Feb 22, 2009 9:08 pm

Cool - I'll do a little playing to see what I find.

EDIT:
The problem with the pad was you forgot to move r1 to r0 in one place in gui.s:

Code: Select all

	mov	r1,r0
	and	#3,r0
	shll2	r0

You just had the and & shift, so you were using left over data instead of the pad value.

I've made a makefile, and updated to my latest m68k_crt1.s which has my latest pad code (not that this needs 6 button controller support

).

My compile works well - I can actually play DigDug now! Great job! Now I'll try playing with the timing on the command handling just to see how it reacts.

EDIT: Well, after testing and looking at the code, the emu is either completely graphics bound, or completely (emulated) CPU bound. Considering the code, I'd say completely CPU bound. The interpreter "main loop" (if it can even be thought of as such a thing) is the most hideous piece of coding I've ever had the misfortune of looking at. A quick look at the original C code shows it's pretty much just the original C code done straight as assembly. I've never before seen an interpreter "main loop" that took longer per instruction than the rest of the program put together. Sad to say, now I have. This needs a rewrite DESPERATELY. I'll see what I can whip up.

mic_ · Post by **mic_** » Mon Feb 23, 2009 5:53 am

There are a lot of branches in there as you can see, so not all the code is executed on every iteration.
But if you can reduce the size of the main loop without breaking any functionality (with timers and interrupts still working properly) that would be good.

I think the time the main SH2 has to spend waiting for the slave when it wants to draw another scanline while the slave is still drawing the previous one is an important factor as well, though.
Just consider that for each scanline it needs to draw 160 pixels. And each of those pixels may be written to multiple times; you've got your background, window and sprites - any game will at least have the background and sprites enabled. So let's say that any given pixel takes 50 SH2 clocks to render (it could be way more in some cases, but let's just assume 50 for now). 160*50 would mean 8000 clocks per scanline. In the meantime the main SH2's task is to run the GB-Z80 456 GB-Z80 clocks. Let's say the average instruction takes 7.5 GB-Z80 clocks. That gives the main SH2 8000*7.5/456 = 131.5 SH2 clocks to emulate each instruction..

Chilly Willy · Post by **Chilly Willy** » Mon Feb 23, 2009 7:09 am

Yeah, I noticed that much of the code is bypassed by branches, but even the best path through the main loop is ridiculously long for an interpreted core. I can shorten it considerably. The main issue I have with it is that it checks the DIV and timer every time through the loop when you KNOW that the DIV only needs to be checked once every 256 clocks, and the timer every N+1 clocks. It needs a main loop like the one I used on the Atari emu my brother and I did for the PowerMac - find the minimum clocks from the cycles per H, cycles until the next DIV increment, and the next timer underflow, then do a TIGHT loop executing instructions until the cycles run out or you get some kind of event demanding attention.

The opcode code looks fairly decent... the flag calculations take a bit of time, but it's not something to worry about now. The main loop for dispatching instructions (even best case) is longer than most instructions - something that must be changed first. As for the drawing, I like how you put the line in the cache. That should help considerably given that you may need to set points repeatedly (as you pointed out).

mic_ · Post by **mic_** » Mon Feb 23, 2009 7:28 am

find the minimum clocks from the cycles per H, cycles until the next DIV increment

I've used the event-based approach in another project, just for the reason of not having to update timers after every instruction. I was unsure if it'd be worth the extra overhead when emulating a machine as slow as the Gameboy, but it could be worth a try.

the flag calculations take a bit of time

Unfortunately the SH doesn't have things like a subtraction flag or a half-carry flag, so it has to be done in software. I tried to make pretty much all the flag calculations branchless, as long as a branched path wouldn't be faster.