You mean Nesticle? Anyway, I've successfully moved all the rendering code to the slave SH2 (there are some minor sync issue, like scrolling happening one scanline too early in some cases) and there was a noticeable speedup in Fusion. The annoying part is that when I ran it on my 32X there was no speedup at all.. I'll have to look into it more later and see if something can be done about it.This runs as fast as one MS-DOS NES emulator on a 25MHz 486SX2
Gameboy on 32X
Moderator: Mask of Destiny
-
- Very interested
- Posts: 2442
- Joined: Tue Dec 05, 2006 1:37 pm
- Location: Estonia, Rapla City
- Contact:
No its not Nesticle... its some emulator which I have no idea how its called, I have no documentation of it left... Nesticle should be faster than that emulator (but less accurate)...
Can't wait for a release
Can't wait for a release
Mida sa loed ? Nagunii aru ei saa
http://www.tmeeco.eu
Files of all broken links and images of mine are found here : http://www.tmeeco.eu/FileDen
http://www.tmeeco.eu
Files of all broken links and images of mine are found here : http://www.tmeeco.eu/FileDen
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
Remember that it's mostly about bus contention. You want to stay off the bus as much as possible while still doing the job. In the case of Wolf32X, my original attempt slowed the main game down because it stayed on the bus too much. My latest code runs from the cache, only purges the cache line of the variables updated (using one associative purge area write), copies the vars into cache, and doesn't constantly check the hardware (while reading the system registers is fast, it's still using the bus - using a tight loop that waits on a sys reg will consume a LOT of bus time).mic_ wrote:Anyway, I've successfully moved all the rendering code to the slave SH2 (there are some minor sync issue, like scrolling happening one scanline too early in some cases) and there was a noticeable speedup in Fusion. The annoying part is that when I ran it on my 32X there was no speedup at all.. I'll have to look into it more later and see if something can be done about it.
All the slave code is copied to cache for that reason. Almost all the data it uses is in cache as well, except for the 32X framebuffer, and the GB I/O registers, VRAM and OAM because both the main and slave SH2 needs access to them.
The slave's flag polling loop looks something like:
Maybe that's too tight, I don't know.
The slave's flag polling loop looks something like:
Code: Select all
mov.l COMM_PORT_0,r4
wait:
nop
nop
mov.l @r4,r5
nop
nop
tst r5,r5
bt wait
! do stuff..
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
Could be. My slave code is waiting on the audio FIFO to not be full, so I have a delay loop of a few hundred cycles between checks. If you're waiting on a vertical blank signal, I imagine you could stretch that time between checks considerably. Or perhaps you could go with an interrupt driven routine. The vertical blank is one of the interrupts the SH2 can respond to.
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
Okay - you really didn't go into any detail about what it was doing... not that you have to, of course.
Well, the two SH2s can't interrupt each other directly, but you COULD poke a value in the comm port telling the 68000 to interrupt the slave. The 68000 CAN be in a really tight loop on the comm registers as they're dual-ported. The interrupt would be as fast as the 68000 loop, assuming you aren't also using the 68000 to do other things.
Another thought - depending on where you wait for the slave to finish the command, you could just put up with a longer delay between checks. For example, instead of
Poke command
Wait for response
do
Wait for previous response
Poke command
That allows to master to do other things while the slave works on the task. For example, an emu would be
Do one line worth of emulated cycles
Wait for previous response
Poke command
<repeat>
Well, the two SH2s can't interrupt each other directly, but you COULD poke a value in the comm port telling the 68000 to interrupt the slave. The 68000 CAN be in a really tight loop on the comm registers as they're dual-ported. The interrupt would be as fast as the 68000 loop, assuming you aren't also using the 68000 to do other things.
Another thought - depending on where you wait for the slave to finish the command, you could just put up with a longer delay between checks. For example, instead of
Poke command
Wait for response
do
Wait for previous response
Poke command
That allows to master to do other things while the slave works on the task. For example, an emu would be
Do one line worth of emulated cycles
Wait for previous response
Poke command
<repeat>
Yeah that's what I'm doing:
Code: Select all
! The Master<->Slave communication is implemented like this:
!
! The Master adds a command for the Slave to execute by calling
! _slave_send_command. This function first waits for the Slave's
! status to become SLAVE_READY, then sends the command and whatever
! data is needed through the communication port.
! The Slave just sits in a loop looking for some command other than
! SLAVE_CMD_NULL. Once it gets a command, the Slave sets its status
! to SLAVE_BUSY and executes whatever task it has been given. Once it's
! finished the Slave changes its status back to SLAVE_READY and starts
! polling for a new command.
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
You can get all the code in its current state here. You'll need the GNU binutils for SH and M68k to build it. Add whatever ROMs you want to roms.s, but there's no GUI yet so it'll only load the first ROM in the table.
Known issues:
* Speed, obviously.
* Joypad emulation is buggy.
* The SBC instruction sets the flags incorrectly (at least H, possibly C).
* ..so do a few others, like ADD SP,n.
* Scrolling sometimes occurs on the wrong scanline since CPU emulation and PPU emulation are run in parallel, and are allowed to get out of sync by up to about 450 cycles (GB CPU cycles, that is).
* It assumes that every ROM uses MBC1, so only ROMs using MBC1 (or no memory bank controller at all) will work.
Known issues:
* Speed, obviously.
* Joypad emulation is buggy.
* The SBC instruction sets the flags incorrectly (at least H, possibly C).
* ..so do a few others, like ADD SP,n.
* Scrolling sometimes occurs on the wrong scanline since CPU emulation and PPU emulation are run in parallel, and are allowed to get out of sync by up to about 450 cycles (GB CPU cycles, that is).
* It assumes that every ROM uses MBC1, so only ROMs using MBC1 (or no memory bank controller at all) will work.
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
Cool - I'll do a little playing to see what I find.
EDIT:
The problem with the pad was you forgot to move r1 to r0 in one place in gui.s:
You just had the and & shift, so you were using left over data instead of the pad value.
I've made a makefile, and updated to my latest m68k_crt1.s which has my latest pad code (not that this needs 6 button controller support ).
My compile works well - I can actually play DigDug now! Great job! Now I'll try playing with the timing on the command handling just to see how it reacts.
EDIT: Well, after testing and looking at the code, the emu is either completely graphics bound, or completely (emulated) CPU bound. Considering the code, I'd say completely CPU bound. The interpreter "main loop" (if it can even be thought of as such a thing) is the most hideous piece of coding I've ever had the misfortune of looking at. A quick look at the original C code shows it's pretty much just the original C code done straight as assembly. I've never before seen an interpreter "main loop" that took longer per instruction than the rest of the program put together. Sad to say, now I have. This needs a rewrite DESPERATELY. I'll see what I can whip up.
EDIT:
The problem with the pad was you forgot to move r1 to r0 in one place in gui.s:
Code: Select all
mov r1,r0
and #3,r0
shll2 r0
I've made a makefile, and updated to my latest m68k_crt1.s which has my latest pad code (not that this needs 6 button controller support ).
My compile works well - I can actually play DigDug now! Great job! Now I'll try playing with the timing on the command handling just to see how it reacts.
EDIT: Well, after testing and looking at the code, the emu is either completely graphics bound, or completely (emulated) CPU bound. Considering the code, I'd say completely CPU bound. The interpreter "main loop" (if it can even be thought of as such a thing) is the most hideous piece of coding I've ever had the misfortune of looking at. A quick look at the original C code shows it's pretty much just the original C code done straight as assembly. I've never before seen an interpreter "main loop" that took longer per instruction than the rest of the program put together. Sad to say, now I have. This needs a rewrite DESPERATELY. I'll see what I can whip up.
There are a lot of branches in there as you can see, so not all the code is executed on every iteration.
But if you can reduce the size of the main loop without breaking any functionality (with timers and interrupts still working properly) that would be good.
I think the time the main SH2 has to spend waiting for the slave when it wants to draw another scanline while the slave is still drawing the previous one is an important factor as well, though.
Just consider that for each scanline it needs to draw 160 pixels. And each of those pixels may be written to multiple times; you've got your background, window and sprites - any game will at least have the background and sprites enabled. So let's say that any given pixel takes 50 SH2 clocks to render (it could be way more in some cases, but let's just assume 50 for now). 160*50 would mean 8000 clocks per scanline. In the meantime the main SH2's task is to run the GB-Z80 456 GB-Z80 clocks. Let's say the average instruction takes 7.5 GB-Z80 clocks. That gives the main SH2 8000*7.5/456 = 131.5 SH2 clocks to emulate each instruction..
But if you can reduce the size of the main loop without breaking any functionality (with timers and interrupts still working properly) that would be good.
I think the time the main SH2 has to spend waiting for the slave when it wants to draw another scanline while the slave is still drawing the previous one is an important factor as well, though.
Just consider that for each scanline it needs to draw 160 pixels. And each of those pixels may be written to multiple times; you've got your background, window and sprites - any game will at least have the background and sprites enabled. So let's say that any given pixel takes 50 SH2 clocks to render (it could be way more in some cases, but let's just assume 50 for now). 160*50 would mean 8000 clocks per scanline. In the meantime the main SH2's task is to run the GB-Z80 456 GB-Z80 clocks. Let's say the average instruction takes 7.5 GB-Z80 clocks. That gives the main SH2 8000*7.5/456 = 131.5 SH2 clocks to emulate each instruction..
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
Yeah, I noticed that much of the code is bypassed by branches, but even the best path through the main loop is ridiculously long for an interpreted core. I can shorten it considerably. The main issue I have with it is that it checks the DIV and timer every time through the loop when you KNOW that the DIV only needs to be checked once every 256 clocks, and the timer every N+1 clocks. It needs a main loop like the one I used on the Atari emu my brother and I did for the PowerMac - find the minimum clocks from the cycles per H, cycles until the next DIV increment, and the next timer underflow, then do a TIGHT loop executing instructions until the cycles run out or you get some kind of event demanding attention.
The opcode code looks fairly decent... the flag calculations take a bit of time, but it's not something to worry about now. The main loop for dispatching instructions (even best case) is longer than most instructions - something that must be changed first. As for the drawing, I like how you put the line in the cache. That should help considerably given that you may need to set points repeatedly (as you pointed out).
The opcode code looks fairly decent... the flag calculations take a bit of time, but it's not something to worry about now. The main loop for dispatching instructions (even best case) is longer than most instructions - something that must be changed first. As for the drawing, I like how you put the line in the cache. That should help considerably given that you may need to set points repeatedly (as you pointed out).
I've used the event-based approach in another project, just for the reason of not having to update timers after every instruction. I was unsure if it'd be worth the extra overhead when emulating a machine as slow as the Gameboy, but it could be worth a try.find the minimum clocks from the cycles per H, cycles until the next DIV increment
Unfortunately the SH doesn't have things like a subtraction flag or a half-carry flag, so it has to be done in software. I tried to make pretty much all the flag calculations branchless, as long as a branched path wouldn't be faster.the flag calculations take a bit of time