I wanted to write some sprite blitting routines on the 32X. I first coded my routine in C, and it looked like this (note that I'm using 8-bit paletted framebuffer):
Code: Select all
// draws a sprite to the screen
inline void drawSprite( u8* sprite, int x, int y, int w, int h )
{
// get fb pointer
volatile u8 *fb = (volatile u8 *)&MARS_FRAMEBUFFER;
// deal with sprites partially intersecting screen edges
u8 hwritebegin = 0;
if( x < 0 )
{
hwritebegin = -x;
x = 0;
}
u8 hwritelength = w - hwritebegin;
if( ( x + hwritelength ) >= DISPLAY_WIDTH )
hwritelength = DISPLAY_WIDTH - x;
u8 vwritebegin = 0;
if( y < 0 )
{
vwritebegin = -y;
y = 0;
}
u8 vwritelength = h - vwritebegin;
if( ( y + vwritelength ) >= DISPLAY_HEIGHT )
vwritelength = DISPLAY_HEIGHT - y;
// pointer to the sprite's top left corner in the framebuffer
int vram_ptr;
vram_ptr = 0x200 + x;
vram_ptr += (y * DISPLAY_WIDTH);
// pointer to sprite pixel
int sprite_ptr;
sprite_ptr = ( vwritebegin * w ) + hwritebegin;
u8 half_hwritelength = hwritelength / 4;
fb += vram_ptr;
sprite += sprite_ptr;
// memcpy sprite line-by-line into frame buffer
u8 v;
for( v = 0; v < vwritelength; v++ )
{
// memcpy pixel row into framebuffer
memcpy( fb, sprite, hwritelength );
// increment vram_ptr to next fb row
fb += DISPLAY_WIDTH;
// increment sprite_ptr to next img row
sprite += w;
}
}
So I thought to myself, "This might even be good enough for a game, coupled with writing some code to utilize the Genny's hardware planes for tilemapping, but just because I can I want to see if I can implement this in assembly".
My approach ended up being to continue doing all of the upfront work in C, but reimplement that memcpy loop in assembly. That turned out to look like this:
Code: Select all
! void wtf( u8* spritePtr, u8* fbPtr, int blitW, int blitH, int blitStep )
.align 4
.global _wtf
_wtf:
! {
! On entry: r4 = spritePtr, r5 = fbPtr, r6 = blitW, r7 = blitH, blitStep pushed onto stack
! copy blitStep into r0
mov.l @r15,r0
! push r8, r9, and r10 onto stack (we're going to use them as a scratch, so we need to save them)
mov.l r8,@-r15
mov.l r9,@-r15
mov.l r10,@-r15
! initialize r8 as framebuffer pitch so we can sum with fbPtr in the copy loop
mov.w fb_pitch,r8
! initialize r1 as rows to copy
mov r7,r1
! copy scanline loop
wtf__copy_scan_loop:
! {
! copy number of bytes from spritePtr to fbPtr
! this should be equivalent to a standard memcpy in terms of perf, right?
! memory copy loop
! initialize r2 as length to copy
mov r6,r2
! copy sprite and fb pointers for iteration
mov r4,r9
mov r5,r10
wtf__memcpy_loop:
!{
! load from spritePtr into r3, increment spritePtr
mov.b @r9+,r3
! store from r3 to fbPtr
mov.b r3,@r10
! loop check
dt r2
bf/s wtf__memcpy_loop
add #1,r10 ! increment fbPtr
!} // wtf__memcpy_loop
! finished copying a row of pixels
add r8,r5 ! move framebuffer pointer to next scanline by adding fb_pitch
! loop check
dt r1 ! decrement and compare scan iterator
bf/s wtf__copy_scan_loop
add r0,r4 ! move sprite pointer to next sprite pixel row by adding blitStep
! } // wtf__copy_scan_loop
! restore r8, r9, and r10 off of stack
mov.l @r15+,r10
mov.l @r15+,r9
mov.l @r15+,r8
rts
nop
! framebuffer pitch in bytes, used for incrementing framebuffer pointer by one scanline
fb_pitch:
.word 320
! } // _wtf
Code: Select all
wtf( sprite, fb, hwritelength, vwritelength, w );
I'm absolutely convinced there's just something stupid I'm doing, but I'd love to get some insights into how my C code is getting noticeably better performance over my hand-written assembly.
I also tried to see if it was the fact that I nested my externed function inside another function, but splitting off the original pure-C loop into a separate function with a similar signature and calling that instead still yields the performance of the original C code, much faster than my assembly.
EDIT: Hm, actually I might have a clue. Noticed that my pure C memcpy-based version actually doesn't seem to do the zero-byte-ignore copy thing quite as I'd expect. I'm using that feature to my advantage so that zero means transparent, but my memcpy code results in odd flickering when sprites cross. My assembly-based version, on the other hand, appears to be working precisely as expected, with no flickering. I wonder if that means memcpy is trying to do some sort of optimization involving not just straight copying byte-by-byte. That could explain both the artifacts and the performance difference.
EDIT 2: OK, so I did make an optimization to my assembly and this makes it actually a little bit more performant than my C code. I made it copy entire longs at a time into the framebuffer instead of copying byte-by-byte (so now adds the requirement that sprite width is a multiple of 4), and then modified the calling code so that it copies into the overwrite buffer to preserve the zero-byte-ignore behavior. I assume memcpy is trying to do some kind of similar optimization, but the speedup is slightly less reliable than my asm code (which I guess would make sense if memcpy was doing something like checking if what remains to be copied can be copied in a long or word or etc, whereas mine just blasts through using longs with no checks)
EDIT 3: OK is it normal for stuff like this to look SERIOUSLY GODDAMN CHOPPY on Gens? Good lord. It's not so much a framerate problem as far as I can tell as much as stuff looks like it's aligned on a 2 pixel boundary or something. They jump pixel positions like nobody's business. Works perfectly in Fusion though.
EDIT 4: Also caught an issue with that optimization where objects partially overlapping left or right screen edges would not be multiples of 4 bytes, so sometimes it'd just crash the game. So now it branches between two different methods (one a per-byte copy, one a per-long copy), switched on if the horizontal copy length is not a multiple of 4 bytes (so also removes the multiple of 4 pixels restraint, as if the sprite is not a multiple of 4 it simply switches over to a per-byte copy).
My new asm looks like this. I'm like 99% certain I've made these branches way less than optimal.
Code: Select all
! // Blits a sprite into the framebuffer. Source data width must be a multiple of 4 bytes unless byteCopy is TRUE
! void GFX_BlitSprite( u8* spritePtr, u8* fbPtr, int blitW, int blitH, int blitStep, int byteCopy )
.align 4
.global _GFX_BlitSprite
_GFX_BlitSprite:
! {
! On entry: r4 = spritePtr, r5 = fbPtr, r6 = blitW, r7 = blitH, blitStep pushed onto stack, byteCopy pushed onto stack
! copy blitStep into r0
mov.l @r15,r0
! copy byteCopy into r1
mov.l @(4,r15),r1
! push r8, r9, r10, and r11 onto stack (we're going to use them as a scratch, so we need to save them)
mov.l r8,@-r15
mov.l r9,@-r15
mov.l r10,@-r15
mov.l r11,@-r15
! initialize r8 as framebuffer pitch
mov.w fb_pitch,r8
! initialize r11 as rows to copy
mov r7,r11
! copy scanline loop
0:
! {
! copy number of bytes from spritePtr to fbPtr
! memory copy loop
! initialize r2 as length to copy
mov r6,r2
! copy sprite and fb pointers for iteration
mov r4,r9
mov r5,r10
! if byteCopy is 0 we can use a faster copy operation
! otherwise, resort to manual byte copy
cmp/pl r1
bf/s skip1
nop
1:
!{
! load from spritePtr into r3, increment spritePtr
mov.b @r9+,r3
! store from r3 to fbPtr
mov.b r3,@r10
! loop check
dt r2
bf/s 1b
add #1,r10
!}
bra skip2
nop
skip1:
2:
!{
! load from spritePtr into r3, increment spritePtr
mov.l @r9+,r3
! store from r3 to fbPtr
mov.l r3,@r10
! loop check
dt r2
bf/s 2b
add #4,r10
!}
skip2:
! finished copying a row of pixels
add r8,r5 ! move framebuffer pointer to next scanline by adding fb_pitch
! loop check
dt r11 ! decrement and compare scan iterator
bf/s 0b
add r0,r4 ! move sprite pointer to next sprite pixel row by adding blitStep
! }
! restore r8, r9, r10, and r11 off of stack
mov.l @r15+,r11
mov.l @r15+,r10
mov.l @r15+,r9
mov.l @r15+,r8
rts
nop
fb_pitch:
.word 320
! } // _GFX_BlitSprite