[SDL] [PATCH] Altivec blitters...

Ryan C. Gordon icculus at clutteredmind.org
Thu Jan 6 07:15:57 PST 2005

I keep hearing from people that complain that SDL is slow on MacOSX. 
Having shipped commercial projects using it, I couldn't ever understand 
why they'd say that. I think I figured out why: the GL codepath is fast, 
but the 2D codepaths are not.

I was surprised to find that a 2D MacOSX project I am working on was 
sitting in BlitNtoN() for 55% of my CPU time, so I set out to optimize a 

The project in question blits a 32-bit surface to the screen surface 
once per frame, usually the whole 640x480 area (but less, in some 
cases). Preconverting or producing source surfaces in screen format 
isn't practical, so the conversion gets done in SDL_BlitSurface(). The 
application wants to write exclusively to a BGRA8888 surface. MacOS is 
handing me a ARGB8888 surface, so a blit requires some basic swizzling 
but no serious conversion. Having no optimized blitters for anything but 
MMX-based CPUs, we fall into BlitNtoN, which is inefficient for several 
reasons, even for scalar code.

Attached is a patch to add the start of Altivec-based blitters. Besides 
the needed structure, I've filled in just the one function, which 
swizzles from one 8888 format to another, and even there, just the 
format I need for my project vs what OSX gives me. Adding new swizzlers 
can use the same function, at the cost of 64 bytes of data per swizzler. 
It's a total win.

Other blitters (16-bit handlers, etc) would need more work, but are 

The end result was hard to gauge, since Shark seems to kill the 
performance boost you'd get from cache prefetching, but before adding 
that, the CPU time spent in the blitter dropped from 55% to about 13%. 
My framerate went from around a consistent 25-27 to 150-300. Once I 
added prefetching, it went up to over 4500. Not a typo. Like I said, 
total win.

Improvements needed:
- Needs other swizzle data filled in.
- Needs non 32-bit blitters written.
- Move this to a seperate file; SDL_blit_N.c is getting cluttered.
- vec_dst gives a HUGE improvement on a G4, but apparently stalls the 
pipeline on a G5. Someone should fix that by figuring out how to toggle 
use_software_prefetch to 0 on a G5 system (and how to do that on 
non-MacOS platforms).
- Configure.in should let you enable/disable the altivec code, and 
should let non-Macs (AmigaOS, PowerPC Linux, etc) use it. Right now it's 
a hardcoded #define to turn it on.
- Configure.in _must_ add -faltivec to gcc's CFLAGS or it won't 
compile...I hacked the generated Makefile because I'm lazy.
- Someone should have MacOSX builds compile with -O3 instead of -O2 
(this comes at Apple's general recommendation that O3 is a significant 
boost over O2, unlike, say, x86 Linux). -falign-loops=32 can be a big 
help in some cases (especially in the blitters on a G5, if I had to assume).

If someone wants to give this patch some love, I'd like to get it into 
CVS eventually.


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: SDL-altivec-swizzle-RYAN-1.diff
URL: <http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20050106/21b4136e/attachment-0007.txt>

More information about the SDL mailing list