[SDL] HW accelaration

dcsin at islandnet.com dcsin at islandnet.com
Thu Aug 19 07:10:33 PDT 1999


On Wed, 18 Aug 1999 11:16:11 -0400 (EDT), you wrote:

>On Wed, 18 Aug 1999 dcsin at islandnet.com wrote:
>
>> dest = (source << 8) | source;
>> 
>> 80x386 method:
>> 
>> mov   al, source
>> mov   ah, al
>> mov   dest, ax
>
>DEAR GOD, THE PIPELINE STALL!!!

Yeah, I thought as much. My assembly days were pre-pentium so I don't
know much of the details of optimizing for them. Also, I've never
actually owned an Intel CPU - I've always bought AMDs so the details
are different for those.

>> Wow. It's been a LOOONG time since I've used assembly.
>
>I see that.  However, I agree that this is better than using a lookup
>table. How about this asm instead:

It was sort of meant to be pseudo-assembly :)

>lodsd		; read 4 bytes from [esi] into eax, and increment esi
>mov ebx,eax	; save rest for later
>
>mov edx,eax	; load into dx and bp for masking
>mov ebp,eax	
>shl edx,16	; move into position
>shl ebp,8
>
>and eax,0x000000FF	; isolate bitmasks, and merge into output pixels
>and edx,0xFF000000
>and ebp,0x00FFFF00
>or eax,edx
>or eax,ebp
>
>stosd		; save eax to [edi], and increment edi
>
>;; do something similar for next 16 bits in ebx
>;;
>
>stosd

I don't feel like dissecting that right now, but why the ANDs? Using
masks like that requires a memory access, which is of course slow.

>This is off the top of my head, so you might come up with something that
>uses fewer shifts and masks, and doesn't use EBP as a scrap register
>(however, I left ECX open for counting the loops) but the advantage here
>is that it utilizes the pipleines better (I'm assuming a pentium or
>better, here).  So even though it is twice the size, it can do several of
>these operations simulteneously, and when you're done, it has extended
>four pixels instead of one, using all 32 bits of the CPU.

Maybe MMX would help out in this situation. Too bad I don't know
anything about those instructions.

>On second thought, I think it would be better to load 16 bits at a time
>from the input, so it wouldn't waste the bx register.  Oh well.

Agreed. That could save a lot of memory accesses.

>(I apoligize, I've been itching to write some asm for a while...)

No problem. I've been wishing I had the time to learn Pentium
optimizations and MMX for a while now. Especially after writing some
stuff that could really use it (like a lovely little 2D bumpmapper).




More information about the SDL mailing list