[SDL] single pixel drawing

Robert Wohleb rwohleb at r2tech.com
Thu Jul 18 10:34:01 PDT 2002

If you have a multiplier that never changes, it can be optimized with
shifting. Even it isn't a power of two. Shift to the number that's closest,
but less, than the target and add the difference. This will really help with
hard coded multipliers.

As with any optimization attempt, it's hard to determine the benefit. Just
looking at the program run won't help. Your going to need to time execution
of various code segments on as detailed level as possible. A few
microseconds here and a few more gained there can add a few more frames per
second. It isn't necessarily about huge benefits. The trick to optimizing
code is really guessing which optimized code will give the best benefit
overall. It also helps to use #ifdef when writing optimization routines so
that it is easy to compare code speeds.


-----Original Message-----
From: Jacek Wojdel [mailto:wojdel at kbs.twi.tudelft.nl]
Sent: Thursday, July 18, 2002 3:20 AM
To: sdl at libsdl.org
Subject: Re: [SDL] single pixel drawing

On Wed, Jul 17, 2002 at 02:12:13PM -0700, Neil Bradley wrote:
> I can't find any references to anything stating that a multiply is any
> more than 1 instruction on a PII/PIII/P4. Not only that, you haven't
> considered that you'll stall one of the pipelines if the add relies on the
> multiply (which it will in this case).

Well, check out
First go to page 22 (2-15) ad lookup the table 2-3. It clearly states that
multiplication has latency of 4 cycles while adds and shifts have only 1
cycle latency.
Secondly, check the page 47 (3-23) for information on which instructions can
be paired (thus executed in paralel).
Those are the only two bits of information that I based my message on. I
don't refer here to any specific code example. You're right that in case of
y*w+x, the addition must wait for the mul to finish, just as it needs to
wait in case of y<<p+x. However, in a tight and unrolled loop, the compiler 
might optimize it in such a way that the shift will take place at the same
time as the addition from previous iteration.
Of course it does not make sense to design a generic approach to substitute
each mul with a lots of branching, shifts and additions. If you do however
have a previous knowledge about your code, and you know that a given mul
will always be by power of 2, just use it (even though nonimmediate shift is
not pairable just as mul, it has lower latency). Shifts are cheaper by 
definition and this way you give your compiler a chance to optimize things 
better (even if current compiler does provide any difference, the future one

> FWIW, I compiled my emulator that does lots of raster graphics, removed
> all throttling, and changed shifts to multiplies. It made zero difference
> (and yes, I shut off optimization and did a disassembly of the code to
> ensure it wasn't turning my multiplies into shifts) between multiplies and
> shifts. This is on a 1Ghz PIII, and the graphics processing takes 60% of
> overall execution time.

I agree with your results, especially as you removed optimizations. That
means in neither case did compiler try to take advantage of OoO execution,
pairing of instructions etc. For such a code there really is no difference
between shift and mul.
It is just a matter of providing the optimization routines with the
knowledge about your code that only you as a programmer possess. Writing:
You say: "this is an operation that can be done by shifting value of x"
If you write:
You say: "this is a multiplication of x by any possible number p"
If however in your code n would always be a power of 2 (it is let's say the
GL texture size), you essentially withold this information from your
compiler. It may not make a difference on your machine, with your compiler
and on your operating system. If however sometime later somebody decides to
compile your code on some embeded system that must put a systemwide trap on
each multiplication (don't ask me why, just imagine), this small piece of
information may provide to be crucial for the speed of the code. 

> There are more factors than just the instructions themselves. Depending
> upon the shift, it may require reloading the cl register which can do who
> knows what to the optimization capabilities of the compiler. At this point
> it's a wash.

And isn't reloading a register a simple matter of renaming now ? I thought
it is since P4. So different part of the code occupying the pipeline may
even see different versions of CL and be happy with it. But I agree with
you saying that there is no way of telling what would optimizations do with
your code. That means for me that you should program things in such a way
that the optimizer will have the most available knowledge about your code.
If it thinks that flushing CL is a bad idea, it can always express any shift
with appropriate multiplication (I know, it's additional xor, and setbit,
but you suggested a disaster because of grabbing CL, so it will be worth
it). It is not possible to efficiently change mul into shift without
And anyway, we're way out of topic here ;)

|from: J.C.Wojdel                     |
|      J.C.Wojdel at cs.tudelft.nl       |

SDL mailing list
SDL at libsdl.org

More information about the SDL mailing list