Code timing, pipelines and stalls.

Steve Hutchesson · February 07, 2010, 12:25:18 PM

There is still a tendency to address assembler instruction timings in the same way as it was done in the pre i486 days when processor chomped one instruction at a time and each instruction had a known number of cycles to be processed by the processor but those days are long gone and the reason why accounts for much of the speed increase in computer hardware over the last 20 years.

The difference among many other factors is the concept of a pipeline and this is best understood something like a factory production line. The path of an instruction from its entry to exit is in fact far longer than the old hardware and in some instances pipelines can be 30 to 40 cycles throughput time with some instruction taking a lot longer.

What makes the pipelining construction technique viable in terms of hardware is that instructions are not fed through the pipeline one at a time but are scheduled sequentially so that at any point in time you may have 20 or 30 instructions being processed. The actual hardware is a lot more complex than that again with various factors like out of order execution, branch prediction buffers and many other variations, most of which you don't need to know but the factor worth knowing is the abstract of how a pipeline in constructed.

Now stalls have been with us for a long time and on the old timers it just did not matter but with a notoriously cantankerous processor like the Pentium PIII and PIV family, this factor jumped up and bit you many times until you learnt how and why it works. If you fed an instruction sequence down a multiple pipeline processor and one instruction depended on the result from another instruction in a different pipeline, the rest come to a grinding halt until this dependency was resolved.

The most successful way to avoid most of it was to use the advice in the Intel manuals and select instructions from the "preferred" set that are easily "paired" in multiple pipelines. Now in practical terms this means using MOV ADD SUB TEST CMP and similar and avoid the more complex instructions as they tend to live in what is called "microcode" where their capacity is emulated in hardware but at the price of performance.

A good example of a complex instruction best avoided is JECXZ where the internal capacity does a comparison with ECX and if it is zero the processor branches to a label. By using the "preferred" instruction set you place a conditional jump JZ or JNZ after the ADD or SUB that changes the register value. The speed difference is substantial and the size difference simply does not matter. You can find a number of avoidable instructions in Agner Fog's famous optimisation manuals.

The purpose of this short tech note is to help let go of an out of date concept of individual instruction timings. The action on hardware over the last 15 years is instruction scheduling so that the sequence travels through the multiple pipelines with the least number of obstructions so that you average instruction timings become shorter.

You always test this by timing a working algorithm for long enough to see how fast it is if it in fact matters. If your app is waiting for a mouse click, its a case of who cares but if your graphics code is lagging in terms of frame rate or you audio has lags in it or breaks and missing bits, you will make the effort to get the code faster as it will effect the performance of the application.

For an evening of light reading (or two) the optimisation manual from Intel set is good value and after you head stops hurting with information overload it actually gets better.

Theo Gottwald · February 08, 2010, 09:19:48 AM

Quotehelp let go of an out of date concept of individual instruction timings.

Thats a good point, Steve, I also find myself thinking abaout CPU Cycles. But sometimes I still want to.

Code timing, pipelines and stalls.

Steve Hutchesson

Theo Gottwald