(Optimization) FOR .. NEXT and DO.. LOOP - whats more efficient?

Started by Theo Gottwald, January 02, 2007, 09:34:45 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Theo Gottwald

Let's take a look, what the compiler makes out of our code.
In Visual C, you can see it in the Debugger, in PB you need to use DisASM.

If we construct a FOR-Loop, later we look at the DO ... LOOP.
And we compare a LONG - version to a SINGLE - version.


Local y as SINGLE
FOR y = 1 TO LoopCounter '
INCR z '
NEXT y '
'----------------------------------
' becomes:
'----------------------------------
4023C8 DB8550FFFFFF           FILD LONG PTR [EBP+FFFFFF50]
4023CE D99D44FFFFFF           FSTP SINGLE PTR [EBP+FFFFFF44]
4023D4 D9E8                   FLD1
4023D6 D99D48FFFFFF           FSTP SINGLE PTR [EBP+FFFFFF48]
4023DC D9E8                   FLD1
4023DE D99D70FFFFFF           FSTP SINGLE PTR [EBP+FFFFFF70]
4023E4 E918000000             JMP L402401
4023E9 FF855CFFFFFF           INC DWORD PTR [EBP+FFFFFF5C]
4023EF D98548FFFFFF           FLD SINGLE PTR [EBP+FFFFFF48]
4023F5 D885                   FADDST, ST(5)
4023F7 70FF                   JO  SHORT L4023F8
4023F9 FF                     ??? ' <---
4023FA FFD99D70FF             CALL D9FF : FFFF709D


while DisASM seems not to understand the code perfectly and seems to get "out of sync" here (<--).
What we can see is that the compiler need Floating-Point Instructions to use the "SINGLE" Variable.
And FP-Instructions are most often slower then Integer Instructions.

Lets take a look on the LONG Version:

                   
                              MOV EAX, DWORD PTR [EBP+FFFFFF50]
40241C 89853CFFFFFF           MOV DWORD PTR [EBP+FFFFFF3C], EAX
402422 C7C601000000           MOV ESI, DWORD 00000001
402428 E908000000             JMP L402435
40242D FF855CFFFFFF           INC DWORD PTR [EBP+FFFFFF5C]
402433 FFC6                   INC ESI
402435 8BC6                   MOV EAX, ESI
402437 3B853CFFFFFF           CMP EAX, DWORD PTR [EBP+FFFFFF3C]
40243D 7EEE                   JLE SHORT L40242D

This one is nearly as good as hond optimized Assembler - isn't ?
Ok, you would maybe replace the

402435 8BC6                   MOV EAX, ESI
402437 3B853CFFFFFF           CMP EAX, DWORD PTR [EBP+FFFFFF3C]

and directly

402437 3B853CFFFFFF           CMP ESI, DWORD PTR [EBP+FFFFFF3C]

But thats not a big diffrence in cycles.

Now let me just say, that in this case the DWORD Version is as good as the LONG-Version.

Now lets take a look at the DO-LOOP construct.



LOCAL y AS LONG
DO UNTIL z > loopcounter '
INCR y '
LOOP '           

'becomes
402536 8B8550FFFFFF           MOV EAX, DWORD PTR [EBP+FFFFFF50]
40253C 3B855CFFFFFF           CMP EAX, DWORD PTR [EBP+FFFFFF5C]
402542 0F8C08000000           JL  L402550
402548 FF855CFFFFFF           INC DWORD PTR [EBP+FFFFFF5C]
40254E EBE6                   JMP SHORT L402536
402550


True, this one looks really fast. No chance for quick further optimization.

But again, if we change the variable from LONG to SINGLE, things look like this:

402536 8B8550FFFFFF           MOV EAX, DWORD PTR [EBP+FFFFFF50]
40253C 3B855CFFFFFF           CMP EAX, DWORD PTR [EBP+FFFFFF5C]
402542 0F8C12000000           JL  L40255A
402548 D9E8                   FLD1
40254A D98570FFFFFF           FLD SINGLE PTR [EBP+FFFFFF70]
402550 DEC1                   FADDP ST(1), ST
402552 D99D70FFFFFF           FSTP SINGLE PTR [EBP+FFFFFF70]
402558 EBDC                   JMP SHORT L402536


Even if you don't know much about ASM, you can see that this version is slower.

Finally let me note something:

A FOR-LOOP cotains more then a DO-LOOP.
the Initialization is included.

FOR z=1 TO 10

NEXT z


In a DO - LOOP you would write this normaly extra:


LOCAL y AS LONG

y=1 ' This Line is needed
DO UNTIL z > loopcounter '
INCR y '
LOOP '   
'-------------------------
' becomes
'-------------------------
402536 C7C601000000           MOV ESI, DWORD 00000001
40253C 39B54CFFFFFF           CMP DWORD PTR [EBP+FFFFFF4C], ESI
402542 0F8C04000000           JL  L40254C
402548 FFC7                   INC EDI
40254A EBF0                   JMP SHORT L40253C   

'--------------------------
' using #REGISTER NONE
'--------------------------
402534 C78550FFFFFF01000000   MOV DWORD PTR [EBP+FFFFFF50], DWORD 00000001
40253E 8B854CFFFFFF           MOV EAX, DWORD PTR [EBP+FFFFFF4C]
402544 3B8550FFFFFF           CMP EAX, DWORD PTR [EBP+FFFFFF50]
40254A 0F8C08000000           JL  L402558
402550 FF856CFFFFFF           INC DWORD PTR [EBP+FFFFFF6C]
402556 EBE6                   JMP SHORT L40253E


which is in both cases perfectly optimized.

Please note that ESI and EDI are the two CPU-Registers which are used for Register Variables.

FAZIT:
If you really need every CPU Cyle, the DO LOOP is faster then the FOR-LOOP.
The reason is, that the FOR LOOP has more options (STEP - ...).
We see here, that using SINGLE is a much bigger diffrence then the diffrence between these two LOOP-contructs.
'
For those cases, where you use a FOR-NEXT Loop,
but don't really need the Features of the FOR-LOOP,
you could use this Macro:


' Laufanweisung als MACRO
'
MACRO GFOR(P1,P2,P3)
P1=P2
DO UNTIL (P1>P3)
END MACRO
'
MACRO GNEX(P1)
INCR P1
LOOP
END MACRO 
'
' Example:
' Instead of
' FOR i=1 TO 10
' ...
' NEXT i
' you would write in your Program:
'
' GFOR(i,1,10)
' ...
' GNEX(i)
'
which will do just the same, saving you few cycles, which may be intresting in very big loops.