(Optimization) ASM - Optimizations - Step-by-Step

Started by Theo Gottwald, January 02, 2007, 09:32:27 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Theo Gottwald

Legend:
First a short comment about the subroutine we are going to ASM-optimize.
Second we go through the Steps.

About the Subroutine:
Sometimes we use

REDIM PRESERVE

to increase an Array. instead of Redim'ing every time its more efficient to increase in bigger steps.
And keep a separate Counter on "dimmed Elements" and "last used Element".

For this we can use an universal Subroutine, which is excellent to demonstrate some optimization issues.
Here is it:

FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
LOCAL e AS DWORD
e=((INT(a/b)+1)*b)
!NOP
FUNCTION=e
END FUNCTION   


Looks quite short and efficient. Anyway we can not really be shure how efficient the Subroutine is unless we have taken a Look into
DisASM. Here is it:

40241C DB450C                 FILD LONG PTR [EBP+0C]
40241F D9E8                   FLD1
402421 DB450C                 FILD LONG PTR [EBP+0C]
402424 DA7D08                 FIDIVR LONG PTR [EBP+08]
402427 E808110000             CALL INTSUB
40242C DEC1                   FADDP ST(1), ST
40242E DEC9                   FMULP ST(1), ST
402430 DF7DA4                 FISTP QUAD PTR [EBP-5C]
402433 8B75A4                 MOV ESI, DWORD PTR [EBP-5C]
402436 90                     NOP
402437 89B578FFFFFF           MOV DWORD PTR [EBP+FFFFFF78], ESI


INTSUB:
           FLDCW WORD PTR [0040287E]
                   FRNDINT
           FLDCW WORD PTR [00402876]
                     RET NEAR



What we see is that the compiler uses a lot of floating-point Mnemonics.
I see two reasons:
First reason is maybe that he doesn't know what we exactly want and to be shure to prevent an overflow, thats what he is doing.
Second reason is that this way there would be no problem to do the same using QUAD variables.

Now whats the next step in optimization?

STEP 1: Replace DWORD's where possible with LONG's
In this case it does not help us anything. Thats why we go to

STEP 2: Make it easier for the compiler to understand what you want.
Make it "easier" means "make it les complex".

FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
REGISTER e AS LONG,f AS LONG
f=INT(a/b)
e=(f+1)*b
FUNCTION=e
!NOP
END FUNCTION 
'
' Here is what we get:
'
40241E DB450C                 FILD LONG PTR [EBP+0C]
402421 DA7D08                 FIDIVR LONG PTR [EBP+08]
402424 E81B110000             CALL L403544
402429 DF7DA4                 FISTP QUAD PTR [EBP-5C]
40242C 8B7DA4                 MOV EDI, DWORD PTR [EBP-5C]
40242F 8B450C                 MOV EAX, DWORD PTR [EBP+0C]
402432 B901000000             MOV ECX, DWORD 00000001
402437 03CF                   ADD ECX, EDI
402439 91                     XCHG  EAX, ECX
40243A F7E9                   IMUL ECX
40243C 89C6                   MOV ESI, EAX
40243E 89B578FFFFFF           MOV DWORD PTR [EBP+FFFFFF78], ESI


It doesn't look shorter, but if you take a close look, we have just
saved 4 Floating-Point Operations.
Instead we use fast Integer Operations.

We have on our List:
- The INT() is still Floating-point.
- The (a/b) is still a FIDIVR (Floating-point Division).
What we do here is, we calculate a division to higher precision just to make a INT later.
Lets try something else.

First we need to define this MACRO:

MACRO A_DIV(P1,P2,P3)
! MOV EAX,P2
! CDQ
! DIV P3
! MOV P1,EAX
END MACRO
'
' Then we can use this:
'
FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
REGISTER e AS LONG,f AS LONG
A_DIV(f,a,b)
e=(f+1)*b
FUNCTION=e
END FUNCTION   
'
' And we get this:
'
40241E 8B4508                 MOV EAX, DWORD PTR [EBP+08]
402421 99                     CDQ EDX:EAX
402422 F7750C                 DIV  DWORD PTR [EBP+0C]
402425 8BF8                   MOV EDI, EAX
402427 8B450C                 MOV EAX, DWORD PTR [EBP+0C]
40242A B901000000             MOV ECX, DWORD 00000001
40242F 03CF                   ADD ECX, EDI
402431 91                     XCHG  EAX, ECX
402432 F7E9                   IMUL ECX
402434 89C6                   MOV ESI, EAX
402436 89B578FFFFFF           MOV DWORD PTR [EBP+FFFFFF78], ESI


No more F's in the code. Looks Ok now. What comes now is just a bit of cosmetics.

Here are some more optimization Macros. Use them with caution, as they do not have the overflow protection like the original PB-Code.

' P1,P2,P3 sind Variablen oder Registernamen
MACRO A_MUL(P1,P2,P3)
! MOV EAX,P2
! MUL P3
! MOV P1,EAX
END MACRO
'
' Intel Developers manuals say that ADD is faster then INC
' because it does not need to change some CPU flags.
'
MACRO A_INC(P1)
! ADD P1,1
END MACRO
'
'
FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
REGISTER e AS DWORD,f AS DWORD
A_DIV(f,a,b)
A_INC(f)
A_MUL(e,f,b)
FUNCTION=e
END FUNCTION 
'
' And here is what we get:
'
4023CF 8B4508                 MOV EAX, DWORD PTR [EBP+08]
4023D2 99                     CDQ EDX:EAX
4023D3 F7750C                 DIV  DWORD PTR [EBP+0C]
4023D6 8BF8                   MOV EDI, EAX
4023D8 83C701                 ADD EDI, BYTE 01
4023DB 8BC7                   MOV EAX, EDI
4023DD F7650C                 MUL  DWORD PTR [EBP+0C]
4023E0 8BF0                   MOV ESI, EAX
4023E2 89B578FFFFFF           MOV DWORD PTR [EBP+FFFFFF78], ESI



Which is still good enough for any program. Just to show the final result, we go one step ahead:

FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
REGISTER e AS DWORD,f AS DWORD
A_DIV(f,a,b)
A_INC(f)
!MOV EAX,f
!MUL b
!MOV function,eax
END FUNCTION
'
' Will result in:
'
4023CF 8B4508                 MOV EAX, DWORD PTR [EBP+08]
4023D2 99                     CDQ EDX:EAX
4023D3 F7750C                 DIV  DWORD PTR [EBP+0C]
4023D6 8BF8                   MOV EDI, EAX
4023D8 83C701                 ADD EDI, BYTE 01
4023DB 8BC7                   MOV EAX, EDI
4023DD F7650C                 MUL  DWORD PTR [EBP+0C]
4023E0 898578FFFFFF           MOV DWORD PTR [EBP+FFFFFF78], EAX



Which is the final "Step-by-Step" optimized version.

Lets finally make a raw assumption on how much time we saved with this optimization:


-9-       FILD LONG PTR [EBP+0C]
-1-       FLD1
-9-       FILD LONG PTR [EBP+0C]
-22-      FIDIVR LONG PTR [EBP+08]
-46-      CALL L403534
-4-       FADDP ST(1), ST
-4-       FMULP ST(1), ST
-7-       FISTP QUAD PTR [EBP-5C]
-1-       MOV ESI, DWORD PTR [EBP-5C]
-1-       MOV DWORD PTR [EBP+FFFFFF78], ESI
'
' compared with:
'
-1-       MOV EAX, DWORD PTR [EBP+08]
-1-       CDQ EDX:EAX
-41-      DIV  DWORD PTR [EBP+0C]
-1-       MOV EDI, EAX
-1-       ADD EDI, BYTE 01
-1-       MOV EAX, EDI
-3-       MUL  DWORD PTR [EBP+0C]
-1-       MOV DWORD PTR [EBP+FFFFFF78], EAX



taking a look on the final Table, we can close with an additional rule:
Avoid divisions where possible :-).

Donald Darden

> FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
> REGISTER e AS LONG,f AS LONG
> f=INT(a/b)
> e=(f+1)*b
> FUNCTION=e
> !NOP
> END FUNCTION 

Instead of the above, you could do this:

FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
FUNCTION = (a\b+1)*b
END FUNCTION 

By using the hierarcy of operators and an integer divide, we can reduce the
process to a single line.  Further, because it is so straightforward, we could
replace the function with a Macro and get rid of some more overhead.

Theo Gottwald

#2
@Donad:

It seems to be that easy. But you know - not everything with legs is a horse.
In this case we get a snail.

My Speed-Test shows a difference of ten (the ASM-version is ten times faster).
Experience says that the shortest code is often the fastest. But there are exceptions.
Finally a look into DisASM helps you to get them. This is one.



'ASM-Version - Number 1:
4023D0 8B4508                 MOV EAX, DWORD PTR [EBP+08]
4023D3 99                     CDQ EDX:EAX
4023D4 F7750C                 DIV  DWORD PTR [EBP+0C]
4023D7 8BF8                   MOV EDI, EAX
4023D9 83C701                 ADD EDI, BYTE 01
4023DC 8BC7                   MOV EAX, EDI
4023DE F7650C                 MUL  DWORD PTR [EBP+0C]
4023E1 898578FFFFFF           MOV DWORD PTR [EBP+FFFFFF78], EAX
4023E7 8B8578FFFFFF           MOV EAX, DWORD PTR [EBP+FFFFFF78]
4023ED 8D65F4                 LEA ESP, DWORD PTR [EBP-0C]

' Short Basic-Code with "Integer divide" - Number 2:
402419 DB450C                 FILD LONG PTR [EBP+0C]
40241C DB450C                 FILD LONG PTR [EBP+0C]
40241F DB4508                 FILD LONG PTR [EBP+08]
402422 E840230000             CALL L404767
402427 DE0520674000           FIADD INTEGER PTR [00406720]
40242D DEC9                   FMULP ST(1), ST
40242F E8F4100000             CALL L403528
402434 898578FFFFFF           MOV DWORD PTR [EBP+FFFFFF78], EAX
40243A 8B8578FFFFFF           MOV EAX, DWORD PTR [EBP+FFFFFF78]
402440 8D65F4                 LEA ESP, DWORD PTR [EBP-0C]

404767 D9FC                   FRNDINT
404769 D9C9                   FXCHST, ST(1)
40476B D9FC                   FRNDINT
40476D DEF9                   FDIVP ST(1), ST ' Here is our "Integer Divide" but you see that "F" infront of it :-))
40476F D92DAC2B4000           FLDCW WORD PTR [00402BAC]
404775 D9FC                   FRNDINT
404777 D92DA62B4000           FLDCW WORD PTR [00402BA6]
40477D C3                     RET NEAR