Let's Talk Assembler

Started by Donald Darden, April 16, 2007, 08:52:27 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Charles Pegge

PDF based manuals are readily available from AMD and Intel websites - quite chunky documents in several volumes that tell you everything. No detail is spared but I dont how accessible the legacy documentation might be. I still have a MS Assembler pocket reference, which is quite handy.

These devices carry the layers of their own evolution. One of my first projects was designing an 8088 based board, to fit into A GEC Multibus system
based on the 8080.  But my best assembler experience was with the ARM processor which was the heart of the Archimedes microcomputers.  Now the ARM is used in many devices including printers, PDAs, games machines and mobile phones, because of its high performance and low power consumption.
The ARM has a Reduced Instruction set or RISC and sixteen registers, most of which are general purpose. The instruction set is very regular a permutational. This makes it very easy to learn and also to write efficient code.

Here is an example:

int gcd (int i, int j)
   while (i != j)
      if (i > j)
          i -= j;
          j -= i;
   return i;

n ARM assembly, the loop is:
loop   CMP    Ri, Rj       ; set condition "NE" if (i != j)
                           ;               "GT" if (i > j),
                           ;           or  "LT" if (i < j)           
       SUBGT  Ri, Ri, Rj   ; if "GT", i = i-j; 
       SUBLT  Rj, Rj, Ri   ; if "LT", j = j-i;
       BNE    loop         ; if "NE", then loop


Note the SUBGT instruction, a conditional subtraction, which saves a conditional jump.

When you make a call with the ARM, the return address is placed in register 14
instead of being pushed onto the stack. Stacking is a separate operation, but this allows very efficient single level calls.

That the x86 architecture has come to dominate PCs, seems to be an accident of history. If there was an opportunity to repeat the PC revolution, I know which CPU to chose.

PS. Theo used to have an Archimedes, do I am sure he is also familiar with the ARM. The assembler was embedded in Archimedes basic.

Donald Darden

I agree that the x86 architecture is not all that it could be, but the adoption of the 8088 and 8086 CPUs by IBM, the then computer giant of the age, for their first PC really put Intel on the inside track, and Intel's continued success has been that its ramped up family of CPUs can run legacy code by sustaining the old architecture with very few modifications, just adding various extensions that do not effect the register and instruction extentions.

The original justifications for the x86 are probably all gone, but what locks us into this antiquitated design is the operating system, first DOS, then Windows.  Linux has also focused on the x86 platform because it is the de facto standard.  And of course the hardware and OS together define the environment where your existing applications and new development must live and work.

You could break away and find a new architecture and OS, new applications, new development tools, and start over if you like.  Really, the only thing that is holding you back is what's available, what you are willing to put up with and do for yourself, and the very limited market space that you would be entering at that point.

I'm going to assume that most of you realize that is too great a journey to embark on, so like it or not, you are going to stay with the prevalent hardware and software combinations currently available.  Which justifies the continuance of this discussion.

There are separate conventions for handling two types of data:  Numeric and String, as well as aniother conventions for processing the contents of memory.

With numeric data, we read most significant bit or byte to least significant bit or byte from left to right.  We do this even with decimal numbers.  Thus, 1057 is read as one thousand and fifty-seven, not seven ones, five tens, and one thousand.  That is a matter of convention.  With words, any combination of letters and digits, and text in general, we follow two rules:  First we attempt to read left-to-right, then we attempt to read from top-to-bottom.  With column data, we read left-to-right, top-to-bottom, then left-to-right again.  In moving through pages of text, we turn the pages from right to left.  These are conventions adopted for most western languages, but the do vary in other languages.

According to these conventions, we would look at a 32-bit register as having its most significant bit, representing 2*31, situated at the left side of the register, and the least significant bit, representing 2*0, situated at the right side of the register.  Numbering the corresponding bit positions, across, we would see:

3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1                              \ Powers of
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0    /      2

Bytes are 8-bit representations, and to represent the four possible bytes that
could be loaded into this register of 32 bits, we would see them organized like this:

|3 3 2 2 2 2 2 2|2 2 2 2 1 1 1 1|1 1 1 1 1 1    |               |   \ Powers of
|1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0|   /      2
      Byte 4         Byte 3         Byte 2           Byte 1           Char. Rep.

If we were to express the first four letters of the alphabet in these fourt bytes,
we would have to show them this way, in order to be consistent with the numeric or byte representation:

|        D          |        C        |          B          |         A         |

Now this would seem backwards from the ABCD order of the left-to-right rule.
It is, but it is consistent with the major to minor rule, if the first byte is considered the minor byte.  And that goes along with the idea that the least
byte occupies a lower address in memory than the next most significant byte.

To put this another way, original 8088 chip design read memory one byte at a time, and advanced through memory from a lower byte address to the next higher byte address.  For a 16-byte memory, it read the low order byte first, so the low order byte always had the lower address.  Reading the low order byte first simplified the process of perform arithmetic operations, and also made it easy to increment and decrement register or memory contents.  So if you had
the whole range of capital letters in memory, it would appear in this order:

  low address  -->  ABCDEFGHIJKLMNOPQRSTUVWXYZ  <--  high address

If you then read the first four bytes into EAX, the second into EBX, then third
into ECX, and the fourth into EDX, this is the byte arrangement in those four


It is still in the same sequence, but now looks backward in each register, because of the convention of the low order byte appearing on the right.  If you
stored these registers back into memory, then you would see this:

       low memory  -->  ABCDEFGHIJKLMNOP  <--  high memory

Now let's examine the EAX register briefly.  The EBX, ECX, and EDX registers would be arranged the same way:

|       Upper 16-bit word        |   AH (8 bits)    |  AL ( 8 bits)   |
|      "D"       |      "C"      |       "B"        |       "A"       |

As explained earlier, getting to the bytes that represent D and C requires rotating the register 16 bits to the left or right. then treating them as AH and
AL respectively.

The Carry flag performs another important function with regards to shift operations.  First, when you perform a shift or rotate operation, the last bit move to the left or right is copied into the Carry bit in Flags.  You have the option then to retain that bit and include it in some other operation, such as testing it with a JC or JNC branch instruction, or comibining it in a add or substract operation using ADC or SBB (Add with Carry or Subtract with Borrow).

Second, using rotate or shift (or even ADC) with other registers or memory, you can take the carry bit and merge it with the contents of that register or memory, effectively creating a long shift function that effects two or more registers or memory addresses.  Thus, it is possible to perform quad operations within either a 32-bit or 16-bit processor.  You can extend this basic capability to handle much larger integer types as well.

To use the carry bit effectively, you have to be aware of which operations change the state of the carry flag.  There are times when you have to preserve the state of the carry flag before carrying out further operations.  A JC or JNC
branch serves the purpose of remembering a prior state by the branch taken,
or you can use the ADC or SBB instructions to preserve the contents into a register or memory location, or you can attempt to save all the flag states before continuing what you are doing.

Handling Flags is simplified by two instructions:  LAHF, which stands for Load AH register from Flags, and SAHF, which stands for Save AH into Flags.  In the original 16-bit design, there were only seven flag bits involved.
You also have the option to save the flags onto the stack with PUSHF, and to return the saved flags from the stack with POPF.

One of the things you might want is an extensive help file on the Assembly instruction set.  You can look for a file named ASM.HLP, which I find quite useful.  I'm not sure where I originally found mine, but I have it associated with the PureBasic product, so it might be on that web site (www.purebasic.com).

My previous post, where I identified the flag bits in the Flags register, is somewhat expanded on by the information in the ASM.HLP file.  There, the following breakdown is available:

        |  | | | | | | | | | | | | | | | | '---  CF Carry Flag
        |  | | | | | | | | | | | | | | | '---  1
        |  | | | | | | | | | | | | | | '---  PF Parity Flag
        |  | | | | | | | | | | | | | '---  0
        |  | | | | | | | | | | | | '---  AF Auxiliary Flag
        |  | | | | | | | | | | | '---  0
        |  | | | | | | | | | | '---  ZF Zero Flag
        |  | | | | | | | | | '---  SF Sign Flag
        |  | | | | | | | | '---  TF Trap Flag  (Single Step)
        |  | | | | | | | '---  IF Interrupt Flag
        |  | | | | | | '---  DF Direction Flag
        |  | | | | | '---  OF Overflow flag
        |  | | | '-----  IOPL I/O Privilege Level  (286+ only)
        |  | | '-----  NT Nested Task Flag  (286+ only)
        |  | '-----  0
        |  '-----  RF Resume Flag (386+ only)
        '------  VM  Virtual Mode Flag (386+ only)
        - see   PUSHF  POPF  STI  CLI  STD  CLD

One of the properties of the 286+ architecture is what is known as the Protected mode.  What it really means is that a certain instruction has to be executed for 32-bit addressing and registers can be accessed, this protecting any existing 16-bit code and data from accidently being interpreted as a 32-bit
instruction.  Setting the Protection mode just means switching on the 32-bit
capability.  In the 286 design, they forgot to include an instruction to turn the
protection mode off.  Once turned on, the only way to turn it off was to power off or reset the computer.  Some people referred to the 286 as having half a
brain, or even being brain dead.  This is an exaggeration, and the 386 was introduced to correct this deficiency and add some improvements, primarily an updated FPU (Floating Point Unit), which was slightly different than the original
FPU.  The 286 has since been largely ignored.

The 486 came out that integrated the CPU and FPU together.  However, the programming features introduced with the 386 have remained essentially the same in later designs.  The key difference has in graphical extensions, high speed instruction caches, and execution pipelines that make the present design more efficient and much faster.  Multiple processing cores are the present vogue, but it is the OS that decides how your program will be processed internally.

While the PowerBasic compilers put a few restrictions on you with regards to
programming in assembly language, it alleviates much of the headache that goes with writing assembly code from scratch.  And since PowerBasic also creates a sandbox (a reasonably safe place) for your assembly code to run in, some of the lacks involved are reasonably nonintrusive, insignificant, and immaterial.  You just need to adapt your coding style accordingly.  If that does not satisfy you, you can use a tool like MASM32, which is capable ot generaging DLLs of assemble routines that can be called from PowerBasic (or other programming language of choice).  PowerBasic also automatically provides you with the Protected mode access to 32-bit registers, memory, and extended instructions.

Charles Pegge

Here is a little piece of PB assembler which shows how to use hex op codes mixed in with the assembly code itself. It uses the RDTSC - Read Time Stamp Counter which PB assembler does not recognise.

The Time Stamp Counter is a free running clock cycle counter on the CPU and is very useful for measuring the performance of your system/program very accurately with a resolution of a few nanoseconds and a time span of over 100 years, being a 64 bit counter.

Note how the quad value in the edx:eax registers is passed back into a PB quad variable.

The chunk in the middle is the code being tested. (As you can see, I am rather partial to hexadecimal).


LOCAL TimeStart AS QUAD, TimeEnd AS QUAD ' for time stamp, measuring cpu clock cycles
LOCAL st AS QUAD PTR , en AS QUAD PTR: st=VARPTR(TimeStart):en=VARPTR(TimeEnd)

! push ebx                  '
'                           ' approx because it is not a serialised instruction
'                           ' it may execute before or after other instructions
'                           ' in the pipeline.
! mov ebx,st                ' var address where count is to be stored.
! db  &h0f,&h31             ' RDTSC read time-stamp counter into edx:eax hi lo.
! mov [ebx],eax             ' save low order 4 bytes.
! mov [ebx+4],edx           ' save high order 4 bytes.

! db &h53,&h55                 ' 2 ' push: ebx ebp
! db &hb9                      ' 1 ' mov ecx, ...
! dd 10000                     ' 4 ' number of loops dword
'! db &h90,&h90,&h90            ' x ' NOPs for alignment padding tests
! db &hb8,&h00,&h00,&h00,&h00  ' mov eax,0
! db &hba,&h00,&h00,&h00,&h00  ' mov edx,0
'! db &hb2,&h00                 ' mov dl,0

! db &hbb,&h00,&h00,&h00,&h00  ' mov ebx,0
! db &hbd,&h00,&h00,&h00,&h00  ' mov ebp,0

! db &hbe,&h00,&h00,&h00,&h00  ' mov esi,0
! db &hbf,&h00,&h00,&h00,&h00  ' mov edi,0
! db &h49                      ' dec ecx
! jg repeats           ' 3     ' jg repeats
! db &h5d,&h5b                 ' pop: ebp ebx

! mov ebx,[esp]             ' restore ebx value without popping the stack
'                           ' approx because it is not a serialised instruction
'                           ' it may execute before or after other instructions
'                           ' in the pipeline.
! mov ebx,en                ' var address where count is to be stored.
! db  &h0f,&h31             ' RDTSC read time-stamp counter into edx:eax hi lo.
! mov [ebx],eax             ' save low order 4 bytes.
! mov [ebx+4],edx           ' save high order 4 bytes.
! pop ebx                   '

MSGBOX "That took "+STR$(TimeEnd-TimeStart)+" clocks."

Donald Darden

Nice piece of code, Charles.  There are generally three ways to try and optimize code:  Perform a byte count and strive to reduce the size of the code; count the instruction cycles used by all the instructions and total them up; or use a timing loop to see how much time is involved during execution.   For the last, you normally repeat the number of times you execute the code to average out the time needed for one cycle, by taking the total time and dividing it by the number of repeats.

I like the fact that you used hex coding, then followed it with the corresponding ASM statement as a comment.  It shows an example of coding in hex, then your
use of a comment to explain what you are doing would certainly help others understand what is going on.  Comments can also be used to explain why you are performing certain operations as well.  Commenting code is even more important in assembly coding than it is in BASIC, because with assembly, you are taking baby steps rather than giant strides, and the context of what you are doing is often harder to grasp.  The focus is more towards interactions involving registers rather than directly with memory, so working with assembly, you have to remember what the relationship between the registers and the memory locations where the contents came from, or the results are eventually stored back into.  Since your memory is often limited to your immediate understanding (you will have forgotten this when you come back to it a year from now), you can again use the comments to signify important relationships, such as 'EBX holds INT PTR -> a'. or 'EBX = @a (LONG).

Donald Darden

Looking at an assembly statement, you may see places where a set of square brackets are used.  They have a special meaning, which is to represent the act of indirect addressing.  Actually, I kind of think that a better term would be to call it redirected addressing.

If I were to enter an assembly instruction like this:
    a = 12345
    b = VARPTR(a)
    ! mov eax, a
    ! mov ebx, b

Then it should not surprise you that the register EAX holds the hexidecmal equivalent to the value 12345 in decimal, while the contents of EBX is the
pointer to where a is in memory.

As it happens, my knowledge of how the PowerBasic Compilers write code into the executable file is extremely limited.  Most of my experience was obtained with the earlier PB/DOS compilers.  Some things work pretty much the same,
but other things are quite different.  I found that some things need to be
verified as to how they work and if they still work before I can comment on them to any real extent here.

Here is an example that I just ran up against:  In earlier incarnations, the VARPTR(some dynamic string) pointed to a reference for that variable in memory, and that reference was composed of two parts:  A string pointer, immediately followed by the string length.  It would look like thism more or less:

VARPTR(stringname) -> STRPTR(stringname) -> First byte of string in memory
                           LEN(stringname) ----------------^

Now if a BYREF to a string was passed as a parameter to s Sub or Function, you would find the VARPTR() value in that parameter's position on the stack, and
when you did something like !MOV EBX.aa, and the variable was named aa, then
the value that was moved into the EBX register was where the string reference was in memory.  In the Assembly level, you never see the variable name, because that only has meaning to the compller.  The assembler only knows about addresses in memory or offsets from the stack pointer or base pointer.

Okay, you have the VARPTR() value for variable aa in the EAX register, and if you did a !MOV EBX.[EBX], you would move the four bytes beginning at that address into the EBX register, and this was (and is) the STRPTR() for the variable aa, and that is the first byte of the actual string contents.  So far so good.  We know where the string is in memory now.  But we also have to know the length of the string. 

Now we know how to use LEN(aa) to get the length of the string, but we are not permitted to do this:  !MOV EAX,LEN(aa).  Here you are trying to mix an assembly operation and a BASIC function together, and the Assembler is not into BASIC at all, and PowerBasic is not able to supply a constant value for the length of aa at compile time, because in use, aa could have any length you want it to have.  This is a dynamic string, remember?  So what do we do?  Well, if this worked the way I thought it did with the new compilers, you could have used an !MOV EAX,[EBX+4] and got the length into EAX before you did the !MOV EBX,[EBX} above.  But this doesn't work for me, and I'm still studying the resulting code, trying to figure out what PowerBasic is now doing instead.

Well, not to dispair, there is usually a way.  Here, all you have to do is have another local or static variable in your procedure, and set it equal to the length of the passed string variable before you then pass that value to a register using assembly language.  Here is an example of how all that could work:

SUB Example (aa AS STRING)
  STATIC a AS LONG               'local working variable
  a = LEN(aa)                    'use it to hold the length of aa string   
  ! MOV EAX, a                   'now pass the length to the EAX register
  ! MOV EBX, aa                  'get the VARPTR(aa) into EBX
  ! MOV EBX, [EBX]               'use that to get the STRPTR(aa) indirectly

The new PowerBasic compilers seem to make an effort to mask where they put the Data segment, breaking it up into separate pieces.  It also seems to put a number of operations into appended code elements that are called with CALL statements, for which there are usually RETN (Return Near) exits.

I'm about of the mind that the method given here, for taking particulars about strings and passing them to the registers via temporary variables, would give you less grief overall, than trying to master PowerBasic's method of finding and returning the length of a string.  Don't forget, PowerBasic can determine the length of any string, whether dynamic or fixed, and whether terminated by a $NULL or not, so it can do more than you need it to do, and possibly take longer getting it done than you think strictly necessary.

As mentioned before, it is not uncommon to use the EBP register as an offset into the passed Stack, because the normal stack pointer (ESP) will continue to
reflect any additional pushes and pops, along with Calls and Returns, and the
EBP can be used to anchor the point from which to reference the stack.  But
depending upon when the !MOV EBP,ESP instruction took place, there may be
things on the stack BELOW the point where the EBP register is set to point to.

Well, that could be awkward, right?  Suppose there were things on the stack that were below where the EBP pointed to, how would you reference them?  The answer involves the fact that if you add a negative number to a positive one, it is exactly the same as subtracting it.  Suppose I wrote an expression of n = n + (-1), then simplified it.  I would get n = n - 1, even though I did specify an add operation.  In the computer, I could have an instruction like this:

                   MOV EAX, DWORD [EBP+0FFFFFF78h]

That long value that starts with "0", has the zero in front to specify that this is a number, and the "FFFFFF78h" specifies a NEGATIVE value to be added to the
EBP register's contents during this operation.  The DWORD tells the assembler
to read a 32-bit value from that location. 

Now you can be that the ESP, whatever its value is, is set below the point being referenced here, because this is the only assurance that the contents of the stack at that point would be stable and valid. 

Note that the MOV EAX, DWORD [EBP+0FFFFFF78h] operation does not change
the contents of EBP.  That's just a computation done to determine the source.
The contents of EAX is what changes, to duplicate what is found at the source.
The source is only copied, so it is not destroyed either.  Only when the computed address is to the left of the comma, are you changing the contents of memory, because that is when it becomes a destination.

Charles Pegge

This is Intel's definitive reference on all the the op codes and their precise actions. Its a PDF. Keep it on your desktop but don't try to print it out  ;D

Fortunately most of the common instructions are easy to remember and the tables in the appendices provide a good quick reference. But if you ever need chapter and verse then there is a little essay on each instruction.

Beyond the main x86 and x87 codings, things start to deviate. There are 3 producers I know of, and they are diverging from each other: Intel AMD and VIA. So to use advanced features, you will need to consult their own manuals.

Instruction set reference:
'Intel Architecture Software Developer's Manual Vol 2: Instruction Set Reference

Theo Gottwald

Nice introductions, Donald. I did some Layout changes as my contribution to your nice ASM-Intros.

Its not more like a [ code]  .... here comes the code [/code ] at the end of the code.
Leave away the spaces inside the tags, I just did them here, to show how its done.

Your writing style is easy to follow, thats why I like your postings.

Donald Darden

Aside from the number of bytes allocated to a numeric type, which can be from
1 to four in 32-bit architecture, or 1 to 8 in 64-bit architecture, you can break number types into three forms for use with assembly code.  The first is the signed integer, the second is the unsigned integer, and the third is floating point, which is always signed.

Computationally the signed and unsigned integers are processed identically.  So how do they differ?  It is the way the results are tested.  With signed integers, the setting of the most significant bit always signal a negative number, and negative numbers are always deemed smaller than any positive number, where the sign bit is clear.  Jump instructions that check the results of signed computations or compares are:  JG, JGE, JE, JL, JLE, and JNE.

Unsigned integers would include ASCII code characters in bytes, or even Unicode
characters in words.  In unsigned integers, setting the sign bit merely shows that that the value is in the upper range of that integer type, not in the negative range as with signed integers.  Thus, a different set of jump instructions are available for testing the results of operations involving unsigned integers:  JA, JAE, JE, JB, JBE, and JNE.

If you can remember that Above and Below are checks for unsigned results, and Greater and Lesser are checks for signed results, you should have no probelm in this area.

Floating point values have their own internal format, which includes two sign bits, one for the mantissa and one for the exponent.  Here are the 32-bit and 64-bit floating point formats defined by the IEEE International Standard:

  | |      | |                     |
  | |      | |                     3
  0 1      8 9                     1

| |         | |                                                  |
| |         1 1                                                  6
0 1         1 2                                                  3                                                      3

Note that the IEEE format counts bit positions from left to right, but that in PC
coding, we read bits in their major to minor order as respective posers of 2.  Thus, we would tend to regard the above layout in this manner:

     Exp.        Signed Mantissa
| |      | |                     |
3 3      2 2                     |
1 0      3 2                     0

      Exponent                     Signed Mantissa
| |         | |                    |                             |
6 6         5 5  <- 2nd DWord ->   |   <- 1st DWord ->           |
3 2         2 1               32-> | <-31                        0

Every number is presumed to either be an integer, a fraction, or a combination of
an integer part and a fractional part.  We use d decimal point (a period usually) to mark the separation between the integer on the left, and the fraction on the
right.  When we have only an integer part, we usually forego use of a decimal
point.  So $5 and $5.00 mean the same amount.  In scientific notation, we can indicate how many trailing or leading zeros are needed in order to position a numerical value correctly with respect to the decimal number,  A positive exponent means add additional trailing zeros as necessary.  A negative exponent
means add additional leading zeros as necessary.

The same general idea holds with Floating Point numbers, but here we are dealing with powers of two, not powers of ten.  We still mean add leading or
trailing spaces, but now each zero means to double or divide by half, not by a
factor of ten as in decimal arithmetic.  So the signed exponent tells us how
big or small the number really is, and the signed mantissa (the numerical value bits themselves) tell us whether the number is positive or negative with respect to zero.

Floating Point gives us the ability to represent extremely large or extremely small values with a fair amount of accuracy and precision.  The more digits used in the
Floating Point form, the greater the range, accuracy, and precision.  But any
computations that involve floating point numbers is very slow by comuter terms,
and best avoided, unless really needed.  You can generally recognize any instruction that involves the Floating Point Unity (FPU) in Assembly because it
will begin with "F".  And that is all the discussion about floating point numbers for the present.

It's been suggested that we look at stringx as well.  This is a good time for
that.  In general, we recognize four types of strings here:  One is the fixed
length string.  We know the size of the string, so all subsequent operations on
that string are against a fixed length.  No mystery and no muss.

The second type of string can be of any length, so we call it variable length,
but it's end is marked by the use of a zero value byte, or what we call a NULL
byte.  We use $NULL in PowerBasic to represent this null byte.  This is the
most common string type dealt with in C, C++, and some other languages.

A third type of string is a mix of the first two.  That is, it is defined to have a
maximum length, but the actual end can vary and will be marked by the
presence of a $NULL byte.  This type is often used with calls to the Windows
APIs.  In that context, it is also called a buffer.

A fourth type is the dynamic, variable length string. and is the default string
type found in many BASICs, including PowerBasic.  A dynamic, variable length string has a separate parameter associated with it called LENGTH, which tracks how may bytes are currently assigned to that string.  This string type has several advantages, from not being limited in length, but adaptive; and being able to contain zero value bytes, which the $Null terminated string types cannot hold.

There is a fifth type of string structure, but it is really just a fixed type string that is associated with a UDT, the User Defined Type.  Whether your UDT is
made up of string elements, pointers, integers, bytes, or other fixed length
strings, the whole of the UDT can be handled as though a fixed length string in its own right.  And it can contain zero value bytes with no problem.

Handling string usually involves two things:  The first is where does the string
start?  This is a pointer value, and normally points to the first byte's address.
The second question is, how long is the string?  For this, you need to know what type of string you are dealing with.  Sometimes it is necessary to convert a string from one type, say a dynamic variable length string, to another, such as a ASCIIZ string.  This is easily done in PowerBasic.

So now you presumably know the type of string you are going to handle, and
you need a plan for doing this.  You have the pointer value, so where should you
put it?  Among the typical places would be the EBX register, or the EDI or ESI
registers.  EBX is very good, because it works well with enhanced instrcutions for indirect addressing.  In 16-bit architecture, the segment:register pairs most often associated with handling strings are DS:SI and ES:DI.  Thhe DS:SI were most commonly used for reading data from memory, and the ES:DI pair were most commonly used for saving data back to memory.  The SI and DI registers were specially designed to automatically increment by some count if the direction flag was set to UP, or decrement by some count if the direction flag was set to DN, for certain instructions.  The "some count" was determined by the size of the register involved - by 1 for 8 bits, by 2 for 16 bits, and by 4 for 32 bits.

So, should you chose EBX, ESI, or EDI?  Well, much depends on what you are trying to do.  You can't really go wrong with any of these, but often, you will find that certain advantages may favor the use of one over another.  Automatic
incrementing or decrementing can be beneficial when processing strings.  There
is a REP (repeat) instruction that is designed to work with string instructions to
get the fastest possible execution done on certain types of operations involving
strings.  Some of the string instructions include  CMPS, CMPSB, CMPSW,                 CMPSD, LODS, LODSB, LODSW, LODSD, MOVS, MOVSB, MOVSW, MOVSD, SCAS,                SCASB, SCASW, SCASD, STOS, STOSB, STOSW, and STOSD.

Note that the REP and LOOP instructions involve the use of the (E)CX register as a counter that counts down to zero.  So if you have a fixed length string, you
put the maximum size of the string into the CX or ECX register and use REP to
fast downcount just one string operation, or LOOP to terminate a series of string instructions.  If you are using a zero (NULL) terminated string, you set the contents of CX or ECX to the maximum possible string length, but then your test would be modified to test for a zero or non-zero content, such as LOOPNZ.

The use of REP with some string operations means very fast processing, but are not very adaptive.  A key case were if you were looking for any letter A to Z, or
any digit 0 to 9.  Nor does it help in case involving looking for either an upper
or lower case letter.  If you are looking for an exact match, then the REP
works very well to find a given byte, word, or dword.  However. since the
increment is determined by the register size, trying to test for four consecutive
bytes by setting them in a dword would force an increment by four, and that
would mean only checking every group of four, not every byte sequence of four.

In an idealized architecture, these shortcomings would be addressed.  So instead, programmers struggle to find optimum solutions for their needs within the scope of what is available within the existing architecture.[/code]

Charles Pegge

Good Morning Donald,

Your discussion on the floating point processor is very timely for me, as I am researching fpu op codes today. The problem with the FPU is that it is rather loosely integrated with the CPU. In fact they started out as two separate chips sharing the same bus with a sync protocol for passing data between them. To ensure correct operation, every maths operation had to be preceded by a WAIT (9B). Although the two chips became one with the 486, they still behave as separate devices in many respects.

Not only do they have totally separate registers, but the FPU registers are arranged as a stack of 8, and when you load a variable,  it goes on to the top of the stack and the other registers are pushed down. That means then when you have finished computing a floating point expressing you have to leave the stack as you found it, and ensure that when you store values,  they must be popped from the stack, if they are no longer required.

Here is a function for adding 2 numbers together


function adds(byval a as double, byval b as double) as double
!  FLD  a                 ; loads and pushes value onto stack
!  FADD b                 ; add to the value in the top of the stack
!  FSTP function          ; store the result and pop the stack
end function


function adds(byval a as double, byval b as double) as double
FLD qword ptr [a]                 ' loads and pushes value onto stack
FADD qword ptr [b]                ' add to the value in the top of the stack
FSTP qword ptr [function]         ' store the result and pop th stack
end asm
end function

Charles Pegge

On the subject of string loops:  LoopNZ Rep etc

It seems that these clever loopy instructions available on the x86 are not as efficient as the elemental instructions. Probably it's because they are microcoded rather than hard coded instructions, and require interpretation before being streamed into the execution pipeline.

So the good news is we don't have to learn them anymore to write the most efficient code. On contemporary CPUs the fundamentals do it better.

Paul Dixon:

I haven't read through the whole of your code but if you want it to be a little faster..
The LOOP opcode is slow compared to coding the same thing yourself. You should try to avoid using it. ...


Charles Pegge:

I confirm that LOOP takes a lot longer than DEC ECX: JNZ short ..
In an empty loop with 2gig repeats, the LOOP instruction took
3 seconds instead of 2 seconds (Athlon 3200).


Donald Darden

The REP and LOOP instructions perform several operations involving certain other
string operands, and consequently, involve quite a bit of overhead.  The detection
of the direction flag, the automatic increment or decrement of the (E)SI and (E)DI
registers, the test of the E(CX) register for a zero value, and the automatic decrement of the E(CX) register if it is not zero, then the branch (jump) to some other location if the condition being tested for is met.

Yes, your "fundamental" instructions are faster, but then you need more instructions to do as much.  So it is not a clear case of one or the other, but just using what works best under the circumstances.

LOOP instructions act like upside-down FOR ... NEXT statements, where the FOR is set up initially (here you would precondition the E(CX), and possibly the E(SI) and E(DI) registers in assembly), then you perform the LOOP, which acts like the NEXT. in that it performs the test and the necessary increment or decrement.

While you can set the LOOP instruction at the top of a loop range, or somewhere within, the area defined by the loop itself is generally governed by the branch address included with the LOOP instruction and any additional jumps that return
you to some part of the loop range, or take you out of the loop.  Thus, trying to analyze LOOP logic in assembler can be much more complicated than looking at a BASIC statement with its nice, neat FOR ... NEXT structure.

Additionally, there is no STEP size involved with a LOOP instruction, it is always a decrement of one (1), and the test for zero comes BEFORE the decrement.

Speaking of steps, that is another topic that needs to be understood.  One of the original chips of the X86 family, the 8088, addressed memory in terms of bytes, just 8 bits.  In reading memory for 16 bits, it read the lower 8 bits, then it read the upper 8 bits from the next address.  For compatability, the 8086 and
later chips still look at memory as though it were organized by bytes, when actually it is usually by word (16 bits) for most 16-bit CPUs, and dwords (32-bits) for most 32-bit CPUs.  But the convention is still to support access to memory by bytes, so offsets from pointers use increments of 1 for bytes, of 2 for words, and of 4 for dwords.  Naturally, with 64-bit architecture, or support for one of the PowerBasic data types, you also have increments of 8 for a quad.

Now the stack is a form of memory - in fact, it actually is part of you main memory, just set to work from someplace high in memory and work backwards down through memory addresses, rather than starting near the lower range and working up as with other memory addressing modes.  Because the stack is a part of memory, it is not a fast mode of addressing as the registers are.  It also requires a register pair of its own (SS:(E)SP) to manage it, by pointing at the current bottom of the stack.  Every program generally requires its own stack space, and you can have multiple programs and processes running at once, so the effort to keep the single SS:and E(SP) registers pointing to the right stack when performing any instructions in the corresponding program or process is a challenge that the Operating System handles transparently for you.  .     

But an oddity of the stack is that while it also recognizes memory as being organized in bytes, it really can only push and pop its contents based on word
sizing.  That is. you cannot push just AL or AH, or any other 8-bit byte, you have to push or pop at least 16 bits at once.  So the information on the stack will always be found in increments of two, and any push or pop will be in increments of two as well.

So why should you care about this?  Well, it does tell you that if you put the
current address found in E(SP) into another register to reference anything on
the stack, or even use the stack pointer itself with an offset, that offset will always be 0, or 2, or 4, or any other multiple of two when finding the leading
byte of any item placed on the stack.  It will never be odd.  It also tells you
that if you put the current flags on the stack with a PUSHF, that the flags will occupy two bytes, the upper byte will will be forced to zero where there are no corresponding flag bits, and appear on the stack first, above and before the second byte, which are the flag bits.  In fact, even though the stack works downwards through memory rather than up, it respects the low-byte then high-byte, or low-word then high-word order used with other references to memory, by decrementing the stack pointer, storing the highest byte first,
decrementing the stack pointer, storing the next highest byte, and so on until
the item is fully copied to the stack.

By the way, if you decide to store individual registers to the stack, as discussed
before, you generally pop them in reverse order from the sequence in which you
pushed them.  If you push them all with a PUSHA or PUSHAD instruction, then
pop them with a POPA or POPAD instruction, the architecture performs this reverse sequencing for you.  I've not yet found a write up that describes the sequence used with POPA or POPAD, but by knowing the contents of each
register beforehand, it would be possible to examine the stack frame and figure
this out.

Should you ever consider consider pushing all the registers onto the stack?  If you are not going to use the stack, and limit your use of the registers, it may not be necessary to save any of the contents.  If only a couple of registers need to be saved and restored, some people prefer to save these in local variables, and others may decided individual PUSH and POP instructions will suffice.  It is often a matter of programmer's preference.  PUSHA and POPA are
easy to do, and cut down on mistakes, and the stack memory is immediately returned for further use.  But with all that pushing and popping going on, there
would be a small performance hit each time you do this.

Let's be clear about something else.  The move (MOV) statement always just
copies information TO a source FROM a destination.  The information still remains at the destination; it is not destroyed in the process.  But when you POP something off the stack, it is gone from the stack.  It may actually be in memory below the place pointed to by the stack pointer for awhile, but there is nothing to protect it there, and it will be overwritten by subsequent pushes or call

I believe we are getting to a point where most of the general observations about assembly programming have been more or less covered.  If you have been reading along, some of the mystery may have gone out of the topic by now.The next stage would be to consider specific cases and see how it is done, then take
the code and tweak it some more yourself.

I am going to propose several small exercises here.  The first is to take a
dynamic, variable length string, and go through it, converting any lower case
characters to upper case.  Then do the same thing for an ASCIIZ string.

The second exercise is to take a dynamic, variable length string, and switch the
first byte with the last, the second with the next to last, and so on.  PowerBasic already has a command for doing this called STRREVERSE$(), but you can write
your own version.  And again, adapt it to work with an ASCIIZ string instead.

Post your work as replies to this post, and let's see what works best and why. 

Charles Pegge

From the CPU's point of view, being aligned to 32 bit words helps to maximise the performance, and memory is absurdly cheap compared to how it was thirty years ago. The main byte bottle neck is networking bandwidth, for which data compression provides a solution.

I once worked with a Chinese IT administrator, he was trying out a Mandarin version of Windows 3.5, but they were using DOS based systems in which 2 letters could be keyed to get one Chinese character.  So yes standard keyboards are always going to be less convenient for special characters but unicode could be useful in a variety of specialise keyboards and other input devices. Perhaps APL could be revived with a keyboard of the right sort, something that could share the regular keyboards USB socket.

Now personally I would like to see a Welsh keyboard, since the Welsh language does not use K Q V X Z these keys could be freed up to do more useful things,  ;D

Donald Darden

Alignment on 32-bit boundaries is good for indirect addressing purposes, as it means all 32-bits can be read in one read cycle, not in two.  But the instruction set of the x86 CPU ranges from one byte to many, and consequently, you cannot have a program that is guaranteed to always be read in one cycle - some memory references will read up two or more instructions, some will require only one read cycle, and some will still require two.

Efforts to speed up the CPU's performance  with pipes, prefetch, caches, and multicores have greatly helped, but blur the distinction as to what works best.  Obviously, there are things that can be done to improve performance, but it is rarely a case of do this or don't do that anymore.  Clients want more speed, then they may need faster computers with more memory.  It beats killing yourself trying to max out an old box.

When I learned the five-bit teletype code, it was a limited character set.  Just 26 upper case letters, a shift up and shift down key, carriage return, line feed,
and a break key.  That took 31 unique codes.  The number keys to placed on top the letters keys, along with a number of punctuation symbols.  You shifted
up, you had numbers and symbols.  Shift down, and you had letters.  I wrote
routines to convert old teletype tapes to ASCII and EBCDIC code at one time.

If I were mapping teletype code onto bytes. I would only need five bits sof the eight, and could thereby store eight characters in the space of five normal 8-bit characters.  If I went to a six-bit code, which supports 63 characters aside from a null, I could include my lower case characters as well, and have two keys represent a function shift option to extend the code to do more.

If I were smart about the present ASCII code set, I would limit myself to just
using 126 key codes, plus the null.  I would use my upmost bit to flag that my
code was not using the values 128 - 255, but that the code required a second
byte.  By setting the uppter byte, I tell my system that I need two bytes, not just one.  If the upper bit of the second byte is also set, that I need a third, and so on.  That way I could potentially "grow" my character set by adding any
additional code sets that I might need, and at the same time, commit myself to
supporting the previous sets as they become defined.

Unicode allows for code combinations from 256 - 65,535.  Big deal.  So who is
managing that growth, and what if someone decides that they want 50,000 symbols to represent Chinese?  What does that leave everyone else?  On top of,
we will all have to adapt from our existing one-byte character set to support
the two-byte code set, and no gain for us in the process.  My method just says
forget the special codes above 127, which no two fonts support exactly alike anyway.

As to mention of using color codes and 32-bit pixels to represent characters, that is for only one pixel, not for a whole character.  Think about the immensity
of supporting any text of any size and color, and trying to interpret characters and words for a text search.  Pictures are said to be worth a thousand words, but no two people would see or describe a picture the same way.  Words, properly used, can convey meanings as precisely and skillfully as the language will allow. 


Charles Pegge

Digressing into the delights of Unicode:


Chinese and Indic scripts are very well covered by unicode. Ancient scripts like Cuneiform too. There are even proposals for Egyptian Hieroglyphics but since there are over 700 of those, - that takes up quite a lot of space for an obsolete script.

Different languages need  to be symbolically represented with a uniform system, as this facilitates multilingual translation as well as displaying script efficiently.

But yes I think you are right. For the computer-human interface, the Anglocentric 7 bit ascii is going to remain the standard for a long time to come. And it is easy for the computer to lex into names, numbers, punctuation etc.

Ideally we would have single symbols to represent single abstract concepts, as is done in pure mathematics, but these are not really part of our linguistic heritage, and we would have an excessively large symbol set if we tried to invent an individual symbol say for each intrinsic function within BASIC or an Assembler instruction set, though this is more a topic for the Computer Languages thread.

Donald Darden

There is a marked difference between phonetic written languages. that try essentially to have symbols for each common sound, which then can be used to represent any word by grouping symbols together into words, and languages that attempt to represent each and every concept with its own unique symbol.

Attempting to use Unicode as the common vehicle to combine both approaches into one shows a lack of regard for the essential simplicity of phonetics.  You create a complex representation that really does nothing for anyone.

It does not make any sence that someone wants to classify a dead language as a suitable target for Unicode, because they can codify that language or any other in its own best-suited manner.  It's not like anybody is ever going to study Unicode and begin using it for everyday conversation, trying to pick up the nuances of some unknown turn of symbols.  The real future is going to be in fast and accurate slanguage-to-language translators and the adoption of common sets of languages for businesss needs.

I've heard over the years that the Chinese have had a real struggle with adopting the use of a keyboard because of the difficulty with creating pictoral representations of their written language, and that a number of different methods have been tried and gained limited acceptance.  The goal has mostly been to try and reduce the necessary set of symbols to a much smaller, more manageable set, or the use of a special keyboard where different key combinations would serve to mark different portions of one symbol.  Right.  Just what everyone wants.  A new keyboard where you design your own symbol by striking several keys at once and simultaneously.  I'd like to see anyone learn speed typing on that.

Unicode will never gain widespread acceptance, but I do expect some limited gains regarding a few choice languages, but most people will not use the extensions because they will not have all those languages in common.  It's like trying to make the traffic laws of every nation on earth uniform. where we all drive on the same side of the road.  Why should the local populance be discomforted and forced to adapt, just because some foreign visitor finds it odd that we stick to doing things our own way rather than conform to their standard or desire for uniformity?

Along similar lines, it has been the thought of some that there should be a universal computer language, a universal human tonge, and a universal written language.  Ever hear of Esperanto or Interlingua?  How about Energy Systems Language?  Artifical languages are always being devised and touted as great advances, even adopted by some, but all they do is create more choice, they to not supplant what we already have.  English is the most common language used in the world of commerce, not because it was the language of choice, but because it followed the influence of leading nations that used it as their primary tongue.  There will be a Unicode, but it will not supplant existing 8-bit codes in existing context.  There will be no driving force to change this, because there is no real need to do so.