M6811DIS Disassembler

Fri Jul 16 20:05:18 GMT 1999

>
>Date: Thu, 15 Jul 1999 12:37:18 -0800
>From: Ludis Langens <ludis at cruzers.com>
>Subject: Re: M6811DIS Disassembler
>
>Are you comparing or ignoring operands?  It is very likely that the same
>function in two different PROMs will reference different RAM/PROM
>locations.  The purpose of those locations hasn't changed, just the
>absolute address has moved around between different builds.
>
>Years ago I wanted to compare disassembled 680x0 code.  I had a linker
>xref from which to obtain the "public" function names.  Local labels
>(sequentially numbered) were used as much as possible within functions. 
>The remaining generated (non-local) labels were replaced with a hash
>derived from the previous public label.  This way all absolute code
>addresses were hidden.
>
>Because the above was done in text files, a simple text file compare
>program could find differences between code revisions.  Overall, this
>didn't work too well.  Usually, a one or two line bug fix would have
>affects throughout a function.  The result is that I got buried in
>output from the file diff utility.

My idea for the signatures would be a multi-level signature...  You
first compare code fragments for the opcode only, so that as you
say absolute address won't cause problems...  But you also need to
compare length -- suppose two code fragments are identical, but on
one, for some reason, one or two extra instructions were added...  So
relative similarity needs to be checked as well...  Also, what about
functional similarity -- would it be possible in the comparison
process to analyze what is being done and if the same thing is being
done but in two different ways, treat them as similar code? ... An
example would be code written in a high level language when compiled
with one particular version of a compiler with a certain set of
optimizations might be slightly different from the identical code
compiled with a different version of the compiler (or different
compiler all together)...  A simple example:

	ldx  #data_addr     (3 bytes)
	ldd  0,x            (2 bytes)

vs.

	ldd  data_addr      (3 bytes)

Obviously the second one is more optimized, unless of course you
are loading many different offsets from a structure at data_addr,
such as:

	ldx  #data_addr     (3 bytes)
	ldd  0,x            (2 bytes)
       pshd                (2 bytes)   ; pshd is macro for psha pshb
	ldd  2,x            (2 bytes)
       pshd                (2 bytes)
       ldd  4,x            (2 bytes)
       pshd                (2 bytes)
       ldd  6,x            (2 bytes)
       pshd                (2 bytes)

             19 bytes total, 13 instructions

vs.
       ldd  data_addr      (3 bytes)
       pshd                (2 bytes)
       ldd  data_addr+2    (3 bytes)
       pshd                (2 bytes)
       ldd  data_addr+4    (3 bytes)
       pshd                (2 bytes)
       ldd  data_addr+6    (3 bytes)
       pshd                (2 bytes)

             20 bytes total, 12 instructions

In this case the first is more optimized in terms of bytes (and
probably clock cycles as well)...  But the functionality is the
same -- except that the X register is destroyed in the first...

So either the signature or the comparator (or both) need to take
functionality into account as well...  And that is a difficult
task -- especially when trying to keep it totally generic at the
same time...

>All this leads to a question on a change I might make to my
>disassemblies.  Because my HC11 assembler is really just a set of macros
>running in a different assembler, I can change how opcodes are
>assembled.  So, as ECM code gets more complex, GM is increasingly doing this:
>
>Foo ...
>Bar ...
>Zot ...
>    ...
>    LDX #Foo
>    LDAA 0,X
>    LDAB 1,X
>    ADDA 2,X
>
>When fully EQUated, the code should be:
>
>    LDX #Foo
>    LDAA Foo-Foo,X
>    LDAB Bar-Foo,X
>    ADDA Zot-Foo,X
>
>This gets hard to read!  At least it is easy to search for all
>references to Zot.  I could have the code promise the assembler that X
>will contain a certain value.  Any extended mode operand which is within
>255 bytes of the index value could then get assembled to use the indexed
>addressing mode.  The code would be something like this:
>
>    LDX #Foo
>    DIRECTX Foo
>    LDAA Foo,X
>    LDAB Bar,X
>    ADDA Zot,X
>    DIRECTX
>
>Would this be a good feature?  Or would it be too hard to run through
>other assemblers?
>

Again, this ties to the intelligent disassembler/decompiler concept...
Usually, when a compiler loads an offset for data like that and addresses
additional data with indexes from that address, then what you are looking
at is a structure -- i.e. the data is related -- though, not always as
this breaks down with stack/frame relationships and variables local to
a function...  A smart disassembler or decompiler program would be able
to see these indexes loaded and start grouping the data and effectively
create your structure for you...

Here, why not create additonal labels that are indexes into the data?
Like:

Foo  ...
Bar  ...
Zot  ...

FBZStruct  EQU  Foo
pFoo       EQU  0
pBar       EQU  1
pZot       EQU  2

Then in the code use:

    LDX #FBZStruct
    LDAA pFoo,X
    LDAB pBar,X
    ADDA pZot,X

This is very readable and would effectively be similar to a C struct:

struct {
    char Foo;
    char Bar;
    char Zot;
} FBZStruct;

This way, it works with any compiler and can be more easily
decompiled into a high level language such as C...

Just an idea...  I do something similar with the HC11 registers... You'll
often see GM use offset/index with the HC11 control registers (usually
located at 0x1000)...  This is why I have things like SCDR and pSCDR
in my HC11 header files that I distribute with my HC11 disassembler...

Donald Whisnant
dewhisna at ix.netcom.com