Reading Disassemble Output
The Common Lisp function DISASSEMBLE
produces an
implementation-specific description of the internal representation of
a function. Its output is very useful when you want to improve the
performance of your code, since it gives you an idea of what the
machine will do.
Disassembly of Byte-Compiled Code
For example, if you use CMUCL's byte-code compiler, the DISASSEMBLE
function will show you the bytecodes generated for your function:
CL-USER> (defun foo (x y) (logand (logior x y) (logxor x y))) CL-USER> (let ((c::*byte-compile* t)) (compile 'foo)) Byte Compiling lambda (x y): CL-USER> (disassemble 'foo) 0: 00 Entry point, frame-size=0 1: 20 push-const #<FDEFINITION object for kernel:two-arg-and> 2: 21 push-const #<FDEFINITION object for kernel:two-arg-ior> 3: 11 push-arg 1 4: 10 push-arg 0 5: 8A named-call, 2 args 6: 22 push-const #<FDEFINITION object for kernel:two-arg-xor> 7: 11 push-arg 1 8: 10 push-arg 0 9: 8A named-call, 2 args 10: 9A named-tail-call, 2 args 11: 00 push-local 0 12: 00 push-local 0 13: 00 push-local 0 14: 00 push-local 0 15: 00 push-local 0
This is fairly easy to interpret, once you know that the byte-code
interpreter is a stack machine. The references of the internal
TWO-ARG-AND
and TWO-ARG-IOR
functions are pushed onto the
stack, the arguments Y
then X
are pushed onto the stack,
the TWO-ARG-IOR
call is executed (leaving the result on the
stack), the reference of the function TWO-ARG-XOR
is pushed onto
the stack, the arguments are pushed again, the TWO-ARG-XOR
is
executed, and finally the TWO-ARG-AND
is executed in a
tail-call.
Disassembly of Native-Compiled Code
You are more likely to be interested in disassembling functions that have been compiled to native code. Understanding disassembly of native code is easier if you understand what certain common sequences of instructions are doing.
The following function illustrates CMUCL's ability to use unboxed
arithmetic on values of type (unsigned-byte 32)
and
(signed-byte 32)
. It can also do similar things with floating point values. This
significantly improves performance by reducing consing, and by
allowing the compiler to emit instructions that operate directly on
machine word sized values.
This optimization is easiest to implement inside a single function. It
is more difficult to apply when making a function call, since it is
necessary to ensure that the called function is expecting to receive
unboxed arguments, rather than the normal calling convention. CMUCL is
able to perform this type of optimization across function call
boundaries when functions are declared to be inline, with locally
defined functions (using FLET
or LABELS
), and when block
compilation is used. See the CMUCL User's Manual for more information
on this.
(defun foo (x y) (declare (optimize (speed 3) (space 0) (safety 0) (debug 0)) (type (unsigned-byte 32) x y)) (logand (logior x y) (logxor x y))) USER> (compile 'foo) Compiling LAMBDA (X Y): In: LAMBDA (X Y) #'(LAMBDA (X Y) (DECLARE (OPTIMIZE # # # #) (TYPE # X Y)) (BLOCK FOO (LOGAND # #))) Note: Doing unsigned word to integer coercion (cost 20) to "<return value>". Compilation unit finished. 1 note
The x86 dissassembly of the function follows below. (You can also look
at the SparcDisassemblyExample which is somewhat easier to read.) The first line is
the entry point of the function, at address 0x481E0770. This is
followed by a standard function prologue, which checks the number of
arguments and their types, and unboxes them (the SAR operations at
labels L0 and L2). Starting at label L3 are the logical operations
(OR, XOR, AND), which operate directly on the 32 bit values. Since we
must box/tag the return value for functions that follow the normal
calling convention, we test whether any of the upper 3 bits of the
result are set (the TEST ECX, 3758096384
operation, where the magic
number is just #xE0000000). If that's true, we create a boxed bignum
representation (sequence at label L5). Otherwise we can just tag the
result as a fixnum with a two-bit 0 tag (and a leading 0 sign bit, of
course), using LEA EDX, [ECX*4].
481E0770: .ENTRY FOO() ; FUNCTION 788: POP DWORD PTR [EBP-8] 78B: LEA ESP, [EBP-32] 78E: MOV EAX, EDX 790: TEST AL, 3 792: JEQ L0 794: MOV ECX, [EAX-3] 797: JMP L1 799: L0: SAR EAX, 2 79C: MOV ECX, EAX 79E: L1: MOV EAX, EDI 7A0: TEST AL, 3 7A2: JEQ L2 7A4: MOV EAX, [EAX-3] 7A7: JMP L3 7A9: L2: SAR EAX, 2 7AC: L3: MOV EDX, ECX ; No-arg-parsing entry point 7AE: OR EDX, EAX 7B0: MOV EBX, ECX 7B2: XOR EBX, EAX 7B4: MOV ECX, EDX 7B6: AND ECX, EBX 7B8: TEST ECX, 3758096384 7BE: JNE L5 7C0: LEA EDX, [ECX*4] 7C7: L4: MOV ECX, [EBP-8] 7CA: MOV EAX, [EBP-4] 7CD: ADD ECX, 2 7D0: MOV ESP, EBP 7D2: MOV EBP, EAX 7D4: JMP ECX 7D6: NOP 7D7: NOP 7D8: L5: JNS L6 7DA: MOV EDX, 522 7DF: JMP L7 7E1: L6: MOV EDX, 266 7E6: L7: MOV BYTE PTR [#x280001D4], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-INTERRUPTED* 7ED: MOV BYTE PTR [#x280001BC], 4 ; COMMON-LISP::*PSEUDO-ATOMIC-ATOMIC* 7F4: MOV EAX, 16 7F9: ADD EAX, [#x8069724] ; current_region_free_pointer 7FF: CMP EAX, [#x80696F8] ; current_region_end_addr 805: JBE L8 807: CALL #x80536F8 ; alloc_overflow_eax 80C: L8: XCHG EAX, [#x8069724] ; current_region_free_pointer 812: MOV [EAX], EDX 814: LEA EDX, [EAX+7] 817: MOV [EDX-3], ECX 81A: MOV BYTE PTR [#x280001BC], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-ATOMIC* 821: CMP BYTE PTR [#x280001D4], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-INTERRUPTED* 828: JEQ L9 82A: BREAK 9 ; Pending interrupt trap 82C: L9: JMP L4
The code at label L4 is the normal function epilogue. FIXME explain.
The code at label L5 checks first whether the most-significant bit is set. If it is, we have to alloc two words for the bignum, since a full 32 bits, plust the now required sign bit will not fit inside a single machine word. If we only have 31 bits worth of integer, we can fit it and the leading sign bit inside a single machine word. Since we also need a word for the header, we thus allocate either 2 or 4 words of storage, (the heap is 8 byte aligned).
The code at labels L7 to L9 is for the gencgc allocator to allocate the necessary space for the result, taking care to make sure the allocation is done atomically. The code is not truly atomic, but care is taken so that nothing else can interrupt these operations until the memory has been allocated.
While this example doesn't show it, CMUCL can handle unboxed values on the stack, so even when intermediate or argument values spill from registers to the stack, they can remain unboxed. On x86 platforms, this is handled using a conversative garbage collector. On other platforms, this is done by using separate unboxed stacks, on platforms that segregate unboxed and boxed values in stacks and the register file.
Also in some situations the compiler can generate fast calls to the no-arg-parsing entry point, that bypass the checking of the number of arguments and their types, assuming that code has been compiled with low safety and high speed. We can see this if we compile and load the a file containing
(declaim (ext:start-block foo bar)) (defun foo (x y) (declare (optimize (speed 3) (space 0) (safety 0) (debug 0)) (type (unsigned-byte 32) x y)) (logand (logior x y) (logxor x y))) (defun bar (x y) (declare (optimize (speed 3) (space 0) (safety 0) (debug 0)) (type (unsigned-byte 32) x y)) (logand x (foo x y)))
The important part is the ext:start-block
which tells the
compiler to block-compile the following code. In essence this
compiles foo
and bar
as if they were labels
functions. (We refer the reader to the CMUCL User's Manual for more
information on this.) If the disassembly of bar
and foo
are examined, you can see that bar
branches directly to the
no-arg-entry point of foo
.
Acknowledgment: This description is based on a USENET article
<m2fzxapiug.fsf@ibook.bln.pmsf.net>
on comp.lang.lisp
,
dated 2002-08-20, by Pierre Mai.