C/C++

A Simple Decompiler

By William May, June 01, 1988

Bill May returns to DDJ with a simple algorithm for recreating source code from tokens. The example source code is in C.

JUN88: A SIMPLE DECOMPILER

A SIMPLE DECOMPILER

Recreating source code without token resistance.

William May

William May works for MassMicro Systems. He is involved with Macintosh II video products and can be reached at 303A Ridgefield Circle, Clinton, MA 01510.

On the surface a decompiler sounds like a hackers' tool, a utility for snooping through code. Although programs such as Steve Jasik's MacNosy for the Macintosh are certainly useful, the decompiler I describe in this article has more prosaic, less imaginative uses. In my case its purpose was to speed up a product that included a spreadsheet-like component.

Fast recalculation speed is an important objective of a spreadsheet, and I felt this would best be achieved by converting the user's expressions into a form that could be computed efficiently---that is, a compiled form. Although the compIlation process is straightforward, a problem arises when the user wants to see and/or change an expression: The application must be able to recreate the original expression, or source code."

A simple way to satisfy this need would be to store both the source and the compiled expressions. This method is used in Smalltalk-80, which also provides interactive viewing and modification of compiled expressions. I felt, however, that with the limited syntax of spread sheet expressions, I should be able to recreate the source code from the compiled code itself.

The problem turned out to have been solved already, and the algorithm was described in an article by PJ. Brown.<fn1> As you will note from the publication date (1976), the algorithm was devised before the introduction of spreadsheets or of microcomputers. Brown developed the algorithm as a component of an interpreted Basic system for minicomputers, where it was used to decompile Basic statements into source form. Given its heritage, it is clear that the algorithm is capable of handling a far wider range of expressions than my spreadsheet is.

The algorithm has a few points that should make it of interest to DDJ readers. First, as I said, it can handle an extremely wide range of expressions, which allows it to find use in many applications. Second, the algorithm is driven by a table that is easy to set up and maintain. With its flexibility and simple setup, the algorithm is one of those tools that, once available, seems to find many unexpected uses. Finally, the subject of decompilers is worth exploring.

The remainder of this article covers my implementation of Brown's algorithm. The code, developed on a Macintosh in MPW C, should be portable to virtually any standard C compiler. In the remainder of this article I assume that readers are familiar with reverse Polish notation (RPN) and have some slight familiarity with compilers.

Brown's Algorithm

Example 1, page 51, shows some examples of using Brown's algorithm, with the object code on the left and the recreated source on the right. For presentation purposes, the object code uses printable characters and standard symbols, which would not usually be the case in an application. Besides handling the normal arithmetic and logical operators, Brown's algorithm can handle Lotus 1-2-3's @ functions, BASIC's Let...=...; and if...then...else... statements, and many others.

One important characteristic of Brown's algorithm is that it is forward scanning, meaning that it starts at the first byte of the RPN expression and works its way toward the end. The alternative, as you might guess, is backward scanning. Forward scanning has two advantages over backward scanning: the scanning algorithm can identity variable-length tokens, and the decompiler can emit (print) its results as it proceeds.

Backward scanning has the advantages of greater speed and better worst-case performance. A backward-scanning algorithm, however, produces its results in reverse, so a second pass is needed to unwind the results and this eliminates some of the speed advantage.

A more critical drawback of backward scanning is the difficulty in handling variable-length tokens. Because variable-length tokens are the norm, this is a serious problem. A spreadsheet, for example, will typically have tokens for operators (1 or 2 bytes) cell references (varying formats for absolute addresses, relative addresses, and so on), integers (4 bytes on a Macintosh), floating-point numbers (412 bytes depending on type and format), strings (many formats), and so on.

At some point programmers using a source-recreation algorithm must decide how faithful the recreation will be to the original text. A normal part of the compiling process will remove all "nonessential" elements, such as spaces, tabs, line feeds, comments, and so on.

There are two solutions to the question of recreation fidelity. The first option is to attempt to achieve strict fidelity, and the second is to recreate expressions in a canonical or standard, format. Note that the approaches are not mutually exclusive. Idiosyncratic formatting can be kept for some elements and ignored for others.

In my application---a spreadsheet---the scope for idiosyncratic formatting is limited. There are no lines, tabs, or comments; however, there is a problem with parentheses. The placement of parentheses in an expression can vary enormously from user to user. Yet this information is lost during the conversion from infix to postfix notation. Users would certainly be dissatisfied if no parentheses were recreated and would undoubtedly prefer that their original placement be maintained.

An approach to handling this difficulty is to compile information on the location of parentheses into the object code. An expression evaluator would ignore the parentheses, and Brown's algorithm would use the information to recreate their exact placement. By making a very small increase in object code size, the process of compiling and decompiling expressions becomes almost invisible to the spreadsheet user.

The Transformation Table

The core of Brown's algorithm is a transformation table that describes how operators in the input stream are to be converted back into source code. Example 2, this page, is an example of such a table. The example shows no particular "language" but illustrates several types of transformations that the algorithm can handle.

Column 1 of the table is an opcode, or identifier, symbol generated by the compiler-that is, the token the algorithm will see in the input stream. In the example I have used printable characters that correspond to their common usage-for example, + indicates addition. Generally the opcodes are encoded as 1- or 2-byte integers. Column 2 indicates the number of operands associated with the opcode. The examples show both unary and binary operators. The table could also include nonary operators (no operands) as well as more complex operators, such as if...then...else or @pv(pmt, int, term).

The following columns, 3 through 5, describe the strings that are to be emitted by the decompiler and their position relative to the operands. For example, + is a binary operator with no prefix__1, a prefix__2 (+), and no suffix. This means that + will apply to two operands, that there is no prefix to the first operand but there is a prefix to the second (+), and that there is no suffix. The expression op1 op2 + will thus be transformed into the string op1 + op2, as expected. Another example is =, which is a binary operator with a prefix__1 (let), a prefix__2 (=), and a suffix (;). In this case, op1 op2 = is transformed into let op1 = op2;, the classic Basic let statement.

Example 1: Source code recreation

            
                        Source Recreation                
                                                         
              Object code        Result                  
              abc++              a+b+c                   
              bcdfe+*/=          let b=c/d*f+e           
              qabc+(cd,,$        let g=sum(a*(b+c),c,d

Example 2: The transformation table

      
      typedef struct decomp-row  {                         
          char      ident                                  
          short      lex_type;                             
          char      *prefix_1;                             
          char      *prefix_2;                             
          char      *suffix;                               
      }  decomp_row;                                       
      static decomp_row table[] = {                        
          {'!', UNARY_OP,      "-",      NIL,      NIL},   
          {'+', BINARY_OP,     NIL,      "+",      NIL},   
          {'-', BINARY_OP,     NIL,      "-",      NIL},   
          {'/', BINARY_OP,     NIL,      "/",      NIL},   
          {'*', BINARY_OP,     NIL,      "*",      NIL},   
          {'^', BINARY_OP,     NIL,      "^",      NIL},   
          {'(', UNARY_OP,      "(",      NIL,      ")"},   
          {'$', UNARY_OP,      "sum(",   NIL,      "}"},   
          {'=', BINARY_OP,     "let ",   "=",      ";"},   
          {',', BINARY_OP,     NIL,      ",",      NIL},   
      };

Algorithm Overview

Probably the best way to get a feel for Brown's algorithm is to step through an example. The example 1 will follow is the expression xy!z/=, which will be transformed into the string let x = -y/z;. My example will use the table in Example 2 as its basis.

The first token found in the scan is the operand x. Upon seeing the operand, the algorithm scans forward to find any operators that give rise to a prefix for the operand. In this case the next token is y, which is also an operand, not an operator. A level count is incremented (it is now 1).

The following token, !, is a unary operator. According to a general rule, for an unary operator the level is decremented by n-1. Because ! is a unary operator, the level is decremented by 0 (no change). Another rule is then applied, which states that if the current level is n, push prefix__(-n + 1) for the operator. Because the level is 1, the algorithm should push prefix__0, which does not exist. Therefore nothing is pushed on the stack for this operator.

The scan continues, finding the operand z, which increases the level (now 2), and then the binary operator /. Applying the decrementing rule, the level is decremented to 1. Using the prefix rule, the algorithm is once again called on to push prefix__0, which still does not exist. Finally, the operator = is seen. This is defined as a binary operator, so the level is decremented once again, to become 0. This indicates that the algorithm should push the prefix__1 for the = operator, which is the string let.

Finally, upon reaching the end of the input string, the stack elements are popped and the associated strings emitted and then the operand itself-x. At this point the output string is let x

The scanning flow you have just observed is the basic pattern to Brown's algorithm. For each operand the algorithm scans ahead for operators that give rise to prefixes for the operand, pushing these prefixes onto the stack. When the forward scan is complete, the prefixes are popped from the stack and emitted and the scan proceeds to the next token.

The scan is now looking at the next operand-y. The same pattern is repeated. First, while the level is 0, the operator ! is seen. This results in pushing prefix__1 for !, which is a -. At the end of the string, the = is seen once again, and the prefix__2 (=) is pushed on the stack. Now the forward scan terminates, the stack elements are popped and emitted, and finally the operand itself---y---is emitted. The output string is now let x = -y. Note that the use of the stack results in printing prefixes in the reverse of the order in which they are seen.

Once again the scan restarts, now at the operator !. When an operator is seen, the algorithm prints its suffix. In this case there isn't one, so you proceed. Next in line is the operand z. The usual routine is followed, resulting in the prefix__2 for /being printed. The output string is now let x=-y/z. Because z is followed by the operator /, which has no suffix, you proceed. Finally, you encounter the operator =, which has the suffix ;, which is emitted. The input string has been entirely scanned, so the algorithm is complete, with the output string being let x=-y/z;.

Simple as this example is, it shows that the algorithm is inefficient for large volumes of data. The tail of the string may be scanned once for each operand in the string. This inefficiency, however, is not evident in an application such as a spreadsheet, where expressions are limited in size, or in a line-by-line inter prefer such as some forms of Basic.

Support Functions

The decompiler (see Listing One) uses several tools to provide supporting mechanics. These tools fall into three classes: input scanning, table handling, and stack handling.

Input scanning is done by the function advance__to_next__elem(). Scanning is much easier for a decompiler than it is for a compiler. Because it is scanning object code, not source, there is no need to handle white space, ends of lines, numeric conversions, and so on. The format of the input stream is much more predictable. Even so, a real scanner can get much more complicated than that shown here because it will need to identify a variety of operand formats, including integers, floating point, cell references, and so on.

Accessing the transformation table is the next important set of functions. The table-handling functions are responsible for searching the table and returning pointers to associated strings. I have used a very simple if inefficient design based on linear searches. A slightly more sophisticated scheme might sort the table and use binary searches.

Finally, a stack is needed to store the results of the forward scans performed for each operator. These results are then unwound when the scan is complete. The stack functions shown can push and pop pointer-size objects. Although simple, this stack implementation is quite sufficient for this algorithm.

Usage

An application program sees only one externally visible function: deparse(instr, outstr). The input string and output string cannot be the same. Some simple enhancements can be added if needed, such as a length check on the output string. Very little error checking should be needed in the decompiler. In theory at least, incorrectly formed expressions should have been caught by a compiler before they get to this stage. Bad things can happen, however, if an incorrectly formed expression is submitted for decompiling. In particular, if the RPN expression does not end with an operator, the algorithm will not terminate correctly, if it terminates at all.

In my implementation the transformation table is compiled into the program. Others, who desire more flexibility, could modify the code to pass a pointer to the table as one of the parameters, allowing use of multiple tables or making changes at run time.

Note

<fn1>P.J. Brown, "More on the Re-creation of Source Code from Reverse Polish," Software---Practice and Experience 7 (1976): 545-551.

[LISTING ONE]

<a name="011d_000d">
A Simple Decompiler
by William May



  1 /************************************************************************
  2     deparse.c
  3
  4     Decompiles an RPN expression into the original source code (or
  5     a reasonable facsimile). Based on an algorithm by P.J. Brown in
  6     'More on the Re-creation of Source Code from Reverse Polish'
  7     Software - Practice and Experience, vol 7 pgs 545-551
  8
  9     William May
 10     303 Ridgefield Circle
 11     Clinton, MA 01510
 12
 13  *************************************************************************/
 14
 15 #include <types.h>
 16 #include <ctype.h>
 17
 18 #define BUFSIZE     80
 19 #define NIL         (char *)0
 20 #define TRUE        1
 21 #define FALSE       0
 22
 23 enum {
 24     NONARY_OP   = 1,
 25     UNARY_OP    = 2,
 26     BINARY_OP   = 4
 27 } lex_types;
 28
 29 /*-------------------------------------------------------------------------
 30     This table describes the operators that the code re-creating
 31     algorithm will handle.  The identifier is the internal representation
 32     of an operator, the lexical type indicates the number of arguments to
 33     the operator, and the prefixes and suffix supply transformations
 34     of the operators.
 35   -------------------------------------------------------------------------*/
 36 typedef struct decomp_row {
 37     char    ident;
 38     short   lex_type;
 39     char    *prefix_1;
 40     char    *prefix_2;
 41     char    *suffix;
 42 } decomp_row;
 43
 44 /*--------------------------------------------------------------------------
 45     An example of the source recreation table setup.
 46     Note that in real life the idents need not be printable
 47     However, printable characters make the program much easier to test.
 48   --------------------------------------------------------------------------*/
 49 static decomp_row table[] ={
 50     {'!', UNARY_OP,  "-",    NIL,   NIL},
 51     {'+', BINARY_OP, NIL,    "+",   NIL},
 52     {'-', BINARY_OP, NIL,    "-",   NIL},
 53     {'/', BINARY_OP, NIL,    "/",   NIL},
 54     {'*', BINARY_OP, NIL,    "*",   NIL},
 55     {'^', BINARY_OP, NIL,    "^",   NIL},
 56     {'(', UNARY_OP,  "(",    NIL,   ")"},
 57     {'$', UNARY_OP,  "sum(", NIL,   ")"},
 58     {'=', BINARY_OP, "let ", "=",   NIL},
 59     {',', BINARY_OP, NIL,    ",",   NIL}
 60 };
 61
 62 #define TABLENUM (sizeof(table)/sizeof(decomp_row))
 63
 64 #ifdef DEBUG
 65 #include <stdio.h>
 66
 67 /*--------------------------------------------------------------------
 68     This is a simple test stub for the deparse function. It
 69     reads a line from stdin, displays the line and the decompiled
 70     result on stdout. Leave DEBUG undefined when compiling deparse as
 71     a library.
 72   --------------------------------------------------------------------*/
 73 main(argc, argv)
 74 int     argc;
 75 char    *argv[];
 76 {
 77     char    s[BUFSIZE]; /* input string     */
 78     char    t[BUFSIZE]; /* output string    */
 79     int     n;
 80     void    deparse();
 81
 82     printf("\n\nDecompiling test expressions:\n\n");
 83
 84     /* get a line. Note that buffer s still has linefeed in it */
 85     while (fgets(s, BUFSIZE, stdin)) {
 86         if (n = strlen(s)) {
 87             s[n-1] = '\0';
 88             printf("%s --> ", s);
 89             deparse(s, t);
 90             printf("%s\n", t);
 91         }
 92     }
 93
 94     exit(0);
 95 }
 96 #endif
 97
 98 /*------------------------------------------------------------------------
 99     deparse():  the only externally visible function, carries out deparsing
100                 an RPN expression.
101  ------------------------------------------------------------------------*/
102
103 void deparse(instr, outstr)
104 char *instr;
105 char *outstr;
106 {
107     char    *restart;   /* restart position after a forward scan            */
108     int     level;      /* number of operands passed during a forward scan  */
109     char    *p,         /* pointer to string to emit                        */
110             *prefix_1_for_elem(),
111             *prefix_2_for_elem(),
112             *suffix(),
113             *pop();
114     void    init_stack(),
115             push();
116
117     init_stack();
118
119     /* make sure outstr is terminated */
120     *outstr = '\0';
121
122     /* scan the input string */
123     while (*instr) {
124
125         /* look for the next operand */
126         while (elem_is_operator(*instr)) {
127             if ((p = suffix(*instr)) != NIL)
128                 strcat(outstr, p);
129
130             advance_to_next_elem(&instr);
131
132             if (!(*instr))
133                 return;             /* no more input, so quit */
134         }
135
136         /* found an operand, so scan forward for prefixes to this operand. */
137         level   = 0;
138         restart = instr;
139
140         while (level >= 0) {
141             /* get the next token */
142             advance_to_next_elem(&instr);
143
144             /* have we reached the end of the input? */
145             if (!*instr)
146                 break;
147
148             /*
149                 is the next token an operator or operand? If an operand then
150                 update count of intervening operands (the level). If an operator
151                 figure out if it results in a prefix to our operand (based on the level)
152             */
153             if (elem_is_operand(*instr))
154                 level++;
155             else {
156                 if (elem_is_binary_op(*instr))
157                     --level;
158
159                 if (level == 0)
160                     if ((p = prefix_1_for_elem(*instr)) != NIL)
161                         push(p);
162
163                 if (level == -1)
164                     if ((p = prefix_2_for_elem(*instr)) != NIL)
165                         push(p);
166
167                 /* ... can be extended for more complex operators */
168             }
169         }
170
171         /*
172             unwind results of the forward scan by popping them off
173             the stack.
174         */
175         while (p = pop())
176             strcat(outstr, p);
177
178         /*
179             forward scan complete, reset our scan point for another round
180         */
181         instr = restart;
182
183         /*
184             now emit the operand itself. Note that operands will often
185             need conversions and formatting in real life, i.e. integer ->
186             string or floating point -> string conversions,
187             currency/date/time formatting, etc. Here we just append the
188             raw operand to the output string.
189         */
190         strncat(outstr, instr, 1);
191
192         advance_to_next_elem(&instr);
193     }
194 }
195
196 /************************************************************************
197     table handling utilities
198  ************************************************************************/
199
200 /*------------------------------------------------------------------------
201     elem_is_operator(): figure out if code is an operator
202                         if so, then return true
203  ------------------------------------------------------------------------*/
204 static elem_is_operator(code)
205 char code;
206 {
207     decomp_row  *row, *find_op();
208
209     if (row = find_op(code))
210         return TRUE;
211     else
212         return FALSE;
213 }
214
215 /*------------------------------------------------------------------------
216     elem_is_operand():  figure out if code is an operand
217                         if so, then return true
218
219     In this example all operands are alphabetic or numeric characters.
220  ------------------------------------------------------------------------*/
221 static elem_is_operand(code)
222 char code;
223 {
224     if (isalnum(code))
225         return TRUE;
226     else
227         return FALSE;
228 }
229
230 /*------------------------------------------------------------------------
231     elem_is_binary_op():    figure out if code is a binary operator
232                         if so, then return true
233  ------------------------------------------------------------------------*/
234 static elem_is_binary_op(code)
235 char code;
236 {
237     decomp_row  *r, *find_op();
238
239     if (r = find_op(code))
240         return (r->lex_type & BINARY_OP);
241     else
242         return false;
243 }
244
245 /*------------------------------------------------------------------------
246     prefix_1_for_elem():    get prefix 1 for the operator
247  ------------------------------------------------------------------------*/
248 static char *prefix_1_for_elem(op)
249 char op;
250 {
251     decomp_row  *r, *find_op();
252
253     if (r = find_op(op))
254         return (r->prefix_1);
255     else
256         return NIL;
257 }
258
259 /*------------------------------------------------------------------------
260     prefix_2_for_elem():    get prefix 2 for the operator
261  ------------------------------------------------------------------------*/
262 static char *prefix_2_for_elem(op)
263 char op;
264 {
265     decomp_row  *r, *find_op();
266
267     if (r = find_op(op))
268         return (r->prefix_2);
269     else
270         return NIL;
271 }
272
273 /*------------------------------------------------------------------------
274     suffix():   get the suffix of given code
275  ------------------------------------------------------------------------*/
276 static char *suffix(code)
277 char code;
278 {
279     decomp_row *row, *find_op();
280
281     if (row = find_op(code))
282         return row->suffix;
283     else
284         return NIL;
285 }
286
287 /*------------------------------------------------------------------------
288     find_op():  finds the operator "op" in the decompiler table
289                 if found, it returns a pointer to the row
290                 if not, it returns NIL
291  ------------------------------------------------------------------------*/
292 decomp_row *find_op(op)
293 char op;
294 {
295     int         i;
296     decomp_row  *row;
297
298     row = table;
299     for (i = 0; i < TABLENUM; i++, row++)
300         if (op == row->ident)
301             return row;
302
303     return NIL; /* no hit */
304 }
305
306 /*------------------------------------------------------------------------
307     advance_to_next_elem(): advance to next element
308                             bump scan pointer as we move
309
310     the function assumes that all tokens are a singe byte.
311  ------------------------------------------------------------------------*/
312 static advance_to_next_elem(p)
313 char    **p;
314 {
315     (*p)++;
316 }
317
318 /*=======================================================================
319     a simple stack implementation
320   =======================================================================*/
321
322 /* stack size in bytes */
323 #define STACKSIZE   40
324
325 /*** a couple of globals for the stack ***/
326 static  char    **sp, **top;
327 static  char    stack[STACKSIZE];
328
329 /*------------------------------------------------------------------------
330     init_stack(): prepare the stack for use
331  ------------------------------------------------------------------------*/
332 static void init_stack()
333 {
334     top  = stack + STACKSIZE - 1;
335     sp   = top + 1;
336 }
337
338 /*------------------------------------------------------------------------
339     pop(): pop a pointer sized value from the stack
340  ------------------------------------------------------------------------*/
341 static char *pop()
342 {
343     if (sp <= top)
344         return(*sp++);
345     else
346         return NIL;
347 }
348
349 /*------------------------------------------------------------------------
350     push(): push a pointer sized value onto the stack
351  ------------------------------------------------------------------------*/
352 static void push(p)
353 char    *p;
354 {
355     *(--sp) = p;
356 }
357



  1 #   File:       deparse.make
  2 #   Target:     deparse
  3 #   Sources:    deparse.c Stubs.c
  4
  5 # set coptions for C compiler
  6 COptions    = -d DEBUG -g
  7
  8 deparse.c.o _ deparse.make deparse.c
  9     C {COptions} deparse.c
 10 Stubs.c.o _ deparse.make Stubs.c
 11     C {COptions} Stubs.c
 12 deparse __ deparse.make deparse.c.o Stubs.c.o
 13     Link -d -t MPST -c 'MPS ' _
 14         deparse.c.o _
 15         Stubs.c.o _
 16         "{CLibraries}"CRuntime.o _
 17         "{CLibraries}"StdCLib.o _
 18         "{CLibraries}"CInterface.o _
 19         -o deparse
 20




Example 1: Source Code Recreation

     Source Recreation

Object code      Result
abc++            a+b+c
bcdfe+*/=        let b=c/d*f+e
gabc+(*cd,,$     let g=sum(a*(b+c),c,d)


Example 2: Transformation Table


typedef struct decomp-row  {
   char    ident
   short    lex_type;
   char    *prefix_1;
   char    *prefix_2;
   char    *suffix;
} decomp_row;
static decomp_row table[] ={
   {`!', UNARY_OP,   "-",   NIL,  NIL},
   {`+', BINARY_OP,  NIL,   "+",  NIL},
   {`-', BINARY_OP,  NIL,   "-",  NIL},
   {`/', BINARY_OP,  NIL,   "/",  NIL},
   {`*', BINARY_OP,  NIL,   "*",  NIL},
   {`^'  BINARY_OP,  NIL,   "^",  NIL],
   {`(', UNARY-OP,   "(",   NIL,  ")"},
   {`$', UNARY_OP,   "sum(",NIL,  ")"},
   {`=', BINARY_OP,  "let ","=",  ";"},
   {`,', BINARY_OP,  NIL,   ",",  NIL}
};

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++

A Simple Decompiler

A SIMPLE DECOMPILER

Recreating source code without token resistance.

William May

Brown's Algorithm

The Transformation Table

Example 1: Source code recreation

Example 2: The transformation table

Algorithm Overview

Support Functions

Usage

Note

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

C/C++

A Simple Decompiler

A SIMPLE DECOMPILER

William May

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content