A SIMPLE DECOMPILER
Recreating source code without token resistance.
William May
William May works for MassMicro Systems. He is involved with Macintosh II video products and can be reached at 303A Ridgefield Circle, Clinton, MA 01510.
On the surface a decompiler sounds like a hackers' tool, a utility for snooping through code. Although programs such as Steve Jasik's MacNosy for the Macintosh are certainly useful, the decompiler I describe in this article has more prosaic, less imaginative uses. In my case its purpose was to speed up a product that included a spreadsheet-like component.
Fast recalculation speed is an important objective of a spreadsheet, and I felt this would best be achieved by converting the user's expressions into a form that could be computed efficiently---that is, a compiled form. Although the compIlation process is straightforward, a problem arises when the user wants to see and/or change an expression: The application must be able to recreate the original expression, or source code."
A simple way to satisfy this need would be to store both the source and the compiled expressions. This method is used in Smalltalk-80, which also provides interactive viewing and modification of compiled expressions. I felt, however, that with the limited syntax of spread sheet expressions, I should be able to recreate the source code from the compiled code itself.
The problem turned out to have been solved already, and the algorithm was described in an article by PJ. Brown.<fn1> As you will note from the publication date (1976), the algorithm was devised before the introduction of spreadsheets or of microcomputers. Brown developed the algorithm as a component of an interpreted Basic system for minicomputers, where it was used to decompile Basic statements into source form. Given its heritage, it is clear that the algorithm is capable of handling a far wider range of expressions than my spreadsheet is.
The algorithm has a few points that should make it of interest to DDJ readers. First, as I said, it can handle an extremely wide range of expressions, which allows it to find use in many applications. Second, the algorithm is driven by a table that is easy to set up and maintain. With its flexibility and simple setup, the algorithm is one of those tools that, once available, seems to find many unexpected uses. Finally, the subject of decompilers is worth exploring.
The remainder of this article covers my implementation of Brown's algorithm. The code, developed on a Macintosh in MPW C, should be portable to virtually any standard C compiler. In the remainder of this article I assume that readers are familiar with reverse Polish notation (RPN) and have some slight familiarity with compilers.
Brown's Algorithm
Example 1, page 51, shows some examples of using Brown's algorithm, with the object code on the left and the recreated source on the right. For presentation purposes, the object code uses printable characters and standard symbols, which would not usually be the case in an application. Besides handling the normal arithmetic and logical operators, Brown's algorithm can handle Lotus 1-2-3's @ functions, BASIC's Let...=...; and if...then...else... statements, and many others.
One important characteristic of Brown's algorithm is that it is forward scanning, meaning that it starts at the first byte of the RPN expression and works its way toward the end. The alternative, as you might guess, is backward scanning. Forward scanning has two advantages over backward scanning: the scanning algorithm can identity variable-length tokens, and the decompiler can emit (print) its results as it proceeds.
Backward scanning has the advantages of greater speed and better worst-case performance. A backward-scanning algorithm, however, produces its results in reverse, so a second pass is needed to unwind the results and this eliminates some of the speed advantage.
A more critical drawback of backward scanning is the difficulty in handling variable-length tokens. Because variable-length tokens are the norm, this is a serious problem. A spreadsheet, for example, will typically have tokens for operators (1 or 2 bytes) cell references (varying formats for absolute addresses, relative addresses, and so on), integers (4 bytes on a Macintosh), floating-point numbers (412 bytes depending on type and format), strings (many formats), and so on.
At some point programmers using a source-recreation algorithm must decide how faithful the recreation will be to the original text. A normal part of the compiling process will remove all "nonessential" elements, such as spaces, tabs, line feeds, comments, and so on.
There are two solutions to the question of recreation fidelity. The first option is to attempt to achieve strict fidelity, and the second is to recreate expressions in a canonical or standard, format. Note that the approaches are not mutually exclusive. Idiosyncratic formatting can be kept for some elements and ignored for others.
In my application---a spreadsheet---the scope for idiosyncratic formatting is limited. There are no lines, tabs, or comments; however, there is a problem with parentheses. The placement of parentheses in an expression can vary enormously from user to user. Yet this information is lost during the conversion from infix to postfix notation. Users would certainly be dissatisfied if no parentheses were recreated and would undoubtedly prefer that their original placement be maintained.
An approach to handling this difficulty is to compile information on the location of parentheses into the object code. An expression evaluator would ignore the parentheses, and Brown's algorithm would use the information to recreate their exact placement. By making a very small increase in object code size, the process of compiling and decompiling expressions becomes almost invisible to the spreadsheet user.
The Transformation Table
The core of Brown's algorithm is a transformation table that describes how operators in the input stream are to be converted back into source code. Example 2, this page, is an example of such a table. The example shows no particular "language" but illustrates several types of transformations that the algorithm can handle.
Column 1 of the table is an opcode, or identifier, symbol generated by the compiler-that is, the token the algorithm will see in the input stream. In the example I have used printable characters that correspond to their common usage-for example, + indicates addition. Generally the opcodes are encoded as 1- or 2-byte integers. Column 2 indicates the number of operands associated with the opcode. The examples show both unary and binary operators. The table could also include nonary operators (no operands) as well as more complex operators, such as if...then...else or @pv(pmt, int, term).
The following columns, 3 through 5, describe the strings that are to be emitted by the decompiler and their position relative to the operands. For example, + is a binary operator with no prefix__1, a prefix__2 (+), and no suffix. This means that + will apply to two operands, that there is no prefix to the first operand but there is a prefix to the second (+), and that there is no suffix. The expression op1 op2 + will thus be transformed into the string op1 + op2, as expected. Another example is =, which is a binary operator with a prefix__1 (let), a prefix__2 (=), and a suffix (;). In this case, op1 op2 = is transformed into let op1 = op2;, the classic Basic let statement.
Example 1: Source code recreation
Source Recreation Object code Result abc++ a+b+c bcdfe+*/= let b=c/d*f+e qabc+(cd,,$ let g=sum(a*(b+c),c,d
Example 2: The transformation table
typedef struct decomp-row { char ident short lex_type; char *prefix_1; char *prefix_2; char *suffix; } decomp_row; static decomp_row table[] = { {'!', UNARY_OP, "-", NIL, NIL}, {'+', BINARY_OP, NIL, "+", NIL}, {'-', BINARY_OP, NIL, "-", NIL}, {'/', BINARY_OP, NIL, "/", NIL}, {'*', BINARY_OP, NIL, "*", NIL}, {'^', BINARY_OP, NIL, "^", NIL}, {'(', UNARY_OP, "(", NIL, ")"}, {'$', UNARY_OP, "sum(", NIL, "}"}, {'=', BINARY_OP, "let ", "=", ";"}, {',', BINARY_OP, NIL, ",", NIL}, };
Algorithm Overview
Probably the best way to get a feel for Brown's algorithm is to step through an example. The example 1 will follow is the expression xy!z/=, which will be transformed into the string let x = -y/z;. My example will use the table in Example 2 as its basis.
The first token found in the scan is the operand x. Upon seeing the operand, the algorithm scans forward to find any operators that give rise to a prefix for the operand. In this case the next token is y, which is also an operand, not an operator. A level count is incremented (it is now 1).
The following token, !, is a unary operator. According to a general rule, for an unary operator the level is decremented by n-1. Because ! is a unary operator, the level is decremented by 0 (no change). Another rule is then applied, which states that if the current level is n, push prefix__(-n + 1) for the operator. Because the level is 1, the algorithm should push prefix__0, which does not exist. Therefore nothing is pushed on the stack for this operator.
The scan continues, finding the operand z, which increases the level (now 2), and then the binary operator /. Applying the decrementing rule, the level is decremented to 1. Using the prefix rule, the algorithm is once again called on to push prefix__0, which still does not exist. Finally, the operator = is seen. This is defined as a binary operator, so the level is decremented once again, to become 0. This indicates that the algorithm should push the prefix__1 for the = operator, which is the string let.
Finally, upon reaching the end of the input string, the stack elements are popped and the associated strings emitted and then the operand itself-x. At this point the output string is let x
The scanning flow you have just observed is the basic pattern to Brown's algorithm. For each operand the algorithm scans ahead for operators that give rise to prefixes for the operand, pushing these prefixes onto the stack. When the forward scan is complete, the prefixes are popped from the stack and emitted and the scan proceeds to the next token.
The scan is now looking at the next operand-y. The same pattern is repeated. First, while the level is 0, the operator ! is seen. This results in pushing prefix__1 for !, which is a -. At the end of the string, the = is seen once again, and the prefix__2 (=) is pushed on the stack. Now the forward scan terminates, the stack elements are popped and emitted, and finally the operand itself---y---is emitted. The output string is now let x = -y. Note that the use of the stack results in printing prefixes in the reverse of the order in which they are seen.
Once again the scan restarts, now at the operator !. When an operator is seen, the algorithm prints its suffix. In this case there isn't one, so you proceed. Next in line is the operand z. The usual routine is followed, resulting in the prefix__2 for /being printed. The output string is now let x=-y/z. Because z is followed by the operator /, which has no suffix, you proceed. Finally, you encounter the operator =, which has the suffix ;, which is emitted. The input string has been entirely scanned, so the algorithm is complete, with the output string being let x=-y/z;.
Simple as this example is, it shows that the algorithm is inefficient for large volumes of data. The tail of the string may be scanned once for each operand in the string. This inefficiency, however, is not evident in an application such as a spreadsheet, where expressions are limited in size, or in a line-by-line inter prefer such as some forms of Basic.
Support Functions
The decompiler (see Listing One) uses several tools to provide supporting mechanics. These tools fall into three classes: input scanning, table handling, and stack handling.
Input scanning is done by the function advance__to_next__elem(). Scanning is much easier for a decompiler than it is for a compiler. Because it is scanning object code, not source, there is no need to handle white space, ends of lines, numeric conversions, and so on. The format of the input stream is much more predictable. Even so, a real scanner can get much more complicated than that shown here because it will need to identify a variety of operand formats, including integers, floating point, cell references, and so on.
Accessing the transformation table is the next important set of functions. The table-handling functions are responsible for searching the table and returning pointers to associated strings. I have used a very simple if inefficient design based on linear searches. A slightly more sophisticated scheme might sort the table and use binary searches.
Finally, a stack is needed to store the results of the forward scans performed for each operator. These results are then unwound when the scan is complete. The stack functions shown can push and pop pointer-size objects. Although simple, this stack implementation is quite sufficient for this algorithm.
Usage
An application program sees only one externally visible function: deparse(instr, outstr). The input string and output string cannot be the same. Some simple enhancements can be added if needed, such as a length check on the output string. Very little error checking should be needed in the decompiler. In theory at least, incorrectly formed expressions should have been caught by a compiler before they get to this stage. Bad things can happen, however, if an incorrectly formed expression is submitted for decompiling. In particular, if the RPN expression does not end with an operator, the algorithm will not terminate correctly, if it terminates at all.
In my implementation the transformation table is compiled into the program. Others, who desire more flexibility, could modify the code to pass a pointer to the table as one of the parameters, allowing use of multiple tables or making changes at run time.
Note
<fn1>P.J. Brown, "More on the Re-creation of Source Code from Reverse Polish," Software---Practice and Experience 7 (1976): 545-551.
[LISTING ONE]<a name="011d_000d">
A Simple Decompiler
by William May
1 /************************************************************************
2 deparse.c
3
4 Decompiles an RPN expression into the original source code (or
5 a reasonable facsimile). Based on an algorithm by P.J. Brown in
6 'More on the Re-creation of Source Code from Reverse Polish'
7 Software - Practice and Experience, vol 7 pgs 545-551
8
9 William May
10 303 Ridgefield Circle
11 Clinton, MA 01510
12
13 *************************************************************************/
14
15 #include <types.h>
16 #include <ctype.h>
17
18 #define BUFSIZE 80
19 #define NIL (char *)0
20 #define TRUE 1
21 #define FALSE 0
22
23 enum {
24 NONARY_OP = 1,
25 UNARY_OP = 2,
26 BINARY_OP = 4
27 } lex_types;
28
29 /*-------------------------------------------------------------------------
30 This table describes the operators that the code re-creating
31 algorithm will handle. The identifier is the internal representation
32 of an operator, the lexical type indicates the number of arguments to
33 the operator, and the prefixes and suffix supply transformations
34 of the operators.
35 -------------------------------------------------------------------------*/
36 typedef struct decomp_row {
37 char ident;
38 short lex_type;
39 char *prefix_1;
40 char *prefix_2;
41 char *suffix;
42 } decomp_row;
43
44 /*--------------------------------------------------------------------------
45 An example of the source recreation table setup.
46 Note that in real life the idents need not be printable
47 However, printable characters make the program much easier to test.
48 --------------------------------------------------------------------------*/
49 static decomp_row table[] ={
50 {'!', UNARY_OP, "-", NIL, NIL},
51 {'+', BINARY_OP, NIL, "+", NIL},
52 {'-', BINARY_OP, NIL, "-", NIL},
53 {'/', BINARY_OP, NIL, "/", NIL},
54 {'*', BINARY_OP, NIL, "*", NIL},
55 {'^', BINARY_OP, NIL, "^", NIL},
56 {'(', UNARY_OP, "(", NIL, ")"},
57 {'$', UNARY_OP, "sum(", NIL, ")"},
58 {'=', BINARY_OP, "let ", "=", NIL},
59 {',', BINARY_OP, NIL, ",", NIL}
60 };
61
62 #define TABLENUM (sizeof(table)/sizeof(decomp_row))
63
64 #ifdef DEBUG
65 #include <stdio.h>
66
67 /*--------------------------------------------------------------------
68 This is a simple test stub for the deparse function. It
69 reads a line from stdin, displays the line and the decompiled
70 result on stdout. Leave DEBUG undefined when compiling deparse as
71 a library.
72 --------------------------------------------------------------------*/
73 main(argc, argv)
74 int argc;
75 char *argv[];
76 {
77 char s[BUFSIZE]; /* input string */
78 char t[BUFSIZE]; /* output string */
79 int n;
80 void deparse();
81
82 printf("\n\nDecompiling test expressions:\n\n");
83
84 /* get a line. Note that buffer s still has linefeed in it */
85 while (fgets(s, BUFSIZE, stdin)) {
86 if (n = strlen(s)) {
87 s[n-1] = '\0';
88 printf("%s --> ", s);
89 deparse(s, t);
90 printf("%s\n", t);
91 }
92 }
93
94 exit(0);
95 }
96 #endif
97
98 /*------------------------------------------------------------------------
99 deparse(): the only externally visible function, carries out deparsing
100 an RPN expression.
101 ------------------------------------------------------------------------*/
102
103 void deparse(instr, outstr)
104 char *instr;
105 char *outstr;
106 {
107 char *restart; /* restart position after a forward scan */
108 int level; /* number of operands passed during a forward scan */
109 char *p, /* pointer to string to emit */
110 *prefix_1_for_elem(),
111 *prefix_2_for_elem(),
112 *suffix(),
113 *pop();
114 void init_stack(),
115 push();
116
117 init_stack();
118
119 /* make sure outstr is terminated */
120 *outstr = '\0';
121
122 /* scan the input string */
123 while (*instr) {
124
125 /* look for the next operand */
126 while (elem_is_operator(*instr)) {
127 if ((p = suffix(*instr)) != NIL)
128 strcat(outstr, p);
129
130 advance_to_next_elem(&instr);
131
132 if (!(*instr))
133 return; /* no more input, so quit */
134 }
135
136 /* found an operand, so scan forward for prefixes to this operand. */
137 level = 0;
138 restart = instr;
139
140 while (level >= 0) {
141 /* get the next token */
142 advance_to_next_elem(&instr);
143
144 /* have we reached the end of the input? */
145 if (!*instr)
146 break;
147
148 /*
149 is the next token an operator or operand? If an operand then
150 update count of intervening operands (the level). If an operator
151 figure out if it results in a prefix to our operand (based on the level)
152 */
153 if (elem_is_operand(*instr))
154 level++;
155 else {
156 if (elem_is_binary_op(*instr))
157 --level;
158
159 if (level == 0)
160 if ((p = prefix_1_for_elem(*instr)) != NIL)
161 push(p);
162
163 if (level == -1)
164 if ((p = prefix_2_for_elem(*instr)) != NIL)
165 push(p);
166
167 /* ... can be extended for more complex operators */
168 }
169 }
170
171 /*
172 unwind results of the forward scan by popping them off
173 the stack.
174 */
175 while (p = pop())
176 strcat(outstr, p);
177
178 /*
179 forward scan complete, reset our scan point for another round
180 */
181 instr = restart;
182
183 /*
184 now emit the operand itself. Note that operands will often
185 need conversions and formatting in real life, i.e. integer ->
186 string or floating point -> string conversions,
187 currency/date/time formatting, etc. Here we just append the
188 raw operand to the output string.
189 */
190 strncat(outstr, instr, 1);
191
192 advance_to_next_elem(&instr);
193 }
194 }
195
196 /************************************************************************
197 table handling utilities
198 ************************************************************************/
199
200 /*------------------------------------------------------------------------
201 elem_is_operator(): figure out if code is an operator
202 if so, then return true
203 ------------------------------------------------------------------------*/
204 static elem_is_operator(code)
205 char code;
206 {
207 decomp_row *row, *find_op();
208
209 if (row = find_op(code))
210 return TRUE;
211 else
212 return FALSE;
213 }
214
215 /*------------------------------------------------------------------------
216 elem_is_operand(): figure out if code is an operand
217 if so, then return true
218
219 In this example all operands are alphabetic or numeric characters.
220 ------------------------------------------------------------------------*/
221 static elem_is_operand(code)
222 char code;
223 {
224 if (isalnum(code))
225 return TRUE;
226 else
227 return FALSE;
228 }
229
230 /*------------------------------------------------------------------------
231 elem_is_binary_op(): figure out if code is a binary operator
232 if so, then return true
233 ------------------------------------------------------------------------*/
234 static elem_is_binary_op(code)
235 char code;
236 {
237 decomp_row *r, *find_op();
238
239 if (r = find_op(code))
240 return (r->lex_type & BINARY_OP);
241 else
242 return false;
243 }
244
245 /*------------------------------------------------------------------------
246 prefix_1_for_elem(): get prefix 1 for the operator
247 ------------------------------------------------------------------------*/
248 static char *prefix_1_for_elem(op)
249 char op;
250 {
251 decomp_row *r, *find_op();
252
253 if (r = find_op(op))
254 return (r->prefix_1);
255 else
256 return NIL;
257 }
258
259 /*------------------------------------------------------------------------
260 prefix_2_for_elem(): get prefix 2 for the operator
261 ------------------------------------------------------------------------*/
262 static char *prefix_2_for_elem(op)
263 char op;
264 {
265 decomp_row *r, *find_op();
266
267 if (r = find_op(op))
268 return (r->prefix_2);
269 else
270 return NIL;
271 }
272
273 /*------------------------------------------------------------------------
274 suffix(): get the suffix of given code
275 ------------------------------------------------------------------------*/
276 static char *suffix(code)
277 char code;
278 {
279 decomp_row *row, *find_op();
280
281 if (row = find_op(code))
282 return row->suffix;
283 else
284 return NIL;
285 }
286
287 /*------------------------------------------------------------------------
288 find_op(): finds the operator "op" in the decompiler table
289 if found, it returns a pointer to the row
290 if not, it returns NIL
291 ------------------------------------------------------------------------*/
292 decomp_row *find_op(op)
293 char op;
294 {
295 int i;
296 decomp_row *row;
297
298 row = table;
299 for (i = 0; i < TABLENUM; i++, row++)
300 if (op == row->ident)
301 return row;
302
303 return NIL; /* no hit */
304 }
305
306 /*------------------------------------------------------------------------
307 advance_to_next_elem(): advance to next element
308 bump scan pointer as we move
309
310 the function assumes that all tokens are a singe byte.
311 ------------------------------------------------------------------------*/
312 static advance_to_next_elem(p)
313 char **p;
314 {
315 (*p)++;
316 }
317
318 /*=======================================================================
319 a simple stack implementation
320 =======================================================================*/
321
322 /* stack size in bytes */
323 #define STACKSIZE 40
324
325 /*** a couple of globals for the stack ***/
326 static char **sp, **top;
327 static char stack[STACKSIZE];
328
329 /*------------------------------------------------------------------------
330 init_stack(): prepare the stack for use
331 ------------------------------------------------------------------------*/
332 static void init_stack()
333 {
334 top = stack + STACKSIZE - 1;
335 sp = top + 1;
336 }
337
338 /*------------------------------------------------------------------------
339 pop(): pop a pointer sized value from the stack
340 ------------------------------------------------------------------------*/
341 static char *pop()
342 {
343 if (sp <= top)
344 return(*sp++);
345 else
346 return NIL;
347 }
348
349 /*------------------------------------------------------------------------
350 push(): push a pointer sized value onto the stack
351 ------------------------------------------------------------------------*/
352 static void push(p)
353 char *p;
354 {
355 *(--sp) = p;
356 }
357
1 # File: deparse.make
2 # Target: deparse
3 # Sources: deparse.c Stubs.c
4
5 # set coptions for C compiler
6 COptions = -d DEBUG -g
7
8 deparse.c.o _ deparse.make deparse.c
9 C {COptions} deparse.c
10 Stubs.c.o _ deparse.make Stubs.c
11 C {COptions} Stubs.c
12 deparse __ deparse.make deparse.c.o Stubs.c.o
13 Link -d -t MPST -c 'MPS ' _
14 deparse.c.o _
15 Stubs.c.o _
16 "{CLibraries}"CRuntime.o _
17 "{CLibraries}"StdCLib.o _
18 "{CLibraries}"CInterface.o _
19 -o deparse
20
Example 1: Source Code Recreation
Source Recreation
Object code Result
abc++ a+b+c
bcdfe+*/= let b=c/d*f+e
gabc+(*cd,,$ let g=sum(a*(b+c),c,d)
Example 2: Transformation Table
typedef struct decomp-row {
char ident
short lex_type;
char *prefix_1;
char *prefix_2;
char *suffix;
} decomp_row;
static decomp_row table[] ={
{`!', UNARY_OP, "-", NIL, NIL},
{`+', BINARY_OP, NIL, "+", NIL},
{`-', BINARY_OP, NIL, "-", NIL},
{`/', BINARY_OP, NIL, "/", NIL},
{`*', BINARY_OP, NIL, "*", NIL},
{`^' BINARY_OP, NIL, "^", NIL],
{`(', UNARY-OP, "(", NIL, ")"},
{`$', UNARY_OP, "sum(",NIL, ")"},
{`=', BINARY_OP, "let ","=", ";"},
{`,', BINARY_OP, NIL, ",", NIL}
};