Andrew is a Software Litigation Consultant who works on the technical aspects of cases involving copyright, patents, trade secrets, antitrust, and privacy. He can be contacted at [email protected] or http://www.undoc.com/.
Awell-manicured finger was placed in a bowl of chili. That the finger was only lightly cooked indicated that it was placed rather than found there. To find whose finger it was, the best technique was to ask around, but the FBI also has a database of about 45 million sets of fingerprints. For the database to find whose finger it was, the database couldn't be accessed by, say, each person's name or social security number. That's the information officials didn't have and wanted to find. The one thing they did havethe fingerprint itselfcould be used as an index (so to speak) into the database. Because of the vagaries of taking fingerprints, similar rather than identical matching must be used.
What if you had a photograph of a building and wanted to find the building's name or address? It's straightforward to implement a database of photographs of buildings, indexed by the address or geographical location of the building depicted. Without suggesting that implementing such sites is trivial, the Paris photographic yellow pages (http://photos .pagesjaunes.fr/), the satellite-photos feature of Google Maps (http://maps.google.com/), and the "Block View" yellow pages at Amazon (http://yp.a9.com/), all show the feasibility, given an address, of retrieving photos from large databases. But to go in the other direction, from the photo to the address, all possible different photos of the Eiffel Tower, for example, should map to the same database index; that requires some way of determining each photograph's "essence," its salient features, and using those features as an index into the database (http://cs.gmu.edu/~zduric/ WebPages/Papers/MIS-1999.pdf).
Or consider music: Sure you can index a database by the so-called artist's name or the composition's title. That, in essence, is Napster or Kazaa or Morpheus or Grokster. But what if, like the recording industry, you want Napster to identify music, based not on a name (which users will happily misspell to evade name-based copyright-infringement detection), but on the music itself, using something like an MD5 signature of the MP3 file? The well-known Gracenote CDDB system uses the number and lengths of tracks on the CD to form the database index (U.S. patent 6,061,680).
One problem is that the same song may generate many different MD5 signatures; Metallica listed over 470,000 distinct checksums for less than 100 works. As with the database indexed by photographs, you want to extract only the music's salient features (if any), such that submitting an audio clip, or some representation thereof, yields the music's identity. The extensive literature on this subject (such as http:// ismir2002.ismir.net/proceedings/02-FP04-2.pdf) calls this "audio fingerprinting," "perceptual hashing," "query by humming," or simply "name that tune."
Finding Salient Binary Code Features
This is not an article about photographic or audio or fingerprint databases. It's about a database of functionspieces of binary code. But creating a database of every function in Windows XP, for example, raises the same basic issue as those other types of content-based retrievalindexing a database, not by the name of a thing or the file in which it happens to reside, but by the thing itself, or some reasonable representation thereof. Even though Microsoft provides PDB debug symbol files for most of Windows (an under-appreciated form of semisource code), and even though those PDB files contain names for internal functions, most uses of the function database require comparing actual code, not just names. We want the code itself, its insides, not a representation of its outward appearance, such as a C++ mangled/decorated name. This is like the Napster problem.
As I explained last month, the goal here is a database of Win32 code, such that submitting a piece of code yields a list of every other occurrence of that code in the system, hopefully with one or more names for the code. As with fingerprints and photographs and audio, the code representations must be insensitive to small changes. As was shown at the end of Part I of this series, such code changes include the address at which the code resides; they also include the code's register usage. We need a database, not of raw chunks of bytes, but of something cooked down: the code's salient features, its shape or edges, as it were. Just as a fingerprint database is indexed, not by actual fingerprints, but generally by minutiae, likewise a database of function "fingerprints" requires some minutiae that uniquely describes a piece of code, without being that piece of code itself.
If code were pornography, you just want the parts it would fall open to. These salient code features will be placed into what I'm calling an "opstring." From this opstring, you'll generate an MD5 digest. With these MD5 digests of opstrings, you can do large-scale comparisons of binary code. This is reverse engineering, but at a higher level than poring over disassembly listings.
Part I talked about some uses for this method, like measuring the uniqueness or self similarity of a software product, possibly to detect copyright infringement (or to defend against an infringement claim), and for finding candidates for refactoring. Code that is cloned many times should perhaps be moved into a shared library such as a Windows DLL. Another use is code compression, especially in embedded systems (according to Microsoft, XP Embedded uses the same binaries as desktop XP, so code duplication inside XP might concern an XP Embedded developer).
The Good Parts
Figures 1 and 2 are disassembly listings produced with DumpPE (http://www.tbcnet.com/~clive/dumppe.htm). The second column shows the actual bytes of code for a function. The exact same source code lies behind the code in Figures 1 and 2, but as the bold portions show, the code bytes differ slightly; this is because this same piece of code is located at different addresses in two different files. Any function database needs to treat Figures 1 and 2 as the same piece of code, despite the small variation.
Figures 1 and 2 here are identical to Figures 7 and 8 in Part I, but with one exceptiona vertical stripe sliced through the code. What this slice represents is the function's opcode bytes. All operand bytes have been ignored. For example, from the 6A01 instruction (push 1), the 6A (the opcode for a push-immediate-byte instruction) is included in the slice, but 01 (the immediate value itself) is excluded. Likewise, the 2-byte opcode FF15 (call dword ptr) is part of the slice, but the dword ptr itself (94404000 in Figure 1 and 24A24000 in Figure 2) has been excluded.
Now, take the vertical slice and turn it 90 degrees on its side:
55,8B,51,6A,68,8B,50,6A,FF15,89,83,75,8B,5D,C3
The UNIX "cut" utility could extract these bytes from a disassembly listing. But turning the vertical cut into a horizontal string requires a "horizontalize" or "stringify" or "cut2string" program that takes specifications of file delimiters and of fields or columns or interest, and adds those fields or columns from each line onto a string, until the next file delimiter is reached, at which point the string is output. The "opstring" program I describe here and in Part III is an implementation of this idea, specific to producing opstrings from DumpPE disassembly listings.
If strings are constructed from only the vertical stripes of highlighted invariant opcode bytes in Figures 1 and 2, the resulting opstrings are portable; that is, independent of the location at which the function happens to reside inside a file, and hence comparable between files. Digests generated from them match, reflecting the identical nature of the C source code used to create them (together with the identical nature of the compiler and compiler settingsan important point to which I'll return).
Some operand bytes discarded here are position-independent: as noted, 6A has been used, instead of 6A01, even though the corresponding code, push 1, would not change regardless of where it appeared in a program. By ignoring operands, an opstring program can be implemented without full knowledge of the instruction set. More important, ignoring most (not all) operands makes opstrings mildly resistant to minor code variations, so that two functions whose only difference is push 2 versus push 1 generate the same opstring.
Here, "same" means not "whose source code is completely identical," but instead something like: "Any differences could easily be parameterized, without otherwise affecting the code." For example, strupr and strlwr in the C Runtime Library would be considered the same, if they differ only in the character range they modify. The same thing would happen with some implementations of memcpy and memmove. "Procedural abstraction" is a rich area of academic research (for example, see Johnson and Mycroft on "Pattern and Approximate-Pattern Matching for Program Compaction" at http://www.milton.arachsys.com/nj71/research/ papers/patternsurvey.pdf).
The final implementation of opstrings is more sophisticated than what has just been described, but a fair question at this point is whether the scant opcodes, sans operand, of a function are really sufficient to characterize a function. How do you know you aren't losing some essential aspect of a function in the course of boiling it down to a string of opcodes, such that trying to match against each other these reduced opstrings (or their MD5 digests) results in unacceptably frequent false positives? For now, simply note that a search through all of the Windows XP code (in this particular configuration, 304 MB residing in over 1900 DLLs and other Win32 PE files) did not locate any functions with this opstring.
We are, in effect, "normalizing" the code. There are other ways to do such noise reduction. A disassembly listing can be postprocessed so that all matches for the regular expressions /(e?(([abcd][x])|([sd]i)|([bs]p)))|([abcd][lh])/ are replaced with "REG," /([0-9A-F]{2,}h)|[0-9]{1}/ are replaced with "NUM," and so on. Brenda Baker's "dup" tool, based on her concept of parameterized pattern matching, finds clones from Java bytecodes in part by replacing identifiers with offsets "to remove the identity of the identifier while preserving the distinctness of different identifiers" (http://www.usenix.org/publications/library/proceedings/usenix98/full_papers/baker/baker.pdf).
Those are examples of operand normalization. Another possibility is opcode normalization. Anyone who has superficially glanced at a lot of graphics code, perhaps by paging rapidly in an editor to see if any repeated patterns leap out, knows that functions sometimes differ only in the direction of math operations. One function might be sprinkled with + and * and the other with - and /. To locate such matches, it might be useful to simply group together all basic math operations as "MATH," all bit operations as "BIT," and so on. However, this is likely naïve if applied to binary code, because instructions are often used in ways that don't correspond with their stated intent (see the discussion of LEA in Michael Abrash's Zen of Code Optimization).
Loosening and Tightening the Opstring
In essence, we are treating the bulk of the binary code as a "wildcard" or "don't care" value, corresponding to * in a regular expression. This idea can be extended, resulting in somewhat looser matching. Multiple opcode bytes map to the same mnemonic. Both 51 and 52, for example, are push instructions (51 is push ecx and 52 is push edx). If an opstring is based, not on the opcode bytes such as 51 or 52, but on the instruction mnemonic, such as "push," the resulting digest is not only position independent, but insensitive to additional changes in the source code upon which the binary function is basedand to some compiler/linker register allocation and instruction reordering. (Because a single opcode can be displayed as multiple mnemonics74h might be je or jzconsistent use of the same underlying disassembler is crucial.). The opstring for Figures 1 and 2 now look like this:
push,mov,push,push,push,mov,push,push,
call,mov,cmp,jnz,mov,pop,ret
(Yes, this makes it appear that the code does little more than move and push. You might suspect that most code gives off this same appearance of frenetic busy-work and housekeeping. You would be right.)
Figure 3 shows fragments of two functions from \windows\system32\gdi32.dll. The similarity between these two functions was found through this mechanical automated process. We happen to know the functions' names (which appear in the DLL's export table), and they happen to be adjacent in the DLL, but the similarity was not found this way. The two, separated at birth, wound up next to each other in a sorted list of half a million function digests.
It is clear (using the eyeball algorithm) these are essentially the same piece of code, but the "W" (wide Unicode) version uses push ecx (a 1-byte instruction, 51h), whereas the "A" (ASCII) version uses push 1 (a 2-byte instruction whose opcode is 6Ah), before calling the same unnamed function, which the disassembler has designated as fn_77F2263B. (If the disassembler had gdi32.pdb to look at, it would have seen that this function is named IcmEnumColorProfile. I'm generally avoiding PDBs here to show the ability to match functions without any information about them besides the binary code itself. The final function database will happily use any names it can from PDB files.)
By using instruction mnemonics (push) rather than opcode bytes (51 or 6A), opstrings characterize functions in a way that is still sufficiently specific to reduce false positives, while sufficiently general to reduce false negatives (we'll quantify this later). Figure 4 shows opstrings based on the two functions in Figure 3. MD5 digests of these opstrings match. The two different types of push have been reduced to the same "push" element in the opstring. Notice that the difference between the 1- and 2-byte instructions (push ecx versus push 1) doesn't affect the matching of code that follows it.
As another refinement, one which tightens the matching, the underlined elements in Figure 4 show that a pseudo operation has been introducedloc. This corresponds to the targets of jumps that were located by the DumpPE disassembler. DumpPE makes a first pass through the code, looking for calls and branches (including indirectly, via tables of function pointers such as C++ vtables), and can create a pseudosymbol table of the targets of such calls and branches, and then in its second pass use this table to generate labels at the addresses that are the targets of these calls and branches.
The loc pseudo operation helps characterize the shape of a function. This idea could be extended. It's easy to detect whether a given branch target is reached from earlier in the code, later, or both. Slightly more difficult to detect (it requires a second pass over the disassembly listing to first locate the ends of functions as well as their beginnings), branch targets could be characterized as coming from inside the enclosing function or (more frequent than you might think) from its outside. Such modifications start to take this method in the direction of clone-detection algorithms that rely on graph or relational aspects of the code, in addition to, or entirely instead of, the instructions themselves. See Todd Sabin on "Comparing Binaries with Graph Isomorphisms" (http:/ www.bindview.com/Services/Razor/Papers/ 2004/comparing_binaries.cfm) and Halvar Flake on "Structural Comparison of Executable Objects" (http://www.sabre-security.com/files/dimva_paper2.pdf).
Introducing the loc pseudo operation is just one way in which opstrings can better reflect the functions from which they are boiled down. Some operands (such as the magic number 5A827999h in code related to the secure hash algorithm or CAFEBABEh in code that manipulates Java .class files) are position-independent on the one hand, and highly indicative of the function's higher level purpose, on the other, and work as "fingerprints" for the function as a whole.
I've seen developers deliberately add such things to help later track how their code is used. When added to code in this way, similar to the way that telephone companies salt or seed phone books with fictitious names to catch infringing derivative works (see Feist Publications v. Rural Telephone Services), the magic number acts as a kind of "watermark," in contrast to a fingerprint, which is generally regarded as native, rather than added, to the code. Automating the extraction of magic numbers, which usually look like variant data or addresses to be ignored, rather than as an invariant magic number to be incorporated into the opstring, probably requires a database of known magic numbers. The UNIX "file" utility employs /etc/magic, but this database is limited to file signatures, such as 'MZ' (0x5A4D) for the start of a Microsoft executable or 0xE0FFD8FF at the start of a JPG file. Such numbers are good candidates for inclusion in opstrings, but so are many nonfile magic numbers.
One method to create a magic database is to extract all 32-bit operands that appear in a large set of disassembly listings for unrelated code, and use all those numbers that appear in more than some minimum number of different files, but in fewer than some maximum number. There's no point loading down opstrings with 0xFFFFFFFF, 0x7FFFFFFF, 0x80000000, and the like. On the other hand, the value 0x80000002 (found in about one-third of Windows files) probably shows that the enclosing function either manipulates the HKLM portion of the Windows registry, or generates a Windows exception code, and the value 0x8D2A4C8A is surely sufficient to characterize the enclosing function as related to the MD5 algorithm (just Google for this number: Are there any hits unrelated to MD5?).
Far easier to extract than magic operands, obscure assembly-language instructions act as "magic" opcodes. For example, addps, fldz, pshufhw, rdtsc, and cmovnle are just a few of the many x86 instructions that are almost never used (I'll avoid any discussion here of RISC and complex instruction sets), but that still appear frequently enough to act as function fingerprints.
Security researchers have tried to identify small indicators of possibly malicious code to avoid inspecting entire programs. In its search for such "tell-tale signs," part of the "malicious code filter" idea (http://seclab.cs.ucdavis.edu/papers/llo95.ps) focuses on UNIX system calls such as chmod, chown, setuid, and setgid. You can generalize from this same basic idea. As noted in Part I, Windows API calls help characterize a function. Looking back at Figures 1 or 2, note the call to the Windows API, MessageBoxA. In an opstring, the API name can be used in place of a more generic call or jmp. This tightens the string: Now, opstrings for the code in Figures 1 or 2 match other opstrings that call not just any function, but only MessageBox from this particular location. To slightly loosen the string, the trailing A or W is removed:
push,mov,push,loc,push,push,mov,push,
push,[MessageBox],mov,cmp,jnz,
mov,pop,ret
Figure 5 shows the current state of our slice through the code originally shown in Figures 1 and 2. Sometimes, instead of a call or jump to a Windows API, code instead moves the API's address into a register, and only sometime later calls (possibly multiple times) through that register. An opstring represents this special move as mov_[API].
Ignoring Common Operations
The refinements just described make an opstring more specific, and thereby reduce false positives. To reduce false negatives, consider that some mnemonics will appear with very high frequency. Despite the embarrassing richness of the Intel instruction set, a tiny handful of instructions are used far more than hundreds of others put together. As we've seen, mov and push occupy the bulk of any opstring (pop does not appear as frequently as push, because add esp is often used instead). It is said that computer software does little more than move or push things from one place to another, and both static and dynamic studies of instruction-set usage confirm this (see the measurements of instruction-set usage in Hennessy and Patterson's Computer Architecture: A Quantitative Approach).
If the most frequently occurring mnemonics such as mov and push on the x86 are simply ignored, as their presence is so likely anyway as to be uninformative, the resulting smaller string of less frequently occurring instruction mnemonics will be further insensitive to minor code changes. Given a minimum opstring length (to be arrived at in Part III), an MD5 digest based on this compressed opstring will find matches reflecting similar but not identical pieces of source code, with a low probability of false positives.
The opstring for the function in Figures 1 and 2 is now nothing more than:
loc,[MessageBox],cmp,jnz,ret
Figure 6 depicts the almost laughable slice this represents of the original code.
When fingerprints were first introduced as evidence in capital cases, one response was that a single sweaty thumbprint was not a strong enough branch from which to hang a person. Here, a string of five "ops" (they're not even all true instruction mnemonics any more) seems too narrow a platform for function matching: loc, [MessageBox], cmp, jnz, retthat's it? Clearly, we've removed the flat boring parts of this picture, but is what's left over truly its essence? More important, is it also the essence of other functions that are not really this one's clones?
As noted earlier, this code's original opstring, which still included mov and push, didn't find any clones in all of Windows XP, and this despite its inclusion of a generic call instead of the far more specific MessageBox. Now, having chopped out mov and push but added in MessageBox, a search of Windows XP still turns up no match. (Actually, this is a fairly unlikely piece of code: How often does one branch immediately to a MessageBox call and then, as a result of that call, compare and branch again?)
One nice example is obviously insufficient. With even an unacceptably high false positive rate of 10 percent, say, most randomly chosen examples would work. Part III takes up in detail the issues of false positives and negatives, and the minimum opstring length (which depends on the number of unique opstrings necessary, and on the effective instruction-set size, which in turn depends in part on the distribution of opcode usage).
Help from Optimizing Compilers
Figure 7 depicts the process of comparing two functions by first generating disassemblies of their respective compiled binary code, then extracting opstrings from the disassemblies, and finally generating MD5 digests from the opstrings.
The top of Figure 7 shows two different-looking pieces of source code. A source-level comparison of these functions finds only the "}" lines in common. Even normalizing the comparison by replacing all variable names with generics, such as v1 for the first variable, v2 for the second, and so on, would still not yield matches, because the control structures appear different (test1 uses nested for loops, whereas test2 uses nested while loops).
The second tier of Figure 7 shows that the resulting assembly-language code varies far less than its overlying source code. This code was compiled with optimizations turned on (cl -Ox -G5r). There isn't such a close match without optimizations, nor between the optimized and unoptimized versions of the same source code. Indeed, one good way to do plagiarism detection is simply to compile with optimizations, and then compare disassemblies of the binaries. That the compiler can smoosh out some differences at the source-code level suggests that non-source analysis can on occasion be superior to using the source.
Even so, the resulting assembly code is different starting on the third line. Some of the differences are highlighted. Comparing the sequence of literal bytes of course fails to reveal the similarities between these two pieces of code, but so does comparing the initial opcode byte alone, or even comparing all mnemonics.
This is where the "no common" method comes in. Minor differences between two similar pieces of code often occur in the high-frequency operations such as mov, push, and pop. Filtering those out (along with add esp, which works like pop), and selecting only the less-frequent operations, such as call, test, jnz, and so on, produces the opstrings toward the bottom of the diagram. As can be seen, these are identical (down to relative location of ops and loc pseudo operations). Naturally, the resulting MD5s at the very bottom are as well.
Figure 7 should not be taken for more than it is. For pedagogical purposes, it reflects a certain amount of cheating. The source code in test1 on the left side was modified until it became test2 on the right side, but I stopped making changes when they began to produce disassemblies that were too different for the opstring program to detect as similar to test1. Obviously, there are numerous additional tweaks that could be made to test2 that would bring it further out of whack with test1.
Besides heavy editing at the source-code level, obviously different compilers can take the same source code and produce fairly different binary code. The same compiler, with different optimization settings, can do the same thing.
But despite these issues to be worked out (see Part III), at bottom it is feasible to extract salient features from binary code, and then compare these salient features to find clones. This is "edge detection" on functions, pulling their basic quality out of the overall image of a program, such that their shapes can be compared against each other, and against basic templates.
DDJ