The Microsoft .NET CLR (Common Language Runtime) and the .NET Framework class libraries provide a lot of new high-level functionality for Windows software developers. Features such as a single object model and extensive type system that all .NET programming languages use, a rich class library common to all languages, automatic lifetime management of allocated objects, and many others make the .NET CLR the clear choice for future Windows development efforts.
One feature of the CLR that is new to many developers is the guaranteed availability of metadata for all running code. All managed language compilers store metadata within an assembly. This metadata describes practically everything the compiler knew about your code. For example, the metadata describes the name and types of all your classes, fields, methods, properties, and events. In fact, there are really only two things missing in the compiled version of your managed application that were present in the original source code: the names of your local variables and your comments.
The .NET CLR uses this metadata to provide valuable services to your code. The Just-In-Time (JIT) compiler references the metadata to layout your objects in memory. The garbage collector uses the metadata to track object references and lifetimes. The remoting infrastructure uses the metadata to provide marshal-by-value and marshal-by-reference semantics automatically to your objects, as appropriate. The serialization layer uses the metadata to read and write the contents of your objects to a stream. The Web Services functionality uses the metadata to provide a Web Service Description Language (WSDL) description of your web services. Use of metadata is pervasive throughout the .NET Runtime and Framework.
However, I talk to hundreds, sometimes thousands, of developers each year and every time I introduce the .NET Runtime to them the same question arises. "What about my intellectual property? Using the metadata, can't someone reverse engineer my code and obtain something similar to the original source?"
The answer is "Yes." The easiest way to reverse engineer a .NET assembly is to use the ILDASM utility provided with the .NET Framework SDK. Example 1 shows an Intermediate Language (IL) disassembly of a private method, GetNewString, in the System.Text.StringBuilder class from the .NET Framework class library.
It's not completely understandable at first glance but it's much easier to read than x86 assembly language. However, using all the metadata that's available, it's possible to decompile this method into C# source code. Using the decompiler Anakrino from Jay Freeman at www.saurik.com, I decompiled GetNewString. Example 2 shows the decompiled version. (Note: This works best when the original language was C#.) You'll notice that the name of the local variable wasn't known, so the decompiler synthesized a new name. Based upon its usage, the local variable represents the value-constrained capacity of the new string.
However, he writes hastily, I must point out that this is not a problem unique to the .NET Runtime and class libraries. When someone has access to the binary code for your application, given enough time and resources, they can always reverse engineer it! This is true for Java applications. This is true for Intel x86 applications.
When you absolutely require that your application not be reverse engineered, then put the only copy on a floppy disk, lock it in a safe, and never run the program. Of course, then it's not a terribly useful application. If you're willing to risk some chance of reverse engineering, then put the application on a physically secured server and write a client application that makes network requests to the secured application. Of course, the application is still only as secure as your server's security.
Generally though, developers want to give their end users an application and also prevent anyone, for example, a competitor, from reverse engineering the application. This cannot be done. (Anyone still in doubt about this should reread the prior two paragraphs. Repeat until you reach a deep Zen state of acceptance.) All you can do is increase the time and resources that a competitor must supply in order to reverse engineer your application.
So, a more realistic question to ask is, "What can I do to make reverse engineering my application more difficult and how much will it cost me?" (Actually, I might first question the assumption that anyone really wants to reverse engineer your application. I've seen the source code for many large software development efforts. It's not a pretty sight.)
Encrypting your application is another method of defeating decompilation. However, unless the decryption occurs in hardware, it is possible to intercept the decrypted code. Most consumer personal computers won't have dedicated decryption hardware, so this solution doesn't work except in very specialized scenarios.
Another thought is to install only an x86 native code application. The NGEN utility from Microsoft (Native code GENerator) compiles a .NET assembly into x86 machine code at program installation time. However, the Runtime still requires the metadata and IL of the original assembly to be present so this technique doesn't provide any additional security from reverse engineering. The NGEN utility's purpose is to "pre-JIT" compile all the methods on an assembly so a method-by-method JIT isn't required at each initial method call.
A managed program is easier to reverse engineer than an unmanaged application (in other words, your legacy Windows application) because the managed application typically contains nearly complete symbolic and type information, i.e. metadata, and the unmanaged application contains none. I think it's worth saying again that the lack of symbol and type information itself does not preclude reverse engineering. With experience, it's not extremely difficult to translate an unmanaged program back into something quite similar to its original C/C++ source program. One reason is that high-level language compilers typically translate flow control structures into unmanaged code with regular and recognizable patterns. It's also easy to recognize code that accesses a local variable or a function argument. It's mainly time consuming, therefore expensive, to reverse engineer such programs. However, there are some semiautomated tools that can assist you during the process.
However, lack of symbol and type information does make reverse engineering more difficult, so removing it from a .NET assembly would help in this regard. There is a class of .NET applications, called "obfuscators," that does exactly this. The purpose of an obfuscator is to transform an application, by applying obfuscating transformations, so that it is functionally identical to the original but is difficult to reverse engineer (understand). There are four types of obfuscating transformations that a typical obfuscator might use.
Data Obfuscations
Data obfuscations operate on the data structures used in the program. Data storage obfuscations change the type of storage for variables. One example is converting a local variable into a global variable. The obfuscator would ensure that different methods use the variable at different times but none of them use it at the same time.
A data encoding obfuscation changes the way a program interprets stored data. For example, you can replace all references that initialize an index variable i by the expression 8*i+3. When the code needs to use the index value, the obfuscator inserts the expression (i-3)/8. Finally, instead of incrementing the variable by one, you add eight to the value; see Example 3. Basically, the obfuscation scales and offsets the index from the desired value and only computes the real index when it's going to be used.
A data aggregation obfuscation alters how data is grouped together in memory. An example is turning a 2D array into a 1D array or vice versa. The basic idea is to change the familiar conceptual mapping to a less common, in-memory representation so that it's more difficult for a person to understand your algorithms. For example, a chessboard is often modeled in a program as a matrix, but changing it to a one-dimensional array works just as well for the CPU.
A data ordering obfuscation changes how data is ordered. In C-based languages, it is common to see the ith element of a collection of data accessed by indexing to position i in an array. A data ordering obfuscation would determine the index in the array of the data by calling some function f(i). Again, this simply rearranges the storage of information in a way that less closely models the normal conceptual model.
Control Flow Obfuscations
Control flow obfuscations affect the control flow of your program. Again, the intent here is to change the flow of control in such a way that the obfuscated program maintains its original semantics but is harder to understand.
A control aggregation obfuscation changes how method statements are grouped. One example is taking a method and "inlining" it. Inlining replaces a call to a method with the actual method's body. Taken to an extreme example, you could eliminate the inlined method completely by replacing all references to the method with the method body. One side effect of this obfuscation is increased program size.
Of course, another control aggregation obfuscation would be "outlining." Take an arbitrary section of code out of a method and create a new method containing the instructions. Replace the original code with a call to this new method. Now there is additional structure in the code that is meaningless to the program's conceptual model.
A control ordering obfuscation alters normal statement execution order to something less expected. For example, sometimes it's possible to make a loop iterate backwards instead of forwards. It's possible to add fake data dependencies into the loop to prevent a decompiler from undoing the effects of this obfuscation. Another example would be to insert a jump into the middle of a while loop. This creates a nonreducible control flow graph that a decompiler cannot typically transform back into a while loop.
A control computation obfuscation tries to hide the real control flow of your application by inserting additional control flow statements that have no real affect on control flow. For example, the obfuscator could insert an if statement in the middle of a block of code and, when true, have control entirely bypass a necessary section of code. In this case, an obfuscator needs to ensure that the condition (predicate) always evaluates to false. Inserting simplistic predicates, such as if (true == false), doesn't provide an additional obfuscation benefit as it's easy for a decompiler (and a human, hopefully) to determine the condition is always false and discard the code that is never executed, basically undoing the obfuscation. A predicate that cannot be easily evaluated by static analysis of the code is an opaque predicate.
Preventive Transformations
Preventive obfuscations try to stop decompilers from working properly. A targeted preventive transformation tries to exploit some weakness in a known decompiler. A well-known example in the Java world is the HoseMocha obfuscator, which inserts extra byte codes after a return instruction in Java. The Mocha decompiler didn't originally handle this case and would crash when encountering the superfluous byte codes.
One .NET obfuscator vendor also sells its own .NET decompiler. (I guess one part of the company is trying to stimulate business for the other half.) One of the company's "selling points" is that its decompiler can't (won't) successfully decompile an assembly that has been obfuscated with its obfuscator. Of course, you can use the Anakrino decompiler instead, so it's not a very good point. I think preventive obfuscations give a false sense of security. Chances are good that the next version of the targeted tool your obfuscator exploits has the bug fixed.
Layout Obfuscations
Layout obfuscations are typically trivial to perform and greatly reduce the amount of information available to a human reader. Examples include discarding unnecessary identifier names and debugging information.
A .NET obfuscator reads an assembly and alters the metadata in various ways, all of which makes the job of subsequently reverse engineering the assembly more difficult. One typical kind of layout obfuscation, called "symbol obfuscation," is to locate all the names of purely assembly-internal types and members and change the original names to something less meaningful.
In most cases, the .NET Runtime does not need the symbolic names of internal, private, and family-and-assembly accessible types and members. They were required to compile the assembly but, except for cases using Reflection or late binding, aren't needed during normal execution. However, a well-named type, method, or field, provides a great deal of semantics that assist a reverse engineer in understanding your code. In my opinion, symbol obfuscation provides the greatest benefit-to-cost of all the obfuscation techniques.
However, there is a subtle point that most developers don't understand and even many obfuscator vendors get wrong. The value of the obfuscation technique comes from discarding the original symbolic information. There is no additional obfuscation benefit that results from a particular choice of replacement names.
Some early obfuscators algorithmically generated a replacement symbol from the original name in a way that permits the original name to be retrieved. On the surface, it appears that names have been discarded but they really have only been, well, obfuscated, but not very well. A decompiler can target that obfuscator's output and readily determine the original symbolic name.
There is another symbol obfuscation technique that is quite popular and works well, but not for the reasons the vendors typically claim. The approach tries to overload the same name as frequently as possible within a scope. For example, the .NET Runtime allows you to name a double field 'a' and an integer field 'a' both in the same scope without ambiguity. The Runtime considers the type and the name together as the unique identifier. Therefore, a 'double a' is clearly different from a 'uint a'.
C# doesn't allow these member definitions, so simplistic decompilers could produce "C# code" that won't compile due to multiple identifiers having the same name in a scope. And it's probably true that having multiple 'things' all called 'a' easily confuses the typical programmer. But, once again, a decompiler needs only to notice the overloaded names and synthesize new unique names for everything. Example 4 shows the output from decompiling an obfuscated version of Example 2.
There are many other possibilities for replacement names. Some common choices are C# or VB.NET keywords and high Unicode characters. No particular choice provides more obfuscation benefit than any other choice. However, there is a nonobfuscation-related reason that encourages short identifiers.
Every unique symbol in the metadata requires space in a symbol table. When you have 26 fields, each named with a different character, you need to store the 26 different symbols in the metadata symbol table. When you overload the names of the 26 fields and give them the same name, only one symbol needs to be stored in the metadata symbol table. When you choose a name where the characters occupy as few bytes as possible, you save even more space.
Therefore, while overloading type and member names as heavily as possible provides no obfuscation benefit, it has the side effect of reducing the size of the metadata. Demeanor for .NET is an obfuscator produced by my company, Wise Owl (www.wiseowl.com). I ran Demeanor for .NET on one version of the Microsoft Windows ~2-MB System.Windows.Forms.dll assembly. This obfuscator overloads names using a more complex technique than common class inheritance/method signature techniques.
An early version reduced the size of the symbols from 277,964 bytes to 124,232 bytes (-56%) a 150-KB savings in metadata string space. This results in a smaller disk image and memory footprint, which are both useful side effects, though not ones that make your code harder to reverse engineer. In fact, for an embedded application, you might want to consider obfuscation from purely an image size consideration. It also discarded 18,352 of the 35,241 original symbolic names.
The .NET Runtime doesn't need most of the symbolic information that's available in the metadata during run time except for a few special cases. Public symbols must be preserved as external code binds to the types and members by name. And public symbols in executable assemblies (.exe files) aren't required at all because it's unusual for external code to bind to exposed symbols residing in an executable. Also, code that uses reflection to bind to a type or member by name typically requires the type or member name be the same at run time and at compile time.
Summary
The rich, integrated Microsoft .NET platform provides so many compelling features to developers that I believe it's only a matter of time until most Windows software developers write the majority of their code for the .NET environment. I fully expect future versions of the Windows operating system itself to provide new APIs that are directly accessible only by managed code. Once you write code for the managed world, it's hard to go back to the Win32 API and COM. For better or worse, metadata, along with all of its good and bad implications, is here to stay.
Many programs won't need obfuscation because the loss caused by reverse engineering will be nonexistent. Numerous obfuscators are already available for the .NET platform, ranging from a basic renaming obfuscator to a fully functional obfuscator that handles mixed IL/native code assemblies created in any managed language, including Microsoft's C++ with Managed Extensions.
Remember though, an obfuscator simply makes your application harder to reverse engineer. It does not prevent reverse engineering. However, the cost of obfuscation is insignificant when compared to the cost of a typical software development project. If you feel like an obfuscator provides you any benefit at all, it's probably worth the price.
References
Douglas Low. Protecting Java Code Via Code Obfuscation. w::d
Brent Rector is president and founder of Wise Owl Inc. and has over two decades experience in software development. Brent has designed and implemented operating systems as well as new computer programming languages and their compilers. Brent started developing Windows applications in 1985 and has been involved in Windows development ever since. He is the author and coauthor of numerous Windows programming books, including ATL Internals and Win32 Programming. Brent is also the author of Demeanor for .NET, a code obfuscator for .NET. He can be contacted through www.wiseowl.com.