September 10, 2024

When coding style survives compilation: De-anonymizing programmers from executable binaries

In a recent paper, we showed that coding style is present in source code and can be used to de-anonymize programmers. But what if only compiled binaries are available, rather than source code? Today we are releasing a new paper showing that coding style can survive compilation. Consequently, we can utilize these stylistic fingerprints via machine learning and de-anonymize programmers of executable binaries with high accuracy. This finding is of concern for privacy-aware programmers who would like to remain anonymous.

 

Update: Video of the talk at the Chaos Communication Congress

 

How to represent the coding style in an executable binary.

Executable binaries of compiled source code on their own are difficult to analyze because they lack human readable information. Nevertheless reverse engineering methods make it possible to disassemble and decompile executable binaries. After applying such reverse engineering methods to executable binaries, we can generate numeric representations of authorial style from features preserved in binaries.

We use a dataset consisting of source code samples of 600 programmers which are available on the website of the annual programming competition Google Code Jam (GCJ). This dataset contains information such as how many rounds the contestants were able to advance, inferring their programming skill, while all contestants were implementing algorithmic solutions to the same programming tasks. Since all the contestants implement the same functionality, the main difference between their samples is their coding style. Such ground truth provides a controlled environment to analyze coding style.

We compile the source code samples of GCJ programmers to executable binaries. We disassemble these binaries to obtain their assembly instructions. We also decompile the binaries to generate approximations of the original source code and the respective control flow graphs. We subsequently parse the decompiled code using a fuzzy parser to obtain abstract syntax trees from which structural properties of code can be derived.

These data sources provide a combination of high level program flow information as well as low level assembly instructions and structural syntactic properties. We convert these properties to numeric values to represent the prevalent coding style in each executable binary.

We cast programmer de-anonymization as a machine learning problem.

We apply the machine learning workflow depicted in the figure below to de-anonymize programmers. There are three main stages in our programmer de-anonymization method. First, we extract a large variety of features from each executable binary to represent coding style with feature vectors, which is possible after applying state-of-the-art reverse engineering methods. Second, we train a random forest classifier on these vectors to generate accurate author models. Third, the random forest attributes authorship to the vectorial representations of previously unseen executable binaries.

The contribution of our work is that we show coding style is embedded in executable binaries as a fingerprint, making de-anonymization possible. The novelty of our method is that our feature set incorporates reverse engineering methods to generate a rich and accurate representation of properties hidden in binaries.

 

 

Results.

We are able to de-anonymize executable binaries of 20 programmers with 96% correct classification accuracy. In the de-anonymization process, the machine learning classifier trains on 8 executable binaries for each programmer to generate numeric representations of their coding styles. Such a high accuracy with this small amount of training data has not been reached in previous attempts. After scaling up the approach by increasing the dataset size, we de-anonymize 600 programmers with 52% accuracy. There has been no previous attempt to de-anonymize such a large binary dataset. The abovementioned executable binaries are compiled without any compiler optimizations, which are options to make binaries smaller and faster while transforming the source code more than plain compilation. As a result, compiler optimizations further normalize authorial style. For the first time in programmer de-anonymization, we show that we can still identify programmers of optimized executable binaries. While we can de-anonymize 100 programmers from unoptimized executable binaries with 78% accuracy, we can de-anonymize them from optimized executable binaries with 64% accuracy. We also show that stripping and removing symbol information from the executable binaries reduces the accuracy to 66%, which is a surprisingly small drop. This suggests that coding style survives complicated transformations.

Other interesting contributions of the paper.

  • By comparing advanced and less advanced programmers’, we found that more advanced programmers are easier to de-anonymize and they have a more distinct coding style.
  • We also de-anonymize GitHub users in the wild, which we explain in detail in the paper. These promising results are encouraging us to extend our method to large real world datasets of various natures for future work.
  • Why does de-anonymization work so well? It’s not because the decompiled source code looks anything like the original. Rather, the feature vector obtained from disassembly and decompilation can be used to predict, using machine learning, the features in the original source code — with over 80% accuracy. This shows that executable binaries preserve transformed versions of the original source code features. 

I would like to thank Arvind Narayanan for his valuable comments.

 

Comments

  1. Alexander Benikowski says

    Is this an error?
    “we can de-anonymize them from optimized executable binaries with 64% accuracy. We also show that stripping and removing symbol information from the executable binaries reduces the accuracy to 66%, which is a surprisingly small drop.”

    How can you “drop” from 64% to 66%? 😉
    Otherwhise: very interesting topic.

    • Hi Alexander,

      It is not an error, but the wording might be confusing. Binaries that are stripped are not optimized. Accordingly, I wanted to compare the accuracy of stripped-binary authorship attribution (66% for 100 programmers) to the accuracy of unoptimized-binary authorship attribution (78% for 100 programmers), not to optimized-binary authorship attribution accuracy (64% for 100 programmers).

  2. Peter Dunne says

    I quickly scanned through your 2 papers
    An interesting approach to the problem and strong results for such early work
    I noticed the mention of programming language and the fact the research centered on C++ so I am wondering if you used programming language as an identification parameter in any tests so far?
    As programmers have preferences for specific tools, not just specific languages, identifying the tool chain would perhaps provide an additional vector
    In the wild, we work with certain libraries, I myself prefer Zeos database components & Indy internet access library for example
    Due to the code signatures of these libraries it is possible to identify which libraries are used in a given program so the selection of libraries used to implement solutions may also reduce error rates
    I refer you to the comment by Jacob Gajek and the point he raised about libraries masking the personal signature of the author
    Identification of libraries and flagging of code blocks would eliminate the entire library block from the code pattern analysis for the specific author and should thus reduce the observed error rate by an appreciable margin
    I would be very easy to recognize from the libraries as I use specific modifications to both data access controls and language support which do not exist in the standard Lazarus/Freepascal distributions, another factor which could possibly be used to identify the author

    • Hi Peter,

      This is a very good point. We can investigate the effects of libraries by creating a dataset that has consistent usage of a variety of libraries and eliminate them to see how it effects de-anonymization accuracy.

      I have not used the programming language as an identification feature so far. It suspect it would make de-anonymization easier but we wanted to focus on coding style as a fingerprint and ignored all features that are not properties of coding style. On the other hand, choice of programming language might be a property of personal style. Incorporating such features can aid in improving de-anonymization. Also, investigating different programming languages from a stylometric perspective can be another research project by itself.

  3. Jacob Gajek says

    Very interesting research. Wouldn’t a large amount of statically-linked library code within the binary swamp any features that are due to the “author” of the binary? Perhaps that could also explain while lesser-skilled programmers tend to be more difficult to de-anonymize: They tend to glue third-party code together rather than implement larger chunks of code themselves.

  4. Hey!

    Impressive research. My question is – coud it be used in pure clustering regime – having a set of sample binaries just group them by probable common author?

    • Yes, you could do that, especially in a setting where you do not have ground truth on authorship. An extra step of feature analysis would make the clustering more accurate though, where you make sure that you are clustering according to authorial fingerprints instead of code properties.

  5. Thank you for this interesting research.
    From what I understood in your paper, your experiment has been made by using only one compiler.
    I would be interested in the result of your classification algorithms if you use, with -O3, three quite different compilers: gcc 3.x, gcc 4.x, and clang-llvm. Their optimisation strategies are quite different, especially when building libraries (PIC code) and I expect that the clustering will work less efficiently.

  6. Would be very interesting to see this applied to malware, to aid in determining authorship of malware variants and families.

    • Please get in touch with me if you have a malware dataset with ground truth.

      • Paul Ellenbogen says

        Do you need actual malware with ground truth? Could you use the GCJ dataset, but use unsupervised machine learning, trying to get the algorithm to group together entries written by the same author?

        • We can start clustering with GCJ. Nevertheless, it is important to see how the method works on real world datasets in the “wild”, including a malware dataset that contains obfuscated code.

    • Paul Ellenbogen says

      Using some sort of unsupervised learning technique to discover which pieces of malware have the same author in dataset would be very interesting.

      • Yes, if the malware dataset does not have ground truth information on authorship, unsupervised learning would be the main approach.

  7. Daniel Lorch says

    Can you impersonate another programmer? i.e. would it be possible to transform the binaries structurally from style x to x’ ?

    • If you imitate the identifying features of x’, you should be able to transform the binaries.

  8. Michael Heyman says

    Back around 2007-08 I think, I worked on code attribution for malware (sometimes obfuscated) that used similar techniques including those based on “Finding Binary Clones with Opstrings & Function Digests” by Andrew Schulman. Parts 1, 2, and 3 are in Dr. Dobbs July, August, and September 2005 issues (http://www.drdobbs.com/finding-binary-clones-with-opstrings-fu/184406152). The malware tended to be polymorphically encrypted so teasing out the actual code was a challenge in itself. The results were not great but we were attempting to do unsupervised learning (we did not know who the coders were and what code a particular coder wrote even within an executable). Our results were improving when our funding ended but we would have had a long way to go to get to 96%.

  9. Shawn Zernik says

    It would be possible to identify the programmer with “obfuscate” code. The style if “if(problem) then return” to prevent nested if would still be present in the logic flow of the “disassembled code”. You’d be looking at flow control patterns, variable declaration locations, and not the “names”. “obfuscate” the code, and the flow control and structure still remains.

  10. This brings back memories – I worked on code attribution for malware that used similar techniques including those based on “Finding Binary Clones with Opstrings & Function Digests” by Andrew Schulman. Parts 1, 2, and 3 are in Dr. Dobbs July, August, and September 2005 issues (http://www.drdobbs.com/finding-binary-clones-with-opstrings-fu/184406152). This was extra fun because the malware tended to be polymorphically encrypted so teasing out the actual code was a challenge in itself. The results were not great but we were attempting to do unsupervised learning (we did not know who the coders were and what code a particular coder wrote even within an executable). Also our funding ended before we had time to use more sophisticated techniques.

  11. If it is made by a human, it will have a pattern imminent that identifies attributes of its authors. If the author tries to obscure such patterns, that is too part of the pattern. This doesn’t necessarily mean the author is immediately identifiable…

    As fingerprints, DNA, etc. have been used and abused, so too will this.

  12. Avraham Bernstein says

    I work in an industry that requires that we obfuscate our C/C++ source code in order to protect against reverse engineering – similar to the techniques described in Collberg and Nagra’s book “Surreptitious Software”. Of course we use randomized obfuscation templates, and we sanitize symbol names – so no information could be gleaned from a naming convention. Your attached article mentioned that your tool uses the top-of-the-line HEX-RAYS disassembler. We use it too, but to verify that we are outsmarting HEX-RAYS. I would be amazed if your tool could fingerprint such code (other than identify that the code is obfuscated).

    • That you would be amazed is as irrelevant as symbol names.

    • Avraham Bernstein says

      Obfuscation techniques rely primarily upon obfuscating the flow, *and not simply symbol name obfuscation*. For example inserting ‘opaque’ conditions that we know logically from the application are guaranteed never to occur, but that fool the optimizing compiler because it cannot tell a priori what the value of the condition will be. In many cases loops can be efficiently traversed out-of-order without any adverse effects, as long as every index is visited exactly once. System function calls can be efficiently cloaked, by first using dlopen() and dlsym() with lightly encrypted string constants, that generate a structure that contains (1) a random value, and (2) the real function address masked with this random value – so that an efficient indirect call can be made by simply masking these 2 values at call time. Application function calls can be cloaked by declaring every single function in a module as static, and then export only a single variable with an obfuscated jump table. HEX-RAYS is blinded. It can no longer take advantage of the object file symbol table to provide fixed labels for any function calls. String constants can be lightly encrypted using a stream cipher based upon a hyper-fast non-crypto PRNG. 16-bit numeric values can be pseudo randomized and cloaked inside a 32-bit variable by using the modulus of a large prime number (greater than 16 bits), where obfuscate(x,rnd) = prime * rnd + x, and uobfuscate(x) = x%prime. It is possible to intentionally jump into the middle of a hand crafted object code, especially in the x86 and x64 instruction sets, to execute a bona fide instruction – again blinding HEX-RAYS. By using clang, one can compile C code into a virtual machine instruction set of your own design, and then execute the virtual machine at run-time. I prefer obfuscated FORTH VMs with a performance hit of only a factor of 2. Clearly in order to efficiently implement the above techniques one needs a powerful preprocessor. We use a customized Jinja2 preprocessor (which is much more powerful and easier to use than the C++ template mechanism which anyway cannot be used for heterogeneous C/C++ projects), and we pay lots of bucks for industrial strength C/C++ obfuscating compilers. We use Cryptanium. This is the type of obfuscation which HEX-RAYS cannot digest.

      • Bosh CPU emulation can be used to debug/decrypt/crack such programs, at turtle speed, but it is possible… If executable protection really worked, triple AAA games with a production budget over 100 000 000 $ would not get cracked, yet they always are. Such tools gives buy you time between the release and a successful crack.They work for boring work oriented, non graphical, software, only because cracker are not motivated by that kind of softwares and that kind of RE take’s a dedication that only passion can buy

        • Avraham Bernstein says

          Nicolas you hit the nail on the economic head. We in the obfuscation business realize that when we release an application that sits on your personal device, and does not require that it be tethered to the Internet, an attacker has all the time in the world to try to reverse engineer (RE) it. And we realize that a skilled attacker with enough time and effort will eventually be successful. Therefore we need to release new versions frequently, and to create versions that are personalized to the user’s specific hardware features, e.g. their network device MAC address or CPU/BIOS/ROM/FLASH ID, or software features such as their O/S UUID, or their application subscriber UUID. So if and when the attacker eventually cracks his specific version of the S/W, it will soon become obsolete, and the technique that he used will not be applicable to other devices and other users – so publicly disseminating his hack is of no value. And besides the fixed protection in the client playback software, perhaps each game or movie could be associated with its own content specific security plug-in too. These techniques are known as a “zero knowledge” defense. And regarding AAA games. Every week that the above techniques can delay a successful attack is worth tens of millions (or even a 100 million) dollars marginal revenue to the content providers. And regarding the well known FOSS emulators, we “know” their fingerprints and foibles. For example the standard ones don’t bother emulating the CPU cache. Therefore a successful attacker will have to invest considerable effort in writing his own emulator, or at a minimum refactoring and re-engineering the existing ones. Unlike the popular cryptographic algorithms which are theoretically “perfect” (i.e. when the only line of attack is to brute force the key), obfuscation is not theoretically perfect, but it can provide economically “good enough” security.

          • Jacob Gajek says

            Cryptographic algorithms are also not theoretically perfect, and there is usually no proof that brute forcing the key is the best possible attack. So just like in obfuscation, there are economic considerations involved in evaluations of cryptographic security.

    • Yep, We need more obfuscating software/hardware for obfuscating apps to obfuscate actual obfuscating software for obfuscate our privacy from our obfuscated clients …

      http://abstrusegoose.com/536

  13. What if all programmers started to use only one style, can’t find the maker if it all looks the same.

  14. What if all programmers started to use only one style, can’t find the maer if it all looks the same.

    • What if 2 became 1? Then all the calculations break down.

    • Tistou Blomberg says

      If the style where documented enough to everyone following the exact style. Then we probably could just make a program that programmes that way. The same element in humans that creates the style also creates the solutions to problems needed to be solved by programming.

  15. Are you going to open source your work?

  16. With that you might be able to find who Satoshi Nakamoto is. I mean, who Nick Szabo is.

    • Ivan Montilla says

      In the case of bitcoin, you just have to grab the very first version of the git repo, it’d be easier because you have the source code right there.