In a recent post, I talked about our paper showing how to identify anonymous programmers from their coding styles. We used a combination of lexical features (e.g., variable name choices), layout features (e.g., spacing), and syntactic features (i.e., grammatical structure of source code) to represent programmers’ coding styles. The previous post focused on the overall results and techniques we used. Today I’ll talk about applications and explain how source code authorship attribution can be used in software forensics, plagiarism detection, copyright or copyleft investigations, and other domains.
Security vs. privacy. Identifying the authors of source code is a security-enhancing method that has applications in software forensics. Most of the post will focus on these applications. But before getting to that, I should mention that it is a double-edged sword. Security-enhancing techniques are often also privacy infringing, depending on how they’re used. For example, Iranian citizen Saeed Malekpour was sentenced to death because he was identified as the software developer of an adult entertainment website. Stylometry would be equally applicable in cases like this.
Increased awareness of such security-enhancing and privacy infringing methods lead to demand for counteracting privacy-enhancing methods, such as programmer de-anonymization evasion tools. Identifying the properties of individual coding style could be used to modify these to anonymize coding with respect to a set of programmers. Such a tool could give suggestions to completely anonymize a programmer within a set or imitate another programmer’s coding style. It could aid programmers such as Bitcoin’s Satoshi, who would like to remain anonymous in open source projects.
This won’t be easy, though. We wondered if running source code through existing code obfuscation tools would be sufficient to anonymize coding style. But we found that Stunnix, the off-the-shelf commercial obfuscator we used, did not change the functionality of code while obfuscating, and preserved the structure. Consequently, our programmer de-anonymization method is impervious to obfuscators that do not modify the structure of source code.
Detecting ghostwritten code. Ghostwriting is a type of plagiarism. Say a freshman student’s performance on programming assignments suddenly improved, and we suspect that someone else is writing his code, perhaps a sophomore who took the class the previous year. There are many plagiarism detection tools like Moss that measure code similarity. But in our example, the assignments are all different this year and as a result code similarity comparison is of no help to detect ghostwriting. Stylometry is still relevant though. We could find the owner of the stylistically most similar code from the previous year, and call in that student as well as the freshman for gentle questioning.
Another example of ghostwriting is the strange case of an employee found outsourcing his tasks. An employee’s performance or coding skills might change all of a sudden and we might suspect that he is outsourcing his code. We do not know where he is outsourcing but we see a significant difference in his coding ability. We could take this employee’s code before and after we notice the change and see if there’s a sharp difference in coding style. If that turns out to be the case, a deeper investigation could be carried out.
Disputed code authorship. Source code author identification could automatically deal with code copyright disputes without requiring manual analysis by a code investigator. A copyright dispute on code ownership can be resolved by comparing the styles of both parties claiming to have generated the code. Style comparison along with copyright information can be extended to automatically detect copyright conflicts. New source code releases can be compared to a code repository that has copyright and author information to automatically detect potential infringements.
Identifying intruders and malware authors. The applications discussed so far can be achieved using our techniques today, but there are also more speculative applications if the tools continue to improve. Consider the forensic task of examining the artifacts on a system after an intrusion to obtain evidence for a criminal prosecution. Often, the attacker leaves behind code after an intrusion, either a backdoor or a payload. If we are able to identify the code’s author — for example, by a stylistic comparison of the code with various authors in online code repositories — it may give us clues about the adversary’s identity. A careful adversary may only leave binaries, but a less careful one may leave behind source code or code written in a scripting language. Even more speculatively, there is the possibility that elements of coding style may be preserved in compiled binaries. This would enhance our ability to track the origins of malware.
Application to software engineering. Code stylometry can also provide insights for software engineering. We took source code from different Google Code Jam rounds to investigate coding style variations in levels of programming difficulty. We found that programmers’ coding styles became more distinct while implementing more challenging functionality. Further, advanced programmers have a more unique coding style compared to less advanced programmers. We found this by comparing contestants who were able to complete difficult rounds and who could not complete easier rounds. We also discovered that coding style is preserved to some degree over the six years spanned by our dataset. Software engineering aspects of identifying programmer’s coding style could aid companies in automating the recruitment process of programmers with coding style considered to be superior.
Software engineering researchers could use code stylometry to analyze stylistic properties of code that has a higher rate of bugs. We could create a dataset with code that is known to include bugs and code without any incidence of bugs to differentiate between their stylometric features. This would aid in creating a classifier that automatically predicts how buggy a piece of code is likely to be.
Summary of results. While the applications discussed above are conceptually similar, they correspond to slightly different machine learning problems, so the level of accuracy achieved with our techniques will be different. The following table summarizes our results on some settings we examined on our Google Code Jam dataset.
Application | Accuracy achieved | Type of classification task |
Ghostwriting | 95% | Comparison of 250 programmers |
Copyright investigation | 99% | Comparison of two programmers |
Authorship verification | 93% | Comparison of one programmer to random/unknown programmer |
Identifying intruders/malware authors | Future work | Does the source code belong to a programmer in the training set? If so, which one? |
I am grateful for Arvind Narayanan’s useful feedback on this blog post.