Every programmer learns to code in a unique way which results in distinguishing “fingerprints” in coding style. These fingerprints can be used to compare the source code of known programmers with an anonymous piece of source code to find out which one of the known programmers authored the anonymous code. This method can aid in finding malware programmers or detecting cases of plagiarism. In a recent paper, we studied this question, which we call source-code authorship attribution. We introduced a principled method with a robust feature set and achieved a breakthrough in accuracy.
Our results. We used a dataset with 250 programmers that had an average of 630 lines of code per programmer. We used a combination of lexical features (e.g., variable name choices), layout features (e.g., spacing), and syntactic features (i.e., grammatical structure of source code), resulting in a 95% accuracy at attributing an anonymous piece of code to one of 250 programmers. This is significantly better than prior work because of the larger number of candidate programmers and greater accuracy. The largest dataset used in previous work, in terms of number of programmers, had 46 programmers (they don’t state the number of lines of code). The accuracy was 55%. In another study, with a smaller dataset of 30 programmers and an average of 1,910 lines of code per programmer, 97% accuracy was reached.
Dataset. Google Code Jam is an annual international programming competition. It has thousands of participants from different backgrounds such as professional programmers, students, and hobbyists. The solution files of the programming tasks submitted by the contestants have been published on the website since 2008. We collected the C++ source code of more than 100,000 contestants along with their usernames from 2008 to 2014. We wanted to avoid risk of identifying the specific properties of problems’ possible solutions instead of a programmer’s coding style. Fortunately, in Google Code Jam, contestants try to solve the same sequence of problems to advance to more difficult rounds. This allowed us to construct experimental datasets in such a way that the training sets for each of 250 programmers were solutions to the same task. The test set was a source code file not seen in any of the training sets.
Abstract syntax trees. Our work is an application of machine learning. Broadly, there are two steps: turning each input file into a vector of numerical features, followed by using a classifier that learns the patterns in each programmer’s feature vectors to classify a new, previously unseen vector. The key advance in our work is the use of a deeper set of structural features to represent coding style. In particular, we used syntactic features extracted from “abstract syntax trees” along with lexical and layout features directly extracted from source code. Abstract syntax trees in source code are analogous “parse trees” of prose sentences. Prose authorship attribution that utilizes parse trees have been able to identify an anonymous text from 100,000 candidate authors 20% of the time.
The figures below show a code snippet and the corresponding abstract syntax tree.
What’s next. Despite the leap in source code authorship attribution accuracy, we believe that this is only a first step in code stylometry and this line of attack will yield many improvements. Just as linguistic stylometry has seen huge leaps in the last few years, a rigorous machine learning based approach can transform code stylometry. For example, adding control flow graph features could further boost accuracy.
Code stylometry has applications in security, privacy, software forensics, and software engineering. In a follow-up blog post, I’ll discuss how it can be used for various problems in different areas. The results I presented above pertain to the general case of a “closed world setting” with multiple programmers. I will conclude with one practical example of where this can be useful. If we have a set of programmers who we think might be Satoshi, and samples of source code from each of these programmers, we could use the initial versions of Bitcoin’s source code to try to determine Satoshi’s identity. Of course, this assumes that Satoshi didn’t make any attempt to obfuscate his or her coding style.
On most platforms, are present utilities like “cb” ( C Beautifier ) or “indent”. The same results could be reached with Emacs or a scripting language. They could be easily added in the compiling environment to produce automatically re-formatted sources. Releasing the re-formatted sources, not the “originals”, solve the problem. About the variable names, there are some standards about the variable name composition:
http://en.wikipedia.org/wiki/Hungarian_notation
following them, no variable profiling is possible.
Interesting article. Having applied for many jobs last year (still no offers); it came to my attention that many such jobs wanted samples of my code. I wonder what types of things they run to determine if I am a good fit for their company or not; based on some samples. Do they look at my coding styles? Do they try to find out if I am guilty of plagiarism or some such (I am not, when I do take code from others, I give credit where credit is due).
I do wonder though, at the 630 lines of code; that seems like insufficient to really get a good sample of code. When I am coding (something I don’t get to do all that often at my current position); 630 is a single day or less of code. Yet, my code expands 15 years.
I look at today’s code and it is SO MUCH different in all of those features (spacing, syntax, grouping, variable naming, etc); than it was even six months ago; let alone 15 years ago.
Seems to me some meta-data on dates of when the code was written would be needed to be included if you were to pinpoint the author; because I doubt I am really the exception to the rule. I really doubt that most coders have one specific style that stays with them forever.
Not to mention; changes in languages. My styles in PHP are very different than in JavaScript, Java, Delphi, or Pascal (just to name a few). I would hope that my styles are getting “better” as time goes on. The only thing that I have tried to maintain would be two spaces (no tabs) for indenting. But even then depending on the tool I have used sometimes that isn’t even true.
This reminds me of the old adage… Those who can do, those you can’t teach or become ‘academics’.
Do tools like gofmt mess this up?
It’s really easy to identify my code by syntactic feature. None of my coworkers ever use docstrings or comments. 🙁
…but have you found Satoshi?
I think this will be a nice addition for git blame. I hope to see automatic author determination offered as a feature on GitHub.
You may adopt Python while everybody uses the similar styles.
Just like John Varley predicted in “Push Enter”…
It’s very easy to obfuscate your source code if you want to. For example, the easiest way is to minify it all. Just like its done with CSS/JS/HTML.
I wonder what happens if you compile to bytecode and then dissemble to code. Would that “anonymize” your code?
It would certainly strip most of the features that the researchers analyzed. But there would still be clues in the disassembly: class and method names, method parameter counts, functional vs procedural style, method length/complexity.
I participated in Code Jam multiple years, and I constructed all of my solutions from the same “skeleton” that included boilerplate for reading text files, parsing integers, writing out results, and so on. It’s utterly unsurprising that a model trained on some of my Code Jam solutions can identify other of my Code Jam solutions, and I’m skeptical that you’d get nearly as high accuracy if you looked at my non-Code-Jam code.
Completely misread the headline. Mentally inserted quotes around the word anonymous and thought the NSA was targeting programmers with the group Anonymous. Then I realized they probably are doing this, but that’s not what the article is about. 🙂
Anyway, as an ex-programmer, I certainly agree. Tab length, hard or soft tabs, text width*, editor cruft, source code filenames, brace style, variable names, white spacing, comment style, order of operations, function structure, code complexity, density versus verbosity, math style, etc, etc, all certainly mark a programmer.
Now I know why companies insist on common coding style guidelines. It prevents the NSA from spying. 🙂
* Anything more than 80 characters is crazy!
I can still identify programmers I worked with a decade ago without a doubt by seeing their code. Still, it’s amazing that we can automate that type of analysis and build programs to do it as well or better than humans.
Don’t worry, your food or coffee habits will be next. You order a Tim’s and you get a drone strike for desert. 🙁
(Sorry, I love America! (Even though we kicked your ass 200 years ago!))
See?! Nested braces, do you let them touch or space them out? I am betraying myself!
I hope you meant “we and the French fleet”.
How diverse was the training set as far as code snippets over diverse extended periods of a coders carrer. I feel like my coding style chages often, with my mood or what I’ve been working with most recently. Can it identify code written by a coder 2 years ago based of code written today? I would think the noise to signal ratio would be to large over those sort of time periods to be useful? It would be use to look at what Hacker-B wrote in college and connect it with the new virus code we just discovered 5 years later, but I doubt that could work.
Please refer to section V.K in the paper: “Consistency of programming style throughout years”
I am unconvinced by Section V.K (disclaimer, I am not a scientist and this isn’t science class).
“We took a set of 25 authors from 2012 that were also contestants in 2014’s competition … In section V-J, the experiment of 25 authors with 9 files within 2014 had a correct classification accuracy of 92.36%. These results indicate that coding style is reserved up to some degree throughout years.”
Same contest though different years. I have never participated in that specific contest, and only once in high school participated in a coding contest. However, based on my experience some important variables; the code contest was conducted where everyone was coding 1) in the same language, 2) with the same tool-set (everyone worked on the same brand of computers with the same compiler), 3) after receiving the standard instruction in high school AP programming class (e.g. pretty much the same text-books). 4) With a supposition that the basic sequence of problems will likely be similar from contest to contest. (4 supposition: When presented with Problem A; I may come at it from Solution A three times out of five perhaps, but sometimes may think of Solution B or Solution C [depends literally on what my brain is doing any given day].)
All four of those variables affect my personal coding style. The language, the tools in use, what I have learned (most these days I learn on my own by studying, so it depends very much on whom I am learning from), and of course what the problem at hand is.
As I review my own code back any given time I sit down to code (anywhere from just weeks to months between coding sessions); I am often changing styling considerably (depending on what I have learned, or experienced in the intervening time). And, if I go back to code I did years ago; I usually say to myself “UGH why did I do that?!” I would be willing to bet that should you increase your sample data set; increase the amount of time between coding (more than two years); you would find less accuracy.
For instance; did you account for whether those coders (those participants of Code Jam) continued more coding and/or completed more coding education in between Code Jam competitions; it may be more believable; as in how am I to know that the 25 coders picked were people who just do it as a hobby and didn’t really do much coding and or learning between competitions–the keeping the same style? And vice versa; perhaps the more coding and or education done, perhaps the greater the consistency? As I said, I am not a scientist, but it seems to me the controls of variables here were not accounted for.
I would be willing to bet if I submitted 10 samples 5 of which were from myself; and 5 from random people over the Internet… you wouldn’t be able to accurately tell me which 5 were myself (without going to the internet to search for the code snippets I happened to pull). Why? Because I can see MAJOR changes to my coding style over the 15 years; and across 10 or more languages; across 2 to 3 compilers/editors each language; etc. etc.
Darn, I should have read the comments before commenting. Gregory, you hit the same idea I was thinking. I see my styles change every time I sit down to code (which because of my current job position [computer-do-it-all], may be months between any coding, as compared to typesetting, graphic design, troubleshooting, and many general office tasks on top of that).
“Mentally inserted quotes around the word anonymous”
Hahaha. I did almost the same thing but not thinking quotes; I saw the word was capitalized; and thought to myself “proper noun?” “or just beginning of sentence?” And then decided I better read to find out which. But I too had passing thoughts of “Anonymous” and the implications of tracking them down via code.
As a disclaimer, I strongly considered joining Anonymous; I agree with much of their positions. I disagreed with only one point and that was that they “never forgive” … Forgiveness is something I find very sacred; so for that and only that I didn’t join.