In a recent post, I talked about our paper showing how to identify anonymous programmers from their coding styles. We used a combination of lexical features (e.g., variable name choices), layout features (e.g., spacing), and syntactic features (i.e., grammatical structure of source code) to represent programmers’ coding styles. The previous post focused on the overall results and techniques we used. Today I’ll talk about applications and explain how source code authorship attribution can be used in software forensics, plagiarism detection, copyright or copyleft investigations, and other domains.
Every programmer learns to code in a unique way which results in distinguishing “fingerprints” in coding style. These fingerprints can be used to compare the source code of known programmers with an anonymous piece of source code to find out which one of the known programmers authored the anonymous code. This method can aid in finding malware programmers or detecting cases of plagiarism. In a recent paper, we studied this question, which we call source-code authorship attribution. We introduced a principled method with a robust feature set and achieved a breakthrough in accuracy.
Paul Ohm’s 2009 article Broken Promises of Privacy spurred a debate in legal and policy circles on the appropriate response to computer science research on re-identification techniques. In this debate, the empirical research has often been misunderstood or misrepresented. A new report by Ann Cavoukian and Daniel Castro is full of such inaccuracies, despite its claims of “setting the record straight.”
In a response to this piece, Ed Felten and I point out eight of our most serious points of disagreement with Cavoukian and Castro. The thrust of our arguments is that (i) there is no evidence that de-identification works either in theory or in practice and (ii) attempts to quantify its efficacy are unscientific and promote a false sense of security by assuming unrealistic, artificially constrained models of what an adversary might do. [Read more…]