Kevin Walsh and Emin Gün Sirer, of Cornell University, have a new paper on Credence, a system for detecting unwanted files in P2P networks. It’s a kind of reputation system for files, designed to detect in advance that certain files are not what they claim to be. One use of this technology is to detect spoofed files inserted into P2P nets by copyright owners.
Credence is really a reputation system for files. Users cast votes, which are simple thumbs-up or thumbs-down verdicts on particular files, saying whether a file is what it claims to be. Every vote is digitally signed by the user who cast it, so that recipients can verify the authenticity of votes they are given. If you’re not sure whether a file is genuine, you can ask people to send you votes, and then you can combine the votes in a special way to yield a quality score for the file.
P2P systems are open, and they generally don’t register their users (or at least they don’t prevent fraudulent or repeated registrations) so users cannot reliably be identified by their real names. Instead, users make up pseudonyms for themselves. Suppose Alice wants to join the system. She makes a pseudonym for herself, by generating a cryptographic key-pair, consisting of a private key and a public key. Alice keeps the private key secret, and uses it to put digital signatures on the votes she casts. Along with each vote, she distributes a copy of the public key, which anybody can use to verify the digital signature. The public key also serves as a “name” for Alice’s pseudonym. The key attribute of this system is that if Bob wants to forge a vote, that is, to create a vote that appears to have come from Alice’s pseudonym, he must somehow determine Alice’s private key, which is essentially impossible if Alice does her cryptography correctly. In short, the cryptography ensures that anybody can make a pseudonym, but only the creator of a pseudonym can cast votes on its behalf.
This only solves half of the problem, because an adversary can create as many pseudonyms as he likes, and have them cast false votes (i.e, votes in favor of the validity of files that are actually invalid, or against the validity of files that are actually valid). So you can’t just add up all of the votes you receive; you need some way to tell whose votes to trust and whose to ignore. Here Credence uses a simple rule – trust people who tend to vote the same way that you do. Suppose Alice knows that files A, B, and C are valid, and that files X, Y, and Z are not valid. If some pseudonym “Bob” has votes in favor of A, B, and C, and against X, Y, and Z, then Alice concludes that “Bob” tends to vote accurately. If another pseudonym “Charlie” votes the opposite way on those six files, then Alice concludes that votes from “Charlie” tend to be the opposite of the truth. So if she sees some new file that “Bob” says is valid and “Charlie” says is invalid, Alice will conclude that the file is valid. Each party’s vote on the new file gets a weight, equal to the correlation between that party’s votes and Alice’s votes on other files. (The paper hints at further mechanisms that assign trust to people whose votes correlate with those of other people Alice trusts.)
This scheme presents would-be adversaries with a dilemma. If Alice’s votes are truthful, then if you want to mislead Alice about one file, you have to earn her trust by telling her the truth about other files. You can tell occasional lies, but on the whole you have to be a truth-teller. (You can achieve the same effect by lying about almost everything, and telling the truth about just one file. Then Alice will conclude that you are a habitual liar, and will count your votes with negative weight, giving credence to the opposite of what you say. Again, you have to provide Alice with many useful-to-her votes in order to trick her once.)
It looks like this method will work, if it can be implemented efficiently in a real network. The real question, I think, is whether it will scale up to enormous P2P networks containing huge numbers of files. Here I have serious doubts. The paper’s authors don’t claim that users have to know about all of the votes cast in the system. But they’re not entirely clear on how individual users can efficiently get the votes they need to make good decisions, if the network is very large. In the long run, I don’t think this scaling problem is insurmountable; but more research is required to solve it.
There should be a time parameter to the trust. Make the system slowly give more and more credence to known trustworthy voters as time passes. That would at least make it considerably harder to create widely visible false positives.
Then the attackers will start using botnets. 😛
It seems that you could prevent millions of fake identities by requiring a large amount (e.g. 1 modern-CPU-hour) of hashcash on each identity. (I assume that the attackers are not using botnets.)
They’ll vote correctly on the millions of jpegs and wrongly on the thousands of mp3z. Additional levels of reputation management for identities (rather than just files) are needed, but they were in fact “hinted at” in the original paper.
Well, the question here is what is your threat model? What attack(s) are you trying to prevent?
For an individual user that wants to distribute some specific files (e.g. Ben’s brother’s garage band) in a fraudulent way (e.g. looking like a 50 cent song) then this sort of credibility system fails to protect against the specific targeted attacks.
However, I didn’t get the impression that the researchers were trying to prevent that sort of attack. It seemed to me they were trying to prevent large scale, wholesale spoofing of a large portion of the available works. Against this attack, it seems this method could be resonably successful.
Specifically, the RIAA or some owner of a large number of copyrights would be able to effectively spoof a few works by correctly identifying a large number others as valid files. That might cause trouble for downloading of a few items, but not actually inhibit the general effectiveness of the P2P network.
You guys have caught on to one of the weaknesses of Credence, which I didn’t sufficiently recognize when I wrote the original post. As you have observed, an adversary can commit plenty of harm even if he mostly tells the truth.
It’s hard, in a system like this, to regulate the creation of new identities, or to get reliable timestamps on anything. Without a central authority, you can allow creation of new identities (as new public/private key pairs), but you can’t do much else reliably.
Maybe the system could block people from opening up new accounts all the time. Sure, in a decentral system this is harder than for a web forum…
Another way is to rate users’ quality by how many files they already voted on, how regularly they vote, how long they are registered, if they are online a decent amount of time every month.
Ben,
I think you can wreak this havoc by creating many fewer users. When your bad file becomes available, many people will download it (on the basis of your bad users’s recommendation) before others start to vote against it.
– PB
PB —
I thought of this, too, and I’m curious if there’s a technical point I’m missing. Though creating just one trusted person will only fool two people, at which point the spoof file will be voted bad by 2/3 voters and people will probably stop downloading it.
What is a problem that I think I see, though, is where I create a million fake voters.
So I’m Joe Spoofer, and I want to create a spoof file, “50.cent.candy.shop.mpg” (call it bad.file) that pretends to be the current top-of-the-charts song but is actually a bad audio reproduction of my brother’s garage band at the local prom.
I create 1×10^6 fake users. I have them all *accurately* evaluate a couple hundred files out there, AND inaccurately vote that bad.file is actually Candy Shop.
Now Real_User (you) wants to download Candy Shop. You pull all the votes on bad.file and find a million votes that it’s the good file. You check out those voters against your own knowledge base and they agree as to the validity of every file you check them against. So you believe the million votes and download My Brother’s Cover of Stairway to Heaven.
Even if you now vote that it’s a bad file, you’re just one vote against a million. The next real user will probably think *you* were the faker. The million fakes could probably fool a pretty big number of users.
They could also fool a number of users w/r/t a pretty big number of files — there’s probably a math problem tucked away in here. I mean, If you, frustrated by your failure to successfully download the first song, now want to download “50.cent.disco.inferno” that I’ve also spoofed (with, for irony’s sake, My Brother’s Cover of Disco Inferno), and you check the votes, you’ll have 1 million voters saying it’s a good file, and those 1 million voters will be 99.999% accurate — they’ve only made one error among thousands that they voted on. So you’d probably believe them.
And so on. Part of the math problem is when do you stop believing a voter.
EVENTUALLY, enough real people will download the spoofs to tip the voting balance the other way, but I don’t think the answer lies that way.
Probably, there’s going to be a PGP-like levels-of-trust thing going on. Not only will voters be ranked with their accuracy as to votes when compared to your own votes, but there will be some sort of heightened trust level you can assign certain voters, and then a step-down level you can assign the voters that *they* trust, etc. If you add weight to the votes like that, you’ll solve the million-fakers spoof problem more quickly.
ALSO, while it wouldn’t solve this particular problem, there will probably evolve a mechanism to rank individuals not only against the votes you have personal knowledge of but against the world-of-votes.
Anybody with a better understanding of the theory or math should probably pipe up at this point.
–Ben Manevitz
I’d be concerned about contributory infringement with regards to that, thinking back to, say, the Intellectual Reserve case.
But if this is used to counter spoofing, chances are that the users are infringing directly to begin with. It’s difficult to imagine that people are spoofing files lawfully when the files themselves are lawfully available on these nets. And you can’t verify the files without downloading them (can’t trust hashes run remotely) and that’s capable of infringing, all else being equal. So in for a penny, in for a pound.
Under Betamax, building the system — which seems like it’s really just a kind of user rating, here being applied in a clever way, but not terribly different from a more nuanced 1-5 star sort of thing — could probably be okay. Whether it’ll still be okay soon remains to be seen.
I wonder if it would be legal to publish your votes about files. It seems that it should be; all you are doing is signing a bunch of file hashes, reporting your observation that such hashes are songs that were good quality. If so, then this is a way that people could support the operation of a P2P network without putting themselves into legal jeopardy.