[Today we kick off a series of three guest posts by Mitch Golden. Mitch was a professor of physics when, in 1995, he was bitten by the Internet bug and came to New York to become an entrepreneur and consultant. He has worked on a variety of Internet enterprises, including one in the filesharing space. As usual, the opinions expressed in these posts are Mitch’s alone. — Ed]
The battle between the record labels and filesharers has been somewhat out of the news a bit of late, but it rages on still. There is an ongoing court case Arista Records v LimeWire, in which a group of record labels are suing to have LimeWire held accountable for the copyright infringing done by its users. Though this case has attracted less attention than similar cases before it, it may raise interesting issues not addressed in previous cases. Though I am a technologist, not a lawyer, this series of posts will advocate a way of looking at the issues, including legal, using a freedom-of-speech based approach, which leads to some unusual conclusions.
Let’s start by reviewing some salient features of filesharing.
Filesharing is a way for a group of people – who generally do not know one another – to allow one another to see what files they collectively have on their machines, and to exchange desired files with each other. There are at least two components to a filesharing system: one allows a user who is looking for a particular file to see if someone has it, and another that allows the file to be transferred from one machine to the other.
One of the most popular filesharing programs in current use is LimeWire, which uses a protocol called gnutella. Gnutella is decentralized, in the sense that neither the search nor the exchange of files requires any central server. It is possible, therefore, for people to exchange copyrighted files – in violation of the law – without creating any log of the search or exchange in a central repository.
The gnutella protocol was originally created by developers from Nullsoft, the company that had developed the popular music player WinAmp, shortly after it was acquired by AOL. AOL was at that time merging with Time Warner, a huge media company, and so the idea that they would be distributing a filesharing client was quite unamusing to management. Work was immediately discontinued; however, the source for the client and the implementation of the protocol had already been released under the GPL, and so development continued elsewhere. LimeWire made improvements both to the protocol and the interface, and their client became quite popular.
The decentralized structure of filesharing does not serve a technical purpose. In general, centralized searching is simpler, quicker and more efficient, and so, for example, to search the web we use Google or Yahoo, which are gigantic repositories. In filesharing, the decentralized search structure instead serves a legal purpose: to diffuse the responsibility so no particular individual or organization can be held accountable for promoting the illegal copying of copyright materials. At the time the original development was going on, the Napster case was in the news, in which the first successful filesharing service was being sued by the record labels. The outcome of that case a few months later resulted in Napster being shut down, as the US courts held it (which was a centralized search repository) responsible for the copyright infringing file sharing its users were doing.
Whatever their legal or technical advantages, decentralized networks, by virtue of their openness, are vulnerable to a common problem: spam. For example, because anyone may send anyone else an e-mail, we are all subject to a deluge of messages trying to sell us penny stocks and weight loss remedies. Filesharing too is subject this sort of cheating. If someone is looking for, say, Rihanna’s recording Disturbia, and downloads an mp3 file that purports to be such, what’s to stop a spammer from instead serving a file with an audio ad for a Canadian pharmacy?
Spammers on the filesharing networks, however, have more than just the usual commercial motivations in mind. In general, there are four categories of fake files that find their way onto the network.
- Commercial spam
- Pornography and Ads for Pornography
- Viruses and trojans
- Spoof files
The last of these has no real analogue to anything people receive in e-mail It works as follows: if, for example, Rihanna’s record label wants to prevent you from downloading Disturbia, they might hire a company called MediaDefender. MediaDefender’s business is to put as many spoof files as possible on gnutella that purport to be Disturbia, but instead contain useless noise. If MediaDefender can succeed in flooding the network so that the real Disturbia is needle in a haystack, then the record label has thwarted gnutella’s users from violating their copyright.
Since people are still using filesharing, clearly a workable solution has been found to the problem of spoof files. In tomorrow’s post, I discuss this solution, and in the following post, I suggest its legal ramifications.
File sharing is made the entertainment industrial lost their money because lots of their digital products is shared by many people for free.
DD
I suggest that the best way of limiting it would be to abolish copyright, as then the availability of all files would increase no end, and the bandwidth taken up by file-sharing would reduce considerably.tiffany
Yes. It’s called “abolition of copyright”, and the eventual end of centralization on the net.
Domain names, DNS, and registrars will go the way of the dodo, because search will find content (and even interactive services) by IP address. Many interactive services can use functional programming to be distributed, with needed data existing on the client machine. Some will need some notion of centralized data, but BitTorrent-like technology can create a vast distributed filesystem, parts encrypted, whose content is updatable, but also versioned and backed up out the wazoo.
Much will become obsolete in the next 30 years. Within ten, copyright and all forms of artificial scarcity of information. Within 20, as it’s increasingly cheap to “fab” fairly complex devices and tools in the home, patents land in the crosshairs of Napsterization. Within 30, patents have also met the fate of the dinosaurs, and scarcity is largely restricted to raw materials, bandwidth, and human expertise. Mining, bandwidth-provision, and knowledge work will thrive, along with services. Japanese automakers will have met the same fate by then as their American brethren face now.
After that the crystal ball kind of fogs up because of strong AI.
I like P2P filesharing like Limewire very much, I download lots of great files via this network, but I also get many viruses from this, hope this solution will help the world clearer.
Free domain reseller
Mitch: What Tom suggests is that spammers are deliberately sending legit-looking spam in order to reduce the effectiveness of statistical spam filtering. I find it more plausible that using particularly plausible and interest-grabbing subject lines is meant to fool human readers – these days a lot of spam uses subjects related to job insecurity and workplace impositions (e.g. subjects suggesting demands by management). Some time ago there were waves of spam with subjects masquerading as news soundbites or celebrity gossip. But certainly this will degrade statistical filtering at least as a side effect. But even my counterpoint to his thesis falls into the same email related category that he highlights.
The way spam filtering e.g. in Mozilla-family email clients works is by deriving feature vectors from email content and metadata (header fields, presence of media types, e.g. embedded images or HTML content, etc.) which are matched against fuzzy sets of positive/negative samples. The samples are seeded and adapted by the user marking unfiltered messages as spam, and/or reviewing filtered false positives marking them as non-spam. Without going into much technical detail, it should be immediately plausible that loading up messages with plausibly legit content/features will dilute the value of the user annotations and introduce overlap between positives and negatives, reducing discrimination effectiveness. The important central concept is that the tool does not know what part of the messages are the spam markers (and that’s a subjective criterion to being with) – it figures out how positives and (presumed) negatives differ, and builds a set of discriminating factors from that. Injecting pseudo-legit “noise” and cloaking spam content, for example by hiding it in images or obscure HTML encodings will impede this analysis. It may also dissuade busy enough human readers who receive similar “legit” messages from marking it as spam, or mark legit messages as spam accidentally, all of which will degrade the sample data base.
In other words, there is an email spam equivalent of fluffing up the channel with noise, only it’s not “fake spam” but “fake legitimacy”.
I agree that the bayesian filter poisoning is a different kind of e-mail spam, but it is not analogous to a spoof file on gnutella, at least in the sense relevant to the discussion here. The spoof file is actually the legitimate copyright holder protecting his/her copyright file by trying to bury the real file in a pile of spoofs. The spoofer is not trying (as all e-mail spammers are) to get you to read (or hear) something. He is trying to *prevent* you from finding something on gnutella that you don’t have the legal right to get.
The filter-poisoning e-mail serves no legitimate purpose. The spoof file *does*, since it is the copyright holder’s attempt to protect a copyright.
Sure. I was arguing under the premise that we are discussing a technical analogy, not a legal one.
With your claim that MediaDefender’s spoofed files serve a “legitimate purpose”, you’ve just lost all credibility.
Besides their overzealous spoofing, there’s the simple fact that instead of taking legal actions, they are in essence engaging in vandalism as a form of “self help”. Imagine an anti-drug crusader sending agents to every street corner to sell severely poisonous substances as “drugs” to deter drug use, or having goons jackhammer to pieces every sidewalk in every city thereby rendering the streets completely useless to pedestrians, not just drug dealers/users. And remember, these are private organizations perpetrating thuggery on the part of big, profit-focused corporations — hardly “good Samaritans”. The behavior of the entire recording industry is disturbingly Mafia-like. Why not just try to compete legitimately? Studies have shown that filesharing has no impact on album sales anyway! The whole mess is ludicrous.
Actually Mozilla calls it “Junk filtering”, a subtly different notion from spam, denoting generalized categories of undesired correspondence.
OMG so many promotional comments here.
Peter
The decentralized structure of filesharing does not serve a technical purpose.
Let me add my disagreement to those above. Decentralization certainly can and does have a technical purpose in many file sharing applications. As someone who has spoken to people who ran popular centralized P2P file sharing index servers (OpenNap, etc), they uniformly tell me that decentralization was crucial to the long-term success of a global P2P network. In fact, Napster itself moved toward decentralization of its indices because it could not manage the numbers of connections it was seeing.
In short, the development of decentralized solutions cannot be said to have “no technical purpose.” Whether it may also have had a legal purpose is another matter, but there’s certainly nothing wrong with trying to design your systems so that they comply with the law.
I don’t know what Napster was doing, but making a single repository (or a set of repositories run by a central authority) scale vertically is much easier than constructing the decentralized search methodology that is implemented by gnutella. In a decentralized network like gnutella, a significant fraction of the network traffic is the search messages being passed around. If you are looking for a file that is on only one host, somehow the search message has to be sure it gets to that very host or you won’t find what you’re looking for. It is a rather a technical tour-de-force that they got gnutella to scale as well as it does.
Another technical reason for decentralizing anything, besides to decentralize costs, is to reduce or eliminate single points of failure and various kinds of bottlenecks in the system, making it more robust and scalable.
I’m not quite sure how to take this comment.
“The decentralized structure of filesharing does not serve a technical purpose.”
Are you saying there is no technical reason for decentralized filesharing? In other words, you don’t think the development of filesharing was motivated by technical issues? (Truthfully, I’m not sure there is any such thing as “technical purpose”. Everything we do is for human purposes, but at any rate…)
I haven’t studied filesharing technologies much, but I thought one technical problem they were trying to overcome was the requirement for needing expensive central servers with copious available bandwidth when trying to provide popular content. You already have many clients on the ends of the network. These clients can actually act as servers, but for various reasons that I won’t attempt to get into, most of these machines act only as clients or “consumers” (to use a much abused term) that request services or content from central servers. Filesharing, among other things, was an attempt to change that so that any client could act as a server (and vice versa) and that you could distribute the bandwidth load across the internet by localizing most of your traffic for popular content (i.e. your neighbors are serving you content that you would otherwise have to stream from a central server).
As an example, aren’t some PC game developers using these techniques for distributing their often massive patches? That seems like a technical purpose to me.
(Of course, I don’t mean to imply that filesharing isn’t being used for copyright infringement. I merely wish to point out that the motivations for the development for these technologies did involve some technical issues.)
Let me answer this and the surrounding two comments as follows:
The particular point I was making was regard to *search* under filesharing. I hope this was clear when the sentence is in context. The point is that Napster had a better and quicker search engine than gnutella. The problem it had was legal, not technical, and the development of gnutella proceeded from that point.
Bullshit. For one, decentralization distributes costs, resource utilization and authority thus making it trivial for communities to form communication networks. Having to get a group coordinate to rent a server poses several almost certainly fatal problems: How do you organize people to pay money? Will you have enough people to pay minimum costs? How do you send and collect the money safely? Who will be the legal “owner” of the server? How can you trust them? Why should they have more power than anyone else? etc. It is just too much goddamn trouble. For proof note the relative absence of community, ruler-free networks. Joining a decentralized network on the other hand is trivial, being essentially frictionless and requiring essentially no human coordination and, once there, no one lords over anyone else.
I would have never thought of spam being in filesharing, although I would expect that viruses would involved in file sharing, which is why I would never allow anyone to access my computer like that. I’ve used Limewire before and did have problems with advertisements in music downloads.
-Nikki-
selling photography 101
I think you may have meant that they are repositories of indices. True, Google does cache the results of its web crawling (and so becomes a repository of content), but it’s only a side-effect of its technical implementation of searching.
For the most part, once a Google search has found a relevant link, people use that link to navigate to the site hosting the content. It is only in unusual circumstances that people use the cached Google content (to retrieve a document that may have been altered on the original site, or to retrieve a document that is no longer available at all from the original site).
–Bob.
You are correct that when you view the actual content you don’t go to Google directly, and so it is a repository of indeces. The comparison is with gnutella, where when you search you aren’t going to anything centralized. You’re broadcasting a request to the other clients on the network asking all of them if they have the file you are looking for. That is centralized vs. decentralized.
Given file-sharing is precisely about freedom-of-speech, then surely it is a contradiction in terms to pursue a means of limiting it?
I suggest that the best way of limiting it would be to abolish copyright, as then the availability of all files would increase no end, and the bandwidth taken up by file-sharing would reduce considerably.
The discussion here is not intended to be so radical as to cover the merit or lack thereof of copyright. I take as assumption that the current regime is going to continue. In that case, it will remain illegal to exchange files that are under copyright (without the owner’s permission). Given that the courts are going to try to limit filesharing, the question is How should they do it? If, in the process, we all lose our freedom of speech, I think (and I am sure you’ll agree) that is a real problem. Given some of the things the Content Industry is asking for, this is not an unlikely outcome. (It’s much more likely than copyright being abolished!)
Seems to me that the “porn and ads for porn” is just a subset of “commercial spam”. That is, “commercial spam” is simply advertising, and any fake content that turns out to be porn is really just some kind of commercial advertising for the most part. (Of course, some of the content that users intend to share over the network may in fact also be porn, but that wouldn’t be a “fake file”).
I suppose the one complicating factor is that porn is also used to lure people into downloading malware (“viruses and trojans”). In that respect, the fourth category of “porn” actually overlaps (intersects) the two other categories. But IMHO that’s just that much more of a reason it’s not really a category unto itself.
I see three categories: fake files for the purpose of advertising (“commercial spam”), for the purpose of distributing malware (“viruses and trojans”), and for the purpose of littering the network with fake content (“spoof files”).
Interestingly, all of those categories can be demonstrated as furthering some commercial interest. That is, they all exist for the purpose of making money. So at some level, there’s just one category. 🙂
Sorry if this was not clear. What I had in mind was a distinction between people trying to sell the usual junk – diets, penny stocks, canadian drugs, etc – vs. pornographic files. As you point out, the latter are often trying to sell subscriptions to porn sites.
At any rate, it is the fourth category, spoof files, that it relevant to the discussion as it will develop. There is nothing analogous to them in e-mail spam.
There is nothing analogous to them in e-mail spam.
Is Bayesian poisoning analogous enough? Sending spam solely for the purpose of degrading a filtering technique seems very similar to seeding files solely for the purpose of degrading a searching technique.
What I meant is that the people who are spoofing files on filesharing networks are trying to keep you from finding the real file of the same name. When an e-mail spammer is sending you something, he is trying to get you to read his message, not keep you from finding another one. That is the sense in which spoofing is distinct from anything in e-mail.