July 3, 2022

Archives for January 2010

Census of Files Available via BitTorrent

BitTorrent is popular because it lets anyone distribute large files at low cost. Which kinds of files are available on BitTorrent? Sauhard Sahi, a Princeton senior, decided to find out. Sauhard’s independent work last semester, under my supervision, set out to measure what was available on BitTorrent. This post, summarizing his results, was co-written by Sauhard and me.

Sauhard chose a (uniform) random sample of files available via the trackerless variant of BitTorrent, using the Mainline DHT. The sample comprised 1021 files. He classified the files in the sample by file type, language, and apparent copyright status.

Before describing the results, we need to offer two caveats. First, the results apply only to the Mainline trackerless BitTorrent system that we surveyed. Other parts of the BitTorrent ecosystem might be different. Second, all files that were available were equally likely to appear in the sample — the sample was not weighted by number of downloads, and it probably contains files that were never downloaded at all. So we can’t say anything about the characteristics of BitTorrent downloads, or even of files that are downloaded via BitTorrent, only about files that are available on BitTorrent.

With that out of the way, here’s what Sauhard found.

File types

46% movies and shows (non-pornographic)
14% games and software
14% pornography
10% music
1% books and guides
1% images
14% could not classify


For the movies and shows category, the predominant file format was AVI, and other formats included RMVB (a proprietary format for RealPlayer), MPEG, raw DVD, and some multi-part RAR archives. Interestingly, this section was heavily biased towards recent movies, instead of being spread out evenly over a number of years. In descending order of frequency, we found that 60% of the randomly selected movies and shows were in English, 8% were in Spanish, 7% were in Russian, 5% were in Polish, 5% were in Japanese, 4% were in Chinese, 4% could not be determined, 3% were in French, 1% were in Italian, and other infrequent languages accounted for 2% of the distribution.


For the games and software category, there was no clearly dominant file type, but common file types for software included ISO disc images, multi-part RAR archives, and EXE (Windows executables). The games were targeted for running on different architectures, such as the XBOX 360, Nintendo Wii, and Windows PC’s. In descending order, we found that 74% of games and software in the sample were in English, 12% were in Japanese, 5% were in Spanish, 4% were in Chinese, 2% were in Polish, and 1% were in Russian and French each.


For the pornography category, the predominant encoding format was AVI, similar to the movies category. However, there were significantly more MPG and WMV (Windows Media Video) files available. Also, most pornography torrents included the full pornographic video, a sample of the video (a 1-5 minute extract of the video), as well as posters or images of the porn stars in JPEG format. Also, as these videos are not typically dated like movies are, it is difficult to make any remarks regarding the recency bias for pornographic torrents. Our assumption would be that demand for pornography is not as time-sensitive as demand for movies, so it is likely that these pornographic videos constitute a broader spectrum of time than the movies do. In descending order, we found that 53% of pornography in our sample was in English, 16% was in Chinese, 15% was in Japanese, 6% was in Russian, 3% was in German, 2% was in French, 2% was unclassifiable, and Italian, Hindi, and Spanish appeared infrequently (1% each).


For the music category, the predominant encoding format for music was MP3, there were some albums ripped to WMA (Windows Media Audio, a Microsoft codec), and there were also ISO images and multi-part RAR archives. There is still a bias towards recent albums and songs, but it is not as strongly evident as it is for movies—perhaps because people are more willing to continue seeding music even after it is no longer new, so these torrents are able to stay alive longer in the DHT. In descending order, we found that 78% of music torrents in our sample were in English, 6% were in Russian, 4% were in Spanish, 2% were in Japanese and Chinese each, and other infrequent languages appeared 1% each.


The books/guides and images categories were fairly minor. We classified 15 torrents under books and guides—13 were in English, 1 was in French, and 1 was in Russian. We classified 3 image torrents—one was a set of national park wallpapers, one was a set of pictures of BMW cars (both of these are English), and one was a Japanese comic strip.

Apparent Copyright Infringement

Our final assessment involved determining whether or not each file seemed likely to be copyright-infringing. We classified a file as likely non-infringing if it appeared to be (1) in the public domain, (2) freely available through legitimate channels, or (3) user-generated content. These were judgment calls on our part, based on the contents of the files, together with some external research.

By this definition, all of the 476 movies or TV shows in the sample were found to be likely infringing. We found seven of the 148 files in the games and software category to be likely non-infringing—including two Linux distributions, free plug-in packs for games, as well as free and beta software. In the pornography category, one of the 145 files claimed to be an amateur video, and we gave it the benefit of the doubt as likely non-infringing. All of the 98 music torrents were likely infringing. Two of the fifteen files in the books/guides category seemed to be likely non-infringing.

Overall, we classified ten of the 1021 files, or approximately 1%, as likely non-infringing, This result should be interpreted with caution, as we may have missed some non-infringing files, and our sample is of files available, not files actually downloaded. Still, the result suggests strongly that copyright infringement is widespread among BitTorrent users.

A Free Internet, If We Can Keep It

“We stand for a single internet where all of humanity has equal access to knowledge and ideas. And we recognize that the world’s information infrastructure will become what we and others make of it. ”

These two sentences, from Secretary of State Clinton’s groundbreaking speech on Internet freedom, sum up beautifully the challenge facing our Internet policy. An open Internet can advance our values and support our interests; but we will only get there if we make some difficult choices now.

One of these choices relates to anonymity. Will it be easy to speak anonymously on the Internet, or not? This was the subject of the first question in the post-speech Q&A:

QUESTION: You talked about anonymity on line and how we have to prevent that. But you also talk about censorship by governments. And I’m struck by – having a veil of anonymity in certain situations is actually quite beneficial. So are you looking to strike a balance between that and this emphasis on censorship?

SECRETARY CLINTON: Absolutely. I mean, this is one of the challenges we face. On the one hand, anonymity protects the exploitation of children. And on the other hand, anonymity protects the free expression of opposition to repressive governments. Anonymity allows the theft of intellectual property, but anonymity also permits people to come together in settings that gives them some basis for free expression without identifying themselves.

None of this will be easy. I think that’s a fair statement. I think, as I said, we all have varying needs and rights and responsibilities. But I think these overriding principles should be our guiding light. We should err on the side of openness and do everything possible to create that, recognizing, as with any rule or any statement of principle, there are going to be exceptions.

So how we go after this, I think, is now what we’re requesting many of you who are experts in this area to lend your help to us in doing. We need the guidance of technology experts. In my experience, most of them are younger than 40, but not all are younger than 40. And we need the companies that do this, and we need the dissident voices who have actually lived on the front lines so that we can try to work through the best way to make that balance you referred to.

Secretary Clinton’s answer is trying to balance competing interests, which is what good politicians do. If we want A, and we want B, and A is in tension with B, can we have some A and some B together? Is there some way to give up a little A in exchange for a lot of B? That’s a useful way to start the discussion.

But sometimes you have to choose — sometimes A and B are profoundly incompatible. That seems to be the case here. Consider the position of a repressive government that wants to spy on a citizen’s political speech, as compared to the position of the U.S. government when it wants to eavesdrop on a suspect’s conversations under a valid search warrant. The two positions are very different morally, but they are pretty much the same technologically. Which means that either both governments can eavesdrop, or neither can. We have to choose.

Secretary Clinton saw this tension, and, being a lawyer, she saw that law could not resolve it. So she expressed the hope that technology, the aspect she understood least, would offer a solution. This is a common pattern: Given a difficult technology policy problem, lawyers will tend to seek technology solutions and technologists will tend to seek legal solutions. (Paul Ohm calls this “Felten’s Third Law”.) It’s easy to reject non-solutions in your own area because you have the knowledge to recognize why they will fail; but there must be a solution lurking somewhere in the unexplored wilderness of the other area.

If we’re forced to choose — and we will be — what kind of Internet will we have? In Secretary Clinton’s words, “the world’s information infrastructure will become what we and others make of it.” We’ll have a free Internet, if we can keep it.

No Warrant Necessary to Seize Your Laptop

The U.S. Customs may search your laptop and copy your hard drive when you cross the border, according to their policy. They may do this even if they have no particularized suspicion of wrongdoing on your part. They claim that the Fourth Amendment protection against warrantless search and seizure does not apply. The Customs justifies this policy on the grounds that “examinations of documents and electronic devices are a crucial tool for detecting information concerning” all sorts of bad things, including terrorism, drug smuggling, contraband, and so on.

Historically the job of Customs was to control the flow of physical goods into the country, and their authority to search you for physical goods is well established. I am certainly not a constitutional lawyer, but to me a Customs exemption from Fourth Amendment restrictions is more clearly justified for physical contraband than for generalized searches of information.

The American Civil Liberties Union is gathering data about how this Customs enforcement policy works in practice, and they request your help. If you’ve had your laptop searched, or if you have altered your own practices to protect your data when crossing the border, staff attorney Catherine Crump would be interested in hearing about it.

Meanwhile, the ACLU has released a stack of documents they got by FOIA request.
The documents are here, and their spreadsheets analyzing the data are here. They would be quite interested to know what F-to-T readers make of these documents.

ACLU Queries for F-to-T readers:
If the answer to any of the questions below is yes, please briefly describe your experience and e-mail your response to laptopsearch at aclu.org. The ACLU promises confidentiality to anyone responding to this request.
(1) When entering or leaving the United States, has a U.S. official ever examined or browsed the contents of your laptop, PDA, cell phone, or other electronic device?

(2) When entering or leaving the United States, has a U.S. official ever detained your laptop, PDA, cell phone, or other electronic device?

(3) In light of the U.S. government’s policy of conducting suspicionless searches of laptops and other electronic devices, have you taken extra steps to safeguard your electronic information when traveling internationally, such as using encryption software or shipping a hard drive ahead to your destination?

(4) Has the U.S. government’s policy of conducting suspicionless searches of laptops and other electronic devices affected the frequency with which you travel internationally or your willingness to travel with information stored on electronic devices?