May 30, 2024

Archives for January 2010

Census of Files Available via BitTorrent

BitTorrent is popular because it lets anyone distribute large files at low cost. Which kinds of files are available on BitTorrent? Sauhard Sahi, a Princeton senior, decided to find out. Sauhard’s independent work last semester, under my supervision, set out to measure what was available on BitTorrent. This post, summarizing his results, was co-written by Sauhard and me.

Sauhard chose a (uniform) random sample of files available via the trackerless variant of BitTorrent, using the Mainline DHT. The sample comprised 1021 files. He classified the files in the sample by file type, language, and apparent copyright status.

Before describing the results, we need to offer two caveats. First, the results apply only to the Mainline trackerless BitTorrent system that we surveyed. Other parts of the BitTorrent ecosystem might be different. Second, all files that were available were equally likely to appear in the sample — the sample was not weighted by number of downloads, and it probably contains files that were never downloaded at all. So we can’t say anything about the characteristics of BitTorrent downloads, or even of files that are downloaded via BitTorrent, only about files that are available on BitTorrent.

With that out of the way, here’s what Sauhard found.

File types

46% movies and shows (non-pornographic)
14% games and software
14% pornography
10% music
1% books and guides
1% images
14% could not classify


For the movies and shows category, the predominant file format was AVI, and other formats included RMVB (a proprietary format for RealPlayer), MPEG, raw DVD, and some multi-part RAR archives. Interestingly, this section was heavily biased towards recent movies, instead of being spread out evenly over a number of years. In descending order of frequency, we found that 60% of the randomly selected movies and shows were in English, 8% were in Spanish, 7% were in Russian, 5% were in Polish, 5% were in Japanese, 4% were in Chinese, 4% could not be determined, 3% were in French, 1% were in Italian, and other infrequent languages accounted for 2% of the distribution.


For the games and software category, there was no clearly dominant file type, but common file types for software included ISO disc images, multi-part RAR archives, and EXE (Windows executables). The games were targeted for running on different architectures, such as the XBOX 360, Nintendo Wii, and Windows PC’s. In descending order, we found that 74% of games and software in the sample were in English, 12% were in Japanese, 5% were in Spanish, 4% were in Chinese, 2% were in Polish, and 1% were in Russian and French each.


For the pornography category, the predominant encoding format was AVI, similar to the movies category. However, there were significantly more MPG and WMV (Windows Media Video) files available. Also, most pornography torrents included the full pornographic video, a sample of the video (a 1-5 minute extract of the video), as well as posters or images of the porn stars in JPEG format. Also, as these videos are not typically dated like movies are, it is difficult to make any remarks regarding the recency bias for pornographic torrents. Our assumption would be that demand for pornography is not as time-sensitive as demand for movies, so it is likely that these pornographic videos constitute a broader spectrum of time than the movies do. In descending order, we found that 53% of pornography in our sample was in English, 16% was in Chinese, 15% was in Japanese, 6% was in Russian, 3% was in German, 2% was in French, 2% was unclassifiable, and Italian, Hindi, and Spanish appeared infrequently (1% each).


For the music category, the predominant encoding format for music was MP3, there were some albums ripped to WMA (Windows Media Audio, a Microsoft codec), and there were also ISO images and multi-part RAR archives. There is still a bias towards recent albums and songs, but it is not as strongly evident as it is for movies—perhaps because people are more willing to continue seeding music even after it is no longer new, so these torrents are able to stay alive longer in the DHT. In descending order, we found that 78% of music torrents in our sample were in English, 6% were in Russian, 4% were in Spanish, 2% were in Japanese and Chinese each, and other infrequent languages appeared 1% each.


The books/guides and images categories were fairly minor. We classified 15 torrents under books and guides—13 were in English, 1 was in French, and 1 was in Russian. We classified 3 image torrents—one was a set of national park wallpapers, one was a set of pictures of BMW cars (both of these are English), and one was a Japanese comic strip.

Apparent Copyright Infringement

Our final assessment involved determining whether or not each file seemed likely to be copyright-infringing. We classified a file as likely non-infringing if it appeared to be (1) in the public domain, (2) freely available through legitimate channels, or (3) user-generated content. These were judgment calls on our part, based on the contents of the files, together with some external research.

By this definition, all of the 476 movies or TV shows in the sample were found to be likely infringing. We found seven of the 148 files in the games and software category to be likely non-infringing—including two Linux distributions, free plug-in packs for games, as well as free and beta software. In the pornography category, one of the 145 files claimed to be an amateur video, and we gave it the benefit of the doubt as likely non-infringing. All of the 98 music torrents were likely infringing. Two of the fifteen files in the books/guides category seemed to be likely non-infringing.

Overall, we classified ten of the 1021 files, or approximately 1%, as likely non-infringing, This result should be interpreted with caution, as we may have missed some non-infringing files, and our sample is of files available, not files actually downloaded. Still, the result suggests strongly that copyright infringement is widespread among BitTorrent users.

A Free Internet, If We Can Keep It

“We stand for a single internet where all of humanity has equal access to knowledge and ideas. And we recognize that the world’s information infrastructure will become what we and others make of it. ”

These two sentences, from Secretary of State Clinton’s groundbreaking speech on Internet freedom, sum up beautifully the challenge facing our Internet policy. An open Internet can advance our values and support our interests; but we will only get there if we make some difficult choices now.

One of these choices relates to anonymity. Will it be easy to speak anonymously on the Internet, or not? This was the subject of the first question in the post-speech Q&A:

QUESTION: You talked about anonymity on line and how we have to prevent that. But you also talk about censorship by governments. And I’m struck by – having a veil of anonymity in certain situations is actually quite beneficial. So are you looking to strike a balance between that and this emphasis on censorship?

SECRETARY CLINTON: Absolutely. I mean, this is one of the challenges we face. On the one hand, anonymity protects the exploitation of children. And on the other hand, anonymity protects the free expression of opposition to repressive governments. Anonymity allows the theft of intellectual property, but anonymity also permits people to come together in settings that gives them some basis for free expression without identifying themselves.

None of this will be easy. I think that’s a fair statement. I think, as I said, we all have varying needs and rights and responsibilities. But I think these overriding principles should be our guiding light. We should err on the side of openness and do everything possible to create that, recognizing, as with any rule or any statement of principle, there are going to be exceptions.

So how we go after this, I think, is now what we’re requesting many of you who are experts in this area to lend your help to us in doing. We need the guidance of technology experts. In my experience, most of them are younger than 40, but not all are younger than 40. And we need the companies that do this, and we need the dissident voices who have actually lived on the front lines so that we can try to work through the best way to make that balance you referred to.

Secretary Clinton’s answer is trying to balance competing interests, which is what good politicians do. If we want A, and we want B, and A is in tension with B, can we have some A and some B together? Is there some way to give up a little A in exchange for a lot of B? That’s a useful way to start the discussion.

But sometimes you have to choose — sometimes A and B are profoundly incompatible. That seems to be the case here. Consider the position of a repressive government that wants to spy on a citizen’s political speech, as compared to the position of the U.S. government when it wants to eavesdrop on a suspect’s conversations under a valid search warrant. The two positions are very different morally, but they are pretty much the same technologically. Which means that either both governments can eavesdrop, or neither can. We have to choose.

Secretary Clinton saw this tension, and, being a lawyer, she saw that law could not resolve it. So she expressed the hope that technology, the aspect she understood least, would offer a solution. This is a common pattern: Given a difficult technology policy problem, lawyers will tend to seek technology solutions and technologists will tend to seek legal solutions. (Paul Ohm calls this “Felten’s Third Law”.) It’s easy to reject non-solutions in your own area because you have the knowledge to recognize why they will fail; but there must be a solution lurking somewhere in the unexplored wilderness of the other area.

If we’re forced to choose — and we will be — what kind of Internet will we have? In Secretary Clinton’s words, “the world’s information infrastructure will become what we and others make of it.” We’ll have a free Internet, if we can keep it.

No Warrant Necessary to Seize Your Laptop

The U.S. Customs may search your laptop and copy your hard drive when you cross the border, according to their policy. They may do this even if they have no particularized suspicion of wrongdoing on your part. They claim that the Fourth Amendment protection against warrantless search and seizure does not apply. The Customs justifies this policy on the grounds that “examinations of documents and electronic devices are a crucial tool for detecting information concerning” all sorts of bad things, including terrorism, drug smuggling, contraband, and so on.

Historically the job of Customs was to control the flow of physical goods into the country, and their authority to search you for physical goods is well established. I am certainly not a constitutional lawyer, but to me a Customs exemption from Fourth Amendment restrictions is more clearly justified for physical contraband than for generalized searches of information.

The American Civil Liberties Union is gathering data about how this Customs enforcement policy works in practice, and they request your help. If you’ve had your laptop searched, or if you have altered your own practices to protect your data when crossing the border, staff attorney Catherine Crump would be interested in hearing about it.

Meanwhile, the ACLU has released a stack of documents they got by FOIA request.
The documents are here, and their spreadsheets analyzing the data are here. They would be quite interested to know what F-to-T readers make of these documents.

ACLU Queries for F-to-T readers:
If the answer to any of the questions below is yes, please briefly describe your experience and e-mail your response to laptopsearch at The ACLU promises confidentiality to anyone responding to this request.
(1) When entering or leaving the United States, has a U.S. official ever examined or browsed the contents of your laptop, PDA, cell phone, or other electronic device?

(2) When entering or leaving the United States, has a U.S. official ever detained your laptop, PDA, cell phone, or other electronic device?

(3) In light of the U.S. government’s policy of conducting suspicionless searches of laptops and other electronic devices, have you taken extra steps to safeguard your electronic information when traveling internationally, such as using encryption software or shipping a hard drive ahead to your destination?

(4) Has the U.S. government’s policy of conducting suspicionless searches of laptops and other electronic devices affected the frequency with which you travel internationally or your willingness to travel with information stored on electronic devices?

Information Technology Policy in the Obama Administration, One Year In

[Last year, I wrote an essay for Princeton’s Woodrow Wilson School, summarizing the technology policy challenges facing the incoming Obama Administration. This week they published my follow-up essay, looking back on the Administration’s first year. Here it is.]

Last year I identified four information technology policy challenges facing the incoming Obama Administration: improving cybersecurity, making government more transparent, bringing the benefits of technology to all, and bridging the culture gap between techies and policymakers. On these issues, the Administration’s first-year record has been mixed. Hopes were high that the most tech-savvy presidential campaign in history would lead to an equally transformational approach to governing, but bold plans were ground down by the friction of Washington.

Cybersecurity : The Administration created a new national cybersecurity coordinator (or “czar”) position but then struggled to fill it. Infighting over the job description — reflecting differences over how to reconcile security with other economic goals — left the czar relatively powerless. Cyberattacks on U.S. interests increased as the Adminstration struggled to get its policy off the ground.

Government transparency: This has been a bright spot. The White House pushed executive branch agencies to publish more data about their operations, and created rules for detailed public reporting of stimulus spending. Progress has been slow — transparency requires not just technology but also cultural changes within government — but the ship of state is moving in the right direction, as the public gets more and better data about government, and finds new ways to use that data to improve public life.

Bringing technology to all: On the goal of universal access to technology, it’s too early to tell. The FCC is developing a national broadband plan, in hopes of bringing high-speed Internet to more Americans, but this has proven to be a long and politically difficult process. Obama’s hand-picked FCC chair, Julius Genachowski, inherited a troubled organization but has done much to stabilize it. The broadband plan will be his greatest challenge, with lobbyists on all sides angling for advantage as our national network expands.

Closing the culture gap: The culture gap between techies and policymakers persists. In economic policy debates, health care and the economic crisis have understandably taken center stage, but there seems to be little room even at the periphery for the innovation agenda that many techies had hoped for. The tech policy discussion seems to be dominated by lawyers and management consultants, as in past Administrations. Too often, policymakers still see techies as irrelevant, and techies still see policymakers as clueless.

In recent days, creative thinking on technology has emerged from an unlikely source: the State Department. On the heels of Google’s surprising decision to back away from the Chinese market, Secretary of State Clinton made a rousing speech declaring Internet freedom and universal access to information as important goals of U.S. foreign policy. This will lead to friction with the Chinese and other authoritarian governments, but our principles are worth defending. The Internet can a powerful force for transparency and democratization, around the world and at home.

Software in dangerous places

Software increasingly manages the world around us, in subtle ways that are often hard to see. Software helps fly our airplanes (in some cases, particularly military fighter aircraft, software is the only thing keeping them in the air). Software manages our cars (fuel/air mixture, among other things). Software manages our electrical grid. And, closer to home for me, software runs our voting machines and manages our elections.

Sunday’s NY Times Magazine has an extended piece about faulty radiation delivery for cancer treatment. The article details two particular fault modes: procedural screwups and software bugs.

The procedural screwups (e.g., treating a patient with stomach cancer with a radiation plan intended for somebody else’s breast cancer) are heartbreaking because they’re something that could be completely eliminated through fairly simple mechanisms. How about putting barcodes on patient armbands that are read by the radiation machine? “Oops, you’re patient #103 and this radiation plan is loaded for patent #319.”

The software bugs are another matter entirely. Supposedly, medical device manufacturers, and software correctness people, have all been thoroughly indoctrinated in the history of Therac-25, a radiation machine from the mid-80’s whose poor software engineering (and user interface design) directly led to several deaths. This article seems to indicate that those lessons were never properly absorbed.

What’s perhaps even more disturbing is that nobody seems to have been deeply bothered when the radiation planning software crashed on them! Did it save their work? Maybe you should double check? Ultimately, the radiation machine just does what it’s told, and the software than plans out the precise dosing pattern is responsible for getting it right. Well, if that software is unreliable (which the article clearly indicates), you shouldn’t use it again until it’s fixed!

What I’d like to know more about, and which the article didn’t discuss at all, is what engineering processes, third-party review processes, and certification processes were used. If there’s anything we’ve learned about voting systems, it’s that the federal and state certification processes were not up to the task of identifying security vulnerabilities, and that the vendors had demonstrably never intended their software to resist the sorts of the attacks that you would expect on an election system. Instead, we’re told that we can rely on poll workers following procedures correctly. Which, of course, is exactly what the article indicates is standard practice for these medical devices. We’re relying on the device operators to do the right thing, even when the software is crashing on them, and that’s clearly inappropriate.

Writing “correct” software, and further ensuring that it’s usable, is a daunting problem. In the voting case, we can at least come up with procedures based on auditing paper ballots, or using various cryptographic techniques, that allow us to detect and correct flaws in the software (although getting such procedures adopted is a daunting problem in its own right, but that’s a story for another day). In the aviation case, which I admit to not knowing much about, I do know they put in sanity-checking software, that will detect when the the more detailed algorithms are asking for something insane and will override it. For medical devices like radiation machines, we clearly need a similar combination of mechanisms, both to ensure that operators don’t make avoidable mistakes, and to ensure that the software they’re using is engineered properly.