March 26, 2017

Census of Files Available via BitTorrent

BitTorrent is popular because it lets anyone distribute large files at low cost. Which kinds of files are available on BitTorrent? Sauhard Sahi, a Princeton senior, decided to find out. Sauhard’s independent work last semester, under my supervision, set out to measure what was available on BitTorrent. This post, summarizing his results, was co-written by Sauhard and me.

Sauhard chose a (uniform) random sample of files available via the trackerless variant of BitTorrent, using the Mainline DHT. The sample comprised 1021 files. He classified the files in the sample by file type, language, and apparent copyright status.

Before describing the results, we need to offer two caveats. First, the results apply only to the Mainline trackerless BitTorrent system that we surveyed. Other parts of the BitTorrent ecosystem might be different. Second, all files that were available were equally likely to appear in the sample — the sample was not weighted by number of downloads, and it probably contains files that were never downloaded at all. So we can’t say anything about the characteristics of BitTorrent downloads, or even of files that are downloaded via BitTorrent, only about files that are available on BitTorrent.

With that out of the way, here’s what Sauhard found.

File types

46% movies and shows (non-pornographic)
14% games and software
14% pornography
10% music
1% books and guides
1% images
14% could not classify


For the movies and shows category, the predominant file format was AVI, and other formats included RMVB (a proprietary format for RealPlayer), MPEG, raw DVD, and some multi-part RAR archives. Interestingly, this section was heavily biased towards recent movies, instead of being spread out evenly over a number of years. In descending order of frequency, we found that 60% of the randomly selected movies and shows were in English, 8% were in Spanish, 7% were in Russian, 5% were in Polish, 5% were in Japanese, 4% were in Chinese, 4% could not be determined, 3% were in French, 1% were in Italian, and other infrequent languages accounted for 2% of the distribution.


For the games and software category, there was no clearly dominant file type, but common file types for software included ISO disc images, multi-part RAR archives, and EXE (Windows executables). The games were targeted for running on different architectures, such as the XBOX 360, Nintendo Wii, and Windows PC’s. In descending order, we found that 74% of games and software in the sample were in English, 12% were in Japanese, 5% were in Spanish, 4% were in Chinese, 2% were in Polish, and 1% were in Russian and French each.


For the pornography category, the predominant encoding format was AVI, similar to the movies category. However, there were significantly more MPG and WMV (Windows Media Video) files available. Also, most pornography torrents included the full pornographic video, a sample of the video (a 1-5 minute extract of the video), as well as posters or images of the porn stars in JPEG format. Also, as these videos are not typically dated like movies are, it is difficult to make any remarks regarding the recency bias for pornographic torrents. Our assumption would be that demand for pornography is not as time-sensitive as demand for movies, so it is likely that these pornographic videos constitute a broader spectrum of time than the movies do. In descending order, we found that 53% of pornography in our sample was in English, 16% was in Chinese, 15% was in Japanese, 6% was in Russian, 3% was in German, 2% was in French, 2% was unclassifiable, and Italian, Hindi, and Spanish appeared infrequently (1% each).


For the music category, the predominant encoding format for music was MP3, there were some albums ripped to WMA (Windows Media Audio, a Microsoft codec), and there were also ISO images and multi-part RAR archives. There is still a bias towards recent albums and songs, but it is not as strongly evident as it is for movies—perhaps because people are more willing to continue seeding music even after it is no longer new, so these torrents are able to stay alive longer in the DHT. In descending order, we found that 78% of music torrents in our sample were in English, 6% were in Russian, 4% were in Spanish, 2% were in Japanese and Chinese each, and other infrequent languages appeared 1% each.


The books/guides and images categories were fairly minor. We classified 15 torrents under books and guides—13 were in English, 1 was in French, and 1 was in Russian. We classified 3 image torrents—one was a set of national park wallpapers, one was a set of pictures of BMW cars (both of these are English), and one was a Japanese comic strip.

Apparent Copyright Infringement

Our final assessment involved determining whether or not each file seemed likely to be copyright-infringing. We classified a file as likely non-infringing if it appeared to be (1) in the public domain, (2) freely available through legitimate channels, or (3) user-generated content. These were judgment calls on our part, based on the contents of the files, together with some external research.

By this definition, all of the 476 movies or TV shows in the sample were found to be likely infringing. We found seven of the 148 files in the games and software category to be likely non-infringing—including two Linux distributions, free plug-in packs for games, as well as free and beta software. In the pornography category, one of the 145 files claimed to be an amateur video, and we gave it the benefit of the doubt as likely non-infringing. All of the 98 music torrents were likely infringing. Two of the fifteen files in the books/guides category seemed to be likely non-infringing.

Overall, we classified ten of the 1021 files, or approximately 1%, as likely non-infringing, This result should be interpreted with caution, as we may have missed some non-infringing files, and our sample is of files available, not files actually downloaded. Still, the result suggests strongly that copyright infringement is widespread among BitTorrent users.