December 15, 2024

Census of Files Available via BitTorrent

BitTorrent is popular because it lets anyone distribute large files at low cost. Which kinds of files are available on BitTorrent? Sauhard Sahi, a Princeton senior, decided to find out. Sauhard’s independent work last semester, under my supervision, set out to measure what was available on BitTorrent. This post, summarizing his results, was co-written by Sauhard and me.

Sauhard chose a (uniform) random sample of files available via the trackerless variant of BitTorrent, using the Mainline DHT. The sample comprised 1021 files. He classified the files in the sample by file type, language, and apparent copyright status.

Before describing the results, we need to offer two caveats. First, the results apply only to the Mainline trackerless BitTorrent system that we surveyed. Other parts of the BitTorrent ecosystem might be different. Second, all files that were available were equally likely to appear in the sample — the sample was not weighted by number of downloads, and it probably contains files that were never downloaded at all. So we can’t say anything about the characteristics of BitTorrent downloads, or even of files that are downloaded via BitTorrent, only about files that are available on BitTorrent.

With that out of the way, here’s what Sauhard found.

File types

46% movies and shows (non-pornographic)
14% games and software
14% pornography
10% music
1% books and guides
1% images
14% could not classify

Movies/Shows

For the movies and shows category, the predominant file format was AVI, and other formats included RMVB (a proprietary format for RealPlayer), MPEG, raw DVD, and some multi-part RAR archives. Interestingly, this section was heavily biased towards recent movies, instead of being spread out evenly over a number of years. In descending order of frequency, we found that 60% of the randomly selected movies and shows were in English, 8% were in Spanish, 7% were in Russian, 5% were in Polish, 5% were in Japanese, 4% were in Chinese, 4% could not be determined, 3% were in French, 1% were in Italian, and other infrequent languages accounted for 2% of the distribution.

Games/Software

For the games and software category, there was no clearly dominant file type, but common file types for software included ISO disc images, multi-part RAR archives, and EXE (Windows executables). The games were targeted for running on different architectures, such as the XBOX 360, Nintendo Wii, and Windows PC’s. In descending order, we found that 74% of games and software in the sample were in English, 12% were in Japanese, 5% were in Spanish, 4% were in Chinese, 2% were in Polish, and 1% were in Russian and French each.

Pornography

For the pornography category, the predominant encoding format was AVI, similar to the movies category. However, there were significantly more MPG and WMV (Windows Media Video) files available. Also, most pornography torrents included the full pornographic video, a sample of the video (a 1-5 minute extract of the video), as well as posters or images of the porn stars in JPEG format. Also, as these videos are not typically dated like movies are, it is difficult to make any remarks regarding the recency bias for pornographic torrents. Our assumption would be that demand for pornography is not as time-sensitive as demand for movies, so it is likely that these pornographic videos constitute a broader spectrum of time than the movies do. In descending order, we found that 53% of pornography in our sample was in English, 16% was in Chinese, 15% was in Japanese, 6% was in Russian, 3% was in German, 2% was in French, 2% was unclassifiable, and Italian, Hindi, and Spanish appeared infrequently (1% each).

Music

For the music category, the predominant encoding format for music was MP3, there were some albums ripped to WMA (Windows Media Audio, a Microsoft codec), and there were also ISO images and multi-part RAR archives. There is still a bias towards recent albums and songs, but it is not as strongly evident as it is for movies—perhaps because people are more willing to continue seeding music even after it is no longer new, so these torrents are able to stay alive longer in the DHT. In descending order, we found that 78% of music torrents in our sample were in English, 6% were in Russian, 4% were in Spanish, 2% were in Japanese and Chinese each, and other infrequent languages appeared 1% each.

Books/Guides

The books/guides and images categories were fairly minor. We classified 15 torrents under books and guides—13 were in English, 1 was in French, and 1 was in Russian. We classified 3 image torrents—one was a set of national park wallpapers, one was a set of pictures of BMW cars (both of these are English), and one was a Japanese comic strip.

Apparent Copyright Infringement

Our final assessment involved determining whether or not each file seemed likely to be copyright-infringing. We classified a file as likely non-infringing if it appeared to be (1) in the public domain, (2) freely available through legitimate channels, or (3) user-generated content. These were judgment calls on our part, based on the contents of the files, together with some external research.

By this definition, all of the 476 movies or TV shows in the sample were found to be likely infringing. We found seven of the 148 files in the games and software category to be likely non-infringing—including two Linux distributions, free plug-in packs for games, as well as free and beta software. In the pornography category, one of the 145 files claimed to be an amateur video, and we gave it the benefit of the doubt as likely non-infringing. All of the 98 music torrents were likely infringing. Two of the fifteen files in the books/guides category seemed to be likely non-infringing.

Overall, we classified ten of the 1021 files, or approximately 1%, as likely non-infringing, This result should be interpreted with caution, as we may have missed some non-infringing files, and our sample is of files available, not files actually downloaded. Still, the result suggests strongly that copyright infringement is widespread among BitTorrent users.

Comments

  1. “Our assumption would be that demand for pornography is not as time-sensitive as demand for movies”
    ie: you don’t care if your porn is old, why should the layman

  2. AVI isn not an encoding format – it’s just a container format, that can contain virtually anything (DivX or XviD, for example).

  3. So some unknown university student took about 1000 files, was unable to classify over 140 as to what they were, used his own undisclosed judgment to decide if they infringed copyright (yes, even though 140 of them he could not decide what they were he still classified them), and this is now taken as fact, considered scientific and quoted?

    Most bitorrent might be ilegal copuright, but quoting some guy, who uses an unknown judgement, and makes statements on copyright even though he doesnt know how to classify it is as scientific as saying you nephew told you so…

  4. Its an interesting idea but I don’t think you don’t have a large enough sample for a significant x-bar correlation and the size of the population is excluded.

    Your data is biased by the geography & language of your sample. I read recently that there are about 100 million illegal copies of Windows XP in use in China…

    There is quite a bit of content available through these resources in the movie and book formats that is not infringing. I challenge this assumption.

    What department did you produce this for? Probably goddamn pre-lawyers. If not, I’m so sorry, otherwise bite me!

  5. I’d be more interested to know what the breakdown by bandwidth used was, i.e. file sizes * number of downloads. Might be a more useful metric. Just a thought.

    • That’d require much more efforts and would have provided questionable results since DHT doesn’t store download numbers (neither it stores statistics at all). Researchers would have to search for tracker that tracks given torrent and try to get download numbers from it. Even if they succeed with first, that doesn’t mean tracker will return actual stats (if any) and if torrent turns out to be tracked by several trackers, they’ll most likely have different stats from different trackers to make things even more complicated.

  6. This study is biased in favor of the Industry, and against the users, right from the start! See, users don’t have to provide evidence that something is free, the industry has to provide evidence that it is copyrighted! So, Sahi should have checked the 99% of the files for including copyrighted content. That would have been much more work, of course, but that would have been the only fair way, representing reality.

    By being lazy, and simply ASSUMING that everything that isn’t easily deetermined to be in the “public domanin”, he of course unavoidably overstates the amount of copyrighted materal in the stream. Just take the 14% of “unclassified” material. This doesn’t make any sense, what can that stuff be when it is digital, but not audio, video, software, documents or images? There is no other significant category beside these! So, obviously, this is simply stuff that is encrypted in any way. Saho can’t ID that, he doesn’t know what it is, but he still counts it as “copyrighted”! This is dishonest, this is unfair, and it is a blatant mistake in an allegedly scientific work of a Princeton senior! I would give it a D, if at all, and I would question the student if he has any ties to the content industry.

    • I’m not sure what you’re suggesting in the first paragraph. You seem to be saying that he should have checked the files that he labeled as likely-infringing. He did that.

      What basis do you have for accusing him of bias, other than that you wish his results were different?

      • …that the constraint selection was biased towards an industry-favorable outcome.

        – Legitimate uploaders are less likely to use trackerless BitTorrent.

        – Lack of download/seeder/leecher information does not provide a useful data set to draw conclusions from regarding typical user behavior, especially if you are going to make claims about relative prominence of trading in certain file types. But such analysis isn’t necessary to provide a positive industry outcome, simply finding many “warez” files available would be a useful point for them.

        – A methodology counting number of torrents uploaded only tells you what people are offering to share, not what people are downloading. This is initially properly noted, but ignored in the conclusion where you say “Still, the result suggests strongly that copyright infringement is widespread among BitTorrent users.” The fact that you extrapolate to users with no data tying the actual users to the files being surveyed lends a real biased feel to the reporting. “Uploaders” would of course be directly supportable by the data, and not open to accusations of bias.

        I’m not saying any bias was intentional, and I have no doubt that by most methodologies, a majority of BitTorrent traffic is in unauthorized copyrighted goods.

  7. Did you just assume file types based on file names? I can rename a .JPG to a .RAR and the file is still a JPEG image.

    Classifying file types by name alone, without analyzing the actual file contents, is like the Census establishing gender counts based on if your name sounds like a boy’s or a girl’s.

  8. “Sauhard chose a (uniform) random sample of files available via the trackerless variant of BitTorrent, using the Mainline DHT”
    I’d be curious to know how that was done, as I don’t believe it’s possible to do so. Would be nice if it was explained how. Without that basic methodological “HOW”, the entire conclusion is dubious.

    • Trackerless BitTorrent assigns each torrent an ID which is a cryptographic hash of some information about the torrent. In effect, each torrent has a pseudorandom ID. If you pick an arbitrary range in the ID space, and enumerate all of the torrents with IDs in that range, you have a (pseudo)random sample of all of the torrents in the system. You can get all of the torrents in a range by joining the DHT at a randomly chosen ID, and making a list of all the torrents that you’re asked to store information about. That’s more or less what Sauhard did.

  9. I run transmission on my MacBook Pro and use it to help distribute things their creators want to be distributed.

    Examples: Some OS distributions (SUSE, Haiku), classes (12 SIPC lectures), NeoOffice, Star Trek New Voyages.

    I wonder how these would show up?

  10. Call me naive or old-fashioned if you want, but I believe that we should publish our legitimate research results regardless of who might find them inconvenient. That’s part of our ethical code as academics.

    • Seth Finkelstein says

      > “we should publish our legitimate research results regardless of who might find them inconvenient.”

      Of course. No offense meant.

    • Of course one should question the ethical code of academics. After all they need money to conduct research and many of the results are actually those that “bill payers” hope to see and are looking for.

    • I would not want to imply however that something should not be published, or kept secret just because it is inconvenient. Rather the opposite, precisely because it is inconvenient, it should be published.

      However, when a conclusion is present from a data set that has all kinds of limitations, the choice than becomes, how is that data summarized, and how is that summarization presented. When the title is “Census of Files Available via BitTorrent”

      And the conclusion is: “Overall, we classified ten of the 1021 files, or approximately 1%, as likely non-infringing, This result should be interpreted with caution, as we may have missed some non-infringing files, and our sample is of files available, not files actually downloaded. Still, the result suggests strongly that copyright infringement is widespread among BitTorrent users.”

      I wonder why the title wasn’t something like: “Sample of Files available from Trackerless Bit Torrent Sites” and the conclusion did not have some of the caveats embedded within it, rather than excluded.

      Our freedoms are being assaulted everyday, and that is the inconvenient truth. Everyday our leaders, some of them who have pledged themselves to high sounding ideals of transparency and openness, are using the spector of copyright infringement to push secret deals like the A.C.T.A. past the normal process which (should) involve democratic review, and instead are examples of capitulation of fundamental rights to the interests of a few corporations. That is the world we live in, and unless we speak out our freedoms will continue to be eroded, and world will not be a better place for that erosion.

      So I take no exception to your claim of ethics, but neither should you claim that I wish to censure any truth, I just desire to have all facts accurately portrayed. The fact that there are those who misuse any inaccuracies to bend all of our rules to their mean wants suggests we should welcome any criticism which strives to accuracy.

  11. Seth Finkelstein says

    > “It cannot be the case that a site like Freedom to Tinker can publish this conclusion …
    > without being at all cognizant of the larger context in which that conclusion might be used …”

    I completely agree, and would have loved to have been a fly on the wall when these issues were discussed. My guess is that this was decided as a positioning for “reasonableness” in the overall debate. I can’t argue with that as a tactic here, seems like a good strategy. If you can gain policy-credibility by stating the utterly obvious, go for it.

  12. First, it should be fairly obvious that it torrents that have trackers would be preferred by those seeking to distribute files that they want to or have an obligation to distribute. Nearly every Linux distribution, for example, uses bit torrent to distribute files, and they are all using trackers.

    For example:
    (Fedora) link omitted to pass filter
    (OpenSuSE) link omitted to pass filter

    But how many times were the files downloaded? Here’s what Fedora’s website (link omitted to pass filter//fedoraproject.org/wiki/Statistics)indicates:

    “The following table shows the number of downloads that have been made over BitTorrent. This table shows downloads only through trackers connected to the official torrent server.
    Downloads from bittorrent (as of 2009-01-18)
    Sulphur (F9): 443,932
    Cambridge (F10): 515,051
    Leonidas (F11): 349,889
    Constantine (F12) 137,716 ”

    So this would total to 1,446,588 downloads, just of Fedora release that would have been current on Jan 18, 2009. There is no reason to think that OpenSuSE, Mandriva or the Mint GNU/Linux distributions would have been much less. This all adds up to a very substantial non-infringing use (see Betamax)

    It cannot be the case that a site like Freedom to Tinker can publish this conclusion “Overall, we classified ten of the 1021 files, or approximately 1%, as likely non-infringing,” without being at all cognizant of the larger context in which that conclusion might be used, (or more likely, mis-used) caveats notwithstanding. The study itself was interesting, but to publish the conclusion, even with caveats, I do not feel was completely responsible.

    eee_eff

  13. Oh, I am sure this guy had a lot of fun conducting this survey, specially the “pornography” part……

  14. First, it should be fairly obvious that it torrents that have trackers would be preferred by those seeking to distribute files that they want to or have an obligation to distribute. Nearly every Linux distribution, for example, uses bit torrent to distribute files, and they are all using trackers.

    For example:
    (Fedora) http://torrent.fedoraproject.org/
    (OpenSuSE) http://tracker.opensuse.org/

    But how many times were the files downloaded? Here’s what Fedora’s website (http://fedoraproject.org/wiki/Statistics)indicates:

    “The following table shows the number of downloads that have been made over BitTorrent. This table shows downloads only through trackers connected to the official torrent server.
    Downloads from bittorrent (as of 2009-01-18)
    Sulphur (F9): 443,932
    Cambridge (F10): 515,051
    Leonidas (F11): 349,889
    Constantine (F12) 137,716 ”

    So this would total to 1,446,588 downloads, just of Fedora release that would have been current on Jan 18, 2009. There is no reason to think that OpenSuSE, Mandriva or the Mint GNU/Linux distributions would have been much less. This all adds up to a very substantial non-infringing use (see Betamax)

    It cannot be the case that a site like Freedom to Tinker can publish this conclusion “Overall, we classified ten of the 1021 files, or approximately 1%, as likely non-infringing,” without being at all cognizant of the larger context in which that conclusion might be used, (or more likely, mis-used) caveats notwithstanding. The study itself was interesting, but to publish the conclusion, even with caveats, I do not feel was completely responsible.

    eee_eff

  15. Not sure I get all the emphasis on copyright. I create game content which I self-publish and use the creative commons non-commercial, attribution, share-alike license, so my work is indeed covered by copyright, but there’s no infringement if someone makes it available for download by torrent or any other means.

  16. Curious about the copyright status of the items they could not classify. The study indicated percentages of the categories they could classify, but not these. So is it more accurate to say that 99% of 86% of 1021 files were classified as infringing?

    That would result in a number that is more in line with other informal estimates that I have seen, between 80-90% of files shared via bittorrent are infringing.

    • Could you not assume though that, as 99% of the classified material was infringing, 99% of the unclassified material, were it to be classified, would probably be infringing too? Just because it’s unclassified shouldn’t mean it’s more or less likely to be copyrighted, should it?

      • “Just because it’s unclassified shouldn’t mean it’s more or less likely to be copyrighted, should it?”

        Then why the hell are you suggesting that “were it to be classified, [it] would probably be infringing too”?

        I’d suggest that it would be more likely to be non-infringing, as it’s more difficult to demonstrate non-copyrighted status than copyrighted status, because non-copyrighted material does not assert its status, whereas copyrighted material almost invariably does.

  17. I knew without a doubt that every ALL of these goddamn were going to be ‘Nuh uh, you made a mistake. You should’ve done this or that. Your title is misleading.’

    Stop fucking nitpicking. You all download copyrighted material. All of you do it, all of the goddamn time. You’re breaking the law, you know it, and you keep trying to justify it. FUCK. OFF.

    • Of course we download copyrighted material, though I’d be very cautious about tarring everyone with the same brush if I were you. I very much doubt your accusation applies to everyone who has commented. As somebody pointed out in detail above, it would be shocking if a significant portion was *not* infringing since the automatic copyright rule means pretty much everything out there is infringing. That’s not what’s interesting about the study; we already know that BitTorrent is widely used for piracy.

      You should pay attention to the fact that “we can’t say anything about the characteristics of BitTorrent downloads, or even of files that are downloaded via BitTorrent, only about files that are available on BitTorrent.” So what does data about files that are *available* on BitTorrent tell us? It tells us what proportion of our culture (i.e. things that people think are worth sharing on BitTorrent) is “owned”. It tells us that, as yet, open culture doesn’t seem to have had much impact on pop culture, since the majority of files *available* come from copyrighted sources. Of course, given that copyright has been a part of our history since long before the existence of the internet, it shouldn’t surprise us that the number of cultural artifacts under copyright vastly outnumber the number outside of it.

      I don’t need a study that says noninfringing content is common in Bittorrent to justify my downloading habits as you imply. Even if that study existed and was true, I don’t see why it would be relevant to the copyrighted content that I download. I am perfectly aware that I break the law when I download copyrighted material. I’m ok with that. I’m not ok with the fact that the act of sharing culture via BitTorrent (or any form of online sharing) is illegal. I’m not ok with the fact that 99% of the culture that people want to share is somehow “owned” by someone. That’s not how culture works, and trying to force it to work that way simply ends up destroying that culture. Culture is fundamentally a shared experience. Take away our ability to share it freely and what you have is at best commerce, not culture.

      Don’t get me wrong. I’m not anti-copyright, and I make my living in the film industry. But we shouldn’t be using copyright to control culture (it won’t work). We should be using it to regulate commerce, where it is well suited to solving very valid business conflicts. But, once those business conflicts have been resolved, copyright needs to recognize that people are interested in *sharing* copyrighted works because those works are culturally interesting, and copyright should not be standing in the way of doing so. After all, the reason many (most?) copyrighted works are created is in hope that the creation might, in some way, be culturally interesting. Copyright only serves as an incentive for creation when it *improves* the chances of a creation having cultural worth; if it outlaws allowing a creation to be used culturally, I (and many other creators) have no use for copyright.

  18. I’m surprised you didn’t find any MKVs under video – that format has got to be way more common than RMVB, given that it’s the defacto standard for HD videos and anime (since it can contain multiple audio/subtitle tracks).

  19. Seth Finkelstein says

    > “Still, the result suggests strongly that copyright infringement is widespread among BitTorrent users.”

    Umm … I should probably praise the political courage necessary to write that, given the way Bittorrent has been used as a catspaw. You know what you’re doing.

    I remember how difficult it was to voice similar sentiments with regard to Napster.

  20. Like Kisar earlier, we, too, would like to see the raw data, but also the coding sheets, the variables, and how those variables were operationalized.

    We also have a few comments and questions about the study:

    1. What was the time frame/time span in which this study took place? The post says Mr. Sahi performed this study during the summer, but it is unclear whether the study spanned the entire summer, a month, a week, or a day.

    2. Hopefully the paper will identify why the trackerless BitTorrent was chosen for study as opposed to other parts of the BitTorrent ecosystem.

    3. Your operationalization of copyright infringement is interesting, but it seems like it may skew the study in one way (i.e. towards a finding of infringement), especially given the actual definition of infringement. The actual definition of copyright infringement in the Copyright Act of 1976 (Section 501(a), http://www.copyright.gov/title17/92chap5.html#501) defines infringement as “Anyone who violates any of the exclusive rights of the copyright owner as provided by sections 106 through 122 or of the author as provided in section 106A(a) … is an infringer of the copyright or right of the author, as the case may be.” Effectively, this means that any time any person other than the copyright owner or his authorized agent invokes the rights of reproduction, derivative work/adaptation, distribution, public performance or public display, that person is infringing per Section 501(a). This finding of infringement, of course, is subject to a raft of limitations or compulsory licenses in Sections 107 through 122.

    Since copyright infringement is a strict liability issue (i.e roughly meaning liability without fault), this essentially means that anytime anyone posts a file on a BitTorrent system — even a digital movie or music file ripped from their own collections — there is, arguably, and infringement because

    (a) the person who owns the source disc from which the movie or music file was ripped is likely not the person that owns the rights to the disc (per Section 202, http://www.copyright.gov/title17/92chap2.html#202);
    (b) therefore arguably has no authority to distribute that file on a digital network. (The first sale limitation in Section 109, http://www.copyright.gov/title17/92chap1.html#109, may or may not apply. We are presuming for the sake of this argument that it is inapplicable.)

    This means that from a legal standpoint, it is possible that any file on such a distributed network technically is an infringement under Section 501(a). The ultimate finding of liability, however, gets made subject to the limitations and compulsory licenses in Sections 107 through 122 of the current Act.

    How could all this legal mumbo affect the study? Well, it could affect the study in a significant way if it does not take into account a variable for actual ownership of the source material from which the traded digital file was ripped. This matters, for one, because the first sale doctrine may be an applicable limitation. (Again, more analysis would need to be done, but it’s worth an investigation.)

    Second, if you can get a way to determine and operationalize source ownership, then the study can probe deeper into whether or not the digital files on the network are, indeed, technical infringements (i.e. people posting stuff they own in disc form, but mistakenly are trading in digital form, not knowing what they are doing is “illegal” — which, in turn, gets into norms vs. law arguments) or infringements based upon rogue behavior (i.e. the person never had the file, never bought the source material, never intends to buy the source material and merely wants to get stuff for free).

    Ultimately, though, it should be pointed out that the way the statutes are written, it would be shocking if anything significantly less than 100% of the files on BitTorrent were technical infringements.

    Granted, all this may be far outside the scope of this study — one more reason why we’d like to see the data and review the full paper instead of drawing conclusions from this summary.

    • Uh, well, you’re obviously going to get paid for usage of the raw data and software that don’t belong to you in first place, so I fail to see the reason why you should get sources for free. Why don’t you play fair and hire them or someone else to make similar sniffer for you? Also, don’t overgeneralize: U.S. laws apply to U.S. territory only, so what you perceive to be an infringement isn’t necessarily an infringement elsewhere.

      • just a thought says

        1. You don’t read books, or? Weren’t you in a library lately. Did you pay anything to get a book or CD/DVD from a library? Did you pay anything if you copy (some sites of) the book at home?

        Is there any advertisement in the borrowed book? No? Well at any torrent tracker there is, so this is “paid” enough for.

        The only comfort and difference between a library and a “torrent” is that you must not give your files back or have to pay if it wasn’t just in time.

        2. You don’t go swimming, or? Weren’t you at a lake or sea to swim the crawl? Did you pay anything for it (but your health if others have polluted the water)? You even can collect some sand/stones or minerals if you like.

        Is there any advertisement?
        The only comfort and difference between a lake and the “torrent” is that you have to leave home.

        3. You don’t breath, or? Weren’t you breathing right now? Did you pay anything for it (but your health if others have polluted the air)? You can breath freely, highly, low and deep as you wish.

        Is there any advertisement?
        There is no difference between breathing and a “torrent”.

        It is just a thought and a “torrent” is just a global information.

  21. kisar_sosae says

    As a stats instructor is it possible to get a copy of the raw data? I’d love to give this to my students to recreate.

  22. You probably made mistakes in software’s status. Lots of downloadable software
    has a free trial period, after which is can be registered online. DVDfab also has a free mode you can select at install.

  23. I don’t know much about the BitTorrent ecosystem, but I assume that trackerless BitTorrent is particularly attractive to those wishing to distribute copyright-infringing material, while those distributing non-infringing material are perfectly happy to use good old-fashion trackers. As such, it seems inappropriate to me that these results be labeled as applying to BitTorrent in general. Your caveat about this being limited to the Mainline DHT probably belongs in the lede, if not the title, of this post.

    • Short answer: this assumption is baseless. Torrent doesn’t have to be trackerless to be shared via DHT. In fact, an overwhelming majority of DHT-tracked torrents use trackers as well.

      Long answer: I highly doubt any trackerless torrents have been sighted at all since they are extremely rare “in the wild” (I’ve seen maybe a couple of them in ~5 years of torrenting). All torrents are either “private” or “public”; former are generally used on private trackers and are always not available via DHT while latter are always available via DHT and may or may not use trackers. If you want to share something that may get you in trouble, you generally do it on private trackers because they give you (false) sense of safety (and disable DHT on your torrent if you haven’t done it yourself). If you, for whatever reason, don’t care about copyright and want your release to be widely available, you post it on public tracker or, better yet, several ones (that don’t mark your torrent “private” and making it private yourself – prohibiting the use of DHT – doesn’t make any sense). So it’s safer to assume that private trackers (that can’t be estimated using this method) have higher percentage of copyrighted stuff because sharing non-copyrighted stuff on private trackers is, well, dumb (it’s always available somewhere else where you don’t have to worry about your upload/download ratio). So, that 1% share of legit torrents is actually even lower (if not non-existant) on private trackers.

  24. Thank you for putting some science and discipline towards an issue that is largely measured in theory and “message points,” although I think you buried the lede a bit with respect to the copyright status of all the video in this sample.

  25. Kevin R. Guidry says

    Nit picking: If you selected a sample from the population of available files then you didn’t perform a census (as indicated in the title of your post).

  26. what about the temporal and geographical consequences of the sample?