November 29, 2020

Online Porn and Bad Science

Declan McCullagh reports
on yesterday’s House Government Reform Committee hearings on porn and
peer-to-peer systems. (I’m sure there is some porn on these systems,
as there is in every place where large groups of people gather.)
There’s plenty to chew on in the story; Frank Field says it “sounds
like a nasty meeting.”

But I want to focus on the factual claims made by one witness. Declan writes:

Randy Saaf, president of P2P-tracking firm MediaDefender, said his
investigations of child pornography on P2P networks found over 321,000
files “that appeared to be child pornography by their names and file
types,” and said that “over 800 universities had files on their
networks that appeared to be child pornography.”

But MediaDefender, and one of the government studies released on
Thursday, reviewed only the file names and not the actual contents of
the image files. A similar approach used in a 1995 article [i.e., the
now-notorious Rimm study – EWF] that appeared in the Georgetown
University law journal drew strong criticism from academics for having
a flawed methodology that led to incorrect estimates of the amount of
pornography on the Internet.

Characterizing a file as porn based on its name alone is obviously
lame, if your goal is to make an accurate estimate of how much porn is
out there. (And that is the goal, isn’t it?)

It’s no excuse to say that it’s infeasible to sample 321,000 files
by hand to see if they are really porn. Because if you actually care
whether 321,000 is even close to correct, you can examine a small
random sample of the files. If you sample, say, ten randomly chosen
files and only five of them are really porn, then you can be pretty
sure that 321,000 is far wrong. There’s no excuse for not doing this,
if your goal is to give the most accurate testimony to Congress.

UPDATE (8:30 AM, March 18): According to a Dawn Chmielewski story at the San Jose Mercury News, a government study found that 42% of files found on Kazaa via “search terms known to be associated with child porn” were actually child porn.