In my research on privacy problems in PACER, I spent a lot of time examining PACER documents. In addition to researching the problem of “bad” redactions, I was also interested in learning about the pattern of redactions generally. To this end, my software looked for two redaction styles. One is the “black rectangle” redaction method I described in my previous post. This method sometimes fails, but most of these redactions were done successfully. The more common method (around two-thirds of all redactions) involves replacing sensitive information with strings of XXs.
Out of the 1.8 million documents it scanned, my software identified around 11,000 documents that appeared to have redactions. Many of them could be classified automatically (for example “123-45-xxxx” is clearly a redacted Social Security number, and “Exxon” is a false positive) but I examined several thousand by hand.
Here is the distribution of the redacted documents I found.
|Type of Sensitive Information||No. of Documents|
|Social Security number||4315|
|Bank or other account number||675|
|Date of birth||290|
|Unique identifier other than SSN||216|
|Name of person||129|
|Phone, email, IP address||60|
|National security related||26|
To reiterate the point I made in my last post, I didn’t have access to a random sample of the PACER corpus, so we should be cautious about drawing any precise conclusions about the distribution of redacted information in the entire PACER corpus.
Still, I think we can draw some interesting conclusions from these statistics. It’s reasonable to assume that the distribution of redacted sensitive information is similar to the distribution of sensitive information in general. That is, assuming that parties who redact documents do a decent job, this list gives us a (very rough) idea of what kinds of sensitive information can be found in PACER documents.
The most obvious lesson from these statistics is that Social Security numbers are by far the most common type of redacted information in PACER. This is good news, since it’s relatively easy to build software to automatically detect and redact Social Security numbers.
Another interesting case is the “address” category. Almost all of the redacted items in this category—393 out of 449—appear in the District of Columbia District. Many of the documents relate to search warrants and police reports, often in connection with drug cases. I don’t know if the high rate of redaction reflects the different mix of cases in the DC District, or an idiosyncratic redaction policy voluntarily pursued by the courts and/or the DC police but not by officials in other districts. It’s worth noting that the redaction of addresses doesn’t appear to be required by the federal redaction rules.
Finally, there’s the category of “trade secrets,” which is a catch-all term I used for documents whose redactions appear to be confidential business information. Private businesses may have a strong interest in keeping this information confidential, but the public interest in such secrecy here is less clear.
To summarize, out of 6208 redacted documents, there are 4315 Social Security that can be redacted automatically by machine, 449 addresses whose redaction doesn’t seem to be required by the rules of procedure, and 419 “trade secrets” whose release will typically only harm the party who fails to redact it.
That leaves around 1000 documents that would expose risky confidential information if not properly redacted, or about 0.05 percent of the 1.8 million documents I started with. A thousand documents is worth taking seriously (especially given that there are likely to be tens of thousands in the full PACER corpus). The courts should take additional steps to monitor compliance with the redaction rules and sanction parties who fail to comply with them, and they should explore techniques to automate the detection of redaction failures in these categories.
But at the same time, a sense of perspective is important. This tiny fraction of PACER documents with confidential information in them is a cause for concern, but it probably isn’t a good reason to limit public access to the roughly 99.9 percent of documents that contain no sensitive information and may be of significant benefit to the public.
Thanks again to Carl Malamud and Public.Resource.Org for their support of my research.