Since we launched RECAP a couple of years ago, one of our top concerns has been privacy. The federal judiciary’s PACER system offers the public online access to hundreds of millions of court records. The judiciary’s rules require each party in a case to redact certain types of information from documents they submit, but unfortunately litigants and their counsel don’t always comply with these rules. Three years ago, Carl Malamud did a groundbreaking audit of PACER documents and found more than 1600 cases in which litigants submitted documents with unredacted Social Security numbers. My recent research has focused on a different problem: cases where parties tried to redact sensitive information but the redactions failed for technical reasons. This problem occasionally pops up in news stories, but as far as I know, no one has conducted a systematic study.
To understand the problem, it helps to know a little bit about how computers represent graphics. The simplest image formats are bitmap or raster formats. These represent an image as an array of pixels, with each pixel having a color represented by a numeric value. The PDF format uses a different approach, known as vector graphics, that represent an image as a series of drawing commands: lines, rectangles, lines of text, and so forth.
Vector graphics have important advantages. Vector-based formats “scale up” gracefully, in contrast to the raster images that look “blocky” at high resolutions. Vector graphics also do a better job of preserving a document’s structure. For example, text in a PDF is represented by a sequence of explicit text-drawing commands, which is why you can cut and paste text from a PDF document, but not from a raster format like PNG.
But vector-based formats also have an important disadvantage: they may contain more information than is visible to the naked eye. Raster images have a “what you see is what you get” quality—changing all the pixels in a particular region to black destroys the information that was previously in that part of the image. But a vector-based image can have multiple “layers.” There might be a command to draw some text followed by a command to draw a black rectangle over the text. The image might look like it’s been redacted, but the text is still “under” the box. And often extracting that information is a simple matter of cutting and pasting.
So how many PACER documents have this problem? We’re in a good position to study this question because we have a large collection of PACER documents—1.8 million of them when I started my research last year. I wrote software to detect redaction rectangles—it turns out these are relatively easy to recognize based on their color, shape, and the specific commands used to draw them. Out of 1.8 million PACER documents, there were approximately 2000 documents with redaction rectangles. (There were also about 3500 documents that were redacted by replacing text by strings of Xes, I also excluded documents that were redacted by Carl Malamud before he donated them to our archive.)
Next, my software checked to see if these redaction rectangles overlapped with text. My software identified a few hundred documents that appeared to have text under redaction rectangles, and examining them by hand revealed 194 documents with failed redactions. The majority of the documents (about 130) appear be from commercial litigation, in which parties have unsuccessfully attempted to redact trade secrets such as sales figures and confidential product information. Other improperly redacted documents contain sensitive medical information, addresses, and dates of birth. Still others contain the names of witnesses, jurors, plaintiffs, and one minor.
Implications
PACER reportedly contains about 500 million documents. We don’t have a random sample of PACER documents, so we should be careful about trying to extrapolate to the entire PACER corpus. Still, it’s safe to say there are thousands, and probably tens of thousands, of documents in PACER whose authors made unsuccessful attempts to conceal information.
It’s also important to note that my software may not be detecting every instance of redaction failures. If a PDF was created by scanning in a paper document (as opposed to generated directly from a word processor), then it probably won’t have a “text layer.” My software doesn’t detect redaction failures in this type of document. This means that there may be more than 194 failed redactions among the 1.8 million documents I studied.
A few weeks ago I wrote a letter to Judge Lee Rosenthal, chair of the federal judiciary’s Committee on Rules of Practice and Procedure, explaining this problem. In that letter I recommend that the courts themselves use software like mine to automatically scan PACER documents for this type of problem. In addition to scanning the documents they already have, the courts should make it a standard part of the process for filing new documents with the courts. This would allow the courts to catch these problems before the documents are made available to the public on the PACER website.
My code is available here. It’s experimental research code, not a finished product. We’re releasing it into the public domain using the CC0 license; this should make it easy for federal and state officials to adapt it for their own use. Court administrators who are interested in adapting the code for their own use are especially encouraged to contact me for advice and assistance. The code relies heavily on the CAM::PDF Perl library, and I’m indebted to Chris Dolan for his patient answers to my many dumb questions.
Getting Redaction Right
So what should litigants do to avoid this problem? The National Security Agency has a good primer on secure redaction. The approach they recommend—completely deleting sensitive information in the original word processing document, replacing it with innocuous filler (such as strings of XXes) as needed, and then converting it to a PDF document, is the safest approach. The NSA primer also explains how to check for other potentially sensitive information that might be hidden in a document’s metadata.
Of course, there may be cases where this approach isn’t feasible because a litigant doesn’t have the original word processing document or doesn’t want the document’s layout to be changed by the redaction process. Adobe Acrobat’s redaction tool has worked correctly when we’ve used it, and Adobe probably has the expertise to do it correctly. There may be other tools that work correctly, but we haven’t had an opportunity to experiment with them so we can’t say which ones they might be.
Regardless of the tool used, it’s a good idea to take the redacted document and double-check that the information was removed. An easy way to do this is to simply cut and paste the “redacted” content into another document. If the redaction succeeded, no text should be transferred. This method will catch most, but not all, redaction failures. A more rigorous check is to remove the redaction rectangles from the document and manually observe what’s underneath them. One of the scripts I’m releasing today, called remove_rectangles.pl, does just that. In its current form, it’s probably not user-friendly enough for non-programmers to use, but it would be relatively straightforward for someone (perhaps Adobe or the courts) to build a user-friendly version that ordinary users could use to verify that the document they just attempted to redact actually got redacted.
One approach we don’t endorse is printing the document out, redacting it with a black marker, and then re-scanning it to PDF format. Although this may succeed in removing the sensitive information, we don’t recommend this approach because it effectively converts the document into a raster-based image, destroying useful information in the process. For example, it will no longer be possible to cut and paste (non-redacted) text from a document that has been redacted in this way.
Bad redactions are not a new problem, but they are taking on a new urgency as PACER documents become increasingly available on the web. Correct redaction is not difficult, but it does require both knowledge and care by those who are submitting the documents. The courts have several important roles they should play: educating attorneys about their redaction responsibilities, providing them with software tools that make it easy for them to comply, and monitoring submitted documents to verify that the rules are being followed.
This research was made possible with the financial support of Carl Malamud’s organization, Public.Resource.Org.
The federal courts conducted a study last fall of 10 million PACER documents filed nationwide during a two month period, looking for instances of unredacted social security numbers. The study found approximately 2,900 documents with one or more unredacted social security numbers, including 71 documents in which the effort to redact the SSN has failed, as described in this article. See Memorandum to Hon. Reena Raggi, Chair, Privacy Protection Subcommittee, from George Cort and Joe Cecil, Federal Judicial Center, Social Security Numbers in Federal Court Documents (April 5, 2010), available at http://www.fjc.gov/library/fjc_catalog.nsf.
Several federal courts also warn about problems of redacting metadata in electronically-filed court documents, see Guidance on Redacting Personal Data Identifiers in Electronically-Filed Documents (http://www.cadc.uscourts.gov/internet/home.nsf/Content/Guidance%20on%20Redacting%20Persona
l%20Data%20Identifiers%20in%20Electronically%20Filed%20Documents/$FILE/ECF%20Redactio
n%20Guide.pdf) and Effective Personal-Identity and Metadata Redaction Techniques for E-Filing
(http://www.njd.uscourts.gov/cm-ecf/RedactTips.pdf).
There are tools in the market place to do proper redaction of image/Office/PDF documents. In the case of PDF documents, it is possible to do proper redaction and produce a searchable redacted PDF as the output thereby retaining the search capabilities for the non-sensitive text. Please check out our various desktop and server side redaction products at http://www.ExtractSystems.com. Our high end system has been used to redact almost 2 Billion images across our customer base, using fully automated redaction, or selective human verification based redaction processes.
It may not be clear to the naked eye, but scanned digital images you have a good shot at separating the slight color difference between the marker pen and the printer ink.
By running the image through a gamma curve adjustment and / or psychedelic color filter you can amplify the difference in chromacity [color] between the marker pen (which is typically a really dark blue, not actually black) and the toner.
Older scanners or grayscale scans (eg fax machines) don’t have this problem, since they don’t have sufficient sensitivity to tell the difference, but most modern scanners this will work.
NSA has an updated guide for redacting Office 2007 documents: http://www.nsa.gov/ia/_files/support/I733-028R-2008.pdf
The new Office 2007 guide is essentially the same as the Office 2003 (and earlier) guide: make changes in Word, convert to PDF
Many redaction failures in the well publicized SCO vs. The Rational World cases can best be explained by strategic decisions by the various counsel. The lawyers while implementing the “letter” of the protection orders sought to release the underlying information to the initiates studying the case.
Great article! For the purposes of this conversation, what really matters is that PDF is likely to be vector when created from electronic documents, as most documents filed with PACER are. Yes, any embedded graphics or scanned documents may be raster and yes, PDF allows for that, but the vector text is what gets people when it comes to redaction.
I spend a lot of time helping law offices do redaction so I want to point out that there are several software applications out there that do proper electronic redaction. Adobe Acrobat itself has quite a few redaction features but it can be pricey. There is also a product called Redact-It that is less expensive and pretty useful. And both help folks check the document before converting to the final PDF for filing. Can’t stress that step enough!
One thing that should be easy to do is check how many of your 1.8 million documents have no text at all (so they were probably scanned), and ignore such documents in the rest of your analysis.
Another idea would be to render the documents without the redaction rectangles, and check whether something that “looks like” text was underneath. You might be able to approximate “looks like text” by something as simple as looking at the pixel intensity histogram–text will probably have a mostly bimodal distribution, with most of the pixels clumping at or near “white” and a much smaller number at “black”.
While it doesn’t solve the “PDF is hell” issue, adding a 6-hour-before-release ‘holding period’ would give people time to actually proofread and reupload if they missed a spot.
This would not be nearly as fun as holding everyone accountable for perfection with no sane/non-liability-increasing way to correct known/discovered errors, though.
Great article! For the purposes of this conversation, what really matters is that PDF is likely to be vector when created from electronic documents, as most documents filed with PACER are. Yes, any embedded graphics or scanned documents may be raster and yes, PDF allows for that, but the vector text is what gets people when it comes to redaction.
I spend a lot of time helping law offices do redaction so I want to point out that there are several software applications out there that do proper electronic redaction. Adobe Acrobat itself has quite a few redaction features but it can be pricey. There is also a product called Redact-It that is less expensive and pretty useful. And both help folks check the document before converting to the final PDF for filing. Can’t stress that step enough!
PDF allows any mix of vector and bitmap images. So the statements “The simplest image formats are bitmap or raster formats…. The PDF format uses a different approach, known as vector graphics…” is inaccurate.
PDF is a vector format; while it may contain bitmaps (as mentioned in the article in the note about printing-and-then-scanning), the PDF itself is always a vector graphic.