December 8, 2024

Is Insurance Regulation the Next Frontier in Open Government Data?

My friend Ray Lehman points to an intriguing opportunity to expand public access to government data: insurance regulation. The United States has a decentralized, state-based system for regulating the insurance industry. Insurance companies are required to disclose data on their premiums, claims, assets, and many other topics, to state regulators for each state in which they do business. These data are then shared with the National Association of Insurance Commissioners, a private, non-profit organization that combines it and then sells access to the database. Ray tells the story:

The major clients for the NAIC’s insurance data are market analytics firms like Charlottesville, Va.-based SNL Financial and insurance rating agency A.M. Best (Full disclosure: I have been, at different times, an employee at both firms) who repackage the information in a lucrative secondary market populated by banks, broker-dealers, asset managers and private investment funds. While big financial institutions make good use of the data, the rates charged by firms like Best and SNL tend to be well out of the price range of media and academic outlets who might do likewise.

And where a private stockholder interested in reading the financials of a company whose shares he owns can easily look up the company’s SEC filings, a private policyholder interested in, say, the reserves held by the insurer he has entrusted to protect his financial future…has essentially nowhere to turn.

However, Ray points out that the recently-enacted Dodd-Frank legislation may change that, as it creates a new Federal Insurance Office. That office will collect data from state regulators and likely has the option to disclose that data to the general public. Indeed, Ray argues, the Freedom of Information Act may even require that the data be disclosed to anyone who asks. The statute is ambiguous enough that in practice it’s likely to be up to FIO director Michael McRaith to decide what to do with the data.

I agree with Ray that McRaith should make the data public. As several CITP scholars have argued, free bulk access to government data has the potential to create significant value for the public. These data could be of substantial value for journalists covering the insurance industry and academics studying insurance markets. And with some clever hacking, it could likely be made useful for consumers, who would have more information with which to evaluate the insurance companies in their state.

What Gets Redacted in Pacer?

In my research on privacy problems in PACER, I spent a lot of time examining PACER documents. In addition to researching the problem of “bad” redactions, I was also interested in learning about the pattern of redactions generally. To this end, my software looked for two redaction styles. One is the “black rectangle” redaction method I described in my previous post. This method sometimes fails, but most of these redactions were done successfully. The more common method (around two-thirds of all redactions) involves replacing sensitive information with strings of XXs.

Out of the 1.8 million documents it scanned, my software identified around 11,000 documents that appeared to have redactions. Many of them could be classified automatically (for example “123-45-xxxx” is clearly a redacted Social Security number, and “Exxon” is a false positive) but I examined several thousand by hand.

Here is the distribution of the redacted documents I found.

Type of Sensitive Information No. of Documents
Social Security number 4315
Bank or other account number 675
Address 449
Trade secret 419
Date of birth 290
Unique identifier other than SSN 216
Name of person 129
Phone, email, IP address 60
National security related 26
Health information 24
Miscellaneous 68
Total 6208

To reiterate the point I made in my last post, I didn’t have access to a random sample of the PACER corpus, so we should be cautious about drawing any precise conclusions about the distribution of redacted information in the entire PACER corpus.

Still, I think we can draw some interesting conclusions from these statistics. It’s reasonable to assume that the distribution of redacted sensitive information is similar to the distribution of sensitive information in general. That is, assuming that parties who redact documents do a decent job, this list gives us a (very rough) idea of what kinds of sensitive information can be found in PACER documents.

The most obvious lesson from these statistics is that Social Security numbers are by far the most common type of redacted information in PACER. This is good news, since it’s relatively easy to build software to automatically detect and redact Social Security numbers.

Another interesting case is the “address” category. Almost all of the redacted items in this category—393 out of 449—appear in the District of Columbia District. Many of the documents relate to search warrants and police reports, often in connection with drug cases. I don’t know if the high rate of redaction reflects the different mix of cases in the DC District, or an idiosyncratic redaction policy voluntarily pursued by the courts and/or the DC police but not by officials in other districts. It’s worth noting that the redaction of addresses doesn’t appear to be required by the federal redaction rules.

Finally, there’s the category of “trade secrets,” which is a catch-all term I used for documents whose redactions appear to be confidential business information. Private businesses may have a strong interest in keeping this information confidential, but the public interest in such secrecy here is less clear.

To summarize, out of 6208 redacted documents, there are 4315 Social Security that can be redacted automatically by machine, 449 addresses whose redaction doesn’t seem to be required by the rules of procedure, and 419 “trade secrets” whose release will typically only harm the party who fails to redact it.

That leaves around 1000 documents that would expose risky confidential information if not properly redacted, or about 0.05 percent of the 1.8 million documents I started with. A thousand documents is worth taking seriously (especially given that there are likely to be tens of thousands in the full PACER corpus). The courts should take additional steps to monitor compliance with the redaction rules and sanction parties who fail to comply with them, and they should explore techniques to automate the detection of redaction failures in these categories.

But at the same time, a sense of perspective is important. This tiny fraction of PACER documents with confidential information in them is a cause for concern, but it probably isn’t a good reason to limit public access to the roughly 99.9 percent of documents that contain no sensitive information and may be of significant benefit to the public.

Thanks again to Carl Malamud and Public.Resource.Org for their support of my research.

Studying the Frequency of Redaction Failures in PACER

Since we launched RECAP a couple of years ago, one of our top concerns has been privacy. The federal judiciary’s PACER system offers the public online access to hundreds of millions of court records. The judiciary’s rules require each party in a case to redact certain types of information from documents they submit, but unfortunately litigants and their counsel don’t always comply with these rules. Three years ago, Carl Malamud did a groundbreaking audit of PACER documents and found more than 1600 cases in which litigants submitted documents with unredacted Social Security numbers. My recent research has focused on a different problem: cases where parties tried to redact sensitive information but the redactions failed for technical reasons. This problem occasionally pops up in news stories, but as far as I know, no one has conducted a systematic study.

To understand the problem, it helps to know a little bit about how computers represent graphics. The simplest image formats are bitmap or raster formats. These represent an image as an array of pixels, with each pixel having a color represented by a numeric value. The PDF format uses a different approach, known as vector graphics, that represent an image as a series of drawing commands: lines, rectangles, lines of text, and so forth.

Vector graphics have important advantages. Vector-based formats “scale up” gracefully, in contrast to the raster images that look “blocky” at high resolutions. Vector graphics also do a better job of preserving a document’s structure. For example, text in a PDF is represented by a sequence of explicit text-drawing commands, which is why you can cut and paste text from a PDF document, but not from a raster format like PNG.

But vector-based formats also have an important disadvantage: they may contain more information than is visible to the naked eye. Raster images have a “what you see is what you get” quality—changing all the pixels in a particular region to black destroys the information that was previously in that part of the image. But a vector-based image can have multiple “layers.” There might be a command to draw some text followed by a command to draw a black rectangle over the text. The image might look like it’s been redacted, but the text is still “under” the box. And often extracting that information is a simple matter of cutting and pasting.

So how many PACER documents have this problem? We’re in a good position to study this question because we have a large collection of PACER documents—1.8 million of them when I started my research last year. I wrote software to detect redaction rectangles—it turns out these are relatively easy to recognize based on their color, shape, and the specific commands used to draw them. Out of 1.8 million PACER documents, there were approximately 2000 documents with redaction rectangles. (There were also about 3500 documents that were redacted by replacing text by strings of Xes, I also excluded documents that were redacted by Carl Malamud before he donated them to our archive.)

Next, my software checked to see if these redaction rectangles overlapped with text. My software identified a few hundred documents that appeared to have text under redaction rectangles, and examining them by hand revealed 194 documents with failed redactions. The majority of the documents (about 130) appear be from commercial litigation, in which parties have unsuccessfully attempted to redact trade secrets such as sales figures and confidential product information. Other improperly redacted documents contain sensitive medical information, addresses, and dates of birth. Still others contain the names of witnesses, jurors, plaintiffs, and one minor.

Implications

PACER reportedly contains about 500 million documents. We don’t have a random sample of PACER documents, so we should be careful about trying to extrapolate to the entire PACER corpus. Still, it’s safe to say there are thousands, and probably tens of thousands, of documents in PACER whose authors made unsuccessful attempts to conceal information.

It’s also important to note that my software may not be detecting every instance of redaction failures. If a PDF was created by scanning in a paper document (as opposed to generated directly from a word processor), then it probably won’t have a “text layer.” My software doesn’t detect redaction failures in this type of document. This means that there may be more than 194 failed redactions among the 1.8 million documents I studied.

A few weeks ago I wrote a letter to Judge Lee Rosenthal, chair of the federal judiciary’s Committee on Rules of Practice and Procedure, explaining this problem. In that letter I recommend that the courts themselves use software like mine to automatically scan PACER documents for this type of problem. In addition to scanning the documents they already have, the courts should make it a standard part of the process for filing new documents with the courts. This would allow the courts to catch these problems before the documents are made available to the public on the PACER website.

My code is available here. It’s experimental research code, not a finished product. We’re releasing it into the public domain using the CC0 license; this should make it easy for federal and state officials to adapt it for their own use. Court administrators who are interested in adapting the code for their own use are especially encouraged to contact me for advice and assistance. The code relies heavily on the CAM::PDF Perl library, and I’m indebted to Chris Dolan for his patient answers to my many dumb questions.

Getting Redaction Right

So what should litigants do to avoid this problem? The National Security Agency has a good primer on secure redaction. The approach they recommend—completely deleting sensitive information in the original word processing document, replacing it with innocuous filler (such as strings of XXes) as needed, and then converting it to a PDF document, is the safest approach. The NSA primer also explains how to check for other potentially sensitive information that might be hidden in a document’s metadata.

Of course, there may be cases where this approach isn’t feasible because a litigant doesn’t have the original word processing document or doesn’t want the document’s layout to be changed by the redaction process. Adobe Acrobat’s redaction tool has worked correctly when we’ve used it, and Adobe probably has the expertise to do it correctly. There may be other tools that work correctly, but we haven’t had an opportunity to experiment with them so we can’t say which ones they might be.

Regardless of the tool used, it’s a good idea to take the redacted document and double-check that the information was removed. An easy way to do this is to simply cut and paste the “redacted” content into another document. If the redaction succeeded, no text should be transferred. This method will catch most, but not all, redaction failures. A more rigorous check is to remove the redaction rectangles from the document and manually observe what’s underneath them. One of the scripts I’m releasing today, called remove_rectangles.pl, does just that. In its current form, it’s probably not user-friendly enough for non-programmers to use, but it would be relatively straightforward for someone (perhaps Adobe or the courts) to build a user-friendly version that ordinary users could use to verify that the document they just attempted to redact actually got redacted.

One approach we don’t endorse is printing the document out, redacting it with a black marker, and then re-scanning it to PDF format. Although this may succeed in removing the sensitive information, we don’t recommend this approach because it effectively converts the document into a raster-based image, destroying useful information in the process. For example, it will no longer be possible to cut and paste (non-redacted) text from a document that has been redacted in this way.

Bad redactions are not a new problem, but they are taking on a new urgency as PACER documents become increasingly available on the web. Correct redaction is not difficult, but it does require both knowledge and care by those who are submitting the documents. The courts have several important roles they should play: educating attorneys about their redaction responsibilities, providing them with software tools that make it easy for them to comply, and monitoring submitted documents to verify that the rules are being followed.

This research was made possible with the financial support of Carl Malamud’s organization, Public.Resource.Org.

Google Should Stand up for Fair Use in Books Fight

On Tuesday Judge Denny Chen rejected a proposed settlement in the Google Book Search case. My write-up for Ars Technica is here.

The question everyone is asking is what comes next. The conventional wisdom seems to be that the parties will go back to the bargaining table and hammer out a third iteration of the settlement. It’s also possible that the parties will try to appeal the rejection of the current settlement. Still, in case anyone at Google is reading this, I’d like to make a pitch for the third option: litigate!

Google has long been at the forefront of efforts to shape copyright law in ways that encourage innovation. When the authors and publishers first sued Google back in 2005, I was quick to defend the scanning of books under copyright’s fair use doctrine. And I still think that position is correct.

Unfortunately, in 2008 Google saw an opportunity to make a separate truce with the publishing industry that placed Google at the center of the book business and left everyone else out in the cold. Because of the peculiarities of class action law, the settlement would have given Google the legal right to use hundreds of thousands of “orphan” works without actually getting permission from their copyright holders. Competitors who wanted the same deal would have had no realistic way of doing so. Googlers are a smart bunch, and so they took what was obviously a good deal for them even though it was bad for fair use and online innovation.

Now the deal is no longer on the table, and it’s not clear if it can be salvaged. Judge Chin suggested that he might approve a new, “opt-in” settlement. But switching to an opt-in rule would undermine the very thing that made the deal so appealing to Google in the first place: the freedom to incorporate works whose copyright status was unclear. Take that away, and it’s not clear that Google Book Search can exist at all.

Moreover, I think the failure of the settlement may strengthen Google’s fair use argument. Fair use exists as a kind of safety valve for the copyright system, to ensure that it does not damage free speech, innovation, and other values. Although formally speaking judges are supposed to run through the famous four factor test to determine what counts as a fair use, in practice an important factor is whether the judge perceives the defendant as having acted in good faith. Google has now spent three years looking for a way to build its Book Search project using something other than fair use, and come up empty. This underscores the stakes of the fair use fight: if Judge Chen ruled against Google’s fair use argument, it would mean that it was effectively impossible to build a book search engine as comprehensive as the one Google has built. That outcome doesn’t seem consistent with the constitution’s command that copyright promote the progress of science and the useful arts.

In any event, Google may not have much choice. If it signs an “opt-in” settlement with the Author’s Guild and the Association of American Publishers, it’s likely to face a fresh round of lawsuits from other copyright holders who aren’t members of those organizations — and they might not be as willing to settle for a token sum. So if Google thinks its fair use argument is a winner, it might as well test it now before it’s paid out any settlement money. And if it’s not, then this business might be too expensive for Google to be in at all.

Predictions for 2011

As promised, the official Freedom to Tinker predictions for 2011. These predictions are the result of discussions that included myself, Joe Hall, Steve Schultze, Wendy Seltzer, Dan Wallach, and Harlan Yu, but note that we don’t individually agree with every prediction.

  1. DRM technology will still fail to prevent widespread infringement. In a related development, pigs will still fail to fly.
  2. Copyright and patent issues will continue to be stalemated in Congress, with no major legislation on either subject.
  3. Momentum will grow for HTTPS by default, with several major websites adding HTTPS support. Work will begin on adding HTTPS-by-default support to Apache.
  4. Despite substantial attention by Congress to online privacy, the FTC won’t be granted authority to mandate Do Not Track compliance.
  5. Some advertising networks and third-party Web services will begin to voluntarily respect the Do Not Track header, which will be supported by all the major browsers. However, sites will have varying interpretations of what the DNT header requires, leading to accusations that some purportedly DNT-respecting sites are not fully DNT-compliant.
  6. Congress will pass an electronic privacy bill along the lines of the principles set out by the Digital Due Process Coalition.
  7. The seemingly N^2 patent lawsuits among all the major smartphone players will be resolved through a grand cross-licensing bargain, cut in a dark, smoky room, whose terms will only be revealed through some congratulatory emails that leak to the press. None of these lawsuits will get anywhere near a courtroom.
  8. Android smartphones will continue gaining market share, mostly at the expense of BlackBerry and Windows Mobile phones. However, Android’s gains will mostly be at the low end of the market; the iPhone will continue to outsell any single Android smartphone model by a wide margin.
  9. 2011 will see the outbreak of the first massive botnet/malware that attacks smartphones, most likely iPhone or Android models running older software than the latest and greatest. If Android is the target, it will lead to aggressive finger-pointing, particularly given how many users are presently running Android software that’s a year or more behind Google’s latest—a trend that will continue in 2011.
  10. Mainstream media outlets will continue building custom “apps” to present their content on mobile devices. They’ll fall short of expectations and fail to reverse the decline of any magazines or newspapers.
  11. At year’s end, the district court will still not have issued a final judgment on the Google Book Search settlement.
  12. The market for Internet set-top boxes like Google TV and Apple TV will continue to be chaotic throughout 2011, with no single device taking a decisive market share lead. The big winners will be online services like Netflix, Hulu, and Pandora that work with a wide variety of hardware devices.
  13. Online sellers with device-specific consumer stores (Amazon for Kindle books, Apple for iPhone/iPad apps, Microsoft for Xbox Live, etc.) will come under antitrust scrutiny, and perhaps even be dragged into court. Nothing will be resolved before the end of 2011.
  14. With electronic voting machines beginning to wear out but budgets tight, there will be much heated discussion of electronic voting, including antitrust concern over the e-voting technology vendors. But there will be no fundamental changes in policy. The incumbent vendors will continue to charge thousands of dollars for products that cost them a tiny fraction of that to manufacture.
  15. Pressure will continue to mount on election authorities to make it easier for overseas and military voters to cast votes remotely, despite all the obvious-to-everybody-else security concerns. While counties with large military populations will continue to conduct “pilot” studies with Internet voting, with grandiose claims of how they’ve been “proven” secure because nobody bothered to attack them, very few military voters will cast actual ballots over the Internet in 2011.
  16. In contrast, where domestic absentee voters are permitted to use remote voting systems (e.g., systems that transmit blank ballots that the voter returns by mail) voters will do so in large numbers, increasing the pressure to make remote voting easier for domestic voters and further exacerbating security concerns.
  17. At least one candidate for the Republican presidential nomination will express concern about the security of electronic voting machines.
  18. Multiple Wikileaks alternatives will pop up, and pundits will start to realize that mass leaks are enabled by technology trends, not just by one freaky Australian dude.
  19. The RIAA and/or MPAA will be sued over their role in the government’s actions to reassign DNS names owned by allegedly unlawful web sites. Even if the lawsuit manages to get all the way to trial, there won’t be a significant ruling against them.
  20. Copyright claims will be asserted against players even further removed from underlying infringement than Internet/online Service Providers: domain name system participants, ad and payment networks, and upstream hosts. Some of these claims will win at the district court level, mostly on default judgments, but appeals will still be pending at year’s end.
  21. A distributed naming system for Web/broadcast content will gain substantial mindshare and measurable US usage after the trifecta of attacks on Wikileaks DNS, COICA, and further attacks on privacy-preserving or anonymous registration in the ICANN-sponsored DNS. It will go even further in another country.
  22. ICANN still will not have introduced new generic TLDs.
  23. The FCC’s recently-announced network neutrality rules will continue to attract criticism from both ends of the political spectrum, and will be the subject of critical hearings in the Republican House, but neither Congress nor the courts will overturn the rules.
  24. The tech policy world will continue debating the Comcast/Level 3 dispute, but Level 3 will continue paying Comcast to deliver Netflix content, and the FCC won’t take any meaningful actions to help Level 3 or punish Comcast.
  25. Comcast and other cable companies will treat the Comcast/Level 3 dispute as a template for future negotiations, demanding payments to terminate streaming video content. As a result, the network neutrality debate will increasingly focus on streaming high-definition video, and legal academia will become a lot more interested in the economics of Internet interconnection.