November 26, 2024

Cloud(s), Hype, and Freedom

Richard Stallman’s recent description of ‘the cloud’ as ‘hype’ and a ‘trap’ seems to have stirred up a lot of commentary, but not a lot of clear discussion of the problems Stallman raised. This isn’t surprising- the term ‘the cloud’ has always been vague. (It was hard to resist saying ‘cloudy.’ 😉 When people say ‘the cloud’ they are really lumping at least four ‘cloud types’ together.

traditional applications, hosted elsewhere

Probably the most common type of ‘cloud’ is a service that takes a traditional software functionality and moves it to remotely hosted, (typically) web-delivered servers. Gmail and salesforce.com are like this- fairly traditional email and CRM applications, ‘just’ moved to the web.

If Stallman’s ‘hype’ claim is valid anywhere, it is here. Administration and maintenance costs are definitely lower when an expert like Google funds and runs the server, and reliability may improve as well. But the core functionality of these apps, and the ability to access data over a network, have been present since the dawn of networked computing. On average, this is undoubtedly a significant change in quality, but only rarely a change in type- making the buzz much harder to justify.

Stallman’s ‘trap’ charge is more complex. Computer users have long compromised on personal control by storing data remotely but accessing it via standardized protocols. This introduced risks- you had to trust the data host and couldn’t tinker with the server- but kept some controls- you could switch clients, and typically you could export the data. Some web apps still strike that balance- for example, most gmail features are accessible via good old POP and IMAP. But others don’t.

Getting your data out of a service like salesforce can be a ‘hidden cost’ of an apparently free service, and even with a relatively standards-based service like gmail you have no freedom to make changes to the server. These risks are what Stallman means when he talks about a ‘trap’, and regardless of your conclusion about them, understanding them is important.

services involving data that can’t (yet) be managed locally

Google Maps and Google Search are the canonical examples of this type of cloud service- heaps of data so large that one would need a large data center to host your own copy and a very, very fat pipe to keep it up-to-date.

Hype-wise, these are a mixed bag. These services definitely bring radical new functionality that traditionally can’t exist- I can’t store all of google maps on my phone. That hype is justified. At the same time, our personal ability to store and process data is still growing quickly, so the claims that this type of cloud service will always ‘require’ remote servers may be overblown.

‘Trap’-wise? Dependence on these services reminds me of ‘dependence’ on a library before the internet- you can work to make sure your library respects your privacy, prefer public libraries to private ones, or establish a personal library if your reading interests are narrow, but in the end eschewing large libraries is likely to be a case of cutting off your nose to spite your face. We’re in the same state with this type of cloud service. You can avoid them, but those concerned with freedom might be better off understanding and fixing them than condemning them altogether.

services that make creation of new data technically or economically feasible

Facebook and wikipedia are the canonical examples here. Unlike the first two types of cloud, where data was available but inconvenient before it ended up in the cloud, this class of cloud applications creates information that wasn’t previously feasible to collect at all.

There may well not be enough hype around this type of cloud. Replicating web scale collaborative facilities like these will be very difficult to do in a p2p fashion, and the impact of the creation of new information (even when it is as mundane as facebook’s data often is) is hard to understate.

Like the previous type of cloud, it is hard to call these a trap per se- they do make it hard to leave, but they do so by providing new functionality that is very hard to get with any traditional software model.

services offering computing and storage, rather than data

The most recent type of cloud service is remotely provisioned computing and storage, like Amazon’s EC2/S3 and Google’s App Engine. This is perhaps the most purely generative type of cloud, allowing individuals to create new services and scale them out to service millions of people without having to invest in their own physical infrastructure. It is hard to see any way in which this can reasonably be called ‘hype,’ given the reach it allows individuals and small or transient groups to have which might otherwise cost them many thousands of dollars.

From a freedom perspective, these can be both the best and worst of the cloud types. On the plus side, these services can be incredibly transparent- developers who use them directly have access to their own source code, and end users may not know they are using them at all. On the down side, especially for proprietary platforms like App Engine, these can have very deep lock-in- it is complicated, expensive, and risky to switch deployment platforms after achieving success. And they replace traditional, very open platforms- a tradeoff that isn’t always appreciated.

takeaways

‘The cloud’ isn’t going away, but hopefully we can clarify our thinking about it by talking about the different types of clouds. Hopefully this post is a useful step in that direction.

[This post is an extension of some ideas I’ve been playing around with on my own blog and at the autonomo.us group blog; readers curious about these issues may want to read further in those places. I also recommend reading this piece, which set me on the (very long) road to this particular post.]

Why is printing so hard?

Recently I bought a mildly used laser printer and wanted to set it up on my home network. In a better world, this would be a trivial exercise — just connect the printer to the network and let the computers discover it. In the actual world, it was a forty-five minute project that only a reasonably handy network jockey could have hoped to complete. (If you care about what exactly I had to do, see below.)

John Hartman says, “Printing is the hardest problem in computer science.” It often seems that way. But why?

Plug-and-play printing seems pretty simple, compared to many of the things that computers do routinely without trouble. Granted, it’s not trivial to get the full variety of printers to work with the full variety of computers, but our collective failure to do so is — or should be — surprising.

There must be some lesson here about engineering, or human nature, or something. Lately I’ve gone around asking people why printing is so hard. I’ve gotten some interesting answers, but I don’t think I really understand the issue yet.

What do you think? Why is printing so hard?

[For the record, here’s what I had to do to get our newly acquired HP LaserJet 2200DN printer working on our home network: I plugged the printer in to our network, but the Windows PCs couldn’t auto-discover the printer. I Googled the printer’s user manual, which said the printer had a built-in webserver. But I didn’t know the printer’s IP address, so I had to log in to our router and look at its DHCP tables. Knowing the IP address, I could connect to the printer’s webserver, which had a page telling me what URL to use for IPP printing. (I had to know what IPP was.) After that, I assigned the printer a static IP address, so the IPP URL (containing an IP address) would keep working across reboots. Now that I had a stable IPP URL, I could set up the PCs for printing. Finally, I had to guess which of driver to use on Windows — two drivers were offered, with no advice about which one to use, but only one of the offered drivers supports duplex printing. Total elapsed time: about 45 minutes.]

California Issues Emergency Election Audit Regulations

The Office of the California Secretary of State has issued a set of proposed emergency regulations for post-election manual tallying of paper election records. In this post, my first at FTT, I’ll try to explain and contextualize this development.

Since her election to office, California Secretary of State (CA SoS) Debra Bowen has methodically studied the shortcomings in California’s election equipment. She first initiated a Top-To-Bottom review (TTBR) of California’s voting systems that found them to be of poor technical quality and vulnerable to a myriad of security vulnerabilities, accessibility flaws, reliability issues and inadequate documentation and testing (a number of FTT regulars participated in the TTBR). For this year’s presidential primary in California, Bowen worked to mitigate these problems by decertifying this equipment and then recertifying it subject to a list of about 40 different conditions. One such condition is that the usual 1% manual tally under California law — counties must randomly choose and hand tally ballots cast in 1% of precincts — would be modified to include escalation that would mandate increased tallying for close races (where even small amounts of possible fraud and/or error could make a difference in the outcome of a contest).

Bowen issued these additional requirements (the “PEMT Requirements”) under her authority as CA SoS to regulate election technologies (here are the original PEMT Requirements). Unfortunately, the Registrar in San Diego County sued Bowen arguing that she 1) didn’t have such broad authority and 2) that, even if she did, she could only issue the PEMT Requirements through the California regulatory procedure (specified by the CA Administrative Procedure Act). A state Superior Court found in favor of the CA SoS but a Court of Appeal found that the PEMT Requirements did indeed betray characteristics of regulations and should therefore have gone through the regulatory procedure (for the legal eagles out there, see: County of San Diego v. Debra Bowen (2008) 166 Cal.App.4th 501).

By the time the Court of Appeal had made its decision on August 29, there was no time to follow the normal regulatory process, which takes about four months. Instead, the CA SoS had to follow the process for adopting an emergency regulation which applies when a regulation “is necessary for the immediate preservation of the public peace, health and safety, or general welfare.”

What is so special about these emergency manual tally provisions? First, it represents the increasing relevance and importance of adversarial considerations in the design of an election audit process. As we describe in the NYU Brennan Center / UC Berkeley Samuelson Clinic report on post-election audits (“Post-Election Audits: Restoring Trust In Elections”), fixed-percentage audits of election records are only particularly useful in detecting wide-ranging anomalies in vote counts. Methods that “tune” the amount of records audited depending on the margin in contests on the ballot do a much better job of ensuring that they’ll find evidence of possible error or fraud. Per the emergency PEMT Regulations, any contest with a margin (difference between the winning and losing choice in a contest) of 0.5% or lower is subject to a 10% manual tally, an order of magnitude more scrutiny than the statutory default.

Second, the CA SoS’ emergency PEMT Regulations reflect many best practices from audit theory and research: precincts to audit must be chosen randomly; the precincts to audit are only chosen after the semi-official vote tallies are arrived at; tally activities must be announced publicly and available for public observation; tallies must be conducted under “blind count” rules where the talliers do not know the totals in the precincts they’re tallying; differences between machine and hand counts must be explained or investigated.

The elephant in the room is always Los Angeles County; LA is so amazingly enormous for an election jurisdiction that some things simply aren’t possible. (For example, they frequently pick up ballot materials from precincts in helicopters; that is, traffic in LA is so bad and there are so many polling places (~5,000 or so) that the most reliable form of ballot transmission is via helicopter.) These rules are going to be exceedingly difficult for LA to comply with. I expect they will hire an army of tally managers and talliers to perform their tally and that it will be a race against the clock, counting 24 hours a day, seven days per week, to try and get it all done in the 28-calendar day canvass period.

Counting Electronic Votes in Secret

Things are not looking good for open government when it comes to observing poll workers on Election Night. Our state election laws, written for the old lever machines, now apply to Sequoia electronic voting machines. Andrew Appel and I have been asking a straightforward question: Can ordinary members of the public watch the procedures used by poll workers to count the votes?

I submitted a formal request to the Board of Elections of Mercer County (where Princeton University is located), seeking permission to watch the poll workers when they close the polls (on Sequoia AVC Advantage voting computers) and announce the results. They said no!

The Election Board said this election is “too important” to permit extra people in the polling place.

They even went so far as to suggest that my written application was fraudulent. I applied on behalf of five people: two Princeton University students, two professors, and myself. In an abundance of caution, I requested authorization in the form of “challenger badges” which the Board of Elections can issue at its discretion. By phone, I explained our interest in merely watching the poll workers.

Of course we understand that they might not want extra people getting in the way on Election Night — that’s why we took measures to get special authorization. To ensure that we could be lawfully present, we asked for challenger badges as non-partisan proponents and opponents of two Public Questions on the ballot, as permitted by NJSA 19:7-2. My request was entirely in compliance with state law, as all the prospective challengers are registered to vote in Mercer County.

In spite of this, the Board expressed reluctance, based on the identities of the prospective challengers. In particular, they cited Andrew’s status as an expert on Sequoia voting machines as a “concern,” and provided assurances that Sequoia has fixed all the problems he identified in past elections.

Other counties in New Jersey permit members of the public to watch the poll workers “read” the election results. Combined with Judge Feinberg’s decision to suppress Andrew’s report on the security of the Sequoia machines, Mercer County conveys the unfortunate impression it does not welcome scrutiny of its electronic voting process.

Piracy Statistics and the Importance of Journalistic Skepticism

If you’ve paid attention to copyright debates in recent years, you’ve probably seen advocates for more restrictive copyright laws claim that “counterfeiting and piracy” cost the US economy as much as $250 billion. When pressed, those who make these kinds of claims are inevitably vague about exactly where these figures come from. For example, I contacted Thomas Sydnor, the author of the paper I linked above, and he was able to point me to a 2002 press release from the FBI, which claims that “losses to counterfeiting are estimated at $200-250 billion a year in U.S. business losses.”

There are a couple of things that are notable about this. In the first place, notice that the press release says counterfeiting, which is an entirely different issue from copyright infringement. Passing stronger copyright legislation in order to stop counterfeiting is a non-sequitur.

But the more serious issue is that the FBI can’t actually explain how it arrived at these figures. And indeed, it appears that nobody knows who came up with these figures and how they were computed. Julian Sanchez has done some sleuthing and found that these figures have literally been floating around inside the beltway for decades. Julian contacted the FBI, which wasn’t able to point to any specific source. Further investigation led him to a 1993 Forbes article:

Ars eagerly hunted down that issue and found a short article on counterfeiting, in which the reader is informed that “counterfeit merchandise” is “a $200 billion enterprise worldwide and growing faster than many of the industries it’s preying on.” No further source is given.

Quite possibly, the authors of the article called up an industry group like the IACC and got a ballpark guess. At any rate, there is nothing to indicate that Forbes itself had produced the estimate, Mr. Conyers’ assertion notwithstanding. What is very clear, however, is that even assuming the figure is accurate, it is not an estimate of the cost to the U.S. economy of IP piracy. It’s an estimate of the size of the entire global market in counterfeit goods. Despite the efforts of several witnesses to equate them, it is plainly not on par with the earlier calculation by the ITC that many had also cited.

It’s not surprising that no one is able to cite a credible source because the figure is plainly absurd. For example, the Institute for Policy Innovation, a group that pushes for more restrictive copyright law, has claimed that copyright infringement costs the economy $58.0 billion. As I’ve written before, these estimates vastly overstate losses because IPI used a dubious methodology that double- and triple-counts each lost sale. The actual figure—even accepting some of the dubious assumptions in the IPI estimate, is almost certainly less than $20 billion. But whether it’s $10, $20, or $58 billion, it’s certainly not $250 billion.

There are a couple of important lessons here. One concerns the importance of careful scholarship. Before citing any statistic, you should have a clear understanding of what that figure is measuring, who calculated it, and how. The fact that this figure has gotten repeated so many times inside the beltway suggests that the people using the figure have not been doing their homework. It’s not surprising that lobbyists cite the largest figures they can find, but public servants have a duty to be more skeptical.

The more important lesson is for the journalistic profession. Far too many reporters at reputable media outlets credulously repeat these figures in news stories without paying enough attention to where they come from. If a statistic is provided by a party with a vested interest in the subject of a story—if, say, a content industry group provides a statistic on the costs of piracy—reporters should double-check that figure against more reputable sources. And, sadly, a government agency isn’t always a reliable source. Agencies like the BLS and BEA who are in the business of collecting official statistics are generally reliable. But it’s not safe to assume that other agencies have done their homework. The FBI, for example, has made little effort to correct the record on the $250 billion figure, despite the fact that it is regularly cited as the source of the figure and despite the fact that it has admitted that it can’t explain where the figure comes from.

Julian gives all the gory details on the origins of the $250 billion figure. He also digs into the oft-repeated claim that piracy costs 750,000 jobs, which dates back even further (to 1986) and is no more credible. And he offers some interesting theoretical reasons to think that the costs of copyright infringement are much, much less than $250 billion.