April 25, 2014

avatar

Wikipedia Quality Check

There’s been an interesting debate lately about the quality of Wikipedia, the free online encyclopedia that anyone can edit.

Critics say that Wikipedia can’t be trusted because any fool can edit it, and because nobody is being paid to do quality control. Advocates say that Wikipedia allows domain experts to write entries, and that quality control is good because anybody who spots an error can correct it.

The whole flap was started by a minor newspaper column. The column, like much of the debate, ignores the best evidence in the Wikipedia-quality debate: the content of Wikipedia. Rather than debating, in the abstract, whether Wikipedia would be accurate, why don’t we look at Wikipedia and see?

I decided to take a look and see how accurate Wikipedia is. I looked at its entries on things I know very well: Princeton University, Princeton Township, myself, virtual memory (a standard but hard-to-explain bit of operating-system technology), public-key cryptography, and the Microsoft antitrust case.

The entries for Princeton University and Princeton Township were excellent.

The entry on me was accurate, but might be criticized for its choice of what to emphasize. When I first encountered the entry, my year of birth was listed as “[1964 ?]“. I replaced it with the correct year (1963). It felt a bit odd to be editing an encyclopedia entry on myself, but I managed to limit myself to a strictly factual correction.

The technical entries, on virtual memory and public-key cryptography, were certainly accurate, which is a real achievement. Both are backed by detailed technical information that probably would not be available at all in a conventional encyclopedia. My only criticism of these entries is that they could do more to make the concepts accessible to non-experts. But that’s a quibble; these entries are certainly up to the standard of typical encyclopedia writing about technical topics.

So far, so good. But now we come to the entry on the Microsoft case, which was riddled with errors. For starters, it got the formal name of the case (U.S. v. Microsoft) wrong. It badly mischaracterized my testimony, it got the timeline of Judge Jackson’s rulings wrong, and it made terminological errors such as referring to the DOJ as “the prosecution” rather than the “the plaintiff”. I corrected two of these errors (the name of the case, and the description of my testimony), but fixing the whole thing was too big an effort.

Until I read the Microsoft-case page, I was ready to declare Wikipedia a clear success. Now I’m not so sure. Yes, that page will improve over time; but new pages will be added. If the present state of Wikipedia is any indication, most of them will be very good; but a few will lead high-school report writers astray.

Comments

  1. Dr. Bonzo says:

    I wonder, from your description, whether there were any heuristic indicators that would allow your high-school report writer to estimate the likely accuracy of an entry? For instance, it might be that an entry that has been around a long time is more accurate (because there’s been more time to correct it); or perhaps the number of individuals who have contributed; or perhaps the product of the two; or some other measure.

    Of course, heuristics are heuristics, and not absolute measures. But it would be interesting to investigate whether any heuristics exist for Wikipedia entry accuracy.

  2. Michael S. says:

    I’ve been very impressed with the quality of Wikipedia too; recently I’ve been going there first for things that are difficult or awkward to Google for, and also to collect additional terms (often technical terms) to feed to Google. It all seems to be a little too good to be true, and I wouldn’t have guessed that Wikipedia could work, but work it does.

    I wonder if there’s a case for adding a feature that allows authors to comment on the quality of an entry, though. At the moment it’s only possible to “patch” entries; perhaps there needs to be some way to issue “bug reports” (or “is this a bug?” reports) for entries as well.

  3. C. Scott Ananian says:

    I agree with you on your discomfort with Wikipedia’s emphasis in its article on yourself, but the reasons for the peculiar emphasis seem clear: I’m sure there was great community desire for articles on the DMCA and the Microsoft case, both of which contain a reference to ‘Ed Felten’ which the authors undoubtedly tagged as a potential Wikipedia entry. Prospective authors following those links to create the ‘Ed Felten’ entry would undoubtedly feel compelled to concentrate on those details which would buttress the reference; as opposed to, say, concentrating on your SHRIMP work. [It may also well be that the SDMI and Microsoft work is more accessible to a general audience, and thus more suitable for Wikipedia's readers.]

    I feel vaguely that there is some higher principle at work here which I can not quite identify. Something like a ‘gravitational pull’ of content, which tends to make new entries relate more, rather than less, to existing entries. There is currently no Wikipedia entry for ‘Java security’, and no mention of security flaws which have been found in Java in Wikipedia’s entry for the ‘Java Programming Language’, and so there is no compelling pull for your work on Java security flaws to be first referenced in a source entry, and secondly expanded on in the ‘Edward Felten’ entry.

    This could, perhaps, be taught as a ‘principle of trust’ for information researchers (read, ‘high school students’) relying on Wikipedia for a source: be aware that errors of omission tend to clump. If you are looking for information on Y’s relationship to X, but X is not written up in Wikipedia, then the entry for Y is not likely to mention X, even if in fact they are related. And conversely, the presence of a detailed write up for X makes it more likely that the content relating to X in article Y will be present (and accurate?).

    This is probably an issue in traditional encyclopedias as well, but magnified in Wikipedia because articles are more frequently written ‘to scratch an itch’, rather than by centralized assignment, and because no author feels compelled to create the ‘full story’ at first draft, because they are confident that weaknesses in their article will be corrected by others.

    The trickier issue of fact is that statements such as ‘the most important work by X’ or ‘the most important thing about X’ tend to relate to the *existing* Wikipedia content, not the sum of all possible content. Again, if information about Y is lacking from Wikipedia, then it is not likely to be mentioned as ‘the most important thing’. So possible categorical omissions have to be kept always in mind when evaluating subjective rankings.

    On the other hand, I would suppose that if I were to edit the ‘Edward Felten’ entry to say, “Edward Felten is best known for his groundbreaking work on Java Security” (say) or “the SHRIMP project”, that articles on said topics would soon grow out of the dangling references. And they would probably neglect others’ contributions (at least at first!).

  4. blog.kennypearce.net says:

    How to Use Wikipedia Properly

    Ed Felten is blogging today about questions on the quality of content on the free online encyclopedia, Wikipedia. According to Felten’s quick survey of topics on which he is an expert (Princeton University, Princeton Township, himself, virtual memory, …

  5. FA Hayek says:

    Regarding your quibble on the Microsoft case. Non lawyers always get the details of litigation wrong. My guess is the wikipedia entry was pieced together from innacurate press reports that might very well find their way into more “traditional” encyclopedias.

  6. Emergent Chaos says:

    Wikipedia

    Over at Freedom To Tinker, Ed Felten writes about the Wikipedia quality debate. He takes a sampling of six entries where he’s competent to judge their quality, and assesses them. Two were excellent, one was slightly inaccurate, two were…

  7. Aaron Swartz says:

    I disagree with C. and suggest the reason for the focus was how well-known certain work is; the SDMI stuff got a lot of press attention, for example. Similarly, I think the reason the Microsoft case article is so poor is because there are few people who had both the legal and technical skills and the desire to follow the case in detail.

  8. Cypherpunk says:

    In my experience any issue that is controversial and where there is a net-community consensus will be biased on Wikipedia. Look at DRM or Trusted Computing, they are very much spun in the same direction you hear over and over on the net. Look at the RIAA or MPAA or file sharing. Wikipedia is part of net culture and can’t help being influenced by its cultural norms. That’s the real reason the Microsoft entry is so bad, it’s because it is unacceptable not to hate Microsoft in online culture.

  9. Karl-Friedrich Lenz says:

    Even the “bad” Microsoft case article now is a little better after your editing.

    The point of Wikipedia is that the sum of thousands of people editing out small mistakes and adding incrimental improvements leads to great articles over time.

    That obviously is not incompatible with the observation that there are some articles that are still in urgent need of improvement. However, the answer to that is not ignoring, but improving the article in question.

    While that might be “too big an effort” for any single user, it is not for the community of all users.

  10. dbs says:

    I have to agree with Karl above. The issue to me is not whether Wikipedia is accurate or not. For the most part, it is. The errors that were found in the Microsoft case sounded more like small factual errors, easily remedied, as was pointed out.

    In Wikipedia’s case, the advantage of a community-maintained and audited system is you can point out flaws, and queue them for updates. To me, this is the full magic of Wikipedia and the net. Bad information gets revised and updated, as opposed to the traditional ‘print media’, where once something is published, it cannot be changed, no matter how inaccurate.

  11. Ed Felten says:

    Cypherpunk:

    My complaints about the Microsoft-case article were based on the acticle making many purely factual errors, such as getting the sequence of Judge Jackson’s rulings wrong. The description of what I said in my testimony was totally wrong, and inconsistent with those facts agreed upon by both sides in the case. I didn’t even touch the issue of bias.

  12. Ed Felten says:

    K-F and dbs:

    Wikipedia entries do seem to improve with time, but convergence to truth in the long run is not the only thing that matters. That’s what I was trying to say at the end of my posting.

    It may be that there will always be some fraction of Wikipedia articles that are immature and unreliable, just as some fraction of Internet users are always newbies. It’s not that the newbies never learn; it’s just that they’re replaced by newer newbies. Similarly, it may be that although individual Wikipedia entries improve, the quality of the average Wikipedia entry does not.

  13. C. Enrique Ortiz says:

    The reported main issue or concern was accountability. Yes accuracy (and completeness) is big issue. But that is the nature of Wikipedia, which is mantained by volunteers.

    For example, as an author/writer of wireless topics, I’ve started contributing to certain Wikipedia sections (see J2ME sections) so my info is accurate and complete, but, I unfortunately I don’t have a lot of free time to update the Wikipedia… so the nature of Wikipedia is that it will get beter overtime.

    Going back to the accountability, I do agree with that, and I believe that Wikipedia should make contributors register formally (the register, and email is sent to a real email account, etc), so that way a good audit trail can be created.

    ceo

    C. Enrique Ortiz
    J2MEDeveloper.com
    Web Page: http://www.j2medeveloper.com
    Web Log: http://www.j2medeveloper.com/blog
    RSS: http://www.j2medeveloper.com/blog/rss

  14. Dan S. says:

    Regarding a “Is this a bug?” facility, check out the “Discussion” tab above every article. Great for coordinating article structure and settling minor disagreements.

    There seems to be an extremely high sensitivity to changes with ill intent. My first Wikipedia contribution was a small, innocuous, and, I thought, rather obvious clarification, but someone else must have considered it an attempt to rewrite history and reverted my change within an hour. I changed it back and posted to the Discussion page that the next person to revert it should post an explanation. I’m confident the person who did the revert had the best of intentions, but just didn’t understand the minor statement I was correcting.

    Regular contributors to wikis tend to consider maintaining the integrity of the data the highest community value. The tricky part for Wikipedia will be responding to malicious changes made by passersby (not that all changes made by passersby are malicious), but that usually works itself out because of the quality of the reversion tools and the tendency for passersby to lose interest. The tools and practices around controversial articles, where people with strong opinions or motives are likely to hang around, are especially interesting. (See, for example, articles on political candidates.)

  15. joe says:

    Where the errors are too numerous to correct, knowledgeable parties should place “This page has many errors… I [insert email address] will gladly review the page after the next iteration.”

    That is, it seems like they need a “under construction”-like warning for some subjects. (This is ignoring the fact that a consensus on some pages may never be reached)

  16. Slowking Man says:

    joe:
    There are numerous templates that can be placed on problematic articles, such as {{disputed}} (for factual accuracy disputes) and {{npov}} (for neutrality disputes). The entire list can be found at http://en.wikipedia.org/wiki/Wikipedia:Template_messages. Conflicts can be worked out on the Talk pages of articles, and there is a process for resolving especially severe conflicts (see http://en.wikipedia.org/wiki/Wikipedia:Dispute_resolution).

    Ortiz:
    The problem with requiring registration is that it turns away many potential users. Many people wish to “try out” editing Wikipedia before they actually join; I myself was one of these people. Also, some people may not want to contribute regularly, but just fix an error or a typo if they see one in an article. And, there are a small group of people who do not wish to register, for privacy or other concerns.

    Disclaimer:
    I am a regular Wikipedia contributor. Someone linked this entry on the Wikipedia-l mailing list.

  17. Brian Kendig says:

    I’ve done a lot of work on the Microsoft antitrust case article in the past.

    I based my information on my own memory of the case, and I supported it with relevant references from news web sites. I am not a lawyer, and I admit I may have gotten as many details wrong as I got right. (Incidentally, at least one news story referenced in the article uses the incorrect term “prosecutors” rather than “plaintiffs.”) There have been times when people disagreed with my edits, so we discussed them on the article’s Talk page and came to an agreement about the article’s content.

    That’s the way Wikipedia is supposed to work. Nobody claims it contains completely perfect information, but given enough eyes and enough editors, it should approximate truth. Personally, I see Wikipedia as a sort of “Cliff’s Notes” for news stories: it’s not as good as finding original sources of information, but it’s valuable as a way to get quickly up-to-speed on generalities and viewpoints. Someone basing his knowledge of the Microsoft antitrust case on that article alone may be mistaken on some specifics (until an editor fixes them), but will come away with a terrific overview of what it’s all about and why it’s important.

    I appreciate that you made some corrections to the antitrust case article to improve it. As for your other concerns which you feel will take too much effort to fix: would you please visit the article’s Talk page and explain what problems you see, so that other people can take up the work on fixing them? (And, I have a specific question: I remember that Microsoft did something to thwart your demonstration that IE could be removed from Windows; I thought this was a Windows Update they released which moved around some function calls in DLLs but I suppose I was wrong; do you remember what it was Microsoft did?)

    A large problem facing Wikipedia is that sometimes the people who find errors in its content will write off the whole effort as flawed and pointless, and therefore won’t contribute their effort to improving the project. Do you have any suggestions on how to address this, or to raise the quality of new articles?

  18. Brian Kendig says:

    Slowking Man: It’s important to point out that {{disputed}} and {{npov}} are for *disputes*. If someone sees a factual inaccuracy or a non-neutral point of view, he can fix it! It’s only when there are specific disagreements over facts or neutrality that one of these tags should be used until the article is sorted out fairly.

  19. Jason Scott says:

    Mr. Felten, the problem you state regarding the Microsoft case is endemic throughout pretty much all reporting of complicated issues, especially when using non-primary sources. Others who have commented in this thread mention this, but I want to expand it and say that I encounter this problem in actual historical retelling, where I have found almost diametrically opposite explanations of the outcome and procedure of cases (a lot of EFF-type cases, too, like the E911 affair and the Tcimpidis seizure).

    The gift we have here is that a principal in a case (like yourself) can come along and comment “they completely got it wrong” and then people can jump in and fix it, where before we were powerless to do anything, as the book sunk into Books In Print and a thousand high school reports.

  20. Freedom to Tinker says:

    Wikipedia vs. Britannica Smackdown

    On Friday I wrote about my spot-check of the accuracy of Wikipedia, in which I checked Wikipedia’s entries for six topics I knew well. I was generally impressed, except for one entry that went badly wrong. Adam Shostack pointed out, correctly, that I h…

  21. Scott Preece says:

    The core problem with the wikipedia is that you have no idea of the quality of any particular article. A conventional encyclopedia hires subject-area experts to write its articles and has professional editors and fact checkers to review them and has an overall reputation that lets you decide how much to trust the set as-a-whole.

    I think requiring author/editor registration is a basic step to improving the reader’s ability to trust the content. I tend to think that as the web flattens out the cost of publishing, editors will grow in importance as people trust individuals to be good guides.

    Beyond reader trust, however, author registration is central to accountability if IP issues arise. If somebody starts copying Britannica articles into the wikipedia verbatim, it would be nice to be able to find all of that person’s “contributions” when the lawyers come to call.

    A signature may not be able to prove much about a person’s true identity, but it should be the minimal requirement for participation.

  22. Jeremy Leader says:

    Scott, I think you’re presenting some valid arguments for the benefits of requiring registration, but you need to recognize that there are costs to requiring registration, as well (such as slightly discouraging people like Prof. Felten from casually correcting things they know to be wrong). My impression (as an outsider) is that the Wikipedia maintainers have considered both the costs and the benefits, and have decided that to them, the costs outweigh the benefits.

    I suspect that in the event of an IP issue, there are other ways to find and remove a batch of related changes. They might track contributor’s IP addresses, or they might just revert all changes that modify more than X% of any article during a window of time around the attack.

    Remember that wiki articles tend to end up having a host of authors, so it’s hard to figure out how to evaluate trust, short of trusting the community. Also, when reverting a change is so easy, the cost of malicious changes is low enough that specific countermeasures against them probably aren’t worth the trouble.

    Also, keep in mind that the Wikipedia software (MediaWiki?) is freely available as Free Software, and can be configured to require registration before editing. I believe the Wikipedia data is also freely available. Thus, there’s nothing stopping someone (such as yourself) from setting up their own fork of Wikipedia with whatever access control mechanism you want. The absence of such an alternative suggests (but doesn’t prove) that Wikipedia’s approach is for some reason superior.