An essay in the new Wired, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete,” argues that we won’t need scientific theories any more, now that we have so much stored information and such great tools for analyzing it. Wired has never been the best source for accurate technology information, but this has to be a new low point.
Here’s the core of the essay’s argument:
[…] The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.
Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the “beautiful story” phase of a discipline starved of data) is that we don’t know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on.
There are several errors here, but the biggest one is about correlation and causation. It’s true that correlation does not imply causation. But the reason is not that the correlation might have arisen by chance – that possibility can be eliminated given enough data. The problem is that we need to know what kind of causation is operating.
To take a simple example, suppose we discover a correlation between eating spinach and having strong muscles. Does this mean that eating spinach will make you stronger? Not necessarily; this will only be true if spinach causes strength. But maybe people in poor health, who tend to have weaker muscles, have an aversion to spinach. Maybe this aversion is a good thing because spinach is actually harmful to people in poor health. If that is true, then telling everybody to eat more spinach would be harmful. Maybe some common syndrome causes both weak muscles and aversion to spinach. In that case, the next step would be to study that syndrome. I could go on, but the point should be clear. Correlations are interesting, but if we want a guide to action – even if all we want to know is what question to ask next – we need models and experimentation. We need the scientific method.
Indeed, in a world with more and more data, and better and better tools for finding correlations, we need the scientific method more than ever. This is confirmed by the essay’s physics story, in which physics theory (supposedly) went off the rails due to a lack of experimental data. Physics theory would be more useful if there were more data. And the same is true of scientific theory in general: theory and experiment advance in tandem, with advances in one creating opportunities for the other. In the coming age, theory will not wither away. Instead, it will be the greatest era ever for theory, and for experiment.
The lack of distinction between correlation and causation is serious enough, but to me, the more fundamental premise of the article is that more data = more knowledge, which is mostly false.
This is of course true on a very superficial level, but real science has always been about theory trying to gather insight into a corpus of experimental data. Scientific progress has always been stimulated either by theory trying to explain experimental data, or by experimental data agreeing or disagreeing with theories.
Another major problem is the tendency to (subconsciously) massage data (and theoretical analysis) to agree with an accepted view.
There was once a physics conference at which the experimentalists presented data that showed that a certain cross section should go like 1/p^4 whilst
th the theorists predicted that it should be 1/p^8. Everybody went away and when they returned the following year theory predicted 1/p^4 and experiment showed 1/p^8.
I must disagree with Seth’s statement that the Wired article is one step above word salad. He is far too kind.
As noted above, undirected or misdirected data collation leads to superstition. Even inadequately controlled data lead to disastrous errors. It is one of the reasons that “more than half of all medical studies are ‘wrong’, presumably including the studies that quantify that statement.” It is the reason for the enormous fallibility of retrospective studies in epidemiology.
Poor project design in generating the target population is probably the single biggest cause of blunder in the sciences. You can not succeed unless you know what you want to measure, how to isolate it, how to measure it, and how to recognize aberrations. This is not possible in a random population of undefined data points. Consider the number of failed metastudies, which at worst (and best), use sets of poorly defined data.
It is not that such studies have no value. They provide the data which lead to an “isn’t that odd?” moment. But such studies can not be taken as knowledge.
No amount of non-experimental data will suffice to scientifically validate a theory. While increasing the quantity of data will help to quash random correlation, no amount of data will suffice to eliminate the possibility of mutual causation.
Suppose every object which has property X is observed to have property Y, while every object which lacks property X also lacks property Y. Suppose further that there is no plausible way that property Y could cause property X. Does that mean that property X causes property Y?
No. It would be entirely possible and consistent with the evidence that there is some unknown property Z which causes both property X and property Y. Only by taking an item and adding or removing property X can one determine whether property X has any effect on property Y.
To flesh out another biological example, it’s all the rage these days to do -omic studies, genome-based studies with, for example, all genome-identified proteins (proteomics). A couple examples: a group might seek to identify all proteins that bind to a protein of interest using mass spectroscopy after immunoprecipitation or using yeast two-hybrid screens, or examine a particular biological or biochemical response after individually reducing the expression of every gene in the genome (by RNA interference) to see what elements are important for the pathways leading to the response. These are all powerful and useful techniques, but have serious drawbacks and limitations. Those of us who do things the old fashioned way, one protein at a time, also do all the required controls to show that the the technique is doing what we expect and do follow-up secondary tests to verify the interactions/pathways we identify. These controls are impossible at this larger -omic scale. So we end up with databases full of artefactual interactions and pathways. Predicted proteins derived from predicted genes in the genome have interaction partners and pathways built around them and are implicated in diseases, despite the fact that they may not even exist as real expressed genes. Rather than an end to the scientific method, we need a return to it. This invasion of -omic scale research in biology sucks funding from groups who are still doing things the proper way. It is not science but a modern cargo-cult: “if you compile enough data, understanding will come.” But it won’t.
I think the Venter example as cited in the text explains exactly why the data-only approach doesn’t get you very far. All those “new species”, except we have no idea how to recognize them out in the world, how they differ from other species in their niches, how they’re related to other species or each other, or anything else. They’re just notches.
Sure, we have a lot of data. But, as the physics folks have pointed out, not even a fraction of the amount that we’d need to get truly better answers. (Weather models laugh at petabytes, for example. Try another five or ten orders of magnitude.)
Great,
The author can make all kinds of non-scientific method discoveries based on “massive data sets’ where correlation is never understood without causation. We could go back to the days of witchcraft, superstition and tradition (no offense to earth-based religion). I would highly recommend the movie Idiocracy, while crude, the scene where Joe Baures tries to explain that a Gatoraide substance is not good for watering crops shows how poorly a lack of scientific method could lead a civilization into certain doom. Without the scientific method, one can not separate causality from correlation. With enough studies, statistically you can prove the correlation of everything.
I don’t think your spinach example is too good. In the first place, spurious correlations like this are usually identified not by theory, but by statistical analysis based on a larger volume of data. You control for greater activity levels and all other relevant variables. You do factor analysis. These techniques aim to identify meaningless correlations and weed them out.
Second, when dealing with questions of health and nutrition like the spinach example, many of our ideas really do come from observations not much more sophisticated than your example. Antioxidants are supposed to be good for you because people who eat food high in antioxidants have been found to be healthier. But then when studies are done giving people more antioxidants, they get worse, not better. The problem is that our theories are very poor in this area due to the complexity of nutrition and how little is known.
And third, even when we have a theory, in this area it has often turned out to be wrong. There have been many studies out in the past few years which have overturned the conventional wisdom based on medical and nutritional theories. One example is the failure of close control of glucose levels to benefit diabetics, despite good theoretical reasons to expect improvement.
All in all it seems that health and nutrition is a field which has depended very heavily on “dumb” data analysis, and where progress will probably require much more data. Maybe in the future everyone will record everything they eat and do, and we will finally be able to figure out what makes people healthy and what does not.
On the basis of my experience of doing a PhD in Quantum field theory I would argue that the problem with the beautiful theories of modern physics is not so much one of the difficulty of doing experiments as the difficulty of computing testable results from the theory.
The idea of working directly from the data was tried to destruction in the 60’s with S matrix theory.
The editor-in-chief of Wired wrote that drivel? Ghaaa.
… I disagree: I think Chris Anderson makes one important point: the move from hypothesis-driven research to shotgun-method-driven research. In Biology this is clearly a trend (but not a very new one, however, Craig Venter is the one that comes to mind as one of the forefathers) and it does make sense to ask where else we could be confronted with this “move.” Of course, shotgun method-driven research relies on a conscious mind to interpret the correlations (ex post), however, that could impact “the scientific method” greatly.
I’m with Seth. It’s a colorful way to document mental laziness. That’s it.
I wouldn’t even bother to consider it as meaningful. It’s just one step above what’s called “word salad”. The real point is to HYPE HYPE HYPE in order to sell techno-utopianism. That’s what this guy does for a living.
This story is also inconsistent. In one quoted paragraph, it mentions that we are buried in data to where the scientific method is obsolete; in the next, it refers to modern physics as data-starved.