Sunday’s New York Times ran a piece by Geoffrey Nunberg complaining about (among other things) the relative absence of major-press articles from the top ranks of Google search results. This has triggered online discussion of why the Times itself doesn’t get much Googlejuice. Speculation has centered on the fact that Times articles get moved to a pay-for-access archive.
The real explanation is simpler : The Times forbids Google to index its site.
There’s a web standard that allows sites to declare a web-crawler program persona non grata. A file called “robots.txt” gives a set of rules, written in a standardized language, saying which automated programs have permission to access which parts of the site. The Times’ robots.txt file forbids all web-crawler programs to visit the parts of the Times site where the articles are. Google’s policy is to honor the requests in robots.txt files; that’s why Times stories don’t show up on Google.
NYT and Google
Sunday’s New York Times ran a piece by Geoffrey Nunberg complaining about (among other things) the relative absence of major-press