April 26, 2024

Separating Search from File Transfer

Earlier this week, Grokster and StreamCast filed their main brief with the Supreme Court. The brief’s arguments are mostly predictable (but well argued).

There’s an interesting observation buried in the Factual Background (on pp. 2-3):

What software like respondents’ adds to [a basic file transfer] capability is, at bottom, a mechanism for efficiently finding other computer users who have files a user is seeking….

Software to search for information on line … is itself hardly new. Yahoo, Google, and others enable searching. Those “search engines,” however, focus on the always-on “servers” on the World Wide Web…. The software at issue here extends the reach of searches beyond centralized Web servers to the computers of ordinary users who are on line….

It’s often useful to think of a file sharing system as a search facility married to a file transfer facility. Some systems only try to innovate in one of the two areas; for example, BitTorrent was a major improvement in file transfer but didn’t really have a search facility at all.

Indeed, one wonders why the search and file transfer capabilities aren’t more often separated as a matter of engineering. Why doesn’t someone build a distributed Web searching system that can cope with many unreliable servers? Such a system would let ordinary users find files shared from the machines of other ordinary users, assuming that the users ran little web servers. (Running a small, simple web server can be made easy enough for any user to do.)

On the Web, file transfer and search are separated, and this has been good for users. Files are transferred via a standard protocol, HTTP, but there is vigorous competition between search engines. The same thing could happen in the file sharing world. In the file sharing world, the search engines would presumably be decentralized. But then again, big Web search engines are decentralized in the sense that they consist of very large numbers of machines scattered around the world – they’re physically decentralized but under centralized control.

Why haven’t file sharing systems been built using separate products for search and file transfer? That’s an interesting question to think about. I haven’t figured out the answer yet.

Comments

  1. “Why haven’t file sharing systems been built using separate products for search and file transfer?”

    Well, what was Archie + AnonFTP but a file sharing system in which the searching and file transfer components were separated?

    Everything old is new again.

  2. Fred von Lohmann says

    Distribution of torrents via gnutella may yet catch on — it was just added as a feature to the most recent version of Morpheus.

    The legal liability of web search engines like Google is reasonably well defined (thanks to the safe harbors created by the DMCA), but there are still lots of unexplored nooks. For example, one web-based search engine that included the gnutella network in its index, MP3Board.com, was sued by the recording industry. In the end, the lawsuit settled, so we don’t know whether centralized indexing of P2P networks presents any special legal problems.

  3. I see two main reasons distributing .torrent files via gnutella hasn’t caught on.

    First, the existence of web-based torrent portals has alleviated the need for this. The most annoying part of torrents is that many of them aren’t tracked any longer. Most web-based torrent portals automatically delist stale torrents. No gnutella client (at least that I know of) has any similar mechanism.

    More importantly though, the only part of a torrent that gnutella lets you search for is the filename. Most people don’t care what the name of the torrent file is. They care what the name of the files that the torrent describes are along with the corresponding hashes. I think this feature alone would lead to a two- or three-fold increase in gnutella-distrubted torrents.

    What I imagine is similar to Windows’ Search ability to search inside .zip files.

  4. Nato Welch says

    The file transfer portion of Gnutella protocol is HTTP, with a small tweak made for “push” transfers. Gnutella’s big innovation was peer search, not distribution. You could use Gnet response URLs with your web browser to download, and you could respond to search queries with URLs served from plain web servers, originally (I don’t know if this has changed).

    I, too, have always wondered why distributing .torrent files over Gnutella never really caught on. People do, and that’s great – it’s a perfect combination of the innovations in both search and distribution.

  5. I don’t have the exact answer, but I think the reason that search and transfer are combined in P2P is related to the fact that passive indexes (like Google) are vulnerable to legal attack.

    Consider the retail market 20 years ago. Let’s say you need to buy a table. What do you do? You go to the yellow pages, look for furniture stores, and then go to the stores (or call them up) until you find the table you want. Now, let’s say you want to buy some pot to smoke. Do you go to the yellow pages? No, you have to know someone who knows someone. And once you’ve found that second “someone”, that turns out to be where you buy your pot.

    In other words, the legal environment makes it impossible for anyone to be a third party index. And therefore, it doesn’t make sense to create a link from A (the searcher) to B (the provider) for searching and then a whole separate link from A to B for the transfer. Once you have the link, you might as well do the search and transfer in one fell swoop.

  6. Ed,

    To a large extent, that’s what Streamcast has done with their distributed hash table-based NEOnet search.

  7. Songmaster says

    Surely the best way to distribute searching through a file-sharing network would be to share one or more index files through the same network that provide pointers and meta-data about the files available on the network. That way the searching part can be done purely locally by the person doing the search. The problem then moves from how to search into who creates and updates the index files and how.

    Index creation is a different problem though, and one which should be much easier to solve in a distributed fashion, either manually or automatically. The network would be tree-structured if groups of index files themselves are indexed in a meta-index and so on, but the network might still be vulnerable to loss of the root index maintainer. Alternatively a more amorphous structure arises if each machine generates its own index file of the files it has available, where that list also includes the indexes of all the peer systems it has contacted – the file sharing code would automatically pull in the index file of every peer it talks to and then proceed to list that file in its own index file.

    There will be various details to be worked out – how long to keep each peer’s index file for instance, but this does seem to be a reasonable model; has it been suggested before and rejected?