By Liz Garone
Remember the old saying, "I know you think you heard what I said, but what I said is not what I meant"? That's how a lot of Web users feel when they enter a name or word into a search engine text window and get back a list of Web sites bearing only a passing semantic resemblance to their topic of interest.
With more than 1 billion different Web pages out there, it's important that sites that offer Web searches--especially those that aspire to be portals--have a Web search facility that does both a robust and an accurate job. "The whole point is to try and create the experience the user is looking for, which is: 'I want the information I'm looking for to come back in that first page of results,'" says Dennis McEvoy, senior vice president of Inktomi, makers of a search engine used by many of the leading search portals, including HotBot, Yahoo, Snap, GoTo, and Excite.
To search the Web efficiently, most portals use a combination of a Web search engine or engines, and an index search. Yahoo, for example, first goes to its own index, compiled and maintained in-house by semioticians, then uses the Inktomi engine to search the Web. Other sites use the index maintained by the Open Directory Project, which, like Yahoo, uses humans to catalog the Web. The difference is that the ODP, originally called Newhoo, is staffed by a growing number of volunteer editors--21,500 at last count--who have indexed more than 1.4 million sites in more than 200,000 categories, according to an ODP spokesperson. America Online and Netscape, which own the ODP, use it on their sites. So do Lycos, AltaVista, Metacrawler, HotBot, and more than 100 other sites.
Under the Hood
The basic technology used to search the Web is similar across the board and hasn't changed much since it was first used as part of an internal IBM system in the '70s during an anti-trust suit.
"Search engines basically all function on the same premise--they go out, they crawl, they index, they execute tasks," says Jupiter Communications analyst Lydia Loizides. The differences come in the strength, frequency, and deepness of their crawls, the size of their indexes, and the criteria used to determine which results they bring back to users.
"The advancements really come in the algorithms that are being developed that enable this very speedy and complex processing," explains Loizides. "Instead of saying, 'If you find this, then execute that,' it's now saying, 'If you find this, it must be within a parameter that has these three factors surrounding it. Only then is it a good result.' They've just become more complex in their execution."
Although search engines have made leaps and bounds in relevancy and scope in the last three decades, they still look pretty much the same on the inside. Every Web search engine comprises the same three components: the spider, the database, and the search tool.
First, the spider (also referred to as a robot or crawler) traverses the Web, gathering information as it goes and depositing it into the database. The third piece of the engine, the search tool, then sifts through the millions of pages the spider has recorded in the database. The tool finds matches and then ranks them depending on the particular engine's programmed criteria, which differs from engine to engine and is further customized for each site.
Ranking the Results
These criteria are what distinguish search engines from each other, according to Loizides. How does an engine choose and rank its results? Does it use simple keyword matches, or does it combine those matches with contextual queries and offer more advanced and specialized results?
Today, you would be hard-pressed to find a search engine that doesn't take a number of factors into consideration when tallying results. AltaVista, for example, uses a number of different factors in sorting pages, including how many times a search word appears, whether it is in the title or meta tags, and its proximity to other search words in the document. Direct Hit ranks results by the popularity of sites by looking at the ones that already have been selected by other users who made similar queries. The longer a user stays at a site, the higher its ranking from Direct Hit. Inktomi uses a combination of word, link, and click analysis.
According to MediaMetrix, HotBot, Yahoo, Lycos, Excite, and AltaVista have the lion's share of user queries. But that doesn't mean that everyone is satisfied with the results they're currently getting. A number of second-generation search tools have sprouted up in the last year and capitalized on discontent with the big search sites by promising more relevant results.
Oingo, which launched last October, is billing itself as "the first meaning-based search" site.
"Our technology is different because we index not on the actual words but on the meaning of the words," explains Oingo's president and co-founder Gil Elbaz. The company uses its own search technology, along with the Oingo Lexicon, a proprietary database of more than 300,000 interconnected terms, to perform searches. The Oingo Lexicon is updated by a team of eight linguists and by automated procedures, according to Elbaz.
Rather than just looking at how many times particular words appear in a document, Oingo identifies the meaning of each document by an automated "sensing" process that analyzes text in the document. Results are then listed in order of assumed relevance and are based on all possible meanings of the search terms entered by the user. Users then have the option of specifying which meanings they are interested in through use of a pull-down menu.
What Oingo has in common with the granddaddy of search engines, Yahoo, is the human touch. "The pendulum had moved away from the directory [approach] but now is moving back towards it," says Elbaz. "We're spanning the gap. We're not trying to do everything by hand, and we're not trying to automate everything. We're taking a middle-of-the-road path."
Forrester Research analyst Paul Hagen, for one, likes the concept. "I'm intrigued with the notion of more interactive dialogue between user and content engine," he says. "I think search engines fail because they set up artificial constraints, in that a user can only enter a couple words and the search engine has to do its best based on those ambiguous words."
Covering the World
As search filters become more refined, however, the challenge of searching a rapidly expanding universe of pages in seconds remains. "It's important to remember that relevancy doesn't help if you don't even have the sites to rank in the first place," says Tom Wilde, senior product manager at FAST Search and Transfer. "The Web is growing extremely rapidly, and it's getting extremely large," he says. "For a search to be meaningful, you need to capture as much of that as possible." Released a few months ago, FAST already is helping to power Lycos searches as well as those on a number of European and Latin American portals.
FAST claims to search more than 300 million documents in less than half a second. FAST soon will be able to blanket the whole Web, according to Wilde. Already, it claims to have a database of more than 700 million visited Web documents.
The future promises a consolidation of technologies as sites try to improve their search functions, both in scope and in relevance. The use of "natural language" is likely to become more popular as sites such as Ask Jeeves, which lets a user ask questions in his or her own words, take off. In fact, Ask Jeeves recently acquired Direct Hit, which together with Inktomi provides the search engine for one of the most popular search sites on the Web, HotBot. Still on the horizon are personalized "agents" that search the Web on behalf of a single user, remembering user preferences as it goes.
One way or another, both casual and professional Web searchers will have more horsepower at their fingertips to peruse the ever-growing world of the Web, which leads to another old saying: On the Internet, as in life, be careful what you wish for.