Search Engines
The advent of search engines revolutionised the Web. Before search engines, the methods
for finding information you didn't already know the location of were rudimentary to say
the least.
Think of a search engine as a very stupid librarian, but with a terrific memory. A
real-life-librarian can use their experience and intelligence to help you look for what
you need, even if you just ask for books about 'archeology' when you really want a book
about Mayan hieroglyphics. A search engine needs much more from you - the more you
give, the better it can help you. But although search engines might look similar,
behind that Web interface, they run code of very distinctive flavours.
Understanding Search Engines
Search engines use robot programs called 'spiders' or 'crawlers' to follow links around
the Web and compile an index of the full text of every page they hit. They can take
weeks to complete one cycle of crawling. When you type a query into a search engine,
it is translated into the syntax understood by that engine. Results which match your
query are ranked statistically, attempting to present the most relevant results near the
top.
Some common misconceptions about search engines:
- one is pretty much the same as another
- the best ones have an index of every page on the Web
The truth is, as always, not so simple! In fact, the key facts to know about
search engines are:
- Each search engines' index is unique
- No search engine covers more than a third of what is on the Web
- Different search engines use different methods of indexing and ranking resources
Different search engines should be used for different tasks. There are two main
types of search engines: statistical search engines and meta-search engines.
Statistical search engines: Altavista,
AlltheWeb, Northern Light, Lycos, Infoseek, HotBot, Excite, WebCrawler (and more)
Metasearch engines: Metacrawler, Dogpile, Profusion, Mamma
The first have unique indices and offer a selection of search strategies. The
second type allow you to search a collection of the other engines' indices: they take your
query and submit it to Altavista, HotBot, Northern Light etc.
When you learn to appreciate the strengths of each search engine, used properly they
can be powerful tools.
But if you want to really search as much of the Web as possible in one sweep, you need
to METASEARCH.
When you write a search query, you write something like this:
- "department of plant biotechnology" (a whole phrase query)
- department AND plant AND biotechnology AND NOT gardening (Boolean query)
- +department +plant +biotechnology -gardening (this is known variously as
'standard syntax', 'Search Engine Math', or 'Fuzzy Boolean')
- A University department of plant biotechnology (natural language search)
A search engine will take this query and will translate in into syntax which the
machine understands. Let's have a look at how Altavista translates each of the
different queries (in order). If you click through to the search results page,
you'll see that Altavista returns a different number of results of differing qualities,
for each query.
A human would have understood that we meant pretty much the same thing, however
we had phrased it. A search engine cannot understand this: a search engine has little
empathy; you must say just what you mean!
N.B. Altavista does have a feature which allows it to mimic human intelligence:
notice how for every query it will suggest some questions which might help find the
information you need. This is from the AskJeeves search service, which has a
cross-referenced database of questions written by people, together with their answers. In
this way, Altavista appears to suggest things which are relevant to your query.
Search engines and information security
Notice that the original query is always visible within the query URL. Searches
on the Web are usually NOT secure, which means that in theory they could be observed by a
party who is outside your company's firewall. All the traffic on the Internet is
visible to the gatekeeper of an Internet line. So, all your
searches and page requests can be seen by your System Administrator and many of them are
also visible to people outside the company.