Previous section Navigate within course Next section
Course Overview
 - Slide for this section

Session 1

What is the Internet?

Internet vs other information resources

Finding a 'knowledge hub'

Search engines

- How they work
 - Comparing engines
- Using Top Three sites

Search strategies

Offline Assignment 1

Session 2

Evaluating information
Researching companies
Researching markets
Researching countries
Researching news
Search software
Offline Assignment 2

Introduction to search engines

Search Engines

The advent of search engines revolutionised the Web. Before search engines, the methods for finding information you didn't already know the location of were rudimentary to say the least.

Think of a search engine as a very stupid librarian, but with a terrific memory. A real-life-librarian can use their experience and intelligence to help you look for what you need, even if you just ask for books about 'archeology' when you really want a book about Mayan hieroglyphics.  A search engine needs much more from you - the more you give, the better it can help you.  But although search engines might look similar, behind that Web interface, they run code of very distinctive flavours.

Understanding Search Engines

Search engines use robot programs called 'spiders' or 'crawlers' to follow links around the Web and compile an index of the full text of every page they hit.  They can take weeks to complete one cycle of crawling.  When you type a query into a search engine, it is translated into the syntax understood by that engine.  Results which match your query are ranked statistically, attempting to present the most relevant results near the top.

Some common misconceptions about search engines:

  • one is pretty much the same as another
  • the best ones have an index of every page on the Web

The truth is, as always, not so simple!  In fact, the key facts to know about search engines are:

  • Each search engines' index is unique
  • No search engine covers more than a third of what is on the Web
  • Different search engines use different methods of indexing and ranking resources

Different search engines should be used for different tasks.  There are two main types of search engines: statistical search engines and meta-search engines.

Statistical search engines: Altavista, AlltheWeb, Northern Light, Lycos, Infoseek, HotBot, Excite, WebCrawler (and more)
Metasearch engines: Metacrawler, Dogpile, Profusion, Mamma

The first have unique indices and offer a selection of search strategies.  The second type allow you to search a collection of the other engines' indices: they take your query and submit it to Altavista, HotBot, Northern Light etc.

When you learn to appreciate the strengths of each search engine, used properly they can be powerful tools.

But if you want to really search as much of the Web as possible in one sweep, you need to METASEARCH.

How a search engine understands YOU: search syntax

When you write a search query, you write something like this:

  • "department of plant biotechnology" (a whole phrase query)
  • department AND plant AND biotechnology AND NOT gardening (Boolean query)
  • +department +plant +biotechnology -gardening (this is known variously as 'standard syntax', 'Search Engine Math', or 'Fuzzy Boolean')
  • A University department of plant biotechnology (natural language search)

A search engine will take this query and will  translate in into syntax which the machine understands.  Let's have a look at how Altavista translates each of the different queries (in order).  If you click through to the search results page, you'll see that Altavista returns a different number of results of differing qualities, for each query.

A human would have understood that we meant pretty much the same thing, however we had phrased it. A search engine cannot understand this: a search engine has little empathy; you must say just what you mean!

N.B.  Altavista does have a feature which allows it to mimic human intelligence: notice how for every query it will suggest some questions which might help find the information you need.  This is from the AskJeeves search service, which has a cross-referenced database of questions written by people, together with their answers. In this way, Altavista appears to suggest things which are relevant to your query.

Search engines and information security

Notice that the original query is always visible within the query URL.  Searches on the Web are usually NOT secure, which means that in theory they could be observed by a party who is outside your company's firewall.  All the traffic on the Internet is visible to the gatekeeper of an Internet line.  So, all your searches and page requests can be seen by your System Administrator and many of them are also visible to people outside the company. 

 

Exercise Number 4: Using Different Search Syntax

Suggested time for exercise = 10min SHOW HINTS

In this exercise, we will look at how you can improve a query by appropriate use of search syntax.

For this exercise we will use only Northern Light (www.northernlight.com)

Using at least three different types of search syntax, search for academics who are interested in research on artificially intelligent robots. (For definitions, see notes on syntax above). Build your query up, from a simple search on "artificially intelligent", to a more complex search using Boolean operators and/or standard syntax.

If possible, find a list of such academics.

Look only at the first page of your results and record the number of entirely relevant (direct) hits for each. For this exercise, a direct hit is a search result which appears to include names of academic researchers. As you refine and improve your search, you should aim to get more and more relevant results (i.e. from academic web sites) and fewer totally irrelevant results in the top ten results.

Which type of syntax returns the highest number of useful results?

Next section