Searching Situations 3: Finding Files in non-HTML FormatsBy Pita Enriquez Harris

You’re heard tell of the Invisible Web. As well as including files found behind password-protected pages or in databases, this used to include many files of a non-HTML format, which were rarely included in indices of the major search engines.

This began to change with the advent of the specialised image search, then the MP3 or audio file search. These searches are aimed primarily at the consumer market; however image searching can be especially useful to people putting together presentations, or product literature.

Document formats used commonly by businesses and government organisations tend to be those employed by offices: MS Word, Excel, Powerpoint and PDFs, not to mention Lotus 1,2,3, Lotus WordPro and others. It isn’t unusual for people to simply place such documents on their Web sites, linking to them from HTML pages. These documents are therefore available, in principle, on the ‘Visible Web’, so long as search engines develop the technology to index their content.

What kind of content might one find in such documents?

Broadly, the answer would be ‘serious’ content for example; case studies, surveys, thesis summaries, clinical studies, investor presentations, and financial results. Some of the real ‘gold’ you may be looking for on the Web is probably hiding behind a .pdf or .doc extension.

Until recently however, business users of the Web could still not readily access PDFs, Word or Excel documents via search engines.

Thanks to Google and Adobe, this has changed.

Adobe, (who make the software used to create a PDF – Portable Document Format) has a searchable index of over a million summaries of PDF documents, available to search at http://searchpdf.adobe.com/

Since the full text of the PDF document is NOT indexed, this is of limited use, however, it is still a very useful resource, not only because of the additional information presented about each document, but because it is one of only two ways to search for PDFs. For each document, there is a page which contains the document title, summary (a few hundred words), the date of the document, the author, number of pages, estimated time of download and the document size.

The biggest accolade must go, however, to Google, who once again prove themselves to be on the cutting edge of search engine technology. Recent innovations allow Google users to search for the following impressive list of file formats:


PDF formatted files are the most popular after HTML files. PostScript and Microsoft Word files are also fairly common. The other file types are relatively uncommon by comparison.

When searching on Google, these non-HTML files will appear amongst the list of search results. The full-text of each document is searched, by virtue of the fact that each document is converted into a format (such as HTML or .txt) that is readily indexable by Google. You can read either the text-only of HTML version online, or alternatively, download the actual document.

The advantage of reading the text-only/HTML version is that Google highlights the keywords you searched for, making it easier to quickly locate the relevant part of what may well be a long document.

Should you choose to actively seek a document in one of these formats, Google’s field-searching can help you out: to search for a MS Word document containing the term "clinical trial" , use the search term "clinical trial" filetype:doc, or for a PDF file, "clinical trial" filetype:pdf

Image searching, once a specialty of Altavista, is now available on two other of the largest Web search engines: Google and FAST. FAST (http://www.alltheweb.com) has always had a multimedia search at http://multimedia.alltheweb.com/ which allows users to trawl through images, audio and video files. Google’s Image search, launched recently, offers another good option for searching for images. Altavista’s image search however, may still be the best on offer since it includes material from the portfolios of partner sites including Corbis and Rollingstone. This latter content may be charged for, but it does give you a quick, easy way in to the content on offer from such companies.

As the Web diversifies and grows, it is inevitable that more non-HTML file types are going to appear. Having rapid, easy access to them is just another weapon in the searchers’ arsenal.