The Invisible Web, By: Young, Jr., Terrence E., Book Report, May/June 2002, Vol. 21, Issue 1.
Looking back over the past century, history tells us that no other technology has grown at such a phenomenal rate as the Internet. Certainly, young people’s comprehensive use of the Internet is one of the primary factors of that growth. We’d all be wealthy if we had a dollar for each time a student utters the phrase, “I need some information on (blank), or a picture of (blank). I need to get on the Internet.” In students’ minds, the Web is the be-all and end-all of information. And the Web certainly offers a plethora of resources. But those sources of information aren’t “catalogued” or “indexed” as are traditional resources.
To complicate matters further, students’ search skills are often rudimentary. The phrase “Enter Your Term in the Search Box” is the gateway to information on the Web, but many students have little idea of how to conduct a search beyond the basics of typing key words or phrases into that box. They don’t know how to use such features as quotation marks, parentheses, Boolean operators, or the advanced or power search functions to refine their searches. As a result, thousands of Web sites are retrieved, making the task of identifying a key resource a difficult and daunting one.
On the other hand, when school library media specialists search the Web for information, the returned results often identify a relevant Web page in the top 10 to 20 results. What’s the difference? Library media specialists subconsciously search the “invisible Web,” which usually isn’t familiar, or thus available, to students, and thus is not searched by students.
On beyond Google
Media specialists know that there’s a difference between searching Google, for instance, for government information, and searching Google for a site that enables you to search a remote database for government information. I wouldn’t use Google to search for the United States budget or primary source materials on U.S. culture and history; I would use a site such as FirstGov, the official site for U.S. Government Information (http://www.firstgov.gov/), or The Library of Congress American Memory Collection (http://memory.loc.gov/ammem/).
The invisible parts of the “invisible Web” aren’t the sites themselves, but the contents of the databases that the sites are connected to. One of the major weaknesses of the popular search engines is that they don’t index an extremely large portion of the Web (often for technical reasons such as that a document doesn’t exist in HTML format). Consequently, current Web searching techniques miss a huge amount of information. Even more frustrating is the fact that the invisible Web’s resources are often of a higher quality than those of the visible Web’s.
Looking at the “invisible Web”
So what is the “invisible Web” (also called the “deep Web” or the “hidden Web”)? The invisible Web is made up of all kinds of specialized materials that are loaded with information. The vast majority of information available in digital form doesn’t reside directly on the Web, i.e. it doesn’t reside on the surface, or visible, Web. The huge amount of authoritative and current information contained in the invisible Web is accessible to you, but you have to know where to find it, because you can’t locate it using generic search engines, such as Vivisimo or Google. So how do you know whether there’s a database available to address your specific information need? Web indexes and directories remain the overarching search tools for finding databases on the Web.
The problem with that is, information in invisible Web databases (whether bibliographic, full-text, directory, products, statistics, or raw data) is generally inaccessible to the software “spiders” and “crawlers” that compile search engine indexes. This is because traditional search engines can only point to the front doors of databases, so to speak, they can’t easily index the information inside them. To make matters worse, PDF files, Flash files, and data stored in back-end databases also aren’t indexed. That’s crucial, because only information that’s been indexed can be searched for using general-purpose search engines. Data in un-indexed files and databases are searchable only from the individual Web site hosting the database, PDF, or Flash content. So, unless you can find that site, you can’t access that database, PDF, or Flash information.
The invisible Web is enormous, and its information is not only potentially valuable, it’s also multiplying faster than data on the surface, or visible, Web. How big is the surface, or visible Web? The Web Characterization Project (http://wcp.oclc.org/) of the Online Computer Library Center (OCLC) Office of Research conducts an annual Web sample to analyze trendsin the size and content of the Web: The 32-bit Internet Protocol (IP) address space consists of 4,294,967,296 unique IP addresses.
To put this in some perspective, let’s discuss Google for a moment. It provides access to the world’s largest and most comprehensive collection of online information – more than two billion Web pages, images, and newsgroup messages. Google was the first search engine to translate Adobe PDF (portable document format) files into HTML and index them across the Web. In the fall of 2001 Google began indexing several additional file formats: Microsoft Word, Excel and PowerPoint formats, as well as Rich Text Format and PostScript files. A bracket label to the left of the document title identifies the file type of your search hits: [doc] for Word documents, [xls] for Excel spreadsheets, [ppt] for PowerPoint presentations, [rtf] for Rich Text Format, and [ps] for PostScript documents. You can focus your search to a specific file type either by adding “filetype:(extension)” to your search strategy or by using the Advanced Search function to select your preferred file type.
Yet, for all of this, Google offers access to fewer than half of the more than four billion existing IP addresses. While search engines continue to improve the number of sites they index, there’s a huge piece of the Web that simply isn’t accessible to the “robots” and “spiders” these engines use. Given this information, can you fathom the wealth of information contained in the invisible Web?
Learn more about the Invisible Web
Want to know more about the fascinating world of the Invisible Web? The Invisible Web: Uncovering Information Sources Search Engines Can’t See, by Chris Sherman and Gary Price, provides a detailed look at the nature of the hidden Web. It also offers pathfinders for accessing the valuable information contained in the deep Web and includes more than 1,000 invisible Web resources that the authors consider to be among the best.
You may also want to sign up for a free trial of BrightPlanet’s LexiBot (version 2) client search tool (http://www.brightplanet.com/). LexiBot simultaneously searches 2,200 database and search engines that are considered to be part of the Invisible Web. IntelliSeek’s BullsEye software (http://www.intelliseek.com/prod/bullseye/bullseye.htm) searches more than 1,000 professional search engines and speciality databases organised into more than 150 categories.
Information resources on the invisible web
Information within databases such as the American Memory primary source collections (http://www.learning.loc.gov/ammem/ammemhome.html), the Geography Network’s statistical data (http://www.geographynetwork.com/) , or the Social Science Data on the Internet (http://odwin.ucsd.edu/idata/) out of the University of California, San Diego, provide us with accurate, reliable information. Besides supplying basic keyword search results, they also present prompts for further searches based on morphological or conceptual similarities. Users can choose search prompts in performing follow-up searches that are even more precisely defined, or for use in information retrieval with multiple search engines.
Ultimately, the invisible Web is another tool to meet the information needs of our users. To be most effective, however, it should be used in conjunction with our other information resources. One of most important of those resources is the media specialist – you.
Think of it this way: Your library media center collection is similar to the invisible Web. While your Online Public Access Catalog (OPAC) serves as a general-purpose search engine, the collection contains a wealth of information that isn’t searchable through the OPAC. As a media specialist, you’re the “invisible” gateway to that information; you function as the ultimate search engine into your collection. In the same way, effective use of the invisible Web requires school library media specialists to become familiar with the contents of the sties that grant entry into it, just as we’re familiar with the content of traditional reference resources and our library collections.
By Terrence E. Young Jr.
Terrence E. Young Jr. is a school library Media Specialist in New Orleans, Louisiana, an Adjunct Instructor of Library Science at the University of New Orleans, and editor of the NetWorth column in Knowledge Quest.
Specialized search tools
A lot of helpful information is locked away in databases that presently aren’t indexed by search engines. Several Web sites attempt to lead school library media specialists to the “invisible” content on the Web that’s often ignored by traditional search engines. Tap into this elusive section of the Web by using some of these special tools for accessing and searching:
Academic Info: Your Gateway to Quality Educational Resources http://www.academicinfo.net/
Academic Info is an educational gateway to online college-and research-level Internet resources. Its target audience is the college and university community, but the subject guides are useful to high school students.
Argus Clearinghouse http://www.clearinghouse.net/
The Argus Clearinghouse provides a central access point for value-added topical guides that identify, describe, and evaluate Internet-based information resources.
Complete Planet http://www.completeplanet.com/
This site offers access to 90,000 searchable databases and speciality search engines organized into more than 7,000 subject headings.
Direct Search http://gwis2.circ.gwu/edu/~gprice/direct.htm
Direct Search is a growing compilation of links to the search interfaces of resources that contain data not easily or entirely searchable/accessible from general search tools such as Alta Vista, Google, or Hotbot. It provides annotated links to more than 1,000 searchable, interactive databases.
Geniusfind is a directory of thousand of search engines, databases, and archives organized into convenient categories and subcategories. These resources are specific to a certain topic and will greatly reduce the time it takes you to find exactly what you’re looking for.
IncyWincy: The Invisible Web Search Engine http://incywincy.com/
This site constructs a database of more than 250,000 searches and functions in a directory of more than 2.7 million sites and 350,000 categories. A 45-day trial licence is available.
This site searches thousands of categories of information and makes it available in a directory that users can navigate, as well as search, to find answers from the targeted databases. The hit list consists of relevant sites that might have hidden information of use to the searcher.
Librarians’ Index to the Internet (LII) http://www.lii.org/
LII is a searchable, annotated subject directory of more than 8,600 Internet resources selected and evaluated by librarians for their usefulness. LII is a reliable and efficient guide to Internet resources.
Pinakes: A Subject Launchpad http://www.hw.ac.uk/libWWW/im/pinakes/pinakes2.html
This site provides links to major subject gateways using a drop-down menu.
Search IQ http://www.zdnet.com/searchiq/
Billing itself as “the smartest tool for finding info on the Net,” Search IQ is a must-try site. The wealth of information is staggering and the search capabilities startling.