Living in the library world: Introduction to reference: Invisible web

Invisible web content is considered to be information that isn’t clearly displayed or searchable on the Internet. This includes photos, images and sound files. These files aren’t apparently “there”, but really they are. Directories have found items on the invisible web, but otherwise it is hard to find this information. The invisible web can be known by any of the following terms:

deep web
dark matter
grey web
opaque web
private web
propriety web

Propriety web describes the invisible web. The private web are Internet pages inaccessible by the public, search as intranets, or sites that require a company’s username.

What is the invisible web?

Material that general-purpose search engines either cannot or perhaps more importantly, will not include in their collections of webpages.

Sherman, Chris. “Navigating the invisible web.” Search Engine Watch 2001. http://www.searchengineweb.com

Why care about the invisible web?

Invisible web is estimated to be 500 times greater than the visual web.

Invisible web is the largest category of new information on the Internet. The commercial web is the largest growing area of the Internet. More paid Websites and databases are coming online, as well as indexes. Most of this information is also available in databases.

Invisible web content contains highly relevant content.

Facts and figures – sizes

Visible or surface web

4 billion individual documents
19 terrabytes (Tb)
100% publicly available
quality – well.... the user must decide that one.

Invisible or deep web

550 billion individual documents
750 terrabytes (Tb)
200,000 web sites
95% publicly available
quality – 1000 to 2000 times greater

Bergman, Michael. “The Deep Web: Surfacing Hidden Value,” 2001. http://www.brightplanet.com/technology/deepweb.asp#HigherQuality

Considering that these figures are 8 years old, the reader can multiply these numbers ten times over and not even get a estimate of how much of the Internet one can and cannot easily access today.

Why isn’t all content picked up?

Restricted access

Undiscovered sites

No links to or from site

Dynamic pages only produce results in response to a specific search request
Robot.txt file attached (instructions to a spider not to index a site)
Search engine may not like the URL used to retrieve the document.
Most dynamic delivery mechanisms use the ? symbol.
Most search engines will not read past the ? in that URL.

Types of invisible web content

Newly added Web pages

not yet found by spiders

Sites use exclusionary tags

Robots can ignore pages with a meta tag. Site owners can create a file called robots.txt which contains a set of rules for the robot telling it what not to index.

Intranets (private networks)
Web sites which generate dynamic pages based on user inquiry

After the session is over the info may “disappear” e,g,

MapQuest http://www.mapquest.com
The Sock Calculator http://www.panix.com/~~llaine/socks.html

Sites requiring special software or hardware to access content

e.g. flash, shockwave, java applets, video, sound e.g.

Mapping History http://www.uoregon.edu/~atlas/ requires shockwave
America’s Jazz Heritage http://www.sl.edu/ajazzh/audio.htm requires RealAudio G2 player

Word, PowerPoint, PDF, Excel, Postscript, Rich Text Format has been indexed by Google since at least 2003, but it still won’t pick up everything.

Web sites free to the public but which require user to search within the site’s database(s) to find info

e.g. Electronic Journal Miner http://ejournal.coalliance.org

Sites requiring registration or login

e.g. New York Times http://www.nytimes.com

Hybrid sites, some content free, some restricted

e.g. Big Chalk http://www.bigchalk.com

Web sites requiring a subscription

e.g. EBSCHost, Encyclopedia Americana

Databases

“When an indexing spider comes across a database, it’s as if it has run smack into the entrance of massive library with securely bolted doors. Spiders can record the library’s address, but can tell you nothing about the books, magazines or other documents it contains.” http://www.freepint.com/issues/080600.htm#feature

Databases form the largest part of the invisible web. Distribution of Deep Web Sites by Content – From Bergman @ http://www.brightplanet.com/deepcontent/tutorial/DeepWeb/index.asp

Finding info on the invisible web

Use search engines, meta-search engines and Web directories

to find hidden info from search engines in databases add word “database” to subject term to find database gateways

Strategies for success

Keep current

discussion list
newsletters
use a monitoring service
use C.I. (current information) resources

Search smart

use meta search engines first
look at first page of results only
use bookmarks, pathfinders, other libraries

Know what is available

subject-specific discussion groups

Search tools for the invisible web

Virtual libraries

Librarian’s Index to the Internet http://lii.org
Digital Librarian http://www.digital-librarian.com
WWW Virtual Library http://www.vlib.org
Infomine (directory) http://infomine.ucr.edu

Invisible search engines and directories

ProFusion http://www.profusion.com
Complete Planet http://www.completeplanet.com
Invisible Web Directory http://invisible-web.net
Direct Search http://www.freepint.com/gary/direct.htm

Subject-specific databases

Use a search engine to find databases

e.g. +”civil engineering database” –library
“civil engineering” NEAR database

Use a directory to specialized databases

Fossick http://www.fossick.com/index.htm

Selected subject-specific databases

Business Global Edge http://globaledge.msu.edu/ibrd/ibrd.asp

Multimedia Singing Fish http://www.singingfish.com

Science Scirus http://www.scirus.com

Other Speciality Search Engines and Invisible Web Resources http://lib.nmsu.edu/instruction/specialitysearch.htm

Invisible web tutorial

Invisible web: what it is, why it exists, how to find it, and its... finding information on the Internet: a tutorial http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html

Living in the library world

Monday, January 4, 2010

Introduction to reference: Invisible web

No comments: