Monday, January 4, 2010

Introduction to reference: Invisible web

Invisible web content is considered to be information that isn’t clearly displayed or searchable on the Internet. This includes photos, images and sound files. These files aren’t apparently “there”, but really they are. Directories have found items on the invisible web, but otherwise it is hard to find this information.

The invisible web can be known by any of the following terms:
• deep web
• dark matter
• grey web
• opaque web
• private web
• propriety web

Propriety web describes the invisible web. The private web are Internet pages inaccessible by the public, search as intranets, or sites that require a company’s username.

What is the invisible web?
Material that general-purpose search engines either cannot or perhaps more importantly, will not include in their collections of webpages.
Sherman, Chris. “Navigating the invisible web.” Search Engine Watch 2001. www.searchengineweb.com


Why care about the invisible web?
• Invisible web is estimated to be 500 times greater than the visual web.
• Invisible web is the largest category of new information on the Internet. The commercial web is the largest growing area of the Internet. More paid Websites and databases are coming online, as well as indexes. Most of this information is also available in databases.
• Invisible web content contains highly relevant content.
• Total quality content of invisible web many times greater than that of the conventional Web
Bergman, Michael. “The Deep Web: Surfacing Hidden Value,” 2001. http://www.brightplanet.com/technology/deepweb.asp#HigherQuality

Facts and figures – sizes
Visible or surface web
• >4 billion individual documents
• 19 terrabytes (Tb)
• 100% publicly available
• quality – well.... the user must decide that one.

Invisible or deep web
• >550 billion individual documents
• 750 terrabytes (Tb)
• 200,000 web sites
• 95% publicly available
• quality – 1000 to 2000 times greater

Bergman, Michael. “The Deep Web: Surfacing Hidden Value,” 2001. http://www.brightplanet.com/technology/deepweb.asp#HigherQuality

Considering that these figures are 8 years old, the reader can multiply these numbers ten times over and not even get a estimate of how much of the Internet one can and cannot easily access today.

Why isn’t all content picked up?
• Restricted access
o Login, password protected
• Undiscovered sites
o No links to or from site
• Dynamic pages only produce results in response to a specific search request
• Robot.txt file attached (instructions to a spider not to index a site)
• Search engine may not like the URL used to retrieve the document. Most dynamic delivery mechanisms use the ? symbol.
• Most search engines will not read past the ? in that URL.

Types of invisible web content
• Newly added Web pages
o not yet found by spiders
• Sites use exclusionary tags
o Robots can ignore pages with a meta tag. Site owners can create a file called robots.txt which contains a set of rules for the robot telling it what not to index.
• Intranets (private networks)
• Web sites which generate dynamic pages based on user inquiry
o After the session is over the info may “disappear” e,g,
 MapQuest http://www.mapquest.com
 The Sock Calculator http://www.panix.com/~~llaine/socks.html
• Sites requiring special software or hardware to access content
o e.g. flash, shockwave, java applets, video, sound e.g.
 Mapping History http://www.uoregon.edu/~atlas/ requires shockwave
 America’s Jazz Heritage http://www.sl.edu/ajazzh/audio.htm requires RealAudio G2 player
o Word, PowerPoint, PDF, Excel, Postscript, Rich Text Format has been indexed by Google since at least 2003, but it still won’t pick up everything.
• Web sites free to the public but which require user to search within the site’s database(s) to find info
o e.g. Electronic Journal Miner http://ejournal.coalliance.org
• sites requiring registration or login
o e.g. New York Times http://www.nytimes.com
• hybrid sites, some content free, some restricted
o e.g. Big Chalk http://www.bigchalk.com
• Web sites requiring a subscription
o e.g. EBSCHost, Encyclopedia Americana

Databases
“When an indexing spider comes across a database, it’s as if it has run smack into the entrance of massive library with securely bolted doors. Spiders can record the library’s address, but can tell you nothing about the books, magazines or other documents it contains.”
http://www.freepint.com/issues/080600.htm#feature

• Databases form the largest part of the invisible web.
Distribution of Deep Web Sites by Content – From Bergman @ http://www.brightplanet.com/deepcontent/tutorial/DeepWeb/index.asp

Finding info on the invisible web
• Use search engines, meta-search engines and Web directories
o to find hidden info from search engines in databases add word “database” to subject term to find database gateways
Strategies for success
• Keep current
o discussion list
o newsletters
o use a monitoring service
o use C.I. (current information) resources
• Search smart
o use meta search engines first
o look at first page of results only
o use bookmarks, pathfinders, other libraries
• Know what is available
o subject-specific discussion groups

Search tools for the invisible web
Virtual libraries

• Librarian’s Index to the Internet http://lii.org
• Digital Librarian http://www.digital-librarian.com
• WWW Virtual Library http://www.vlib.org
• Infomine (directory) http://infomine.ucr.edu

Invisible search engines and directories
• ProFusion http://www.profusion.com
• Complete Planet http://www.completeplanet.com
• Invisible Web Directory http://invisible-web.net
• Direct Search http://www.freepint.com/gary/direct.htm

Subject-specific databases
• Use a search engine to find databases
o e.g. +”civil engineering database” –library
o “civil engineering” NEAR database
• Use a directory to specialized databases
o Fossick http://www.fossick.com/index.htm

Selected subject-specific databases
Business

Global Edge http://globaledge.msu.edu/ibrd/ibrd.asp
Multimedia
Singing Fish http://www.singingfish.com
Science
Scirus http://www.scirus.com
Other
Speciality Search Engines & Invisible Web Resources
http://lib.nmsu.edu/instruction/specialitysearch.htm

Invisible web tutorial
Invisible web: what it is, why it exists, how to find it, and its... finding information on the Internet: a tutorial http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html

No comments: