Monday, January 4, 2010

Introduction to reference: Invisible web

Invisible web content is considered to be information that isn’t clearly displayed or searchable on the Internet. This includes photos, images and sound files. These files aren’t apparently “there”, but really they are. Directories have found items on the invisible web, but otherwise it is hard to find this information. The invisible web can be known by any of the following terms:
  • deep web
  • dark matter
  • grey web
  • opaque web
  • private web
  • propriety web
Propriety web describes the invisible web. The private web are Internet pages inaccessible by the public, search as intranets, or sites that require a company’s username.
What is the invisible web?
Material that general-purpose search engines either cannot or perhaps more importantly, will not include in their collections of webpages.
Sherman, Chris. “Navigating the invisible web.” Search Engine Watch 2001. http://www.searchengineweb.com
Why care about the invisible web?
  • Invisible web is estimated to be 500 times greater than the visual web.
  • Invisible web is the largest category of new information on the Internet. The commercial web is the largest growing area of the Internet. More paid Websites and databases are coming online, as well as indexes. Most of this information is also available in databases.
  • Invisible web content contains highly relevant content.
Facts and figures – sizes 

  • Visible or surface web 
    • 4 billion individual documents 
    • 19 terrabytes (Tb) 
    • 100% publicly available 
    • quality – well.... the user must decide that one.
  • Invisible or deep web 
    • 550 billion individual documents 
    • 750 terrabytes (Tb) 
    • 200,000 web sites 
    • 95% publicly available 
    • quality – 1000 to 2000 times greater 

Bergman, Michael. “The Deep Web: Surfacing Hidden Value,” 2001. http://www.brightplanet.com/technology/deepweb.asp#HigherQuality


Considering that these figures are 8 years old, the reader can multiply these numbers ten times over and not even get a estimate of how much of the Internet one can and cannot easily access today. 

Why isn’t all content picked up? 

  • Restricted access
    • Login, password protected 
  • Undiscovered sites 
    • No links to or from site 
  • Dynamic pages only produce results in response to a specific search request 
  • Robot.txt file attached (instructions to a spider not to index a site) 
  • Search engine may not like the URL used to retrieve the document. 
  • Most dynamic delivery mechanisms use the ? symbol. 
  • Most search engines will not read past the ? in that URL. 
Types of invisible web content
  • Newly added Web pages 
    • not yet found by spiders 
  • Sites use exclusionary tags 
    • Robots can ignore pages with a meta tag. Site owners can create a file called robots.txt which contains a set of rules for the robot telling it what not to index. 
  • Intranets (private networks) 
  • Web sites which generate dynamic pages based on user inquiry
    • After the session is over the info may “disappear” e,g, 
  • Sites requiring special software or hardware to access content 
    • e.g. flash, shockwave, java applets, video, sound e.g.
      • Mapping History http://www.uoregon.edu/~atlas/ requires shockwave 
      • America’s Jazz Heritage http://www.sl.edu/ajazzh/audio.htm requires RealAudio G2 player 
    • Word, PowerPoint, PDF, Excel, Postscript, Rich Text Format has been indexed by Google since at least 2003, but it still won’t pick up everything. 
  • Web sites free to the public but which require user to search within the site’s database(s) to find info 
    • e.g. Electronic Journal Miner http://ejournal.coalliance.org 
  • Sites requiring registration or login 
    • e.g. New York Times http://www.nytimes.com 
  • Hybrid sites, some content free, some restricted 
    • e.g. Big Chalk http://www.bigchalk.com 
  • Web sites requiring a subscription 
    • e.g. EBSCHost, Encyclopedia Americana 
Databases
“When an indexing spider comes across a database, it’s as if it has run smack into the entrance of massive library with securely bolted doors. Spiders can record the library’s address, but can tell you nothing about the books, magazines or other documents it contains.” http://www.freepint.com/issues/080600.htm#feature 
Finding info on the invisible web 
  • Use search engines, meta-search engines and Web directories 
    • to find hidden info from search engines in databases add word “database” to subject term to find database gateways 
Strategies for success 
  • Keep current 
    • discussion list 
    • newsletters 
    • use a monitoring service
    • use C.I. (current information) resources 
  • Search smart
    • use meta search engines first
    • look at first page of results only
    • use bookmarks, pathfinders, other libraries 
  • Know what is available 
    • subject-specific discussion groups 
Search tools for the invisible web 
Virtual libraries 
Invisible search engines and directories 
Subject-specific databases 
  • Use a search engine to find databases 
    • e.g. +”civil engineering database” –library 
    • “civil engineering” NEAR database 
  • Use a directory to specialized databases 
Selected subject-specific databases 
Multimedia Singing Fish http://www.singingfish.com 
Science Scirus http://www.scirus.com 

Other Speciality Search Engines and Invisible Web Resources http://lib.nmsu.edu/instruction/specialitysearch.htm

Invisible web tutorial 
Invisible web: what it is, why it exists, how to find it, and its... finding information on the Internet: a tutorial http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html

No comments: