- deep web
- dark matter
- grey web
- opaque web
- private web
- propriety web
What is the invisible web?
Considering that these figures are 8 years old, the reader can multiply these numbers ten times over and not even get a estimate of how much of the Internet one can and cannot easily access today.
Why isn’t all content picked up?
Material that general-purpose search engines either cannot or perhaps more importantly, will not include in their collections of webpages.
Sherman, Chris. “Navigating the invisible web.” Search Engine Watch 2001. http://www.searchengineweb.com
Why care about the invisible web?
- Invisible web is estimated to be 500 times greater than the visual web.
- Invisible web is the largest category of new information on the Internet. The commercial web is the largest growing area of the Internet. More paid Websites and databases are coming online, as well as indexes. Most of this information is also available in databases.
- Invisible web content contains highly relevant content.
http://www.brightplanet.com/technology/deepweb.asp#HigherQuality
Facts and figures – sizes - Visible or surface web
- 4 billion individual documents
- 19 terrabytes (Tb)
- 100% publicly available
- quality – well.... the user must decide that one.
- Invisible or deep web
- 550 billion individual documents
- 750 terrabytes (Tb)
- 200,000 web sites
- 95% publicly available
- quality – 1000 to 2000 times greater
Bergman, Michael. “The Deep Web: Surfacing Hidden Value,” 2001. http://www.brightplanet.com/technology/deepweb.asp#HigherQuality
Considering that these figures are 8 years old, the reader can multiply these numbers ten times over and not even get a estimate of how much of the Internet one can and cannot easily access today.
Why isn’t all content picked up?
- Restricted access
- Login, password protected
- Undiscovered sites
- No links to or from site
- Dynamic pages only produce results in response to a specific search request
- Robot.txt file attached (instructions to a spider not to index a site)
- Search engine may not like the URL used to retrieve the document.
- Most dynamic delivery mechanisms use the ? symbol.
- Most search engines will not read past the ? in that URL.
- Newly added Web pages
- not yet found by spiders
- Sites use exclusionary tags
- Robots can ignore pages with a meta tag. Site owners can create a file called robots.txt which contains a set of rules for the robot telling it what not to index.
- Intranets (private networks)
- Web sites which generate dynamic pages based on user inquiry
- After the session is over the info may “disappear” e,g,
- MapQuest http://www.mapquest.com
- The Sock Calculator http://www.panix.com/~~llaine/socks.html
- Sites requiring special software or hardware to access content
- e.g. flash, shockwave, java applets, video, sound e.g.
- Mapping History http://www.uoregon.edu/~atlas/ requires shockwave
- America’s Jazz Heritage http://www.sl.edu/ajazzh/audio.htm requires RealAudio G2 player
- Word, PowerPoint, PDF, Excel, Postscript, Rich Text Format has been indexed by Google since at least 2003, but it still won’t pick up everything.
- Web sites free to the public but which require user to search within the site’s database(s) to find info
- e.g. Electronic Journal Miner http://ejournal.coalliance.org
- Sites requiring registration or login
- e.g. New York Times http://www.nytimes.com
- Hybrid sites, some content free, some restricted
- e.g. Big Chalk http://www.bigchalk.com
- Web sites requiring a subscription
- e.g. EBSCHost, Encyclopedia Americana
“When an indexing spider comes across a database, it’s as if it has run smack into the entrance of massive library with securely bolted doors. Spiders can record the library’s address, but can tell you nothing about the books, magazines or other documents it contains.” http://www.freepint.com/issues/080600.htm#feature 
- Databases form the largest part of the invisible web. Distribution of Deep Web Sites by Content – From Bergman @ http://www.brightplanet.com/deepcontent/tutorial/DeepWeb/index.asp
- Use search engines, meta-search engines and Web directories
- to find hidden info from search engines in databases add word “database” to subject term to find database gateways
- Keep current
- discussion list
- newsletters
- use a monitoring service
- use C.I. (current information) resources
- Search smart
- use meta search engines first
- look at first page of results only
- use bookmarks, pathfinders, other libraries
- Know what is available
- subject-specific discussion groups
Virtual libraries 
- Librarian’s Index to the Internet http://lii.org
- Digital Librarian http://www.digital-librarian.com
- WWW Virtual Library http://www.vlib.org
- Infomine (directory) http://infomine.ucr.edu
- ProFusion http://www.profusion.com
- Complete Planet http://www.completeplanet.com
- Invisible Web Directory http://invisible-web.net
- Direct Search http://www.freepint.com/gary/direct.htm
- Use a search engine to find databases
- e.g. +”civil engineering database” –library
- “civil engineering” NEAR database
- Use a directory to specialized databases
- Fossick http://www.fossick.com/index.htm
Other Speciality Search Engines and Invisible Web Resources http://lib.nmsu.edu/instruction/specialitysearch.htm
 Invisible web tutorial 
Invisible web: what it is, why it exists, how to find it, and its... finding information on the Internet: a tutorial http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html
 
No comments:
Post a Comment