Friday, May 8, 2009

Introduction to the Internet: Internet searching

About web searching
  • The WWW is DAUNTING
    o It’s over 3.5 billon pages – we can only actually guess
  • You cannot search the Web directly
    o you search a database of selected sites
  • Success depends on
    o choosing a database right for your needs
    o understanding how to search the database

Parts of a search engine

  • Spider/robot/crawler
    o A program that finds and downloads Web pages (new pages, updated pages)
    o Takes 2 weeks to 6 months
  • Index
    o Database gathering a copy of each Web page gathered by the spider
  • Search engine software
    o Software that allows users to ask for information

How a search engine handles a query

  • Search terms input
  • Index file searched for matches
  • Matching page entries gathered and ranked by relevance
  • Results formatted
  • Results page returned to searcher’s web browser

Typical criteria for matching

  • Title
    o Keyword in title?
  • Domain/url
    o Keyword in domain/url?
  • Style
    o Is keyword used in bold, italic, or in a < h1 > tag?
  • Density
    o How many times does keyword appear on page?
  • Outbound links
    o Who does the page link with?
    o What are the keywords in the link?
  • Inbound links?
    o Who else has linked to this site?
  • Insite links
    o What other pages in the site itself does the page point to? Outbound links
  • Meta tags
    o Allow page owner to specify key words and concepts for indexing
    o Subject to abuse so spiders should match meta tags against page content and reject those that don’t match

Paid placement

  • Some directories and search engines load the top of their results pages with paid listings. These are usually listings of sites whose owners pay for high placement. In other words, they are essentially advertisements.
  • Not all search services do this, and some are more clear than others about what has been paid for and what has not.
  • For more information read, “The Straight Story on Search Engines” from PC World at http://www.pcworld.com/resource/printable/article/0,aid,97431,00.asp

Search engines

  • Each search engine creates its own proprietary database and will differ in how it searches and what it returns
  • When you use a search engine you are not searching the live web but rather those pages gathered by the search engine’s spiders.

Web disadvantages

  • Not all web sites are indexed
  • Technical problems are a fact of life
    o Downed networks, busy sites, poor connections, downloading wait time
  • Lack of quality of information
  • Explosion in growth of web increasing difficulty in locating relevant data
  • Search engines undergoing rapid change in search features and stability (here today, gone or radically changed tomorrow)

Search engines

  • ABSOLUTELY ESSENTIAL for locating information on the Web.
  • Many provide search options such as Boolean, proximity operators, nesting, truncation, and more

Searching the Web effectively

  • Know what search engines will and will not do
  • Have a toolbox of resources to start with—and help narrow down your search
  • Focus on how search engines can help find the resource, not necessarily the exact item you want

Domains

  • Up until 2000, there were six major domains used in the United sites:
    o COM – Commercial
    o GOV – Governmental
    o MIL – Military
    o ORG – Organization
    o NET – Network
    o EDU – Education
    This has now changed
  • In November 2000, the Internet Corporation for Assigned Names and Numbers (ICANN) added its first new batch of international domain names which have started to appear:
    o BIZ – business
    o NAME – individuals
    o PRO – professionals
    o MUSEUM – museums
    o COOP – business cooperatives
    o AERO – aviation industry

Search tips

  • MOST IMPORTANT identify exactly what it is that you are looking for. The Web can’t guess that when you typed “information” you were really looking for “information on literacy efforts in high schools”.
  • Formulate searches carefully, taking full advantage of the advanced search options available on your selected search engine
  • Suppose you found a Web site that completely answered your question. Try to picture what this “perfect” web site would contain, and formulate your search based on that criteria

Sample search strategies
Juvenile offers prosecuted as adults
“juvenile offenders” prosecuted adults
Privatization of prisons
Title: prison privatization
Mentally ill in criminal justice system
“mental illness” “criminal justice system”
Women’s rights issues in Third World nations
“women’s rights” “third world”

Search engines are best for

  • Distinctive word, name, title, or “word phrase”
  • A specific site
  • Interests people “put on the Web”
    o To share, promote, persuade, entice, sell
    o political, social
    o organizational, movement, controversy
    o businesses, investments
    o consumer viewpoints
    o activity, hobby, event, conference
    o ideas, theories, points of views, expert(?)ise etc. from the “collective human mind”

Best general search engines
Google (3+ billion)
most popular results first
Default And (ALL your terms)
No truncation
Capitalize OR if needed eg. water bottled OR tap
- excludes
“ “ quotes make phrase

AllTheWeb (2+ billion)
“Should include” alters ranking
Default AND (ALL your terms)
No truncation
Parentheses to OR water (bottled tap)
-excludes
“ “ quotes make phrase

Google

  • Officially launched in September 1999, it is currently the largest at over 3 billion pages indexed
  • Advanced features: language, date, domains, field searching. It also now includes an image search.
  • Ranks sites based on “Page Rank” which fives a page a higher ranking if it is linked from many high-quality sites
  • Google offers a cached copy of each result. The cached copy can be especially helpful if the site’s server is down or the web page is no longer available.

AllTheWeb

  • Launched in August 1999 under the name Fast
  • Second largest at 2+ billion
  • Now has advanced features, such as language, filters, domain and field searching

Teoma

  • A crawler-based search engine officially launched April 2002, indexes 1 million web pages, now owned by Ask Jeeves
  • Three responses to each search
    o Results (relevant Web pages)
    o Refine (suggestions to narrow search)
    o Resources (link collections from experts and enthusiasts)
  • “Ranks by Subject-Specified Popularity”

AltaVista

  • Started in 1995 by Digital Equipment Corp.
  • Currently third largest at 1.6 billion pages
  • Offers proximity searching, truncation, link searches
  • Has country specific interfaces e.g. AltaVista.ca, AltaVista.de, etc.
  • Has undergone numerous redesigns

Meta search engines

  • Meta search engine: a server which passes queries onto many search engines and/or directories and then summarizes all the results, e.g. SurfWax, Ixquick, Dogpile, ProFusion, etc.
  • Alternative names: parallel search engine, multithreaded search engine, multiple search engine
  • Do not
    o have own databases
    o classify or review web sites
    o collect web pages
    o accept URL additions
  • Do
    o send queries simultaneously to Web search engines and/or directories

Types

  • Separate retrieval e.g. Dogpile
  • Collated retrieval e.g. SurfWax, Ixquick
  • Utilities (must download to your hard drive) e.g. Copernic, Web Ferret

Pros

  • useful for retrieving a relatively small number of relevant results
  • good for obscure topics
  • a good option when can’t find results after using one or two “major” search engines
  • can search a variety of sources on same topic at one time
  • can see top hits from several databases at once
  • can search with one interface and method
  • gives an overall picture of what is available on your topic on the Web

Cons

  • use is limited primarily to simple queries
  • little or no field searching available
  • most return a limited number of results from each search engine
  • search may be slow and may have a timeout period eliminating a search engine which does not respond within limits set
  • results from noncollated engines can be redundant and overwhelming
  • some of largest search engines e.g. Google often not covered
  • may not process Boolean searches correctly

Directories

  • Collections of Web pages
    o gathered for a target purpose or audience
    o organized into subject categories
  • Built by human editors – often experts
    o size range from hundreds to 2 million +
  • Subject categories vary with the scope
    o some try to cover all kinds of information
    o some analyze a subject area or discipline
  • NEVER full text of the web pages linked to

When are directories best for?

  • search engines don’t work
    o no specific terms or phrases
    o too much on the Web with your own word
  • finding more specialized directories
  • finding specialized databases
  • expertise, guidance, overview of a topic

Good general directories

  • Librarians’ Index to the Internet
  • Infomine
  • AcademicInfo
  • About.com
  • Yahoo

Useful Websites

Search Engine Tutorials

No comments: