How AltaVista Works?
AltaVista has an index that is built by sending out a crawler (a robot program) that captures text and brings it back.
The main crawler is called "Scooter." Scooter sends out thousands of threads simultaneously. 24 hours a day, 7 days a week, Scooter and its cousins access thousands of pages at a time, like thousands of blind users grabbing text, pulling it back, throwing it into the indexing machines so the next day that text can be in the index. And at the same time, they pull off, from all those pages, every hyperlink that they find, to put in a list of where to go to next.
In a typical day Scooter and its cousins visit over 10 million pages. If there are a lot of hyperlinks from other pages to yours, that increases your chances of being found. But if this is your own personal site, or if this is a brand new Web page, that's not too likely.
AltaVista has in incredibly large database of Web sites, such that searches often return hundreds of thousands of Web site matches. AltaVista's spider goes down about three pages into your site. This is important to remember if you have different topical pages that won't be found within three clicks of the main page. You will have to index them separately.
You cannot tell Alta Vista how to index your site, it is all done via their spider, but you can go to their site and give the spider a nudge by submitting specific pages. That way, AltaVista's spider knows to visit that page and index it. Once you have done that, it's all up to your META tags and your page's content! AltaVista's spider may revisit your site each month after its initial visit.
AltaVista ranking algorithms reward keywords in the <TITLE> tag. If a keyword is not in a title tag, it will likely not appear anywhere near the top of the search results! AltaVista also rewards keywords near one another, and keywords near the beginning of a page
Add a Page
Adding a page through AltaVista’s Add URL form doesn’t guarantee that the page would be listed. It usually takes around 4 to 6 weeks to show up. You don't have to have any special authority to "add a page." This is not a directory, like Yahoo!, where the information provider has to submit information and has to prove they are who they say
they are. You do not have to do this with AltaVista. It will go and check and bring back whatever text it finds at that address.
If you give it a URL for a page that doesn't exist, it will come back with Error 404, which means there is no such page. If that page was in the index, it will remove that page from the index the next day.
This is very important from several perspectives. Say you have changed the directory structure at your Web site. First, you should go to AltaVista and Add a Page for all the old addresses to remove the old information from the index. Then you should Add a Page for all the new addresses. Also, if you made an embarrassing typo or posted a document that you shouldn't have, and removed that page from the Web, you can Add URL for that page at AltaVista to make sure the information is not perpetuated in the index.
What AltaVista doesn’t Index
AltaVista doesn't index everything. In fact, features that Web designers may add to sites at great expense may block crawlers, meaning that those pages will never be indexed and never be found through search engines. As a result, those sites may end up spending far more on promotion than they would have had to otherwise.
Here are some pages AltaVista doesn’t index. This only highlights the importance of using plain text for your web pages.
First, sites that require any kind of registration or password lock out AltaVista. Keep in mind that a web crawler cannot fill out a form of any kind. If you need to fill out a form to get to the next page, the crawler halts right there. If you would like to gather information about your users/members but would also like your pages to be indexed, make the registration optional.
Similarly, the AltaVista crawler cannot get content from a database, because it cannot fill out a form. If the content of your database is largely text, you might consider creating plain text static HTML pages with that same content, so it can be indexed and found.
Dynamic pages also block AltaVista spiders. While it's great to give visitors to your site unique experiences, tailored to their needs, the techniques you use to do that could stop most search engines including AltaVista from indexing your content and hence could greatly reduce your potential traffic. Dynamically generated pages are created on the fly from a variety of elements held in databases. When the AltaVista crawler arrives at such a page, it captures the content but halts immediately, and will not follow the links, because it sees ahead of it an infinite number of pages ahead -- a black hole that would bring it to a crash.
Active Server Pages (.asp) with question marks in their URLs (indicating that the page is a script for the construction of a page, rather than just static content) fall into this category.
If you have information inside frames, that will probably prove to be a hindrance, but is not an absolute barrier. AltaVista indexes the outside of the frame as a distinct page. It will also index each pane of the frame window as a separate page. That means that if the content matching a query is in a pane, when visitors clicking on those links will see the pane and only the pane -- not the full page as it was designed. So if you want visitors from search engines to experience your pages the way they were intended to be seen, you should have non-frames as well as frames versions of those pages; and submit the non- frames versions with Add URL.
AltaVista also can't index text that is embedded in graphics. Search engines simply cannot "see" the text unless the Webmaster put ALT text behind the picture, describing it and listing those important words. But pictures, as pictures, can be indexed for Image search at AltaVista.
Text that appears in multi-media files (audio and video) cannot be indexed. But those same files can be indexed at AltaVista for Multimedia search.
Information that is generated by Java applets or in XML coding cannot be indexed. Acrobat files cannot be indexed either. But technology exists that will enable AltaVista to convert those files to indexable form.
Exceptionally large pages also present a problem at AltaVista. As a pragmatic compromise, intended to help optimize the performance of AltaVista, they fully index the first 64 Kbytes of text on any single page. They will harvest the hyperlinks from the whole document for following up later, but they will only index the first 64 Kbytes. So if you want to post an entire book, it's best to break it up into chapters, and then all the text can be indexed.
Comments, such as <!--change this every Friday-->, aren't indexed at all. Those are intended as private communications, not viewable by Web site visitors, except by using View/Page Source.
Also, consider technical factors. If a site has a slow connection, it might time-out for the crawler. Very complex pages, too, may time out before the crawler can harvest the text. If you have a hierarchy of directories at your site, put the most important information high, not deep. AltaVista will presume that the higher you placed the information, the more important it is. And crawlers may not venture deeper than three or four or five directory levels.
Above all remember the obvious - full-text search engines such as AltaVista index text. You may well be tempted to use fancy and expensive design techniques that either block search engine crawlers or leave your pages with very little plain text that can be indexed.
Ranking Rules
The simple rule of thumb is that content counts, and that content near the top of a page counts for more than content at the end. In particular, the HTML title and the first couple lines of text are the most important part of your pages. If the words and phrases that match a query happen to appear in the HTML title or first couple lines of text of one of your pages, chances are very good that that page will appear high in the list of search results.
AltaVista bases its ranking on both static factors (a computation of the value of page independent of any particular query) and query-dependent factors.
It values:
• Long pages, which are rich in meaningful text (not randomly generated letters and words).
• Pages that serve as good hubs, with lots of links to pages that that have related content (topic similarity, rather than random meaningless links, such as those generated by link exchange programs or intended to generate a false impression of "popularity").
• The connectivity of pages, including not just how many links there are to a page but where the links come from: the number of distinct domains and the "quality" ranking of those particular sites. This is calculated for the site and also for individual pages. A site or a page is "good" if many pages at many different sites point to it, and especially if many "good" sites point to it.
• The level of the directory in which the page is found. Higher is considered more important. If a page is buried too deep, and the crawler simply won't go that far and will never find it.
These static factors are recomputed about once a week, and new good pages slowly percolate upward in the rankings. Note that there are advantages to having a simple address and sticking to it, so others can build links to it, and so you know that it's in the index
Query-dependent factors include:
• The HTML title.
• The first lines of text.
• Query words and phrases appearing early in a page rather than late.
• Meta tags, which are treated as ordinary words in the text, but like words that appear early in the text (unless the meta tags are patently unrelated to the content on the page itself, in which case the page will be penalized)
• Words mentioned in the "anchor" text associated with hyperlinks to your pages.
(E.g., if lots of good sites link to your site with anchor text "breast cancer" and the query is "breast cancer," chances are good that you will appear high in the list of matches.)
AltaVista's policy on doorway pages and cloaking
AltaVista is opposed to doorway pages and cloaking. It considers doorway and cloaked pages to be spam and encourages people to use other avenues to increase the relevancy of their pages. A description of doorway pages and cloaking is given later on in this guide.
Meta tags
Though it indexes Meta tags, considering them to be regular text, AltaVista claims it doesn't give them priority over HTML titles and other text. Though you should use meta tags in all your pages, some webmasters claim their doorway pages for AltaVista rank better when they don't use them.
If you do use Meta tags, make your description tag no more than 150 characters and your keywords tag no more than 1,024 characters long.
Keywords in the URL and file names
It's generally believed that AltaVista gives some weight to keywords in filenames and URL names. If you're creating a file, try to name it with keywords.
Keywords in the ALT tags
AltaVista indexes ALT tags, so if you use images on your site, make sure to add them. ALT tags should contain more than the image's description. They should include keywords, especially if the image is at the top of the page. ALT tags are explained later.
Page Length
There's been some debate about how long doorway pages for AltaVista should be. Some webmasters say short pages rank higher, while others argue that long pages are the way to go. According to AltaVista's help section, it prefers long and informative pages. We've found that pages with 600-900 words are most likely to rank well.
Frame support
AltaVista has the ability to index frames, but it sometimes indexes and links to pages intended only as navigation. To keep this from happening to you, submit a frame-free site map containing the pages that you want indexed. You may also want to include a "robots.txt" file to prohibit AltaVista from indexing certain pages.
AltaVista’s Search Features
AltaVista offers a wide range of search features. Most of these options are available in its
"Advanced Search" section.
• Boolean search - Full Boolean Search support. AND (+), OR, AND NOT (-) (instead of NOT) Search terms can be nested.
• Phrase - Available. Put quotes around the phrase, such as "New York Times"
• Proximity - Available. NEAR operator means within ten words of one another. Can be nested with other tags.
• Word Stemming - Available. You can use the wild card (*) at the end or in the middle of a word.
• Capitalization - If you search in upper case, AltaVista searches in upper case only. Lower case words and phrases search for upper and lower case, and will therefore yield more results.
• Field Search - The following options are available:
o Applet: searches for the name of an applet
o Domain: specifies the domain extension, such as .com
o Host: searches for pages within a particular site
o Image: searches for an image name
o Link: searches for pages that link to the specified site
o Object: search engines - searches for the name of an object
o Text: excludes Meta tags information
o Title: search in the HTML title only
o URL: searches for sites that have a specified word in the URL
• Date Searching - Available under Advanced Search section.
• Search within results -Available. This option is offered after each search.
• Media Type searching - Available for Images, Music/MP3, and Video.
• Language Searching - AltaVista has very extensive language support. It supports around 30 languages.
2.3 How Teoma Works?
Teoma adds a new dimension and level of authority to search results through its approach, known as Subject-Specific PopularitySM.
Instead of ranking results based upon the sites with the most links leading to them, Teoma analyzes the Web as it is organically organized—in naturally-occurring communities that are about or related to the same subject—to determine which sites are most relevant. Teoma’s search technology can locate communities on the Web within their specific subject areas, as they actually exist.
To determine the authority—and thus the overall quality and relevance—of a site's content, Teoma uses Subject-Specific PopularitySM. Subject-Specific Popularity ranks a
site based on the number of same-subject pages that reference it, not just general popularity. In a recent test performed by Search Engine Watch, Teoma's relevance grade was raised to an "A" following the integration of Teoma 2.0.
Teoma 2.0: Evolution and Growth
In early 2003, Teoma 2.0 was launched. The enhanced version represents a major evolution in terms of improvements to relevance and an expansion of the overall advanced search functionalities. Below are detailed explanations for the improvements made in this version:
More Communities
Like social networks in the real world, the Web is clustered into local communities.
Communities are groups of Web pages that are about or are closely related to the same subject. Teoma is the only search technology that can view these communities as they naturally occur on the Web (displayed under the heading "Refine" on Teoma.com). This method allows Teoma to generate more finely tuned search results. In other words, Teoma's community-based approach reveals a 3-D image of the Web, providing it with more information about a particular Web page than other search engines, which have only a one-dimensional view of the Web.
Web-Based Spell Check
Teoma's proprietary Spell Check technology identifies query misspellings and offers corrections that help improve the relevance and precision of search results. The Spell Check technology, developed by Teoma's team of scientists, leverages the real-time content of the Web to determine the correct spelling of a word.
Dynamic DescriptionsSM
Dynamic Descriptions enhance search results by showing the context of search terms as they actually appear on referring Web pages. This feature provides searchers with information that helps them to determine the relevance of a given Web page in association with their query.
Advanced Search Tools
Teoma's Advanced Search tools allow searchers to search using specific criteria, such as exact phrase, page location, geographic region, domain and site, date, and other word filters. Users can also search using 10 Western languages, including Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Spanish and Swedish. A link to Teoma's Advanced Search tools can be found next to the search box on Teoma.com.
The Teoma Algorithm
In addition to utilizing existing search techniques, Teoma applies what they call authority, a new measure of relevance, to deliver search results. For this purpose, Teoma employs three proprietary techniques:
Refine, Results and Resources.
Refine
First, Teoma organizes sites into naturally occurring communities that are about the subject of each search query. These communities are presented under the heading "Refine" on the Teoma.com results page. This tool allows a user to further focus his or her specific search.
For example, a search for "Soprano" would present a user with a set of refinement suggestions such as "Marie-Adele McArther" (a renowned soprano), "Three Sopranos" (the operatic trio), "The Sopranos" (the wildly-popular HBO television show) as well as several other choices. No other technology can dynamically cluster search results into the actual communities as they exist on the Web.
Results
Next, after identifying these communities, Teoma employs a technique called Subject- Specific PopularitySM. Subject-Specific Popularity analyzes the relationship of sites within a community, ranking a site based on the number of same-subject pages that reference it, among hundreds of other criteria. In other words, Teoma determines the best answer for a search by asking experts within a specific subject community about who they believe is the best resource for that subject. By assessing the opinions of a site's peers, Teoma establishes authority for the search result. Relevant search results ranked by Subject-Specific Popularity are presented under the heading "Results" on the Teoma.com results page.
In some instances companies pay to have their Web sites included within Teoma's dataset, otherwise known as the Teoma Index. Like all Web sites, these sites are processed through Teoma's search algorithms and are not guaranteed placement in the results. This ensures that relevancy is the primary driver of results.
Resources
Finally, by dividing the Web into local subject communities, Teoma is able to find and identify expert resources about a particular subject. These sites feature lists of other authoritative sites and links relating to the search topic.
For example, a professor of Middle Eastern history may have created a page devoted to his collection of sites that explain the geography and topography of the Persian Gulf. This site would appear under the heading "Resources" in response to a Persian Gulf-related query. No previous search technology has been able to find and rank these sites.
Sponsored Links
Search results appearing under the heading "Sponsored Links" are provided by Google®, a third party provider of pay for performance search listings. Google generates highly relevant sponsored results by allowing advertisers to bid for placement in this area based on relevant keywords. These results, which are powered by Google's advanced algorithms, are then distributed across the Internet to some of the world's most popular and well-known Web sites, including Teoma.com and Ask Jeeves.
Other factors
Boolean Searching
Limited Boolean searching is available. Teoma defaults to an AND between search terms and supports the use of - for NOT. Either OR or ORR can be used for an OR operation, but the operator must be in all upper case. Unfortunately, no nesting is vailable.
Proximity Searching
Phrase searching is available by using “double quotes” around a phrase or by checking the "Phrase Match" box. Teoma also supports phrase searching when a dash is used between words with no spaces. Until Nov. 2002, Teoma's help page stated that "Teoma returns results which exactly or closely matches the given phrase" which meant that not all phrases matches will necessarily be accurate. As of Nov. 2002, that appears to have been corrected and phrase searching now works properly.
Truncation
No truncation is currently available.
Case Sensitivity
Searches are not case sensitive. Search terms entered in lowercase, uppercase, or mixed case all get the same number of hits.
Stop Words
Teoma does ignore frequently-occurring words such as 'the,' 'of', 'and', and 'or'. However, like at Google, these stop words can be searched by putting a + in front of them or by including them within a phrase search.
Sorting
By defaults, sites are sorted in order of perceived relevance. They also have site collapsing (showing only two pages per site with the rest link via a “More Results” message. There is no option for sorting alphabetically, by site, or by date.
Display
Teoma displays the title (roughly first 60 characters), a two line keyword-in-context extract from the page, and the beginning of the URL for each hit. Some will also have a link to "Related Pages" which finds related records based on identifying Web communities by analyzing link patterns. Two other sections displayed are the "Refine" section (formerly folders) that suggest other related searches based on words that Teoma uses to identify communities on the Web and the "Resources: Link collections from experts and enthusiasts" (formerly "Experts' Links") which are Web pages that include numerous links to external resources -- metasites or Internet resource guides. Some
"Sponsored Links" may show up at the top. These are ads from the Google AdWords program.
Teoma will only display 10 Web page records at a time; however, up to a 100 at a time can be displayed through a change in the preferences and on the advanced search page. Teoma may also display up to 10 metasites under the "Resources" heading and up to 6 Refine suggestions.
2.4 How Inktomi Works
Inktomi is one of the most popular crawler based search engines. Inktomi is a crawler- based search engine. However, it does not make its index available to the public through its own site like other crawler-based search engines, such as Lycos or Alltheweb. Inktomi licenses other companies to use its search index. These companies are then able to provide search services to their visitors without having to build their own index.
It uses a robot named Slurp to crawl and index web pages.
Slurp – The Inktomi Robot
Slurp collects documents from the web to build a searchable index for search services using the Inktomi search engine, including Microsoft and HotBot. Some of the characteristics of Slurp are given below:
Frequency of accesses
Slurp accesses a website once every five seconds on average. Since network delays are involved it is possible over short periods the rate will appear to be slightly higher, but the average frequency generally remains below once per minute.
robots.txt
Slurp obeys the Robot Exclusion Standard. Specifically, Slurp adheres to the 1994 Robots Exclusion Standard (RES). Where the 1996 proposed standard disambiguates the 1994 standard, the proposed standard is followed.
Slurp will obey the first record in the robots.txt file with a User-Agent containing "Slurp". If there is no such record, it will obey the first entry with a User-Agent of "*".
This is discussed in detail later in this book.
NOINDEX meta-tag
Slurp obeys the NOINDEX meta-tag. If you place
<META NAME="robots" CONTENT="noindex">
in the head of your web document, Slurp will retrieve the document, but it will not index the document or place it in the search engine's database.
Repeat downloads
In general, Slurp would only download one copy of each file from your site during a given crawl. Occasionally the crawler is stopped and restarted, and it re-crawls pages it has recently retrieved. These re-crawls happen infrequently, and should not be any cause for alarm.
Searching the results
Slurp crawls from websites to the Inktomi search engines immediately. The documents are indexed and entered into the search database in quick time.
Following links
Slurp follows HREF links. It does not follow SRC links. This means that Slurp does not retrieve or index individual frames referred to by SRC links.
Dynamic links
Slurp has the ability to crawl dynamic links or dynamically generated documents. It will not, however, crawl them by default. There are a number of good reasons for this. A couple of reasons are that dynamically generated documents can make up infinite URL spaces, and that dynamically generated links and documents can be different for every retrieval so there is no use in indexing them.
Content guidelines for Inktomi
Given here are the content guidelines and policies for Inktomi. In other words, listed below is the content Inktomi indexes and the content it avoids.
Inktomi indexes:
• Original and unique content of genuine value
• Pages designed primarily for humans, with search engine considerations secondary
• Hyperlinks intended to help people find interesting, related content, when applicable
• Metadata (including title and description) that accurately describes the contents of a Web page
• Good Web design in general
Inktomi avoids:
• Pages that harm accuracy, diversity or relevance of search results
• Pages dedicated to directing the user to another page
• Pages that have substantially the same content as other pages
• Sites with numerous, unnecessary virtual hostnames
• Pages in great quantity, automatically generated or of little value
• Pages using methods to artificially inflate search engine ranking
• The use of text that is hidden from the user
• Pages that give the search engine different content than what the end-user sees
• Excessively cross-linking sites to inflate a site's apparent popularity
• Pages built primarily for the search engines
• Misuse of competitor names
• Multiple sites offering the same content
• Pages that use excessive pop-ups, interfering with user navigation
• Pages that seem deceptive, fraudulent or provide a poor user experience
Inktomi's policies are designed to ensure that poor-quality pages do not degrade the user experience in any way. As with Inktomi's other guidelines, Inktomi reserves the right, at its sole discretion, to take any and all action it deems appropriate to insure the quality of its index.
Inktomi encourages Web designers to focus most of their energy on the content of the pages themselves. They like to see truly original text content, intended to be of value to the public. The search engine algorithm is sophisticated and is designed to match the regular text in Web pages to search queries. Therefore, no special treatment needs to be done to the text in the pages.
They do not guarantee that your web page will appear at the top of the search results for any particular keyword.
How does Inktomi rank web pages?
Inktomi search results are ranked based on a combination of how well the page contents match the search query and on how "important" the page is, based on its appearance as a reference in other web pages.
The quality of match to the query terms is not just a simple text string match, but a text analysis that examines the relationships and context of the words in the document. The query match considers the full text content of the page and the content of the pages that link to it when determining how well the page matches a query.
Here are a few tips that can make sure your page can be found by a focused search on the Internet:
• Think carefully about key terms that your users will search on, and use those terms to construct your page.
• Documents are ranked higher if the matching search terms are in the title. Users are also more likely to click a link if the title matches what they're looking for. Choose terms for the title that match the concept of your document.
• Use a "description" meta-tag and write your description carefully. After a title, users click on a link because the description draws them in. Placing high in search results does little good if the document title and description do not attract interest.
• Use a "keyword" meta-tag to list key words for the document. Use a distinct list of keywords for each page on your site instead of using one broad set of keywords on every page. (Keywords do not have much effect on ranking, but they do have an effect.)
• Keep relevant text and links in HTML. Placing them in graphics or image maps means search engines can't search for the text and the crawler can't follow links to your site's other pages. An HTML site map, with a link from your welcome page, can help make sure all your pages are crawled.
• Use ALT text for graphics. It's good page design to accommodate text browsers or visually impaired visitors, and it helps improve the text content of your page for search purposes.
• Correspond with webmasters and other content providers and build rich linkages between related pages. Note: "Link farms" create links between unrelated pages for no reason except to increase page link counts. Using link farms violates Inktomi content guidelines, and will not improve your page ranking.
Inktomi’s Spamming Policies
Sites that violate the Inktomi content guidelines may be removed from the index. These sites are considered as spam. Inktomi considers techniques such as tiny text, invisible text, keyword stuffing, doorway pages, and fake links as spam.
Pages with no unique text or no text at all may drop out of the index or may never be indexed. If you want a page to appear in web search results, be sure that page includes some unique text content to be indexed.
Inktomi, however, does index dynamic pages. For page discovery, Inktomi mostly follows static links, and the avoidance of dynamically generated href links except in directories disallowed by a /robots.txt exclusion rule is recommended.
Spamming includes:
• Embedding deceptive text in the body of web documents.
• Creating metadata that does not accurately describe the content of web documents.
• Fabricating URLs that redirect to other URLs for no legitimate purpose.
• Web documents with intentionally misleading links
• Cloaking/doorway pages that feed Inktomi crawlers content that is not reflective of the actual page
• Creating inbound links for the sole purpose of boosting the popularity score of the URL
• The misuse of third party affiliate or referral programs
Click popularity measurement
As mentioned earlier, Inktomi measures the click popularity of web pages while deciding the rank of a web page. Click popularity is the number of times the surfers click on your web page listing and how long they stay in your site.
The number of click on your site's listing can be improved by utilizing the title and the Meta tags. These two tags not only help you in attaining a high rank in the search engines, but they also can be utilized to write a good marketing text about your site. The text in the title and Meta description tag appears in the hyperlink listings on the search
engine results page. If the text is attractive to the net surfers, the chances of getting more clicks is greater.
Another factor which decides the click popularity factor of your web site is the time that the visitors spend in your site. The secret behind retaining visitors in your web site is the content of your site. Informative and useful content relevant to the search terms will help to retain visitors to your site and make them come back again.
Inktomi’s Partner sites
Inktomi provides search results to many search sites. The different search portals may also use results from other information sources, so not all of their results come from the Inktomi search database. These search portals also apply different selection or ranking constraints to their search requests, so Inktomi results at different portals may not be the same.
Following are Inktomi’s partner sites:
http://www.about.com/
http://www.bbc.co.uk/
http://www.bluewin.ch/
http://www.blueyonder.co.uk/
http://www.espotting.com/
http://www.goo.ne.jp/
http://www.hotbot.com
http://www.hotbot.co.uk/
http://www.looksmart.com
http://search.msn.com/
http://www.overture.com/
http://www.soneraplaza.fi/
http://www.tocc.co.jp/search/
http://www.wp.pl/

|