A Short Tutorial on Website Submissions & Backward Linking
Submitting your site to the search engines is a pain, make no mistake. But with a top twenty listing on Webcrawler alone having the capability to bring 50 visitors a day to your pages it's a pain that is worth suffering.
There are a great number of search engines around, but only the top 15 or so really generate serious traffic for most sites. Many of the lesser search engines are 'meta search' types, meaning they actually get their results from the bigger engines anyway.
Of the main engines, there are two categories - Directory based, and Spidered based. The Directory based (which includes Yahoo, Looksmart and the Open Directory Project) are generally added by hand. This means an editor for the chosen category looks at your site and awards it a position (or not) based on how he/she rated it. The Spider index types, (including Lycos, Inktomi and AltaVista) use robot browsers to check and index the sites based on pre-programmed criteria.
In general, to get a good listing in the Directory engines means you must impress the editor with the look, ease of navigation and content of your site. Make sure your first page loads very quickly (empty your browser cache before checking) as this is an important factor. If your 'front' page doesn't open within 10 - 15 seconds you will not get a high placement. If need be, create a simple 'Welcome' page that opens quickly and use it as a front door to your site with the more graphic laden pages with in it.
With Directory engines, if you have midis on your page be sure you include an "off " control. You will be penalized if not. Don't over do images or clutter - the editors read hundreds of these sites a day and will be likely to have a lot in the queue behind you. They want to judge your site rather than your taste in art or cartoons and music selections.
Spider based engines are more predictable. Spiders scan your pages looking for your keywords, count the number of times the keyword occurs through out that page, and measure it against the overall length of your text to calculate how relevant your site is to the keyword.
Your keywords need to be set in the keywords meta tag, and should also be included in your description, and to occur AT LEAST once in the first 200 characters of text in your page. For this reason it is wise to not try to target too many keywords on a single page, try to pick simple word-pairs. Make sure you pick phrases or words that you will repeat several times in the actual text of your page and that they describe your site to a viewer not to a robot.
For the same reasons of relevance to keywords, try to stick to one specific topic per page. If you deal with two different topics then you risk the chance of only 50% of the page being deemed relevant to either topic.
Assuming your site is now optimized to be rated by the engines you need to start submitting. There are some tricks to this too. Firstly, submitting to FFA (Free For All) Links pages is a good idea. Most spiders will place your site far more highly if they have found lots of other sites linking to you. Once you submit to any Spidered Search Engine you should add your site to as many FFA pages as you can find, once per day, for about two weeks. You will get a lot of junk mail in response to these submissions - one from each site generally, but most have exactly the same text in them so you can set your mail program to automatically delete them by using filter settings.
The best value listings of all are the Inktomi database. Those used by Yahoo, Hotbot, ICQit and many others, and the ODP (Open Directory Project) which is used by a list of Engines too long to even consider listing. Getting into the ODP database is now the only way to get listed in the AOL net search. AOL use spiders to index sites found in the ODP directory so you can see why it would be so important to be listed.
Top Five Directory Databases
(1) Open Directory
(2) Snap
(3) Yahoo! Web Sites (Inktomi)
(4) Yahoo! Directory
(5) LookSmart
Normally, you will submit to these directories one time for a given web site. It is extremely important that you submit to them correctly the first time and choose the very best category. Be aware that it's often difficult to get a directory to change your listing later unless you send them a letter of explanation.
Here's where it gets complicated:
Submitting to
(6) HotBot (an engine) - will get you listed in all Inktomi based engines.
(7) Netscape: Netscape draws its results from Open Directory first. If no matches are found there, then it searches
(8) Google. Therefore, we submit to Open Directory (a directory) and Google (an engine) to become listed on Netscape.
(9) AOL Search: AOL will search Open Directory listings first. After those matches are displayed, it will then draw results from Inktomi. Again, we submit to Open Directory using our directory guide, and to HotBot to be fully listed in AOL Search.
(10) Magellan: By submitting to Excite will get you listed here.
(11) Excite will get you listed in Magellan. They pull from each other's database.
(12) MSN: MSN draws results from LookSmart.com first, then after that, matches are displayed from
(13) AltaVista. Therefore, we submit at (www.looksmart.com) (a directory) and to AltaVista to be found in MSN.
Additionally there are very popular Search engines to submit to that can be found on many web sites using search box's.
(14) Infoseek/Go Network
(15) Lycos
(16) Planet Search
(17) WebCrawler
(18) What-U-Seek
(19) WhatsNu
(20) Northern Light
I hope the information given will aid you to get your sites placed in the search engines so give some thought to your keywords, check your load times, make sure your site is easy to navigate, and get submitting. To learn more about how Search engines rate sites and other tips I would recommend visiting (http://www.searchenginewatch.com)
---------------------------------------------- Resubmission's are just as important ----------------------------------------------
After all the hard work you've done to submit your site, don't neglect it by hoping you will stay on top - YOU WON'T! You must resubmit your site at a min of every 4 weeks. There are some great programs and sites out there that can submit your site for you so be sure to book mark at least 3 sites if you don't have sumitting software.
I use a calendar here to remind me when I am due to resubmit. But for beginners the easiest thing you can do is, submit on the 1st of every month. Your site will remain fresh in the databases and show it as a current site submission. With the information above you are armed now so you know what sites will give you the best results.
With out question you NEED to get listed in the ODP site - though the evaluator may be your competitor and my not register your site in his section, this may not be the case with other link categories that follow the the same theme. You want to be listed in as MANY categories as you can (what's the worst that can happen? They say no?) So submit to each available listing in the main category.
----------------------------
Linking to other sites
-----------------------------
Who should you try to get to link to you? In simple terms, anyone and everyone. In more practical terms however, it depends on the effort and what you have to give up in return. Exchanging a banner ad on your site for a link from a high quality busy site may be extremely valuable for attracting new visitors. The banner may however, redirect your hard sought visitor to the banner site and away from yours.
Banners also tend to clutter a web page, increase the load time and distract your visitors. You can only put up a limited number and then what do you exchange?
When was the last time you cruised a link exchange site looking for somewhere interesting to go? The fact is people look for sites with search engines or by search lists and recommendations at related sites. Link exchanges will produce few if any visitors and if they do they will probably not be prime candidates for your product or service. That is not unless you have a link exchange site yourself.
Not necessarily. There is actually a growing value to having these links to your site. As previously mentioned, many search engines, especially second generation engines, are starting to use site popularity as a way to rank your site. In other words, a site with 100 links to it will rank higher in the search results than a site with only 5 or six. It is consequently important to establish links to your site both for the traffic they can generate directly and for the indirect value of improving your search engine ranking.
Although Link Exchanges may be of value, it is better to put your initial efforts into exchanging links with other related sites. Not only will they generate more visitors, the visitors they do generate will be better qualified. What about competitors? Sometimes competitors attract viewers who may not find exactly what they need. If you allow your competition to also have a link what's the harm? The idea is to attract the viewer. link pages if you have a lot of links and put any Link Exchange banners on your site. Put these at the bottom if you do decide to use them all to often these banners will hang while loading and cause the page itself to hang - most viewers will not wait long so play it safe, and put it at the bottom. This way the viewer can still see your site and then the banner later.
Joining Web Rings related to your topic if one is available or create one if not is another option available to attract viewers. With E commerce now on the rise shoppers want to find the best deals by having all the related competition in one ring it makes it easy to find what and were the best deals are. The key to using rings, is in the registration of the page - you must use the full URL address of the page that will host the banner for that ring. This is especially important if your site uses frames.
How to Set Up a robots.txt to Control Search Engine Spiders
by Christopher Heng, thesitewizard.com
When I first started writing my first website, I did not really think that I would ever have any reason why I would want to create a robots.txt file. After all, did I not want search engine robots to spider and thus index every document in my site? Yet today, all my sites, including thesitewizard.com, have a robots.txt file in their root directory. This article explains why you might also want to include a Robots.txt file on your sites, how you can do so, and notes some common mistakes made by new webmasters with regards the ROBOTS.TXT file.
For those new to the robots.txt file, it is merely a text file implementing what is known as the Standard for Robot Exclusion. The file is placed in the main directory of a website that advises spiders and other robots which directories or files they should not access. The file is purely advisory - not all spiders bother to read it let alone heed it. However, most, if not all, the spiders sent by the major search engines to index your site will read it and take cognizance of the rules contained within the file.
Why is a Robots.txt File Important?
1. It Can Avoid Wastage of Server Resources
At the date of this writing, as far as I know, many of the search engine spiders do not bother to index the scripts on your site (such as your CGI or PHP scripts). However, there are those that do, including one of the major players, Google.
For robots or spiders that actually index scripts, they will actually call your scripts just as a browser would, complete with all the special characters. If your site is like mine, where the scripts are solely meant for the use of humans and serve no practical use for a search engine (why should a search engine need to invoke my site-navigation script? - it can just crawl the direct links), you may want to block spiders from the directories that contain your scripts. For example, I block spiders from my CGI-BIN directory. Hopefully, this will reduce the load on the web server that occurs when scripts are executed by removing unnecessary executions.
Of course there are the occasional ill-behaved robots that hit your server at high speed. Such spiders can actually bring down your server or at the very least slow it down for the real users who are trying to access it. If you know of any such spiders, you might want to exclude them too. You can do this with a robots.txt file. Unfortunately though, ill-behaved spiders often ignore robots.txt files as well.
2. It Can Save Your Bandwidth
If you look at your website's web logs, you will undoubtedly find many requests for the robots.txt file by various search engine spiders. If, like me, you have a customized 404 document (which loads each time a visitor tries to retrieve a page that does not exist on your site), you will find that the robot will wind up requesting for that document instead, if you don't have an existing robots.txt file. My site has a fairly large 404 document, with the result that the spiders wind up loading it repeatedly throughout the day, adding to my already large bandwidth problems. In such a case, having a small robots.txt file may save you some bandwidth (yeah, I know, it's not that much).
Some spiders may also request for files which you feel they should not. For example, one search engine requests for graphic files (".gif" files") on my sites. Since I see little reason why I should let it index the graphics on my site, waste my bandwidth, and possibly infringe my copyright, I ban it (and in fact all spiders) from my graphic files directory in my robots.txt file.
3. It Removes Clutter from your Web Statistics
I don't know about you, but one of the things I check from my web statistics is the list of URLs that visitors tried to access, but met with a 404 File Not Found Error. Often this tells me if I made a spelling error in one of the internal links on one of my sites (yes, I know - I should have checked all links in the first place, but mistakes do happen).
If you don't have a robots.txt file, you can be sure that /robots.txt is going to feature in your web statistics 404 report, adding clutter and perhaps unnecessarily distracting your attention from the real bad URLs that need your attention.
4. Refusing a Robot for Copyright Reasons
Sometimes you don't want a particular spider to index your site because you feel that it that particular search engine infringes on your copyright or some other reason. For example, Picsearch (found at http://www.picsearch.com/ ) will download your images and create a thumbnail version of it for people to search. That thumbnail image will be saved in their web server. If, as a webmaster, you do not want this done, you can actually exclude their spider from indexing your site with a robots.txt directive (the spider apparently obeys the rules in that file).
How to Set Up a Robots.txt File
Writing a robots.txt file could not be easier. It's just an ASCII text file that you place at the root of your domain. For example, if your domain is www.yourdomain.com, you will place the file at www.yourdomain.com/robots.txt.
The file basically lists the names of spiders on one line, followed by the list of directories or files it is not allowed to access on subsequent lines, with each directory or file on a separate line. It is possible to use the wildcard character "*" instead of naming specific spiders. When you do so, all spiders are assumed to be named. Note that the robots.txt file is a robots exclusion file (with emphasis on the "exclusion") - there is no way to tell spiders to include any file or directory.
Take the following robots.txt file for example:
User-agent: *
Disallow: /cgi-bin/
The above two lines, when inserted into a robots.txt file, inform all robots (since the wildcard asterisk "*" character was used) that they are not allowed to access anything in the cgi-bin directory and its descendents. That is, they are not allowed to access cgi-bin/whatever.cgi or even a file or script in a subdirectory of cgi-bin, such as /cgi-bin/anything/whichever.cgi.
If you have a particular robot in mind, such as the picsearch robot, you may have lines like the following:
User-agent: psbot
Disallow: /
This means that the picsearch robot, "psbot", should not try to access any file in the root directory "/" and all its subdirectories. This effectively means that psbot is banned from the entire of your website.
You can have multiple Disallow lines for each user agent (ie, for each spider). Here is an example of a longer robots.txt file:
User-agent: *
Disallow: /images/
Disallow: /cgi-bin/
User-agent: psbot
Disallow: /
The first block of text disallows all spiders from the images directory and the cgi-bin directory. The second block of code disallows the psbot spider from every directory.
It is possible to exclude a spider from indexing a particular file. For example, if you don't want Google's image search robot to index a particular picture, say, mymugshot.jpg, you can add the following:
User-agent: Googlebot-Image
Disallow: /images/mymugshot.jpg
Remember to add the trailing slash ("/") if you are indicating a directory. If you simply add
User-agent: *
Disallow: /privatedata
the robots will be disallowed from accessing privatedata.html as well as privatedataandstuff.html as well as the directory tree beginning from /privatedata/ (and so on). In other words, there is an implied wildcard character following whatever you list in the Disallow line.
Where Do You Get the Name of the Robots?
If you have a particular spider in mind which you want to block, you have to find out its name. To do this, the best way is to check out the website of the search engine. Respectable engines will usually have a page somewhere that gives you details on how you can prevent their spiders from accessing certain files or directories.
Common Mistakes in Robots.txt
1. It's Not Guaranteed to Work
As mentioned earlier, although the robots.txt format is listed in a document called "A Standard for Robots Exclusion", not all spiders and robots actually bother to heed it. Listing something in your robots.txt is no guarantee that it will be excluded. If you really need to protect something, you should use a .htaccess file (if you are running your site on an Apache server).
2. Don't List Your Secret Directories
Anyone can access your robots file, not just robots. For example, typing http://www.google.com/robots.txt will get you Google's own robots.txt file. I notice that some new webmasters seem to think that they can list their secret directories in their robots.txt file to prevent that directory from being accessed. Far from it. Listing a directory in a robots.txt file often attracts attention to the directory! In fact, some spiders (like certain spammers' email harvesting robots) make it a point to check the robots.txt for excluded directories to spider.
3. Only One Directory/File per Disallow line
Don't try to be smart and put multiple directories on your Disallow line. This will probably not work the way you think, since the Robots Exclusion Standard only provides for one directory per Disallow statement.
It's Worth It
Even if you want all your directories to be accessed by spiders, a simple robots file with the following may be useful:
With no file or directory listed in the Disallow line, you're implying that every directory on your site may be accessed. At the very least, this file will save you a few bytes of bandwidth each time a spider visits your site (or more if your 404 file is large); and it will also remove Robots.txt from your web statistics bad referral links report.
Copyright 2001-2002 by Christopher Heng. All rights reserved.
Get more free tips and articles like this, on web design, promotion, revenue and scripting, from http://www.thesitewizard.com/index.htm or subscribe to the FREE newsletter by sending an email to subscribe@thesitewizard.com.
|