Search
Engine Crawlers and Dynamic Web Pages
by © Jerry Yu
There are misunderstandings and confusions
in the Search Engine Optimization (SEO) world in regard to search engines
indexing of dynamic web pages.
It has been claimed that search engine
spiders don't index/crawl dynamic web pages well. This statement is only
half true. The correct statement should be "Search engines don't index/crawl
dynamic web pages well if the page URL contains "?" (without quotes) character.".
Search engines do index dynamic web pages very well if the page URL contains
no "?" character(s).
URLs that contain "?" are called dynamic
URLs.
What web pages are dynamic?
If you have knowledge about HTML, you know
the web pages you create normally have .htm, or .html, file extension.
These files are static because the HTML code don't change on the fly when
requested and they are not processed by web servers. They can be viewed
without using a web server.
A web page is said to be dynamic if it
is created by using server-side scripting languages such as php, asp, jsp,
perl, cgi and so on. These languages are like normal programming languages
such as C++, Java, etc. The major difference is scripting languages can't
be compiled beforehand. They can only be processed by web servers on the
fly when the page is requested by a visitor. Dynamic pages can't be viewed
without a web server.
When a dynamic page is requested, the web
server first looks at the page's source code and if any server-side scripting
code exist, it will process them and generate static HTML result. When
processing of the full page has been completed, web server sends only pure
HTML code to the web visitor's browser.
Using scripting languages to create web
pages gives you the power to do nearly anything you want. If the dynamic
page has no "?" character in its URL, search engine spiders treat the page
the same as a normal HTML static page.
Query string parameters
When "?" character is used, the page's
full URL changes when values after "?" change. The portion after "?" is
called the page's query string parameter(s), or simply query parameter(s).
Every time when parameter(s) changes, the resulted page will be different.
A page URL can contain more than one "?"
character. When this happens, search engine spiders will have difficult
time to index the resulted page. If the page has only one "?" character,
major search engine spiders can crawl that page well. For example, Google
can index and store a page's URL as http://www.examplesite.com/product.asp?id=12345.
But if the same page's URL is
http://www.examplesite.com/product.asp?id=12345&category=23&page=3
Most search engines will not be able to
index it well even though Googlebot and Yahoo! Slurp may be able to index
it.
(Note: Googlebot is Google's web-crawling
robot. Yahoo! Slurp is Yahoo's web-crawling robot. Search engine robots
collect documents from the web to build a searchable index.)
Yahoo help says
"Yahoo! does index dynamic pages, but for
page discovery, our crawler mostly follows static links. We recommend you
avoid using dynamically generated links except in directories that are
not intended to be crawled/indexed (e.g., those should have a /robots.txt
exclusion)."
Google's Webmaster Guidelines:
"If you decide to use dynamic pages (i.e.
the URL contains a "?" character), be aware that not every search engine
spider crawls dynamic pages as well as static pages. It helps to keep the
parameters short and the number of them small."
Let's analyze what Google has stated above.
1. the URL contains a "?" character: this
means the definition of dynamic pages are those containing "?" characters
in URL.
2. keep the parameters short: this means
the number of characters in each individual parameter should be short.
There is no quantitative measurement given by Google but we can check some
web forums to see examples. My Search engine friendly article (http://www.webactionguide/action-guide/build-site/se-friendly.php)
referenced black hat seo discussion thread on Cre8ASiteForums. Its URL
is http://www.cre8asiteforums.com/viewtopic.php?t=8386
This page was crawled by Google. The length
of its query parameter is 4 characters. There are many other examples on
the internet that have more characters and were crawled successfully. The
maximum number of characters that can be accepted by Google is unknown.
3. keep the number of them small: this
means we should keep the number of parameters in each URL as small as possible.
The above Cre8ASiteForums example has one parameter.
At least now we can say Googlebot is able
to crawl dynamic pages that have one query parameter and the number of
characters in the parameter can be 4.
How to get your pages crawled if using
query parameters are not avoidable?
Query parameters are often used for database
calls to retrieve stored information by using primary keys in one or more
tables. Database Management System (DBMS) makes some tedious work easy
to manage. When query parameters must be used for your site, consider build
a site map page and hard code a page's URL. For example, the previous URL
can be hard coded as
http://www.examplesite.com/product12345-23-3.asp
Hand code every dynamic page is time-consuming.
If you use Apache web server, there is a Apache mod_rewrite module to help
you (http://httpd.apache.org/docs/mod/mod_rewrite.html)
rewrite the requested URL to one with no "?" character embedded on the
fly.
Another mod rewrite resource site is http://www.modrewrite.com.
An interesting article on weberblog.com
talked about a practical example of how Google successfully indexed a dynamic
page after applying mod_rewrite module. The page originally had 17 characters
in the query parameter.
Before rewrite: http://www.weberblog.com/article.php?sroty=20040419170030157
After rewrite: http://www.weberblog.com/article.php/20040419170030157
So, if your site is experiencing the same
problem, hurry up and implement mod_rewrite now.
Author Resource Box:
------------------------------
Copyright © Jerry Yu
Jerry Yu is an experienced internet marketer
and web developer. Visit his site http://www.WebActionGuide.com
for FREE "how-to" step-by-step action guide, tips, knowledge base articles,
and more.
|