|
|
email me
email me
Apache .htaccess Examples :: Rewriting and Redirecting with mod_rewrite (and mod_alias)
PURPOSE AND BACKGROUND
I heavily rely on Apache's mod_rewrite module in my site's
.htaccess file to do a number of helpful things. In order to learn enough
about mod_rewrite, I studied
a number of examples on the web. Some of those examples were very hard to
find. As the official mod_rewrite documentation will tell you, the use of
these modules can be a black magic art. I'm hoping that these examples will
help someone else acquire a little bit of that black magic.
First, be sure to see the official Apache documentation, which is mirrored
all over the web. In particular, review these sites:
Apache 1.3 URL Rewriting Guide
There are a LOT of good examples here. You should always start here for
some textbook helpful examples of, in some cases, some complicated and useful
rewriting code.
Module mod_rewrite
This is an extremely important document. There are a lot of nuances built
into this document that are often quickly overlooked because the author decided
to only spend a quick sentence on something very important. Pay attention to
every detail of this document.
Module mod_alias
mod_alias is hardly as complex as mod_rewrite, but it's equally as important.
Much of my .htaccess file could be rewritten much more simply using mod_alias. The
only reason I lean so much on mod_rewrite is that mod_alias recurses down every
subdirectory of mine, which includes my subdomains. Thus, if I use mod_alias directives,
redirections I want on my main site show up on all of my subdomains as well. This is not desirable.
I solve this problem by rewriting all of my mod_alias statements with mod_rewrite
directives; mod_rewrite directives do not recurse down subdirectories and subdomains.
If it weren't for my subdomains, I'd use mod_alias much more.
Everyone should have a solid understanding of this module.
Module mod_asis
This is an honorable mention. My .htaccess sends the 403 Forbidden for a number of
different very specific reasons. I should have used mod_asis to send those custom 403
error messages. Combining mod_asis with mod_alias and/or mod_rewrite gives the ability to
build CONDITIONAL ERROR DOCUMENTS. (I leave this as an exercise)
Finally, note that this web page only scratches the surface. With
the mod_rewrite directives like the ones involving chaining, passing thru,
and skipping, mod_rewrite can turn an .htaccess configuration file
into a powerful scripting language.
|
Some Examples Similar to Lines in My .htaccess File
Contents
Order Matters to mod_rewrite Directives
It is important to note that the relative order of the
mod_rewrite directives is important.
For example, if you are having a problem with a redirect rule
that keeps putting information about the real filesystem location in
the target URL, try moving that redirect rule earlier in the file.
In most cases, if there is no other easy way to determine ordering,
it is best to order redirect rules to URLs with explicit
hostnames FIRST. This sort of ordering is reflected in the examples given
below.
The examples below are meant to be taken in order. If I was to put
these into an .htaccess file, I would leave them in the same
order as is on this page.
Spelling of Referrer is REFERER
Remember that it's HTTP_REFERER. This is NOT the
correct spelling of the word referrer, but it IS the correct spelling
of the server variable.
Difference Between Redirecting with mod_rewrite and mod_alias
These next two blocks may appear to be equivalent, but they have at least
one major difference.
RewriteRule ^servo.php$ http://www.tedpavlic.com/post_servo.php [R=permanent,L]
RewriteRule ^images($|/.*$) http://links.tedpavlic.com/images$1 [R=permanent,L]
Redirect permanent /servo.php http://www.tedpavlic.com/post_servo.php
RedirectMatch permanent ^/images($|/.*$) http://links.tedpavlic.com/images$1
The first block is implemented with mod_rewrite directives.
Thus, the first block is
NOT inherited by other .htaccess files that
live in child directories underneath the main directory.
The second block is implemented with mod_alias directives.
Thus, the second block IS
INHERITED by other .htaccess files that
live in child directories underneath the main directory.
In other words, suppose links.tedpavlic.com is a subdomain that is
hosted out of a links folder that resides within the main
www.tedpavlic.com document root. Suppose that links folder contains
its own .htaccess file that makes no mention of either servo.php
or images.
When accessing http://links.tedpavlic.com/servo.php, the SECOND
block will redirect this request back to http://www.tedpavlic.com/post_servo.php.
However, the FIRST block will return a 404 File Not Found.
When accessing http://links.tedpavlic.com/images, the SECOND
block will redirect this request back to http://links.tedpavlic.com/images,
which results in a redirect loop.
However, the second block will return a 404 File Not Found.
mod_alias rules ride along top of the directory structure, regardless
of the public structure of the web site and its subdomains. mod_rewrite rules are
completely forgotten when a new .htaccess is found in a subdirectory.
For my site, because of my subdomains, that means that the mod_rewrite was best for
me. This may not be the case with your site.
Important Options
Options -Indexes +Includes +FollowSymLinks
-Indexes: I include this here to remind you that you are in control
of your web site. If you don't like the way the webserver displays your
very important content then change it. Rewrite it. Change how the webserver
interprets requests. -Indexes to me is a symbol of control.
+Includes: This is more of a reminder to use .shtml files
for your error documents (if you don't want to use error scripts). This will help
you return good information to your users that they may be able to return to you
incase they find a bug in your rules.
+FollowSymLinks: This is the important one. When using .htaccess
mod_rewrite rewriting, this is required.
Turn the Engine On
RewriteEngine On
This is just a simple reminder that mod_rewrite needs to be turned on.
Redirect to Most Desirable Hostname or Subdomain
#### Now, before doing any rewriting, make sure everyone is
#### pointed at the right web host
RewriteCond %{HTTP_HOST} !^www\.tedpavlic\.com$ [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteCond %{REQUEST_URI} ^($|/.*$)
RewriteRule ^.* http://www.tedpavlic.com%1 [R=permanent,L]
My websites often have many aliases. These aliases are provided so I have some
flexibility when I want to develop new content. These aliases are also provided so
users have an easy way to remember my sites. However, regardless of user preference,
I really want them to end up at one particular site. I also want search engines
to only index ONE of those sites.
Note the use of the %1 rather than the typical $1.
A %1 will match a group found in one of the RewriteCond statements. In
this case, I'm picking off the whole REQUEST_URI so I can resubmit it to the
subdomain. Note that I could have gotten rid of that third RewriteCond
and done the match entirely in the RewriteRule line and used $1 instead.
However, to keep consistency with my subdomains, I show it like this. This also avoids
confusion with how the match works when the actual domain is found in the target. See
mod_rewrite documentation and further information below for more details.
Notice the R=permanent. Not only does this rule rewrite the URL,
but it issues a 301 permanent redirection. This should convince webbots
to update their records to point to the central site.
Notice the L rewrite flag indicating that this is the last rule
to be processed on this pass. Wait for the browser to continue the redirect.
Then continue processing on the NEW URL. This simplifies rewriting rules later.
This is the reason why I have this rule so early in my .htaccess file!!
Notice that the second line of this rule makes sure it does NOT apply when
there is an empty HTTP_HOST variable. Browsers using older versions of the
HTTP protocol may result in HTTP_HOST being empty. Let
these users through without the redirect. Otherwise, you will put them
in a deadly redirect loop. That's bad.
Note that when the explicit site hostname is given,
in the target URL, the RewriteRule is interpretted differently and matches against a
slightly different string. See mod_rewrite documentation for more information about
this. This distinction is not important in this rule because I chose to match on REQUEST_URI
instead. I only chose to do this because it is necessary for me to do this within subdirectories
that host my subdomains. (see below)
The following is the very similar RewriteRule block I use on
each of my subdomains that lie inside subdirectories of my main site. Depending on what
sort of redirect you are trying to do, this may be a better choice for you.
#### Now, before doing any rewriting, make sure everyone is
#### pointed at the right web host
RewriteCond %{HTTP_HOST} !^links\.tedpavlic\.com$ [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteCond %{REQUEST_URI} ^/links($|/.*$)
RewriteRule ^.* http://links.tedpavlic.com%1 [R=permanent,L]
Note the similarities and differences between these
lines and the lines that I use in my main website. The purpose of this rule is to redirect any request to
http://www.tedpavlic.com/links/.* to go directly to
http://links.tedpavlic.com/.*.
One major difference is that in this case I'm only grabbing a portion of the REQUEST_URI
to pass on to the subdomain. Note again that I use %1 here rather than $1.
Here, it is important that I match against the REQUEST_URI with a RewriteCond line
because a request to http://www.tedpavlic.com/links/ will cause the RewriteRule line
to match against the ABSOLUTE FILENAME from the FILE SYSTEM rather than just the relative filename from
the document root. The RELATIVE FILENAME is ONLY USED WHEN the TARGET URL INCLUDES the web site host name.
The final important point to make here is that this rule COULD NOT have been placed in the
main site's .htaccess file. This is because (UNLIKE mod_alias directives) the mod_rewrite rules do not
recurse into subdomain subdirectories because each of my subdomains has its own special .htaccess
file. As a consequence, if anyone requests a file from those directories directly under the main site,
she will be redirected to the actual subdomain. Because of the existence of the subdomain's
.htaccess file, any rules I make in the main .htaccess file to attempt to do
the same redirections are disregarded. Thus, the rules must exist in the subdomain .htaccess
file.
Forbid Access to Bad Webbots (and others?)
### Forbid access from certain known-malicious browsers/bots
RewriteCond %{HTTP_USER_AGENT} nhnbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} naver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NHN.Corp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} naverbot [NC,OR]
# Korean addresses that Naverbot might use
RewriteCond %{REMOTE_ADDR} ^61\.(7[89]|8[0-5])\. [OR]
# Korean addresses that Naverbot might use
RewriteCond %{REMOTE_ADDR} ^218\.(14[4-9]|15[0-9])\. [OR]
RewriteCond %{HTTP_USER_AGENT} Sleipnir [NC]
# Allow access to robots.txt and forbidden message
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteCond %{REQUEST_URI} !^/403\.shtml$
RewriteRule ^.* - [F,L]
Note the chaining of rewrite conditions. These condition lines
implicitly have "ands" between each two line unless an OR is used. They
also apply "anded" to the actual rewriting rule.
Note that this primarily applies to robots, so it would not have been
a bad idea to also check to make sure HTTP_REFERER was empty.
Most robots enter the site without any referrer. If you have
HTTP_USER_AGENT checks that may accidentally catch real users,
a second check making sure the referrer is empty wouldn't be a bad idea.
Note the explicit check for robots.txt and 403.shtml.
Without this check, the robots will be forbidden from seeing your custom built
403 message and your robots.txt which tells the robot where it should
and should not be.
Note the use of the F option on the rewrite rule. This
instructs the web browser to send a 403 Forbidden.
Note the use of regular expressions to pick out IP address
ranges. A strong grasp of regular expressions will be very helpful when
writing these rules and conditions.
Forbid Access to Only Certain Types of Files from Certain Agents
### Forbid Yahoo's MMCrawler from accesing multimedia (anything non-text)
RewriteCond %{HTTP_USER_AGENT} MMCrawler [NC]
RewriteCond %{REQUEST_URI} !^/.*\.(txt|tex|ps|pdf|php|htm|html|shtm|shtml)$ [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteCond %{REQUEST_URI} !^/403\.shtml$
RewriteRule ^.* - [F,L]
This rule keeps MMCrawler from grabbing anything but true CONTENT.
Again, notice the explicit exclusion of robots.txt and 403.shtml from this rule.
In this rule, this is NOT NECESSARY since these are already excluded by the rest
of the rule.
Note the rule expression could be more compact, but it is easier to read this way.
Forbid Access to Certain Documents
### Forbid access to sensitive documents
RewriteCond %{REQUEST_URI} (^|/)\.htaccess$ [NC,OR]
RewriteCond %{REQUEST_URI} ^/guestbook\.csv$ [OR]
RewriteCond %{REQUEST_URI} ^/post_template\.php$ [OR]
RewriteCond %{REQUEST_URI} ^/page_template\.php$ [OR]
RewriteCond %{REQUEST_URI} ^/hitlist\.db$ [OR]
RewriteCond %{REQUEST_URI} ^/includes($|/.*$) [OR]
RewriteCond %{REQUEST_URI} ^/ban($|/.*$) [OR]
RewriteCond %{REQUEST_URI} ^/removed($|/.*$)
RewriteRule ^.* - [F,L]
This is a simple rule. It shows more regular expressions and how
to block access to important non-content files that just happen to live
in the same directory (or close) as web content.
NOTE that the last three of these rules block entire directories
AND ALL OF THEIR CHILDREN.
Prevent Good Spiders from Entering Traps for Bad Spiders
### Forbid access from known good spiders to spam traps and other nasty spots
### (this PROTECTS the good guys!!)
RewriteCond %{REMOTE_ADDR} ^61\.(7[89]|8[0-5])\. [OR]
# Googlebot
RewriteCond %{REMOTE_ADDR} ^64\.68\.82\. [OR]
RewriteCond %{REMOTE_ADDR} ^216\.239\.39\.5$ [OR]
RewriteCond %{REMOTE_ADDR} ^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\. [OR]
# Yahoo Slurp
RewriteCond %{REMOTE_ADDR} ^66\.196\.(6[4-9]|(7|8|9|10|11)[0-9]|12[0-7])\. [OR]
RewriteCond %{REMOTE_ADDR} ^68\.142\.(19[2-9]|2[1-4][0-9]|25[0-5])\. [OR]
# msnbot
RewriteCond %{REMOTE_ADDR} ^207\.46\. [OR]
# psbot
RewriteCond %{REMOTE_ADDR} ^62\.119\.133\.([0-5][0-9]|6[0-3])$ [OR]
# Cyveillance
RewriteCond %{REMOTE_ADDR} ^63.148.99.2(2[4-9]|[34][0-9]|5[0-5])$
# Bots don't come from referrers
RewriteCond %{HTTP_REFERER} ^$
# Prohibit suckerdir and trapme access
RewriteCond %{REQUEST_URI} ^/(suckerdir|trapme)(/|$)
RewriteRule ^.* - [F,L]
There are a number of methods to trick spambots into areas
that record their presence, submit their IP to authorities, and
block them from further access to the site.
These methods often "poison" the spambot as well by providing
ficticious e-mail addresses and (perhaps not obviously) recursive
links. Clearly, it would be bad if a GOOD bot ever found its way
into such traps. This would waste the resources of the good bot.
This would also possibly submit bad content onto a search engine.
It additionally might ban a legitimate bot from your site (and others).
This rule tries to prevent good bots from wandering into bad
traps.
Notice the restriction is on a class of requests
that start with a particular string.
Prevent Real People from Entering Traps for Bad Spiders
### Forbid access from unknowing web browsers who happened upon the traps
### (this PROTECTS the little people!!)
# Real people often do come from referrers. Protect them.
RewriteCond %{HTTP_REFERER} !^$
# Prohibit suckerdir and trapme access
RewriteCond %{REQUEST_URI} ^/(suckerdir|trapme)(/|$)
RewriteRule ^.* - [F,L]
It would also be bad if real people came upon these requests.
Most likely, if they hear of these requests, it'll be from a link
that some jerk has put on a page somewhere.
Again, remember that bots usually carry no HTTP_REFERER.
Since these traps are designed for bots, forbid access from
links. Make sure the HTTP_REFERER is empty.
Setup an Environment for Bad Spider Traps
### Allow access into the suckerdir and trapme traps for all others
## Setup the sand traps, suckerdir and trapme
# This RedirectMatch makes sure there's a trailing / on "directories"
RedirectMatch /(suckerdir|trapme)$ http://www.tedpavlic.com/$1/
# This RewriteRule makes sure there's a trailing / on "directories"
RewriteCond %{REQUEST_URI} (suckerdir|trapme)/(.+)$
RewriteCond %{REQUEST_URI} !(suckerdir|trapme)/(.+)(\.(html?|php)|/)$
RewriteRule ^(.*)$ http://%{HTTP_HOST}/$1/ [R,L]
# This RewriteRule makes index.html the "DirectoryIndex"
RewriteRule ^(suckerdir|trapme)(/|/(.*)/)$ $1$2/index.html
# This RewriteRule actually generates the content for each query
RewriteRule ^(suckerdir|trapme)/.+$ $1/$1.php
This set of rules helps to "create" the traps mentioned above.
The actual traps also involve some scripts that generate the
bad content, but these rules make those scripts more believable.
As you can see from the last line, the scripts that make all the
magic of this trap are suckerdir.php and trapme.php.
Note that the index.html file does not exist and the
second-to-last line really isn't needed. It's just there for
my own amusement in the wonder and power of mod_rewrite.
Notice the use of HTTP_HOST. If and when the
site name changes in the future, this makes it easy to transport
this rule to the new site name. REMEMBER that one of the
FIRST RULES redirected to the desired site name, so
HTTP_HOST at this point is a known quantity.
Remember that you can only use %{HTTP_HOST} with the
mod_rewrite directives. This DOES NOT EXIST with the
mod_alias directives.
The end effect of this rule is that any request
to a DIRECTORY, html FILE, or php SCRIPT that BEGINS with
suckerdir or trapme ACTUALLY gets EXECUTED by the
suckerdir.php and trapme.php scripts WITHOUT THE
AGENT EVER KNOWING.
These rules also make sure that requests that look like requests
to directories without the trailing slash get redirected to the version
that does have the trailing slash before actually getting processed.
This helps convince a bot that it's looking at real content.
This shows a way to make dynamic content LOOK STATIC.
It also shows how one script can operate AN ENTIRE SITE and the user
will PERCEIVE that the site is MANY pages with an entire DIRECTORY STRUCTURE.
Strip the Query Strings from Requests from Bots
### If we detect a bot at all, set an environment variable
# NOTE: It is okay to match bad bots here too. We just don't want to match
# real human people.
# To match most bots, check out User-Agent and look for empty referrer
RewriteCond %{HTTP_USER_AGENT} (google|other_bots|wisenutbot) [NC]
RewriteCond %{HTTP_REFERER} ^$
RewriteRule ^.* - [E=HTTP_CLIENT_IS_BOT:1]
# Certain bots actually do have referrers. Catch them too.
RewriteCond %{HTTP_USER_AGENT} (becomebot) [NC]
RewriteRule ^.* - [E=HTTP_CLIENT_IS_BOT:1]
### If we match a bot, strip query strings from requests
# Match a bot
RewriteCond %{ENV:HTTP_CLIENT_IS_BOT} ^1$
# Look for non-empty query string
RewriteCond %{QUERY_STRING} !^$
# Force it empty and tell the bot that it's a permanent change
RewriteRule ^(.*)$ http://%{HTTP_HOST}/$1? [R=permanent,L]
I've done both setting and checking environment variables here.
This isn't necessary. Those lines could be combined, but I thought
this was a good place to show an example of using environment
variables in this way.
One advantage of using environment variables here
is that it passes useful information back to my web
scripts. In this case, checking the HTTP_CLIENT_IS_BOT
environment variable lets me know that it has met my "bot criteria" setup
here in the .htaccess file. I can then tailor my content for bots.
The first half of these rules identify probable web bots.
Since nearly all bots always have empty referrers, it's easy to
reduce false positives by checking for an empty HTTP_REFERER.
The second half strips the query string from
all queries identified as being from bots by the first half.
This is useful to me since many of my pages use query
strings to change the display format, but the content stays the
same. These rules prevent redundant indexing of content.
Notice the use of the ? at the end of the rewriting rule. A
single ? at the end of the rule REMOVES THE QUERY STRING. This is
one of those documented features that is OFTEN OVERLOOKED.
NOTICE that this REDIRECTION does not occur if the
QUERY_STRING is EMPTY. This prevents REDIRECT LOOPS!!
Some Cheap and Simple Redirects
# Redirect requests that should be going to subdomains directly
RewriteRule ^osufirst($|/.*$) http://osufirst.tedpavlic.com$1 [R=permanent,L]
RewriteRule ^schedule($|/.*$) http://schedule.tedpavlic.com$1 [R=permanent,L]
RewriteRule ^europa/schedule($|/.*$) http://schedule.tedpavlic.com$1 [R=permanent,L]
RewriteRule ^blocko($|/.*$) http://blocko.tedpavlic.com$1 [R=permanent,L]
RewriteRule ^europa/blocko($|/.*$) http://blocko.tedpavlic.com$1 [R=permanent,L]
# Redirect some renames that may still be linked elsewhere under old names
RewriteRule ^servo.php$ http://www.tedpavlic.com/post_servo.php [R=permanent,L]
RewriteRule ^riley\.(jpg|gif|png)$ http://links.tedpavlic.com/riley.$1 [R=permanent,L]
Notice the permanent keyword. Using this keyword is
identical to using R=301. If no keyword is given, R=302 is implied,
which performs a temporary redirect.
Other status codes may be used as long as they correspond to REDIRECT
operations. In other words, other status codes may be used as long as they
are greater than 299 and less than 400.
To indicate that a page is gone or forbidden, use the
G or F flags, respectively and replace the target URL with
a - (hyphen). See the next section for an example.
Redirects with Other Status Messages
# Now pages that have just been removed and replaced
RewriteRule ^opinions.php$ http://www.tedpavlic.com/general_posts.php [R=seeother,L]
# Now pages that have just been removed entirely
RewriteRule ^analog_ee.php$ - [G,L]
RewriteRule ^teaching.php$ - [G,L]
RewriteRule ^wav($|/.*$) - [G,L]
RewriteRule ^toys_.*.s?html?$ - [G,L]
# Now forbid some pages that I want to keep around but don't want people to see
Redirect 403 /phpinfo.php
seeother provides a good alternative to a permanent redirect.
It isn't implying that the target page is a replacement, but it is stating that
the desired page is gone, but a similar page is available.
gone does not take a target parameter with mod_alias directives and takes
a simple hyphen (-) as a parameter with mod_rewrite directives. Use this status when a page
has been permanently removed. Note that using G instead of R=gone is
NOT possible using mod_rewrite's RewriteRule since - is given for redirection.
The final mod_alias rule uses 403 to indicate that all requests
to /phpinfo.php should be given the 403 Forbbiden status. There is no mod_alias
keyword for forbidding pages; however, mod_rewrite provides the F parameter.
NOTE that the final rule uses a mod_alias keyword, so it applies
to all subdirectories. This includes subdomains that happen to be hosted beneath this directory.
So this single rule prohibits phpinfo.php access on ANY of my subdomains.
Being smart about these very technical issues helps to make sure search engines (and users in
general) stay up to date with the layout of your site.
Case Insensitive robots.txt
# Any request for robots.txt, regardless of case, should go through
RewriteRule ^robots.txt$ robots.txt [NC]
This rule allows robots.txt to be fetched with any name. It makes
the robots.txt file completely case insensitive.
With the addition of this rule, it is a good idea to make every
other robots.txt condition case insensitive with the NC parameter.
|
|
|