Robots,
Agents and Spiders - Identifying
Search Engine Crawlers
By Michael Bloch
Crawlers,
Agents, Bots, Robots and
Spiders
Five terms all describing
basically the same thing,
but in this article they'll
be referred to collectively
as spiders or "agents".
A search engine spider is
an automated software program
used to locate and collect
data from web pages for
inclusion in a search engine's
database and to follow links
to find new pages on the
World Wide Web. The term
"agent" is more
commonly applied to web
browsers and mirroring software.
If
you've ever examined your
server logs or web site
traffic reports, you've
probably come across some
weird and wonderful names
for search engine spiders,
including "Fluffy the
Spider" and Slurp.
Depending upon the type
of web traffic reports you
receive, you may find spiders
listed in the "Agents"
section of your statistics.
Not
all spiders are good
Who actually owns these
spiders? It's good to know
the beneficial from the
bad. Some agents are generated
by software such as Teleport
Pro, an application that
allows people to download
a full "mirror"
of your site onto their
hard drives for viewing
later on, or sometimes for
more insidious purposes
such as plagiarism. If you
have a large or image heavy
site, the practice of web
site stripping could also
have a serious impact on
your bandwidth usage each
month.
Banning
spiders and agents
If you notice entries like
Teleport Pro and WebStripper
in your traffic reports,
someone's been busy attempting
to download your web site.
You don't have to just sit
back and let this happen.
If you are commercially
hosted, you'll be able to
add a couple of lines to
your robots.txt file to
prevent repeat offenders
from stripping your site.
The
robots.txt file gives search
engine spiders and agents
direction by informing them
what directories and files
they are allowed to examine
and retrieve. These rules
are called The Robots Exclusion
Standard.
To
prevent certain agents and
spiders from accessing any
part of your web site, simply
enter the following lines
into the robots.txt file:
User-agent:
NameOfAgent
Disallow: /
Ensure
that you enter the name
of the agent exactly as
it appeared in your reports/logs
e.g. Teleport Pro/1.29 and
that there is a separate
entry for each agent. Skip
a line between entries.
You could do the same to
exclude search engine spiders,
but somehow I don't think
you'll really want to do
this :0). The "/"
in the above example means
disallow access to any directory.
You can also disallow access
by spiders and agents to
certain directories e.g.
User-agent:
*
Disallow: /cgi-bin/
In
this example the asterisk
(wildcard) indicates "all".
Don't use the asterisk in
the Disallow statement to
indicate "all",
use the forward slash instead.
If
you don't have a robots.txt
file, create one in notepad
and upload it to the docs
directory (or the root of
whichever directory your
web pages are stored in).
Never use a blank robots.txt
file as some search engines
may see this as an indication
that you don't want your
site spidered at all! Have
at least one entry in the
file.
Unfortunately,
defining web stripper agents
and spiders in your robots.txt
file won't work in all cases
as some mirroring software
applications have the ability
to mimic web browser identifiers;
but at least it's some protection
that may save you some valuable
bandwidth.
If
you're not able to create
a robots.txt file, which
is usually the case if you
are hosted by a free hosting
service, this
tool may be useful:
Search
engine spider identification
The following is a basic
listing of search engine
spider names and their "owners".
This is by no means complete,
as there are many thousands
of search engines on the
Internet, but it covers
the more common beneficial
spiders. Look for these
in your traffic reports
or search for the names
through your server logs
to discover which pages
they have been spidering.
You'll find that many of
the entries will also have
accompanying numbers or
letters e.g Googlebot/2.1
or Slurp.so/1.0
Spider
name
|
Spider
owner
|
| Googlebot |
Google.com |
| TeomaAgent |
Teoma.com |
| Zyborg |
Wisenut.com |
| Gulliver |
NorthernLight.com |
| Architext
spider |
Excite.com |
| FAST-WebCrawler |
FAST (AllTheWeb.com) |
| Slurp |
Inktomi.com |
| Ask
Jeeves |
AskJeeves.com |
| ia_archiver |
Alexa.com |
| Scooter |
AltaVista.com |
| Mercator |
AltaVista.com |
| crawler@fast |
FAST (AllTheWeb.com) |
| Crawler |
Crawler.de |
| InfoSeek
sidewinder |
InfoSeek.com |
| Lycos_Spider_(T-Rex) |
Lycos.com |
| Fluffy
the Spider |
SearchHippo.com |
| Ultraseek |
InfoSeek.com |
| MantraAgent |
LookSmart.com |
| Moget |
Goo.jp |
| T-H-U-N-D-E-R-S-T-O-N-E |
Thunderstone.com |
| MuscatFerret |
Euroferret.com |
| VoilaBot |
Voila.fr |
| Sleek
Spider |
Search-info.com |
| KIT_Fireball |
FireBall.de |
| WebCrawler |
Webcrawler.com |
If
you have spotted any significant
activity from these spiders
in your reports or logs,
there's a good chance that
you'll be listed on that
particular search engine.
But you'll need to be patient;
some Search Engines take
up to 6 months to refresh
their databases!
========================================
Article by Michael Bloch.
Michael Bloch is
owner of http://www.tamingthebeast.net.
To view great articles,
tutorials and tools for
site owners, web developers
and Internet marketers!
Subscribe for free to our
popular ecommerce/web design
ezine!