What should know a SEO
about the robots.txt? And What is a Robots.txt File?
A "robots.txt" text
file is a small piece of technology that should be understood as SEO. Because
it controls the crawlers of search engines through its own website. Or, better
said: In the robots.txt you can ask the search engine bots to leave certain
areas of the website in peace. Google anyway keeps to it.
However, there are some
details and peculiarities of the note to robots.txt.
About the robots.txt general
If you want to in the
robots.txt Robots Exclusion Standard protocol underlying import, place on the
Wikipedia some information. I will confine myself here to the major SEOs for
rules and tips. And that is generally going on:
- The robots.txt is always to be found on the top level, its URL always is this: http: // [www.domain.com] /robots.txt.
- The data for directories and files always start directly behind the domain. So always with a slash "/" and then the path.
- There are the instructions "disallow" and "allow". Since "allow" is the rule, it must not be used as a rule. In addition, only the "big" Robots take care of this "allow".
- Upper and lower case plays a role.
- An asterisk (*) is a wildcard when about all the directories that are starting with "private" meaning, it looks like this: private * (example below). Also, not all bots, and you can expect only the great fact that it works.
- Since characters and thus also the point is interpreted, it must be possible that the end of a file extension is selected. This is done with the dollar sign ($) (example below)
- Several rules are separated by a blank line.
There are also many tips
and tricks, how to disable the robots.txt unloved bots. Crawler, the nervous
one, eat the performance of the page, or even steal content. However, even so
long list of "evil" bots does not provide for rest on the server in
the robots.txt. Whoever scrapt content, will hardly stand a Disallow statement
in a small text file. Since then makes only one entry in the .htaccess for
really locked pages.
The structure of robots.txt
An entry in the robots.txt
always consists of two parts. First, specify for which user agent (Robot), the
statement is true, then the instruction is given:
User-agent: Googlebot
Disallow: /admin/
This means that the web
crawlers of Google should not crawl the admin / directory.
If at all not a robot deal
with this directory would be so, the star is a substitute for all robots, the
statement:
User-agent: *
Disallow: /admin/
Must the entire page not
crawl (such as on a test installation before a prelaunch) would read as
instructed. The slash means that no directory to be crawled or that all
directories are prohibited:
User-agent: *
Disallow: /
Should not crawl a single
page or a single image, the entry in the robots.txt will look like this:
User-agent: Googlebot
Disallow: /get.html
Disallow: /images/seo.jpg
Special tasks in the robots.txt
It is indeed often not a
matter of complete pages to hew out of the index, but finer tasks. There are
several instructions for the different robots.
User-agent:
Googlebot-Image
Disallow: /
It is also possible, for
example, all images in a particular format to be locked:
User-agent:
Googlebot-Image
Disallow: /*.jpg$
A special feature is the
switching of AdWords ads. This may only appear on pages that can be crawled.
But should they still not in the organic index, you must make two entries. One
that the sides * not * be crawled - and then make an exception for the AdWords
Crawler:
User-agent:
Mediapartners-Google
Allow: /
User-agent: *
Disallow: /
Should an entire directory
banned, but some directories are allowed in, the statement would look like
this. However, the "General" should stand in front of the
"special":
User-agent: *
Disallow: /shop/
Allow: /shop/Magazine/
You can also block all
pages with a parameter before crawling. Which is good to clarify here whether
these sites are not the Canonical tag would be better. Both information would
interfere with each other. If it fits, however, this statement would be
correct. To ensure that all URLs would not be crawled with a question mark.
User-Agent: Googlebot
Disallow: / *?
With a special command you
can influence the reading speed. This only works with Yahoo! Slurp and MSNBot.
In the example, may only every 120 seconds to read a page:
Crawl-delay: 120
Important: If you give
instructions for special bots, all previous, more general instructions have to
be repeated in this area.
Special Case "Mobile Site"
With robots.txt you can
"detach" a mobile site from a desktop site. Namely, by blocking the
mobile bots on the desktop version and the 'normal' bots on the Mobile Site.
However, I would always also be combined with appropriate Sitemaps.
Mobile site:
User-Agent:
Googlebot-Mobile
Disallow: /
Desktop Site:
User-Agent: Googlebot
Disallow: /
Here the Allow information
can each still be made - but that's probably not really necessary. But of
course even more Mobile bots may be used (see table below)
For more information on
the robots.txt
- Can the crawler does not crawl a page or a directory that does not mean that the sides of it come under any circumstances in the Google index. Because if links point to it, Google will take the page in the index - he knows them. But he will not crawl. So also, for example, will not see the statement "noindex".
- You can test the robots.txt in the Google Webmaster Tools. There, in the "Status" and "Blocked URLs", a statement will be checked directly with example URLs.
- With the Firefox addon Roboxt you see at any time the status of a page in robots.txt ( https://addons.mozilla.org/en/firefox/addon/roboxt/ )
- In addition, you can link one or more Sitemaps in the robots.txt. The entry would look like this:
- Sitemap : http: // [www.domain.com] /sitemap.xml
- They may also have comment lines that begin with "#".
Please: Everything in
moderation and with peace!
Whenever we meet with a
client on a properly detailed and long robots.txt, I get suspicious. I ask
then: Is there someone who can really tell me every statement in it? If this is
not the case, it may well be that really great landing pages are excluded from
the crawl. The best strategy is then to start with the following robots.txt and
discuss each statement in peace:
User-agent: *
Disallow:
Because this statement is
to say: Welcome, Robots, look at my page on ...
No comments