-->

What should know a SEO about the robots.txt

What should know a SEO about the robots.txt? And What is a Robots.txt File?
A "robots.txt" text file is a small piece of technology that should be understood as SEO. Because it controls the crawlers of search engines through its own website. Or, better said: In the robots.txt you can ask the search engine bots to leave certain areas of the website in peace. Google anyway keeps to it.

However, there are some details and peculiarities of the note to robots.txt.

About the robots.txt general

If you want to in the robots.txt Robots Exclusion Standard protocol underlying import, place on the Wikipedia some information. I will confine myself here to the major SEOs for rules and tips. And that is generally going on:
  • The robots.txt is always to be found on the top level, its URL always is this: http: // [www.domain.com] /robots.txt.
  • The data for directories and files always start directly behind the domain. So always with a slash "/" and then the path.
  • There are the instructions "disallow" and "allow". Since "allow" is the rule, it must not be used as a rule. In addition, only the "big" Robots take care of this "allow".
  • Upper and lower case plays a role.
  • An asterisk (*) is a wildcard when about all the directories that are starting with "private" meaning, it looks like this: private * (example below). Also, not all bots, and you can expect only the great fact that it works.
  • Since characters and thus also the point is interpreted, it must be possible that the end of a file extension is selected. This is done with the dollar sign ($) (example below)
  • Several rules are separated by a blank line.

There are also many tips and tricks, how to disable the robots.txt unloved bots. Crawler, the nervous one, eat the performance of the page, or even steal content. However, even so long list of "evil" bots does not provide for rest on the server in the robots.txt. Whoever scrapt content, will hardly stand a Disallow statement in a small text file. Since then makes only one entry in the .htaccess for really locked pages.

The structure of robots.txt

An entry in the robots.txt always consists of two parts. First, specify for which user agent (Robot), the statement is true, then the instruction is given:

User-agent: Googlebot
Disallow: /admin/

This means that the web crawlers of Google should not crawl the admin / directory.

If at all not a robot deal with this directory would be so, the star is a substitute for all robots, the statement:

User-agent: *
Disallow: /admin/

Must the entire page not crawl (such as on a test installation before a prelaunch) would read as instructed. The slash means that no directory to be crawled or that all directories are prohibited:

User-agent: *
Disallow: /

Should not crawl a single page or a single image, the entry in the robots.txt will look like this:

User-agent: Googlebot
Disallow: /get.html
Disallow: /images/seo.jpg

Special tasks in the robots.txt

It is indeed often not a matter of complete pages to hew out of the index, but finer tasks. There are several instructions for the different robots.

User-agent: Googlebot-Image
Disallow: /

It is also possible, for example, all images in a particular format to be locked:

User-agent: Googlebot-Image
Disallow: /*.jpg$

A special feature is the switching of AdWords ads. This may only appear on pages that can be crawled. But should they still not in the organic index, you must make two entries. One that the sides * not * be crawled - and then make an exception for the AdWords Crawler:

User-agent: Mediapartners-Google
Allow: /
User-agent: *
Disallow: /

Should an entire directory banned, but some directories are allowed in, the statement would look like this. However, the "General" should stand in front of the "special":

User-agent: *
Disallow: /shop/
Allow: /shop/Magazine/

You can also block all pages with a parameter before crawling. Which is good to clarify here whether these sites are not the Canonical tag would be better. Both information would interfere with each other. If it fits, however, this statement would be correct. To ensure that all URLs would not be crawled with a question mark.

User-Agent: Googlebot
Disallow: / *?

With a special command you can influence the reading speed. This only works with Yahoo! Slurp and MSNBot. In the example, may only every 120 seconds to read a page:

Crawl-delay: 120

Important: If you give instructions for special bots, all previous, more general instructions have to be repeated in this area.

Special Case "Mobile Site"

With robots.txt you can "detach" a mobile site from a desktop site. Namely, by blocking the mobile bots on the desktop version and the 'normal' bots on the Mobile Site. However, I would always also be combined with appropriate Sitemaps.

Mobile site:
User-Agent: Googlebot-Mobile
Disallow: /

Desktop Site:
User-Agent: Googlebot
Disallow: /

Here the Allow information can each still be made - but that's probably not really necessary. But of course even more Mobile bots may be used (see table below)

For more information on the robots.txt
  • Can the crawler does not crawl a page or a directory that does not mean that the sides of it come under any circumstances in the Google index. Because if links point to it, Google will take the page in the index - he knows them. But he will not crawl. So also, for example, will not see the statement "noindex".
  • You can test the robots.txt in the Google Webmaster Tools. There, in the "Status" and "Blocked URLs", a statement will be checked directly with example URLs.
  • With the Firefox addon Roboxt you see at any time the status of a page in robots.txt ( https://addons.mozilla.org/en/firefox/addon/roboxt/ )
  • In addition, you can link one or more Sitemaps in the robots.txt. The entry would look like this:
  • Sitemap : http: // [www.domain.com] /sitemap.xml
  • They may also have comment lines that begin with "#".

Please: Everything in moderation and with peace!

Whenever we meet with a client on a properly detailed and long robots.txt, I get suspicious. I ask then: Is there someone who can really tell me every statement in it? If this is not the case, it may well be that really great landing pages are excluded from the crawl. The best strategy is then to start with the following robots.txt and discuss each statement in peace:

User-agent: *
Disallow:
Because this statement is to say: Welcome, Robots, look at my page on ...

No comments