Prevent Pages from Being Crawled with Robots.txt

Prevent Pages from Being Crawled with Robots.txt

If you have any pages on your website hosting exactly the same content as other pages, you might find that those pages do not rank as highly on Google as they could. This is known as the “duplicate content penalty.” Additionally, you may have learned that if you use a PHP jump script for affiliate links, Google actually indexes the jump links and displays them as if they were a part of your website. You can prevent this from happening by creating a robots.txt file. Robots.txt is used to give instructions to robots for Google and other search engines, and those instructions can include directions not to crawl certain areas of your website.

Open Notepad and begin with a blank text file. Add the following to the top line of the file:

User-agent: *

This line indicates that the instructions contained in the file are for all robots, regardless of the search engine.

Type the following in the second line of the file:

Disallow: /

Follow the line with the file or path that you do not want search engine robots to crawl. For example, WordPress automatically creates “tag” pages for each tag that you add to your posts, giving readers a way to search your site and locate the content that interests them. If you use a “blog-style” WordPress theme, however, you may find that each “tag” page contains a complete copy of each post. This can lead to poor search engine placement due to duplicate content. See an example of a “tag” page without duplicate content here. To prevent search engines from crawling your “tag” pages, add this line to the robots.txt file:

Disallow: /tag/

To prevent search engines from crawling your PHP affiliate jump links, add this line:

Disallow: /jump.php

Save the file with the name “robots.txt” and upload it to the main directory of your website (usually “public_html”).

One Response to “Prevent Pages from Being Crawled with Robots.txt”

  1. TheShadow says:

    Thanks for the tutorial.i didn’t add tags at my robots.txt but when i go through webmaster tools i found that the not found pages filled with i have added it to robots.hope