What is Robots.txt file? How to Create it?

Robots.txt

Robots.txt file

By default Search Engine Crawlers are made to crawl each and every page of the website. Crawlers will crawl each and every page and will submit it to index as per Search Engine Guidelines. But website may contain some confidential content that should not be shown to the user. Those confidential contents may be leaked, if search engine indexed and ranked those pages. Obviously, no webmaster wants to go through from such condition. So, How to prevent Search Engine to index such confidential web pages?. Robots.txt file will help you .

This article will guide you about:

 

What is Robots.txt file?

Robots.txt file tells the Search Engine not to visit the specified pages. Robots.txt file should be saved in text (.txt) format.

When to use Robots.txt file?

Robots.txt file is used to disallow pages contains:

  • Confidential content.
  • Admin login.
  • Duplicate content.
  • Poor quality or less amount of content.
  • Pages unavailable due to maintenance.
  • Pages with 404 errors

Any web page which is not suitable for the user can be disallowed using Robots.txt file.

 

How to use Robots.txt file?

As I said earlier, Robots.txt file should be saved in text (.txt) format. You can use notepad to create a robots.txt file and saved in the top-level directory.

Robots.txt file has a following basic syntax.

User agent: *
Disallow: /page-url/

Above syntax tells all Search Engines to ignore the specified page from search engine.

Let’s understand using Following Scenarios

  • To disallow all pages.
    User agent: *
    Disallow: / 
  • To allow all pages
    User agent: *
    Disallow:or
    User agent: *
    Allow: /
  • To disallow single page
    User agent: *
    Disallow: /directory/page.html 
  • To disallow multiple pages
    User agent: *
    Disallow: /directory/page1.html
    Disallow: /directory/page2.html
     
  • To disallow all pages from specified directories
    User agent: *
    Disallow: /directory1/
    Disallow: /directory2/
     
  • To allow only few pages from specified directories
    User agent: *
    Disallow: /directory1/
    Allow: /directory1/allowed-page1.html
    Allow: /directory1/allowed-page2.html
    Disallow: /directory2/
    Allow: /directory2/allowed-page1.html
    Allow: /directory2/allowed-page2.html
     
  • To allowed or disallowed Specified bots
    User agent: googlebot
    Disallow: /directory1/User agent: bingbot
    Disallow: /directory2/

Robots.txt directives

  • Allow
  • Disallow
  • Sitemap
  • Crawl-delay
  • Host

Allow

We have already covered the examples of allowed and disallowed directive in this article.
In short,

Allow directory tells Search Engine Crawlers to crawl and index the specified pages or directories. Allow Directory is generally used to crawl and index the specific page of disallowed directory.

Example:

User agent: *
Disallow: /make-money/
Allow: /make-money/adsense.html

Disallow

Disallow directory tells Search Engine not to crawl and index the specified pages or directories.

Example:

User agent: *
Disallow: /seo/black-hat-techniques.html

Sitemap Directive

Robots.txt allows to include sitemap path. Some Search Engine crawler can locate the sitemap path from the robots.txt.

Example:

User agent: *
Disallow: /make-money/
Allow: /make-money/adsense.html
Sitemap: seowebsiteblog.com/sitemap.xml

Crawl-delay

The crawl delay directive is generally used by a frequently updating websites. Crawl delay directive tells Search Engine crawlers not to visit the site frequently.

Example:

User agent: *
Crawl-delay: 1

Disallow: /make-money/
Allow: /make-money/adsense.html

More about Crawl-delay.

Host

Host directive is used to set the predefined domain (like www or non www)

Example:

User agent: *
Host: seowebsiteblog.com
Crawl-delay: 1

Disallow: /make-money/
Allow: /make-money/adsense.html

or

User agent: *
Host: www.seowebsiteblog.com
Crawl-delay: 1

Disallow: /make-money/
Allow: /make-money/adsense.html

Comments

comments