Configuring robots.txt for WordPress

In this article, I provide an optimal example of code for the robots.txt file for WordPress that you can use on your websites.

Contents:

Optimal code for robots.txt for WordPress
Version 1 (less strict)
Version 2 (strict)
Directives (code analysis)
IMPORTANT: Rules Sorting
Checking robots.txt and Documentation
robots.txt in WordPress
robots_txt
do_robotstxt
Recommendations
Erroneous Recommendations
Controversial Recommendations
Do Not Disallow /wp-admin/admin-ajax.php
Do Not Disallow /wp-includes/
Do Not Disallow Feeds: */feed
Non-Standard Directives
Clean-param
Crawl-delay (deprecated)
Host (deprecated)
Conclusion

First, let's remember why robots.txt is needed — the robots.txt file is exclusively for search engine robots to "tell" them which sections/pages of the site to visit and which ones not to visit. Pages that are closed from visiting will not be indexed by search engines (Yandex, Google, etc.).

You can also block a page from the robot through the robots meta-tag or in the HTTP response header X-Robots-Tag. The advantage of the robots.txt file is that when a robot visits the site, it first loads all the rules from the robots.txt file and, based on them, crawls the pages of the site, excluding pages whose URLs do not fit the rules.

Thus, if we have closed a page in robots.txt, the robot will simply skip it without making any requests to the server. And if we have closed a page in the X-Robots-Tag header or meta-tag, the robot needs to first make a request to the server, receive a response, check what is in the header or meta-tag, and only then decide whether to index the page or not.

Therefore, the robots.txt file explains to the robot which pages (URLs) of the site should simply be skipped without making any requests. This saves the robot's time to crawl all the pages of the site and saves server resources.

Let's consider an example. Let's say we have a site with a total of 10,000 pages (not 404 URLs). Of these, there are only 3000 useful pages with unique content, the rest are archives by date, authors, pagination pages, and other pages where the content is duplicated (for example, filters with GET parameters). Let's say we want to prevent indexing of these 7000 non-unique pages:

if we do this through robots.txt, the robot will only need to visit 3000 pages to index the entire site, the rest will be filtered out immediately at the URL level.
if we do this through the robots meta-tag, the robot will need to visit all 10,000 pages of the site to index the entire site. Because it needs to get the page content to find out what is in the meta-tag (which indicates that the page should not be indexed).

It's not hard to guess that in this case, the first option is much preferable because the robot will spend much less time crawling the site, and the server will generate much fewer pages.

Optimal code for `robots.txt` for WordPress

It is important to understand that the code below is a universal example for the robots.txt file. For each specific site, it needs to be extended or modified. And it's better not to change anything if you don't understand what you are doing - seek help from knowledgeable people.

Version 1 (less strict)

This version is perhaps more preferable compared to the second one because there is no danger of preventing the indexing of any files within the WordPress core or the wp-content folder.

User-agent: *                   # Create a section of rules for robots. * means for all
								# robots. To specify a section of rules for a specific
								# robot, instead of *, specify its name: GoogleBot, Yandex.
Disallow: /cgi-bin              # Standard folder on hosting.
Disallow: /wp-admin/            # Close the admin area.
Allow: /wp-admin/admin-ajax.php # Allow ajax.
Disallow: /?                    # All query parameters on the main page.
Disallow: *?s=                  # Search.
Disallow: *&s=                  # Search.
Disallow: /search               # Search.
Disallow: /author/              # Author archive.
Disallow: */embed$              # All embeds.
Disallow: */xmlrpc.php          # WordPress API file
Disallow: *utm*=                # Links with utm tags
Disallow: *openstat=            # Links with openstat tags

# One or more links to the site map (Sitemap file). This is an independent
# directive and there is no need to duplicate it for each User-agent. For example,
# Google XML Sitemap creates 2 site maps:
Sitemap: http://example.com/sitemap.xml
Sitemap: http://example.com/sitemap.xml.gz

# Code version: 2.0
# Don't forget to change `example.com` to your site.

Version 2 (strict)

In this variant, we control all accesses. First, we globally deny access to almost everything from WP (Disallow: /wp-), and then allow where necessary.

I probably wouldn't recommend this code because it closes everything from wp- and you will need to describe everything that is allowed. So in the future, when WP introduces something new, this new thing may become unavailable to robots. For example, this happened with the WP site map.

User-agent: *                  # Create a section of rules for robots. * means for all
							   # robots. To specify a section of rules for a specific
							   # robot, instead of *, specify its name: GoogleBot, Yandex.
Disallow: /cgi-bin             # Standard folder on hosting.
Disallow: /wp-                 # Everything related to WP - this is: /wp-content /wp-admin
							   # /wp-includes /wp-json wp-login.php wp-register.php.
Disallow: /wp/                 # Directory where the WP core is installed (if the core is installed
							   # in a subdirectory). If WP is installed standardly, then
							   # the rule can be removed.
Disallow: /?                   # All query parameters on the main page.
Disallow: *?s=                 # Search.
Disallow: *&s=                 # Search.
Disallow: /search              # Search.
Disallow: /author/             # Author archive.
Disallow: */embed$             # All embeds.
Disallow: */xmlrpc.php         # WordPress API file
Disallow: *utm*=               # Links with utm tags
Disallow: *openstat=           # Links with openstat tags
Allow:    */wp-*/*ajax*.php    # AJAX requests: */admin-ajax.php */front-ajaxs.php
Allow:    */wp-sitemap         # site map (main and nested)
Allow:    */uploads            # open uploads
Allow:    */wp-*/*.js          # inside /wp- (/*/ - for priority)
Allow:    */wp-*/*.css         # inside /wp- (/*/ - for priority)
Allow:    */wp-*/*.png         # images in plugins, cache folder, etc.
Allow:    */wp-*/*.jpg         # images in plugins, cache folder, etc.
Allow:    */wp-*/*.jpeg        # images in plugins, cache folder, etc.
Allow:    */wp-*/*.gif         # images in plugins, cache folder, etc.
Allow:    */wp-*/*.svg         # images in plugins, cache folder, etc.
Allow:    */wp-*/*.webp        # files in plugins, cache folder, etc.
Allow:    */wp-*/*.swf         # files in plugins, cache folder, etc.
Allow:    */wp-*/*.pdf         # files in plugins, cache folder, etc.
							   # Rules section ends

# One or more links to the site map (Sitemap file). This is an independent
# directive and there is no need to duplicate it for each User-agent. For example,
# Google XML Sitemap creates 2 site maps:
Sitemap: http://example.com/wp-sitemap.xml
Sitemap: http://example.com/wp-sitemap.xml.gz

# Code version: 2.0
# Don't forget to change `example.com` to your site.

In the Allow: rules, you can see additional seemingly unnecessary asterisks - they are needed to increase the priority of the rule. Why this is necessary see in the rules sorting.

Directives (code analysis)

User-agent:

Determines for which robot the block of rules written after this line will work. There are two possible options:

User-agent: * — specifies that the rules after this line will apply to all search robots.
User-agent: ROBOT_NAME — specifies a specific robot for which the block of rules will apply. For example: User-agent: Yandex, User-agent: Googlebot.

Possible Yandex robots:

Yandex robot checks for records starting with User-agent:, taking into account substrings Yandex (case-insensitive) or *. If a line User-agent: Yandex is found, the line User-agent: * is not considered. If the lines User-agent: Yandex and User-agent: * are absent, it is assumed that the robot has unrestricted access.

Yandex — any Yandex robot.
YandexImages - Indexes images for display on Yandex Images.
YandexMedia - Indexes multimedia data.
YandexDirect - Downloads information about the content of Yandex Advertising Network partner sites to refine their themes for relevant advertising.
YandexDirectDyn - Downloads the site's favicon file for display in search results.
YandexBot - Main indexing robot.
YandexAccessibilityBot - Downloads pages to check their accessibility to users. Its maximum crawl rate is 3 requests per second. The robot ignores settings in the Yandex.Webmaster interface.
YandexAdNet - Yandex Advertising Network robot.
YandexBlogs - Blog search robot, indexing post comments.
YandexCalendar - Yandex Calendar robot. Downloads calendar files at the initiative of users, which are often located in directories prohibited for indexing.
YandexDialogs - Sends requests to Alice skills.
YaDirectFetcher - Downloads target pages of advertising announcements to check their accessibility and refine their themes. This is necessary for placing ads in search results and on partner sites. The robot does not use the robots.txt file, so it ignores directives set for it.
YandexForDomain - Domain mail robot, used to verify domain ownership rights.
YandexImageResizer - Mobile services robot.
YandexMobileBot - Identifies pages with layouts suitable for mobile devices.
YandexMarket - Yandex Market robot.
YandexMetrika - Yandex Metrica robot. Downloads site pages to check their accessibility, including checking target pages of Yandex.Direct ads. The robot does not use the robots.txt file, so it ignores directives set for it.
YandexMobileScreenShotBot - Takes a screenshot of a mobile page.
YandexNews - Yandex News robot.
YandexOntoDB - Object response robot.
YandexOntoDBAPI - Object response robot, downloading dynamic data.
YandexPagechecker - Accesses the page when validating microdata through the Microdata Validator form.
YandexPartner - Downloads information about the content of Yandex partner sites.
YandexRCA - Gathers data to create previews. For example, for extended display of a site in search results.
YandexSearchShop - Downloads YML files of product catalogs (at the initiative of users) that are often located in directories prohibited for indexing.
YandexSitelinks - Checks the availability of pages used as quick links.
YandexSpravBot - Yandex Business robot.
YandexTracker - Yandex Tracker robot.
YandexTurbo - Crawls an RSS feed created to generate Turbo pages. Its maximum crawl rate is 3 requests per second. The robot ignores settings in the Yandex.Webmaster interface and the Crawl-delay directive.
YandexUserproxy - User action proxy for Yandex services: sends requests in response to button clicks, downloads pages for online translation, and so on.
YandexVertis - Search verticals robot.
YandexVerticals - Yandex Verticals robot: Auto.ru, Yandex Realty, Yandex Jobs, Yandex Reviews.
YandexVideo - Indexes videos for display in Yandex video search results.
YandexVideoParser - Indexes videos for display in Yandex video search results.
YandexWebmaster - Yandex Webmaster robot.

Full list of Yandex robots.

Possible Google robots:

Googlebot — main indexing robot.
Googlebot-Image — indexes images.
Mediapartners-Google — robot responsible for placing ads on the site. Important for those using AdSense advertising. With this user-agent, you can control the placement of ads by allowing or disallowing them on certain pages.
Full list of Google robots.

Disallow:

Prevents robots from "crawling" links containing the specified substring:

Disallow: /cgi-bin — closes the scripts directory on the server.
Disallow: *?s= — closes search pages.
Disallow: */page/ — closes all types of pagination.
Disallow: */embed$ — closes all URLs ending with /embed.

Example of adding a new rule. Let's say we need to prevent indexing of all posts in the news category. To do this, we add the rule:

Disallow: /news

It will prevent robots from crawling links like this:

http://example.com/news
http://example.com/news/another-title/

If you need to close any occurrences of /news, then write:

Disallow: */news

This will close:

http://example.com/news
http://example.com/my/news/another-title/
http://example.com/category/newsletter-title.html

You can learn more about the robots.txt directives on the Yandex help page. Keep in mind that not all rules described there work for Google.

IMPORTANT about Cyrillic: Robots do not understand Cyrillic, so it needs to be provided in encoded form. For example:

Disallow: /каталог                                    # incorrect.
Disallow: /%D0%BA%D0%B0%D1%82%D0%B0%D0%BB%D0%BE%D0%B3 # correct.

Allow:

In the line Allow: */uploads, we intentionally allow indexing of pages containing /uploads. This rule is necessary because we prohibit indexing pages starting with /wp-, and /wp- is part of /wp-content/uploads. Therefore, to override the Disallow: /wp- rule, we need the line Allow: */uploads, as links like /wp-content/uploads/... may contain images that should be indexed, and there may also be uploaded files that do not need to be hidden.
Allow: can be placed "before" or "after" Disallow:. When reading the rules, robots first sort them and then process the last applicable rule, so the placement of Allow: and Disallow: does not matter. For more details on sorting, see below.

Sitemap:

The rule Sitemap: http://example.com/sitemap.xml instructs the robot to the XML format sitemap file. If you have such a file on your site, specify the full path to it. There may be multiple such files, in which case you need to specify the path to each file separately.

IMPORTANT: Rules Sorting

Yandex and Google process the Allow and Disallow directives not in the order they are specified, but by sorting them from short to long, and then processing the last applicable rule:

User-agent: *
Allow: */uploads
Disallow: /wp-

will be read as:

User-agent: *
Disallow: /wp-
Allow: */uploads

Thus, if a link like /wp-content/uploads/file.jpg is being checked, the Disallow: /wp- rule will block it, but the subsequent Allow: */uploads rule will allow it, making the link available for crawling.

To quickly understand and apply the sorting feature, remember this rule: "the longer the rule, the higher its priority. If the rules are of the same length, priority is given to the Allow directive."

Checking `robots.txt` and Documentation

You can check if the rules are working correctly using the following links:

Yandex: http://webmaster.yandex.ru/robots.xml.
Google: https://www.google.com/webmasters/tools/robots-testing-tool Requires authentication and the presence of the site in the webmaster panel.
Yandex documentation on robots.txt.
Google documentation on robots.txt

`robots.txt` in WordPress

It is IMPORTANT that there is NO robots.txt file in the root of your site! If it is there, then everything described below simply will not work, because your server will serve the content of this static file.

In WordPress, the request /robots.txt is handled non-standardly. On-the-fly content for the robots.txt file is created (via PHP).

Dynamic creation of content for /robots.txt allows for convenient modification through the admin panel, hooks, or SEO plugins.

You can modify the content of robots.txt through:

Hook robots_txt.
Hook do_robotstxt.
Plugin https://wordpress.org/plugins/pc-robotstxt/ or similar.

Let's consider both hooks: how they differ and how to use them.

robots_txt

By default, WP 5.5 creates the following content for the /robots.txt page:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: http://example.com/wp-sitemap.xml

See do_robots() — how dynamic creation of the robots.txt file works.

This hook allows you to add to the existing data of the robots.txt file. The code can be inserted into the theme's functions.php file.

// Add to the base robots.txt
// -1 before wp-sitemap.xml
add_action( 'robots_txt', 'wp_kama_robots_txt_append', -1 );

function wp_kama_robots_txt_append( $output ){

	$str = '
	Disallow: /cgi-bin             # Standard hosting folder.
	Disallow: /?                   # All query parameters on the main page.
	Disallow: *?s=                 # Search.
	Disallow: *&s=                 # Search.
	Disallow: /search              # Search.
	Disallow: /author/             # Author archive.
	Disallow: */embed              # All embeddings.
	Disallow: */page/              # All types of pagination.
	Disallow: */xmlrpc.php         # WordPress API file
	Disallow: *utm*=               # Links with utm tags
	Disallow: *openstat=           # Links with openstat tags
	';

	$str = trim( $str );
	$str = preg_replace( '/^[\t ]+(?!#)/mU', '', $str );
	$output .= "$str\n";

	return $output;
}

As a result, when we visit the page /robots.txt, we see:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /cgi-bin             # Standard hosting folder.
Disallow: /?                   # All query parameters on the main page.
Disallow: *?s=                 # Search.
Disallow: *&s=                 # Search.
Disallow: /search              # Search.
Disallow: /author/             # Author archive.
Disallow: */embed              # All embeddings.
Disallow: */page/              # All types of pagination.
Disallow: */xmlrpc.php         # WordPress API file
Disallow: *utm*=               # Links with utm tags
Disallow: *openstat=           # Links with openstat tags

Sitemap: http://example.com/wp-sitemap.xml

Note that we have added to the native WP data, not replaced it.

do_robotstxt

This hook allows you to completely replace the content of the /robots.txt page.

add_action( 'do_robotstxt', 'wp_kama_robots_txt' );

function wp_kama_robots_txt(){

	$lines = [
		'User-agent: *',
		'Disallow: /wp-admin/',
		'Disallow: /wp-includes/',
		'',
	];

	echo implode( "\r\n", $lines );

	die; // end PHP execution
}

Now, by visiting the link http://site.com/robots.txt, we will see:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

Recommendations

Erroneous Recommendations

Adding Sitemap after each User-agent
This is unnecessary. A sitemap should be specified only once anywhere in the robots.txt file.
Closing folders wp-content, wp-includes, cache, plugins, themes
These are outdated requirements. However, I have found such advice even in an article with a grandiose title "The Most Correct Robots for WordPress 2018"! It is better not to close them for Yandex and Google. Or to close them "smartly," as described above (Version 2).
Closing tag and category pages
If your site indeed has a structure where content is duplicated on these pages and they have no special value, it's better to close them. However, often the promotion of the site is also carried out through category and tagging pages. In this case, you may lose some traffic.
Specifying Crawl-Delay
A trendy rule. However, it should only be specified when there is a real need to limit the visits of robots to your site. If the site is small and visits do not create a significant load on the server, then limiting the time "just because" would not be the most sensible idea.
Blunders
Some rules I can only categorize as "blogger didn't think." For example: Disallow: /20 — with such a rule, not only will you close all archives, but also all articles about 20 ways or 200 tips to make the world a better place 🙂

Controversial Recommendations

Indexing Pagination Pages /page/
There is no need to do this. For such pages, the rel="canonical" tag is configured, so these pages are also visited by the robot and the products/articles listed on them are taken into account, as well as the internal linking structure.
Comments
Some people advise to disallow indexing of comments like Disallow: /comments and Disallow: */comment-*.
Open the uploads folder only for Googlebot-Image and YandexImages
```
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
User-agent: YandexImages
Allow: /wp-content/uploads/
```
This advice is quite dubious because for ranking a page, information about the images and files placed on it is necessary.

Do Not Disallow `/wp-admin/admin-ajax.php`

/wp-admin/admin-ajax.php is the default WordPress file to which AJAX requests are sent.

Robots analyze the site structure, including CSS files, JS, and AJAX requests.

admin-ajax.php should be allowed for indexing because it may be used by plugins and themes to load content, and robots should be able to load and index such content if necessary.

The correct way is:

Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Do Not Disallow `/wp-includes/`

Disallow: /wp-includes/

With the arrival of the Panda 4 algorithm, Google started to see sites the same way as users, including CSS and JavaScript.

Many sites use old techniques that block indexing of /wp-includes/, where the style and script files are often included on the front end. For example, files like:

/wp-includes/css/dist/block-library/style.min.css
/wp-includes/js/wp-embed.min.js

These files are necessary for the site to function. This means that Google will see the site differently from how visitors see it.

Do Not Disallow Feeds: `*/feed`

Disallow: */feed

This is because having open feeds is necessary, for example, for Yandex Zen, when it is necessary to connect the site to a channel. Perhaps open feeds are needed elsewhere.

Feeds have their own format in response headers, which allows search engines to understand that it is not an HTML page, but a feed, and obviously process it differently.

Non-Standard Directives

Clean-param

Google does not understand this directive. It instructs the robot that the URL of the page contains GET parameters that should not be considered for indexing. These parameters can be session identifiers, user IDs, UTM tags, i.e. anything that does not affect the content of the page.

Fill in the Clean-param directive as fully as possible and keep it up to date. A new parameter that does not affect the page's content can lead to the appearance of duplicate pages, which should not appear in the search results. Because of the large number of such pages, the robot will crawl the site more slowly. This means that important changes will take longer to appear in search results. By using this directive, the Yandex robot will not repeatedly reload duplicate information. This will increase the efficiency of crawling your site and reduce the server load.

For example, if there are pages on the site where the ref parameter is used only to track which resource the request came from and does not change the content, all three addresses will show the same page:

example.com/dir/bookname?ref=site_1
example.com/dir/bookname?ref=site_2
example.com/dir/bookname?ref=site_3

If you specify the directive as follows:

User-agent: Yandex
Clean-param: ref /dir/bookname

then the Yandex robot will consolidate all page addresses into one:

example.com/dir/bookname

An example of clearing multiple parameters at once: ref and sort:

Clean-param: ref&sort /dir/bookname

Clean-Param is a cross-sectional directive, so it can be specified anywhere in the robots.txt file. If multiple directives are specified, all of them will be considered by the robot.

Crawl-delay (deprecated)

User-agent: Yandex
Disallow: /wp-admin
Disallow: /wp-includes
Crawl-delay: 1.5

User-agent: *
Disallow: /wp-admin
Disallow: /wp-includes
Allow: /wp-*.gif

Google does not understand this directive. Timeout for its robots can be set in the webmaster panel.

Yandex has stopped considering Crawl-delay

More details Yandex has stopped considering Crawl-delay:

Analyzing the letters received over the past two years to our support regarding indexing, we found out that one of the main reasons for slow document downloading is incorrectly configured Crawl-delay directive in robots.txt [...] In order to relieve site owners from worrying about this and to ensure that all really necessary pages of sites appear and are updated in search results quickly, we decided to stop considering the Crawl-delay directive.

Why the Crawl-delay directive was needed

When a robot scans a site too aggressively, creating excessive server load. You can ask the robot to "slow down". For this, you can use the Crawl-delay directive. It specifies the time in seconds that the robot should wait to scan each subsequent page of the site.

Host (deprecated)

Google has never supported the Host directive, and Yandex completely refuses it. Host can safely be removed from the robots.txt. Instead of Host, a 301 redirect should be set up from all site mirrors to the main site.

Conclusion

It is important to remember that changes to robots.txt on an already active site will only be noticeable after several months (2-3 months).

There are rumors that Google may sometimes ignore the rules in robots.txt and index a page if it deems it very unique and useful and it simply must be indexed. However, other rumors refute this hypothesis, citing incorrect robots.txt code. I am more inclined towards the latter.