Hardware and software setup

Create robots txt wordpress. Correct setup for WooCommerce plugin

Hello everyone! Today's article is about what should be correct file robots.txt for WordPress. We dealt with the functions and purpose a few days ago, and now we will analyze a specific example for WordPress.

With this file, we have the ability to set the basic indexing rules for various search engines, as well as assign access rights to individual search bots. With an example, I'll show you how to write correct robots.txt for wordpress. Based on two main search engines— Yandex and Google.

In narrow circles of webmasters, one may come across the opinion that it is necessary to create a separate section for Yandex, referring to it by User-agent: Yandex. Let's figure out together what these beliefs are based on.

Yandex supports directives Clean param and Host, which Google doesn't know about and doesn't use when crawling.

It is reasonable to use them only for Yandex, but there is a nuance - these are intersectional directives that can be placed anywhere in the file, and Google simply will not take them into account. In this case, if the indexing rules are the same for both search engines, then it is quite sufficient to use User-agent: * for all search robots.

When accessing robots by User-agent, it is important to remember that the file is read and processed from top to bottom, therefore, using User-agent: Yandex or User-agent: Googlebot, these sections must be placed at the beginning of the file.

Robots.txt example for WordPress

I want to warn you right away: there is no perfect file that will suit absolutely all sites running on WordPress! Don't be fooled by blindly copying the contents of a file without analyzing it for your particular case! Much depends on the permalink settings chosen, the structure of the site, and even installed plugins. I'm looking at an example where CNC is used and permalinks like /%postname%/ .

Like any content management system, it has its own administrative resources, administration directories, and so on, which should not be included in the search engine index. To protect such pages from access, it is necessary to prohibit their indexing in this file with the following lines:

Disallow: /cgi-bin Disallow: /wp-

The directive on the second line will block access to all directories starting with /wp- , including:

  • wp-admin
  • wp-content
  • wp-includes

But we know that images are uploaded to a folder by default uploads, which is inside the directory wp-content. Let's allow them to be indexed by a string:

allow: */uploads

We have closed the service files, we are moving on to eliminating duplicates with the main content, which reduce within the same domain and increase the likelihood that the PS will impose a filter on the site. Duplicates include category pages, authors, tags, RSS feeds, as well as pagination, trackbacks, and individual comment pages. Be sure to disable their indexing:

Disallow: /category/ Disallow: /author/ Disallow: /page/ Disallow: /tag/ Disallow: */feed/ Disallow: */trackback Disallow: */comments

Further, I would like to pay special attention to such an aspect as. If you use CNC, then pages containing question marks in the URL are often "superfluous" and again duplicate the main content. Such parameter pages should be disabled in the same way:

Disallow: */?

This rule applies to simple permalinks?p=1 , pages with search queries?s= and other parameters. Another problem can be pages of archives containing the year, month in the URL. In fact, it is very easy to close them using the mask 20* , thereby prohibiting the indexing of archives by years:

Disallow: /20*

To speed up and complete indexing, add the path to the location of . The robot will process the file and the next time you visit the site, it will use it to crawl pages in priority.

Sitemap: https://site/sitemap.xml

In the robots.txt file, you can place additional information for robots that improves the quality of indexing. Among them, the Host directive points to for Yandex:

Host: webliberty.ru

When the site works via HTTPS, you must specify the protocol:

Host: https://website

Since March 20, 2018, Yandex has officially stopped supporting the Host directive. It can be removed from robots.txt, and if left, the robot simply ignores it.

To summarize, I combined all of the above together and got the contents of the robots.txt file for WordPress, which I have been using for several years and at the same time there are no duplicates in the index:

User-agent: * Disallow: /cgi-bin Disallow: /wp- Disallow: /category/ Disallow: /author/ Disallow: /page/ Disallow: /tag/ Disallow: */feed/ Disallow: /20 * Disallow: * /trackback Disallow: */comments Disallow: */? Allow: */uploads Sitemap: https://site/sitemap.xml

Constantly monitor the progress of indexing and correct the file in time in case of duplicates.

A lot depends on whether the file is compiled correctly or not, so pay special attention to its compilation so that search engines quickly and efficiently index the site. If you have any questions - ask, I will be happy to answer!

Hello dear readers! The Anatomy of Business project and webmaster Alexander are with you. We continue the series of articles in the manual “How to create a site on WordPress and make money on it”, and today we will talk about how to create a robots.txt file for WordPress and why this file is needed.

In the past 16 lessons, we have covered a huge amount of material. Our site is almost ready to start filling it with interesting content and SEO optimization.

So, let's get down to business!

Why does a site need a robots.txt file?

The main value on our site will be the content, but in addition to it, the site has a whole bunch of technical sections or pages that are not valuable for a search robot.

These sections include:
- admin. panel
- Search
- you may want to close comments from indexing
- or some duplicate pages that have the same characters in their urls

In general, robots.txt is designed to prevent the search robot from indexing certain pages.
At one time, in understanding how robots txt works, this picture helped me a lot:

As we can see, the first thing a crawler does when it comes to a site is to look for this particular File! After analyzing it, he understands which directories he needs to enter and which not.

Many novice web masters neglect this file, but in vain! Because how “clean” the indexing of your site will be depends on its position in the search engine.

An example of writing a robots.txt file for WordPress

Let's now figure out how to write this file. There is nothing complicated here, to write it, we just need to open the usual text editor Notepad or you can use a professional editor like notepad+.
Enter the following data into the editor:

User agent: Yandex
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-comments

Disallow: /wp-content/themes
Disallow: /wp-login.php
Disallow: /wp-register.php
Disallow: */trackback
Disallow: */feed
Disallow: /cgi-bin
Disallow: *?s=
Host: site.ru

User-agent: *
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-comments
Disallow: /wp-content/plugins
Disallow: /wp-content/themes
Disallow: /wp-login.php
Disallow: /wp-register.php
Disallow: */trackback
Disallow: */feed
Disallow: /cgi-bin
Disallow: *?s=

Sitemap: http://site.ru/sitemap.xml

Now let's deal with all this.

The first thing to notice is that the file is divided into two large blocks.
And at the beginning of each block is the "User-agent" directory, which indicates for which robot this block was made.
Our first block was made for Yandex robots, as evidenced by given string: "User-agent: Yandex"

The second block says that it is for all other robots. This is indicated by the asterisk "User-agent: *".

The "Disallow" directory sets which sections are prohibited from being indexed.

Now let's break it down into sections:

/wp-admin- ban on indexing admin. panels

/wp-includes- ban on indexing system folders WordPress engine

/wp-comments- ban on indexing comments

/wp-content/plugins- ban on indexing folder with plugins for WordPress

/wp-content/themes- ban on indexing the folder with themes for WordPress

/wp-login.php- prohibition on the index of the login form to the site

/wp-register.php- close the registration form from the robot

*/feed- ban on the blog RSS feed index

/cgi-bin- prohibition on the script directory index on the server

*?s=— ban on indexing all URLs that contain?s=

And at the very end of robots.txt we show the robot where the sitemap.xml file is located

Sitemap: http://site.ru/sitemap.xml

After the file is ready, save it in the root directory of the site.

How to close some categories from indexing?

For example, you don't want to show a section on your site for search robots. The reasons for this can be completely different. For example, do you want your The Diary read only by regular visitors to the site.

The first thing we need to do is find out the URL of this category. Most likely it will be /my-dnevnik.

In order to close this section, we just need to add the following line to it: Disallow: /my-dnevnik

Robots.txt - when to expect the effect?

I can say from personal experience that you should not expect that already with the next update all the rubrics you closed will leave the index. Sometimes this process can take up to two months. Just be patient.

It should also be taken into account that Googlebots can simply ignore this file if they consider that the page is already very unique and interesting.

Something to remember ALWAYS!

Of course, the technical component is not unimportant, but first of all, you need to focus on useful and interesting content, for which regular readers of your project will return! It is the emphasis on quality that will make your resource in demand and popular.

Good luck with your internet business

Robots.txt was created to regulate the behavior of search robots on sites, namely where they can go and take them into the search, and where they can’t. About 10 years ago, the power of this file was great, all search engines worked according to its rules, but now it is more like a recommendation than a rule.

But until it is canceled, webmasters must make it and set it up correctly based on the structure and hierarchy of sites. A separate theme is WordPress, because the CMS contains many elements that do not need to be crawled and given to the index. Let's figure out how to correctly compose robots.txt

Where is the robots file in wordpress

On any of the resources, robots.txt must be in the root folder. In the case of WordPress, where is the wp-admin folder and the like.

Server location

If it was not created and loaded by the site administrator, then by default it cannot be found on the server. The default build of WordPress does not provide for such an object.

How to create correct robots txt

Creating a correct robots txt is not a difficult task, it is more difficult to write the correct directives in it. First, create a document, open notepad and click save as.


Saving the document

In the next window, set the name robots, leave the txt extension, ANSI encoding, and click save. The object will appear in the folder where it was saved. While the document is empty and contains nothing, let's see what directives it can support.

If you wish, you can immediately download it to the server to the root via .


Saving the robot

Command settings

I will highlight four main commands:

  • User-agent: Shows the rules for different crawlers, either all or specific
  • Disalow: Denies access
  • Allow: allow access
  • Sitemap: address to XML map

Outdated and unnecessary configurations:

  1. Host: indicates the main mirror, became unnecessary, because the search itself will determine the correct option
  2. Crawl-delay: limits the time for the robot to stay on the page, now the servers are powerful and you don’t need to worry about performance
  3. Clean-param: limits the loading of duplicate content, you can register, but there will be no sense, the search engine will index everything that is on the site and take the maximum number of pages

Working example instructions for WordPress

The fact is that the search robot does not like prohibitive directives, and will still take into circulation what it needs. The ban on indexing should be objects that 100% should not be in the search and in the Yandex and Google database. We place this working code example in robots txt.

User-agent: * Disallow: /wp- Disallow: /tag/ Disallow: */trackback Disallow: */page Disallow: /author/* Disallow: /template.html Disallow: /readme.html Disallow: *?replytocom Allow: */uploads Allow: *.js Allow: *.css Allow: *.png Allow: *.gif Allow: *.jpg Sitemap: https://yourdomain/sitemap.xml

Let's deal with the text and see what exactly we allowed and what was forbidden:

  • User-agent, put the * sign, thereby indicating that all search engines must obey the rules
  • A block with Disallow prohibits all technical pages and duplicates from being indexed. note that i have blocked folders starting with wp-
  • The Allow block allows you to scan scripts, pictures, and css files, this is necessary for the correct presentation of the project in the search, otherwise you will receive a footcloth without registration
  • : shows the path to the XML sitemap, you definitely need to make it, as well as replace the inscription “your domain”

I recommend not to make the rest of the directives, after saving and making changes, we load the standard robots txt to the WordPress root. To check for availability, open this address https://your-domain/robots.txt, replace the domain with your own, it should display like this.


Address in query string

How to check if robots.txt works

Standard way check through the yandex webmaster service. For better analysis, you need to register and install the service on the site. At the top we see the loaded robots, click check.


Checking a document in yandex

A block with errors will appear below, if there are none, then go to the next step, if the command is displayed incorrectly, then correct and check again.


No errors in the validator

Let's check if Yandex processes commands correctly, go down a little lower, enter two forbidden and allowed addresses, do not forget to click check. In the picture, we see that the instruction worked, the red mark indicates that entry is prohibited, and the green checkmark indicates that indexing of records is allowed.


Checking folders and pages in Yandex

We checked, everything works, let's move on to the next method - setting up robots using plugins. If the process is not clear, then watch our video.

Generator plugin Virtual Robots.txt

If you don't want to contact FTP connection, then one great WordPress generator plugin comes to the rescue called Virtual Robots.txt . We install it as standard from the WordPress admin panel by searching or downloading the archive, it looks like this.


What does Virtual Robots.txt look like?

Settings > Virtual Robots.txt, we see a familiar configuration, but we need to replace it with ours from the article. Copy and paste, don't forget to save.


Configuring Virtual Robots.txt

Robots will be automatically created and available at the same address. If you want to check if it is in the WordPress files, we won’t see anything, because the document is virtual and can only be edited from the plugin, but it will be visible to Yandex and Google.

Add with Yoast SEO

The famous Yoast SEO plugin provides the ability to add and modify robots.txt from the WordPress dashboard. Moreover, the created file appears on the server (and not virtually) and is located in the root of the site, that is, after deletion or deactivation, the robots remain. Go to Tools > Editor.


Yoast SEO File Editor

If there is robots, it will be displayed on the page, if not, there is a “create” button, click on it.


Create robots button

The text area will come out, we write down the existing text from the universal configuration and save it. You can check by FTP connection the document will appear.

Edit by Module in All in One SEO

The old All in One SEO plugin can change robots txt, in order to activate the ability, go to the modules section and find the item of the same name, click Activate.


Modules in All In one Seo

In the menu All in One SEO will appear new section, we go in, we see the functionality of the constructor.


Working in the AIOS module
  1. We write down the name of the agent, in our case * or leave it empty
  2. Allow or disable indexing
  3. Directory or page where not to go
  4. Result

The module is not convenient, it is difficult to create a valid and correct robots.txt according to this principle. Better use other tools.

Correct setup for WooCommerce plugin

To make the right setup for the WordPress WooCommerce ecommerce plugin, add these lines to the rest:

Disallow: /cart/ Disallow: /checkout/ Disallow: /*add-to-cart=*

We do similar actions and upload to the server via FTP or a plugin.

Outcome

Let's summarize what needs to be done so that the WordPress site has the correct file for search engines:

  • Create a file manually or using a plugin
  • We write in it the instructions from the article
  • Uploading to the server
  • Checking in the Yandex validator
  • Do not use robots txt generators on the Internet, move your hands a little

Improve your WordPress blogs, move forward and set all the parameters correctly, and we will help you with this, success!

From the author: one of the files that search engines use when indexing your site is the robots.txt file. It is not difficult to understand from the name of the file that it is used for robots. Indeed, this file allows you to tell the search robot what can be indexed on your site and what you do not want to see in search index. So let's see how to set up robots txt for a WordPress site.

There are many articles on the web on this topic. In almost each of these articles, you can find your own version of the robots txt file, which you can take and use with almost no edits on your WordPress site. I will not once again rewrite one of these options in this article, since there is not much point in this - you can easily find all these options on the net. In the same article, we will simply analyze how to create robots txt for WordPress and what minimum rules should be there.

Let's start with where the robots.txt file should be located and what to write in it. This file, like the sitemap.xml file, must be located at the root of your site, i.e. it should be available at http://site/robots.txt

Try to contact this address, replacing the word site with the address of your site. You can see something like this here:

Although you can see this picture:

Strange situation, you say. Indeed, the address is the same, but in the first case the file is available, in the second it is not. At the same time, if you look at the root of the site, you will not find any robots.txt file there. How and where is robots.txt in WordPress?

It's all about the simple setup - it's the CNC setup. If CNC is enabled on your site, then you will see robots.txt dynamically generated by the engine. V otherwise a 404 error will be returned.

Turn on the CNC in the Settings - Permalinks menu, checking the Post Name setting. Save the changes - now the robots.txt file will be dynamically generated by the engine.

As you can see in the first figure, this file uses some directives that set certain rules, namely, to allow or prohibit indexing something at a given address. As you might guess, the Disallow directive disables indexing. In this case, this is the entire contents of the wp-admin folder. Well, the Allow directive allows indexing. In my case, indexing of the admin-ajax.php file from the wp-admin folder forbidden above is allowed.

In general, this file, of course, is unnecessary for search engines, I can’t even imagine why WordPress wrote this rule. Well, yes, I don’t feel sorry, in principle

By the way, I specifically added the phrase “in my case” above, since in your case the content of robots.txt may already be different. For example, it may be prohibited from indexing wp-includes folder.

In addition to the Disallow and Allow directives in robots.txt, we see the User-agent directive, for which an asterisk is specified as a value. An asterisk means that the following set of rules applies to all search engines. You can also specify the names of specific search engines instead of an asterisk. The robots.txt file also supports other directives. I will not dwell on them, all of them with examples can be viewed in the Google or Yandex webmaster console. You can also read the information on this site.

How to create robots txt for wordpress

So, we have a file for search robots, but it is likely that it will not suit you in its current form. How to create your own file. There are several options here. Let's start with the first - manually creating a file. Create a normal Text Document in notepad and save it as robots with .txt extension. In this file write necessary set rules and simply save it to the root of your WordPress site, next to the wp-config.php configuration file.

Just in case, check that the file has loaded and is available by accessing it from the browser. This was the first way. The second way is the same dynamic file generation, only now the plugin will do it. If you use the popular All in One SEO plugin, then you can use one of its modules.

This article is an example of the best, in my opinion, code for the WordPress robots.txt file that you can use in your sites.

To begin with, let's remember why robots.txt is needed- the robots.txt file is needed exclusively for search robots in order to “tell” them which sections / pages of the site to visit and which do not need to be visited. Pages that are closed from visiting will not be indexed by search engines (Yandex, Google, etc.).

Option 1: Optimal robots.txt code for WordPress

User-agent: * Disallow: /cgi-bin # classic... Disallow: /? # all query options on the main page Disallow: /wp- # all WP files: /wp-json/, /wp-includes, /wp-content/plugins Disallow: *?s= # search Disallow: *&s= # search Disallow: /search # search Disallow: /author/ # author archive Disallow: */embed # all embeds Disallow: */page/ # all types of pagination Allow: */uploads # open uploads Allow: /*/*.js # inside /wp - (/*/ - for priority) Allow: /*/*.css # inside /wp- (/*/ - for priority) Allow: /wp-*.png # images in plugins, cache folder, etc. Allow: /wp-*.jpg # images in plugins, cache folder, etc. Allow: /wp-*.jpeg # images in plugins, cache folder, etc. Allow: /wp-*.gif # pictures in plugins, cache folder, etc. Allow: /wp-*.svg # images in plugins, cache folder, etc. Allow: /wp-*.pdf # files in plugins, cache folder, etc. Allow: /wp-admin/admin-ajax.php #Disallow: /wp/ # when WP is installed in wp subdirectory Sitemap: http://example.com/sitemap.xml Sitemap: http://example.com/sitemap2. xml # another file #Sitemap: http://example.com/sitemap.xml.gz # compressed version (.gz) # Code version: 1.1 # Don't forget to change `site.ru` to your site.

Code parsing:

    In the User-agent: * line, we indicate that all of the following rules will work for all crawlers * . If you want these rules to work only for one specific robot, then instead of * specify the name of the robot (User-agent: Yandex , User-agent: Googlebot).

    In the Allow: */uploads line, we intentionally allow pages that contain /uploads to be indexed. This rule is mandatory, because above, we forbid indexing pages starting with /wp- , and /wp- included in /wp-content/uploads. Therefore, to break the Disallow: /wp- rule, you need the line Allow: */uploads , because on links like /wp-content/uploads/... we may have pictures that should be indexed, as well as there may be some uploaded files that there is no need to hide. Allow: can be "before" or "after" Disallow: .

    The rest of the lines prevent robots from "walking" on links that start with:

    • Disallow: /cgi-bin - closes the script directory on the server
    • Disallow: /feed - closes the blog's RSS feed
    • Disallow: /trackback - Disable notifications
    • Disallow: ?s= or Disallow: *?s= - close search pages
    • Disallow: */page/ - closes all types of pagination
  1. The Sitemap rule: http://example.com/sitemap.xml points the robot to an XML sitemap file. If you have such a file on your site, then write the full path to it. There can be several such files, then we specify the path to each separately.

    In the Host: site.ru line, we indicate the main mirror of the site. If the site has mirrors (copies of the site on other domains), then in order for Yandex to index all of them equally, you need to specify the main mirror. Directive Host: understands only Yandex, Google does not understand! If the site works under the https protocol, then it must be specified in Host: Host: http://example.com

    From the Yandex documentation: "Host is an independent directive and works anywhere in the file (cross-section)". Therefore, we put it at the top or at the very end of the file, through an empty line.

Because the presence of open feeds is required, for example, for Yandex Zen, when you need to connect the site to the channel (thanks to the Digital commentator). Perhaps open feeds are needed somewhere else.

At the same time, feeds have their own format in response headers, thanks to which search engines understand that this is not HTML page, while the feed and obviously handle it differently.

Host directive for Yandex is no longer needed

Yandex completely abandons the Host directive, it has been replaced by 301 redirects. Host can be safely removed from robots.txt . However, it is important that all site mirrors have a 301 redirect to the main site (main mirror).

This is important: sorting rules before processing

Yandex and Google does not process the Allow and Disallow directives in the order in which they are specified, but first sorts them from the short rule to the long one, and then processes the last matching rule:

User-agent: * Allow: */uploads Disallow: /wp-

will be read as:

User-agent: * Disallow: /wp- Allow: */uploads

To quickly understand and apply the sorting feature, remember this rule: “the longer the rule in robots.txt, the more priority it has. If the length of the rules is the same, then the Allow directive takes precedence.”

Option 2: Standard robots.txt for WordPress

I don’t know how anyone, but I am for the first option! Because it is more logical - you do not need to completely duplicate the section in order to specify the Host directive for Yandex, which is cross-sectional (it is understood by the robot anywhere in the template, without specifying which robot it refers to). As for the non-standard directive Allow , it works for Yandex and Google, and if it does not open the uploads folder for other robots that do not understand it, then in 99% it will not entail anything dangerous. I haven't noticed yet that the first robots doesn't work as it should.

The above code is slightly incorrect. Thanks to the commentator "" for pointing out the incorrectness, though I had to figure out what it was myself. And here is what I came up with (I could be wrong):

    Some robots (not Yandex and Google) do not understand more than 2 directives: User-agent: and Disallow:

  1. The Yandex directive Host: should be used after Disallow: because some robots (not Yandex and Google) may not understand it and generally reject robots.txt. Judging by the documentation, Yandex itself does not care where and how to use Host:, even if you create robots.txt with only one line Host: www.site.ru in order to glue all the mirrors of the site.

3. Sitemap: a cross-sectional directive for Yandex and Google and apparently for many other robots too, so we write it at the end through an empty line and it will work for all robots at once.

Based on these amendments, the correct code should look like this:

User-agent: Yandex Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-json/ Disallow: /wp-login.php Disallow: /wp-register.php Disallow: */embed Disallow: */page/ Disallow: /cgi-bin Disallow: *?s= Allow: /wp-admin/admin-ajax.php Host: site.ru User-agent: * Disallow: /wp-admin Disallow : /wp-includes Disallow: /wp-content/plugins Disallow: /wp-json/ Disallow: /wp-login.php Disallow: /wp-register.php Disallow: */embed Disallow: */page/ Disallow: / cgi-bin Disallow: *?s= Allow: /wp-admin/admin-ajax.php Sitemap: http://example.com/sitemap.xml

We add for ourselves

If you need to prohibit any more pages or groups of pages, you can add a rule (directive) below Disallow:. For example, we need to close all posts in a category from indexing news, then before sitemap: add a rule:

Disallow: /news

It prevents robots from following links like this:

  • http://example.com/news
  • http://example.com/news/drugoe-name/

If you need to close any occurrences of /news , then we write:

Disallow: */news

  • http://example.com/news
  • http://example.com/my/news/drugoe-name/
  • http://example.com/category/newsletter-name.html

You can learn more about the robots.txt directives on the Yandex help page (but keep in mind that not all the rules that are described there work for Google).

Robots.txt check and documentation

You can check if the prescribed rules are working correctly at the following links:

  • Yandex: http://webmaster.yandex.ru/robots.xml .
  • At Google, this is done in search console. You need authorization and the presence of the site in the webmaster panel...
  • Service for creating a robots.txt file: http://pr-cy.ru/robots/
  • Service for generating and checking robots.txt: https://seolib.ru/tools/generate/robots/

I asked Yandex...

Asked a question in those. Yandex support for cross-sectional use of Host and Sitemap directives:

Question:

Hello!
I am writing an article about robots.txt on my blog. I would like to get an answer to such a question (I did not find an unambiguous "yes" in the documentation):

If I need to glue all the mirrors, and for this I use the Host directive at the very beginning of the robots.txt file:

Host: site.ru User-agent: * Disallow: /asd

Will it be in this example work correctly Host: site.ru? Will it indicate to robots that site.ru is the main mirror. Those. I use this directive not in a section, but separately (at the beginning of the file) without specifying which User-agent it refers to.

I also wanted to know if the Sitemap directive must be used inside the section or can it be used outside: for example, through an empty line, after the section?

User-agent: Yandex Disallow: /asd User-agent: * Disallow: /asd Sitemap: http://example.com/sitemap.xml

Will the robot understand the Sitemap directive in this example?

I hope to get an answer from you that will put an end to my doubts.

Answer:

Hello!

The Host and Sitemap directives are cross-sectional, so they will be used by the robot regardless of where they are specified in the robots.txt file.

--
Sincerely, Platon Schukin
Yandex Support

Conclusion

It is important to remember that changes in robots.txt on an already working site will be noticeable only after a few months (2-3 months).

Rumor has it that Google can sometimes ignore the rules in robots.txt and take a page into the index if it considers that the page is very unique and useful and it simply has to be in the index. However, other rumors refute this hypothesis by saying that inexperienced optimizers may incorrectly specify the rules in robots.txt and close it like that. desired pages from indexing and leave unnecessary ones. I'm leaning more towards the second suggestion...

Dynamic robots.txt

In WordPress, a request for a robots.txt file is handled separately and it is not necessary to physically create a robots.txt file in the root of the site, moreover, it is not recommended, because with this approach it will be very difficult for plugins to change this file, and this is sometimes necessary.

Read about how the dynamic creation of the robots.txt file works in the description of the function, and below I will give an example of how you can change the content of this file, on the fly, through the hook.

To do this, add the following code to your functions.php file:

Add_action("do_robotstxt", "my_robotstxt"); function my_robotstxt()( $lines = [ "User-agent: *", "Disallow: /wp-admin/", "Disallow: /wp-includes/", "", ]; echo implode("\r\n ", $lines); die; // terminate PHP )

User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/

Crawl-delay - timeout for crazy robots (not taken into account since 2018)

Yandex

After analyzing emails for the last two years to our indexing support, we found out that one of the main reasons for the slow download of documents is an incorrectly configured Crawl-delay directive in robots.txt […] So that site owners do not have to worry about this anymore and so that all really necessary pages of sites appear and update quickly in the search, we decided to refuse to take into account the Crawl-delay directive.

When the Yandex robot crawls the site like crazy and this creates an unnecessary load on the server. The robot can be asked to “slow down”.

To do this, you need to use the Crawl-delay directive. It specifies the time in seconds that the robot should idle (wait) to crawl each next page of the site.

For compatibility with robots that do not follow the robots.txt standard, Crawl-delay must be specified in the group (in the User-Agent section) immediately after Disallow and Allow

The Yandex robot understands fractional values, for example, 0.5 (half a second). This does not guarantee that the crawler will visit your site every half a second, but it allows you to speed up the crawl of the site.

User-agent: Yandex Disallow: /wp-admin Disallow: /wp-includes Crawl-delay: 1.5 # 1.5 second timeout User-agent: * Disallow: /wp-admin Disallow: /wp-includes Allow: /wp-* .gif Crawl-delay: 2 # timeout of 2 seconds

Google

Googlebot does not understand the Crawl-delay directive. The timeout for its robots can be specified in the webmaster's panel.

On the avi1.ru service you can already purchase SMM promotion more than 7 most popular in social networks. At the same time, pay attention to enough low cost all site services.

Liked the article? Share with friends!
Was this article helpful?
Yes
Not
Thanks for your feedback!
Something went wrong and your vote was not counted.
Thank you. Your message has been sent
Did you find an error in the text?
Select it, click Ctrl+Enter and we'll fix it!