Information updated on February 14, 2023
Block page indexing in Google and other search engines is prescribed if a certain page or site should not be indexed by a search engine. This may be necessary for several reasons:
Below, I will present several ways to disable page indexing and describe methods for implementing them. Also in the article, you will learn how to check if a page is not indexed on your site or if it is not indexed on external sites, for example, checking whether backlinks are indexable.
The noindex directive is a special rule that prevents Google and other search engines from indexing a page. This rule does not prevent the page from being crawled (read more about the difference between indexing and scanning in a dictionary).
Implementation of the noindex rule is possible in two ways - using a special noindex tag (#1.1), which must be placed in the <head>, or using an http response (#1.2).
#1.1 Using a tag is an easier option. The code looks like this
a) disable indexing for all search engine bots
<meta name="robots" content="noindex" />
b) Disable indexing for Googlebot:
<meta name="googlebot" content="noindex" />
You can disable indexing for any bot, for this you just need to find out the name of the bot in the help of a search engine or crawler.
Enabling the output of the noindex tag in certain CMS is usually possible due to special settings in the system interface or thanks to plugins. For example, in WordPress with Yoast enabled, you need to go to the post admin page, scroll to the Yoast Seo Premium block, open the Advanced block, and select the "No" option in the question "Allow search engines to show the Page in search results?".
#1.2. HTTP response header
Noindex as a rule of prohibiting page indexing can also be written in the HTTP response. It looks like this:
HTTP/1.1 200 OK (...) X-Robots-Tag: noindex (...)
This method is suitable for those who know how to work with configurations on the server, since the server response can be edited there.
The noindex method is ideal when you need to permanently disable indexing of a technical page, or not index pages intended only for registered and authorized users. Read more in the source about the noindex directive.
With the help of special rules on the server or on the CDN, it is possible to prevent Googlebot or any other bot from crawling and indexing a site/page. If the rule is written correctly, bots will receive a 403 server response (deny access).
For example, this is how a crawl ban (and as a result - the page will not be indexed) looks like on cloudflare in the Firewall section for Googlebot. A similar rule can be created in the configurations on the server.
When you check a url that has such a ban, you will see a 403 server response (that is, no access to view the page).
This way of disabling indexing is suitable for new sites that are just being developed and should not be crawled or indexed yet. Note! Denying access to bots prevents the page from not only not getting into the index, but also from being crawled. The ban on crawling in the robots.txt file works in a similar way, but with the difference that URLs in robots.txt can still get into the search results if, for example, they have a lot of links from other sources. With new sites completely closed from bots on the server or CDN, this usually does not happen.
There is a difference between crawling and indexing, but the first process is always the predecessor of the second. Therefore, theoretically, the prohibition of crawling excludes the possibility of getting into the search results, because the bot cannot crawl the page and understand what it is about.
Prohibition of scanning in Robots.txt is carried out by prescribing in the Robots.txt file a ban on scanning of a page/site/subfolder (directory)/file types in the following way:
User-agent: * Disallow: /user/* Disallow: /news/*
This rule prevents all bots from scanning the /user/ and /news/ directories (subfolders) and any files in these directories.
In fact, this method does not work in terms of prohibiting indexing. "The robots.txt file is not intended to prevent your content from appearing in Google search results." is a direct quote from Google help. In other words, if a page closed from crawling is detected by other methods, for example, external links lead to it, it can still get into the index (and this has happened more than once in my practice).
One more quote for confirmation:
If access to a page is denied in the robots.txt file, it can still be indexed by links from other sites. Google will not directly crawl and index content that is blocked in the robots.txt file. However, if such a URL is referenced by other sites, then it can still be found and added to the index. After that, the page may appear in the search results (in many cases, along with the text of the link that leads to it). If this does not suit you, we recommend that you protect the files on the server with a password or use the noindex directive in the meta tag or HTTP response header. An alternative solution is to delete the page completely.
In short, I would recommend this method as a last resort only if you cannot edit the robots tag in the head, write the server response or configure the firewall.
The method is similar to method #2 - it will return a 403 server response (deny access) to the bots. Ideal for pages intended only for users / subscribers.
To password protect pages, you can use a content restriction plugin (for WordPress, this is, for example, Password Protected). Install and activate it, then go to Settings > Password Protected and turn on Password Protected Status. This gives you finer control by allowing you to whitelist certain IP addresses.
This is a way to urgently (within about an hour) remove a page from the index. For a console-verified site, the removal tool is located at https://search.google.com/u/0/ search-console/removals. There you can request the deletion of an entire site or directory.
This method does not disable indexing, but simply temporarily removes pages from the index.
To remove indexed non-your site content, use https://support.google.com/websearch/answer/6349986 (there will be a link to the tool). To avoid manipulation by SEOs, the verification of requests to remove content from someone else's site is carried out according to certain criteria and is most often not approved. Also, it is not the page that is mostly removed from the index, but the outdated content (the cache is discarded).
#1 You can test your site's pages in Search Console using the Test live Url tool. It shows
#1.1 if the page is blocked from indexing by a noindex tag or server response:
#1.2 if Googlebot can't access the page due to password protection or firewall (403 server response):
#1.3 if the page cannot be crawled due to restrictions in the robots.txt file (however, I repeat: such a page can still be indexed):
#2 You can massively check pages for the possibility of indexing (for example, check your backlinks) in special services, for example Linkbox.Pro , as well as using the ScreamingFrog SEO Spider (preferably with a license).
The easiest and most correct (recommended by Google) way is to write the Noindex tag in the head section.
<meta name="robots" content="noindex" />
The easiest way to check indexability in bulk is to use the LinkboxPro service.
Disallow in Robots.txt does not disable page indexing. If Google finds it in other sources (external links), it may still be indexed.
No. In page evaluation processes, crawling comes first, then indexing. Prohibiting crawling in robots.txt will prevent Google from taking into account the fact that the page is noindex, and it can get into the index anyway. For noindex to work, the page must crawl properly.
Support Ukraine!πΊπ¦
We are fighting for our independence right now. Support us financially. Even $1 donation is important.
Yes, you really liked the content on the site, but... you never subscribe to anything, right? Please make an exception for me. I really give a fuck so that the site not only grows, but also this one is of the highest quality. Support not a project - support me specifically in my quest to write cool.