Blog about seo




How to block page indexing

Suggest another topic
Analysis of page indexing in Linkbox.Pro

Information updated on February 14, 2023

Block page indexing in Google and other search engines is prescribed if a certain page or site should not be indexed by a search engine. This may be necessary for several reasons:

  1. The page was created as a technical page, has no search engine value and should not rank.
  2. The page/site is temporarily not ready and should not be indexed yet.
  3. The page is a copy, low quality, with bad content.
  4. The page is intended for registered and authorized users and should not be indexed or crawled by a bot.

Below, I will present several ways to disable page indexing and describe methods for implementing them. Also in the article, you will learn how to check if a page is not indexed on your site or if it is not indexed on external sites, for example, checking whether backlinks are indexable.

Ways to block indexing

#1 - noindex

The noindex directive is a special rule that prevents Google and other search engines from indexing a page. This rule does not prevent the page from being crawled (read more about the difference between indexing and scanning in a dictionary).

Implementation of the noindex rule is possible in two ways - using a special noindex tag (#1.1), which must be placed in the <head>, or using an http response (#1.2).

#1.1 Using a tag is an easier option. The code looks like this

a) disable indexing for all search engine bots

  <meta name="robots" content="noindex" />

b) Disable indexing for Googlebot:

  <meta name="googlebot" content="noindex" />

You can disable indexing for any bot, for this you just need to find out the name of the bot in the help of a search engine or crawler.

Enabling the output of the noindex tag in certain CMS is usually possible due to special settings in the system interface or thanks to plugins. For example, in WordPress with Yoast enabled, you need to go to the post admin page, scroll to the Yoast Seo Premium block, open the Advanced block, and select the "No" option in the question "Allow search engines to show the Page in search results?".

Noindex in Wordpress

#1.2. HTTP response header

Noindex as a rule of prohibiting page indexing can also be written in the HTTP response. It looks like this:

   HTTP/1.1 200 OK
                (...)
                X-Robots-Tag: noindex
                (...)

This method is suitable for those who know how to work with configurations on the server, since the server response can be edited there.

The noindex method is ideal when you need to permanently disable indexing of a technical page, or not index pages intended only for registered and authorized users. Read more in the source about the noindex directive.

#2 Deny access to bots (403 server response)

With the help of special rules on the server or on the CDN, it is possible to prevent Googlebot or any other bot from crawling and indexing a site/page. If the rule is written correctly, bots will receive a 403 server response (deny access).

For example, this is how a crawl ban (and as a result - the page will not be indexed) looks like on cloudflare in the Firewall section for Googlebot. A similar rule can be created in the configurations on the server.

Block in Cloudflare in the Firewall section for Googlebot

When you check a url that has such a ban, you will see a 403 server response (that is, no access to view the page).

no access to page view for googlebot

This way of disabling indexing is suitable for new sites that are just being developed and should not be crawled or indexed yet. Note! Denying access to bots prevents the page from not only not getting into the index, but also from being crawled. The ban on crawling in the robots.txt file works in a similar way, but with the difference that URLs in robots.txt can still get into the search results if, for example, they have a lot of links from other sources. With new sites completely closed from bots on the server or CDN, this usually does not happen.

#3 Disable scanning in Robots.txt (not recommended!)

There is a difference between crawling and indexing, but the first process is always the predecessor of the second. Therefore, theoretically, the prohibition of crawling excludes the possibility of getting into the search results, because the bot cannot crawl the page and understand what it is about.

Prohibition of scanning in Robots.txt is carried out by prescribing in the Robots.txt file a ban on scanning of a page/site/subfolder (directory)/file types in the following way:

   User-agent: *
                Disallow: /user/*
                Disallow: /news/*

This rule prevents all bots from scanning the /user/ and /news/ directories (subfolders) and any files in these directories.

In fact, this method does not work in terms of prohibiting indexing. "The robots.txt file is not intended to prevent your content from appearing in Google search results." is a direct quote from Google help. In other words, if a page closed from crawling is detected by other methods, for example, external links lead to it, it can still get into the index (and this has happened more than once in my practice).

One more quote for confirmation:

If access to a page is denied in the robots.txt file, it can still be indexed by links from other sites. Google will not directly crawl and index content that is blocked in the robots.txt file. However, if such a URL is referenced by other sites, then it can still be found and added to the index. After that, the page may appear in the search results (in many cases, along with the text of the link that leads to it). If this does not suit you, we recommend that you protect the files on the server with a password or use the noindex directive in the meta tag or HTTP response header. An alternative solution is to delete the page completely.

In short, I would recommend this method as a last resort only if you cannot edit the robots tag in the head, write the server response or configure the firewall.

#4 Password protect files

The method is similar to method #2 - it will return a 403 server response (deny access) to the bots. Ideal for pages intended only for users / subscribers.

To password protect pages, you can use a content restriction plugin (for WordPress, this is, for example, Password Protected). Install and activate it, then go to Settings > Password Protected and turn on Password Protected Status. This gives you finer control by allowing you to whitelist certain IP addresses.

5 Manual removal from the index

This is a way to urgently (within about an hour) remove a page from the index. For a console-verified site, the removal tool is located at https://search.google.com/u/0/ search-console/removals. There you can request the deletion of an entire site or directory.

This method does not disable indexing, but simply temporarily removes pages from the index.

To remove indexed non-your site content, use https://support.google.com/websearch/answer/6349986 (there will be a link to the tool). To avoid manipulation by SEOs, the verification of requests to remove content from someone else's site is carried out according to certain criteria and is most often not approved. Also, it is not the page that is mostly removed from the index, but the outdated content (the cache is discarded).

How to check if page indexing is allowed

#1 You can test your site's pages in Search Console using the Test live Url tool. It shows

#1.1 if the page is blocked from indexing by a noindex tag or server response:

the page is closed from indexing using the noindex tag or server response

#1.2 if Googlebot can't access the page due to password protection or firewall (403 server response):

no access to page view for googlebot

#1.3 if the page cannot be crawled due to restrictions in the robots.txt file (however, I repeat: such a page can still be indexed):

Blocked by  robots.txt

#2 You can massively check pages for the possibility of indexing (for example, check your backlinks) in special services, for example Linkbox.Pro , as well as using the ScreamingFrog SEO Spider (preferably with a license).

Bulk check  indexing capability in Linkbox Pro

Quick Replies

Easiest way to stop a page from being indexed?

The easiest and most correct (recommended by Google) way is to write the Noindex tag in the head section.

What is the syntax of the Noindex tag?

<meta name="robots" content="noindex" />

What's the easiest way to check page indexability in bulk?

The easiest way to check indexability in bulk is to use the LinkboxPro service.

Why doesn't disallow in Robots.txt help prevent indexing?

Disallow in Robots.txt does not disable page indexing. If Google finds it in other sources (external links), it may still be indexed.

Can we use noindex and disallow in robots.txt for better effect?

No. In page evaluation processes, crawling comes first, then indexing. Prohibiting crawling in robots.txt will prevent Google from taking into account the fact that the page is noindex, and it can get into the index anyway. For noindex to work, the page must crawl properly.

Support Ukraine!πŸ‡ΊπŸ‡¦

We are fighting for our independence right now. Support us financially. Even $1 donation is important.





Subscribe!

Yes, you really liked the content on the site, but... you never subscribe to anything, right? Please make an exception for me. I really give a fuck so that the site not only grows, but also this one is of the highest quality. Support not a project - support me specifically in my quest to write cool.