≡ Menu

How to use Google Webmaster Tools to check the dark underbelly of your site architecture

Every so often, Google Webmaster Tools adds a new wrinkle that makes life measurably easier for SEO leadership at large organizations. Index Status is one of those wrinkles.

This nifty little feature – which was released this past summer but has gotten relatively little fanfare considering the profound insights and leverage it can provide – does two things that I find crucial:

  1. It helps quickly diagnose very serious issues with duplicate and/or non-canonical URLs that could be wasting your daily allotment of crawl budget while simultaneously confusing the hell out of search engine crawlers.
  2. It provides a very intuitive data visualization that can be used to tell a very compelling story to cross-functional stakeholders, which in turn, can lead to the site and server-level implementation to overcome duplicate and canonical issues.

Now there are a few things to keep in mind that will help ensure that you get the most value possible out of this feature. For starters, in order to get to the really interesting data you need to click on the “Advanced” tab. That’s how you can gain access to “Not Selected” data, which essentially refers to URLs that have been crawled by Google but not indexed because of duplicate content issues or because of server-side redirects that are in place (and other issues as well). This is an important piece of data because it can give you a fairly accurate sense for whether or not you have major issues with rogue – often auto-generated – URLs.

In my experience, it’s normal to see a fairly high ratio of “Not Selected” to “Indexed” pages. In other words, more often than not, you’ll have a significant portion of your site’s URLs not selected for Google’s index particularly if you’ve had to implement a lot of server-side redirects (which is often the case for enterprise-caliber websites). However, if you have a ridiculous ratio of “Not Selected” to “Indexed” URLs or if you have a ridiculous overall number of “Not Selected” URLs (I’ve seen instances of sites with tens of millions of “Not Selected” URLs) or if the graph shows a major spike in either the ratio or absolute number of “Not Selected” URLs than it’s likely time to invest some significant resources into really opening up the hood to understand why, how, and how often Google is finding all of these URLs and why they are choosing to exclude them from the index.

Fact of the matter is that this Google Webmaster Tools feature won’t dive into granular details (e.g. provide the actual URLs being labeled as “Not Selected”, etc.) so you will need to employ advanced server log file analysis techniques like the one in the article I linked to above as well as other data-driven operations in order to figure out what URLs are being excluded, why they’re being excluded, how they’re being generated in the first place, and what you can do to remedy the situation in a scalable and technically feasible manner.

But even though this feature isn’t capable of any sort of deep dive, I still find that the simple ability to monitor this key metric and share visualizations with key organizational stakeholders in order to persuade them to make important technical and architectural changes make this feature really valuable.

Managing your site’s crawl budget is key, and that makes Index Status a nifty little tool indeed.


Comments on this entry are closed.

  • http://twitter.com/khushbaht Khushbaht (Kush)

    Thanks Hugo, what would you consider to be “a ridiculous ratio of “Not Selected” to “Indexed” URLs” for an eCommerce site? Would 3 to 1 be too high?

    • hugoguzman

      Hi Kush! Good question. I don’t think 3 to 1 is high at all. It really just depends on the size and nature of the site. Ultimately, your best bet is to use server log file analysis to get a sense for how Google’s bots are getting to these pages, what these pages are exactly, and why they might not be making the index. So for example, if they are mostly 301 redirects that were setup on purpose, then a relatively high ratio is no problem at all. What you really want to do is weed out duplicate and near-duplicate content.

  • http://www.brickmarketing.com/ Nick Stamoulis

    This is another good metric to use for SEO reporting purposes. Many times clients get hung up on their rankings (where they show up in Google) which can vary based on location, browser, and saved information on a users Google account. The Index Status helps to demonstrate how content marketing efforts are working to add more pages to the site and build more search engine trust.

  • http://jfwhite.org/ James

    Fantastic tip, Hugo! I didn’t notice the rollout this summer, evidently, but this is such a painless method to get a broad overview of your site’s indexation.

    • hugoguzman

      Thanks for the props, James! I totally agree. Surprised that this feature hasn’t gotten more attention. It such a painless way to get a bird’s eye view on indexation and crawl budget allocation.

  • http://www.ydeveloper.com/e-smart-ecommerce-suite.html eCommerce

    There isn’t any doubt about this tool that it has huge impact to the onsite and it’s important to figure out. To do this, it requires to have a great insight into the link structure.