Every so often, Google Webmaster Tools adds a new wrinkle that makes life measurably easier for SEO leadership at large organizations. Index Status is one of those wrinkles.
This nifty little feature - which was released this past summer but has gotten relatively little fanfare considering the profound insights and leverage it can provide – does two things that I find crucial:
- It helps quickly diagnose very serious issues with duplicate and/or non-canonical URLs that could be wasting your daily allotment of crawl budget while simultaneously confusing the hell out of search engine crawlers.
- It provides a very intuitive data visualization that can be used to tell a very compelling story to cross-functional stakeholders, which in turn, can lead to the site and server-level implementation to overcome duplicate and canonical issues.
Now there are a few things to keep in mind that will help ensure that you get the most value possible out of this feature. For starters, in order to get to the really interesting data you need to click on the “Advanced” tab. That’s how you can gain access to “Not Selected” data, which essentially refers to URLs that have been crawled by Google but not indexed because of duplicate content issues or because of server-side redirects that are in place (and other issues as well). This is an important piece of data because it can give you a fairly accurate sense for whether or not you have major issues with rogue – often auto-generated – URLs.
In my experience, it’s normal to see a fairly high ratio of “Not Selected” to “Indexed” pages. In other words, more often than not, you’ll have a significant portion of your site’s URLs not selected for Google’s index particularly if you’ve had to implement a lot of server-side redirects (which is often the case for enterprise-caliber websites). However, if you have a ridiculous ratio of “Not Selected” to “Indexed” URLs or if you have a ridiculous overall number of “Not Selected” URLs (I’ve seen instances of sites with tens of millions of “Not Selected” URLs) or if the graph shows a major spike in either the ratio or absolute number of “Not Selected” URLs than it’s likely time to invest some significant resources into really opening up the hood to understand why, how, and how often Google is finding all of these URLs and why they are choosing to exclude them from the index.
Fact of the matter is that this Google Webmaster Tools feature won’t dive into granular details (e.g. provide the actual URLs being labeled as “Not Selected”, etc.) so you will need to employ advanced server log file analysis techniques like the one in the article I linked to above as well as other data-driven operations in order to figure out what URLs are being excluded, why they’re being excluded, how they’re being generated in the first place, and what you can do to remedy the situation in a scalable and technically feasible manner.
But even though this feature isn’t capable of any sort of deep dive, I still find that the simple ability to monitor this key metric and share visualizations with key organizational stakeholders in order to persuade them to make important technical and architectural changes make this feature really valuable.
Managing your site’s crawl budget is key, and that makes Index Status a nifty little tool indeed.