The Internet Marketing Driver

  • GSQi Home
  • About Glenn Gabe
  • SEO Services
    • Algorithm Update Recovery
    • Technical SEO Audits
    • Website Redesigns and Site Migrations
    • SEO Training
  • Blog
    • Web Stories
  • Contact GSQi

How To Find The True Size Of Your Site Using GSC’s Index Coverage Reporting (And Why It’s Important For SEO)

December 10, 2018 By Glenn Gabe Leave a Comment

Share
Tweet
Share
Email

How large is your site SEO-wise?

So, how large is your site? No, how large is it really??

When speaking with companies about SEO, it’s not long before I ask that important question. And I often get some confused responses based on the “really” tag at the end. That’s because many site owners go by how many pages are indexed. i.e. We have 20K, 100K, or 1M pages indexed. Well, that’s obviously important, but what lies below the surface is also important. For example, Googlebot needs to crawl and process all of your crawlable urls, and not just what’s indexed.

Site owners can make it easy, or very hard, for Google to crawl, process, and index their pages based on a number of factors. For example, if you have one thousand pages indexed, but Google needs to crawl and process 600K pages, then you might have issues to smooth over from a technical SEO perspective. And when I say “process”, I mean understand what to do with your urls after crawling. For example, Google needs to process the meta robots tag, rel canonical, understand duplicate content, soft 404s, and more.

By the way, I’ll cover the reason I said you might need to smooth things over later in this post. There are times a site has many urls that need to be processed that are being properly handled. For example, noindexing many pages that are fine for users, but shouldn’t be indexed. That could be totally ok based on what the site is trying to accomplish. But on the flipside, there may be sites using a ton of parameters, session IDs, dynamically changing urls, or redirects that can cause all sorts of issues (like creating infinite spaces). Again, I’ll cover more about this soon.

And you definitely want to make Google’s job easier, and not harder. Google’s John Mueller has mentioned this a number of times over the past several years. Actually, here’s a video of John explaining this from last week’s webmaster hangout video. John explained that you don’t want to make Google’s job harder by having it churn through many urls. Google needs to crawl, and then process, all of those urls, even when rel canonical is used properly (since Google needs to first crawl each page to see rel canonical).

Here’s the clip from John (at 4:16 in the video):

John has also explained that you should make sure Google can focus on your highest quality content versus churning through many lower quality urls. That’s why it’s important to fully understand what’s indexed on your site. Google takes all pages into account when evaluating quality, so if you find a lot of urls that shouldn’t be indexed, take action on them.

Here’s another video from John about that (at 14:54 in the video):

Crawl Budget – Most don’t need to worry about this, but…
In addition, crawl budget might also be a consideration based on the true size of your site. For example, imagine you have 73K pages indexed and believe you don’t need to worry about crawl budget too much. Remember, only very large sites with millions of pages need to worry about crawl budget. But what if your 73K page site actually contains 29.5M pages that need to be crawled and processed? If that’s the case, then you actually do need to worry about crawl budget. This is just another reason to understand the true size of your site. See the screenshot below.

Potential crawl budget issues.

How Do You Find The True Size Of Your Site?
One of the easiest ways to determine the true size of your site is to use Google’s index coverage reporting, including the incredibly important Excluded category. Note, GSC’s index coverage reporting is by property… so make sure you check each property for your site (www, non-www, http, https, etc.)

Google’s Index Coverage
I wrote a post recently about juicing up your index coverage reporting using subdirectories where I explained the power of GSC’s new reporting. It replaces the index status report in the old GSC and contains extremely actionable data. There are multiple categories for each property in GSC in the reporting, including:

  1. Errors
  2. Valid with warnings
  3. Valid and indexed
  4. Excluded

By reviewing ALL of the categories, you can get a stronger feel for true size of your site. That true size might make total sense to you, but there are other times where the true size might scare you to death (and leave you scratching your head about why there are so many urls Google is crawling).

Below, I’ll run through some quick examples of what you can find in the index coverage reporting that could be increasing the size of your site from a crawling and processing standpoint. Note, there are many reasons you could be forcing Google to crawl and process many more pages than it should and I’ll provide just some quick examples in this post. I highly recommend going through your own reporting extensively to better understand your own situation.

Remember, you should make it easier on Googlebot, not harder. And you definitely want to have Google focus on your most important pages versus churning through loads of unimportant urls.

Errors
There are a number of errors that can show up in the reporting, including server errors (500s), redirect errors, urls submitted in sitemaps that are soft 404s, have crawl issues, are being blocked by robots.txt, and more. Since this post is about increasing the crawlable size of your site, I would definitely watch out for situations where Google is surfacing and crawling urls that should never be crawled (or urls that should resolve with 200s, but don’t for some reason).

There are also several reports in this category that flag urls being submitted in xml sitemaps that don’t resolve correctly. Definitely dig in there to see what’s going on.

Errors in GSC's index coverage reporting.

Indexed and Valid
Although this report shows all urls that are properly indexed, you definitely want to dig into the report titled, “Indexed, not submitted in sitemap”. This report contains all urls that Google has indexed that weren’t submitted in xml sitemaps (so Google is coming across the urls via standard crawling without being supplied a list of urls). I’ve found this report can yield some very interesting findings.

For example, if you have 200K pages indexed, but only 4K are submitted in xml sitemaps, then what are the other 196K urls? How is Google discovering them? Are they canonical urls that you know about, and if so, should they be submitted in sitemaps? Or are they urls you didn’t even know existed, that contain low-quality content, or autogenerated content?

I’ve surfaced some nasty situations by checking this report. For example, hundreds of thousands of pages (or more) indexed that the site owner didn’t even know were being published on the site. Note, you can also fine some urls there that are fine, like pagination, but I would definitely analyze the reporting to gain a deeper understanding of all the valid and indexed urls that Google has found. Again, you should know all pages indexed on your site, especially ones that you haven’t added to xml sitemaps.

Indexed not submitted in GSC's index coverage reporting.

Excluded – The Mother Lode
I’ve mentioned the power of the Excluded category in GSC’s index coverage reporting before, and I’m not kidding. This category contains all of the urls that Google has crawled, but decided to exclude from indexing. You can often find glaring problems when digging into the various reports in this category.

I can’t cover everything you can find there, but I’ll provide a few examples of situations that could force Google to crawl and process many more urls than it should. And like John Mueller said in the videos I provided above, you should try and make it easy for Google, focus on your highest quality content, etc.

Infinite spaces
When I first fire up the index coverage reporting, I usually quickly check the total number of excluded urls. And there are times that number blows me away (especially knowing how large a site is supposed to be).

When checking the reporting, you definitely want to make sure there’s not an infinite spaces problem, which can cause an unlimited number of urls to be crawled by Google. I wrote a case study about an infinite spaces problem that was impacting up to ten million urls on a site. Yes, ten million.

Infinite spaces based on site search problems.

Sometimes a site has functionality that can create an unlimited number of urls with content that’s not unique or doesn’t provide any value. And when that happens, Google can crawl forever… By analyzing the Excluded category, you can potentially find problems like this in the various reports.

For example, you might see many urls being excluded in a specific section of the site that contain parameters used by certain functionality. The classic example is a calendar script, which if left in its raw form (and unblocked), can cause Google to crawl infinitely. You also might find new urls being created on the fly based on users conducting on-site searches (that was the problem I surfaced in my case study about infinite spaces). There are many ways infinite spaces can be created and you want to make sure that’s not a problem on your site.

Parameters, parameters, parameters
Some sites employ a large number of url parameters, when those parameters might not be needed at all. And to make the situation worse, some sites might not be using rel canonical properly to canonicalize the urls with parameters to the correct urls (which might be the urls without parameters). And as John explained in the video from earlier, even if you are using rel canonical properly, it’s not optimal to force Google to crawl and process many unnecessary urls.

Or worse, maybe each of those urls with parameters (that are producing duplicate content) mistakenly contain self-referencing canonical tags. Then you are forcing Google to figure out that the urls are duplicates and then handle those urls properly. John explained that can take a long time when many urls are involved. Again, review your own reporting. You never know what you’re going to find.

When analyzing the Excluded reporting, you can find urls that fit this situation in a number of reports. For example, “Duplicate without user-selected canonical”, “Google chose a different canonical than user”, and others.

Parameters in GSC's index coverage reporting.

Excessive redirects
There are times you might find a massive number of redirects that you weren’t even aware were on the site. If you see a large number of redirects, then analyze those urls to see where they are being triggered on the site, why the redirects are there, and determine if they should be there at all. I wouldn’t force both users and Googlebot through an excessive number of redirects, if possible. Make it easy for Google, not harder.

Redirects in GSC's index coverage reporting.

And some of those redirects might look like this… a chain of never-ending redirects (and the page never resolves). So one redirect might actually be 20 or more before the page finally fails…

Redirects chain for url that never resolves.

Malformed URLs
There have been some audits where I find an excessive number of malformed urls when reviewing the Excluded reporting. For example, a coding glitch that combines urls by accident, leaves out parts of the url, or worse. And on a large-scale site, that can cause many, many malformed urls to be crawled by Google.

Here is an example of a malformed url pattern I surfaced in a recent audit:
https://www.domain.com/blog/category/some-blog-post-title/www.domain.com/blog/category/some-blog-post-title/www.domain.com/blog/category/some-blog-post-title/www.domain.com/blog/category/some-blog-post-title/www.domain.com/blog/category/some-blog-post-title/www.domain.com/blog/category/some-blog-post-title/www.domain.com/blog/category/some-blog-post-title/www.domain.com/blog/category/some-blog-post-title/www.domain.com/blog/category/some-blog-post-title/www.domain.com/blog/category/some-blog-post-title/

Soft 404s
You might find many soft 404s on a site, which are urls that return 200s, but Google sees them as 404s. There are a number of reasons this could be happening, including having pages that should contain products or listings, but simply return “No products found”, or “No listings found”. These are thin pages without any valuable content for users, and Google is basically seeing them as 404s (Page Not Found).

Note, soft 404s are treated as hard 404s, but why have Google crawl many of them when it doesn’t need to? And if this is due to some glitch, then you could continually force Google to crawl many of these urls.

Example of a soft 404.

Like I said, The Mother Lode…
I can keep going here, but this post would be huge. Again, there are many examples of problems you can find by analyzing the Excluded category in the new index coverage reporting. These were just a few examples I have come across when helping clients. The main point I’m trying to make is that you should heavily analyze your reporting to see the true size of your site. And if you find problems that are causing many more urls to be published than should be, then you should root out those problems and make crawling and processing easier for Google.

Make it easy for Google.

Sometimes Excluded can be fine: What you don’t need to worry about…
Not everything listed in the index coverage reporting is bad. There are several categories that are fine as long as you expect that behavior. For example, you might be noindexing many urls across the site since they are fine for users traversing the site, but you don’t want them indexed. That’s totally fine.

Or, you might be blocking 60K pages via robots.txt since there’s no reason for Googlebot to crawl them. That’s totally fine as well.

And how about 404s? There are some site owners that believe that 404s can hurt them SEO-wise. That’s not true. 404s (Page Not Found) are totally normal to have on a site. And for larger-scale sites, you might have tens of thousands, hundreds of thousands, or even a million plus 404s on your site. If those urls should 404, then that’s fine.

Here’s an example of a site I helped with many 404s at any given time. They do extremely well SEO-wise, but based on the nature of their site, they 404 a lot of urls over time. Again, this is totally fine if the urls should 404:

Massive number of 404s on site doing well in organic search.

There have been several large-scale clients I’ve helped that receive over a million clicks per day from Google that have hundreds of thousands of 404s at any given time. Actually, a few of those clients sometimes have over one million 404s showing in GSC at any given time. And they are doing extremely well SEO-wise. Therefore, don’t worry about 404s if those pages should 404. Google’s John Mueller has covered this a number of times in the past.

Here’s a video of John explaining this (at 39:50 in the video):

There are other categories that might be fine too, based on how your own site works. So, dig into the reporting, identify anything out of the ordinary, and then analyze those situations to ensure that you’re making Google’s life easier, and not harder.

Index Coverage Limits And Running Crawls To Gain More Data
After reading this post, you might be excited to jump into GSC’s index coverage reporting to uncover the true size of your site. That’s great, but there’s something you need to know. Unfortunately, there’s a maximum one thousand rows per report that can be exported. So, you will be severely limited with the data you can export once you drill into a problem. Sure, you can (and should) identify patterns of issues across your site so you can tackle those problems at scale. But it still would be amazing to export all of the urls per report.

I know Google has mentioned the possibility of providing API access to the index coverage reporting, which would be amazing. Then you could export all of your data, not matter how many rows. In the short-term, you can read my post about how to add subdirectories to GSC in order to get more data in your reporting. It works well and you can read that post to learn more.

And beyond that, you can use any of the top crawling tools to launch surgical crawls into problematic areas. Or, you could just launch enterprise crawls that tackle most of the site in question. Then you can filter by problematic url type, parameter, etc. to surface more urls per category.

As I’ve mentioned many times in my posts, my three favorite crawling tools are:

  • DeepCrawl (where I’m on the customer advisory board). I’ve been using DeepCrawl for a very long time and it’s especially powerful for larger-scale crawls.
  • Screaming Frog – Another powerful tool that many in the industry use for crawling sites. Dan and his crew of amphibious SEOs do an excellent job at providing a boatload of functionality in a local crawler. I often use both DeepCrawl and Screaming Frog together. 1+1=3
  • Sitebulb – The newest kid on the block across the three, but Patrick and Gareth have created something special with Sitebulb. It’s an excellent local crawling tool that some have described as beautiful mix between DeepCrawl and Screaming Frog. You should definitely check it out. I use all three extensively.

 Summary – The importance of understanding the true size of your site.
The next time someone asks you how large your site is, try to avoid giving the quick response with simply pages indexed. If you’ve gone through the process I’ve documented in this post, the number might be much larger.

It’s definitely a nuanced answer, since Google might be churning through many additional urls based on your site structure, coding glitches, or other SEO problems. But you should have a solid grasp on what’s going on and how to address those issues if you’ve dug in heavily. I recommend analyzing GSC’s index coverage reporting today, including the powerful Excluded category. There may be some hidden treasures waiting for you. Go find them!

GG

 

Share
Tweet
Share
Email

Filed Under: google, seo, tools

Connect with Glenn Gabe today!

Latest Blog Posts

  • How to compare hourly sessions in Google Analytics 4 to track the impact from major Google algorithm updates (like broad core updates)
  • It’s all in the (site) name: 9 tips for troubleshooting why your site name isn’t showing up properly in the Google search results
  • Google Explore – The sneaky mobile content feed that’s displacing rankings in mobile search and could be eating clicks and impressions
  • Bing Chat in the Edge Sidebar – An AI companion that can summarize articles, provide additional information, and even generate new content as you browse the web
  • The Google “Code Red” That Triggered Thousands of “Code Reds” at Publishers: Bard, Bing Chat, And The Potential Impact of AI in the Search Results
  • Continuous Scroll And The GSC Void: Did The Launch Of Continuous Scroll In Google’s Desktop Search Results Impact Impressions And Clicks? [Study]
  • How to analyze the impact of continuous scroll in Google’s desktop search results using Analytics Edge and the GSC API
  • Percent Human: A list of tools for detecting lower-quality AI content
  • True Destination – Demystifying the confusing, but often accurate, true destination url for redirects in Google Search Console’s coverage reporting
  • Google’s September 2022 Broad Core Product Reviews Update (BCPRU) – The complexity and confusion when major algorithm updates overlap

Web Stories

  • Google’s December 2021 Product Reviews Update – Key Findings
  • Google’s April 2021 Product Reviews Update – Key Points For Site Owners and Affiliate Marketers
  • Google’s New Page Experience Signal
  • Google’s Disqus Indexing Bug
  • Learn more about Web Stories developed by Glenn Gabe

Archives

  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • August 2021
  • July 2021
  • June 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • GSQi Home
  • About Glenn Gabe
  • SEO Services
  • Blog
  • Contact GSQi
Copyright © 2023 G-Squared Interactive LLC. All Rights Reserved. | Privacy Policy
This website uses cookies to improve your experience. Are you ok with the site using cookies? You can opt-out at a later time if you wish. Cookie settings ACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience. You can read our privacy policy for more information.
Cookie Consent