The Internet Marketing Driver

  • GSQi Home
  • About Glenn Gabe
  • SEO Services
    • Algorithm Update Recovery
    • Technical SEO Audits
    • Website Redesigns and Site Migrations
    • SEO Training
  • Blog
    • Web Stories
  • Contact GSQi

Avoiding Dirty Sitemaps – How to Download and Crawl XML Sitemaps Using Screaming Frog

July 10, 2013 By Glenn Gabe 4 Comments

Share
Tweet
Share
Email

Indexed to submitted ratio in GSC

SEO Audits are a core service I provide, including both comprehensive audits and laser-focused audits tied to algorithm updates.  There are times during those audits that I come across strange pages that are indexed, or I see crawl errors for pages not readily apparent on the site itself.  As part of the investigation, it’s smart to analyze and crawl a website’s xml sitemap(s) to determine if that could be part of the problem.  It’s not uncommon for a sitemap to contain old pages, pages leading to 404s, application errors, redirects, etc.  And you definitely don’t want to submit “dirty sitemaps” to the engines.

What’s a Dirty Sitemap?
A dirty sitemap is an xml sitemap that contains 404s, 302s, 500s, etc.  Note, those are header response codes.  A 200 code is ok, while the others signal various errors or redirects.  Since the engines will retrieve your sitemap and crawl your urls, you definitely don’t want to feed them errors.  Instead, you want your xml sitemaps to contain canonical urls on your site, and urls that resolve with a 200 code.  Duane Forrester from Bing was on record saying that they have very little tolerance for “dirt in a sitemap”.  And Google has explained that xml sitemaps are the second most important source for url discovery. Therefore, you should avoid dirty sitemaps so the engines can build trust in that sitemap (versus having the engines encounter 404s, 302s, 500s, etc.)

Indexed to Submitted Ratio
One metric that can help you understand if your xml sitemaps are problematic (or dirty) is the indexed to submitted ratio, which you can understand via Google Search Console.  When you access the Sitemaps section of webmaster tools under the left-side Indexing menu in GSC, you will see the number of pages submitted in the sitemap, but also the number indexed.  That ratio should be close(r) to 1:1.  If you see a low indexed-to-submitted ratio, then that could flag an issue with the urls you are submitting in your sitemap.  For example, if you see 12,000 pages submitted, but only 6,500 indexed, then that’s only 54% of the pages submitted.

Here’s a screenshot showing 157K urls being submitted, but not being indexed for some reason:

Submitted URLs in Google Search Console

Pages “Linked From” in Google Search Console
In addition to what I explained above about the indexed to submitted ratio, you might find crawl errors in Google Search Console for urls that don’t look familiar. In order to help track down the problematic urls, Search Console will show you how it found the urls in question.

If you inspect the url in GSC, you will see the error details.  But, you will also see the referring urls where Google found links to the url and the sitemaps that contain the urls. That information can reveal if the urls are contained in a specific sitemap, and if the urls are being linked to from other files on your site.  This is a great way to hunt down problems, and as you can guess, you might find that your xml sitemap is causing problems.

Inspecting a url in GSC and viewing xml sitemaps

Crawling XML Sitemaps
If you do see a problematic indexed-to-submitted ratio, what can you do?  Well, the beautiful part about xml sitemaps is that they are public.  As long as you know where they reside, you can download and crawl them using a tool like Screaming Frog.  I’ve written about Screaming Frog in the past, and it’s a phenomenal tool for crawling websites, flagging errors, analyzing optimization, etc.  I highly recommend using it.

Screaming Frog provides functionality for crawling text files (containing a list of urls), but not an xml file (which is the format of xml sitemaps submitted to the engines).  That’s a problem if you simply download the xml file to your computer.  In order to get that sitemap file into a format that can be crawled by Screaming Frog, you’ll need to first import that file into Excel, and then copy the urls to a text file.  Then you can crawl the file.

And that’s exactly what I’m going to show you in this tutorial.  Once you crawl the xml sitemap, you might find a boatload of issues that can be quickly resolved.  And when you are hunting down problems SEO-wise, any problem you can identify and fix quickly is a win.  Let’s begin.

Update: Screaming Frog now lets you crawl XML sitemaps directly from the tool. So if you are just trying to crawl a sitemap, then just enter “List Mode” and select the option to download an xml sitemap. Then just enter the url for the sitemap. If you want to download the sitemap, then follow the instructions below.
Download xml sitemap in Screaming Frog

Quick Note: If you control the creation of your xml sitemaps, then you obviously don’t need to download them from the site.  That said, the sitemaps residing on your website are what the engines crawl.  If your CMS is generating your sitemaps on the fly, then it’s valuable to use the exact sitemaps sitting on your servers.  So even though you might have them locally, I would still go through the process of downloading them from your website via the tutorial below.

How To Download and Crawl Your XML Sitemaps

  1. Download the XML Sitemap(s)
    Enter the URL of your xml sitemap, or the sitemap index file.  A sitemap index file contains the urls of all of your xml sitemaps (if you need to use more than one due to sitemap size limitations).  If you are using a sitemap index file, then you will need to download each xml sitemap separately.  Then you can either crawl each one separately or combine the urls into one master text file.  After the sitemap loads in your browser, depending on the browser, you can just click “File”, and then “Save As”. For example, that’s how it works in Firefox.  Then save the file to your hard drive. And if you are using Chrome, then just right click in the window containing your sitemap and select “Save As”. Then just save the xml sitemap to your computer.
    Download XML Sitemaps
  2. Import the Sitemap into Excel
    Next, you’ll need to get a straight list of urls to crawl from the sitemap.  In order to do this, I recommend using the “Import XML” functionality in the “Developer” tab in Excel.  Click “Import” and then select the sitemap file you just downloaded.  After clicking the “Import” button after selecting your file, Excel will provide a dialog box about the xml schema.  Just click “OK”.  Then Excel will ask you where to place the data.  Leave the default option and click “OK”.  You should now see a table containing the urls from your xml sitemap.  And yes, you might already see some problems in the list.  :)
    Import XML Sitemaps into Excel
  3. Copy the URLs to a Text File
    I mentioned earlier that Screaming Frog will only crawl text files with a list of urls in them.  In order to achieve this, you should copy all of the urls from column A in your spreadsheet.  Then fire up your text editor of choice (mine is Textpad), and paste the urls.  Make sure you delete the first row, which contains the heading for the column.  Save that file to your computer.
    Copy URLs to a Text File
  4. Unleash the Frog
    Next, we’re ready to crawl the urls in the text file you just created.  Fire up Screaming Frog and click the “Mode” tab.  Select “List”, which enables you to load a text file containing a series of urls. And again, there is now an option to directly crawl a sitemap by entering the url of that sitemap. If you don’t need to download your sitemap, then that’s a great way to go.
    List Mode in Screaming Frog
  5. Load The Text File and Start The Crawl
    Once you select “List Mode”, then click the Upload List button and select “From a file”. Then select the text file you created. Screaming Frog will load the urls and display them in a window. Once you click OK, the crawl will begin.
    Load text file in Screaming Frog.
  6. Analyze the Crawl
    When the crawl is done, you now have a boatload of data about each url listed in the xml sitemap.  The first place I would start is the “Response Codes” tab, which will display the header response codes for each url that was crawled.  You can also use the filter dropdown to isolate 404s, 500s, 302s, etc.  You might be surprised with what you find.
    Analyze a Sitemap Crawl in Screaming Frog
  7. Fix The Problems!
    Once you analyze the crawl, work with your developer or development team to rectify the problems you identified.  The fix sometimes can be handled quickly (in less than a day or two).

Summary – Cleaning Up Dirty Sitemaps
Although XML sitemaps provide an easy way to submit all of your canonical urls to the engines, that ease of setup sometimes leads to serious errors.  If you are seeing strange urls getting indexed, or if you are seeing crawl errors for weird or unfamiliar urls, then you might want to check your own sitemaps to see if they are causing a problem.  Using this tutorial, you can download and crawl your sitemaps quickly, and then flag any errors you find along the way.

Let’s face it, quick and easy wins are sometimes hard to come by in SEO.  But finding xml sitemap errors can be a quick an easy win.  And now you know how to find them.  Happy crawling.

GG

 

Share
Tweet
Share
Email

Filed Under: google, seo, tools

Connect with Glenn Gabe today!

Latest Blog Posts

  • How to compare hourly sessions in Google Analytics 4 to track the impact from major Google algorithm updates (like broad core updates)
  • It’s all in the (site) name: 9 tips for troubleshooting why your site name isn’t showing up properly in the Google search results
  • Google Explore – The sneaky mobile content feed that’s displacing rankings in mobile search and could be eating clicks and impressions
  • Bing Chat in the Edge Sidebar – An AI companion that can summarize articles, provide additional information, and even generate new content as you browse the web
  • The Google “Code Red” That Triggered Thousands of “Code Reds” at Publishers: Bard, Bing Chat, And The Potential Impact of AI in the Search Results
  • Continuous Scroll And The GSC Void: Did The Launch Of Continuous Scroll In Google’s Desktop Search Results Impact Impressions And Clicks? [Study]
  • How to analyze the impact of continuous scroll in Google’s desktop search results using Analytics Edge and the GSC API
  • Percent Human: A list of tools for detecting lower-quality AI content
  • True Destination – Demystifying the confusing, but often accurate, true destination url for redirects in Google Search Console’s coverage reporting
  • Google’s September 2022 Broad Core Product Reviews Update (BCPRU) – The complexity and confusion when major algorithm updates overlap

Web Stories

  • Google’s December 2021 Product Reviews Update – Key Findings
  • Google’s April 2021 Product Reviews Update – Key Points For Site Owners and Affiliate Marketers
  • Google’s New Page Experience Signal
  • Google’s Disqus Indexing Bug
  • Learn more about Web Stories developed by Glenn Gabe

Archives

  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • August 2021
  • July 2021
  • June 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • GSQi Home
  • About Glenn Gabe
  • SEO Services
  • Blog
  • Contact GSQi
Copyright © 2023 G-Squared Interactive LLC. All Rights Reserved. | Privacy Policy
This website uses cookies to improve your experience. Are you ok with the site using cookies? You can opt-out at a later time if you wish. Cookie settings ACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience. You can read our privacy policy for more information.
Cookie Consent