After publishing my last post about dangerous rel canonical problems, I started receiving a lot of questions about other areas of technical SEO. One topic in particular that seemed to generate many questions was how to best use and set up xml sitemaps for larger and more complex websites.
Sure, in its most basic form, webmasters can provide a list of urls that they want the search engines to crawl and index. Sounds easy, right? Well, for larger and more complex sites, the situation is often not so easy. And if the xml sitemap situation spirals out of control, you can end up feeding Google and Bing thousands, hundreds of thousands, or millions of bad urls. And that’s never a good thing.
While helping clients, it’s not uncommon for me to audit a site and surface serious errors with regard to xml sitemaps. And when that’s the case, websites can send Google and Bing mixed signals, urls might not get indexed properly, and both engines can end up losing trust in your sitemaps. And as Bing’s Duane Forrester once said in this interview with Eric Enge:
“Your Sitemaps need to be clean. We have a 1% allowance for dirt in a Sitemap. If we see more than a 1% level of dirt, we begin losing trust in the Sitemap.”
Clearly that’s not what you want happening…
So, based on the technical SEO work I perform for clients, including conducting many audits, I decided to list some important facts, tips, and answers for those looking to maximize their xml sitemaps. My hope is that you can learn something new from the bullets listed below, and implement changes quickly.
1. Use RSS/Atom and XML For Maximum Coverage
This past fall, Google published a post on the webmaster central blog about best practices for xml sitemaps. In that post, they explained that sites should use a combination of xml sitemaps and RSS/Atom feeds for maximum coverage.
Xml sitemaps should contain all canonical urls on your site, while RSS/Atom feeds should contain the latest additions or recently updated urls. XML sitemaps will contain many urls, where RSS/Atom feeds will only contain a limited set of new or recently changed urls.
So, if you have new urls (or recently updated urls) that you want Google to prioritize, then use both xml sitemaps and RSS/Atom feeds. Google says by using RSS, it can help them “keep your content fresher in its index”. I don’t know about you, but I like the idea of Google keeping my content fresher. :)
Also, it’s worth noting that Google recommends maximizing the number of urls per xml sitemap. For example, don’t cut up your xml sitemaps into many smaller files (if possible). Instead, use the space you have in each sitemap to include all of your urls. If you don’t Google explains that, “it can impact the speed and efficiency of crawling your urls.” I recommend reading Google’s post to learn how to best use xml sitemaps and RSS/Atom feeds to maximize your efforts. By the way, you can include 50K urls per sitemap and each sitemap must be less than 10MB uncompressed.
2. XML Sitemaps By Protocol and Subdomain
I find a lot of webmasters are confused by protocol and subdomains, and both can end up impacting how urls in sitemaps get crawled and indexed.
URLs included in xml sitemaps must use the same protocol and subdomain as the sitemap itself. This means that https urls located in an http sitemap should not be included in the sitemap. This also means that urls on sample.domain.com cannot be located in the sitemap on www.domain.com. So on and so forth.
This is a common problem when sites employ multiple subdomains or they have sections using https and http (like ecommerce retailers). And then of course we have many sites starting to switch to https for all urls, but haven’t changed their xml sitemaps to reflect the changes. My recommendation is to check your xml sitemaps reporting today, while also manually checking the sitemaps. You might just find issues that you can fix quickly.
3. Dirty Sitemaps – Hate Them, Avoid Them
When auditing sites, I often crawl the xml sitemaps myself to see what I find. And it’s not uncommon to find many urls that resolve with non-200 header response codes. For example, urls that 404, 302, 301, return 500s, etc.
You should only provide canonical urls in your xml sitemaps. You should not provide non-200 header response code urls (or non-canonical urls that point to other urls). The engines do not like “dirty sitemaps” since they can send Google and Bing on a wild goose chase throughout your site. For example, imagine driving Google and Bing to 50K urls that end up 404ing, redirecting, or not resolving. Not good, to say the least.
Remember Duane’s comment from earlier about “dirt” in sitemaps. The engines can lose trust in your sitemaps, which is never a good thing SEO-wise. More about crawling your sitemaps later in this post.
4. View Trending in Google Webmaster Tools
Many SEOs are familiar with xml sitemaps reporting in Google Webmaster Tools, which can help surface various problems, while also providing important indexation statistics. Well there’s a hidden visual gem in the report that’s easy to miss. The default view will show the number of pages submitted in your xml sitemaps and the number indexed. But if you click the “sitemaps content” box for each category, you can view trending over the past 30 days. This can help you identify bumps in the road, or surges, as you make changes.
For example, check out the trending below. You can see the number of images submitted and indexed drop significantly over a period of time, only to climb back up. You would definitely want to know why that happened, so you can avoid problems down the line. Sending this to your dev team can help them identify potential problems that can build over time.
5. Using Rel Alternate in Sitemaps for Mobile URLs
When using mobile urls (like m.), it’s incredibly important to ensure you have the proper technical SEO setup. For example, you should be using rel alternate on the desktop pages pointing to the mobile pages, and then rel canonical on the mobile pages pointing back to the desktop pages.
Although not an approach I often push for, you can provide rel alternate annotations in your xml sitemaps. The annotations look like this:
It’s worth noting that you should still add rel canonical to the source code of your mobile pages pointing to your desktop pages.
6. Using hreflang in Sitemaps for Multi-Language Pages
If you have pages that target different languages, then you are probably already familiar with hreflang. Using hreflang, you can tell Google which pages should target which languages. Then Google can surface the correct pages in the SERPs based on the language/country of the person searching Google.
Similar to rel alternate, you can either provide the hreflang code in a page’s html code (page by page), or you can use xml sitemaps to provide the hreflang code. For example, you could provide the following hreflang attributes when you have the same content targeting different languages:
Just be sure to include a separate <loc> element for each url that contains alternative language content (i.e. all of the sister urls should be listed in the sitemap via a <loc> element).
7. Testing XML Sitemaps in Google Webmaster Tools
Last, but not least, you can test your xml sitemaps or other feeds in Google Webmaster Tools. Although easy to miss, there is a red “Add/Test Sitemap” button in the upper right-hand corner of the Sitemaps reporting page in Google Webmaster Tools.
When you click that button, you can add the url of your sitemap or feed. Once you click “Test Sitemap”, Google will provide results based on analyzing the sitemap/feed. Then you can rectify those issues before submitting the sitemap. I think too many webmasters use a “set it and forget it” approach to xml sitemaps. Using the test functionality in GWT, you can nip some problems in the bud. And it’s simple to use.
8. Bonus: Crawl Your XML Sitemap Via Screaming Frog
In SEO, you can either test and know, or read and believe. As you can probably guess, I’m a big fan of the former… For xml sitemaps, you should test them thoroughly to ensure all is ok. One way to do this is to crawl your own sitemaps. By doing so, you can identify problematic tags, non-200 header response codes, and other little gremlins that can cause sitemap issues.
One of my favorite tools for crawling sitemaps is Screaming Frog (which I have mentioned many times in my previous posts). By setting the crawl mode to “list mode”, you can crawl your sitemaps directly. Screaming Frog natively handles xml sitemaps, meaning you don’t need to convert your xml sitemaps into another format before crawling (which is awesome).
Screaming Frog will then load your sitemap and begin crawling the urls it contains. In real-time, you can view the results of the crawl. And if you have Graph View up and running during the crawl, you can visually graph the results as the crawler collects data. I love that feature. Then it’s up to you to rectify any problems that are surfaced.
Summary – Maximize and Optimize Your XML Sitemaps
As I’ve covered throughout this post, there are many ways to use xml sitemaps to maximize your SEO efforts. Clean xml sitemaps can help you inform the engines about all of the urls on your site, including the most recent additions and updates. It’s a direct feed to the engines, so it’s important to get it right (and especially for larger and more complex websites).
I hope my post provided some helpful nuggets of sitemap information that enable you to enhance your own efforts. I recommend setting some time aside soon to review, crawl, audit, and then refine your xml sitemaps. There may be some low-hanging fruit changes that can yield nice wins. Now excuse me while I review the latest sitemap crawl. :)