The Internet Marketing Driver

  • GSQi Home
  • About Glenn Gabe
  • SEO Services
    • Algorithm Update Recovery
    • Technical SEO Audits
    • Website Redesigns and Site Migrations
    • SEO Training
  • Blog
  • Contact GSQi

Archives for July 2013

Robots.txt Case Study – How To Block and Destroy SEO with 5K+ Directives

July 22, 2013 By Glenn Gabe 5 Comments

Robots.txt Problems with 5K Directives

If you’ve read my previous posts, then you already know that SEO audits are a core service of mine.  I still believe they are the most powerful deliverable in all of SEO.   So when prospective clients call me for help, I’m quick to start checking some of their stats while we’re on the phone.  As crazy as it sounds, there are times that even a few SEO checks can reveal large SEO problems.   And that’s exactly what happened recently while on the phone with a small business owner who was wondering why SEO wasn’t working for them.

While performing audits, I’ve found some really scary problems.  Problems that spark nightmares for SEOs.  But what I witnessed recently had me falling out of my seat (literally). It was a serious robots.txt issue that was causing serious problems SEO-wise.  I can tell you that I’ve never seen a robots.txt issue like what I saw while on that call.  It was so bad, and so over the top, that I’m adding it as an example to my SEO Bootcamp training (so attendees never implement something like that on their own websites).

Although robots.txt is a simple file that sits at the root of your website, it can still cause serious SEO problems.  The result of the scary-as-heck robots.txt file I mentioned earlier is a small business website with only one page indexed, and all other pages blocked (including its core services pages).  In addition, the robots.txt file was so riddled with problems, that I’m wondering if Googlebot and Bingbot are so offended by the directives that they don’t even crawl the site.  Yes, it was that bad.

Site command revealing only one page indexed:
Site Command Revealing 1 Page Indexed

The SMB SEO Problem
Many small to medium sized businesses are skeptical of SEO companies.  And I don’t blame them.  There are some crazy stories out there, from SEO scams to SEOs not delivering to having all SEO work outsourced to less-experienced third parties.  That leads to many SMBs handling SEO for themselves (which is ok if they know what they are doing).  But in situations where business owner are not familiar with SEO best practices, advanced-level SEO work, etc., they can get themselves in trouble.  And that’s exactly what happened in this situation.

5K Lines of SEO Hell
As you can imagine, the site I mentioned earlier is not performing well in search.  The business owner didn’t know where to begin with SEO and was asking for me assistance (for setting a strong foundation that they could build upon).  That’s a smart move, but I picked up the robots.txt file issue within 5 minutes of being on our call.  I did a quick site: command in Google to see how many pages were indexed, and only one page showed up.  Then I quickly checked the site’s robots.txt file and saw the problems.  And there were really big problems.  Like 5K+ lines of problems.  That’s right, over 5K lines of directives were included in this small business website’s robots.txt file.

Keep in  mind that most small businesses might have five or six lines of directives (at most).  Actually, there are some with just three two or three lines.  Even the most complex websites I’ve worked on had less than 100 directives.  So to see an SMB website with 5K+ lines shocked me to say the least.

Excessive and redundant directives in a robots.txt file:
Example of excessive directives in robots.txt

Yes, the SMB Owner Was Shocked Too
The business owner was not happy to hear what was going on, but still had no idea what I was talking about.  I sent them a link to their robots.txt file so they could see what I was referring to.  I also explained that their web designer or developer should change that immediately.  As of now, it’s still there.  Again, this is one of the core problems when small businesses handle their own SEO.  Serious problems like this can remain (and sometimes for a long time).

What’s Wrong with the File?
So, if you’re familiar with SEO and robots.txt files, you are probably wondering what exactly was included in this file, and why a small business would need 5K+ lines of directives.  Well, they obviously don’t need 5K+ lines, and this was something added by either the CMS they are using or the hosting provider.  It’s hard to tell, since I wasn’t involved when the site was developed.

The file contains a boatload of disallow directives for almost every single directory on the site.  Those directives were replicated for a number of specific bots as well.  Then it ends with a global disallow directive (blocking all engines).  So the file goes to great lengths to disallow every bot from hundreds of directories, but then issues a “disallow all” (which would cover every bot anyway).

Errors in a robots.txt file:
Example of errors in robots.txt

Sitemap Index File Problems
But it gets worse.  The final line is a sitemap directive!  Yes, the file blocks everything, and then tries to feed the engines an xml sitemap that should contain all urls on the site.  But, the site is actually using a sitemap index file, which is typically used to include multiple sitemap files (if you need more than one).  Remember, this is a small business website… so it really shouldn’t need more than one xml sitemap.  When checking the sitemap index file, it only contains one xml sitemap file (which makes no sense)!   If you only have one sitemap file, then why use a sitemap index file??  And that one xml sitemap only contains one URL!!  Again, this underscores the point that the site is creating an overly complex situation for something that should be simple (for most small business websites).

Wrong use of a sitemap index file:
Unnecessary sitemap index file in robots.txt

So let me sum this up for you.  The robots.txt file:

  • Blocks all search engines from crawling all content on the site.
  • Is overly complex by blocking each directory for each bot under the sun.
  • Contains malformed directives (like close to a thousand).
  • Provides autodiscovery for a sitemap index file that only contains one xml sitemap, with only one URL listed!  And all of those urls are blocked by robots.txt anyway!

 

What To Do –  Get the basics down, then scale.
Based on the audits I’ve performed and the businesses I’ve helped, here is what I think small businesses should do with their robots.txt files:

  • Keep it simple.  Don’t take it from me, listen to Google.  Only add directives when absolutely needed.  If you are unsure what a directive will do, don’t add it.
  • Test your robots.txt file. You can use Google Webmaster Tools to test your robots.txt file to ensure it blocks what you need it to block (and that it doesn’t block what you want crawled).  You can also use some online tools to test your robots.txt file.  You can read another one of my posts to learn about how one of those tools saved a client’s robots.txt file.
  • Add autodiscovery to make sure your clean xml sitemap can be found automatically by the engines. And use a sitemap index file if you are using more than one xml sitemap.
  • And if you’re a small business owner and have no idea what I’m talking about in the previous bullets, have an SEO audit completed.  It is one of the most powerful deliverables in all of SEO and can definitely help you get things in order.

 

Summary – Avoid 5K Line Robots.txt Files
If there’s one thing you should take away from this post, it’s that the basics are really important SEO-wise. Unfortunately, the small business with the 5K+ line robots.tx file could write blog posts until the cows came home and it wouldn’t matter.  They could have gotten a thousand likes and tweets, and it would have only impacted them in the short term.  That’s because they are blocking every file from being crawled on their website.  Instead of doing that, you should develop a solid foundation SEO-wise and then build upon it.

Nobody needs a 5K line robots.txt file, not even the most complex sites on the web.  Remember, keep it simple and then scale.

GG

Filed Under: google, seo

Avoiding Dirty Sitemaps – How to Download and Crawl XML Sitemaps Using Screaming Frog

July 10, 2013 By Glenn Gabe 4 Comments

Dirty XML Sitemaps

SEO Audits are a core service I provide, including both comprehensive audits and laser-focused audits tied to algorithm updates.  There are times during those audits that I come across strange pages that are indexed, or I see crawl errors for pages not readily apparent on the site itself.  As part of the investigation, it’s smart to analyze and crawl a website’s xml sitemap(s) to determine if that could be part of the problem.  It’s not uncommon for a sitemap to contain old pages, pages leading to 404s, application errors, redirects, etc.  And you definitely don’t want to submit “dirty sitemaps” to the engines.

What’s a Dirty Sitemap?
A dirty sitemap is an xml sitemap that contains 404s, 302s, 500s, etc.  Note, those are header response codes.  A 200 code is ok, while the others signal various errors or redirects.  Since the engines will retrieve your sitemap and crawl your urls, you definitely don’t want to feed them errors.  Instead, you want your xml sitemaps to contain canonical urls on your site, and urls that resolve with a 200 code.  Duane Forrester from Bing was on record saying that they have very little tolerance for “dirt in a sitemap”.  And Google feels the same way.  Therefore, you should avoid dirty sitemaps so the engines can build trust in that sitemap (versus having the engines encounter 404s, 302s, 500s, etc.)

Indexed to Submitted Ratio
One metric that can help you understand if your xml sitemaps are problematic (or dirty) is the indexed to submitted ratio in Google Webmaster Tools.  When you access the “Sitemaps” section of webmaster tools (under the “Crawl” tab), you will see the number of pages submitted in the sitemap, but also the number indexed.  That ratio should be close(r) to 1:1.  If you see a low indexed to submitted ratio, then that could flag an issue with the urls you are submitting in your sitemap.  For example, if you see 12K pages submitted, but only 6500 indexed.  That’s only 54% of the pages submitted.

Here’s a screenshot of a very low indexed to submitted ratio:
A Low Submitted to Indexed Sitemap Ratio in Google Webmaster Tools

Pages “Linked From” in Google Webmaster Tools
In addition to what I explained above about the indexed to submitted ratio, you might find crawl errors in Google Webmaster Tools for urls that don’t look familiar.  In order to help track down the problematic urls, webmaster tools will show you how it found the urls in question.

If you click the url in the crawl errors report, you will see the error details as the default view.  But, you will also see two additional tabs for “In Sitemaps” and “Linked From”.  These tabs will reveal if the urls are contained in a specific sitemap, and if the urls are being linked to from other files on your site.  This is a great way to hunt down problems, and as you can guess, you might find that the your xml sitemap is causing problems.

Linked From in Google Webmaster Tools
Crawling XML Sitemaps
If you do see a problematic indexed to submitted ratio, what can you do?  Well, the beautiful part about xml sitemaps is that they are public.  As long as you know where they reside, you can download and crawl them using a tool like Screaming Frog.  I’ve written about Screaming Frog in the past, and it’s a phenomenal tool for crawling websites, flagging errors, analyzing optimization, etc.  I highly recommend using it.

Screaming Frog provides functionality for crawling text files (containing a list of urls), but not an xml file (which is the format of xml sitemaps submitted to the engines).  That’s a problem if you simply download the xml file to your computer.  In order to get that sitemap file into a format that can be crawled by Screaming Frog, you’ll need to first import that file into Excel, and then copy the urls to a text file.  Then you can crawl the file.

And that’s exactly what I’m going to show you in this tutorial.  Once you crawl the xml sitemap, you might find a boatload of issues that can be quickly resolved.  And when you are hunting down problems SEO-wise, any problem you can identify and fix quickly is a win.  Let’s begin.

Quick Note: If you control the creation of your xml sitemaps, then you obviously don’t need to download them from the site.  That said, the sitemaps residing on your website are what the engines crawl.  If your CMS is generating your sitemaps on the fly, then it’s valuable to use the exact sitemaps sitting on your servers.  So even though you might have them locally, I would still go through the process of downloading them from your website via the tutorial below.

How To Download and Crawl Your XML Sitemaps

  1. Download the XML Sitemap(s)
    Enter the URL of your xml sitemap, or the sitemap index file.  A sitemap index file contains the urls of all of your xml sitemaps (if you need to use more than one due to sitemap size limitations).  If you are using a sitemap index file, then you will need to download each xml sitemap separately.  Then you can either crawl each one separately or combine the urls into one master text file.  After the sitemap loads in your browser, click “File”, and then “Save As”.  Then save the file to your hard drive.
    Download XML Sitemaps
  2. Import the Sitemap into Excel
    Next, you’ll need to get a straight list of urls to crawl from the sitemap.  In order to do this, I recommend using the “Import XML” functionality in the “Developer” tab in Excel.  Click “Import” and then select the sitemap file you just downloaded.  After clicking the “Import” button after selecting your file, Excel will provide a dialog box about the xml schema.  Just click “OK”.  Then Excel will ask you where to place the data.  Leave the default option and click “OK”.  You should now see a table containing the urls from your xml sitemap.  And yes, you might already see some problems in the list.  :)
    Import XML Sitemaps into Excel
  3. Copy the URLs to a Text File
    I mentioned earlier that Screaming Frog will only crawl text files with a list of urls in them.  In order to achieve this, you should copy all of the urls from column A in your spreadsheet.  Then fire up your text editor of choice (mine is Textpad), and paste the urls.  Make sure you delete the first row, which contains the heading for the column.  Save that file to your computer.
    Copy URLs to a Text File
  4. Unleash the Frog
    Next, we’re ready to crawl the urls in the text file you just created.  Fire up Screaming Frog and click the “Mode” tab.  Select “List”, which enables you to load a text file containing a series of urls.
    List Mode in Screaming Frog
  5. Load The Text File and Start The Crawl
    Once you select “List Mode”, then click the Upload List button and select “From a file”. Then select the text file you created. Screaming Frog will load the urls and display them in a window. Once you click OK, the crawl will begin.
    Load text file in Screaming Frog.
  6. Analyze the Crawl
    When the crawl is done, you now have a boatload of data about each url listed in the xml sitemap.  The first place I would start is the “Response Codes” tab, which will display the header response codes for each url that was crawled.  You can also use the filter dropdown to isolate 404s, 500s, 302s, etc.  You might be surprised with what you find.
    Analyze a Sitemap Crawl in Screaming Frog
  7. Fix The Problems!
    Once you analyze the crawl, work with your developer or development team to rectify the problems you identified.  The fix sometimes can be handled quickly (in less than a day or two).

Summary – Cleaning Up Dirty Sitemaps
Although XML sitemaps provide an easy way to submit all of your canonical urls to the engines, that ease of setup sometimes leads to serious errors.  If you are seeing strange urls getting indexed, or if you are seeing crawl errors for weird or unfamiliar urls, then you might want to check your own sitemaps to see if they are causing a problem.  Using this tutorial, you can download and crawl your sitemaps quickly, and then flag any errors you find along the way.

Let’s face it, quick and easy wins are sometimes hard to come by in SEO.  But finding xml sitemap errors can be a quick an easy win.  And now you know how to find them.  Happy crawling.

GG

 

Filed Under: google, seo, tools

Connect with Glenn Gabe today!

Latest Blog Posts

  • Google’s December 2020 Broad Core Algorithm Update Part 2: Three Case Studies That Underscore The Complexity and Nuance of Broad Core Updates
  • Google’s December 2020 Broad Core Algorithm Update: Analysis, Observations, Tremors and Reversals, and More Key Points for Site Owners [Part 1 of 2]
  • Exit The Black Hole Of Web Story Tracking – How To Track User Progress In Web Stories Via Event Tracking In Google Analytics
  • Image Packs in Google Web Search – A reason you might be seeing high impressions and rankings in GSC but insanely low click-through rate (CTR)
  • Google’s “Found on the Web” Mobile SERP Feature – A Knowledge Graph and Carousel Frankenstein That’s Hard To Ignore
  • Image Migrations and Lost Signals – How long before images lose signals after a flawed url migration?
  • Web Stories Powered by AMP – 12 Tips and Recommendations For Creating Your First Story
  • Visualizing The SEO Engagement Trap – How To Use Behavior Flow In Google Analytics To View User Frustration [Case Study]
  • The May 2020 Google Core Update – 4 Case Studies That Emphasize The Complexity Of Broad Core Algorithm Updates
  • How To Remove An Image From Google Search Using The Outdated Content Tool (When The Image Was Published On Another Site)

Web Stories

  • Google’s Disqus Indexing Bug
  • Google’s New Page Experience Signal

Archives

  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • GSQi Home
  • About Glenn Gabe
  • SEO Services
  • Blog
  • Contact GSQi
Copyright © 2021 G-Squared Interactive LLC. All Rights Reserved. | Privacy Policy

We are using cookies to give you the best experience on our website.

You can find out more about which cookies we are using or switch them off in settings.

The Internet Marketing Driver
Powered by  GDPR Cookie Compliance
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookies

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

3rd Party Cookies

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.

This site also uses pixels from Facebook, Twitter, and LinkedIn so we publish content that reaches you on those social networks.

Please enable Strictly Necessary Cookies first so that we can save your preferences!