The Internet Marketing Driver

  • GSQi Home
  • About Glenn Gabe
  • SEO Services
    • Algorithm Update Recovery
    • Technical SEO Audits
    • Website Redesigns and Site Migrations
    • SEO Training
  • Blog
    • Web Stories
  • Contact GSQi

Sitebulb Server – Technical Tips And Tricks For Setting Up A Powerful DIY Enterprise Crawler (On A Budget)

September 26, 2022 By Glenn Gabe Leave a Comment

Share
Tweet
Share
Email
Sitebulb Server

When performing SEO audits, crawling is ultra-important. For SEOs and site owners, there are several options available from local crawlers to enterprise crawlers (SAAS services). I’ve been extremely vocal over the years about my favorite crawling tools, which are Screaming Frog, Sitebulb, DeepCrawl, and more recently, JetOctopus. Screaming Frog and Sitebulb are typically local crawlers, while DeepCrawl and JetOctopus are excellent and powerful enterprise crawlers.

When referring to Screaming Frog and Sitebulb, I said “typically local crawlers” since there are ways to hack a local crawler into a SAAS crawler. For example, I have spun up several AWS servers, installed Screaming Frog and Sitebulb on them, and then I can crawl remotely. The core benefit being that I free up my local resources to focus on other things while my AWS servers do the heavy lifting crawling-wise.

That has worked pretty well, but there’s a new solution for you Do-It-Yourself’ers. It’s called Sitebulb Server, and it’s now out of beta. I’ve been using it for several months and wanted to cover some tips and tricks in a blog post. I think it’s a powerful solution that can take you from local to enterprise on a budget.

Note, I won’t be covering everything you need to know in this post. Instead, I wanted to cover how it works, some technical tips and tricks, and some watchouts. I’m sure the team over at Sitebulb can answer any other questions you have (they have been super helpful over the years and while I was testing the beta version). And you can always ping me on Twitter if you run into any issues. If I can answer those questions quickly, I will.

What is Sitebulb Server Exactly?
Sitebulb Server is a way for you to set up a special version of Sitebulb on a separate server, which can run crawls while not bogging down your local resources. With the standard version of Sitebulb, most users run it on their local computers. That’s fine, but it can definitely bog down your system and take up bandwidth. With Sitebulb Server, that all happens on a separate server. Then you can use a special version of Sitebulb on your desktop to connect to your server. And when you do that, you can access the audits like you had run them on your local machine. It’s awesome to be able to do that.

I mentioned earlier that you could always set up a separate remote server and run Sitebulb (or Screaming Frog). I have done this for years and it works pretty well (although you couldn’t run multiple crawls at the same time). Well, Sitebulb Server is a remote crawling server, but on steroids. It’s built to run multiple crawls at the same time while enabling you to connect to any of those crawls from your own desktop app. In addition, multiple team members can access those crawls from Sitebulb Server. So if you have a team of SEOs working on an audit, then Sitebulb Server can be a strong DIY solution for accessing crawl data across those team members.

The ability to crawl sites concurrently on a remote server is amazing:

Crawl multiple sites using Sitebulb Server

You can access your server from anywhere in order to audit the crawl data like it was sitting on your local machine:

Access crawl data from anywhere via Sitebulb Server

The Biggest Obstacle IMO – The scary, confusing, cryptic, but often easy, server setup.
This all sounds great, right? But what’s the biggest obstacle or hoop that you need to jump through? Undoubtedly, it’s the server setup. I ran into this when first setting up AWS instances to run their own versions of Screaming Frog and Sitebulb. It’s a cryptic process that many SEOs and site owners aren’t familiar with. It’s not necessarily hard, but definitely an obstacle in my opinion. I find many SEOs have not set up separate servers for crawling and I know a number that ran into snags while trying to set them up.

Well, Sitebulb to the rescue. Patrick and Gareth from Sitebulb have created excellent documentation for setting up Sitebulb Server, how to set up remote servers (including AWS and Google Cloud Compute), and more. You can read more in their help documentation, which also includes video clips (which are amazing when you are trying to set up remote servers). Sometimes a picture is worth a thousand words.

For example, here is a video clip Sitebulb put together for setting up Sitebulb Server via AWS:

Note, I personally use AWS, and that has worked well, but you can use whatever setup you want. You can use a dedicated server, AWS, Google Cloud Compute, a spare computer on your local network, etc. Once you set up a server, which typically doesn’t take long, then you can move forward with setting up Sitebulb Server and the special desktop version of Sitebulb that connects to your server.

Disk space and vCPUs: Some important points about your server.
When setting up your server, then it’s important to make sure you have enough disk space and enough vCPUs (or virtual CPUs). They impact how much crawl data you can store and how many threads you can use when crawling.

First, crawls take up a lot of space. And enterprise crawls take up a ton of space. Make sure you select enough disk space based on the types of crawls you typically run. Below is a screenshot from AWS for configuring storage.

Configuring disk storage when setting up Sitebulb Server on AWS

Next up is vCPUs (or virtual CPUs). It’s important to understand that each vCPU is a thread. So if your crawl will take up 5 threads, then you’ll need 5 vCPUs. In addition, when you connect to the server, you are also taking up a thread. And if you want to run multiple crawls at the same time, you need to take that into account as well (even more threads). Below, you can see the AWS instance has 8 vCPUs (or 8 threads for Sitebulb Server).

Selecting the number of vCPUs when setting up Sitebulb Server on AWS

For example, if you run two crawls using 5 threads each, and you are connecting to the server, then you’ll need 11 threads (5 + 5 + 1). I had some questions about this, and Patrick was awesome with getting back to me with more information. The team over at Sitebulb has a wealth of knowledge and they are incredible with helping customers. So, first check their documentation. If you still don’t have an answer, I’m sure they can help you figure out the best solution.

Notes about running crawls concurrently versus queuing them.
Another point of confusion is about running concurrent crawls. In other words, this is how you will run multiple crawls at the same time. This is something typically only reserved for enterprise crawlers, but you can do this now via Sitebulb Server.

First, when setting up your server make sure you check the option for running concurrent crawls. That’s in the server settings section.

Checking concurrent audits in Sitebulb Server

Next, make sure you have the right setting for “Concurrent queue type”. That should be set to “Next based on available threads” and not “First in, first out”. If you have it set to “First in, first out”, then each crawl will run separately (and in order). By using “Concurrent queue type”, the crawls can run at the same time as long as there are enough threads (see my comments earlier about that).

Setting concurrent queue type in Sitebulb Server

And for “Reserved threads”, the number you set is based on the number of team members accessing the server at the same time. If you’re a solo consultant, then you can just set one. If you have two other teammates that will be accessing the server at the same time, then you should have that set to three (you and two teammates).

Setting reserved threads in Sitebulb Server

IP Address Changes When You Stop and Restart AWS
Another confusing topic is related to IP addresses and your AWS instances. Since you are paying when the server is in use, you will typically want to stop that instance when it’s not in use. If not, your costs can start to skyrocket. But here’s the rub. When you stop and the restart your AWS instance, the server gets a new IP address. And that IP address is what you use when connecting your Sitebulb desktop app to your Sitebulb Server. It’s also what you use when connecting to that server via Remote Desktop (for managing the server remotely).

Therefore, you will need to quickly go into to your settings on Sitebulb desktop and change the IP address for your server. It doesn’t take long, it’s not hard to do, but it can cause confusion if you don’t know you have to do that. You basically won’t be able to connect to your Sitebulb Server unless the correct IP address is used.

Changing IP address after stopping and restarting an AWS server

And also remember you will need to change that IP address when connecting via Remote Desktop. If not, your connection will fail. You use Remote Desktop to manage your server remotely (like installing software).

Adding a new IP address via Remote Desktop

Connect to multiple Sitebulb servers from one desktop Sitebulb setup.
Another cool feature of Sitebulb Server is that you can connect to multiple servers from one desktop setup. So, if you need multiple Sitebulb Servers since you need to run many crawls at the same time, you can do that. Just spin up multiple AWS servers or dedicated servers, set up Sitebulb Server on them, and then connect to those servers from your desktop app. Sitebulb Server is extremely scalable on that front.

Add multiple servers in Sitebulb Server
Registering a new server in Sitebulb Server

Important: Open up a network port on your server.
OK, I ran into this issue when setting up Sitebulb Server, so I’m sure others will too. Sitebulb also has this in their documentation, so hopefully you won’t miss it when setting up your own server. But, I wanted to cover it here anyway, since it’s important.

You will probably need to open a network port on your server firewall in order to properly run Sitebulb Server. Network ports are typically closed by default, so you’ll need to create a firewall policy to open port 10401 on your server. It’s easy to do once you know where to go and how to do it, but I think many could miss setting it up. Sitebulb’s video tutorials cover this step in detail, so I won’t recreate the wheel here. But again, it’s important to do.

Opening a network port when setting up Sitebulb Server via AWS

Sitebulb Server – A strong option for running enterprise crawls without bogging down your local setup.
Again, I didn’t want to try and cover everything about Sitebulb Server in this post. Instead, I wanted to cover some technical tips and tricks that SEOs and site owners might run into while setting up and running Sitebulb Server (based on using Sitebulb Server over the past several months). Personally, I have found Sitebulb Server to be a strong solution for running enterprise crawls on a budget. And I think you will too. I recommend reaching out to Patrick and Gareth at Sitebulb to learn more about the options available for trying out Sitebulb Server.

GG

Share
Tweet
Share
Email

Filed Under: google, seo, tools

Connect with Glenn Gabe today!

Latest Blog Posts

  • How to compare hourly sessions in Google Analytics 4 to track the impact from major Google algorithm updates (like broad core updates)
  • It’s all in the (site) name: 9 tips for troubleshooting why your site name isn’t showing up properly in the Google search results
  • Google Explore – The sneaky mobile content feed that’s displacing rankings in mobile search and could be eating clicks and impressions
  • Bing Chat in the Edge Sidebar – An AI companion that can summarize articles, provide additional information, and even generate new content as you browse the web
  • The Google “Code Red” That Triggered Thousands of “Code Reds” at Publishers: Bard, Bing Chat, And The Potential Impact of AI in the Search Results
  • Continuous Scroll And The GSC Void: Did The Launch Of Continuous Scroll In Google’s Desktop Search Results Impact Impressions And Clicks? [Study]
  • How to analyze the impact of continuous scroll in Google’s desktop search results using Analytics Edge and the GSC API
  • Percent Human: A list of tools for detecting lower-quality AI content
  • True Destination – Demystifying the confusing, but often accurate, true destination url for redirects in Google Search Console’s coverage reporting
  • Google’s September 2022 Broad Core Product Reviews Update (BCPRU) – The complexity and confusion when major algorithm updates overlap

Web Stories

  • Google’s December 2021 Product Reviews Update – Key Findings
  • Google’s April 2021 Product Reviews Update – Key Points For Site Owners and Affiliate Marketers
  • Google’s New Page Experience Signal
  • Google’s Disqus Indexing Bug
  • Learn more about Web Stories developed by Glenn Gabe

Archives

  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • August 2021
  • July 2021
  • June 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • GSQi Home
  • About Glenn Gabe
  • SEO Services
  • Blog
  • Contact GSQi
Copyright © 2023 G-Squared Interactive LLC. All Rights Reserved. | Privacy Policy
This website uses cookies to improve your experience. Are you ok with the site using cookies? You can opt-out at a later time if you wish. Cookie settings ACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience. You can read our privacy policy for more information.
Cookie Consent