Sitebulb Server – Technical Tips And Tricks For Setting Up A Powerful DIY Enterprise Crawler (On A Budget)

Glenn Gabe

google, seo, tools

Sitebulb Server

When performing SEO audits, crawling is ultra-important. For SEOs and site owners, there are several options available from local crawlers to enterprise crawlers (SAAS services). I’ve been extremely vocal over the years about my favorite crawling tools, which are Screaming Frog, Sitebulb, DeepCrawl, and more recently, JetOctopus. Screaming Frog and Sitebulb are typically local crawlers, while DeepCrawl and JetOctopus are excellent and powerful enterprise crawlers.

When referring to Screaming Frog and Sitebulb, I said “typically local crawlers” since there are ways to hack a local crawler into a SAAS crawler. For example, I have spun up several AWS servers, installed Screaming Frog and Sitebulb on them, and then I can crawl remotely. The core benefit being that I free up my local resources to focus on other things while my AWS servers do the heavy lifting crawling-wise.

That has worked pretty well, but there’s a new solution for you Do-It-Yourself’ers. It’s called Sitebulb Server, and it’s now out of beta. I’ve been using it for several months and wanted to cover some tips and tricks in a blog post. I think it’s a powerful solution that can take you from local to enterprise on a budget.

Note, I won’t be covering everything you need to know in this post. Instead, I wanted to cover how it works, some technical tips and tricks, and some watchouts. I’m sure the team over at Sitebulb can answer any other questions you have (they have been super helpful over the years and while I was testing the beta version). And you can always ping me on Twitter if you run into any issues. If I can answer those questions quickly, I will.

What is Sitebulb Server Exactly?
Sitebulb Server is a way for you to set up a special version of Sitebulb on a separate server, which can run crawls while not bogging down your local resources. With the standard version of Sitebulb, most users run it on their local computers. That’s fine, but it can definitely bog down your system and take up bandwidth. With Sitebulb Server, that all happens on a separate server. Then you can use a special version of Sitebulb on your desktop to connect to your server. And when you do that, you can access the audits like you had run them on your local machine. It’s awesome to be able to do that.

I mentioned earlier that you could always set up a separate remote server and run Sitebulb (or Screaming Frog). I have done this for years and it works pretty well (although you couldn’t run multiple crawls at the same time). Well, Sitebulb Server is a remote crawling server, but on steroids. It’s built to run multiple crawls at the same time while enabling you to connect to any of those crawls from your own desktop app. In addition, multiple team members can access those crawls from Sitebulb Server. So if you have a team of SEOs working on an audit, then Sitebulb Server can be a strong DIY solution for accessing crawl data across those team members.

The ability to crawl sites concurrently on a remote server is amazing:

Crawl multiple sites using Sitebulb Server

You can access your server from anywhere in order to audit the crawl data like it was sitting on your local machine:

Access crawl data from anywhere via Sitebulb Server

The Biggest Obstacle IMO – The scary, confusing, cryptic, but often easy, server setup.
This all sounds great, right? But what’s the biggest obstacle or hoop that you need to jump through? Undoubtedly, it’s the server setup. I ran into this when first setting up AWS instances to run their own versions of Screaming Frog and Sitebulb. It’s a cryptic process that many SEOs and site owners aren’t familiar with. It’s not necessarily hard, but definitely an obstacle in my opinion. I find many SEOs have not set up separate servers for crawling and I know a number that ran into snags while trying to set them up.

Well, Sitebulb to the rescue. Patrick and Gareth from Sitebulb have created excellent documentation for setting up Sitebulb Server, how to set up remote servers (including AWS and Google Cloud Compute), and more. You can read more in their help documentation, which also includes video clips (which are amazing when you are trying to set up remote servers). Sometimes a picture is worth a thousand words.

For example, here is a video clip Sitebulb put together for setting up Sitebulb Server via AWS:

YouTube video

Note, I personally use AWS, and that has worked well, but you can use whatever setup you want. You can use a dedicated server, AWS, Google Cloud Compute, a spare computer on your local network, etc. Once you set up a server, which typically doesn’t take long, then you can move forward with setting up Sitebulb Server and the special desktop version of Sitebulb that connects to your server.

Disk space and vCPUs: Some important points about your server.
When setting up your server, then it’s important to make sure you have enough disk space and enough vCPUs (or virtual CPUs). They impact how much crawl data you can store and how many threads you can use when crawling.

First, crawls take up a lot of space. And enterprise crawls take up a ton of space. Make sure you select enough disk space based on the types of crawls you typically run. Below is a screenshot from AWS for configuring storage.

Configuring disk storage when setting up Sitebulb Server on AWS

Next up is vCPUs (or virtual CPUs). It’s important to understand that each vCPU is a thread. So if your crawl will take up 5 threads, then you’ll need 5 vCPUs. In addition, when you connect to the server, you are also taking up a thread. And if you want to run multiple crawls at the same time, you need to take that into account as well (even more threads). Below, you can see the AWS instance has 8 vCPUs (or 8 threads for Sitebulb Server).

Selecting the number of vCPUs when setting up Sitebulb Server on AWS

For example, if you run two crawls using 5 threads each, and you are connecting to the server, then you’ll need 11 threads (5 + 5 + 1). I had some questions about this, and Patrick was awesome with getting back to me with more information. The team over at Sitebulb has a wealth of knowledge and they are incredible with helping customers. So, first check their documentation. If you still don’t have an answer, I’m sure they can help you figure out the best solution.

Notes about running crawls concurrently versus queuing them.
Another point of confusion is about running concurrent crawls. In other words, this is how you will run multiple crawls at the same time. This is something typically only reserved for enterprise crawlers, but you can do this now via Sitebulb Server.

First, when setting up your server make sure you check the option for running concurrent crawls. That’s in the server settings section.

Checking concurrent audits in Sitebulb Server

Next, make sure you have the right setting for “Concurrent queue type”. That should be set to “Next based on available threads” and not “First in, first out”. If you have it set to “First in, first out”, then each crawl will run separately (and in order). By using “Concurrent queue type”, the crawls can run at the same time as long as there are enough threads (see my comments earlier about that).

Setting concurrent queue type in Sitebulb Server

And for “Reserved threads”, the number you set is based on the number of team members accessing the server at the same time. If you’re a solo consultant, then you can just set one. If you have two other teammates that will be accessing the server at the same time, then you should have that set to three (you and two teammates).

Setting reserved threads in Sitebulb Server

IP Address Changes When You Stop and Restart AWS
Another confusing topic is related to IP addresses and your AWS instances. Since you are paying when the server is in use, you will typically want to stop that instance when it’s not in use. If not, your costs can start to skyrocket. But here’s the rub. When you stop and the restart your AWS instance, the server gets a new IP address. And that IP address is what you use when connecting your Sitebulb desktop app to your Sitebulb Server. It’s also what you use when connecting to that server via Remote Desktop (for managing the server remotely).

Therefore, you will need to quickly go into to your settings on Sitebulb desktop and change the IP address for your server. It doesn’t take long, it’s not hard to do, but it can cause confusion if you don’t know you have to do that. You basically won’t be able to connect to your Sitebulb Server unless the correct IP address is used.

Changing IP address after stopping and restarting an AWS server

And also remember you will need to change that IP address when connecting via Remote Desktop. If not, your connection will fail. You use Remote Desktop to manage your server remotely (like installing software).

Adding a new IP address via Remote Desktop

Connect to multiple Sitebulb servers from one desktop Sitebulb setup.
Another cool feature of Sitebulb Server is that you can connect to multiple servers from one desktop setup. So, if you need multiple Sitebulb Servers since you need to run many crawls at the same time, you can do that. Just spin up multiple AWS servers or dedicated servers, set up Sitebulb Server on them, and then connect to those servers from your desktop app. Sitebulb Server is extremely scalable on that front.

Add multiple servers in Sitebulb Server
Registering a new server in Sitebulb Server

Important: Open up a network port on your server.
OK, I ran into this issue when setting up Sitebulb Server, so I’m sure others will too. Sitebulb also has this in their documentation, so hopefully you won’t miss it when setting up your own server. But, I wanted to cover it here anyway, since it’s important.

You will probably need to open a network port on your server firewall in order to properly run Sitebulb Server. Network ports are typically closed by default, so you’ll need to create a firewall policy to open port 10401 on your server. It’s easy to do once you know where to go and how to do it, but I think many could miss setting it up. Sitebulb’s video tutorials cover this step in detail, so I won’t recreate the wheel here. But again, it’s important to do.

Opening a network port when setting up Sitebulb Server via AWS

Sitebulb Server – A strong option for running enterprise crawls without bogging down your local setup.
Again, I didn’t want to try and cover everything about Sitebulb Server in this post. Instead, I wanted to cover some technical tips and tricks that SEOs and site owners might run into while setting up and running Sitebulb Server (based on using Sitebulb Server over the past several months). Personally, I have found Sitebulb Server to be a strong solution for running enterprise crawls on a budget. And I think you will too. I recommend reaching out to Patrick and Gareth at Sitebulb to learn more about the options available for trying out Sitebulb Server.