
Blog Post
SEO

Nadine
Wolff
published on:
11.01.2019
Crawling – The spider on the move on your website
Table of Contents
This article provides you with an overview of what this "crawling" actually is and what the difference is compared to indexing on Google. Additionally, you will get to know a small selection of web crawlers and receive a brief insight into their main focuses.
You will also learn about the work and control of the Google Crawler in this article, as crawling can be managed with a few simple tricks.
The term "crawling" is a fundamental technical term in search engine optimization.
The two terms "crawling" and "indexing" are often confused or mixed up.
In essence, these two terms are so relevant that the entire web world depends on them.

What types of crawlers are there?
A web crawler (also known as an ant, bot, web spider, or web robot) is an automated program or script that automatically searches websites for specific information. This process is referred to as web crawling or spidering.
There are various uses for web crawlers. Essentially, web crawlers are used to collect & retrieve data from the internet. Most search engines use it as a means to provide up-to-date data and find the latest information online (e.g., indexing on Google search result pages). Analytics companies and market researchers use web crawlers to identify customer and market trends. Below, we introduce you to some well-known web crawlers specifically for the SEO sector:
ahref - ahrefs is a well-known SEO tool and provides very specific data in the area of backlinks and keywords.
semrush - an all-in-one marketing software intended exclusively for SEO, social media, traffic, and content research.
Screaming Frog - is an SEO Spider Tool available as downloadable software for Mac OS, Windows, and Ubuntu. It is available as a free and paid version.
Crawling vs. Indexing
Crawling and indexing are two different things, which is often misunderstood in the SEO field. Crawling means that the bot (e.g., Googlebot) views and analyzes all the content on a page (which can be text, images, or CSS files). Indexing means that the page can be displayed in Google search results. One cannot go without the other.
Imagine walking down a long hotel corridor, with closed doors to your left and right. Accompanying you is someone, for example, a tour guide, who in this case is the Googlebot.
If Google is allowed to browse a page (a room), it can open the door and actually see what's inside (crawling).
There may be a sign on a door indicating that the Googlebot can enter the room and is allowed to show it to others (you) (indexing possible, the page is displayed in search results).
The sign on the door could also mean it must not show the room to people (“noindex”). The page was crawled because it could peek inside, but it is not displayed in search results because it is instructed not to show the room to people.
If a page is blocked for the crawler (e.g., a sign on the door saying "Google is not allowed here"), it will not go in and look around. Thus, it won’t peek into the room, yet it shows people (you) the room (index) and tells them they may enter if they wish.
Even if there is an instruction inside the room to not allow people in (“noindex” meta tag), it will never see it because it wasn't allowed in.
Blocking a page via the robots.txt thus means it is eligible for indexing, regardless of whether you have a "index" or "noindex" meta robots tag in the page itself (since Google cannot see this because it is blocked), so it is treated as indexable by default. Of course, this means that the ranking potential of the page is reduced (because the actual content of the page cannot be really analyzed). If you’ve ever seen a search result with the description "The description of this page is unavailable due to robots.txt," this is the reason.
[caption id="attachment_23214" align="aligncenter" width="607"]

Google search results page with a blocked description due to robots.txt[/caption]
Google Crawler – He came, he saw, and he indexed
The Googlebot is Google's search bot that crawls the web and creates an index. It is also known as a spider. The bot crawls every page it has access to and adds it to the index, where it can be retrieved and returned by users' searches.
In the SEO field, a distinction is made between classical search engine optimization and Googlebot optimization. The Googlebot spends more time crawling websites with significant PageRank. PageRank is an algorithm by Google that essentially analyzes and weighs a domain's link structure. The time the Googlebot dedicates to your website is referred to as the "crawling budget." The greater a page's "authority," the more "crawling budget" the website receives.
In a Googlebot article by Google, it is said: "In most cases, the Googlebot accesses your website on average only once every few seconds. However, due to network delays, the frequency might seem higher over short periods." In other words: Your website is always being crawled, provided it accepts crawlers. In the SEO world, there is a lot of discussion about the "crawling rate" and how to get Google to recrawl your site for optimal ranking. The Googlebot constantly crawls your website. The more up-to-date the content, the more backlinks, comments, etc., the more likely it is that your website will appear in search results. Note that the Googlebot does not constantly crawl every page of your website. In this context, we want to point out the importance of current and good content - fresh, consistent content always attracts the crawler's attention and increases the chance of top pages being placed.
The Googlebot first accesses a webpage's "robots.txt" file to query the rules for crawling the site. Unapproved pages are typically not crawled or indexed by Google.
Google's crawler uses the sitemap.xml to determine all areas of the website that should be crawled and rewarded by a Google index. Due to the different ways websites are created and organized, the crawler might not automatically crawl every page or section. Dynamic content, low-ranked pages, or extensive content archives with little internal linking could benefit from a well-crafted sitemap. Sitemaps are also useful for informing Google about the metadata that lies behind videos, images, or, for example, PDF files, provided the sitemaps use these sometimes optional annotations. If you want to learn more about creating a sitemap, read the blog article on the topic “the perfect sitemap.”
Controlling the Google bot to index your website is no secret. A lot can be achieved with simple means such as a good robots.txt and internal linking to influence crawling.
Do you have few pages that allow Google indexing? Contact us. We support you with strategy and technical implementation.
What can we do for you?
Do you want to make sure your website is crawled correctly? We are happy to advise you on search engine optimization!
We look forward to your inquiry.

Nadine
Wolff
As a long-time expert in SEO (and web analytics), Nadine Wolff has been working with internetwarriors since 2015. She leads the SEO & Web Analytics team and is passionate about all the (sometimes quirky) innovations from Google and the other major search engines. In the SEO field, Nadine has published articles in Website Boosting and looks forward to professional workshops and sustainable organic exchanges.