DE

EN

Blog Post

SEO

Nadine

Wolff

published on:

11.01.2019

Crawling – The spider on the move on your website

Table of Contents

No table of contents available
No table of contents available
No table of contents available

This article provides an overview of what this "crawling" actually is and what the difference is between crawling and indexing on Google. Additionally, you'll get to know a selection of web crawlers and gain a brief insight into their focuses.

You will also learn about the work and management of the Google crawler in this article, as crawling can be controlled with a few simple tricks.

The term "crawling" is a fundamental technical term in search engine optimization.

The two terms "crawling" and "indexing" are often confused or mixed up.

Essentially, the two terms are so relevant that the entire web world depends on them.

What crawlers are there?

A web crawler (also known as an ant, bot, web spider, or web robot) is an automated program or script that automatically searches websites for specific information. This process is known as web crawling or spidering.

There are various uses for web crawlers. Essentially, web crawlers are used to gather and retrieve data from the internet. Most search engines use it as a means to provide up-to-date data and find the latest information on the internet (e.g., indexing on Google search results pages). Analytics companies and market researchers use web crawlers to determine customer and market trends. Below, we introduce some well-known web crawlers specifically for the SEO sector:

  • Ahref - Ahrefs is a well-known SEO tool and provides very specific data in the area of backlinks and keywords.

  • Semrush - an all-in-one marketing software intended exclusively for SEO, social media, traffic, and content research.

  • Screaming Frog - is an SEO spider tool available as downloadable software for Mac OS, Windows, and Ubuntu. It is available in both free and paid versions.

Crawling vs. Indexing

Crawling and indexing are two different things, which is often misunderstood in the SEO field. Crawling means that the bot (e.g., the Googlebot) looks at and analyzes all the content (which can be text, images, or CSS files) on the page. Indexing means that the page can be displayed in Google search results. One cannot happen without the other.

Imagine you are walking down a large hotel corridor, with closed doors to your left and right. You have someone with you, for instance, a travel companion, who in this case is the Googlebot.

  • If Google is allowed to search a page (a room), it can open the door and actually see what is inside (crawling).

  • There may be a sign on the door indicating that the Googlebot is allowed to enter the room and show it to other people (you) (Indexing possible, the page is displayed in search results).

  • The sign on the door could also mean that it is not allowed to show the room to people (“noindex”). The page was crawled because it was possible to look inside, but it is not displayed in search results because it is instructed not to show the room to people.

  • If a page is blocked for the crawler (e.g., a sign on the door that says "Google is not allowed here"), it will not enter and look around. So it avoids looking into the room but shows the room (index) to people (you) and tells them they are allowed to go inside if they want.

    • Even if there is an instruction inside the room asking him not to let people enter the room (“noindex” meta-tag), he will never see it because he was not allowed into the room.

Blocking a page via robots.txt means that it is eligible for indexing, regardless of whether you have a meta-robots tag "index" or "noindex" on the page itself (since Google cannot see it because it's blocked), so it is treated as indexable by default. This naturally means that the page's ranking potential is reduced (since the content of the page can't really be analyzed. If you have ever seen a search result where the description contains something like "The description of this page is unavailable due to robots.txt," this is the reason.

[caption id="attachment_23214" align="aligncenter" width="607"]

Suchergebnisseite bei Google mit einer gesperrten Beschreibung aufgrund der robots.txt

Google search results page with a blocked description due to robots.txt[/caption]

Google Crawler – It came, it saw, and it indexed

The Googlebot is Google's search bot that crawls the web and creates an index. It is also known as a spider. The bot crawls every page it has access to and adds it to the index, where it can be retrieved and returned by user queries.

In the SEO field, there is a distinction between classic search engine optimization and Googlebot optimization. The Googlebot spends more time crawling websites with significant PageRank. PageRank is a Google algorithm that essentially analyzes and weights the linking structure of a domain. The time the Googlebot dedicates to your website is called the "crawling budget." The larger the "authority" of a page, the more "crawling budget" the website receives.

In a Googlebot article from Google, it states: "In most cases, the Googlebot accesses your website only once every few seconds on average. Due to network delays, the frequency may appear higher over short periods." In other words, your website is always being crawled as long as your website accepts crawlers. In the SEO world, there are many discussions about "crawling rate" and how Google can be made to re-crawl your website for optimal ranking. The Googlebot crawls your website constantly. The more current content, backlinks, comments, etc. are present, the more likely your website will appear in search results. Note that the Googlebot does not constantly crawl all pages of your website. In this context, we want to highlight the importance of current and good content - fresh, consistent content always grabs the crawler's attention and increases the likelihood that top pages will be placed.

The Googlebot first accesses a webpage's "robots.txt" file to inquire about the rules for crawling the website. Unapproved pages are typically not crawled or indexed by Google.

Google's crawler uses the sitemap.xml to determine all areas of the website that should be crawled and rewarded with a Google indexing. Due to the different ways websites are built and organized, the crawler might not automatically search every page or section. Dynamic content, low-ranked pages, or extensive content archives with low internal links could benefit from a carefully created sitemap. Sitemaps are also useful to inform Google about the metadata behind videos, images, or, for example, PDF files, as long as the sitemaps utilize these partly optional annotations. If you want to learn more about how to structure a sitemap, read the blog article on "the perfect sitemap."

Controlling the Google bot to index your website is no secret. Simple measures such as a good robots.txt and internal linking can achieve a great deal and influence crawling.

You have few pages that allow Google indexing? Contact us. We support you in strategy and technical implementation.

What can we do for you?

Do you want to ensure your website is crawled correctly? We are happy to advise you on search engine optimization!

We look forward to your inquiry.

Nadine

Wolff

As a long-time expert in SEO (and web analytics), Nadine Wolff has been working with internetwarriors since 2015. She leads the SEO & Web Analytics team and is passionate about all the (sometimes quirky) innovations from Google and the other major search engines. In the SEO field, Nadine has published articles in Website Boosting and looks forward to professional workshops and sustainable organic exchanges.

Address

Bülowstraße 66

Aufgang D3

10783 Berlin

Legal Information

Newsletter

Address

Bülowstraße 66

Aufgang D3

10783 Berlin

Legal Information

Newsletter

Address

Bülowstraße 66

Aufgang D3

10783 Berlin

Legal Information

Newsletter