Blog Post
SEO
Nadine
Wolff
published on:
23.06.2016
The Most Common Causes of Duplicate Content
Table of Contents
Search engines want to deliver the most relevant information to users and organize their data efficiently. A "phenomenon" that counteracts this is duplicate content, also known as double content. But how does duplicate content arise? We have compiled the most common causes of duplicate content for you and show you how to avoid such content.
The most common cause of duplicate content is the conscious or unconscious copying of content. Copying product descriptions, definitions, press releases, and other content from the web and publishing it on your own site presents a significant problem. Depending on the extent to which website operators pursue this strategy, in the worst case, search engines might penalize the relevant pages. Consequently, the keywords gradually disappear from the rankings.
Online shops are also affected. Each product with its own URL requires an individual product description. Another option would be to exclude the product from indexing, which is a rather counterproductive measure for online shops. After all, the goal of a shop is to sell goods. If potential buyers do not find the product pages in the search engines, there is a risk that your shop will not be found and the goods will remain in storage.
Problem: Domains are accessible with and without www
Many websites are accessible both with and without www. This is problematic because, from the search engine's perspective, all URLs are present twice.
https://www.domain.de/https://domain.de/
Decide on a main variant. The majority of website operators choose the www variant. Make sure to also redirect all internal links to this variant to optimally use the internal link juice. You have two options at your disposal to deal with this issue and set your main variant.
.htaccess file: Using this file, you specify that the variant without www permanently redirects to the variant with www.
Google Search Console: Google is aware of the issue. You have the option to specify which variant you prefer in Google Search Console.
Are trailing slashes problematic?
Anyone browsing the web notices different ways of writing URLs. There are URLs with and without a slash at the end. For example:
https://www.domain.de/https://www.domain.de
Strictly speaking, these are also two different URLs or documents, which, with the same content, may cause duplicate content. Even if, according to Google's statement, automatic canonicalization is successful, it is advisable to use a uniform scheme.
Homepage with or without index.html – what's the difference?
The homepage of a domain should never be accessible via multiple URLs.
https://www.domain.de
https://www.domain.de/
https://www.domain.de/index.html
As described, the slash at the end is not a problem for modern browsers, as they usually remove it before submission. It is different with the mention of "index.html". Set a -tag on the https://www.internetwarriors.de/index.html page. This tag prevents duplicate content and directs the entire link power to the correct URL. It tells the search engine that the homepage is always defined with https://www.internetwarriors.de/.
This tag looks like this:
<link rel="canonical" href=https://www.domain.de/>
How to handle staging and development servers
When website operators carry out larger work on their sites, they usually create a copy of their site. On this page, they test new design elements or programming without affecting the public website. Depending on the website size, several people may work on this test system, typically accessible via the web. URLs for such test systems typically look like this:
https://test.domain.de
https://www.domain.de/test/
https://www.test-domain.de
If you forget to protect the subdomain or the path from search engines, duplicate content arises because search engines index both the test page and the live page.
To prevent this, there are the following options:
Protect the test page with a password via the .htaccess file.
Block access to all web crawlers through robots.txt.
This ensures that your test page does not end up in the search engines' index and no duplicate content is created.
Consider adding only the noindex meta tag to the test page, ensure to remove it when going live. Otherwise, you risk search engines removing the public website from the index. However, this is only problematic if the test system replaces the current live system.
How to correctly implement print views
Websites often offer a print view. There are two variants to implement this. When printing the webpage, you can design the URL differently through the media control of CSS. This is done with a line in the head section of the site code. It can look like this:
Since this is the same document, this variant is safe. It is free of duplicate content.
Another variant is controlling the print view through a separate URL or a parameter. These can look as follows:
https://www.domain.de/blog.html?print=1https://www.domain.de/blog-print-view.html
Since these are two different URLs with the same content, the likelihood of duplicate content is high. It is recommended to exclude the print view from indexing. Additionally, you should add a nofollow attribute to the link to the print view.
Handling functional parameters
In shop systems and content management systems, functional parameters often exist to control views. Typically, these are parameters in product categories that, for example, sort by brand name or price. These may look like this:
Sorting by brand:https://www.domain.de/category.html?sort=brand
Sorting by price:https://www.domain.de/category.html?sort=price
Note that these parameters only change the sorting and not the content. Hence, potentially duplicate content arises. You can prevent this by blocking the parameters via robots.txt, Google Search Console, or noindex meta tag. Keep this in mind for session IDs, pagination, internal search, and product information such as size or color.
What we can do for you
Are you concerned about having duplicate content on your website or want to learn more about the topic? Contact us and we will help you identify and resolve duplicate content issues on your site.
Nadine
Wolff
As a long-time expert in SEO (and web analytics), Nadine Wolff has been working with internetwarriors since 2015. She leads the SEO & Web Analytics team and is passionate about all the (sometimes quirky) innovations from Google and the other major search engines. In the SEO field, Nadine has published articles in Website Boosting and looks forward to professional workshops and sustainable organic exchanges.
no comments yet