You’ve most likely heard of “Duplicate Content”, perhaps from internal SEO teams, content marketers or partner agencies. You may have also listened to an explanation and feel you understand what it involves.
In this guide, we’ve highlighted the most common causes of duplicate content and how to resolve them. We’ll explain:
The dreaded ‘Google Duplicate Content Penalty’ is a common misconception. We’re here to dispel that myth, as it’s simply untrue. You’ll never find a Google Penalty related to duplicate content in Google Search Console, but that’s not to say that unmanaged duplicate content won’t harm your website’s visibility.
What duplicate content is (and isn’t)
“Your website seems to contain large amounts of duplicate content.”
“But we wrote all the content ourselves!?”
The first hurdle to get past is language; more often than not, people associate duplicate content with plagiarism. This is not the case.
There are two categories of duplicate content; on-site and off-site. Parallels can be drawn between off-site duplicate content issues and plagiarism, although this isn’t typically a technical issue you can control. The associated causes, impacts, and solutions for each type of duplicate content are entirely different.
“On-site duplicate content” is a technical SEO problem caused by how a website is engineered. It occurs when a specific webpage renders at multiple different URLs. It is not content that has been stolen, reused, or taken from other places on the web or your website.
So you know, almost every CMS-driven website produces duplicate content – the question is whether or not it’s being managed properly.
The simplest example is your homepage. A homepage might show up when you type example.com or www.example.com In this case, the same content is being rendered at two different URLs, meaning that one of them is a duplicate.
Now, it’s only a problem if search engines can crawl the duplicates. That said, never underestimate a Googlebot’s ability to find stuff. They usually have a helping hand, like an incorrectly configured sitemap or CMS link. When Google is sending you over 50% of your online customers, it’s worth taking precautions.
So why worry about it?
Don’t worry, but do be aware of it. Google’s index is based entirely on URLs. When the same page renders at two different URLs, there’s no clear indication of the correct page. As a result, neither page ranks as well as it should.
Signs of duplicate content
There are several instances in which duplicate content can crop up. Still, it most commonly occurs following the launch of a new website or during development changes to a site where duplicate content management has been implemented incorrectly (or not at all). You’ll see rankings and traffic start to slide, although the impact will depend on the severity of the problem.
If you’ve got a solid grasp of duplicate content, you’ll be able to find it by carrying out manual checks on a site, but for a quick spot-check, you can carry out a site search on Google (site:yourdomain.com). If you see the following message on the last page of the search results, there’s a chance that duplicate content is afoot. You’ll need to investigate further to be certain.
How duplicate content occurs
Homepage duplicates
One of the most common instances of duplicate content on every website is duplication between the www subdomain and non-www root domain
For example:
- www.example.com
- example.com
Depending on your server, you’ll find that the homepage could also render at:
- example.com/index.php (linux servers)
- www.example.com/index.php (linux servers)
- example.com/home.aspx (windows servers)
- www.example.com/home.aspx (windows servers)
This is the simplest, most noticeable instance of duplicate content, and for the most part, people are aware of it.
This type of duplication usually occurs throughout a website, so if your site renders at www.example.com and example.com, it probably renders a www.example.com/category and example.com/category too. This means that the duplicates are sitewide and will significantly impact organic performance.
Solutions
- 301 (permanent) redirect
- Canonical link element
Sub-folders, sub-categories, and child pages
Most websites use some form of categories and sub-categories to help users find information. Categories are often the most important areas of an e-commerce site, as they intuitively target refined, specific search terms.
For example, If you sell a Widget at Widgets.com, and a potential customer wants to buy “Blue Widgets”, more often than not, it will be a category page for “Blue Widgets” returned as a result. The same applies to any site categorising content into sub-folders and child pages.
Let’s say you have the category structure as follows:
example.com/category/sub-category
Here the user has probably navigated to the first category and then into one of its sub-categories. Many systems will allow this sub-category to render at example.com/sub-category without the parent category included in the URL. This sub-category now renders the same content at multiple URLs, including the parent category and one that doesn’t.
The same applies to child pages which could render at example.com/category/product and example.com/product. This might occur on a non-e-commerce site as example.com/services/service-name and example.com/service-name.
Solution
- 301 (permanent) redirect
- Canonical link element
Pagination
In some cases, the contents of a category page may be broken into several pages; 1, 2, and 3, for example. We refer to this as a ‘paginated series’.
Using the previous example, here’s what page 1 will normally look like:
example.com/category
Page 2 might then be accessed at:
example.com/category/?p=2
Precisely how the pagination is reflected in the URL will depend on the site’s setup. In this instance, we’re still in the same category, but on the second page. Search engines may well interpret the subsequent pages as duplicates of page 1.
Solution
- rel=“next” and rel=“previous” link elements
Parameters
Most websites affix a parameter to a URL based on certain conditions, such as the use of a filter, a ‘sort by’ function, or a variety of other purposes. A common cause is using “breadcrumbs”, which help users navigate a site. Breadcrumbs represent the user’s path to a specific page and are usually clickable for navigation purposes.
Breadcrumbs are specific to the user and are driven by session parameters which are sometimes visible in the page URL.
For example:
example.com/category/sub-category/product/?Path=312&214
Here “Path” refers to the user’s route, and the numbers represent specific categories. In this example, the user has accessed category 312, followed by category 214. This might generate breadcrumbs that look like this:
home -> category -> sub-category -> product
Now we’re still on the same product page as identified in the URL, except with URL parameters that create the breadcrumbs.
The same content renders on this page, but it can be accessed using various URLs. This problem is exacerbated be the number of different routes a user could take, increasing the number of duplicates considerably.
Solution
- Canonical link element
Capitalisation & trailing slashes
Some platforms tend to ignore letter cases in URLs, allowing a page to render irrespective of capitalisation. If the page is accessible at URLs that contain upper case letters as well as ones using only lower case letters, you’re probably going to have some problems. For example:
example.com/category
example.com/Category
The same applies to trailing slashes (/) in URLs:
example.com/category
example.com/category/
Solution
- 301 (permanent) redirect
- Canonical link element
Random CMS Junk
Obviously, this is not a technical term. Not all websites operate on the latest, most up-to-date CMS platform. Many are outdated, bespoke, and, quite frankly, not in a good condition for SEO purposes.
The quality of a bespoke CMS, for example, is directly related to the knowledge and ability of the development team that built it. A slight lack of technical SEO knowledge can result in a site that outputs much dynamic duplicate content.
Looking for this is quite simple; conduct a site search in Google using “site:example.com”. Look for indexed URLs containing “?”’s, path parameters, and “index.php/?”. Assuming you have SEO-friendly URLs, these are most likely to be unmanaged duplicates of canonical pages.
Solution
- Canonical link element
Localisation & Translation
There are two ways to tailor content for an audience. Localisation is when content is provided in the same language, but the information is tweaked for each audience to account for linguistic differences. These variants might exist on a subdomain (us.example.com) or a subfolder (example.com/us).
Where the equivalent pages exist for another locale (such as uk.example.com or example.com/uk), content should be localised for two reasons
- to ensure the right content ranks for the right audience
- to ensure that similar content is not considered a duplicate
The same applies to translation, except the difference is in the language. For example, en.example.com or example.com/enWhat’s important is that search engines don’t perceive these pages as unmanaged duplicates or as different pages; they are the same page, tailored for a different audience.
Other instances of duplicate content
Duplicate content can arise in several other ways. Once you understand what it is, you can identify and resolve duplicate issues. Remember, “duplicate content occurs when the same page renders at multiple URLs”.
How to manage duplicate content
First of all, duplicate content is not a bad thing – almost every website outputs duplicate content. The problem is when this duplicate content is not managed using 301 redirects, robot directives, canonical link elements, or alternate link elements.
301 (permanent) redirects
Until the canonical link element was introduced, 301 redirects were the best way to manage duplicate content. However, redirect and link elements work differently.
Once a 301 redirect is applied to a duplicate, a user can no longer access it and will be redirected to (all being well) the (correct) canonical version. The problem is that duplicates often exist precisely for users. To use the example of path parameters, breadcrumbs provide great usability for visitors. If the URLs, including path parameters, are redirected, breadcrumbs will no longer work correctly, detracting from the website’s navigation.
A 301 should only be applied to pages that offer no extra value to a user, such as a root domain and subdomain (www.example.com and example.com). In doing so, roughly 90% of the authority of the donor page to the target page, provided the redirect is maintained, consolidating your link equity.
Canonical link elements
The canonical link element deals with duplicate content in the same way as a 301 redirect, with one exception; users can still access the page. Therefore this is the most effective way to manage duplicates without running the risk of detracting from the user experience.
A canonical link element looks like this:
<link rel="canonical" href="http://example.com">
It points to the canonical (correct) version of the web page on which it is found. The beauty of the canonical link element is that it can be applied site-wide, ensuring protection against duplicate content issues, irrespective of whether there’s a problem or not.
The canonical version of the page should have a self-referring canonical link element – one that points to itself. Therefore, duplicates of this page will have a canonical link element pointing to the canonical version.
Like a 301 redirect, the canonical link element passes roughly 90-95% of link equity to the target page. Canonical link elements work across domains too. So, if, for some reason, your site is rendering on a second domain, the canonical link elements will still point back to the original, preventing duplicate issues.
A Final Tip
There are some nuances to getting the most out of a canonical link element and choosing the canonical version. The version set as the canonical will rank in search engines; therefore, we want to use the one with the best possible chance of ranking well.
For example, you might have a product page that renders at example.com/mens-shoes/black-shoes and also at example.com/black-shoes. If someone was to search for “men’s black shoes”, which do you think has the best chance of ranking?
Where the category or subcategory contains valuable search terms, it may be worth setting the canonical version to include them in the URL.
You may have noticed the appearance of “structured breadcrumbs” sometime in 2013, or maybe not. Traditionally, when a webpage appears in the SERPs, the page URL is displayed below the page title.
With the right code in place, it’s now possible to show the actual site architecture based on breadcrumbs.
Referring to my previous example of categories, sub-categories, and child pages, for these beautifully structured elements to show, the subcategory’s canonical versions MUST include the parent categories in the URL for the canonical version to include the correct breadcrumbs.
Robots.txt (Kidding!)
Neither duplicate content nor indexation should be managed using the robots.txt file. A disallowed entry in Robots.txt provides meta directives at the root domain level. As such, it’s common for pages disallowed in Robots.txt to continue to be indexed when they are accessed directly by Googlebot or another crawler. Once a disallowed page is indexed, it will remain in the index irrespective of the content of your robots.txt file. It will also prevent crawlers from picking up canonical link elements on the pages. Take a look below:
If you insist on trying to manage duplicate content by controlling indexation, you’re better off using the “noindex” meta directive at the page level – a much more reliable solution. However, this will not pass Link Authority to canonical pages like a canonical link element or 301 redirects.
Right…any questions?
Our Manchester SEO team is on hand to answer any questions about duplicate content. Contact us today to find out more about our digital marketing services.