Hi all! In the last article, we touched on an important topic - searching for duplicate website pages. As the comments and several letters that came to me showed, this topic is relevant. Duplicate content on our blogs, technical flaws in the CMS and various template jambs do not give our resources complete freedom in search engines Oh. Therefore, we have to seriously fight them. In this article we will learn how to remove duplicate pages from any website; examples in this guide will show how to get rid of them in a simple way. We are simply required to use the knowledge gained and monitor subsequent changes in search engine indexes.

My story of fighting duplicates

Before we look at ways to eliminate duplicates, I will tell you my story of dealing with duplicates.

Two years ago (May 25, 2012) I received a training blog for the SE0 specialist courses. It was given to me in order to practice the acquired knowledge during my studies. As a result, in two months of practice I managed to produce a couple of pages, a dozen posts, a bunch of tags and a carload of duplicates. Over the next six months, when the educational blog became my personal website, other duplicates were added to this composition in the Google index. This happened due to the fault of replytocom due to the growing number of comments. But in the Yandex database, the number of indexed pages grew gradually.

At the beginning of 2013, I noticed a specific decline in the positions of my blog in Google. Then I started wondering why this was happening. In the end, I got to the point where I discovered a large number of duplicates in this search engine. Of course, I began to look for options to eliminate them. But my searches for information did not lead to anything - I did not find any sensible manuals on the Internet for removing duplicate pages. But I was able to see one note on one blog about how you can remove duplicates from the index using the robots.txt file.

First of all, I wrote a bunch of prohibiting directives for Yandex and Google to prohibit scanning of certain duplicate pages. Then, in the middle of summer 2013, I used one method for removing duplicates from the Google index (you will learn about it in this article). By that time, the index of this search engine had accumulated more than 6,000 duplicates! And this is with only five pages and more than 120 posts on your blog...

After I implemented my method of removing duplicates, their number began to rapidly decrease. Earlier this year, I used another option to remove duplicates to speed up the process (you will also learn about it). And now on my blog the number of pages in the Google index is approaching ideal - today there are about 600 pages in the database. This is 10 times less than it was before!

How to remove duplicate pages - basic methods

There are several different ways to deal with duplicates. Some options allow you to prevent the appearance of new duplicates, while others can get rid of old ones. Of course, the most the best option- it's manual. But to implement it, you need to have a good understanding of the CMS of your website and know how search engine algorithms work. But other methods are also good and do not require specialized knowledge. We'll talk about them now.

This method is considered the most effective, but also the most demanding in terms of programming knowledge. The point is that it is written here necessary rules in the .htaccess file (located in the root of the site directory). And if they are written with an error, then you may not only fail to solve the task of removing duplicates, but also remove the entire site from the Internet altogether.

How is the problem of removing duplicates solved using a 301 redirect? It is based on the concept of redirecting search robots from one page (from a duplicate) to another (the original). That is, the robot comes to a duplicate of some page and, using a redirect, appears on the original site document we need. Then he begins to study it, skipping a take outside his field of vision.

Over time, after registering all the variants of this redirect, identical pages are glued together and duplicates eventually fall out of the index. Therefore, this option perfectly cleans previously indexed duplicate pages. If you decide to use this method, be sure to study the syntax for creating redirects before adding rules to the .htaccess file. For example, I recommend studying a guide on the 301st redirect from Sasha Alaev.

Creating a canonical page

This method is used to indicate to the search engine the document from the entire set of its duplicates that should be in the main index. That is, such a page is considered original and participates in search results.

To create it, you need to write a code with the URL of the original document on all duplicate pages:

Of course, it’s cumbersome to write all this manually. There are various plugins for this. For example, for my blog, which runs on the WordPress engine, I specified this code using the “All in One SEO Pack” plugin. This is done very simply - check the appropriate box in the plugin settings:

Unfortunately, the canonical page option does not remove duplicate pages, but only prevents their further appearance. In order to get rid of already indexed duplicates, you can use the following method.

Disallow directive in robots.txt

The robots.txt file is an instruction to search engines that tells them how to index our site. Without this file, a search robot can reach almost all documents on our resource. But we don’t need such freedom from the search spider - we don’t want to see all pages in the index. This is especially true for duplicates that appear due to the inadequacy of the site template or our mistakes.

That is why such a file was created in which various directives for prohibiting and allowing indexing by search engines are prescribed. You can prevent scanning of duplicate pages using the Disallow directive:

When creating a directive, you also need to correctly draft the prohibition. After all, if you make a mistake when filling out the rules, then the result may be a completely different page blocking. Thus, we can limit access to the necessary pages and allow other duplicates to leak out. But still, the errors here are not as bad as when creating redirect rules in .htaccess.

The ban on indexing using Disallow applies to all robots. But not for everyone, these bans allow the search engine to remove prohibited pages from the index. For example, Yandex eventually removes duplicate pages blocked in robots.txt.

But Google will not clear its index from unnecessary trash, which was indicated by the webmaster. In addition, the Disallow directive does not guarantee this blocking. If there are external links to pages prohibited in the instructions, they will eventually appear in the Google database .

Getting rid of duplicates indexed in Yandex and Google

So, with various methods figured it out, it's time to find out step by step plan removing duplicates in Yandex and Google. Before cleaning, you need to find all duplicate pages - I wrote about this in a previous article. You need to see before your eyes which elements of page addresses are reflected in duplicates. For example, if these are pages with tree comments or pagination, then we record the words “replytocom” and “page” in their addresses:

I note that in the case of replytocom you can use not this phrase, but simply a question mark. After all, it is always present in the address of tree comment pages. But then you need to remember that the URLs of the original new pages should not contain the “?” symbol, otherwise these pages will also be banned.

Cleaning Yandex

To remove duplicates in Yandex, we create rules for blocking duplicates using the Disallow directive. To do this, we perform the following actions:

  1. Open the special tool “Robot.txt Analysis” in Yandex Webmaster.
  2. We are adding new rules for blocking duplicate pages to the directives field.
  3. In the “URL list” field we enter examples of duplicate addresses for the new directives.
  4. Click the “Check” button and analyze the results.

If we did everything correctly, then this tool will show that there is a blocking according to the new rules. In the special field “URL check results” we should see a red inscription about the ban:

After checking, we must send the created duplicate directives to the real robots.txt file and rewrite it in the directory of our site. And then we just need to wait until Yandex automatically scrapes our duplicates from its index.

Cleaning Google

It's not that simple with Google. Forbidden directives in robots.txt do not remove duplicates in the index of this search engine. Therefore, we will have to do everything on our own. Fortunately, there is an excellent Google Webmaster service for this. Specifically, we are interested in its “URL Parameters” tool.

It is thanks to this tool that Google allows the site owner to provide the search engine with information about how it needs to process certain parameters in the URL. We are interested in the opportunity to show Google those parameters of addresses whose pages are duplicates. And these are the ones we want to remove from the index. Here's what we need to do for this (for example, let's add a parameter to remove duplicates from replytocom):

  1. Open the “URL Options” tool in the Google service from the “Crawling” menu section.
  2. Click the “Add Parameter” button, fill out the form and save the new parameter:

As a result, we get a written rule for Google to review its index for the presence of duplicate pages. Thus, we further specify the following parameters for other duplicates that we want to get rid of. For example, this is what part of my list looks like with written rules for Google so that it adjusts its index:

This concludes our work on cleaning Google, and my post has come to an end. I hope this article will bring you practical benefit and allow you to get rid of duplicate pages of your resources.

Sincerely, Your Maxim Dovzhenko

P.S. Friends, if you need to make a video on this topic, write to me in the comments to this article.


Fighting duplicate pages

The owner may not even suspect that some pages on his site have copies - most often this is the case. The pages open, everything is fine with their content, but if you just pay attention to the page, you will notice that the addresses are different for the same content. What does it mean? For live users, absolutely nothing, since they are interested in the information on the pages, but soulless search engines perceive this phenomenon completely differently - for them it is completely different pages with the same content.

Are duplicate pages harmful? So, if an ordinary user cannot even notice the presence of duplicates on your site, then search engines will immediately determine this. What reaction should you expect from them? Since the copies are essentially seen as different pages, the content on them ceases to be unique. And this already has a negative impact on rankings.

Also, the presence of duplicates blurs the image that the optimizer tried to focus on the landing page. Due to duplicates, it may end up on a completely different page than they wanted to move it to. That is, the effect of internal linking And external links may decrease many times over.

In the vast majority of cases, duplicates are to blame - clear copies are generated due to incorrect settings and lack of proper attention by the optimizer. This is the problem with many CMSs, for example Joomla. It is difficult to find a universal recipe to solve the problem, but you can try using one of the plugins for deleting copies.

The occurrence of unclear duplicates, in which the content is not completely identical, is usually due to the fault of the webmaster. Such pages are often found on online store sites, where pages with product cards differ only in a few sentences with a description, and all the rest of the content, consisting of end-to-end blocks and other elements, is the same.

Many experts argue that a small number of duplicates will not harm the site, but if there are more than 40-50%, then the resource may face serious difficulties during promotion. In any case, even if there are not many copies, it is worth taking care of eliminating them, so you are guaranteed to get rid of problems with duplicates.

Finding Copy Pages There are several ways to find duplicate pages, but first you should contact several search engines and see how they see your site - you just need to compare the number of pages in the index of each. This is quite simple to do without resorting to any additional funds: in Yandex or Google, just enter host:yoursite.ru in the search bar and look at the number of results.




If, after such a simple check, the quantity differs greatly, by 10-20 times, then this, with some degree of probability, may indicate the content of duplicates in one of them. Copy pages may not be to blame for this difference, but it nevertheless gives rise to further, more thorough searching. If the site is small, then you can manually count the number of real pages and then compare them with indicators from search engines.

You can search for duplicate pages by URL in the search engine results. If they must have CNC, then pages with URLs containing incomprehensible characters, like “index.php?s=0f6b2903d”, will immediately stand out from the general list.

Another way to determine the presence of duplicates using search engines is to search through text fragments. The procedure for such a check is simple: you need to enter a text fragment of 10-15 words from each page into the search bar, and then analyze the result. If there are two or more pages in the search results, then there are copies, but if there is only one result, then this page has no duplicates, and you don’t have to worry.

It is logical that if the site consists of a large number of pages, then such a check can turn into an impossible task for the optimizer. To minimize time costs, you can use special programs. One of these tools, which is probably familiar to experienced professionals, is the Xenu`s Link Sleuth program.


To check the site, you need to open a new project by selecting “Check URL” from the “File” menu, enter the address and click “OK”. After this, the program will begin processing all site URLs. At the end of the check, you need to export the received data to any convenient editor and start searching for duplicates.

In addition to the above methods, the Yandex.Webmaster and Google Webmaster Tools panels have tools for checking page indexing that can be used to search for duplicates.

Methods for solving the problem After all duplicates have been found, they will need to be eliminated. This can also be done in several ways, but each specific case requires its own method, and it is possible that you will have to use all of them.

  • Copy pages can be deleted manually, but this method is most likely only suitable for those duplicates that were created manually due to the negligence of the webmaster.
  • The 301 redirect is great for merging copy pages whose URLs differ in the presence and absence of www.
  • The solution to the problem with duplicates using the canonical tag can be used for unclear copies. For example, for product categories in an online store that have duplicates that differ in sorting by various parameters. Canonical is also suitable for print versions of pages and other similar cases. It is applied quite simply - the rel=”canonical” attribute is specified for all copies, but not for the main page, which is the most relevant. The code should look something like this: link rel="canonical" href="http://yoursite.ru/stranica-kopiya"/, and be within the head tag.
  • Setting up the robots.txt file can help in the fight against duplicates. The Disallow directive will block access to duplicates for search robots. You can read more about the syntax of this file in our newsletter.

Quite often, there are copies of pages on the same site, and its owner may not even be aware of it. When you open them, everything is displayed correctly, but if you take a look at the site address, then you will notice that different addresses can correspond to the same content.

What does this mean? For ordinary users in Moscow, nothing, since they came to your site not to look at page titles, but because they were interested in the content. But this cannot be said about search engines, since they perceive this state of affairs in a completely different light - they see pages that are different from each other with the same content.

If ordinary users may not notice duplicate pages on the site, this will definitely not escape the attention of search engines. What could this lead to? Search robots will identify the copies as different pages, and as a result, they will no longer perceive their content as unique. If you are interested in website promotion, then know that this will certainly affect the ranking. In addition, the presence of duplicates will reduce the link juice that appeared as a result of considerable efforts by the optimizer who tried to highlight the landing page. Duplicate pages can result in a completely different part of the site being highlighted. And this can significantly reduce the effectiveness of external links and internal linking.

Can duplicate pages cause harm?

Often the culprit for the appearance of duplicates is the CMS. incorrect settings which or lack of attention by the optimizer can lead to the generation of clear copies. Website management systems such as Joomla often suffer from this. Let us immediately note that there is simply no universal tool for combating this phenomenon, but you can install one of the plugins designed to search and delete copies. However, unclear duplicates may appear, the contents of which do not completely match. This most often happens due to shortcomings of the webmaster. Often such pages can be found in online stores, in which product cards differ only in a few sentences of description, while the rest of the content, which consists of various elements and end-to-end blocks, is the same. Experts often agree that a certain number of duplicates will not interfere with the site, but if there are about half or more of them, then promoting the resource will cause many problems. But even in cases where there are several copies on the site, it is better to find and eliminate them - this way you will certainly get rid of duplicates on your resource.

Finding duplicate pages

There are several ways to find duplicate pages. But before the search itself, it would be good to look at your site through the eyes of search engines: how they imagine it. To do this, simply compare the number of your pages with those that are in their index. To see this, simply enter search bar Google or Yandex the phrase host:yoursite.ru, then evaluate the results.

If such a simple check provides various data that may differ by a factor of 10 or more, then there is reason to believe that your electronic resource contains duplicates. While this may not always be due to duplicate pages, this check will provide a good basis for finding them. If your site has small sizes, then you can independently count the number of real pages, and then compare the result with search engine indicators. You can also search for duplicates using URLs that are offered in search results. If you are using CNC, then pages with strange symbols in URLs such as “index.php?с=0f6b3953d” will immediately catch your attention.

Another method for determining the presence of duplicates is to search for text fragments. To perform such a check, you need to enter a few words of text from each page into the search bar, then simply analyze the result. In cases where two or more pages appear in the search results, it becomes obvious that there are copies. If there is only one page in the search results, then it has no duplicates. Of course, this verification technique is only suitable for a small site consisting of several pages. When a site contains hundreds of them, its optimizer can use special programs, for example, Xenu`s Link Sleuth.

To check the site, open a new project and go to the “File” menu, find “Check URL”, enter the address of the site you are interested in and click “OK”. The program will now begin processing all URLs of the specified resource. When the work is completed, the information received will need to be opened in any convenient editor and search for duplicates. The methods for finding duplicate pages do not end there: in the panel Google tools Webmaster and Yandex.Webmaster you can see tools that allow you to check the indexing of pages. With their help you can also find duplicates.

On the way to solving the problem

When you find all the duplicates, you will be tasked with eliminating them. There are several possibilities to solve this problem and various ways eliminating duplicate pages.

Merging copy pages can be done using a 301 redirect. This is effective in cases where URLs are distinguished by the absence or presence of www. You can also delete duplicate pages in manual mode, but this method is successful only for those takes that were created manually.

You can solve the problem of duplicates using the canonical tag, which is used for unclear copies. Thus, it can be used in an online store for product categories for which there are duplicates and which differ only by sorting according to different parameters. Additionally, the canonical tag is suitable for use on print pages and similar situations. Using it is not at all difficult - for each copy an attribute is set in the form rel=”canonical”; for the promoted page with the most relevant characteristics, this attribute is not specified. Approximate view of the code: link rel="canonical" href="http://site.ru/stranica-kopiya"/. It should be located in the head tag area.

A properly configured robots.txt file will also allow you to achieve success in the fight against duplicates. Using the Disallow directive, you can block search robots from accessing all duplicate pages.

Even professional website development will not help bring it to the TOP if the resource contains duplicate pages. Today, copy pages are one of the most common pitfalls that newbies suffer from. A large number of them on your site will create significant difficulties in bringing it to the TOP, or even make it impossible.

Duplicates of site pages, their impact on search engine optimization. Manual and automated methods for detecting and eliminating duplicate pages.

The influence of duplicates on website promotion

The presence of duplicates negatively affects the ranking of the site. As stated above, search engines see the original page and its duplicate as two separate pages. Content duplicated on another page ceases to be unique. In addition, the link weight of the duplicated page is lost, since the link can transfer not to the target page, but to its duplicate. This applies to both internal linking and external links.

According to some webmasters, a small number of duplicate pages in general will not cause serious harm to the site, but if their number is close to 40-50% of the total site volume, serious difficulties in promotion are inevitable.

Reasons for duplicates

Most often, duplicates appear as a result of incorrect settings of individual CMSs. The engine's internal scripts begin to work incorrectly and generate copies of site pages.

The phenomenon of fuzzy duplicates is also known - pages whose content is only partially identical. Such duplicates arise, most often, through the fault of the webmaster himself. This phenomenon is typical for online stores, where product card pages are built according to the same template, and ultimately differ from each other by only a few lines of text.

Methods for finding duplicate pages

There are several ways to detect duplicate pages. You can turn to search engines: to do this in Google or Yandex, enter a command like “site:sitename.ru” into the search bar, where sitename.ru is the domain of your site. The search engine will return all indexed pages of the site, and your task will be to detect duplicate ones.

There is another equally simple way: searching by text fragments. To search in this way, you need to add a small piece of text from your website, 10-15 characters, to the search bar. If the search results for the searched text contain two or more pages of your site, it will not be difficult to detect duplicates.

However, these methods are suitable for sites consisting of a small number of pages. If the site has several hundred or even thousands of pages, then manually searching for duplicates and optimizing the site as a whole becomes impossible tasks. There are special programs for such purposes, for example, one of the most common is Xenu`s Link Sleuth.

In addition, there are special tools for checking the indexing status in Google panels Webmaster Tools and Yandex.Webmaster. They can also be used to detect duplicates.

Methods for eliminating duplicate pages

There are also several ways to eliminate unnecessary pages. Each specific case has its own method, but most often, when optimizing a website, they are used in combination:

  • removing duplicates manually – suitable if all unnecessary ones were also detected manually;
  • merging pages using a 301 redirect – suitable if duplicates differ only in the absence and presence of “www” in the URL;
  • using the “canonical” tag - suitable in case of unclear duplicates (for example, the situation mentioned above with product cards in an online store) and is implemented by entering a code like “link rel="canonical" href="http://sitename.ru/ stranica-kopiya"/" within the head block of duplicate pages;
  • correct configuration of the robots.txt file - using the “Disallow” directive, you can prohibit duplicate pages from being indexed by search engines.

Conclusion

The occurrence of duplicate pages can become a serious obstacle to optimizing the site and bringing it to the top position, therefore this problem must be addressed at the initial stage of its occurrence.

The reason for writing this article was another call from an accountant in panic before submitting VAT reports. Last quarter I spent a lot of time cleaning up counterparties' duplicates. And again they are the same and new. Where?

I decided to spend time and deal with the cause, not the effect. The situation is mainly relevant when automatic uploads are configured through exchange plans from the control program (in my case UT 10.3) to the enterprise accounting department (in my case 2.0).

Several years ago these configurations were installed and configured automatic exchange between them. We were faced with the problem of the originality of maintaining a directory of counterparties by the sales department, which began to create duplicate counterparties (with the same INN/KPP/Name) for one reason or another (they scattered the same counterparty into different groups). The accounting department expressed its “fi” and decided - it doesn’t matter to us what they have, combine the cards when loading into one. I had to intervene in the process of transferring objects using the exchange rules. For counterparties, we removed the search by internal identifier, and left the search by INN+KPP+Name. However, even here, pitfalls surfaced in the form of people who like to rename the names of counterparties (as a result, duplicates are created in the BP by the rules themselves). We all got together, discussed, decided, convinced that duplicates are unacceptable in UT, removed them, and returned to the standard rules.

It’s just that after “combing” the duplicates in the UT and BP, the internal identifiers of many contractors were different. And since standard exchange rules search for objects exclusively by internal identifier, then with the next portion of documents a new duplicate of the counterparty arrived in the BP (if these identifiers were different). But universal exchange XML data would not be universal if it were impossible to get around this problem. Because existing object identifier regular means cannot be changed, then you can bypass this situation using a special information register “Correspondence of objects for exchange”, which is available in all standard configurations from 1C.

In order to avoid new duplicates, the duplicate removal algorithm became as follows:

1. In the BP, using the “Search and replace duplicate elements” processing (it is standard, it can be taken from the Trade Management configuration or on the ITS disk, or you can select the most suitable one among the many variations on Infostart itself) I find a duplicate, determine the correct element, click execute replacement.

2. I get the internal identifier of the only (after replacement) object of our double (I sketched out a specially simple processing for this, so that the internal identifier is automatically copied to the clipboard).

3. I open the “Compliance of objects for exchange” register in the UT and make a selection using my own link.