Having decided to revive an abandoned blog on a blogger and conduct my personal experiment to increase traffic, I was faced with the need to make the blog visible to search engine robots.

The sitemap.xml page is very important for this - this is a map of your site, which informs robots about the appearance of new pages on the site and speeds up the indexing of pages.

You can find a sitemap in blogger if you enter the name of the blog and add sitemap.xml at the end. Here is an example for my blog http://1000experiments.blogspot.ru/sitemap.xml Substitute your blog's address and get your sitemap.

You can check for errors in the sitemap file on Yandex for webmasters > To do this, follow the link, enter the url for the sitemap file and click the "Check" button:


If there are no errors, you will get the following result:


Now we need http://1000experiments.blogspot.ru/sitemap.xml insert into the second file important for search robots - robots.txt

This file allows or prohibits search engine robots from visiting pages on your site and indexing the site.
First, the robot looks at the line containing the words: User-agent. This line can contain all robots (in this case there is an asterisk - User-agent: *), or one specific robot, for example, User-agent: Yandex. If Yandex (or another robot) finds a line with its name, it takes into account the commands specified for it and ignores what is specified for all robots. We will not consider all the directives (you can read them, for example, on the Yandex website >), only the most important ones - Allow and Disallow.

Allow robots.txt commands the robot to visit pages of your site and index, Disallow robots.txt - prohibit indexing. You can give permission/deny to visit the entire site, or individual pages. If visiting all pages is prohibited, then the following Disallow command is given: / If prohibited specific pages, then we allow everything Allow: / and prohibit a certain group, for example, in the photo folder - Disallow: / photo However, for sites hosted on the blogger, all this is not very important, this was information for a better understanding of individual commands.
robots.txt where is it located?

Where is robots.txt located in blogger? From the admin panel, go to Settings - Search Settings. Click “Edit” to the right of “Custom robots.txt file”, click “Yes”, an empty field appears where we enter the text of our robots.txt and click the “Save changes” button:


If you have a full-fledged website, then the robots.txt file should be uploaded to the root directory.
The simplest example of a file with permission for all robots to index all pages of the site can be seen on this page. If you copy, be sure to change the site name in the last line. robots.txt example: User-agent: *
Allow: /

Sitemap: http://1000experiments.blogspot.ru/sitemap.xml

Check robots.txt You can still analyze robots.txt there - on Yandex for webmasters > By clicking on the link you will immediately be taken to the robots.txt verification page, or you can select it like this:




Blogging platform - Blogger (Blogspot) is considered professional. And since recently, when Google added the “Description for search engines” functions there, the ability to insert target = “_blank” and rel = “nofollow” tags into the link code, as well as do what will be discussed below, there is not a single one left , any good reason why you should not use this blogging platform. As the saying goes, “If you don’t like cats, it’s because you don’t know how to cook them!”

So, in order.

A site map is needed. An axiom that does not require proof. If you want your blog to be better indexed, then you need to have both a sitemap and robots.txt. Otherwise, the search robot will face a dilemma, like the hero in the picture

And if on WordPress everything is relatively simple with creating a map, then with Blogger (Blogspot) until recently it was not so simple.

There are sitemaps for search robots and sitemaps for people.

So, if we talk about a site map for robots, then everything turns out to be ridiculously simple. You just need to add /sitemap.xml to the link to your blog so that the link to your blog map looks like this: http://rsolovyov.blogspot.com/sitemap.xml and your sitemap is already generated! Blogger has been using auto-generated sitemap.xml for a long time, and it lists all posts (URL) of the blog with the date of last modification. Therefore, you no longer need to rack your brains about which link to feed search engines to the sitemap. I won’t even describe now the tricks that I had to resort to before. In a word, now wherever you need to write a link to a map, just write it in the form I gave above. And the problem is solved.

Well, if we talk about a sitemap for people, then you will have to do a little manual work. It's not scary. But you need to be careful.

Go to the admin panel of the blog, go to the “Pages” tab => “Create”). We create new page, we call it, for example, Blog Map.

Then go to HTML mode and paste this code:





var accToc=true;

and be careful: in the code from the table my Domain name rsolovyov needs to be replaced with yours!


Well, now that we have created sitemaps for both robots and readers, we need to let search engines know about it. First of all, you need to tell the search robot what to index through the information in the robots.txt file. But what to do, because it is impossible to create and upload such a file via FTP - the domain is 3rd level and the platform is free. But Google solved this problem too!

Go to the blog admin panel in SETTINGS => Search Settings => and enable “Custom robots.txt file”


Attention! Incorrect use of these functions may result in your blog not being indexed search engines.

Then you need to specify what exactly you want to allow or deny for indexing.

And here I want to make a small, so to speak, lyrical digression. Almost all blog authors on this platform have not made any changes here! But in vain! Here's what usually defaults to this location:

User-agent: Mediapartners-Google Disallow: User-agent: * Disallow: /search Allow: / Sitemap: http://rsolovyov.blogspot.com/feeds/posts/default?orderby=UPDATED

And here’s what should come after: (you need to copy and paste this code. But again, replace my rsolovyov with yours.

User-agent: Mediapartners-Google Disallow: User-agent: * Disallow: /search Allow: / Sitemap: http://rsolovyov.blogspot.com/sitemap.xml

Well, now, we need to inform you that you now have sitemaps on your blog and a robots.txt file to the main tools for web masters: in Yandex, in Google, in Bing, in Mail.ru, and perhaps this could be the topic of the next, separate post.

Robots.txt file in Blogger blog plays a vital role in search engine optimization(SEO). This file will definitely lead to better SEO for your blog if compiled properly.

Robots.txt can be customized in the new Blogger interface. Many bloggers use the Robots.txt file to hide certain parts of their blogs from search robots.

The Robots.txt file tells the search bot which parts of the blog should be accessible or blocked for indexing. Whenever a robot crawls your blog, it first checks the Robots.txt file and follows all the instructions given in this file.

Login to Blogger and go to Settings >> Search Settings.

In the Search Robots and Indexing subsection, find the Custom robots.txt file option and click Edit.

Click the Yes button, and you will see a window like the picture below

Now copy the following code below and paste it into the box:

User-agent: Mediapartners-Google Disallow: User-agent: * Disallow: /search Allow: / Sitemap:http://site /feeds/posts/default?orderby=UPDATED
Replace the address of my blog (highlighted in red) with the address of your blog.

Let's examine some parts of the Robots.txt file User-agent: * Disallow: /
Directive User-agent: Mediapartners-Google determines the access of the Google search robot, and User-agent: * all search robots (Google, Yahoo, Bing, Yandex, etc.).

Directive Disallow: / prevents search robots from crawling any specific pages or directories existing on your blog. For example, the code below denies search robots access to the images.html page

User-agent: * Disallow: /images.html
You can read more about using the Robots.txt file

Hello my dear readers. Lately, I have been increasingly asked about tools for Google and Yandex webmasters, namely about the robots.txt file, about pages prohibited from indexing or blocked pages in the robots.txt file.

It turns out strange, first we look for information on how to do it, follow all the recommendations, and then we just start asking questions about why my pages are blocked and how to unblock them.
That's why I decided to look at optimizing Blogger/Blogspot blogs from the point of view of the robots.txt file. I’ll start in order, with what a robots.txt file actually is.

The robots.txt file is essentially a regular one text file, which is located in the sites root folder.

http://site.ru/robots.txt

The file is ordinary, but the contents of this file are very important. The robots.txt file was designed to manage site indexing. Tell the search robot what can be indexed and what cannot.

Naturally, the question arises: why ban anything at all; let the robot index everything.

The first and most obvious situation. With the development of the Internet, more and more sites support registration and personal accounts users with information that the users themselves would not want to share. This situation also includes those when the site has sections accessible to all users and sections accessible only to registered users. I think this is clear. And such content is specifically prohibited from indexing.

But there is another situation that we will consider in more detail.

All modern websites are dynamic. Many users naively believe that a dynamic site is one on which running lines, pictures themselves replace each other, etc. and what is called a flash site. In fact, a dynamic site has nothing to do with this. And the word dynamics arose for a completely different reason.

I am not a professional, so I may use some not entirely precise wording, but I hope I can convey the essence to you. Imagine an online store. The site has a product search form based on various criteria. You can get to the same product by using different filters. For example, a filter by manufacturer can lead to a product, which can also be selected by applying a filter by price and dimensions. Using different filters creates a different path to the product in the page URL. And the same product can be located on 2-3-4 different URLs.

This is where the confusion begins: which of all these pages is correct and most important? Which page to show in search results? This is where a file like robots.txt comes to the rescue. Which states that all URLs that result from applying filters cannot be indexed.

A distinctive feature of all URLs that were formed in the process of selecting products is the presence of special characters or words. Let's get back to our blogs. I suggest you take one apart special case. This case is not frequent, but also not rare, especially at the initial stage of blogging, when we do not understand everything yet. Please treat this case as a virtual example, i.e. It is not at all necessary that this may happen to you, but at the same time take it seriously, because such cases are still not uncommon.
Condition

  • You show full text articles on the main page, without hiding part of the article under the cat.
  • You have assigned a label to this article, under which you do not yet have any other articles except this one.
  • Let's go to our imaginary article, it has an address

    http://my_blog/date/my_article

    Remember, you gave this article a label that no other article has yet. You have just decided to write about this topic, and you have no other articles on this topic. Let's go to the page of this shortcut. It has a URL

    http://my_blog/search/label/label_name

    And what we see. On this page is our article, in full, because... we don’t hide it under a cut, and we don’t have any other articles at all.

    As a result, it turns out that the same article is present at two different addresses at once. Which of these two pages is correct? Which one is more important? The search robot cannot determine the difference between these pages and considers them almost the same.

    Search robots have a very negative attitude towards this kind of content. And even when we start hiding articles under cut, and even when we have several articles under a shortcut, the search engine doesn’t like the fact that we have such pages at all. This situation is called duplication of content.

    Therefore, so that search engines do not argue, so that our blog can be ranked better, there is an entry in the robots.txt file:

    User-agent: *
    Disallow: /search

    Which means that any robots of all search engines should not index pages that contain the /search directive. This is done for our benefit by the platform developers. And if you find a warning in the tools that some pages are blocked (prohibited) by the robots.txt file, you don’t need to panic and worry that something is not indexed on your site.

    A similar situation arises with archives. For example, you have 10 articles displayed on your blog home page. Address home page

    So, it turns out that all these 10 articles were written in November. Many people use the Archive widget. Let's select November in the archive, we will see the same 10 articles that are now on the main page of the blog, but in address bar browser we see a completely different URL

    http://my_blog/2010_11_01_archive.html

    Same content at different addresses. These are the archive pages we intentionally prohibit from indexing through meta tags.
    Something similar happens due to the standard listing of blog pages not by individual articles, but when you can scroll through the main page. As a result of scrolling through the main page, addresses like

    http://my_blog/search?updated-max=2010-06-17T16%3A17%3A00%2B03%3A00&max-results=7

    It would seem that the URL of this page contains a /search directive, but I noticed that Google constantly indexes these pages. That's why I don't have a page-by-page listing.

    I simply deleted it so that such pages would not accumulate. At the same time, I delete everything that comes up in the search manually in the webmaster tools on the Site Configuration tab - Access for crawler - Delete URL.

    Often, the Google index (I have never encountered this in Yandex) also includes Shortcut pages that are prohibited by the robots.txt file. I also delete all such URLs in Google Webmaster Tools.

    UPD from 05/14/2015 Previously, the lack of ability to edit the robots.txt file was a huge problem. Now the Blogger developers have provided this opportunity. You can read more about the robots.txt file for Blogger in the article

    This is the message I received when I decided to see how my Google Ad Sense works:


    How are pages indexed?
    In the root directory there is a robots.txt file, and it contains instructions for search robots.
    These instructions are used to index the site's pages.
    Therefore, if something is wrong with indexing, you urgently need to edit the robots.txt file

    How to do this and where is this root directory?
    For example, if the URL of my site is http://www.poliushka.blogspot.ru/, then the URL of the robots.txt file
    will be http://www.poliushka.blogspot.ru/robots.txt

    Or, and this is much easier, you need to go to the blog settings. This is what it looked like on my blog.


    Search settings.
    This is where we need it. This is where you need to tell search robots Where and What to look for on your blog.
    And here it is robots.txt file(in this frame - this is how I understood it. You won’t find another place anyway).

    And now a little theory and terms.
    User-agent is a client identifier that is used by search engines and browsers.
    User-agent: *- an asterisk next to User-agent means “any user agent”.
    disallow- do not allow indexing
    allow- allow indexing
    Mediapartners-Google- Adsense search robot user agent
    / - site root
    Sitemap- xml map (list of main links on the site in “raw” form)

    That is, in the correct robots.txt of Blogger:

    • everything must be allowed for the contextual advertising robot;
    • all agents are banned from searching on the blog (it is prohibited from indexing, otherwise duplicates will appear in the cache);
    • should be allowed to index the entire blog.
    In short, always check the status of your robots.txt

    There can only be one on the site file "/robots.txt".
    IN robots.txt file write, for example:

    User-agent: *
    Disallow: /search
    Disallow: /p/search.html
    Disallow: /tags/

    This means indexing of 3 directories is prohibited here.
    Please note each directory is indicated on a separate line and arranged in a column.

    You can:

    prohibit indexing of the entire site by robots, for this you need to put / (slash) after the word Disallow
    User-agent: *
    Disallow: /

    allow indexing site by robots and to do this you just need to remove this very slash
    User-agent: *
    Disallow:
    Or simply create an empty file “/robots.txt”.

    allow indexing site to one robot and prohibit the rest
    User-agent: Yandex - specified a specific robot instead of an asterisk
    Disallow: -slash must be removed

    User-agent: * - indexing is prohibited for all other robots
    Disallow: /


    Well, now to practice. Open Blog Settings and:

    Settings - Search settings - Search robots indexing - Custom robots.txt file - Edit


    Select YES and paste the following:

    User-agent: Mediapartners-Google
    Disallow:

    User-agent: *
    Disallow: /search
    Disallow: /p/search.html
    Allow: /
    Sitemap: http:// your site name/feeds/posts/default?orderby=updated

    User-agent: Yandex
    Disallow: /search
    Disallow: /p/search.html
    Allow: /

    Instead of your site name insert your blog name. So you gave the task to Google and Yandex search robots.

    But pay attention!

    We insert the task for the robot into the frame exactly in this form - in a column, and not in one line.

    Let's work with them at the same time.