Search on the Internet.

The convenience of the Internet is that we can find almost any information on it, even when we don’t know exactly where it is. If the address of the page with the material we are interested in is unknown and there is no page with suitable links, we have to look for materials all over the Internet. For this purpose they use search engines Internet - special web sites that allow you to find the desired document.

Types of search engines.

There are two main methods for searching the Internet. In the first case, you search for web pages related to a specific topic. The search is carried out by selecting a thematic category and gradually narrowing it. Such search engines are called search directories. They are convenient when you need to get acquainted with a new topic or get to well-known “classic” resources on a given topic. The second search method is used when the topic is narrow, specific, or rare, little-known resources are needed. In this case, you should imagine what keywords should appear in a document on the topic you are interested in. These words should be chosen in such a way that they are most likely to appear in the necessary documents that are not related to the chosen topic. Systems that allow such searches are called search indexes. Search catalogs differ from search indexes not only in the search method, but also in the method of formation. Any Internet search engine consists of two parts. A specialized web page, accessible to everyone and allowing search, is based on a large, constantly expanded and updated database that contains information about Internet resources.

The method of replenishing this database depends on the type of search engine, search directories, the most important thing is the accuracy of selection. Every resource found should be useful. The subject of the page is determined or checked manually. Because of this, the volume of search directories is relatively small. When the volume approaches a million pages, the amount of manual labor is so great that further catalog growth stops.

Search indexes, on the contrary, are focused on breadth of coverage. Automation is quite capable of identifying words on a web page; search index data can cover many millions of web pages. However, searching in an index is more difficult than searching in a catalog because the same keywords can appear on web pages on different topics.

Principles of searching for information on the Internet.

By becoming a full-fledged Internet user, you gain access to a huge number of information resources. For example, the number of HTML documents available on the Internet is no longer measured in tens, but in hundreds of millions. But on the Internet you can find not only text, but also programs, images, sound and video files, etc. On the one hand, in this sea of ​​information there will probably be information that you are interested in, even if your area of ​​interest is very specific. On the other hand, finding exactly those that interest you among hundreds of millions of web pages is not an easy task. Make it easier for Internet users to search necessary information search engines are called upon.

Information retrieval systems are hosted on the Internet on public servers. The basis of search engines are the so-called search engines, or automatic indexes. Special robot programs (also known as spiders) in automatic mode periodically survey the Internet based on certain algorithms, indexing the documents found. The created index databases are used by search engines to provide the user with access to information posted on Network nodes. The user, within the appropriate interface, formulates a request, which is processed by the system, after which the results of processing the request are displayed in the browser window. Query processing mechanisms are constantly being improved, and modern search engines do not just sort through a huge number of documents. - The search is carried out on the basis of original and very complex algorithms, and its results are analyzed and sorted in such a way that the information presented to the user most closely matches his expectations.

Currently, in the development of search engines, there is a tendency to combine automatic index search engines and manually compiled catalogs of Internet resources. The resources of these systems successfully complement each other, and combining their capabilities is quite logical.

Nevertheless, studies of the capabilities of search engines, even the most powerful of them, such as AltaVista or HotBot, show that the actual coverage of World Wide Web resources by a single such system does not exceed 30%. Therefore, you should not limit yourself to using any one of them. If you are unable to find the information you are interested in using one system, try using another.

Each search engine has its own characteristics and, and the quality of the results obtained depends on the subject of the search and the accuracy of the wording of the query. Therefore, when starting to search for information, first of all, you need to clearly understand what exactly and where you want to find. For example, foreign systems amaze with the number of indexed documents. For searching in the field of professional knowledge, especially information in a foreign language, systems such as AltaVista, HotBot or Northern are best suited.

However, Russian search engines are better suited for searching for information in Russian, especially on the Russian part of the Internet. Firstly, they are specifically focused on Russian-language resources of the Network and, as a rule, are distinguished by greater coverage and depth of study of these resources. Secondly, Russian systems work taking into account the morphology of the Russian language, that is, all forms of the searched words are included in the search. Russian systems better take into account such a historically established feature of Russian Internet resources as the coexistence of several Cyrillic encodings.

The interface of all search engines is built approximately the same. The user is prompted to enter a query about a special field and then initiate a search by clicking a button. The system performs a search and displays the results in a browser window. In addition, many search engines provide the user with the opportunity to specify additional search criteria. For example, you can search only in a certain thematic category or only certain servers. (15, pp. 523-525)

Search engine architecture typically includes:

Encyclopedic YouTube

    1 / 5

    ✪ Lesson 3: How a search engine works. Introduction to SEO

    ✪ Search engine from the inside

    ✪ Shodan - black Google

    ✪ The CHEBURASHKA search engine will replace Google and Yandex in Russia

    ✪ Lesson 1 - How a search engine works

    Subtitles

Story

Chronology
Year System Event
1993 W3Catalog?! Launch
Aliweb Launch
JumpStation Launch
1994 WebCrawler Launch
Infoseek Launch
Lycos Launch
1995 AltaVista Launch
Daum Base
Open Text Web Index Launch
Magellan Launch
Excite Launch
SAPO Launch
Yahoo! Launch
1996 Dogpile Launch
Inktomi Base
Rambler Base
HotBot Base
Ask Jeeves Base
1997 Northern Light Launch
Yandex Launch
1998 Google Launch
1999 AlltheWeb Launch
GenieKnows Base
Naver Launch
Teoma Base
Vivisimo Base
2000 Baidu Base
Exalead Base
2003 Info.com Launch
2004 Yahoo! Search Final launch
A9.com Launch
Sogou Launch
2005 MSN Search Final launch
Ask.com Launch
Nygma Launch
GoodSearch Launch
SearchMe Base
2006 wikiseek Base
Quaero Base
Live Search Launch
ChaCha Launch (beta)
Guruji.com Launch (beta)
2007 wikiseek Launch
Sproose Launch
Wikia Search Launch
Blackle.com Launch
2008 DuckDuckGo Launch
Tooby Launch
Picollator Launch
Viewzi Launch
Cuil Launch
Boogami Launch
LeapFish Launch (beta)
Forestle Launch
VADLO Launch
Powerset Launch
2009 Bing Launch
KAZ.KZ Launch
Yebol Launch (beta)
Mugurdy Closing
Scout Launch
2010 Cuil Closing
Blekko Launch (beta)
Viewzi Closing
2012 WAZZUB Launch
2014 Satellite Launch (beta)

Early in the development of the Internet, Tim Berners-Lee maintained a list of web servers hosted on the CERN website. There were more and more sites, and manually maintaining such a list became more and more difficult. The NCSA website had a special section “What’s New!” (English: What's New!), where links to new sites were published.

First computer program there was a program for searching the Internet Archie(English archie - archive without the letter “c”). It was created in 1990 by Alan Emtage, Bill Heelan, and J. Peter Deutsch, computer science students at McGill University in Montreal. The program downloaded lists of all files from all available anonymous FTP servers and built a database that could be searched by file names. However, Archie's program did not index the contents of these files, since the amount of data was so small that everything could be easily found by hand.

Development and distribution network protocol Gopher, invented in 1991 by Mark McCahill at the University of Minnesota, led to the creation of two new search programs, Veronica and Jughead. Like Archie, they searched for file names and headers stored in Gopher index systems. Veronica (English) Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) allowed keyword searches for most Gopher menu titles across all Gopher listings. Jughead program Jonzy's Universal Gopher Hierarchy Excavation And Display) retrieved menu information from specific Gopher servers. Although the name of the Archie search engine was not related to the comic book series "Archie", however, Veronica and Jughead are characters in these comics.

By the summer of 1993, there was not yet a single system for searching the Internet, although numerous specialized directories were manually maintained. Oscar Nierstrasz at the University of Geneva wrote a series of Perl scripts that periodically copied these pages and rewrote them in standard format. This became the basis for W3Catalog?!, the web's first primitive search engine, launched on September 2, 1993.

Probably the first web crawler written in Perl was the "World Wide Web Wanderer" bot by Matthew Gray in June 1993. This robot created the search index "Wandex". Wanderer's goal was to measure the size of the World Wide Web and find all web pages containing the words from the query. In 1993, the second search engine “Aliweb” appeared. Aliweb did not use a crawler, but instead expected notifications from website administrators about the presence of an index file in a certain format on their sites.

JumpStation, created in December 1993 by Jonathan Fletcher, searched and indexed web pages using a web crawler, and used a web form as an interface for formulating search queries. It was the first Internet search tool that combined the three most important functions of a search engine (verification, indexing, and search itself). Due to the limited computer resources of the time, indexing and therefore searching was limited to only the titles and titles of web pages found by the crawler.

Search engines participated in the “Dotcom Bubble” of the late 1990s. Several companies hit the market in spectacular fashion, generating record profits during their initial public offerings. Some have abandoned the public search engine market and started working only with the corporate sector, e.g. Northern Light.

Google adopted the idea of ​​selling keywords in 1998, then it was a small company that provided a search engine at goto.com. The move marked a shift for search engines from competing with each other to becoming one of the most profitable business ventures on the Internet. Search engines began to sell the first places in search results to individual companies.

The Google search engine has been prominent since the early 2000s. The company has achieved a high position due to good search results using the PageRank algorithm. The algorithm was introduced to the public in the article "The Anatomy of Search Engine", written by Sergey Brin and Larry Page, the founders of Google. This iterative algorithm ranks web pages based on an estimate of the number of hyperlinks to a web page, under the assumption that “good” and “important” pages have more links than others. Google's interface is designed in a spartan style, where there is nothing superfluous, unlike many of its competitors who built the search engine into the web portal. The Google search engine has become so popular that imitating systems have appeared, for example, Mystery Seeker(secret search engine).

Searching for information in Russian

In 1996, a search was implemented taking into account Russian morphology on the Altavista search engine and the original Russian search engines Rambler and Aport were launched. On September 23, 1997, the Yandex search engine was opened. On May 22, 2014, Rostelecom opened the national search engine Sputnik, which at the time of 2015 is in beta testing. Opened on April 22, 2015 new service Satellite. Children especially for children with increased safety.

Methods of cluster analysis and metadata search have become very popular. Of the international cars of this type, the most famous is "Clusty" companies Vivisimo. In 2005, in Russia, with the support of Moscow State University, the Nigma search engine was launched, supporting automatic clustering. In 2006, the Russian metamachine Quintura opened, offering visual clustering in the form of a tag cloud. Nygma also experimented with visual clustering.

How does a search engine work?

The main components of a search system: search robot, indexer, search engine.

Typically, systems operate in stages. First, the crawler retrieves the content, then the indexer generates a searchable index, and finally, the search engine provides the functionality to search the indexed data. To update the search engine, this indexing cycle is repeated.

Search engines work by storing information about many web pages, which they retrieve from HTML pages. A search robot or “crawler” (eng. Crawler) is a program that automatically goes through all the links found on the page and highlights them. Crawler based on links or based on advance given list addresses, searches for new documents not yet known to the search system. The site owner can exclude certain pages using robots.txt, which can be used to prevent the indexing of files, pages or directories on the site.

The search engine analyzes the content of each page for further indexing. Words can be extracted from titles, page text, or special fields - meta tags. An indexer is a module that analyzes a page, having previously broken it into parts, using its own lexical and morphological algorithms. All elements of a web page are isolated and analyzed separately. Web page data is stored in an index database for use in subsequent queries. The index allows you to quickly find information based on a user's request. A number of search engines like Google store home page whole or part of it, the so-called cache, as well as various information about the web page. Other systems, like AltaVista's, store every word of every page found. Using a cache helps speed up the retrieval of information from already visited pages. Cached pages always contain the text that the user specified in the search query. This can be useful in the case when the web page has been updated, that is, it no longer contains the text of the user’s request, and the page in the cache is still old. This situation is related to the loss of links. linkrot) and Google's user-friendly (usability) approach. This involves returning short text fragments from the cache containing the request text. The principle of least surprise applies; the user usually expects to see the searched words in the texts of the received pages ( User expectations). In addition to the fact that using cached pages speeds up searches, cached pages may contain information that is no longer available anywhere else.

The search engine works with the output files received from the indexer. The search engine accepts user queries, processes them using an index and returns search results.

When a user enters a query into a search engine (usually using keywords), the system checks its index and returns a list of the most relevant web pages (sorted by some criterion), usually with a short summary containing the title of the document and sometimes parts of the text. The search index is built using a special technique based on information extracted from web pages. Since 2007, the Google search engine allows you to search based on time, creating the documents you are looking for (calling the “Search Tools” menu and specifying the time range). Most search engines support the use of Boolean operators AND, OR, NOT in queries, which allows you to refine or expand the list of searched keywords. In this case, the system will search for words or phrases exactly as entered. Some search engines have the option approximate search, in this case, users expand the search area by specifying the distance to keywords. There are also conceptual search, which uses statistical analysis of the use of searched words and phrases in the texts of web pages. These systems allow you to create requests for natural language. An example of such a search engine is the site ask com.

The usefulness of a search engine depends on the relevance of the pages it finds. While millions of web pages may include a given word or phrase, some may be more relevant, popular, or authoritative than others. Most search engines use ranking methods to bring the “best” results to the top of the list. Search engines decide which pages are more relevant and in what order results should be shown in different ways. Search methods, like the Internet itself, change over time. This is how two main types of search engines emerged: systems of predefined and hierarchically ordered keywords and systems in which an inverted index is generated based on text analysis.

Most search engines are commercial enterprises that make a profit through advertising; in some search engines you can buy first places in search results for given keywords for a fee. Those search engines that do not charge money for the order of issuing results make money from contextual advertising, while advertising messages correspond to the user’s request. Such advertising is displayed on a page with a list of search results, and search engines earn money every time a user clicks on advertising messages.

Types of Search Engines

There are four types of search engines: robotic, human-powered, hybrid, and meta.

  • systems using search robots
They consist of three parts: a crawler (“bot”, “robot” or “spider”), an index and search engine software. A crawler is needed to crawl the web and create lists of web pages. An index is a large archive of copies of web pages. Target software- evaluate search results. Due to the fact that the search robot in this mechanism constantly explores the network, the information is more relevant. Most modern search engines are systems of this type.
  • human-managed systems (resource directories)
These search engines retrieve lists of web pages. The directory contains the address, title and brief description of the site. The resource directory only looks for results from page descriptions submitted to it by webmasters. The advantage of catalogs is that all resources are checked manually, therefore, the quality of the content will be better compared to the results obtained automatically by the first type of system. But there is also a drawback - updating catalog data is done manually and can significantly lag behind the real state of affairs. Page rankings cannot change instantly. Examples of such systems include Yahoo directory, dmoz and Galaxy.
  • hybrid systems
Search engines such as Yahoo, Google, MSN, combine the functions of systems using search robots and systems operated by humans.
  • meta-systems
Metasearch engines combine and rank the results of several search engines at once. These search engines were useful when each search engine had a unique index and search engines were less "smart". Since search has improved so much now, the need for them has decreased. Examples: MetaCrawler and MSN Search.

Search Engine Market

Google is the most popular search engine in the world with a market share of 68.69%. Bing ranks second with a 12.26% share.

The most popular search engines in the world:

Search system Market share in July 2014 Market share in October 2014 Market share in September 2015
Google 68,69 % 58,01 % 69,24%
Baidu 17,17 % 29,06 % 6,48%
Bing 6,22 % 8,01 % 12,26%
Yahoo! 6,74 % 4,01 % 9,19%
AOL 0,13 % 0,21 % 1,11%
Excite 0,22 % 0,00 % 0,00 %
Ask 0,13 % 0,10 % 0,24%

Asia

In East Asian countries and Russia, Google is not the most popular search engine. In China, for example, it is more popular search engine Soso?!.

In South Korea search portal Naver's own development is used by about 70% of Yahoo! Japan and Yahoo! Taiwan is the most popular search engine in Japan and Taiwan respectively.

Russia and Russian-language search engines

According to LiveInternet data in June 2015 on the coverage of Russian-language search queries:

  • All-language:
    • Yahoo! (0.1%) and search engines owned by this company: Inktomi,AltaVista, Alltheweb
  • English-speaking and international:
    • AskJeeves(Teoma engine)
  • Russian-speaking - most “Russian-language” search engines index and search for texts in many languages ​​- Ukrainian, Belarusian, English, Tatar and others. They differ from “all-language” systems that index all documents in a row in that they mainly index resources located in domain zones where the Russian language dominates, or in other ways limit their robots to Russian-language sites.

Some of the search engines use external search algorithms.

Quantitative data from Google search engine

The number of Internet users and search engines and user requirements for these systems are constantly growing. To increase search speed necessary information major search engines contain a large number of servers. Servers are usually grouped into server centers (data centers). Popular search engines have server centers scattered around the world.

In October 2012, Google launched the "Where the Internet Lives" project, where users are given the opportunity to explore the company's data centers.

About the work of search engine data centers Google system the following is known:

  • The total capacity of all Google data centers, as of 2011, was estimated at 220 MW.
  • When Google planned to open a new complex in Oregon in 2008, consisting of three buildings with a total area of ​​6.5 million square meters, Harper's Magazine estimated that such a large complex would consume more than 100 megawatts of electricity, comparable to the energy consumption of a city with a population of 300,000 Human.
  • The estimated number of Google servers in 2012 is 1,000,000.
  • Google's expenses on data centers amounted to $1.9 billion in 2006, and $2.4 billion in 2007.

The size of the World Wide Web as indexed by Google as of December 2014 is approximately 4.36 billion pages.

Search engines that take into account religious prohibitions

Global spread of the Internet and increase in popularity electronic devices in the Arab and Muslim world, in particular in the countries of the Middle East and the Indian subcontinent, contributed to the development of local search engines that take into account Islamic traditions. Such search engines contain special filters that help users avoid visiting prohibited sites, such as sites with pornography, and allow them to use only those sites whose content does not contradict the Islamic faith. Just before the Muslim month of Ramadan, in July 2013, the world was introduced Halalgoogling- a system that provides users with only halal "correct" links, filtering search results received from other search engines such as Google and Bing. Two years earlier, in September 2011, the I'mHalal search engine was launched to serve users in the Middle East. However this search service had to be closed soon, according to the owner, due to lack of funding.

Lack of investment and the slow pace of technology diffusion in the Muslim world have hampered progress and hampered the success of a serious Islamic search engine. The failure of huge investments in Muslim lifestyle web projects, one of which was Muxlim. He's raised millions of dollars from investors like Rite Internet Ventures, and now - according to I'mHalal's last post before it shut down - is pitching the dubious idea that "the next Facebook or Google might only come from the Middle East." if you support our brilliant youth." However, Islamic Internet experts have been in the business for many years of determining what is or is not compliant with Shariah, and classifying websites as "halal" or "haram". All past and present Islamic search engines are simply a specially indexed set of data, or they are major search engines such as Google, Yahoo and Bing, with some filtering system used to prevent users from accessing haram sites such like sites about nudity, LGBT, gambling and any other topics that are considered anti-Islamic.

Among other religiously oriented search engines, Jewogle - Jewish Google version and SeekFind.org, a Christian site that includes filters to protect users from content that may undermine or weaken their faith.

Personal results and filter bubbles

Many search engines, such as Google and Bing, use algorithms to selectively guess what information a user would like to see based on their past browsing activity. As a result, websites only show information that is consistent with the user's past interests. This effect is called the "filter bubble".

All this leads to the fact that users receive much less information that contradicts their point of view and become intellectually isolated in their own “information bubble”. Thus, the "bubble effect" can have negative consequences for the formation of civic opinion.

Search Engine Bias

Although search engines are programmed to rank websites based on some combination of popularity and relevance, in reality experimental research indicates that various political, economic and social factors influence search results.

This bias may be a direct result of economic and commercial processes: companies that advertise on a search engine may become more popular in organic search results on the engine. Removing search results that do not comply with local laws is an example of the influence of political processes. For example, Google will not display some neo-Nazi websites in France and Germany, where Holocaust denial is illegal.

Bias can also be a consequence of social processes, as search engine algorithms are often designed to exclude unformatted viewpoints in favor of more “popular” results. The indexing algorithms of the major search engines give priority to American sites.

Search bombing is one example of an attempt to manipulate search results for political, social or commercial reasons.

see also

  • Qwika
  • Electronic library#Lists of libraries and search engines
  • Web Developer Toolbar

Notes

Literature

  • Ashmanov I. S., Ivanov A. A. Website promotion in search engines. - M.: Williams, 2007. - 304 p. - ISBN 978-5-8459-1155-1.
  • Baykov V.D. Internet. Search for information. Website promotion. - St. Petersburg. : BHV-Petersburg, 2000. - 288 p. - ISBN 5-8206-0095-9.
  • Kolisnichenko D. N. Search engines and website promotion on the Internet. - M.: Dialectics, 2007. - 272 p. - ISBN 978-5-8459-1269-5.
  • Lande D.V. Search for knowledge on the Internet. - M.: Dialectics, 2005. - 272 p. - ISBN 5-8459-0764-0.
  • Lande D. V., Snarsky A. A., Bezsudnov I. V. Internet: Navigation in complex networks: models and algorithms. - M.: Librocom (Editorial URSS), 2009. - 264 p. - ISBN 978-5-397-00497-8.
  • Chu H., Rosenthal M.

In order to successfully maintain and develop our blog, we, first of all, need to know what algorithms they work by. A clear understanding of the answers to these questions will allow us to successfully solve the problems of website promotion in search engines. But the conversation about search engine optimization of websites is still ahead, but for now a little theory about search engines.

What are Internet search engines?

If we turn to Wikipedia, this is what we find out:

“A search engine is a software and hardware complex with a web interface that provides the ability to search for information on the Internet.”

And now in a language we understand. Let's say we urgently need information on a certain topic. So that we can quickly find it, search engines have been created - sites where, by entering a search query in the search form, we will be given a list of sites on which, with a high degree of probability, we will find what we are looking for. This list is called search results. It can consist of millions of pages with 10 sites on each. The main task of a webmaster is to get into at least the top ten.

Remember that when you search for something on the Internet, you usually find it on the first page of the search results, rarely moving to the second, much less to subsequent ones. This means that the higher the site ranks, the more visitors will visit its pages. And high traffic (number of visitors per day) is, among other things, an opportunity to do well.

How do Internet search engines find information on the Internet and on what basis do they distribute places in search results?

In a few words, internet search engine- this is a whole web in which spider robots constantly scan the network and remember all the texts that enter the Internet. Analyzing the received data, search engines select documents that most correspond to the search query, i.e. relevant ones, from which search results are formed.

The most interesting thing is that search engines cannot read. So how then do they find information? Search engine algorithms boil down to a few basic principles. First of all, they pay attention to the title and description of the article, paragraph headings, semantic highlights in the text and the density of keywords, which must necessarily correspond to the topic of the article. The more accurate this match is, the higher the site will appear in search results. In addition, the volume of information and many other factors must be taken into account. For example, the authority of a web resource, which depends on the number and authority of sites linking to it. The greater the authority, the higher the ranking.

A set of measures aimed at raising the site’s position in search results for certain queries is called search engine optimization. Now this is a whole science -. But more on that later.

On this moment There are many search engines in the world. I'll name the most popular ones. In the west these are: Google, Bing and Yahoo. In RuNet - Yandex, Mail.ru, Rambler and Nigma. Basically, users give preference to the world leader, and the Yandex system has become the most popular on the Russian-language Internet.

A little history. Google was created in 1997 by a native of Moscow Sergey Brin and his American friend Larry Page during their studies at Stanford University.

Google's peculiarity was that it brought the most relevant search results in a logical sequence to the first positions in search results, while other search engines simply compared the words in the query with the words on the web page.

On September 23 of the same year it was announced and Yandex system, which since 2000 began to exist as a separate company “Yandex”.

I won’t bore you any more, I hope it’s a little clearer now, what are internet search engines. It is worth saying that search engine algorithms are constantly evolving. Every day, search engines are getting better at identifying the needs of users and showing them the most relevant information in the search results, based on many factors (region, what queries the user has already requested, what sites he visited during the search process, where he went from them, etc.).

Soon Google and Yandex will know better than us what we need and what we think about!

Topic 3.1.1 Searching for information on the Internet

The Internet is growing at a very fast pace, so finding the information you need among hundreds of billions of Web pages and hundreds of millions of files is becoming increasingly difficult. To search for information, special search engines are used, which contain constantly updated information about the location of Web pages and files on hundreds of millions of Internet servers.

When searching for information, it is necessary to answer three questions: what to look for, that is, what sources of information, where to look (locations of these sources) and how to look (what tools to use for this).

What are the main sources of information available on the Internet? These are WWW documents, articles in news groups and mailing lists, files in file libraries, directories of address information of organizations and people ( Email, address, telephone), articles in thematic databases, encyclopedias.

Where are these information sources located? These are such popular Internet resources as WWW, news groups, mailing lists and FTP servers.

Of course, you can search for the necessary sources of information manually, find out addresses from specialized magazines on computer science and the Internet, and use special paper directories with addresses classified into categories.

However, for such a changing space as the Internet, it is necessary to learn how to use special tools, the purpose of which is to collect data about information resources and provide users with a service quick search.

IRS (information retrieval system) is a system that provides search and selection of necessary data in a special database with descriptions of information sources (index) based on information retrieval language and corresponding search rules.

The main task any information system is to search for information relevant to the user’s information needs. It is very important not to lose anything as a result of the search, that is, to find all the documents related to the request and not find anything superfluous. Therefore, a qualitative characteristic of the search procedure is introduced - relevance.

Relevance is the correspondence of search results to the formulated query.

Internet search servers can be divided into two groups:

- search engines general purpose;

– specialized search engines.

General purpose search engines

The general purpose search engine interface contains a search field and a list of directory sections. The following are distinguished: search tools for WWW: directories, search engines, metasearch engines.


Catalog

Catalog– a search system with a list of annotations classified by topic with links to web resources. Classification is usually done by people.


Searching the catalog is very convenient and is carried out by sequentially clarifying topics. However, directories support the ability to quickly search for a specific category or page using keywords using a local search engine. The directory's link database (index) usually has a limited volume and is filled in manually by directory staff. Some directories use automatic update index.

The search result in the catalog is presented as a list consisting of brief description(annotations) of documents with a hypertext link to the original source.

Popular directory addresses:

1 Foreign catalogues:

a) Yahoo – www.yahoo.com;

b) Look Smart – www.looksmart.com;

c) Magellan – www.mckinley.com;

d) eiNET – www.einet.net.

2 Russian catalogues:

a) Aport (Constellation Internet) – www.aport.ru;

b) AU – www.au.ru;

c) Weblist – www.weblist.ru;

d) Snail – www.ulitka.ru.

In a search engine database, Web sites are grouped into hierarchical subject directories, which are analogous to a subject directory in a library.

Thematic sections top level, for example: Internet, Computers, Science and Education, and so on, contain subdirectories. For example, the Internet directory may contain subdirectories Search, Mail and others.

Searching for information in the catalog is reduced to selecting a specific catalog, after which the user will be presented with a list of links to the Internet addresses of the most visited and informative Web sites. Each link is usually annotated, that is, it contains a short commentary on the contents of the document.

The most complete multi-level hierarchical thematic catalog of Russian-language Internet resources is available in the Aport search system (www.aport.ru). The catalog contains a detailed summary of the content of Web sites and an indication of their geographical location.

Search engine

Search engine– a search system with a robot-generated database containing information about information resources.

Distinctive feature search engines is the fact that the database containing information about Web pages, Usenet articles, and so on, is generated by a robot program.

A search in such a system is carried out according to a query compiled by the user, consisting of a set of keywords or a phrase enclosed in quotation marks. The index is generated and kept up to date by indexing robots. For example, to search for Internet search engines themselves, you can enter the keywords “ Russian system searching for information on the Internet."

Some time after sending the request, the search engine will return a list of Internet addresses of documents in which the specified keywords were found. The description of a document most often contains the first few sentences or excerpts from the text of the document with keywords highlighted. As a rule, the date of update (verification) of the document is indicated, its size in kilobytes; some systems determine the language of the document and its encoding (for Russian-language documents).

To view this document in a browser, simply activate the link pointing to it.

If the keywords were chosen poorly, then the list of document addresses may be too large (may contain tens or even hundreds of thousands of links). In order to reduce the list, you can enter additional keywords in the search field or use the search engine directory.

Many search engines allow you to search the documents found, and you can refine your query by introducing additional terms. If the intelligence of the system is high, you may be offered the service of searching for similar documents. To do this, you select a document you particularly like and point it to the system as a model to follow. But often this function does not work as expected. Some search engines allow you to re-sort the results. To save your time, you can save your search results as a file on local disk for later study offline.

Addresses of the most popular search engines abroad and in Russia:

1 Foreign search engines:

a) Google – www.google.com;

b) Alta Vista – www.altavista.com;

c) Excite – www.excite.com;

d) HotBot – www.hotbot.com;

e) Northern Light – www.northernlight.com;

f) Go (Infoseek) – www.go.com (infoseek.com);

g) Lycos – www.lycos.com;

h) Fast – www.alltheweb.com.

2 Russian search engines:

a) Yandex – www.yandex.ru (or www.ya.ru);

b) Rambler – www.rambler.ru;

c) Aport – www.aport.ru.

One of the most complete and powerful search engines is Google (www.google.ru), whose database stores 8 billion Web pages and every month robot programs add 5 million new pages to it. In Runet (the Russian part of the Internet), extensive databases containing 200 million documents each have the search engines Yandex (www.yandex.ru) and Rambler (www.rambler.ru).

Metasearch engine

Please note that different search engines describe different numbers of sources of information on the Internet. Therefore, you cannot limit your search to only one of the specified search engines. Now let's get acquainted with search tools that do not create their own index, but can use the capabilities of other search engines. These are metasearch engines ( search services) – systems capable of sending user requests simultaneously to several search servers, then combine the results and present them to the user in the form of a document with links.

Metasearch engines do not have their own database. They are programs that accept a user request and process this request using algorithms artificial intelligence and then searched by search engines. That is, they are search engines of search engines. The advantage of these systems is their ability to synthesize the search intent rather than just search according to a verbal query. The results of such a search are clear to the user and most closely match what he is looking for. Metasearch sites offer a huge number of options, aiming to be useful to any user. There are various versions of metasearch engines that constantly crawl the Internet for information that matches your search criteria.

When the system finds new information, it alerts you or automatically downloads it. If you want to find sites dedicated to general issues, travel, and so on, then metasearch engines will allow you to quickly gain access to the necessary information. They also offer direct access to sites with specific information, such as telephone directories, travel guides and government sites. Metasearch engines usually have a slightly longer run time because they query other search engines. It makes sense to turn to them when conventional search engines have not yielded results.

Addresses of well-known metasearch engines:

– MetaCrawler – www.metacrawler.com;

– SavvySearch – www.savvysearch.com

The Internet is necessary for many users in order to receive answers to queries (questions) that they enter.

If there were no search engines, users would have to independently search for the sites they need, remember them, and write them down. In many cases, finding something suitable “manually” would be very difficult, and often simply impossible.

We do all this routine work of searching, storing and sorting information on websites.

Let's start with the famous Runet search engines.

Internet search engines in Russian

1) Let's start with the domestic search engine. Yandex works not only in Russia, but also works in Belarus and Kazakhstan, Ukraine, and Turkey. There is also Yandex in English.

2) The Google search engine came to us from America and has Russian-language localization:

3) Domestic search engine Mail ru, which simultaneously represents social network VKontakte, Odnoklassniki, also My World, the famous Answers Mail.ru and other projects.

4) Intelligent search engine

Nigma (Nigma) http://www.nigma.ru/

Since September 19, 2017, the nigma “intellectual” has not worked. It ceased to be of financial interest to its creators; they switched to another search engine called CocCoc.

5) The well-known company Rostelecom has created the Sputnik search engine.

There is a search engine called Sputnik, designed specifically for children, which I wrote about.

6) Rambler was one of the first domestic search engines:

There are other famous search engines in the world:

  • Bing,
  • Yahoo!,
  • Baidu,
  • Ecosia,

Let's try to figure out how a search engine works, namely, how sites are indexed, analyzed indexing results and generated search results. The principles of operation of search engines are approximately the same: searching for information on the Internet, storing it and sorting it for delivery in response to user requests. But the algorithms that search engines use can differ greatly. These algorithms are kept secret and its disclosure is prohibited.

By entering the same query in search strings different search engines, you can get different answers. The reason is that all search engines use their own algorithms.

The purpose of search engines

First of all, you need to know that search engines are commercial organizations. Their goal is to make a profit. Profit can be made from contextual advertising, other types of advertising, from promoting the necessary sites to the top of the search results. In general, there are many ways.

It depends on the size of the audience, that is, how many people use this search engine. The larger the audience, the more people the ad will be shown to. Accordingly, this advertising will cost more. Search engines can increase their audience through their own advertising, as well as by attracting users by improving the quality of their services, algorithm and search convenience.

The most important and difficult thing here is the development of a fully functioning search algorithm that would provide relevant results for the majority of user queries.

The work of a search engine and the actions of webmasters

Each search engine has its own algorithm, which must take into account a huge number of various factors when analyzing information and compiling results in response to a user request:

  • the age of a particular site,
  • website domain characteristics,
  • quality of content on the site and its types,
  • features of navigation and site structure,
  • usability (convenience for users),
  • behavioral factors (the search engine can determine whether the user found what he was looking for on the site or the user returned to the search engine again and there again looks for an answer to the same query)
  • etc.

All this is necessary precisely so that the results at the user’s request are as relevant as possible, satisfying the user’s requests. At the same time, search engine algorithms are constantly changing and being refined. As they say, there is no limit to perfection.

On the other hand, webmasters and optimizers are constantly inventing new ways to promote their sites, which are not always honest. The task of the developers of the search engine algorithm is to make changes to it that would not allow “bad” sites of dishonest optimizers to appear in the TOP.

How does a search engine work?

Now let's talk about how the search engine actually works. It consists of at least three stages:

  • scanning,
  • indexing,
  • ranging.

The number of sites on the Internet is simply astronomical. And every site is information, information content that is created for readers (living people).

Scanning

This is a search engine wandering around the Internet to collect new information, analyze links and search for new content that can be used to return to the user in response to his requests. For scanning, search engines have special robots called search robots or spiders.

Search robots are programs that automatically visit websites and collect information from them. The crawl can be primary (the robot visits a new site for the first time). After the initial collection of information from the site and entering it into the search engine database, the robot begins to visit its pages with some regularity. If any changes have occurred (new content has been added, old content has been deleted), then all these changes will be recorded by the search engine.

The main task of a search spider is to find new information and send it to the search engine for the next stage of processing, that is, for indexing.

Indexing

A search engine can search for information only among those sites that are already included in its database (indexed by it). If crawling is the process of searching and collecting information that is available on a particular site, then indexing is the process of entering this information into the search engine database. At this stage, the search engine automatically decides whether to enter this or that information into its database and where to enter it, in which section of the database. For example, Google indexes almost all the information found by its robots on the Internet, while Yandex is more picky and does not index everything.

For new sites, the indexing stage can be long, so visitors from search engines may wait a long time for new sites. And new information that appears on old, well-promoted sites can be indexed almost instantly and almost immediately end up in the “index”, that is, in the search engine database.

Ranging

Ranking is the arrangement of information that was previously indexed and entered into the database of a particular search engine, according to rank, that is, what information the search engine will show to its users in the first place, and what information will be placed “rank” lower. Ranking can be attributed to the stage of search engine service to its client – ​​the user.

On the search engine servers, the received information is processed and results are generated for a huge range of all kinds of queries. This is where the search engine algorithms come into play. All sites included in the database are classified by topic, and topics are divided into groups of requests. For each group of requests, a preliminary issue can be compiled, which will subsequently be adjusted.