Good afternoon, my dear readers. Today we will touch on an extremely interesting and important topic - information retrieval systems. The ability to work with them correctly, knowledge of basic concepts and operating principles can help novice users learn how to quickly and efficiently search for various information on the Internet, obtain the necessary data and quickly develop their online business.

In this article I will talk about the history of the creation of search systems, the principles of their operation and structure. In addition, I will dwell on very important features that you must know when working with IPS.

So, let's study in more detail what IPS is and what components are included in their composition.

Information retrieval systems (IRS) and their types

This concept arose back in the late 80s and early 90s of the last century. It was then that their first prototypes arose, both in Russia and abroad. According to the definition, it is a system that allows you to search, process, and select the required request data in its own special database, which contains descriptions of various sources of information, as well as rules for using them.

Its main task is to find the information the user needs. To make it more effective, the concept of relevance is used, that is, how accurately the search results themselves match a particular query.

The main types of IPS include the following concepts:

Catalog indexing can be done either manually or automatically with index updating. In turn, the result of the system operation itself includes a special list. It includes a hyperlink to the required resources and a description of a particular document on the Internet.

The most popular catalogs include: Yahoo, Magellan(foreign) andWeblist, Snail and @Rus from domestic ones.


The most common foreign information retrieval systems include Google, Altavista, Excite. Russians - Yandex and Rambler.

  • There are a huge number of different types of information systems in the world, which contain many sources of information. Of course, even the most modern and powerful server cannot satisfy the needs of millions of users. That is why special metasearch engines. They can simultaneously forward user requests to various search servers, and based on their generalization, they are able to provide the user with a document containing links to the required resource. These include MetaCrawler or SavvySearch.

History of the creation of the IPS

The very first IPS appeared in the mid-90s of the 20th century. They were very reminiscent of ordinary indexes that are found in any books, some kind of reference books. Their database contained special keywords (words) that were collected from numerous sites in various ways. Since Internet technologies were not perfect, the search itself was performed only using keywords.

Much later, a special full-text search was developed to make it easier for the user to find the information he needed. The system recorded keywords. Thanks to it, users could make the necessary queries for certain words and various phrases.

One of the first was Wandex. It was developed by the very famous programmer Matthew Graham in 1993. Also, in the same year, a new “search” “Aliweb” appeared (by the way, it still works successfully to this day). However, they all had a rather complex structure and did not have modern technologies.

One of the most successful was WebCrawler, which was first launched in 1994. A distinctive feature and main advantage that sets it apart from other search engines is that it can find any keywords on a given page. After this, it became a kind of standard for all other IPS that were developed later.

Much later, other search engines emerged, which sometimes competed with each other. These were Excite, AltaVista, InfoSeek, Inktomi and many others. Since 1996, Russian netizens began working with Rambler and Aport. But the real triumph for the Russian Internet was Yandex, created in 1997.

This Russian analogue of Google has become the real pride of Russian programmers. Today, it is confidently squeezing out its competitor in RuNet and is also one of the leaders in search queries among information retrieval systems in Russia.

Today, there are numerous special “search engines” that are created to solve specific problems. For example, the information and retrieval system “Patron” was designed to store and search data on cartridges for various weapons and is now used both by the Ministry of Internal Affairs and intelligence services, and by professional and amateur hunters.

There are others designed for notaries, doctors, engineers, military, car enthusiasts, etc.

How does IPS work?

The work of an information retrieval system is very complex. However, if you wish, you can understand its structure. The first thing to note is that there is a special program - it is called a search robot (spider). This program systematically monitors various pages and indexes them.

The web server creates a user request to obtain this or that information, and then provides this request to the search engine. The search engine examines the required database, then compiles a complete list of pages, and then transmits it to the web server. It, in turn, finally forms all the query results into a “readable” form, then transfers them to the user’s “computer”.

IPS is intended for the following purposes:

  • Store significant amounts of data;
  • Conduct a quick search for the necessary information;
  • Add and remove various data;
  • Display information in a simple and convenient form.

There are several main types of IPS:

  • Automated
  • Bibliographic
  • Conversational
  • Documentary

What search engines are the most popular today?

In first place, without any doubt, is the indispensable leader - Google. Today, about 80 percent of the world's various requests in a wide variety of areas are addressed to it. As for the second place, it is also deservedly occupied by the American eBay.

In third place is our domestic, Russian “Yandex”. In fourth place is Yahoo and in fifth place is MSN. Another domestic browser, but occupying only 10th place in the European ranking, is the Russian “Rambler”.

Google

This search engine is known to a huge number of users. Today it is the first most popular system in the world! It processes more than 41 billion queries monthly and indexes 25 billion pages.

As for the history of the creation of Google, back in 1996, a pair of Stanford University students, Larry Page and Sergey Brin, developed a browser based on new search methods. They called it simply and concisely, just like the design of the Google search engine. The actual name google is a distorted word googol (the number ten to the hundredth power).

It is based on a special search robot called Googlebot. It scans pages and indexes them. As an authority algorithm, this PS . In fact, it is he who ensures how pages will be displayed to the visitor in search results.

One of the first, this company developed and in various languages, which greatly facilitates the entry of data into the system. Well, and finally, it served as the basis for the word “google”, which is increasingly found in the slang of young teenagers.

« Yahoo» - second most popular in the USA. It was founded in 1994 by two Stanford graduate students, David Filo and Jerry Yang. In the late 90s, they acquired the RocketMail portal and based on it they created the free Yahoo mail server. Today, you can store any number of emails on its servers. In 2010, a Russian-language mail resource appeared - Yahoo! Mail.

Yandex

One of the best Russian search engines, without a doubt, is Yandex. Today it ranks fourth in terms of the total number of requests. At the same time, in terms of popularity, Yandex today ranks first in the Russian Federation. The total number of queries generated exceeds 250 million every day

It was introduced in September 1997, and already in May 2011, by placing its shares in an IPO, this company was able to earn the largest number of shares among other Internet companies.

Today, Yandex has 50 services, some of which are unique - Yandex.Search, Yandex.Maps, Yandex.Market. In addition, Russian users are very interested in such services as “Blog Search” and “Yandex Traffic.” Basic queries for users mainly from the following neighboring countries: Russia, Belarus, Turkey and Kazakhstan.

Historically, the company was founded by businessman and programmer Arkady Volozh in 1989. The name of the company itself was invented by Ilya Segalovich, director of Yandex. Thanks to cooperation with the Institute for Information Transmission Problems, a searchable reference dictionary was created.

Unlike other browsers, it also takes into account the morphology of the Russian language. Thus, the system itself is designed specifically to work in the Russian-language segment of the Internet.

Since 2010, in addition to the Yandex.ru browser, another search engine, Yandex.com, has appeared. This Internet resource is used to search on foreign portals.

Search system "Ebay»

Ebay is an Internet company from the United States that specializes in conducting online auctions. It manages the eBay.com portal, as well as versions in other countries around the world. In addition, the company owns another eBay Enterprise.

The founder of the company is American programmer Pierre Omidyar, who in the mid-90s developed an online auction for his personal portal. At the same time, eBay is a kind of intermediary in purchasing and selling. To use it, sellers pay a certain fee, and buyers get the opportunity to use the site for free.

The general principles of its operation are as follows:

  • Basically all people are decent
  • Everyone can contribute
  • In open communication, people show their best qualities

Already in 1995, millions of different items were sold at thousands of online auctions. Today, it is a powerful platform for buying and selling, both by individuals and legal entities.

Since 2010, a Russian-language version of the popular resource has appeared and began to be called “International Trade Center eBay”. Payment at the auction is made through the PayPal payment system.

In order to sell items on this portal, you need to write how much it costs, its starting price, when the auction will begin, and also how long the auction will last. As in a regular auction, the highest bidder gets the selected item.

One of the advantages of such an auction is that the seller and buyer can be located anywhere in the world, and the presence of local branches and time frames provide the opportunity for a huge number of sellers and buyers to participate in auctions.

This search engine is a leading Internet browser developed by Microsoft. It appeared simultaneously with the release of the first operating system, Windows 95. Later, the Hotmail email service, as well as various Microsoft websites, began to use this name. At the beginning of 2002, it was one of the largest Internet providers in the United States and had 9 million subscribers.

Search systemRambler

The second major Russian search engine is the Internet portal “Rambler”. At its core, together with Yandex, it is the founder of the Runet, as well as the main player in the media services market.

Its founder is Sergei Lysakov, who in 1994 developed a search system, and in 1996 the domain www.rambler.ru was registered. Since 2012, Rambler began to work as a news portal.

Today it ranks 11th in popularity among other Russian sites. Also, a special Rambler Top-100 classifier was developed. In essence, it was the first in Russia. Today it is a convenient catalog of real estate objects “Rambler – Real Estate”.

Search enginemail

One of the largest postal services was created in 1998, Mail.ru. Today it is an e-mail service, a catalog of Internet resources and information sections. In addition to very convenient mail, it has a number of special projects that are very popular and needed by subscribers: “Auto Mail.ru”, Poster “Mail.ru”, “Children of mail.ru”, “Health mail.ru”, “Lady mail. ru", "News mail.ru" and "Real Estate mail.ru".

For fans of sports and Hi-Tech there are corresponding sections.

This concludes my material. If you liked it, then please subscribe to my blog and invite your family, friends and acquaintances.

(No ratings yet)

Read: 476 times

In one of his interviews, Gary Flake (director of Yahoo! Research Labs) said: “If Web search were perfect, it would return an answer to every query, and it would be as if the question was answered by the smartest person in the world, who I have all the reference information at my fingertips, and it all gets done in less than an instant.” In the meantime, modern systems provide a visual interface for analyzing the selection of documents they have “prepared.”

  1. Network navigation.
  2. An alternative search method is to search for objects and their relationships that are automatically extracted from the text of documents during the ETL process phase. This method allows you to explore connections between objects from documents without specifying a contextual criterion for filtering documents. For example, you can search for relationships between the Cheney object and other objects (Fig. 1). This can be used to navigate to the desired objects, to obtain and analyze documents about the relationships of these objects. Further development of methods for analyzing connections between objects is associated with solving problems of typing connections between objects. In turn, their solution is limited by the quality of Russian language parsers and thesauruses.

    The method of navigating through a collection of documents using OLAP technology is very useful. The system “on the fly” builds a multidimensional representation of the resulting collection of documents with measurements from the fields of the card: headings, authors, publication date, sources. The analyst can dive into elements of different dimensions (for example, into the regions of the federal district), view documents in cells with the required frequency values, etc. Additionally, general methods of data analysis and forecasting can be used. In Fig. Figure 3 shows a diagram for obtaining a list of publications from a cell of the two-dimensional distribution of publications by region and subcategory of the “Politics” heading. This method is used to analyze the dynamics of publications and the factors determining it

  3. Automatic annotation.
Open sources of information make available a huge number of publications and thereby raise the problem of efficient work with large volumes of documents. Providing a condensed meaning of primary sources in the form of annotations increases the speed of analysis several times. However, our experience shows that annotations are a static result; they are used when analyzing “paper” documents, and when analyzing collections of electronic documents, a more visual and structured representation of the content of one or a collection of electronic documents is provided by an interactive semantic map of the relationships between document topics. Modern systems for analytical processing of text information have means for automatically compiling annotations. There are two approaches to solving this problem.
  1. In the first approach, the annotator program extracts from the original source a small number of fragments in which the content of the document is most fully presented. These can be sentences containing query terms; fragments of sentences with terms surrounded by several words, etc. In more developed systems, sentences are identified that directly contain the key topics of the document (but not coreferent links to them).
  2. C) In the second approach, the abstract is a synthesized document in the form of a summary. The abstract generated in accordance with the first approach is qualitatively inferior to that obtained during synthesis. To improve the quality of annotation, it is necessary to solve the problem of processing coreference links in the Russian language. Another problem that arises when synthesizing annotations is the lack of tools for semantic analysis and synthesis of text in Russian, therefore annotation services are focused either on a narrow subject area or require human participation.

Most annotation programs are built on the principle of highlighting text fragments. Thus, the eXtragon research system is focused on annotating Web documents. For each sentence in the document, a weight is calculated based on information about keywords, significant phrases, their place in the text and presence in the query, after which the sentences are ranked, and an abstract is compiled from several phrases with the maximum weight. In the Analytical Courier system, a document abstract is automatically formed from its fragments, and its volume depends on the main topics of the document and settings. The annotation on objects or problems may include anaphoric sentences of the document. In addition, there is a component for creating a general annotation based on the relationships of topics in the semantic network of this collection of documents.

Search engines consist of five separate software components, namely:

  • Spider (spider): its task is to download WEB pages; a program that is similar to a WEB browser.
  • Crawler: spider, which is called "traveling"; it automatically follows all links that were found on the page.
  • Indexer (indexer): a program called “blind”; its task is to analyze WEB pages that were downloaded by spiders.
  • Database (database): it is a repository of pages that were first downloaded and then processed.
  • Results delivery system (search engine results engine): This system helps to retrieve search results from the database.

Learn more about each search engine component

Spider: Spider - its task is simple - download WEB pages. The principle of its operation is no different from your browser, if you simply connect to the site and start loading the page. The Spider has no visualization. A similar situation (downloading) can be seen when you start viewing a page and select “view HTML code” in your WEB browser.

Crawler: Like Spider, it also downloads pages, and its functions also include “stripping” the page and finding all the links. This is his task - to determine where the Spider should move next, it is based only on links or using a predetermined list of addresses.

Indexer: Indexer helps you parse a page into its different parts and analyze them. Headings, elements of any page headings, text, links, BOLD, ITALIC elements, structural elements, as well as other style parts of the page are isolated and analyzed.

Database: Database is the repository of any data that the search engine is going to download and analyze. In most cases, this requires enormous resources.

Results delivery system: Search Engine Results Engine is the heart of the search engine. It is this system that will decide which pages will satisfy each request of an ordinary user. Using this part of the search engine, the search is carried out.

If the user enters a keyword and starts searching, the search engine begins to select results based on constantly changing criteria. The method by which a search engine makes any of its decisions is called an algorithm. “Algos” - this term is sometimes used by professionals - this is what we are talking about.

Search criteria when generating results by search engines

Even though search engines have changed a lot, most of them nowadays select search results based on these criteria:

  • Heading(Title): Is there a keyword in the title?
  • Domain/address(Domain/URL): Is there a keyword in the page address or domain name?
  • Style(Style): Headings HEAD, Italic (I or EM), Bold (B or STRONG): Is there a place on the page where the keyword is used in italic, bold, or Hx (H1, H2, ...) text headings?
  • Density(Density): How often is the keyword used on the page? Keyword density is the number of keywords relative to the page text.
  • Meta data(MetaInformation): Although many deny it, some search engines these days still read meta descriptions and meta keywords.
  • Links out(Outbound Links): Where do the links on the page go, and is there a keyword in the link text?
  • External links(Inbound Links): Who else on the Internet has a link to this site? What's in the link text? The author of the page cannot control this criterion in every case, which is why it is called “off-page”.
  • Links within the page(Insite Links): Does the page contain links to other pages on this site?

As a result, we see that the search engine must be able to make many clarifying queries, using the entire downloaded page.

This article is only a short description of the functioning of search engines.