Question-answering systems

Question and answer system(English Question-answering system) is a special type of information systems, which are a hybrid of search, reference and intelligent systems (often they are considered as intelligent search engines). A QA system must be able to accept questions in natural language, that is, it is a system with a natural language interface. Information is provided based on documents from the Internet or from local storage. Modern developments of QA systems make it possible to process many variants of requests for facts, lists, definitions, questions like How, Why, hypothetical, complex and interlingual.

  • Highly specialized QA systems work in specific areas (for example, medicine or car maintenance). Building such systems is a relatively easy task.
  • Are common QA systems work with information on all areas of knowledge, thus making it possible to search in related areas.

Architecture

The first QA systems were developed in the 1960s and were natural language shells for expert systems focused on specific areas. Modern systems intended for search answers to questions in the documents provided using natural language processing (NLP) technologies.

Modern QA systems usually include a special module - question classifier, which determines the type of question and, accordingly, the expected answer. After this analysis, the system gradually applies more and more complex and subtle NLP methods to the provided documents, discarding unnecessary information. The roughest method is search in documents- involves the use of an information retrieval system to select parts of the text that potentially contain an answer. Then filter highlights phrases that are similar to the expected answer (for example, for the question “Who ...” the filter will return pieces of text containing people’s names). And finally, the module highlighting answers will find the correct answer among these phrases.

Scheme of work

The performance of a question-answering system depends on the quality of the text base - if there are no answers to questions in it, the QA system will not be able to find anything. The larger the base, the better, but only if it contains necessary information. Large repositories (such as the Internet) contain a lot of redundant information. This leads to two positive points:

  1. Since the information is presented in different forms, the QA system will quickly find the appropriate answer. You don't have to resort to complex word processing techniques.
  2. Correct information is repeated more often, so errors in documents are eliminated.

Surface search

The most common search method is by keywords. Phrases found in this way are filtered according to question type and then ranked based on syntactic features, such as word order.

Advanced Search

Problems

In 2002, a group of researchers wrote a research plan in the field of question-answering systems. It was proposed to consider the following questions.

Types of Questions Different questions require different methods of finding answers. Therefore, it is necessary to create or improve methodological lists of the types of possible questions. Processing questions The same information can be requested in different ways. Required to create effective methods understanding and processing the semantics (meaning) of a sentence. It is important that the program recognizes questions that are equivalent in meaning, regardless of the style, words, syntactic relationships and idioms used. I would like the QA system to divide complex questions into several simple ones and correctly interpret context-sensitive phrases, perhaps clarifying them with the user during the dialogue. Contextual questions Questions are asked in a specific context. Context can clarify a query, resolve ambiguity, or follow the user's thinking through a series of questions. Sources of knowledge for a QA system Before answering the question, it would be nice to inquire about the available text databases. No matter what text processing methods are used, we will not find the correct answer if it is not in the databases. Selecting answers The correct execution of this procedure depends on the complexity of the question, its type, context, quality of available texts, search method, etc. - a huge number of factors. Therefore, the study of text processing methods must be approached with great caution, and this problem deserves special attention. Formulation of the answer The answer should be as natural as possible. In some cases, a simple discharge it from the text. For example, if a name is required (person’s name, name of an instrument, disease), quantity (currency rate, length, size) or date (“When was Ivan the Terrible born?”) - a direct answer is sufficient. But sometimes you have to deal with complex queries, and special algorithms are needed here merge answers from different documents. Answering questions in real time It is necessary to create a system that would find answers in repositories in a few seconds, regardless of the complexity and ambiguity of the question, the size and vastness of the document base. Multilingual queries Development of systems for working and searching in other languages ​​(including automatic translation). Interactivity Often the information offered by a QA system as an answer is incomplete. The system may have incorrectly identified the question type or “understood” it incorrectly. In this case, the user may want not only to reformulate his request, but also to “explain” to the program using dialogue. Reasoning (Inference) Mechanism Some users would like to receive an answer that goes beyond the available texts. To do this, you need to add knowledge that is common to most areas to the QA system (see General ontologies in computer science), as well as means for automatically inferring new knowledge. User Profiles of QA Systems Information about the user, such as his area of ​​interest, his manner of speech and reasoning, and default facts, could significantly increase the performance of the system.

Links

  • Dialogus is a search engine that automatically selects answers to user questions.
  • [email protected]: Human search for answers to any questions.

Wikimedia Foundation. 2010.

See what “Question-answering systems” are in other dictionaries:

    This article is about automatic information systems. For information about the type of social networks, see the question and answer system (web service). Question answering system (QA system; from English QA English Question answering... ... Wikipedia

    Intelligent Information system(IIS) is one of the types of automated information systems, sometimes IIS is called a knowledge-based system. IIS is a complex of software, linguistic and logical-mathematical... ... Wikipedia

    This article lacks links to sources of information. Information must be verifiable, otherwise it may be questioned and deleted. You can... Wikipedia

    URL... Wikipedia

    - (from the English Virtual virtual, Digital digital, Assistant assistant, abbreviated as VDA) web service and/or application for smartphones and PCs, which actually plays the role of a personal secretary for the user. Solves scheduling problems,... ... Wikipedia

    Knowledge representation is an issue that arises in cognitive science (the science of thinking), computer science, and research. artificial intelligence. In cognitive science, it is related to how people store and process information. In computer science with selection ... Wikipedia

    Knowledge representation is an issue that arises in cognitive science (the science of thinking), in computer science and in artificial intelligence. In cognitive science, it is related to how people store and process information. In computer science, the main goal is to select a representation... Wikipedia

    It is intended for the user to obtain the most accurate (relevant) information on a topic of interest to him (and limited by the database of articles). Typically, an article is selected according to the hierarchy of help topics. Help systems often combined with... ... Wikipedia

    - (Natural Language Processing, NLP) general direction of artificial intelligence and mathematical linguistics. It studies problems of computer analysis and natural language synthesis. When applied to artificial intelligence, analysis means... Wikipedia

    Wolfram|Alpha Home page site... Wikipedia

Introduction

Problems

Domain overview

1 The task of analyzing the issue

Question Analysis Methods

1 Question character patterns

2 Syntactic question patterns

3 Statistics on the use of words in questions

Evaluating Question Analysis Methods

1 Creating a test collection of questions

2 Metrics

3 Results simple experiment

Bibliography

Introduction

Due to the rapid development of information technology and the continuous increase in the volume of information available on the global Internet, the issues of effective search and access to data are becoming increasingly relevant. Often a standard search using keywords does not give the desired result, due to the fact that this approach does not take into account the linguistic and semantic relationships between the query words. Therefore, natural language processing (NLP) technologies and question-answering systems (QAS) based on them are now actively developing.

A question-answering system is an information system that is a hybrid of search, reference and intelligent systems that uses a natural language interface. The input to such a system is a request formulated in natural language, after which it is processed using NLP methods and a natural language response is generated. As a basic approach to the task of finding an answer to a question, the following scheme is usually used: first, the system in one way or another (for example, by searching by keywords) selects documents containing information related to the question posed, then filters them, highlighting individual text fragments, potentially containing the answer, after which the generating module synthesizes the answer to the question from the selected fragments.

As a source of information, the QA system uses either local storage or global network, or both at the same time. Despite the obvious advantages of using the Internet, such as access to huge, ever-growing information resources, there is a significant problem associated with this approach - information on the Internet is unstructured and for its correct retrieval it is necessary to create so-called “wrappers”, that is, subroutines that provide unified access to various information resources.

Modern QA systems are divided into general (open-domain) and specialized (closed-domain). General systems, that is, systems focused on processing arbitrary questions, have a fairly complex architecture, but nevertheless, in practice they give rather weak results and low accuracy of answers. But, as a rule, for such systems the degree of knowledge coverage is more important than the accuracy of the answers. In specialized systems that answer questions related to a specific subject area On the contrary, the accuracy of answers is often a critical indicator (it is better not to answer a question at all than to give the wrong answer).

1. Problems

However, today question-answering systems show far from impressive results. So, best system on the track, GikiCLEF 2009 demonstrated an accuracy of 47% (note that this is the result of running systems on a multilingual collection). Separately, we note the fact that today very few Russian-language question-answer systems participate in open independent quality assessment. In publications there is only one case that makes it possible to compare at least two systems - this is the participation of the Stockon system (today AskNet.ru) and Exacatus.ru at the ROMIP 2006 seminar (2., 23). Both systems use semantic indexing, which is just one of many methods used by researchers around the world today (3,4). According to the authors, it is necessary to conduct research on other popular methods on the Russian-language corpus.

Analysis existing works showed that to conduct an independent assessment on Russian language corpora of the entire range of methods used in question-answer systems, the creation of a research software platform in accordance with the so-called Common architecture for Question Answering (3) It is proposed to use an open system as a basis. source code OpenEphyra, which has already been used by other researchers to work with English, German and Dutch (5). The architecture of the OpehEphyra system follows the standard architecture.

The main tasks for the work are the implementation of almost all modules of the system pipeline for the Russian language. The authors propose to use the following existing software libraries for processing the Russian language: libraries of lexical, morphological and syntactic parsing from aot.ru (6), module for morphological parsing of sentences mystem (7), classification of questions of the AskNet.ru system for the Russian language (8., 34 ), thesaurus of the Russian language RussNet (9). A number of missing modules need to be developed independently: syntactic templates for questions and answers, a module for categorizing questions, a module for recognizing named entities.

Fig.1. OpenEphyra system architecture (10., 1)

The goal of the work is to prepare a basic research system for presentation at the ROMIP, CLEF, TREC seminars. Without such a system, the authors consider it impossible to conduct experimental studies of methods for automatically answering questions in Russian. Considering the results similar project in Dutch - in work (5) an accuracy of 3.5% was achieved - the authors expect that the basic implementation of the system will demonstrate an accuracy of the same order on the ROMIP track of previous years. A separate problem is the inability to reuse ROMIP question-answer tracks in automatic mode (2). To solve this problem, the authors plan to create a reusable test collection based on a subset of ROMIP tasks, using regular expressions to compare responses, as proposed by TREC in work (11).

Further in the article, only the first stage of the question-answer system is discussed - the question analysis module. Considered: statement of the problem of question analysis, methods of question analysis and available apparatus for experimental research of methods on a test collection of questions.

2. Domain overview

Question-answer search systems, compared to traditional search engines, receive a question sentence in natural language (English, Russian, etc.), rather than a set of keywords, and return a short answer, rather than a list of documents and links. Modern information retrieval systems allow us to obtain a list of entire documents that may contain information of interest, while leaving the user the work of obtaining the necessary data from documents ordered by level of relevance to the query. For example, a user enters the following question: "Who is the President of Russia?" and receives the person's name as a response, rather than a list of relevant links to documents. Thus, finding the answer to a question by extracting a small passage of text from a document, which directly contains the answer itself, is a completely different task, in contrast to information search.

Most of the existing projects in the field of question-answer search are intended for in English. If we compare several works in this area of ​​research, we can come to a standard design for question-answer systems. As a rule, the operation of a typical question-answer system consists of several stages:

The stage of analyzing the question entered by the user;

Information search stage;

Response extraction stage.

At the first stage, a question is entered in natural language and primary processing and formalization of the sentence by various analyzers (syntactic, morphological, semantic), its corresponding attributes are determined for their further use. Next, at the second stage, a search and analysis of documents occurs - documents and their fragments are selected, which may contain the answer to the original question. At the third stage, the answer is extracted: the system, receiving text documents or fragments thereof, extracts from them words, sentences or text passages that can become an answer.

It should be noted that the use of various thesaurus dictionaries plays an important role in the results and development. The use of these dictionaries solves the problem of determining the types of entities to identify answers, finding the initial form of words for use in search queries. These dictionaries are also used to find synonyms of words.

.1 Task of issue analysis

The first stage of work is the creation of a question analysis module (Question Analysis in Fig. 1). The following task is set for the module: for a question in natural language, select focusquestion, supportquestion and determine semantic taganswer (Fig.2).

Rice. 2. Non-detailed IDEF0 diagram for the issue analysis process.

Focus of the question(English: question focus) - this is the kind of information contained in question, which carry information about the user’s expectations from the information in the answer (4).

Question support(English: question support) - that's the rest of the question (after "subtraction" of focus), which carries information that supports the choice of a specific answer.

Semantic response tag(English: answer tag, answer type) - Class information requested by the user according to some previously defined taxonomy.

Below are examples of analysis of questions from ROMIP 2009 tasks, performed manually (Table 2.1., the spelling of real queries is preserved).

Table 2.1.

Examples of analysis of questions from ROMIP 2009 assignments. (3., 12)

No. Question, focus is highlighted in bold Semantic tag nqa2009_6368 how to disable keyboard interception? Recipenqa2009_7185 how much does it cost to fix a socket on a Sony Ericsson phone? Moneyynqa2009_6425 in which religions is karma considered? Definitionnqa2009_3123 Patriotic war who is with whom? Countrynqa2009_ 8557Are attics a fire hazard?Yes/Nonqa2009_7801What is the number of read/write cycles provided by Fujifilm Cardinal for LTO 4 standard cartridges?nqa2009_8763when will the mega sale start?Datenqa2009_9150what time is sunset on February 27?Timenqa2009_8754when can you bring cats?Agenqa2009_6797what recording studios are there in Tambov??Organization

A taxonomy of semantic tags is usually chosen by system developers to cover most of the questions about the system. The following taxonomy was borrowed from (3) and added by the authors with several tags to better cover the test collection of ROMIP 2009 questions: Age, Disease, Ordinal, Recipe, Animal, Duration, Organ, Salutation, Areas, Event, Organization, Substance, Attraction, Geological objects, People, Term (Reverse definition), Cardinal, Law, Percent, Time, Company-roles, Location, Person, Title-of-work , Country, Manner, Phrase (NNP), URL, Date, Measure, Plant, Weather, Date-Reference, Money, Product, Yes/No, Definition, Occupation, Reason.

3. Question analysis methods

This section gives short review existing methods for analyzing issues.

.1 Question character patterns

The simplest way to determine the tag or focus in a question is to prepare patterns (regular expressions) to recognize common interrogative phrases. Below are some rules used in the OpenEphyra system for English (Table 3.1.).

Table 3.1.

Character question templates from the OpenEphyra system (10)

SemanticRegular question expressionTagNEaward(what|which|name|give|tell) (.*)?(accolade|award|certification|decoration|honoring|honouring|medal|prize|reward)NEbird(what|which|name|give|tell) (.*)?birdNEbirthstone(what|which|name|give|tell) (.*)?birthstoneNEcolor(what|which|name|give|tell) (.*)?(color|colour)NEconflict(what|which| name|give|tell) (.*)?(battle|conflict|conquest|crisis|crusade|liberation|massacre|rebellion|revolt|revolution|uprising|war)NEdate(when|what|which|name|give|tell) (.*)?(birthday|date|day)NEdate-century(when|what|which|name|give|tell) (.*)?century

To highlight the focus in work (3), the following templates were used, including: and morphological information (Table 3.2., in English):

Table 3.2.

Examples of templates for highlighting the focus of a question in English (3)

Question wordPattern What, which , name , list,question word + headword of first noun clusteridentifyWho, why, whom, whenquestion wordWherequestion word + main verbHowquestion word plus next word if it seeks an count attribute + headword of firstnoun clusterquestion word plus the next word if it seeks an attribute if question seeks a methodology, then just question word

The obvious disadvantages of this approach are:

1.The practical impossibility of covering a significant part of real user questions. The set of questions is selected so as to process a specific set of test tasks. It is quite easy to go beyond this “inconvenient question” cover.

2.After a series of experiments, it becomes clear that the relationship between question words and semantic tags is not so straightforward. So the word “who” can signal about a person, an organization, a country, and a people (for example, in the question “Who won the war?”).

.Pattern-based focus selection also works in very limited cases.

The template method was successfully used in systems participating in TREC-8 (1999), in which the organizers prepared questions for the QA track manually. However, already in TREC-9 (2000), tasks were proposed based on real user requests, and those systems that did not apply other methods of analyzing the issue noticeably lagged behind the adapted leaders.

3.2 Syntactic question patterns

To highlight the focus of the question The next step after character templates was the method of syntactic templates. The method is based on the assumption that the focus of the question is often in a certain syntactic relationship with the question word, perhaps not in one, but the range of options for these relationships is limited. If you parse a sentence, you will get a syntactic tree (Fig. 3.). This example clearly demonstrates that in order to work on a collection of real user questions, the system, incl. must cope with typos and spelling errors.

Here is an example of a syntax pattern for focus recognition used in the OpenEphyra system:

(ROOT (SBARQ (WHNP (WP What)) (SQ (VP (VBZ is) (NP (NP (DT the) (NN name)) (PP (IN of) (*NP xx)))))))

Here, in bracket notation, a syntactic tree is specified with words or their syntactic/morphological labels at the nodes. Such a tree template is compared with the real question tree and, if there is a match, the members of the sentence corresponding to position xx in the template are considered to be the focus.

3.3 Statistics on the use of words in questions

In progress (3)proposed method automatic training of a statistical model for placing a semantic tag. For each question from the training set, three “features streams” are distinguished:

1.all words as is and additional marks for some of them (for example, the mark bqw means that the question word is at the beginning of the sentence);

2.labels of parts of speech of words and serial numbers of words in a sentence;

3.Focus words with hypernyms, according to the lexical thesaurus.

Below are the signs for one question in English (Table 3.3.).

Table 3.3.

Features for the question "Which European city hosted the 1992 Olympics?" (3)

Words as they areWhich which_bqw which_JJ European city host 1992 olympicsParts of speechWDT_0 which_WDT JJ_0 european_JJ NN_1 city_NN VBD_2hosted_VBD DT_3 CD_4 1992_CD NNS_5 olympics_NNSHyperonymsEuropean city metropolis urban_center municipality urban_areageographical_areageographic_areageographical_ regiongeographic_region regionlocation entity metropolisurban_centercity_centercentral_cityfinancial_centerHubcivic_centermunicipal_center down_town inner_city

Having manually tagged a collection of more than 4 thousand questions, the authors (3) calculated which properties each semantic tag most often means. For this purpose, the mathematical apparatus of entropy maximization was used. In total, 36 thousand features were generated from a collection of 4 thousand questions. Below are the weights for making a decision on placing a particular tag based on the identified signs (Table 3.4.).

The disadvantage of the statistical method is the need to create a large training collection of questions manually. Thus, the authors of work (3) are not satisfied with the size of their collection of 4 thousand TREC-9 questions.

Table 3.4.

Signs for placing a semantic tag (3)

SignsSemantic tagWeightmany | COUNT0CARDINAL6,87why_WRBREASON33,04RegionLOCATION5,75who_VPERSON4,09when_V | DEFN0DATE17.31PeriodDURATION7.66GovernmentLOCATION9.56

4. Evaluation of methods for analyzing issues

Let us consider the procedure for an experimental study of methods for analyzing issues.

.1 Creating a test collection of questions

As in other tasks information retrieval, it is proposed to create a text collection of questions and perform the analysis manually using the assessor tool. As a test collection, the authors use tasks for the question-answer track of the ROMIP 2009 seminar. These are 9617 Russian-language questions formulated by users on the Internet.

.2 Metrics

It is proposed to use as the main metric errorplacing a semantic tag: E t = (M-N)/M, Where N- the number of questions processed by the assessor, M - the number of questions for which the same semantic tag was assigned by the question analysis module as the assessor (3).

The second metric is to evaluate whether the question has the correct focus. The authors did not find an existing metric in the literature, so they propose their own metrics: accuracy P and completeness R focus selectionat the question asked:

In both sets, insignificant words are ignored: question words, prepositions, conjunctions. The elements of both sets are not words as lexical units, but the positions of words in a sentence, i.e. a set may contain several instances of one word if it was repeated in a question sentence. Average precision and recall should be taken as metrics for the entire collection of questions.

.3 Results of a simple experiment

On the collection of Russian-language questions An experiment was carried out to study the trivial implementation of the semantic tag insertion module. The module used the word search table in the question to select one or another semantic tag. Listed below are all the operating rules of the module (Table 4.1.).

Table 4.1.

Rules for the operation of a trivial module for analyzing questions in Russian.

WordTagWordTagDownloadURLdonate | GiftProductWhoPersonWhetherYes/NoHowRecipeDefinition | what isDefinitionWhereLocationprice | cost | how much is Moneywhen | in what yearDateage | how old is Age

The experiment showed that this implementation of the question analysis module gives an error of 67%. At the time of writing, the authors had not conducted focus selection experiments.

Conclusion

In the task of automatically answering a question in natural language, the first stage of the system's operation is the analysis of the question. The quality of the issue analysis module significantly affects the quality of the system as a whole (3). Foreign researchers conducted experiments on analyzing questions in English, and different research groups used different methods for solving this first problem.

In this work, a review of existing methods for the English language was carried out, a procedure for evaluating methods was developed, a test collection of Russian-language questions was manually processed, and an experiment was carried out to study some trivial implementation of the module. The authors plan to assemble a complete pipeline of a typical question-answer system from trivially implemented modules, which will become an experimental platform for researching more effective methods.

semantic tag question template

Bibliography

1.Carol Peters. What happened in CLEF 2009 Introduction to the Working Notes. // Proceedings of CLEF 2009. URL: #"justify">2. Russian seminar on Evaluation of Information Retrieval Methods. Proceedings of the fourth Russian seminar ROMIP" 2006. St. Petersburg: NU TsSI, 2006, 274 p.

3.Abraham Ittycheriah. A Statistical Approach For Open Domain Question Answering // Advances in Open Domain Question Answering. Springer Netherlands, 2006. Part 1. Vol.32.

4.Burger, J. et al. Issues, tasks and program structures to roadmap research in question & answering (Q&A). NIST DUC Vision and Roadmap Documents, 2001. URL: #"justify">6. Segalovich I. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. MLMTA, 2003.

7.Search engine AskNet.ru [ Electronic resource]: List of questions supported by the AskNet system for conducting semantic searches. URL: #"justify">. Azarova I. V. et al. Development of a computer thesaurus of the Russian language like WordNet // Reports scientific conference"Corpus linguistics and linguistic databases" / Ed. A.S. Gerda. St. Petersburg, 2002. pp. 6-18.

9.Semantic Analyzer group blog [Electronic resource]. URL: http://semanticanalyzer.info/

Question-answering systems

Anatoly Nikitin, Pavel Raikov

1. Introduction. 2

1.1 Problems.. 3

2. QA system Start 4

2.1 Ternary expressions. 5

2.2 S-rules. 6

2.3 Lexicon. 6

2.6 Annotations in natural language. 8

2.7 Conclusion. 9

3. Statistical techniques for natural language analysis. 10

3.1 Introduction. 10

3.2 Determining parts of speech for words in sentences. eleven

3.3 Creating parse trees from sentences. 14

3.4 Creating your own parsing rules based on PCFG. Treebank grammars. “Markov grammars” 16

3.5 Lexical parsers.. 16

1. Introduction

Due to the rapid development of information technology and the continuous increase in the volume of information available on the global Internet, the issues of effective search and access to data are becoming increasingly relevant. Often, a standard search using keywords does not give the desired result, due to the fact that this approach does not take into account the linguistic and semantic relationships between the query words. Therefore, natural language processing (NLP) technologies and question-answering systems (QAS) based on them are now actively developing.

A question-answering system is an information system that is a hybrid of search, reference and intelligent systems that uses a natural language interface. The input to such a system is a request formulated in natural language, after which it is processed using NLP methods and a natural language response is generated. As a basic approach to the task of finding an answer to a question, the following scheme is usually used: first, the system in one way or another (for example, by searching by keywords) selects documents containing information related to the question posed, then filters them, highlighting individual text fragments, potentially containing the answer, after which the generating module synthesizes the answer to the question from the selected fragments.

As a source of information, the QA system uses either local storage, or the global network, or both at the same time. Despite the obvious advantages of using the Internet, such as access to huge, ever-growing information resources, there is a significant problem associated with this approach - information on the Internet is unstructured and for its correct retrieval it is necessary to create so-called “wrappers”, that is, subroutines that provide unified access to various information resources.

Modern QA systems are divided into general (open-domain) and specialized (closed-domain). General systems, that is, systems focused on processing arbitrary questions, have a fairly complex architecture, but nevertheless, in practice they give rather weak results and low accuracy of answers. But, as a rule, for such systems the degree of knowledge coverage is more important than the accuracy of the answers. In specialized systems that answer questions related to a specific subject area, on the contrary, the accuracy of the answers is often a critical indicator (it is better not to answer the question at all than to give the wrong answer).

1.1 Problems

In 2002, a group of researchers wrote a research plan in the field of question-answering systems. It was proposed to consider the following questions:

Types of questions. Different questions require different methods of finding answers. Therefore, it is necessary to create or improve methodological lists of the types of possible questions. Processing questions. The same information can be requested in different ways. It is required to create effective methods for understanding and processing the semantics (meaning) of a sentence. It is important that the program recognizes questions that are equivalent in meaning, regardless of the style, words, syntactic relationships and idioms used. I would like the QA system to divide complex questions into several simple ones and correctly interpret context-sensitive phrases, perhaps clarifying them with the user during the dialogue. Contextual questions. Questions are asked in a specific context. Context can clarify a query, resolve ambiguity, or follow the user's thinking through a series of questions. Sources of knowledge for the QA system. Before answering the question, it would be a good idea to inquire about the available text databases. No matter what text processing methods are used, we will not find the correct answer if it is not in the databases. Highlighting responses. Correct execution of this procedure depends on the complexity of the question, its type, context, quality of available texts, search method, etc. - a huge number of factors. Therefore, the study of text processing methods must be approached with great caution, and this problem deserves special attention. Formulation of the answer. The answer should be as natural as possible. In some cases, a simple discharge it from the text. For example, if a name is required (person’s name, name of an instrument, disease), quantity (currency rate, length, size) or date (“When was Ivan the Terrible born?”) - a direct answer is sufficient. But sometimes you have to deal with complex queries, and special algorithms are needed here merge answers from different documents. Answers to questions in real time. We need to create a system that would find answers in repositories in a few seconds, regardless of the complexity and ambiguity of the question, the size and vastness of the document base. Multilingual queries. Development of systems for working and searching in other languages ​​(including automatic translation). Interactivity. Often the information offered by a QA system as an answer is incomplete. The system may have incorrectly identified the question type or “understood” it incorrectly. In this case, the user may want not only to reformulate his request, but also to “explain” to the program using dialogue. Mechanism of reasoning (inference). Some users would like to receive an answer that goes beyond the available texts. To do this, you need to add knowledge that is common to most areas to the QA system, as well as means for automatically outputting new knowledge. User profiles of QA systems. Information about the user, such as his area of ​​interest, his manner of speech and reasoning, and default facts, could significantly improve system performance.

2. QA system Start

The Start QA system is an example of a general question-answering system that responds to arbitrary queries formulated in English. It is being developed at the MIT Artificial Intelligence Laboratory under the direction of Boris Katz. This system first appeared on the Internet in 1993 and is now available at http://start. csail. mit. edu. When searching for an answer to a question, the system uses both a local knowledge base and a number of information resources on the Internet.

The system can answer various types of questions, which can be divided into the following categories:

Questions about definitions (What is a fractal?)

Factual questions (Who invented the telegraph?)

Questions about relationships (What country is bigger, Russia or USA?)

List queries (Show me some poems by Alexander Pushkin)

The core of the system is the Knowledge Base. There are 2 modules: Parser and Generator, which can, respectively, convert texts in English into special form(T-expressions), in which they are stored in the Knowledge Base, and, conversely, generate English texts from a set of T-expressions.

2.1 Ternary expressions

A ternary expression (T-expression) is an expression of the form<объект отношение субъект>. In this case, other T-expressions can act as objects/subjects of some T-expressions. Adjectives, possessive pronouns, prepositions, and other sentence parts are used to create additional T-expressions. The remaining attributes of the sentence (articles, verb tenses, adverbs, auxiliary verbs, punctuation marks, etc.) are stored in a special History structure associated with the T-expression.

For example, a sentence BillsurprisedHillarywithhisanswer" after passing through the Parser it will be converted into 2 ternary expressions: << BillsurpriseHillary>withanswer> And < answerrelated-toBill>. Information about the tense of the surprise verb will be stored in the History structure.

Let the system, in the Knowledge Base of which there are 2 T-expressions described above, be asked the question: WhomdidBillsurprisewithhisanswer?" The issue will be processed in the following order:

1. The Question Analyzer converts the question to a template type, reversing the inversion that is used when formulating questions in English: Billsurprisedwhom withhisanswer?”.

2. The parser translates the sentence into 2 T-expressions: <whom> with answer> And

3. The resulting template is checked against T-expressions located in the Knowledge Base. Match found when Whom = Hillary

4. The generator converts T-expressions <> with answer> And into a sentence and returns it as an answer.

The search for answers to questions like “Did Bill surprise with his answer?” is carried out in a similar way. Only in this case, an exact match with expressions in the Database will be searched, and not a search using a template.

Thus, T-expressions retain information about semantic relationships between words to some extent. In 2002, a series of experiments were carried out to evaluate the effectiveness of organizing searches based on T-expressions compared to keyword searches. After Parser processed the Encyclopedia with descriptions of various animal species, the system was asked the question: “What do frogs eat?” (“What do frogs eat?”). The search method described above returned 6 answers, 3 of which were correct. A keyword-based search of the source documents returned 33 results, including the same 3 correct answers, but in addition there were random word matches frogs And eat(for example, answers to the question “Who eats frogs?”). Thus, the search based on T-expressions produced 10 times fewer incorrect answers.

2.2 S-rules

In addition to T-expressions, the Knowledge Base also stores a list of S-rules. These are the rules for converting T-expressions into equivalent forms. The fact is that the same idea in natural language can be expressed in different ways. For example, sentences “Bill’s answer surprised Hillary” And “Bill surprised Hillary with his answer” are equivalent. But the T-expressions obtained when passing these sentences through the Parser are different: , And <with answer>, . Therefore, the S-rule is introduced Surprise :

<<n1 surprise n2>with n3>, <n3 related-to n1> = <n3 surprise n2>, <n3 related-to n1>,

WhereniNouns

Using such rules, we can describe the so-called linguistic variations, that is, equivalent transformations of language constructs:

Lexical (synonyms)

Morphological (same root words)

Syntactic (inversions, active/passive voice, ...)

In addition, S-rules can describe logical implications. For example:

<<A sell B >to C > = <<C buy B >from A>

2.3 Lexicon

Many S-rules apply to groups of words. For example, the S-rule described earlier Surprise is performed not only for the verb surprise, but also for any verb from the so-called group of emotional-reactionary verbs. In order not to produce S-rules, a Lexicon was created, which stores all the words of the English language. Each word is associated with a list of groups to which it belongs. Now S-rule Surprise can be made even more abstract:

<<n1v n2>with n3>, <n3 related-to n1> = <n3 vn2>, <n3 related-to n1>,

Where niNouns, vemotional-reaction-verbs

2.4 WordNet

Except Lexicon , which stores words grouped according to various syntactic and semantic characteristics, the Start system uses another powerful tool for processing the semantics of words - a dictionary WordNet . The basic unit in this dictionary is the concept synset. Synset is a certain meaning, meaning. Different words can have the same meaning (synonyms), and therefore belong to one synset, and, conversely, one word can have several meanings, that is, belong to several synsets. In addition, the WordNet dictionary introduces relationships between synsets. For example, the following relationships exist between nouns:

- Hypernyms : Yhypernym X, If X– variety Y(fruit– hypernym peach)

- Hyponyms : Yhyponym X, If Y– variety X(peach– hyponym fruit)

- Equal in rank : X And Y equal in rank, if they have a common hypernym ( peach And apple– equal in rank)

- Golonims : Yholonim X, If X- Part Y(peach– holonim bones)

- Meronyms : Ymeronym X, If Y– part X ( peel– meronym peach)

Thus, the WordNet dictionary describes the relationships between meanings of the form general-particular and part.

WordNet is used when searching for matches in the Knowledge Base. For example, if the T-expression is stored in the Base < birdcanfly> and the WordNet dictionary defines that canaryhyponym bird. Let the question be asked Cancanaryfly?. The parser converts this question into an expression < canarycanfly>. If it doesn't find a match in the database, Start will use WordNet and try to find an answer to a more general question: Canbirdfly? This question will be answered Yes, from which, given that canary– variety bird Start will conclude that canarycanfly".

2.5 Omnibase

To find answers to factual questions like “When did Beethoven die?” or “What is the capital of England?” Start uses the base Omnibase. This database uses a different information storage model: “object-property-value”. For example information “Federico Fellini is a director of La Strada” will be saved in the database Omnibase as La Strada – director – Federico Fellini . Here LaStrada- an object, director– property, and FedericoFellini– the value of this property. With this data model, the search for the necessary information occurs quite quickly and efficiently.

To search for information Omnibase uses a large number of external data sources from the Internet: Wikipedia, Google, Internet Movie Database, etc. In this case, data is extracted from an external source through the so-called wrapper - a module that provides access to external base through queries of the “object-property” type. To determine the source in which information about a particular object is stored, Omnibase uses Catalog of Objects, in which each object is associated with a data source. For example, an object LaStrada corresponds to the base imdb-movie(Internet Movie Database). Having determined the base in which to search, Omnibase sends a request to the wrapper of this database: ( LaStrada, director) and receives a response FredericoFellini.

2.6 Natural language annotations

The problem of machine analysis of natural speech is very complex. Therefore, the developers of question-answering systems propose solving this problem from two sides: on the one hand, improving natural language processing methods by teaching the computer to “understand” the language, but, on the other hand, trying to make the text more understandable to computers. Namely, it is proposed to compile annotations in natural languages ​​for information resources.

In this case, it is possible to effectively organize a search not only for text, but also for various multimedia information: images, video and audio recordings. In the Start system, annotations are used as follows: when adding information to the Knowledge Base, the Parser processes only its annotation, and attaches a link to the source resource to the generated T-expressions.

The implementation of annotations occurs through RDF (Resource Description Framework) descriptions, which are attached to each resource. The RDF language is based on XML format. The description of this language is quite extensive, so we will only limit ourselves to an example of an RDF description of a certain database in which geographic information is stored. Parameterized annotations are attached to this base " Manypeoplelivein ? s " And " populationof ? s " , and the response template: "The population of ?s is ?o" , Where ? o denotes accessing the base and retrieving a property population at the object ? s. When processing such an annotation, the Parser will save 2 question templates and a link to the answer template. If, when executing a user request, Start finds a match in the Knowledge Base with a question template, it will contact the external resource from which the annotation was taken and an appropriate response will be generated.

In addition, using parameterized annotations, you can describe the search pattern for an answer to an entire class of questions. For example, questions like “What is the country in Africa with the largest area?” or “What country in Europe has the lowest infant mortality rate?” fall under one template: “What country in $region has the largest $attribute " Further, the annotation describes a general algorithm for finding answers to such questions.

Some questions are a composition of several questions. For example, to answer the question “Is Canada’s coastline longer than Russia’s coastline?” it is necessary, firstly, to calculate the length of the coastlines of Canada and Russia, and secondly, to compare the obtained values ​​and generate an answer. Therefore, for this kind of question, you can describe a plan for finding an answer, in which auxiliary questions will be asked.

2.7 Conclusion

The Start question-answer system uses a differentiated approach to finding answers depending on the type of question. This gives a relatively good result for a large number of general questions.

The Knowledge Base and ternary expressions used as a basis are a successful model for presenting information, which, on the one hand, preserves to some extent semantic connections between words, and on the other hand, it is simple enough to effectively implement searching and editing the Database.

Annotations can be used to organize programmatic access to Internet information resources using a universal natural language interface. And the use of additional structures, such as Omnibase, makes it possible to increase the efficiency of finding answers to some specific types of questions.

Finally, various dictionaries and linguistic modules can, to some extent, model the semantic features of natural language and handle more complex queries. The task of compiling such dictionaries, as well as other problems associated with the development of question-answer systems, inevitably requires the involvement of specialists not only in the field of computer science, but also linguists and philologists.

3. Statistical techniques for natural language analysis

3.1 Introduction

Let's consider the process of analyzing proposals. Our task will be to compile a parse tree for each sentence. Due to the relative complexity of the Russian language and the lack of literature and scientific works on this subject, examples from the English language will be considered below. Below is an example of such analysis.

Fig. 1 Parse tree for the phrase “The dog ate”

In Fig. 1, the vertices (det, noun, np, etc.) represent logical combinations of parts of a sentence. For example, np – noun phrase means that this tree node is responsible for the part of the sentence that has the meaning of a noun. Note that for any phrase, even such a simple one, there may be several parse trees, which will differ in that they will give different meanings to the same phrase. For example, you can say: “I ate meat with dogs.” From such a sentence you can get 2 completely different parse trees. In one case, it turns out that I ate meat with dogs, and in the other, that I ate some kind of meat diluted with dog entrails. The most amazing thing is that such “wonderful” examples are found everywhere in English literature, so you have to be content with them. To avoid such absurdities, you should use a separate parser, which, to the best of its ability, will help our parser. In this work, we will build a parser that itself will take syntactic connections into account when constructing a parse tree.

3.2 Determining parts of speech for words in sentences

In English, the task of this part sounds like Part-Of-Speech tagging and is one of the many subtasks of such a section of modern science as NLP (Natural Language Processing). In general, NLP aims to enable a computer to understand texts in natural language. These problems are now widely encountered and their effective solutions are in great demand. It would, of course, be great if the program, having “read” a physics textbook, independently answered questions like: “What is the reason for the heating of the semiconductor in such and such an experiment?” Here another difficulty is immediately visible - even after reading the textbook, the program must still understand the user’s questions, and also, preferably, be able to generate its own questions (the dream of some lazy teachers).

Let's return to the question already posed: “How to determine the part of speech for a word in a sentence?”

Antonyms" href="/text/category/antonimi/" rel="bookmark">antonyms, etc. Since we are looking at a statistical approach, for each word we will consider the probability that it will be a noun, adjective, etc. e. We can construct such a table of probabilities on the basis of test texts that have already been manually analyzed. In Fig. 2, those parts of speech that are determined for words using this approach are highlighted in bold. One of the possible problems– although “can” is in most cases a modal verb, sometimes it can be a noun. It turns out that this method will always treat “can” as a modal verb. Despite its simplicity and obvious disadvantages, this method shows good results and, on average, recognizes 90% of words correctly. Formalizing the results obtained, we will write the product that must be maximized during of this algorithm:

The following notations are introduced here:

    t – tag (det, noun, …) w – word in the text (can, will...) p(t | w) – probability that tag t corresponds to the word w

Taking into account the shortcomings of the previous model, a new one was created that takes into account the fact that, for example, according to statistics, after an adjective there is another adjective or noun. It is worth noting that this, like all other statistics, is obtained from some example, and the case when there is no initial statistics will not be considered. Based on this proposal, the following formula was derived:

    p(w | t) – probability that the word w corresponds to tag t p(t1 | t2) – probability that t1 comes after t2

As can be seen from the proposed formula, we are trying to select tags so that the word matches the tag, and the tag matches the previous tag. This method shows better results than the previous one, which is quite natural, for example, it recognizes “can” as a noun, and not as a modal verb.

The constructed model for calculating the probability that a set of tags will correspond to a sentence, as it turns out, can be interpreted as a “Hidden Markov Model”.

We get something like finite state machine. Now we will describe how to get it. Vertices are parts of speech. The pair (word, probability) at the vertex shows the probability that the word assigned to a given part of speech will be exactly this, for example, for the vertex “det” and the word “a” this will be the probability that a randomly taken article from the test text will be “a”. Transitions show how likely it is that one part of speech will be followed by another. For example, the probability that 2 articles will appear in a row, provided that an article is encountered, will be equal to 0.0016.

Our task will be to find a path in such a model so that the product of numbers on the edges and at the vertices would be maximum. A solution to such a problem exists, but we will not dwell on it, since this issue is beyond the scope of this work. Let’s just say that there are algorithms that solve this problem in a time that is linear in the number of vertices. Let us add that according to the existing classification we have obtained a “canonical statistical tagger”.

Let's now consider another approach to defining tags. It's called a transformational scheme. It lies in the fact that, when working on test sentences, a trivial algorithm is first applied, and then the system considers all the rules of the form: “Change a word’s tag X to tag Y if the tag of the previous word is Z.” The total number of such rules will be the number of tags in a cube, which is relatively small. Therefore, at each step we try to apply such a rule, and if after this the number of correctly identified parts of speech increases, then the rule will become a candidate for the title of the best rule at the first step. Next, the best rule is selected and added to the list of “good” rules. We do this several times. We obtain N rules that “well” improve the probability of the tag system for sentences from the test system. Next, when parsing an arbitrary sentence itself, after applying a trivial algorithm, we use already prepared rules. For this algorithm, one of its main advantages can be noted – speed. It is equal to 11,000 words/sec, while the algorithm using HMM has 1,200 words/sec.

In conclusion, I would like to add that so far we have assumed the presence of a voluminous initial base. If there is no such thing, then HMM training does not lead to significant improvements (effectiveness is 90%). While TS (transformational scheme) allows you to reach 97%. Let us remind you that efficiency is measured as the number of correctly identified tags on test texts.

3.3 Creating parse trees from sentences

Fig. 4 Analysis of the sentence “The stranger ate donut with a fork.”

The task this section there will be a construction of parse trees similar to those shown in Fig. 4. Let us immediately note that on the Internet there is a rich collection of already created trees for the corresponding proposals from the initial database. You can learn more about this system by visiting the website. Let’s immediately discuss the issue of checking parsers. We simply feed them sentences from as input and check the resulting trees for matches. This can be done in several ways, but in this work we will use one of those already proposed in. In the tree space, we introduce two metrics: precision and memory. Accuracy will be defined as the number of correctly identified non-terminal vertices divided by their total number. Memory will be equal to the number of correctly found vertices, divided by the number of non-terminals of the same sentence in the database. It is stated that if you apply the simplest approach to building a tree, you will immediately get 75% efficiency for both metrics. However, modern parsers can reach 87-88% efficiency (hereinafter, unless specifically stated, efficiency will be referred to by both metrics).

Let's divide our task into 3 main stages:

    Finding rules to apply Assigning probabilities to rules Finding the most likely rule

One of the simplest mechanisms to solve this task, there are “Probabilistic Context-Free Grammars” (PCFG). Let's look at an example of a grammar that will make it easier to understand this concept:

    sp → np vp (1.0) vp → verb np (0.8) vp → verb np np (0.2) np → det noun (0.5) np → noun (0.3) np → det noun noun (0.15) np → np np (0.05)

Rules are written here for parsing the corresponding vertices, and for each rule there is a probability of its application. Thus, we can calculate the probability of the tree “π” matching its sentence “s”:

margin-top:0cm" type="disc"> s – initial sentence π – the tree we obtained c – runs through the internal vertices of the tree r(c) – probability of using r for c

We will not give exact algorithms, we will only say that iterating through all parse trees of length N using PCFG will take N cubed time. Unfortunately, it can be noted that PCFGs by themselves do not produce “good” statistical parsers, which is why they are not widely used.

3.4 Creating your own parsing rules based on PCFG. Treebank grammars. “Markov grammars”

Let's consider the main tasks that need to be solved in order to parse the proposal:

Constructing your own grammar in the form of PCFG (it would be desirable for our proposal to have at least one conclusion in this grammar). A parser that would apply given rules to a sentence and obtain some or all possible parse trees. Ability to find optimal trees for equation (1).

An overview of the last 2 problems was given in the previous part, so let's stop now

on the first point. First, we will offer a simple solution to it. Let's say we already have a ready-made collection of parse trees. Then, processing each of these trees, we will simply make a rule from each non-terminal vertex, based on how it is expanded in a particular tree. After this, if such a rule already exists, then we increase its statistical parameter by 1, and if it does not exist, then we add a new rule to our grammar with this parameter equal to 1. After processing all test trees, we will normalize so that the probability of applying each rule is ≤ 1. The efficiency of such models is 75%. Such models are called “Treebank grammars”.

Now let's talk a little about the approach that allows you to invent new rules on the fly. To do this, based on test trees, we will build statistics for the following value – p(t1 | f, t2). It means the probability that tag “t1” will occur after tag “t2” when expanding the form “f”. For example, p(adj | np, det) means the probability that the adjective will be followed by an article, provided that we expand the “noun phrase” (free translation of np) and encounter the article. Based on this, for the probability of correct application of any rule to some vertex, we can create a formula:

3.5 Lexical parsers

The main idea of ​​this part will be to change the structure of the tree in order to improve the efficiency of our model. Now we will not just build a parse tree, as was presented above, but we will additionally assign a word to each vertex that will best characterize it as a lexical unit. For vertex “c” we denote such a line as head(c). Head(c) will be defined as applying a certain function to the children of “c” and the rule by which “c” was “opened”. In general, it turns out that when constructing this head, we take into account that some words occur frequently with each other, therefore, having such statistics, we can improve the probability of parsing truthfulness for some sentences. For example, in the sentence “the August merchandise trade deficit” there are 4 nouns in a row, therefore, if we use the previous models, we will get a very low probability of correctly parsing this sentence. But the fact that “deficit” is the main part of this “np” and that in the test texts we encountered expressions that simultaneously contained “deficit” and other words will help us correctly compile a parse tree. Now let’s formalize the above using the formula:

    p(r | h) is the probability that rule r will be applied for a node with a given h. p(h | m, t) – the probability that such h is a child of a vertex with head = m and tag t.

Let us present a table from which the form of the formula given above should become clearer.

h(c) = “deficit”

The concept of conditional probability is actively used here. It's just that the probability that the word at the top of the tree "c" is "August" turns out to be higher if we assume that head(c) = "deficit". In fact, we want to make our cases more specific so that very rare rules like “rule = np → det propernoun noun noun noun” can have a fairly good probability, and then we can process very complex texts. In this case, it does not matter to us that the rule that we would like to apply might not be found in the initial collection of rules.

3.6 Conclusion

The statistical approach allows solving many NLP problems and is one of the fairly new and rapidly developing areas in mathematical linguistics. In this work, only basic concepts and terms were considered, which leaves the reader freedom of choice when reading specific studies on this topic. Unfortunately for Russian-speaking readers, it is worth noting that the number of studies and works on this topic in Russia is small and all the material had to be taken from English sources. Perhaps you are the very person who can change the situation and pick up the initiatives of 2 Russian projects. One of them is non-commercial and is being developed at the PM-PU of St. Petersburg State University. The other is a commercial product from RCO; those interested can read the scientific works of this company on their website. All examples and pictures used in this article were taken from.

4. Links

CLEF. http://clef-qa. itc. it/WordNet. http://wordnet. princeton. edu/Pen treebank. http://www. cis. upenn. edu/~treebank/Start. http://start. csail. mit. edu/TREC. http://trec. nist. gov/ Eugene Charniak, “Statistical Techniques for Natural Language Parsing” Gary C. Borchardt, “Causal Reconstruction” Boris Katz, Beth Levin “Exploiting Lexical Regularities in Designing Natural Language Systems” SEMLP. http:///RCO. http://www. *****/

New information technologies

Lecture No. 2.2. Basic classes of natural language systems. Intelligent question and answer systems

    New information technologies (3)

1.1. Main classes of natural language systems

        Functional components of natural language systems

        Comparative characteristics of the main classes of NL systems

        1. Intelligent question and answer systems

          1. Information retrieval systems

            Database communication systems

            Expert systems

            Dialogue problem solving systems

            Smart storage and digital libraries

        2. Speech recognition systems

          1. Systems for recognizing spoken commands in isolation

            Systems for recognizing keywords in a stream of continuous speech

            Continuous speech recognition systems

            Analysis-by-synthesis approach

            Lip reading systems

          Connected text processing systems

          1. Text summarization systems

            Text comparison and classification systems

            Text clustering systems

          Synthesis systems

          1. Speech synthesis systems

            Text-based video synthesis systems

          Machine translation systems. Speech (text) understanding systems

          1. Phrase translation systems

            Contextual translation systems

            Speech (text) understanding systems

          Ontologies and thesauri

          Speech and text databases

          Components of intelligent systems

        Comparative characteristics of natural language systems

        Intelligent question and answer systems

Currently, the most popular product falling under the category of intelligent question-answering systems are (57) information retrieval systems.

2.2.1.1. Information retrieval systems

The most well-known information retrieval systems GOOGLE, Yandex, Rambler have approximately the same capabilities and functionality. The only thing (58) system differenceGOOGLEfrom the rest is rather technical in nature: this system is implemented as a parallel distributed system using a large number of processors with internally produced memory. Perhaps it was this difference that played a decisive role in the undoubted superiority of this system over all others, although they had more intellectual functions. (59) Natural language processing in this and other information retrieval systems does not play a very big role, but the volume of their use in human-machine communication systems is very large.

Rice. 2.2. A typical information retrieval system.

(60) The main functions of an information retrieval system are reduced to parsing sources, indexing texts extracted from sources, processing a user request, comparing indexed database texts with a user request, and producing results. Recently appeared in the GOOGLE system speech input, which allows you to enter a limited-volume request into the system by voice. Another function used in information retrieval systems is function of representing the structure of the system world model, which is a means of navigating through system resources.

Thus, a standard (61) information retrieval system contains seven main components (see Fig. 2.2): an information input block, a parsing block, a source indexing block, a user request processing block, a comparison block of source texts with a user request presented in natural language, a block for outputting results, and a block for structuring subject areas and navigation.

The main task of implementing input is to present the original set of texts and the user's request in a form convenient for the computer. The fact is that due to the large volume of information processed by information retrieval systems (62) the texts of processed documents are usually not stored in the system.Only their representations are stored. Texts are taken from the repository(s) and processed from time to time (usually cyclically).

(63) Such a representation of the text could be, for example, a list of keywords extracted from the text (represented by vector-spatial, or n-grammatic models), but there may also be a network of co-occurrence of words in text fragments.

Main idea (64) vector-spatial model simple: the text is described by a lexical vector in Euclidean space, each vector component corresponds to some object contained in the text(word, phrase, company names, positions, names, etc.), which is called a term. To each The term used in the text is assigned its weight (significance), determined on the basis of statistical information about its occurrence in a separate text. The vector dimension is the number of terms that appear in texts.

(65) In the polygram model, text is represented as a vector, where the elements of the vector are all combinations of characters of length n from the alphabetM (for Russian language M = 33 ). Each element of the vector is associated with the frequency of occurrence of the correspondingn -grams in the text. The vector dimension for arbitrary text is strictly fixed and amounts to 33 3 = 35937 elements. However, as practice shows, in real texts no more than 25-30 percent of n-grams are implemented from the total permissible number, i.e. for the Russian language there are no more than 7000.

(66) Network of co-occurrence of words in text fragments. The text is represented by a variety of concepts in their relationships. Both concepts and connections are assessed by their weight.

(67) The user's query, presented in natural language, is processed in a manner similar to the processing of information when indexing source texts, in order to simplify the comparison of these natural language texts. At the comparison stage, in fact, search strategies are implemented

Thus, in addition to the methods of internal representation of text, the method of classifying (comparing) texts plays a significant role in information retrieval systems. (68) Currently, the following types of classifiers are in practical use::

        (69) Statistical classifiers based on probabilistic methods. The most famous in this group is the family of Bayesian classifiers. Their common feature is classification procedure based on Bayes' formula for conditional probability.

The classical method of text classification makes very strong assumptions about the independence of the events involved (the appearance of words in documents), but practice shows that the naive Bayes classifier turns out to be very effective.

2. (70) Classifiers based on similarity functions. The most characteristic of such classifiers is the use of lexical vectors of the term-document model, which are also used in neural classifiers. As similarity measures usually take the cosine of the angle between the vectors, calculated through the scalar product.

In light of the above (72) the following strategies are used in information retrieval systems.

1. (73) Based on keywords. Keywords are usually supplied with weight characteristics that determine the weight of the word in the text. The numerical characteristic is based on the frequency of occurrence of words in the text. However, the semantic weight of a word differs from the frequency of its appearance in the text.

2. Very important (74) Information aboutthe order of keywords in text fragments. To increase search efficiency in this case, n-grams of key concepts are used instead of key concepts.

3. When searching, the frequency of co-occurrence of keywords in text fragments is also used. Internal text structure (75) in terms of keywords in their relationships - a semantic portrait of the text– is the basis for text representation in information retrieval systems. The semantic portrait of the text allows us to identify logical structure text (and the logical structure of the entire text corpus), which improve the quality of search and speed it up.

4. B Lately when searching, they began to use the so-called (76) fuzzy comparison.

To improve search results, the user can change the query. This is what feedback is for. (77) Information processing in an information retrieval system includes structuring information for the purpose of subsequent navigation, including its clustering.

Under (76) When outputting results, you need to understand those links to the source text (texts) that the system gives to the user. This could be a system of citations, document numbers storing analyzed texts.

(77) Search necessary information on the web using a search engine is usually done as follows. The user enters one or more search terms into the search engine dialog box. The search engine returns search results that match these search terms. For example, a search engine may return a list of web addresses (URLs) that indicate documents containing one or more search words request.

Shortly after the advent of list processing in computer science, the BASEBALL program was written (to illustrate how the new methods could be applied to question-answering systems) (Green, Wolfe, Chomsky, & Loughrey, 1961; Greene, 1963). This program was designed to answer questions about the 1959 American Baseball League games - hence its name. Although the social value of this application of the program is questionable, it provides good device to test programming principles that have since found widespread use. Messages to the program were compiled in a simple subset of the English language, which we will practically not dwell on. Much more interesting is the data structure used here.

The BASEBALL program data was organized into a hierarchical system. This data structure could be equivalently represented as a tree. The most high level was YEAR (data only for 1959 was used, but the program provided for possibilities for several years), followed by, in order of priority, MONTH and PLACE. After the YEAR, MONTH and PLACE were determined, the game number, day and score (points won by the team) were indicated sequentially.

In general terms, the format of the data structure was:

Clearly, this form of data structure is not unique to baseball, and the data processing routines in the BASEBALL program were written with the intention of working with any hierarchical data structure, regardless of the interpretation of the various levels and branches.

The workings of the BASEBALL program can be understood by considering two concepts: the data path and the specification list.

A data path is a sequence of branches that must be followed to obtain information about a specific game.

For example, it defines and along the way establishes some information about this game. Each game has a single data path associated with it, and its entries define, as shown in the example, the characteristics of the game. To generate all possible data paths, you can use any simple algorithm search on the tree, since the data tree is obviously finite.

A specification list is a list of attributes that a data path must have in order to be a valid answer to a question. For example, a list of question specifications

Where (in what places) did the Redsox team play in July? (1) will

Let's say that the language processor has generated a list of specifications for a question. A hierarchical data processor takes a list of specifications and systematically generates all data paths that match it. The path matches the specification list if

(a) a feature-value pair (for example, it is contained both in the list of specifications and on the path, or

(b) the feature-value pair in the list of specifications has a value (for example, in this case the corresponding value in the data path is registered as possible (in example (1), the list of values ​​would be the answer).

If a feature-value pair in a specification list has a value, it is consistent with the value of any feature on the data path. The approval type is not registered.

As already mentioned, the process of generating data paths and matching them to a list of specifications does not contain anything specific to the baseball theme. Derived attribute-value pairs can also be consistent, but in this case they are application-specific. For example, consider the question:

How many games did teams win on their home fields in July? (2)

To answer it, the program must find all the data paths that define the games for which the value matches the value and whose name matches the value and has the best score. It is clear that the subroutine that compiles the corresponding list of specifications relies on knowledge of the game of baseball.

When data paths satisfying the specification lists are found, they are combined into a master list. He is also represented as a tree. For example, the paths answering question (1) can be summarized as follows:

The answer to the question is compiled by reviewing the main list. In case (1), the answer is obtained by simply listing the PLACE values ​​in the main list.

Rice. 14.1. (see scan) Stages of answering questions in the BASEBALL program.

The answer to a slightly more complex question

How many places did the Redsox play in July? (3)

can be obtained by recalculating the values ​​in the main list.

A diagram of this procedure for answering questions is shown in Fig. 14.1. Natural Language Subset Analyzer

perceives a question in natural language, recognizes the type of question asked, and produces a list of specifications. This part of the BASEBALL program is of necessity related to the field of application in two respects. Obviously, she must have access to the lexicon of this game. Less obviously, it must contain routines that transform natural language expressions such as “how much” or “in what” into suitable lists of specifications. Thus, although Green et al. did not restrict the user from asking “indexed” questions, as is done in libraries, they predetermined the types of questions that the system could receive.

In step (B), the program generates the main data list from the list of specifications. As already noted, large sections of B are independent of applications, although individual routines may need derived test features. In the last step, the answer is derived from the main list (blocks and Here again the programmer must anticipate the type of questions being asked and enter into the system a suitable main list generation routine for each type of question.

As can be seen from the blocks in Fig. 14.1, the BASEBALL program is not limited to questions that can be answered by going through the data only once. Let's consider the question:

How many teams played in 8 places in July? (4)

The initial list of specifications is as follows:

The question defined by this list of specifications cannot be answered immediately. Instead, the processor must explore

and cannot be answered immediately, so an auxiliary question is needed:

It also cannot be answered immediately, so it is remembered and a question is generated

Can be answered using a question processor giving a master list

The specification means viewing all lists of the form. By counting the names of places, you can get the answer to it. You can convert it into a list and get the answer to. This immediately follows the answer to

The development of the BASEBALL program did not go beyond the scope of the initial project - the usual fate of artificial intelligence systems. In fact, the idea of ​​hierarchical data structure seems to have disappeared from machine-readable programming. This is somewhat surprising, since hierarchical structures allow for efficient data management, especially if large amounts of information must be kept partly in primary memory and partly in relatively slow, inexpensive memory devices (for details, see Sussenguth, 1963). In addition, hierarchical structures can be implemented by data management techniques that are compatible with more traditional information processing systems (Hunt and Kildall, 1971; Lefkowitz, 1969). Without a doubt, when you are going to use “understanding” programs, you must at some stage raise the inevitable practical issues of cost and system compatibility. Perhaps in the future it would be worth returning to the principles implemented in this rather old program.