What is Data Mining

Classification of Data Mining tasks

Association rules search problem

Clustering problem

Features of Data Miner in Statistica 8

Analysis tools STATISTICA Data Miner

Example of working in Data Minin

Creating reports and summaries

Sorting information

Analysis of prices of residential plots

Analysis of predictors of survival

Conclusion


What is Data Mining

The modern computer term Data Mining translates as “information extraction” or “data mining”. Often, along with Data Mining, the terms Knowledge Discovery and Data Warehouse are used. The emergence of these terms, which are an integral part of Data Mining, is associated with a new round in the development of tools and methods for processing and storing data. So, the goal of Data Mining is to identify hidden rules and patterns in large (very large) volumes of data.

The fact is that the human mind itself is not adapted to perceive huge amounts of heterogeneous information. The average person, with the exception of some individuals, is not able to grasp more than two or three relationships, even in small samples. But traditional statistics, which has long claimed to be the main tool for data analysis, also often fails when solving real-life problems. It operates with average characteristics of the sample, which are often fictitious values ​​(average solvency of the client, when, depending on the risk function or loss function, you need to be able to predict the solvency and intentions of the client; average signal intensity, while you are interested in the characteristic features and preconditions of signal peaks, etc. .d.).

Therefore, methods of mathematical statistics turn out to be useful mainly for testing pre-formulated hypotheses, while determining a hypothesis is sometimes a rather complex and time-consuming task. Modern technologies Data Mining processes information in order to automatic search templates (patterns) characteristic of any fragments of heterogeneous multidimensional data. Unlike online analytical processing (OLAP), Data Mining shifts the burden of formulating hypotheses and identifying unexpected patterns from humans to computers. Data Mining is not one, but a combination of a large number of different knowledge discovery methods. The choice of method often depends on the type of data available and what information you are trying to obtain. Here, for example, are some methods: association (union), classification, clustering, time series analysis and forecasting, neural networks, etc.

Let us consider the properties of the discovered knowledge given in the definition in more detail.

The knowledge must be new, previously unknown. The effort spent on discovering knowledge that is already known to the user does not pay off. Therefore, it is new, previously unknown knowledge that is valuable.

Knowledge must be non-trivial. The results of the analysis should reflect non-obvious, unexpected patterns in the data, which constitute the so-called hidden knowledge. Results that could have been obtained more in simple ways(for example, visual inspection) do not justify the use of powerful Data Mining methods.

Knowledge must be practically useful. The knowledge found must be applicable, including on new data, with sufficient high degree reliability. Usefulness lies in the fact that this knowledge can bring certain benefits when applied.

Knowledge must be accessible to human understanding. The patterns found must be logically explainable, otherwise there is a possibility that they are random. In addition, the discovered knowledge must be presented in a form that is understandable to humans.

In Data Mining, models are used to represent the acquired knowledge. The types of models depend on the methods used to create them. The most common are: rules, decision trees, clusters and mathematical functions.

The scope of Data Mining is not limited in any way - Data Mining is needed wherever there is any data. The experience of many such enterprises shows that the return on data mining can reach 1000%. For example, there are reports of an economic effect that is 10-70 times higher than the initial costs from 350 to 750 thousand dollars. Information is provided about a $20 million project that paid for itself in just 4 months. Another example is annual savings of $700 thousand. through the implementation of Data Mining in a chain of supermarkets in the UK. Data Mining is of great value to managers and analysts in their daily activities. Business people have realized that with the help of Data Mining methods they can gain tangible competitive advantages.

Classification of DataMining tasks

DataMining methods allow you to solve many problems that an analyst faces. The main ones are: classification, regression, search for association rules and clustering. Below is short description main tasks of data analysis.

1) The classification task comes down to determining the class of an object based on its characteristics. It should be noted that in this problem the set of classes to which an object can be classified is known in advance.

2) The regression problem, like the classification problem, allows you to determine the value of some of its parameters based on the known characteristics of an object. Unlike the classification problem, the value of the parameter is not a finite set of classes, but a set of real numbers.

3) Association task. When searching for association rules, the goal is to find frequent dependencies (or associations) between objects or events. The found dependencies are presented in the form of rules and can be used both to better understand the nature of the analyzed data and to predict the occurrence of events.

4) The task of clustering is to search for independent groups (clusters) and their characteristics in the entire set of analyzed data. Solving this problem helps you understand the data better. In addition, grouping homogeneous objects makes it possible to reduce their number and, therefore, facilitate analysis.

5) Sequential patterns - establishing patterns between events related in time, i.e. detecting a dependency that if event X occurs, then after specified time event Y will occur.

6) Analysis of deviations - identifying the most uncharacteristic patterns.

The listed tasks are divided into descriptive and predictive according to their purpose.

Descriptive tasks focus on improving understanding of the data being analyzed. The key point in such models is the ease and transparency of the results for human perception. It is possible that the patterns discovered will be a specific feature of the particular data being studied and will not be found anywhere else, but it can still be useful and therefore should be known. This type of task includes clustering and searching for association rules.

Solving predictive problems is divided into two stages. At the first stage, a model is built based on a data set with known results. In the second stage, it is used to predict results based on new data sets. In this case, it is naturally required that the constructed models work as accurately as possible. This type of task includes classification and regression problems. This can also include the problem of searching for association rules, if the results of its solution can be used to predict the occurrence of certain events.

Based on the methods of solving problems, they are divided into supervised learning (learning with a teacher) and unsupervised learning (learning without a teacher). This name comes from the term Machine Learning ( machine learning), often used in English literature and denoting all Data Mining technologies.

In the case of supervised learning, the problem of data analysis is solved in several stages. First, using some Data Mining algorithm, a model of the analyzed data – a classifier – is built. The classifier is then trained. In other words, the quality of its work is checked and, if it is unsatisfactory, additional training of the classifier occurs. This continues until the required level of quality is achieved or it becomes clear that the selected algorithm does not work correctly with the data, or the data itself does not have a structure that can be identified. This type of task includes classification and regression problems.

Unsupervised learning combines tasks that identify descriptive patterns, such as patterns in purchases made by customers at a large store. Obviously, if these patterns exist, then the model should represent them and it is inappropriate to talk about its training. Hence the name - unsupervised learning. The advantage of such problems is the possibility of solving them without any prior knowledge about the analyzed data. These include clustering and searching for association rules.

Classification and Regression Problem

When analyzing, it is often necessary to determine which of the known classes the objects under study belong to, i.e., to classify them. For example, when a person approaches a bank for a loan, the bank employee must decide whether the potential client is creditworthy or not. Obviously, such a decision is made on the basis of data about the object under study (in this case, a person): his place of work, size wages, age, family composition, etc. As a result of analyzing this information, the bank employee must classify the person into one of two well-known classes: “creditworthy” and “uncreditworthy.”

Another example of a classification task is email filtering. In this case, the filtering program must classify incoming message like spam (junk Email) or as a letter. This decision is made based on the frequency of occurrence of certain words in the message (for example, the name of the recipient, impersonal address, words and phrases: acquire, “earn”, “ profitable proposition" and so on.).

OLAP systems provide the analyst with a means of testing hypotheses when analyzing data, that is, the main task of the analyst is to generate hypotheses, which he solves based on his knowledge and experience. However, not only a person has knowledge, but also the accumulated data that is analyzed . Such knowledge is contained in a huge amount of information that a person cannot research on his own. Because of this, there is a risk of missing hypotheses that could provide significant benefits.

To detect “hidden” knowledge it is used special methods automatic analysis, with the help of which you have to practically extract knowledge from the “blockages” of information. The term “data mining” or “data mining” has been assigned to this area.

There are many definitions of DataMining that complement each other. Here are some of them.

Data Mining is the process of discovering non-trivial and practically useful patterns in databases. (BaseGroup)

Data Mining is the process of extracting, exploring and modeling large volumes of data to discover previously unknown patterns (patterns) in order to achieve business advantages (SAS Institute)

Data Mining is a process that aims to discover new significant correlations, patterns and trends by sifting through large amounts of stored data using pattern recognition techniques plus the application of statistical and mathematical techniques (GartnerGroup)

Data Mining is the research and discovery by a “machine” (algorithms, artificial intelligence tools) of hidden knowledge in raw data.were previously unknown, non-trivial, practically useful, accessible for interpretationtions by man. (A. Bargesyan “Data Analysis Technologies”)

DataMining is the process of discovering useful knowledge about business. (N.M. Abdikeev “KBA”)

Properties of discovered knowledge

Let's consider the properties of the discovered knowledge.

  • The knowledge must be new, previously unknown. The effort spent on discovering knowledge that is already known to the user does not pay off. Therefore, it is new, previously unknown knowledge that is valuable.
  • Knowledge must be non-trivial. The results of the analysis should reflect non-obvious, unexpectedpatterns in data that constitute so-called hidden knowledge. Results that could be obtained by simpler methods (for example, visual inspection) do not justify the use of powerful Data Mining methods.
  • Knowledge must be practically useful. The knowledge found must be applicable, including on new data, with a sufficiently high degree of reliability. Usefulness lies in the fact that this knowledge can bring certain benefits when applied.
  • Knowledge must be accessible to human understanding. The patterns found must be logically explainable, otherwise there is a possibility that they are random. In addition, the discovered knowledge must be presented in a form that is understandable to humans.

In DataMining, models are used to represent the acquired knowledge. The types of models depend on the methods used to create them. The most common are: rules, decision trees, clusters and mathematical functions.

DataMining Tasks

Let us recall that DataMining technology is based on the concept of templates, which are patterns. As a result of the discovery of these patterns, hidden from the naked eye, DataMining problems are solved. Different types of patterns that can be expressed in a human-readable form correspond to specific DataMining tasks.

There is no consensus on which tasks should be classified as DataMining. Most authoritative sources list the following: classification,

clustering, prediction, association, visualization, analysis and discovery

deviations, assessment, analysis of connections, summing up.

The purpose of the description that follows is to give a general idea of ​​DataMining problems, compare some of them, and also present some methods by which these problems are solved. The most common Data Mining tasks are classification, clustering, association, forecasting and visualization. Thus, tasks are divided according to the type of information produced, this is the most general classification of DataMining tasks.

Classification

The task of dividing a set of objects or observations into a priori specified groups, called classes, within each of which they are assumed to be similar to each other, having approximately the same properties and characteristics. In this case, the solution is obtained based on analysis values ​​of attributes (features).

Classification is one of the most important tasks DataMining . It is used in marketing when assessing the creditworthiness of borrowers, determining customer loyalty, pattern recognition , medical diagnostics and many other applications. If the analyst knows the properties of objects of each class, then when a new observation belongs to a certain class, these properties are automatically extended to it.

If the number of classes is limited to two, thenbinary classification , to which many more complex problems can be reduced. For example, instead of defining such degrees of credit risk as “High”, “Medium” or “Low”, you can use only two - “Issue” or “Refuse”.

DataMining uses many different models for classification: neural networks, decision trees , support vector machines, k-nearest neighbors method, covering algorithms, etc., in the construction of which supervised learning is used whenoutput variable(class label ) is specified for each observation. Formally, classification is made based on the partitionfeature spaces into areas, within each of whichmultidimensional vectors are considered identical. In other words, if an object falls into a region of space associated with a certain class, it belongs to it.

Clustering

Short description. Clustering is a logical continuation of the idea

classifications. This is a more complex task; the peculiarity of clustering is that object classes are not initially predefined. The result of clustering is the division of objects into groups.

An example of a method for solving a clustering problem: “unsupervised” training of a special type of neural networks - self-organizing Kohonen maps.

Associations

Short description. When solving the problem of searching for association rules, patterns are found between related events in a data set.

The difference between association and the two previous DataMining tasks: the search for patterns is carried out not on the basis of the properties of the analyzed object, but between several events that occur simultaneously. The most well-known algorithm for solving the problem of finding association rules is the Apriori algorithm.

Sequence or sequential association

Short description. Sequence allows you to find temporal patterns between transactions. The sequence task is similar to association, but its goal is to establish patterns not between simultaneously occurring events, but between events related in time (i.e., occurring at some specific interval in time). In other words, a sequence is determined by a high probability of a chain of events related in time. In fact, an association is a special case of a sequence with a time lag of zero. This DataMining task is also called the sequential pattern finding task.

Sequence rule: after event X, event Y will occur after a certain time.

Example. After purchasing an apartment, residents in 60% of cases purchase a refrigerator within two weeks, and within two months in 50% of cases they purchase a TV. The solution to this problem is widely used in marketing and management, for example, in Customer Lifecycle Management.

Regression, forecasting (Forecasting)

Short description. As a result of solving the forecasting problem, missing or future values ​​of target numerical indicators are estimated based on the characteristics of historical data.

To solve such problems, methods of mathematical statistics, neural networks, etc. are widely used.

Additional tasks

Deviation Detection, variance or outlier analysis

Short description. The goal of solving this problem is to detect and analyze data that is most different from the general set of data, identifying so-called uncharacteristic patterns.

Estimation

The estimation task comes down to predicting continuous values ​​of a feature.

Link Analysis

The task of finding dependencies in a data set.

Visualization (GraphMining)

As a result of visualization, a graphic image of the analyzed data is created. To solve the visualization problem, graphical methods are used to show the presence of patterns in the data.

An example of visualization techniques is presenting data in 2-D and 3-D dimensions.

Summarization

A task whose goal is to describe specific groups of objects from the analyzed data set.

Quite close to the above classification is the division of DataMining tasks into the following: research and discovery, forecasting and classification, explanation and description.

Automatic exploration and discovery (free search)

Example task: discovering new market segments.

To solve this class of problems, cluster analysis methods are used.

Prediction and classification

Example problem: predicting sales growth based on current values.

Methods: regression, neural networks, genetic algorithms, decision trees.

Classification and forecasting tasks constitute a group of so-called inductive modeling, which results in the study of the analyzed object or system. In the process of solving these problems, a general model or hypothesis is developed based on a set of data.

Explanation and Description

Sample problem: Characterizing customers based on demographics and purchasing history.

Methods: decision trees, rule systems, association rules, connection analysis.

If the client's income is more than 50 conventional units and his age is more than 30 years, then the client's class is first.

Comparison of clustering and classification

Characteristic

Classification

Clustering

Controllability of training

Controlled

Uncontrollable

Strategies

Tutored training

Unsupervised learning

Availability of class label

Training set

accompanied by a label indicating

class to which it belongs

observation

Trainer class labels

sets are unknown

Basis for classification

New data is classified based on the training set

A lot of data is given for the purpose

establishing the existence

classes or data clusters

Areas of application of DataMining

It should be noted that today DataMining technology is most widely used in solving business problems. Perhaps the reason is that it is in this direction that the return on use of DataMining tools can be, according to some sources, up to 1000% and the costs of its implementation can quickly pay off.

We will look at four main areas of application of DataMining technology in detail: science, business, government research and the Web.

business tasks. Main areas: banking, finance, insurance, CRM, manufacturing, telecommunications, e-commerce, marketing, stock market and others.

    Should I issue a loan to the client?

    Market segmentation

    Attraction of new clients

    Credit card fraud

Application of DataMining for solving problems at the state level. Main directions: search for tax evaders; means in the fight against terrorism.

Application of DataMining for scientific research. Main areas: medicine, biology, molecular genetics and genetic engineering, bioinformatics, astronomy, applied chemistry, research related to drug addiction, and others.

Using DataMining to solve Web tasks. Main areas: search engines, counters and others.

E-commerce

In the field of e-commerce, DataMining is used to generate

This classification allows companies to identify specific customer groups and conduct marketing policies in accordance with the identified interests and needs of customers. DataMining technology for e-commerce is closely related to WebMining technology.

The main tasks of DataMining in industrial production:

· comprehensive system analysis of production situations;

· short-term and long-term forecast of development of production situations;

· development of options for optimization solutions;

· forecasting the quality of a product depending on certain parameters

technological process;

· detection of hidden trends and patterns in the development of production

processes;

· forecasting patterns of development of production processes;

· detection of hidden influence factors;

· detection and identification of previously unknown relationships between

production parameters and influencing factors;

· analysis of the interaction environment of production processes and forecasting

changes in its characteristics;

processes;

· visualization of analysis results, preparation of preliminary reports and projects

feasible solutions with assessments of the reliability and effectiveness of possible implementations.

Marketing

In the field of marketing, DataMining is widely used.

Basic marketing questions: “What is sold?”, “How is it sold?”, “Who is

consumer?"

The lecture on classification and clustering problems describes in detail the use of cluster analysis to solve marketing problems, such as consumer segmentation.

Another common set of methods for solving marketing problems is methods and algorithms for searching for association rules.

The search for temporal patterns is also successfully used here.

Retail

In retail trade, as in marketing, the following are used:

· algorithms for searching for association rules (to determine frequently occurring sets of

goods that buyers buy at the same time). Identifying such rules helps

place goods on store shelves, develop strategies for purchasing goods

and their placement in warehouses, etc.

· use of time sequences, for example, to determine

required volumes of goods in the warehouse.

· classification and clustering methods to identify groups or categories of clients,

knowledge of which contributes to the successful promotion of goods.

Stock market

Here is a list of stock market problems that can be solved using Data technology

Mining: · forecasting future values ​​of financial instruments and their indicators

past values;

· trend forecast (future direction of movement - growth, decline, flat) financial

the instrument and its strength (strong, moderately strong, etc.);

· identification of the cluster structure of the market, industry, sector according to a certain set

characteristics;

· dynamic portfolio management;

· volatility forecast;

· risk assessment;

· predicting the onset of a crisis and forecasting its development;

· selection of assets, etc.

In addition to the areas of activity described above, DataMining technology can be used in a wide variety of business areas where there is a need for data analysis and a certain amount of retrospective information has been accumulated.

Application of DataMining in CRM

One of the most promising areas for using DataMining is the use of this technology in analytical CRM.

CRM (CustomerRelationshipManagement) - customer relationship management.

At sharing These technologies combine the extraction of knowledge with the “extraction of money” from customer data.

An important aspect in the work of the marketing and sales departments is the compilationa holistic view of clients, information about their characteristics, characteristics, and the structure of the client base. CRM uses so-called profilingclients, giving a complete picture of the entire necessary information about clients.

Customer profiling includes the following components: customer segmentation, customer profitability, customer retention, customer response analysis. Each of these components can be examined using DataMining, and analyzing them together as profiling components can ultimately provide knowledge that is impossible to obtain from each individual characteristic.

WebMining

WebMining can be translated as “data mining on the Web.” WebIntelligence or Web.

Intelligence is ready to “open a new chapter” in the rapid development of electronic business. The ability to determine the interests and preferences of each visitor by observing his behavior is a serious and critical competitive advantage in the e-commerce market.

WebMining systems can answer many questions, for example, which of the visitors is a potential client of the Web store, which group of Web store customers brings the most income, what are the interests of a particular visitor or group of visitors.

Methods

Classification of methods

There are two groups of methods:

  • statistical methods based on the use of average accumulated experience, which is reflected in retrospective data;
  • cybernetic methods, including many heterogeneous mathematical approaches.

The disadvantage of this classification is that both statistical and cybernetic algorithms rely in one way or another on a comparison of statistical experience with the results of monitoring the current situation.

The advantage of this classification is its ease of interpretation - it is used to describe the mathematical means of a modern approach to extracting knowledge from arrays of initial observations (operative and retrospective), i.e. in Data Mining tasks.

Let's take a closer look at the groups presented above.

Statistical methods Data mining

In these methods represent four interrelated sections:

  • preliminary analysis of the nature of statistical data (testing hypotheses of stationarity, normality, independence, homogeneity, assessing the type of distribution function, its parameters, etc.);
  • identifying connections and patterns(linear and nonlinear regression analysis, correlation analysis, etc.);
  • multivariate statistical analysis (linear and nonlinear discriminant analysis, cluster analysis, component analysis, factor analysis, etc.);
  • dynamic models and forecast based on time series.

The arsenal of statistical methods for Data Mining is classified into four groups of methods:

  1. Descriptive analysis and description of source data.
  2. Relationship analysis (correlation and regression analysis, factor analysis, analysis of variance).
  3. Multivariate statistical analysis (component analysis, discriminant analysis, multivariate regression analysis, canonical correlations, etc.).
  4. Time series analysis (dynamic models and forecasting).

Cybernetic Data Mining Methods

The second direction of Data Mining is a variety of approaches united by the idea of ​​computer mathematics and the use of artificial intelligence theory.

This group includes the following methods:

  • artificial neural networks (recognition, clustering, forecast);
  • evolutionary programming (including algorithms for the method of group accounting of arguments);
  • genetic algorithms (optimization);
  • associative memory (search for analogues, prototypes);
  • fuzzy logic;
  • decision trees;
  • expert knowledge processing systems.

Cluster analysis

The purpose of clustering is to search for existing structures.

Clustering is a descriptive procedure, it does not make any statistical inferences, but it does provide an opportunity to conduct exploratory analysis and study the “structure of the data.”

The very concept of “cluster” is defined ambiguously: each study has its own “clusters”. The concept of cluster is translated as “cluster”, “bunch”. A cluster can be characterized as a group of objects that have common properties.

The characteristics of a cluster can be described as two:

  • internal homogeneity;
  • external isolation.

A question that analysts ask when solving many problems is how to organize data into visual structures, i.e. expand taxonomies.

Clustering was initially most widely used in sciences such as biology, anthropology, and psychology. Clustering has been little used for solving economic problems for a long time due to the specific nature of economic data and phenomena.

Clusters can be disjoint, or exclusive (non-overlapping, exclusive), and overlapping.

It should be noted that as a result of applying various methods of cluster analysis, clusters of various shapes can be obtained. For example, “chain” type clusters are possible, when the clusters are represented by long “chains”, elongated clusters, etc., and some methods can create clusters of arbitrary shape.

Various methods may strive to create clusters of specific sizes (e.g., small or large) or assume that there are clusters of different sizes in the data set. Some cluster analysis methods are particularly sensitive to noise or outliers, others less so. As a result of using different clustering methods, different results may be obtained; this is normal and is a feature of the operation of a particular algorithm. These features should be taken into account when choosing a clustering method.

Let us give a brief description of approaches to clustering.

Algorithms based on data separation (Partitioning algorithms), incl. iterative:

  • dividing objects into k clusters;
  • Iterative redistribution of objects to improve clustering.
  • Hierarchyalgorithms:
  • agglomeration: each object is initially a cluster, clusters,
  • connecting with each other, they form a larger cluster, etc.

Density-basedmethods:

  • based on the ability to connect objects;
  • ignore noise and find clusters of arbitrary shape.

Grid - methods (Grid-based methods):

  • quantization of objects into grid structures.

Model methods (Model-based):

  • using the model to find clusters that best fit the data.

Cluster analysis methods. Iterative methods.

With a large number of observations, hierarchical methods of cluster analysis are not suitable. In such cases, non-hierarchical methods based on division are used, which are iterative methods of fragmenting the original population. During the division process, new clusters are formed until the stopping rule is satisfied.

Such non-hierarchical clustering consists of dividing a data set into a certain number of individual clusters. There are two approaches. The first is to determine the boundaries of clusters as the most dense areas in the multidimensional space of the source data, i.e. defining a cluster where there is a large “condensation of points”. The second approach is to minimize the measure of difference between objects

k-means algorithm

The most common non-hierarchical method is the k-means algorithm, also called fast cluster analysis. A complete description of the algorithm can be found in Hartigan and Wong (1978). Unlike hierarchical methods, which do not require preliminary assumptions regarding the number of clusters, to be able to use this method, it is necessary to have a hypothesis about the most likely number of clusters.

The k-means algorithm constructs k clusters located at the greatest possible distances from each other. The main type of problems that the k-means algorithm solves is the presence of assumptions (hypotheses) regarding the number of clusters, and they should be as different as possible. The choice of k may be based on previous research, theoretical considerations, or intuition.

The general idea of ​​the algorithm: a given fixed number k of observation clusters are compared to clusters so that the averages in the cluster (for all variables) differ from each other as much as possible.

Description of the algorithm

1. Initial distribution of objects into clusters.

  • The number k is selected, and in the first step these points are considered the “centers” of the clusters.
  • Each cluster corresponds to one center.

The selection of initial centroids can be done as follows:

  • selecting k-observations to maximize initial distance;
  • random selection of k-observations;
  • selection of the first k-observations.

As a result, each object is assigned to a specific cluster.

2. Iterative process.

The centers of the clusters are calculated, which are then used to calculate the coordinate-wise averages of the clusters. Objects are redistributed again.

The process of calculating centers and redistributing objects continues until one of the conditions is met:

  • cluster centers have stabilized, i.e. all observations belong to the cluster to which they belonged before the current iteration;
  • the number of iterations is equal maximum number iterations.

The figure shows an example of the k-means algorithm for k equal to two.

An example of the k-means algorithm (k=2)

Choosing the number of clusters is a complex issue. If there are no assumptions regarding this number, it is recommended to create 2 clusters, then 3, 4, 5, etc., comparing the results obtained.

Checking the quality of clustering

After receiving the results of the k-means cluster analysis, you should check the correctness of the clustering (i.e., assess how different the clusters are from each other).

To do this, average values ​​for each cluster are calculated. Good clustering should produce very different means for all dimensions, or at least most of them.

Advantages of the k-means algorithm:

  • ease of use;
  • speed of use;
  • understandability and transparency of the algorithm.

Disadvantages of the k-means algorithm:

  • the algorithm is too sensitive to outliers that can distort the average.

Possible solution This problem is to use a modification of the algorithm - the k-median algorithm;

  • the algorithm may be slow on large databases. A possible solution to this problem is to use data sampling.

Bayesian networks

In probability theory, the concept of information dependence is modeled through conditional dependence (or strictly: the absence of conditional independence), which describes how our confidence in the outcome of some event changes when we gain new knowledge about facts, provided that we already knew some set of other facts.

It is convenient and intuitive to represent dependencies between elements through a directed path connecting these elements in a graph. If the relationship between elements x and y is not direct and is carried out through a third element z, then it is logical to expect that there will be an element z on the path between x and y. Such intermediary nodes will “cut off” the dependence between x and y, i.e. simulate a situation of conditional independence between them with a known value of direct influencing factors.Such modeling languages ​​are Bayesian networks, which serve to describe conditional dependencies between the concepts of a certain subject area.

Bayesian networks are graphical structures for representing probabilistic relationships between large numbers of variables and for performing probabilistic inference based on those variables."Naive" (Bayesian) classification is a fairly transparent and understandable classification method. "Naive" it is called because it is based on the assumption of mutualindependence of signs.

Classification properties:

1. Using all variables and determining all dependencies between them.

2. Having two assumptions about the variables:

  • all variables are equally important;
  • all variables are statistically independent, i.e. the value of one variable says nothing about the value of another.

There are two main scenarios for using Bayesian networks:

1. Descriptive analysis. The subject area is displayed as a graph, the nodes of which represent concepts, and the directed arcs, displayed by arrows, illustrate the direct dependencies between these concepts. The relationship between x and y means: knowing the value of x helps you make a better guess about the value of y. The absence of a direct connection between concepts models the conditional independence between them with known values ​​of a certain set of “separating” concepts. For example, a child's shoe size is obviously related to a child's reading ability through age. Thus, a larger shoe size gives greater confidence that the child is already reading, but if we already know the age, then knowing the shoe size will no longer give us additional information about the child's ability to read.


As another, opposite example, consider such initially unrelated factors as smoking and colds. But if we know a symptom, for example, that a person suffers from a cough in the morning, then knowing that the person does not smoke increases our confidence that the person has a cold.

2. Classification and forecasting. The Bayesian network, allowing for the conditional independence of a number of concepts, makes it possible to reduce the number of parameters of the joint distribution, making it possible to confidently estimate them on the available volumes of data. So, with 10 variables, each of which can take 10 values, the number of parameters of the joint distribution is 10 billion - 1. If we assume that only 2 variables depend on each other between these variables, then the number of parameters becomes 8 * (10-1) + (10*10-1) = 171. Having a joint distribution model that is realistic in terms of computational resources, we can predict the unknown value of a concept as, for example, the most probable value of this concept given the known values ​​of other concepts.

The following advantages of Bayesian networks as a DataMining method are noted:

The model defines the dependencies between all variables, this makes it easyhandle situations in which the values ​​of some variables are unknown;

Bayesian networks are quite easy to interpret and allowPredictive modeling makes it easy to conduct what-if scenario analysis;

The Bayesian method allows you to naturally combine patterns,inferred from data, and, for example, expert knowledge obtained explicitly;

Using Bayesian networks avoids the problem of overfitting(overfitting), that is, excessive complication of the model, which is a weaknessmany methods (for example, decision trees and neural networks).

The Naive Bayes approach has the following disadvantages:

It is correct to multiply conditional probabilities only when all inputthe variables are truly statistically independent; although often this methodshows quite good results when the statistical condition is not metindependence, but theoretically such a situation should be handled by more complexmethods based on training Bayesian networks;

Direct processing of continuous variables is not possible - they are requiredconversion to an interval scale so that attributes are discrete; however suchtransformations can sometimes lead to the loss of significant patterns;

The classification result in the Naive Bayes approach is influenced only byindividual values ​​of input variables, the combined influence of pairs ortriplets of values ​​of different attributes are not taken into account here. This could improvequality of the classification model in terms of its predictive accuracy,however, it would increase the number of options tested.

Artificial neural networks

Artificial neural networks (hereinafter referred to as neural networks) can be synchronous and asynchronous.In synchronous neural networks, at each moment of time its state changes only one neuron. In asynchronous - the state changes immediately in a whole group of neurons, as a rule, in all layer. Two basic architectures can be distinguished: layered and mesh networks.The key concept in layered networks is the concept of layer.A layer is one or more neurons whose inputs receive the same common signal.Layered neural networks are neural networks in which neurons are divided into separate groups(layers) so that information processing is carried out layer by layer.In layered networks, neurons of the i-th layer receive input signals, transform them, and transmit them through branching points to the neurons of the (i+1) layer. And so on until the k-th layer, which producesoutput signals for the interpreter and user. The number of neurons in each layer is not related to the number of neurons in other layers and can be arbitrary.Within one layer, data is processed in parallel, and across the entire network, processing is carried out sequentially - from layer to layer. Layered neural networks include, for example, multilayer perceptrons, radial basis function networks, cognitron, noncognitron, associative memory networks.However, the signal is not always sent to all neurons in the layer. In a cognitron, for example, each neuron of the current layer receives signals only from neurons close to it in the previous layer.

Layered networks, in turn, can be single-layer or multi-layer.

Single layer network- a network consisting of one layer.

Multilayer network- a network with several layers.

In a multilayer network, the first layer is called the input layer, subsequent layers are called internal or hidden, and the last layer is called the output layer. Thus, intermediate layers are all layers in a multilayer neural network except the input and output.The input layer of the network communicates with the input data, and the output layer communicates with the output.Thus, neurons can be input, output and hidden.The input layer is organized from input neurons, which receive data and distribute it to the inputs of neurons in the hidden layer of the network.A hidden neuron is a neuron located in the hidden layer of a neural network.Output neurons, from which the output layer of the network is organized, producesresults of the neural network.

In mesh networks Each neuron transmits its output to other neurons, including itself. The output signals of the network can be all or some of the output signals of neurons after several cycles of network operation.

All input signals are given to all neurons.

Training neural networks

Before using a neural network, it must be trained.The process of training a neural network consists of adjusting its internal parameters to a specific task.The neural network algorithm is iterative; its steps are called epochs or cycles.An epoch is one iteration in the learning process, including the presentation of all examples from the training set and, possibly, checking the quality of learning on a test set. many. The learning process is carried out on the training sample.The training set includes the input values ​​and their corresponding output values ​​of the dataset. During training, the neural network finds certain dependencies between the output fields and the input fields.Thus, we are faced with the question - what input fields (features) do we need?nessesary to use. Initially, the choice is made heuristically, thenthe number of inputs can be changed.

A problem that may arise is the number of observations in the data set. And although there are certain rules describing the relationship between the required number of observations and the size of the network, their correctness has not been proven.The number of required observations depends on the complexity of the problem being solved. As the number of features increases, the number of observations increases nonlinearly; this problem is called the “curse of dimensionality.” In case of insufficient quantitydata, it is recommended to use a linear model.

The analyst must determine the number of layers in the network and the number of neurons in each layer.Next, you need to assign such values ​​of weights and offsets that canminimize the decision error. The weights and biases are automatically adjusted to minimize the difference between the desired and received output signals, called the training error.The training error for the constructed neural network is calculated by comparingoutput and target (desired) values. The error function is formed from the resulting differences.

The error function is an objective function that requires minimization in the processsupervised learning of a neural network.Using the error function, you can evaluate the quality of the neural network during training. For example, the sum of squared errors is often used.The quality of training of a neural network determines its ability to solve the assigned tasks.

Retraining a neural network

When training neural networks, a serious difficulty often arises calledproblem of overfitting.Overfitting, or overfitting - overfittingneural network to a specific set of training examples, in which the network losesability to generalize.Overtraining occurs when there is too much training, not enoughtraining examples or an overcomplicated neural network structure.Retraining is due to the fact that the choice of the training setis random. From the first steps of learning, the error decreases. Onsubsequent steps in order to reduce the error (objective function) parametersadapt to the characteristics of the training set. However, this happens“adjustment” not to the general patterns of the series, but to the features of its part -training subset. At the same time, the accuracy of the forecast decreases.One of the options to combat network overtraining is to divide the training sample into twosets (training and testing).The neural network is trained on the training set. The constructed model is checked on the test set. These sets must not intersect.With each step, the model parameters change, but the constant decreaseThe value of the objective function occurs precisely on the training set. When we split the set into two, we can observe a change in the forecast error on the test set in parallel with observations on the training set. somethe number of forecast error steps decreases on both sets. However, onAt a certain step, the error on the test set begins to increase, while the error on the training set continues to decrease. This moment is considered the beginning of retraining

DataMining Tools

Both world-famous leaders and new developing companies are involved in the development of the DataMining sector of the global software market. DataMining tools can be presented either as a stand-alone application or as add-ons to the main product.The latter option is implemented by many software market leaders.Thus, it has already become a tradition that developers of universal statistical packages, in addition to traditional methods of statistical analysis, include in the packagea specific set of DataMining methods. These are packages like SPSS (SPSS, Clementine), Statistica (StatSoft), SAS Institute (SAS Enterprise Miner). Some OLAP solution providers also offer a set of DataMining methods, such as the Cognos family of products. There are suppliers that include DataMining solutions in the DBMS functionality: these are Microsoft (MicrosoftSQLServer), Oracle, IBM (IBMIntelligentMinerforData).

Bibliography

  1. Abdikeev N.M. Danko T.P. Ildemenov S.V. Kiselev A.D., “Business process reengineering. MBA course", M.: Eksmo Publishing House, 2005. - 592 p. - (MBA)
  1. Abdikeev N.M., Kiselev A.D. “Knowledge management in a corporation and business reengineering” - M.: Infra-M, 2011. - 382 p. – ISBN 978-5-16-004300-5
  1. Barseghyan A.A., Kupriyanov M.S., Stepanenko V.V., Kholod I.I. “Methods and models of data analysis: OLAP and Data Mining”, St. Petersburg: BHV-Petersburg, 2004, 336 pp., ISBN 5-94157-522-X
  1. Duke IN., Samoilenko A., “Data Mining.Training course" St. Petersburg: Peter, 2001, 386 p.
  1. Chubukova I.A., Data Mining course, http://www.intuit.ru/department/database/datamining/
  1. IanH. Witten, Eibe Frank, Mark A. Hall, Morgan Kaufmann, Data Mining: Practical Machine Learning Tools and Techniques (Third Edition), ISBN 978-0-12-374856-0
  1. Petrushin V.A. , Khan L. , Multimedia Data Mining and Knowledge Discovery
Data Mining is divided into two large groups based on the principle of working with initial training data. In this classification, the top level is determined based on whether the data is stored after Data Mining or whether it is distilled for later use.

1. Direct use of data, or data storage.

In this case, the source data is stored in an explicitly detailed form and is directly used at stages and/or exception analysis. The problem with this group of methods is that when using them, it may be difficult to analyze very large databases.

Methods in this group: cluster analysis, nearest neighbor method, k-nearest neighbor method, analogical reasoning.

2. Identification and use of formalized patterns, or template distillation.

With technology template distillation one sample (template) of information is extracted from the source data and converted into certain formal structures, the type of which depends on the Data Mining method used. This process is carried out at the stage free search, the first group of methods does not have this stage in principle. At stages predictive modeling And exception analysis stage results are used free search, they are much more compact than the databases themselves. Let us recall that the constructions of these models can be interpreted by an analyst or non-interpretable (“black boxes”).

Methods of this group: logical methods; visualization methods; cross-tabulation methods; methods based on equations.

Logical methods, or methods of logical induction, include: fuzzy queries and analyses; symbolic rules; decision trees; genetic algorithms.

The methods of this group are, perhaps, the most interpretable - they formalize the found patterns, in most cases, in a fairly transparent form from the user’s point of view. The resulting rules can include continuous and discrete variables. It should be noted that decision trees can be easily converted into sets of symbolic rules by generating a single rule along the path from the root of the tree to its terminal apex. Decision trees and rules are actually different ways solutions to the same problem and differ only in their capabilities. In addition, the implementation of rules is carried out by slower algorithms than the induction of decision trees.

Cross-tabulation methods: agents, Bayesian (trust) networks, cross-tabulation visualization. The last method does not quite meet one of the properties of Data Mining - independent search patterns analytical system. However, providing information in the form of cross-tabs ensures the implementation of the main task of Data Mining - finding patterns, so this method can also be considered one of the Data Mining methods.

Equation-based methods.

The methods of this group express the identified patterns in the form of mathematical expressions - equations. Therefore, they can only work with numeric variables, and variables of other types must be encoded accordingly. This somewhat limits the use of methods in this group, however, they are widely used in solving various problems, especially forecasting problems.

The main methods of this group: statistical methods and neural networks

Statistical methods are most often used to solve forecasting problems. There are many methods of statistical data analysis, among them, for example, correlation and regression analysis, correlation of time series, identifying trends in time series, and harmonic analysis.

Another classification divides the variety of Data Mining methods into two groups: statistical and cybernetic methods. This separation scheme is based on different approaches to training mathematical models.

It should be noted that there are two approaches to classifying statistical methods as Data Mining. The first of them contrasts statistical methods and Data Mining; its supporters consider classical statistical methods to be a separate area of ​​data analysis. According to the second approach, statistical methods of analysis are part of the mathematical tools of Data Mining. Most authorities take the second approach.

In this classification, two groups of methods are distinguished:

  • statistical methods based on the use of average accumulated experience, which is reflected in retrospective data;
  • cybernetic methods, including many heterogeneous mathematical approaches.

The disadvantage of this classification is that both statistical and cybernetic algorithms rely in one way or another on a comparison of statistical experience with the results of monitoring the current situation.

The advantage of this classification is its ease of interpretation - it is used to describe the mathematical means of the modern approach to knowledge extraction from arrays of initial observations (operative and retrospective), i.e. in Data Mining tasks.

Let's take a closer look at the groups presented above.

Statistical methods Data mining

These methods are divided into four interrelated sections:

  • preliminary analysis of the nature of statistical data (testing hypotheses of stationarity, normality, independence, homogeneity, assessing the type of distribution function, its parameters, etc.);
  • identifying connections and patterns(linear and nonlinear regression analysis, correlation analysis, etc.);
  • multivariate statistical analysis (linear and nonlinear discriminant analysis, cluster analysis, component analysis, factor analysis and etc.);
  • dynamic models and forecast based on time series.

The arsenal of statistical methods for Data Mining is classified into four groups of methods:

  1. Descriptive analysis and description of source data.
  2. Relationship analysis (correlation and regression analysis, factor analysis, analysis of variance).
  3. Multivariate statistical analysis (component analysis, discriminant analysis, multivariate regression analysis, canonical correlations, etc.).
  4. Time series analysis ( dynamic models and forecasting).

Cybernetic Data Mining Methods

The second direction of Data Mining is a variety of approaches united by the idea of ​​computer mathematics and the use of artificial intelligence theory.

The development of methods for recording and storing data has led to a rapid growth in the volume of information collected and analyzed. The volumes of data are so impressive that it is simply impossible for a person to analyze them on their own, although the need for such an analysis is quite obvious, because this “raw” data contains knowledge that can be used in decision-making. In order to carry out automatic data analysis, Data Mining is used.

Data Mining is the process of discovering in “raw” data previously unknown, non-trivial, practically useful and interpretable knowledge necessary for decision-making in various areas of human activity. Data Mining is one of the steps of Knowledge Discovery in Databases.

The information found in the process of applying Data Mining methods must be non-trivial and previously unknown, for example, average sales are not. Knowledge should describe new connections between properties, predict the values ​​of some features based on others, etc. The knowledge found must be applicable to new data with some degree of reliability. The usefulness lies in the fact that this knowledge can bring certain benefits when applied. Knowledge must be in a non-mathematical form that is understandable to the user. For example, the logical constructions “if... then...” are most easily perceived by humans. Moreover, such rules can be used in various DBMSs as SQL queries. In the case where the extracted knowledge is not transparent to the user, there must be post-processing methods to bring it into an interpretable form.

The algorithms used in Data Mining require a lot of calculations. Previously, this was a deterrent to widespread practical application Data Mining, however, today's productivity growth modern processors alleviated the severity of this problem. Now for acceptable time you can conduct a qualitative analysis of hundreds of thousands and millions of records.

Problems solved by Data Mining methods:

  1. Classification– this is the assignment of objects (observations, events) to one of the previously known classes.
  2. Regression, including forecasting tasks. Establishing the dependence of continuous outputs on input variables.
  3. Clustering is a grouping of objects (observations, events) based on data (properties) that describe the essence of these objects. Objects within a cluster must be “similar” to each other and different from objects included in other clusters. The more similar the objects within a cluster and the more differences between clusters, the more accurate the clustering.
  4. Association– identifying patterns between related events. An example of such a pattern is a rule indicating that event X follows from event Y. Such rules are called associative. This problem was first proposed to find typical shopping patterns in supermarkets, so it is sometimes also called market basket analysis.
  5. Sequential patterns– establishing patterns between events related in time, i.e. detection of the dependence that if event X occurs, then after a given time event Y will occur.
  6. Deviation Analysis– identification of the most uncharacteristic patterns.

Business analysis problems are formulated differently, but the solution to most of them comes down to one or another Data Mining problem or a combination of them. For example, risk assessment is a solution to a regression or classification problem, market segmentation is clustering, demand stimulation is association rules. In fact, Data Mining tasks are the elements from which a solution to the vast majority of real business problems can be assembled.

To solve the above problems, various Data Mining methods and algorithms are used. Due to the fact that Data Mining has developed and is developing at the intersection of such disciplines as statistics, information theory, machine learning, and database theory, it is quite natural that most Data Mining algorithms and methods were developed based on various methods from these disciplines. For example, the k-means clustering procedure was simply borrowed from statistics. The following Data Mining methods have become very popular: neural networks, decision trees, clustering algorithms, including scalable ones, algorithms for detecting associative connections between events, etc.

Deductor is an analytical platform that includes a full set of tools for solving Data Mining problems: linear regression, supervised neural networks, unsupervised neural networks, decision trees, search for association rules and many others. For many mechanisms, specialized visualizers are provided, which greatly facilitate the use of the resulting model and interpretation of the results. Strength platform is not only the implementation modern algorithms analysis, but also providing the ability to arbitrarily combine various analysis mechanisms.

Ministry of Education and Science of the Russian Federation

Federal State Budgetary Educational Institution of Higher Professional Education

"NATIONAL RESEARCH TOMSK POLYTECHNIC UNIVERSITY"

Institute of Cybernetics

Direction Informatics and Computer Science

Department of VT

Test

in the discipline of informatics and computer engineering

Topic: Data Mining Methods

Introduction

Data Mining. Basic concepts and definitions

1 Steps in the data mining process

2 Components of intelligent analysis systems

3 Data mining methods

Data Mining Methods

1 Derivation of association rules

2 Neural network algorithms

3 Nearest neighbor and k-nearest neighbors methods

4 Decision trees

5 Clustering algorithms

6 Genetic algorithms

Areas of application

Manufacturers of Data Mining tools

Criticism of methods

Conclusion

Bibliography

Introduction

The result of development information technologies is the colossal amount of data accumulated in in electronic format, growing at a rapid pace. At the same time, data, as a rule, have a heterogeneous structure (texts, images, audio, video, hypertext documents, relational databases). Data accumulated over a long period of time can contain patterns, trends and relationships, which are valuable information for planning, forecasting, decision-making, and process control. However, humans are physically unable to effectively analyze such volumes of heterogeneous data. Methods of traditional mathematical statistics have long claimed to be the main tool for data analysis. However, they do not allow the synthesis of new hypotheses, but can only be used to confirm pre-formulated hypotheses and “rough” exploratory analysis, which forms the basis of online analytical processing (OLAP). Often, it is the formulation of a hypothesis that turns out to be the most difficult task when conducting analysis for subsequent decision-making, since not all patterns in the data are obvious at first glance. Therefore, Data mining technologies are considered as one of the most important and promising topics for research and application in the information technology industry. Data mining in this case refers to the process of identifying new, correct and potentially useful knowledge based on large amounts of data. Thus, MIT Technology Review described Data Mining as one of the ten emerging technologies that will change the world.

1. Data Mining. Basic concepts and definitions

Data Mining is the process of discovering in “raw” data previously unknown, non-trivial, practically useful and interpretable knowledge necessary for decision-making in various areas of human activity.

The essence and purpose of Data Mining technology can be formulated as follows: it is a technology that is designed to search large volumes of data for non-obvious, objective and practical patterns.

Non-obvious patterns are patterns that cannot be detected standard methods processing information or by expert means.

Objective patterns should be understood as patterns that fully correspond to reality, in contrast to expert opinion, which is always subjective.

This data analysis concept assumes that:

§ data may be inaccurate, incomplete (contain omissions), contradictory, heterogeneous, indirect, and at the same time have gigantic volumes; therefore, understanding data in specific applications requires significant intellectual effort;

§ data analysis algorithms themselves may have “elements of intelligence,” in particular, the ability to learn from precedents, that is, to draw general conclusions based on particular observations; the development of such algorithms also requires significant intellectual effort;

§ the processes of processing raw data into information, and information into knowledge, cannot be done manually and require automation.

Data Mining technology is based on the concept of templates that reflect fragments of multidimensional relationships in data. These patterns represent patterns inherent in subsamples of data that can be compactly expressed in a human-readable form.

The search for patterns is carried out using methods that are not limited by a priori assumptions about the structure of the sample and the type of distributions of the values ​​of the analyzed indicators.

An important feature of Data Mining is the non-standard and non-obvious nature of the patterns being sought. In other words, Data Mining tools are different from statistical processing data and OLAP tools in that instead of checking interdependencies pre-assumed by users, based on available data, they are able to find such interdependencies independently and build hypotheses about their nature. There are five standard types of patterns identified by Data Mining methods:

· association - a high probability of events being connected with each other. An example of an association is items in a store that are often purchased together;

· sequence - a high probability of a chain of events related in time. An example of a sequence is a situation where, within a certain period after the purchase of one product, another will be purchased with a high degree of probability;

· classification - there are signs that characterize the group to which this or that event or object belongs;

· clustering is a pattern similar to classification and differs from it in that the groups themselves are not specified - they are identified automatically during data processing;

· temporal patterns - the presence of patterns in the dynamics of the behavior of certain data. A typical example of a time pattern is seasonal fluctuations in demand for certain goods or services.

1.1 Steps in the data mining process

Traditionally, the following stages are distinguished in the process of data mining:

1. Study of the subject area, as a result of which the main goals of the analysis are formulated.

2. Data collection.

Data preprocessing:

a. Data cleaning - eliminating inconsistencies and random “noise” from the source data

b. Data integration - combining data from several possible sources into one repository. Data conversion. At this stage, the data is converted into a form suitable for analysis. Data aggregation, attribute sampling, data compression, and dimensionality reduction are often used.

4. Data analysis. As part of this stage, mining algorithms are used to extract patterns.

5. Interpretation of the found patterns. This stage may involve visualizing the extracted patterns, identifying truly useful patterns based on some utility function.

Using new knowledge.

1.2 Components of intelligent analysis systems

Typically, data mining systems have the following main components:

1. A database, data warehouse, or other repository of information. This could be one or more databases, a data warehouse, spreadsheets, or other types of repositories that can be cleaned and integrated.

2. Database or data warehouse server. The specified server is responsible for retrieving the essential data based on the user request.

Knowledge base. This is domain knowledge that guides how to search and evaluate the usefulness of the resulting patterns.

Knowledge mining service. It is an integral part of the data mining system and contains a set of functional modules for tasks such as characterization, association search, classification, cluster analysis and variance analysis.

Pattern evaluation module. This component calculates measures of interest or usefulness of patterns.

Graphical user interface. This module is responsible for communication between the user and the data mining system, visualization of patterns in various forms.

1.3 Data mining methods

Most of the analytical methods used in Data Mining technology are well-known mathematical algorithms and methods. What is new in their application is the possibility of their use in solving certain specific problems, due to the emerging capabilities of hardware and software. It should be noted that most Data Mining methods were developed within the framework of the theory of artificial intelligence. Let's look at the most widely used methods:

Derivation of association rules.

2. Neural network algorithms, the idea of ​​which is based on an analogy with the functioning of nervous tissue and lies in the fact that the initial parameters are considered as signals that are transformed in accordance with the existing connections between “neurons”, and the response of the entire network is considered as the response resulting from the analysis to the original data.

Selecting a close analogue of the original data from existing historical data. Also called the “nearest neighbor” method.

Decision trees are a hierarchical structure based on a set of questions that require a “Yes” or “No” answer.

Cluster models are used to group similar events into groups based on similar values ​​of several fields in a data set.

In the next chapter we will describe the above methods in more detail.

2. Data Mining Methods

2.1 Inference of association rules

Association rules are rules of the form “if... then...”. Searching for such rules in a data set reveals hidden connections in seemingly unrelated data. One of the most frequently cited examples of finding association rules is the problem of finding stable connections in a shopping cart. This problem is to determine which products are purchased by customers together, so that marketers can appropriately place these products in the store to increase sales.

Association rules are defined as statements of the form (X1,X2,…,Xn) -> Y, where it is implied that Y can be present in a transaction provided that X1,X2,…,Xn are present in the same transaction. It should be noted that the word "may" implies that the rule is not an identity, but is satisfied only with some probability. In addition, Y can be a set of elements, rather than just one element. The probability of finding Y in a transaction that contains elements X1,X2,…,Xn is called confidence. The percentage of transactions that contain a rule out of the total number of transactions is called support. The level of confidence that must exceed the confidence of a rule is called interestingness.

There are different types of association rules. In their simplest form, association rules communicate only the presence or absence of an association. Such rules are called Boolean Association Rules. An example of such a rule: “customers who purchase yogurt also purchase low-fat butter.”

Rules that group several association rules together are called Multilevel or Generalized Association Rules. When constructing such rules, elements are usually grouped according to a hierarchy, and the search is conducted at the highest conceptual level. For example, “customers who buy milk also buy bread.” In this example, milk and bread contain a hierarchy various types and brands, but searching at the lower level will not allow you to find interesting rules.

A more complex type of rule is Quantitative Association Rules. This type of rule is searched using quantitative (eg price) or categorical (eg gender) attributes, and is defined as ( , ,…,} -> . For example, "buyers whose age is between 30 and 35 years old with an income of more than 75,000 per year buy cars costing more than 20,000."

The above types of rules do not address the fact that transactions are, by their nature, time dependent. For example, searching before a product has been listed for sale or after it has disappeared from the market will adversely affect the support threshold. Taking this into account, the concept of attribute lifetime was introduced in algorithms for searching temporary association rules (Temporal Association Rules).

The problem of searching for association rules can be generally decomposed into two parts: searching for frequently occurring sets of elements, and generating rules based on the found frequently occurring sets. Previous studies have, for the most part, followed these lines and extended them in different directions.

Since the advent of the Apriori algorithm, this algorithm is the most commonly used in the first step. Many improvements, for example in speed and scalability, are aimed at improving the Apriori algorithm, at correcting its erroneous property of generating too many candidates for the most frequently occurring sets of elements. Apriori generates itemsets using only the large sets found in the previous step, without re-examining transactions. The modified AprioriTid algorithm improves on Apriori by only using the database on the first pass. When calculating in subsequent steps, only the data created in the first pass and which is much smaller in size than the original database is used. This leads to a huge increase in productivity. A further improved version of the algorithm, called AprioriHybrid, can be obtained by using Apriori on the first few passes, and then switching to AprioriTid on later passes, when the kth candidate sets can already be entirely located in computer memory.

Further efforts to improve the Apriori algorithm are related to parallelization of the algorithm (Count Distribution, Data Distribution, Candidate Distribution, etc.), its scaling (Intelligent Data Distribution, Hybrid Distribution), the introduction of new data structures, such as trees of frequently occurring elements (FP-growth ).

The second step is mainly characterized by authenticity and interestingness. The new modifications add the dimensionality, quality, and timing support described above to the traditional Boolean rule rules. An evolutionary algorithm is often used to find rules.

2.2 Neural network algorithms

Artificial neural networks appeared as a result of the application of mathematical apparatus to the study of the functioning of the human nervous system with the aim of its reproduction. Namely: the ability of the nervous system to learn and correct errors, which should allow us to simulate, albeit rather roughly, the work of the human brain. The main structural and functional part of the neural network is the formal neuron, shown in Fig. 1, where x0, x1,..., xn are the components of the input signal vector, w0,w1,...,wn are the values ​​of the weights of the neuron’s input signals, and y is the neuron’s output signal.

Rice. 1. Formal neuron: synapses (1), adder (2), converter (3).

A formal neuron consists of 3 types of elements: synapses, adder and converter. A synapse characterizes the strength of the connection between two neurons.

The adder adds the input signals, previously multiplied by the corresponding weights. The converter implements the function of one argument - the output of the adder. This function is called the activation function or transfer function of the neuron.

The formal neurons described above can be combined in such a way that the output signals of some neurons are the input to others. The resulting set of interconnected neurons is called artificial neural networks (ANN) or, for short, neural networks.

There are the following three general types of neurons, depending on their position in the neural network:

Input neurons (input nodes) to which input signals are supplied. Such neurons usually have one input with unit weight, there is no bias, and the neuron output value is equal to the input signal;

Output nodes, whose output values ​​represent the resulting output signals of the neural network;

Hidden neurons (hidden nodes) that do not have direct connections with input signals, while the values ​​of the output signals of hidden neurons are not the output signals of the ANN.

Based on the structure of interneuron connections, two classes of ANNs are distinguished:

Feedforward ANNs, in which the signal propagates only from the input neurons to the output neurons.

Recurrent ANN - ANN with feedback. In such ANNs, signals can be transmitted between any neurons, regardless of their location in the ANN.

There are two general approaches to training an ANN:

Training with a teacher.

Learning without a teacher.

Supervised learning involves the use of a pre-generated set of training examples. Each example contains a vector of input signals and a corresponding vector of reference output signals, which depend on the task at hand. This set is called the training sample or training set. Training a neural network is aimed at changing the weights of the ANN connections in such a way that the values ​​of the ANN output signals differ as little as possible from the required values ​​of the output signals for a given vector of input signals.

In unsupervised learning, the connection weights are adjusted either as a result of competition between neurons, or taking into account the correlation of the output signals of neurons between which there is a connection. In the case of unsupervised learning, no training set is used.

Neural networks are used to solve a wide range of problems, such as payload planning for space shuttles and forecasting exchange rates. However, they are not often used in data mining systems due to the complexity of the model (knowledge recorded as the weights of several hundred interneural connections is completely beyond human analysis and interpretation) and the long training time on a large training set. On the other hand, neural networks have such advantages for use in data analysis tasks as resistance to noisy data and high accuracy.

2.3 Nearest neighbor and k-nearest neighbors methods

The basis of the nearest neighbor algorithm (nearest neighbor algorithm) and k-nearest neighbor algorithm (KNN) is the similarity of objects. The nearest neighbor algorithm selects, among all known objects, the object that is as close as possible (using the distance metric between objects, for example, Euclidean) to a new previously unknown object. The main problem with the nearest neighbor method is its sensitivity to outliers in the training data.

The described problem can be avoided by the KNN algorithm, which identifies among all observations the k-nearest neighbors that are similar to the new object. Based on the classes of nearest neighbors, a decision is made regarding the new object. An important task of this algorithm is the selection of the coefficient k - the number of records that will be considered similar. A modification of the algorithm in which the contribution of a neighbor is proportional to the distance to the new object (k-weighted nearest neighbors method) allows one to achieve greater classification accuracy. The k nearest neighbors method also allows you to evaluate the accuracy of the forecast. For example, if all k nearest neighbors have the same class, then the probability that the object being tested will have the same class is very high.

Among the features of the algorithm, it is worth noting its resistance to anomalous outliers, since the probability of such a record being included in the number of k-nearest neighbors is low. If this happened, then the impact on voting (especially weighted) (for k>2) will also most likely be insignificant, and, therefore, the impact on the classification result will also be small. Also, the advantages are simplicity of implementation, ease of interpretation of the result of the algorithm, the ability to modify the algorithm by using the most suitable combination functions and metrics, which allows you to adjust the algorithm to a specific task. The KNN algorithm also has a number of disadvantages. First, the data set used for the algorithm must be representative. Second, the model cannot be separated from the data: all examples must be used to classify a new example. This feature greatly limits the use of the algorithm.

2.4 Decision trees

The term “decision trees” refers to a family of algorithms based on the representation of classification rules in a hierarchical, sequential structure. This is the most popular class of algorithms for solving data mining problems.

A family of algorithms for constructing decision trees makes it possible to predict the value of a parameter for a given case based on a large amount of data about other similar cases. Typically, algorithms of this family are used to solve problems that make it possible to divide all the initial data into several discrete groups.

When applying decision tree construction algorithms to a set of initial data, the result is displayed as a tree. Such algorithms make it possible to implement several levels of such division, dividing the resulting groups (tree branches) into smaller ones based on other characteristics. The division continues until the values ​​that are supposed to be predicted become the same (or, in the case of a continuous value of the predicted parameter, close) for all resulting groups (leaves of the tree). It is these values ​​that are used to make predictions based on this model.

The operation of algorithms for constructing decision trees is based on the use of regression and correlation analysis methods. One of the most popular algorithms of this family is CART (Classification and Regression Trees), based on dividing the data in a tree branch into two child branches; Moreover, the further division of a particular branch depends on how much initial data this branch describes. Some other similar algorithms allow you to split a branch into more child branches. In this case, the division is made on the basis of the highest correlation coefficient for the data described by the branch between the parameter according to which the division occurs and the parameter that must subsequently be predicted.

The popularity of the approach is associated with clarity and clarity. But decision trees are fundamentally incapable of finding the “best” (most complete and accurate) rules in the data. They implement the naive principle of sequential viewing of features and actually find parts of real patterns, creating only the illusion of a logical conclusion.

2.5 Clustering algorithms

Clustering is the task of dividing a set of objects into groups called clusters. The main difference between clustering and classification is that the list of groups is not clearly defined and is determined during the operation of the algorithm.

The application of cluster analysis in general comes down to the following steps:

· selection of a sample of objects for clustering;

· defining a set of variables by which objects in the sample will be assessed. If necessary, normalize the values ​​of variables;

· calculation of similarity measure values ​​between objects;

· application of the cluster analysis method to create groups of similar objects (clusters);

· presentation of analysis results.

After receiving and analyzing the results, it is possible to adjust the selected metric and clustering method until the optimal result is obtained.

Clustering algorithms include hierarchical and flat groups. Hierarchical algorithms (also called taxonomy algorithms) build not just one partition of the sample into disjoint clusters, but a system of nested partitions. Thus, the output of the algorithm is a tree of clusters, the root of which is the entire sample, and the leaves are the smallest clusters. Flat algorithms construct one partition of objects into clusters that do not intersect with each other.

Another classification of clustering algorithms is into clear and fuzzy algorithms. Clear (or non-overlapping) algorithms assign each sample object a cluster number, that is, each object belongs to only one cluster. Fuzzy (or intersecting) algorithms assign each object a set of real values ​​that show the degree of the object’s relationship to the clusters. Thus, each object belongs to each cluster with a certain probability.

Among hierarchical clustering algorithms, there are two main types: bottom-up and top-down algorithms. Top-down algorithms work on a top-down principle: first, all objects are placed in one cluster, which is then divided into smaller and smaller clusters. More common are bottom-up algorithms, which start by placing each object in a separate cluster and then combining the clusters into larger and larger ones until all the objects in the sample are contained in a single cluster. Thus, a system of nested partitions is constructed. The results of such algorithms are usually presented in the form of a tree.

A disadvantage of hierarchical algorithms is the system of complete partitions, which may be unnecessary in the context of the problem being solved.

Let us now consider flat algorithms. The simplest among this class are quadratic error algorithms. The clustering problem for these algorithms can be considered as constructing an optimal partition of objects into groups. In this case, optimality can be defined as the requirement to minimize the root mean square error of partitioning:

,

Where c j - “center of mass” of the cluster j(point with average characteristics for a given cluster).

The most common algorithm in this category is the k-means method. This algorithm builds a given number of clusters located as far apart as possible. The work of the algorithm is divided into several stages:

Randomly select k points that are the initial “centers of mass” of the clusters.

2. Assign each object to the cluster with the nearest “center of mass”.

If the algorithm stopping criterion is not satisfied, return to step 2.

The minimum change in the mean square error is usually chosen as the criterion for stopping the algorithm. It is also possible to stop the algorithm if at step 2 there were no objects that moved from cluster to cluster. The disadvantages of this algorithm include the need to specify the number of clusters for partitioning.

The most popular fuzzy clustering algorithm is the c-means algorithm. It is a modification of the k-means method. Algorithm steps:

1. Select the initial fuzzy partition n objects on k clusters by choosing a membership matrix U size n x k.

2. Using matrix U, find the value of the fuzzy error criterion:

,

Where c k - “center of mass” of the fuzzy cluster k:

3. Regroup objects in order to reduce this fuzzy error criterion value.

4. Return to step 2 until the matrix changes U will not become insignificant.

This algorithm may not be suitable if the number of clusters is unknown in advance, or if it is necessary to unambiguously assign each object to one cluster.

The next group of algorithms are algorithms based on graph theory. The essence of such algorithms is that a selection of objects is represented in the form of a graph G=(V, E), whose vertices correspond to objects, and whose edges have a weight equal to the “distance” between objects. The advantages of graph clustering algorithms are clarity, relative ease of implementation, and the ability to introduce various improvements based on geometric considerations. The main algorithms are the algorithm for identifying connected components, the algorithm for constructing a minimum spanning tree and the layer-by-layer clustering algorithm.

To select a parameter R Usually a histogram of distributions of pairwise distances is constructed. In tasks with a well-defined cluster structure of data, the histogram will have two peaks - one corresponds to intra-cluster distances, the second - inter-cluster distances. Parameter R is selected from the minimum zone between these peaks. At the same time, it is quite difficult to control the number of clusters using a distance threshold.

The minimum spanning tree algorithm first constructs a minimum spanning tree on a graph and then sequentially removes the edges with the largest weight. The layer-by-layer clustering algorithm is based on identifying connected graph components at a certain level of distances between objects (vertices). The distance level is set by the distance threshold c. For example, if the distance between objects is , then .

The layer-by-layer clustering algorithm generates a sequence of subgraphs of the graph G, which reflect hierarchical relationships between clusters:

,

Where G t = (V, E t ) - level graph With t, ,

With t - t-th distance threshold, m - number of hierarchy levels,
G 0 = (V, o), o is the empty set of graph edges obtained by t 0 = 1,
G m = G, that is, a graph of objects without restrictions on distance (the length of the edges of the graph), since t m = 1.

By changing the distance thresholds ( With 0 , …, With m), where 0 = With 0 < With 1 < …< With m = 1, it is possible to control the depth of the hierarchy of the resulting clusters. Thus, the layer-by-layer clustering algorithm is capable of creating both a flat and hierarchical partition of the data.

Clustering allows you to achieve the following goals:

· improves understanding of data by identifying structural groups. Dividing the sample into groups of similar objects makes it possible to simplify further data processing and decision-making by applying a different method of analysis to each cluster;

· allows you to compactly store data. To do this, instead of storing the entire sample, you can keep one typical observation from each cluster;

· detection of new atypical objects that did not fall into any cluster.

Typically, clustering is used as an auxiliary method in data analysis.

2.6 Genetic algorithms

Genetic algorithms are among the universal optimization methods that allow solving problems of various types (combinatorial, general problems with and without restrictions) and varying degrees of complexity. At the same time, genetic algorithms are characterized by the possibility of both single-criteria and multi-criteria search in a large space, the landscape of which is not smooth.

This group of methods uses an iterative process of evolution of a sequence of generations of models, including the operations of selection, mutation and crossing. At the beginning of the algorithm, the population is formed randomly. To assess the quality of the encoded solutions, the fitness function is used, which is necessary to calculate the fitness of each individual. Based on the results of assessing the individuals, the most fit of them are selected for crossing. As a result of crossing selected individuals through the application of the genetic crossover operator, offspring are created, the genetic information of which is formed as a result of the exchange of chromosomal information between the parent individuals. The created descendants form a new population, and some of the descendants mutate, which is expressed in a random change in their genotypes. The stage, including the sequence “Population assessment” - “Selection” - “Crossing” - “Mutation”, is called generation. The evolution of a population consists of a sequence of such generations.

The following algorithms for selecting individuals for crossing are distinguished:

· Panmixia. Both individuals that will form a parental pair are randomly selected from the entire population. Any individual can become a member of several pairs. This approach is universal, but the efficiency of the algorithm decreases with increasing population size.

· Selection. Parents can be individuals with at least average fitness. This approach ensures faster convergence of the algorithm.

· Inbreeding. The method is based on the formation of a pair based on close kinship. Here, relatedness is understood as the distance between members of a population, both in the sense of the geometric distance of individuals in the parameter space and the Heming distance between genotypes. Therefore, a distinction is made between genotypic and phenotypic inbreeding. The first member of the pair to be crossed is selected at random, and the second is more likely to be the individual closest to it. Inbreeding can be characterized by the property of concentrating search in local nodes, which actually leads to the division of the population into separate local groups around areas of the landscape that are suspicious for extremes.

· Outbreeding. Formation of a pair based on distant kinship, for the most distant individuals. Outbreeding aims to prevent the algorithm from converging on already found solutions, forcing the algorithm to look at new, unexplored areas.

Algorithms for forming a new population:

· Selection with displacement. Of all individuals with the same genotypes, preference is given to those whose fitness is higher. Thus, two goals are achieved: the best solutions found, which have different chromosome sets, are not lost, and sufficient genetic diversity is constantly maintained in the population. Displacement forms a new population of distantly located individuals, instead of individuals grouping around the current solution found. This method is used for multiextremal problems.

· Elite selection. Elite selection methods ensure that selection will ensure that the best members of the population survive. At the same time, some of the best individuals pass on to the next generation without any changes. The rapid convergence provided by elite selection can be compensated for by a suitable method for selecting parent pairs. In this case, outbreeding is often used. It is this combination of “outbreeding - elite selection” that is one of the most effective.

· Tournament selection. Tournament selection implements n tournaments to select n individuals. Each tournament is built on selecting k elements from the population and selecting the best individual among them. The most common is tournament selection with k = 2.

One of the most popular applications of genetic algorithms in the field of Data Mining is the search for the most optimal model (search for an algorithm that corresponds to the specifics of a particular field). Genetic algorithms are primarily used to optimize the topology of neural networks and weights. However, they can also be used as an independent tool.

3. Applications

Data Mining technology has a truly wide range of applications, being, in fact, a set of universal tools for analyzing data of any type.

Marketing

One of the very first areas where data mining technologies were applied was the field of marketing. The task with which the development of Data Mining methods began is called shopping basket analysis.

This task is to identify products that buyers tend to purchase together. Knowledge of the shopping basket is necessary for conducting advertising campaigns, forming personal recommendations to customers, developing a strategy for creating inventories of goods and ways of laying them out in sales areas.

Also in marketing, tasks such as determining the target audience of a product for more successful promotion are solved; Temporal pattern research that helps businesses make inventory decisions; creation of predictive models, which allows enterprises to recognize the nature of the needs of various categories of customers with certain behavior; predicting customer loyalty, which allows you to identify in advance the moment of customer departure when analyzing his behavior and, possibly, prevent the loss of a valuable customer.

Industry

One of the important areas in this area is monitoring and quality control, where, using analysis tools, it is possible to predict equipment failure, the occurrence of malfunctions, and plan repair work. Forecasting the popularity of certain features and knowing which features are typically ordered together helps optimize production and focus it on the real needs of consumers.

Medicine

In medicine, data analysis is also used quite successfully. Examples of tasks include analyzing examination results, diagnostics, comparing the effectiveness of treatment methods and medications, analyzing diseases and their spread, and identifying side effects. Data Mining technologies such as association rules and sequential patterns have been successfully used to identify connections between medications and side effects.

Molecular genetics and genetic engineering

Perhaps the most acute and at the same time clear task of discovering patterns in experimental data is in molecular genetics and genetic engineering. Here it is formulated as a definition of markers, which are understood as genetic codes that control certain phenotypic characteristics of a living organism. Such codes can contain hundreds, thousands or more related elements. The result of analytical data analysis is also the relationship discovered by genetic scientists between changes in the human DNA sequence and the risk of developing various diseases.

Applied chemistry

Data Mining methods are also used in the field of applied chemistry. Here the question often arises of clarifying the characteristics of the chemical structure of certain compounds that determine their properties. This task is especially relevant when analyzing complex chemical compounds, the description of which includes hundreds and thousands of structural elements and their connections.

Fighting crime

Data Mining tools have been used relatively recently in security, but practical results have already been obtained confirming the effectiveness of data mining in this area. Swiss scientists have developed a system for analyzing protest activity in order to predict future incidents and a system for tracking emerging cyber threats and hacker actions in the world. The latest system allows you to predict cyber threats and other risks information security. Data Mining methods are also successfully used to detect credit card fraud. By analyzing past transactions that later turned out to be fraudulent, the bank identifies some patterns of such fraud.

Other applications

· Risk analysis. For example, by identifying combinations of factors associated with paid claims, insurers can reduce their liability losses. There is a known case in which a large insurance company in the United States discovered that the amounts paid on claims of married people were twice as high as the amounts paid on claims by single people. The company responded to this new knowledge by revising its general policy of offering discounts to family customers.

· Meteorology. Weather prediction using neural network methods, in particular, self-organizing Kohonen maps are used.

· Personnel policy. Analysis tools help HR services select the most successful candidates based on data analysis of their resumes, and model the characteristics of ideal employees for a particular position.

4. Manufacturers of Data Mining tools

Data Mining tools are traditionally expensive software products. Therefore, until recently, the main consumers of this technology were banks, financial and insurance companies, large trading enterprises, and the main tasks requiring the use of Data Mining were considered to be the assessment of credit and insurance risks and the development of marketing policies, tariff plans and other principles of working with clients. In recent years, the situation has undergone certain changes: relatively inexpensive Data Mining tools and even freely distributed systems have appeared on the software market, which has made this technology accessible to small and medium-sized businesses.

Among paid tools and data analysis systems, the leaders are SAS Institute (SAS Enterprise Miner), SPSS (SPSS, Clementine) and StatSoft (STATISTICA Data Miner). Well-known solutions are from Angoss (Angoss KnowledgeSTUDIO), IBM (IBM SPSS Modeler), Microsoft (Microsoft Analysis Services) and (Oracle) Oracle Data Mining.

The choice of free software is also varied. There are both universal analysis tools, such as JHepWork, KNIME, Orange, RapidMiner, and specialized tools, for example, Carrot2 - a framework for clustering text data and search query results, Chemicalize.org - a solution in the field of applied chemistry, NLTK (Natural Language Toolkit) processing tool natural language(natural language processing).

5. Criticism of methods

The results of Data Mining largely depend on the level of data preparation, and not on the “wonderful capabilities” of some algorithm or set of algorithms. About 75% of the work in Data Mining consists of data collection, which occurs before the analysis tools are applied. Illiterate use of tools will lead to a waste of the company's potential, and sometimes millions of dollars.

Opinion of Herb Edelstein, a world-renowned expert in the field of Data Mining, Data Warehousing and CRM: “A recent study by Two Crows showed that Data Mining is still in its early stages of development. Many organizations are interested in this technology, but only a few are actively implementing such projects. Managed to find out one more important point: The process of implementing Data Mining in practice turns out to be more complex than expected. Teams are carried away by the myth that Data Mining tools are easy to use. It is assumed that it is enough to run such a tool on a terabyte database, and useful information will instantly appear. In fact, a successful Data Mining project requires an understanding of the activity, knowledge of data and tools, and the process of data analysis.” Thus, before using Data Mining technology, it is necessary to carefully analyze the limitations imposed by the methods and the critical issues associated with it, as well as soberly evaluate the capabilities of the technology. Critical issues include the following:

1. Technology cannot provide answers to questions that have not been asked. It cannot replace the analyst, but only gives him a powerful tool to facilitate and improve his work.

2. The complexity of developing and operating a Data Mining application.

Because the this technology is a multidisciplinary field, to develop an application that includes Data Mining, it is necessary to involve specialists from different fields, as well as ensure their high-quality interaction.

3. User qualifications.

Different Data Mining tools have varying degrees of user-friendliness of the interface and require certain user qualifications. That's why software must correspond to the user's level of training. The use of Data Mining should be inextricably linked with improving the user's qualifications. However, there are currently few Data Mining specialists who are well versed in business processes.

4. Extracting useful information is impossible without a good understanding of the essence of the data.

Careful selection of the model and interpretation of the dependencies or patterns that are discovered are required. Therefore, working with such tools requires close collaboration between a domain expert and a Data Mining tool specialist. Persistent models must be intelligently integrated into business processes to allow models to be evaluated and updated. Recently, Data Mining systems are being supplied as part of data warehousing technology.

5. Difficulty in preparing data.

Successful analysis requires high-quality data preprocessing. According to analysts and database users, the preprocessing process can take up to 80% of the entire Data Mining process.

Thus, for the technology to work for itself, it will require a lot of effort and time, which goes into preliminary data analysis, model selection and its adjustment.

6. A large percentage of false, unreliable or useless results.

Using Data Mining technologies, you can find truly very valuable information, which can provide a significant advantage in further planning, management, and decision-making. However, the results obtained using Data Mining methods quite often contain false and meaningless conclusions. Many experts argue that Data Mining tools can produce a huge number of statistically unreliable results. To reduce the percentage of such results, it is necessary to check the adequacy of the obtained models on test data. However, it is impossible to completely avoid false conclusions.

7. High cost.

Qualitative software is the result of significant effort on the part of the developer. Therefore, Data Mining software is traditionally an expensive software product.

8. Availability of sufficient representative data.

Data Mining tools, unlike statistical ones, theoretically do not require strictly a certain amount retrospective data. This feature can cause the detection of unreliable, false models and, as a result, making incorrect decisions based on them. It is necessary to monitor the statistical significance of the discovered knowledge.

neural network algorithm clustering data mining

Conclusion

Dana a brief description of areas of application and provides criticism of Data Mining technology and the opinion of experts in this field.

Listliterature

1. Han and Micheline Kamber. Data Mining: Concepts and Techniques. Second Edition. - University of Illinois at Urbana-Champaign

Berry, Michael J. A. Data mining techniques: for marketing, sales, and customer relationship management - 2nd ed.

Siu Nin Lam. Discovering Association Rules in Data Mining. - Department of Computer Science University of Illinois at Urbana-Champaign