Data Mining
Data mining is the analysis of data for relationships that have not previously been discovered. For example, the sales records for a particular brand of fishing rod might, if sufficiently analyzed and related to other market data, reveal a seasonal correlation with the purchase by the same parties of camping equipment.
Data mining results include:
- Associations, or when one event can be correlated to another event (e.g., soft drink purchasers buy popcorn a certain percentage of the time)
- Sequences, or one event leading to another later event (e.g., a dining room table and chairs purchase followed by a purchase of chandeliers)
- Classification, or the recognition of patterns and a resulting new organization of data (e.g., profiles of customers who make purchases over $100)
- Clustering, or finding and visualizing groups of facts not previously known
- Forecasting, or simply discovering patterns in the data that can lead to predictions about the future
Data mining gets its name from the similarities between searching for valuable business information in a large database and mining a promising geologic feature for a vein of valuable ore. Both processes require either sifting and sorting through an incredibly large amount of information, often measured in terabytes (a terabyte is a measure of computer storage capacity and is 2 to the 40th power or, in decimal, approximately a thousand billion bytes,) or methodically probing it to find exactly where the value resides.
Assuming databases of sufficient size and quality (quality = reliable data), data mining technology can generate new business opportunities by providing these capabilities:
- Automated prediction of trends and behaviors. Data mining automates the process of finding predictive information (information that can help predict future behaviors) in large databases. Questions that usually required extensive hands-on analysis can now be answered quickly and directly from the data. Typical examples of predictive problems include targeted marketing, forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events. Several teams from the National Basketball Association (NBA) used data mining in the early 1990s to predict which players on a team scored the most points on which sections of the court against a certain opponent.
- Automated discovery of previously unknown patterns. Data mining tools can in a single pass sweep through databases and identify previously hidden patterns. The retail industry uses pattern discovery to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors or employee theft.
When implemented on high-performance parallel processing systems, data mining tools can analyze massive databases in minutes. Faster processing allows users more opportunity to experiment with different models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield higher probabilities for accurate predictions of future events.
Data Mining Techniques
Because no one particular data-mining algorithm will address all data sets, data mining involves a combination of disciplines and techniques. Therefore, there is the need for hybrid investigative methods, such as the following:
- Query tools. The use of traditional query tools, such as structured query language (SQL), provides a cursory analysis of a data set. As much as 80 percent of data-set relationships can be accounted for by using SQL. To uncover the remaining hidden 20 percent, more advanced techniques are required.
- Visualization Techniques. Visualization techniques are another way to obtain on overall analysis of a data set and show where patterns might be found. One such technique is the Scatter diagram, where information is plotted on an X-Y plot. Very often small sub-clusters of the data sets indicate possible interesting relationships, where more advanced data mining techniques may extract hidden relationships.
- Likelihood and Distance. Using three-dimensional graphical representations of plotted data sets provides investigators with an idea of relationships between those data sets. Records in proximity to each other are very alike, while those that are far removed from each other represent data with little in common.
- Online Analytical Processing (OLAP) Tools. When the need for information goes beyond two-dimensional plots, tools that can provide a multidimensional view of many data sets are required. Because there is no standard order or number of queries that can be asked of a database, the need for OLAP tools is paramount for representing the multidimensional relationships.
- k-Nearest Neighbor. If records can be considered points in a data space, then records that are close to each other could be conceived as "living" in each other's neighborhood. The letter "k" in k-nearest represents the number of "neighbors" being investigated. For example, 8-nearest neighbor looks at eight neighbors. Simple k-nearest neighbor is more of a search method than an learning technique, and it can be quite complicated. For that reason, the simple k-nearest neighbor technique is used mostly for sub-samples or small data sets.
These are just a few of the many tools and techniques used to extract meaningful relationships hidden deep within very large databases. Setting up a data-mining system is by no means a trivial task. To be successful, the long-term goal must be to establish a self-learning organization that understands and recognizes the importance of making optimal use of the information it generates and captures.
This is the complete article, containing 859 words
(approx. 3 pages at 300 words per page).