Data mining may be valuable as the outgrowth of inventing thoughtful and foretelling example from heavy data. It is the readiness of origin beneficial complaint from the bulky total of data. It binds unwritten data analysis with algorithms for advance comprehensive amount of data. It is an interdisciplinary room coming together concepts from databank systems, stats, coach lore, reckon, notice hypothesis, and model notice. This paper discusses the definition of data mining, how it performs, the trends so far on data mining, related or similar technologies or concepts, achievements of its advocates, its reported strengths and weaknesses, the future of it and innovations that I can bring or suggest to data mining.
INTRODUCTION TO DATA MINING
In this era, information is very vital. It is because we trust that information precede to might and succession, and thanks to sophisticated machines or technologies such as computers, satellites, etc. that have helped in collecting tremendous amounts of information. Initially, with the arrival of computers and disgraceful
for bulk digital storing, we started collecting and storing all sorts of data, counting on the
power of computers to aid chance through this alloy of information. Unfortunately,
these heavy collections of data stored on different formation very apace became
overpowering. This commencing confusion led to the creation of structured databases and database management systems (DBMS). The database management system has been very important because it manages data efficiently and allows users to perform multiple tasks with ease. It stores, organizes and manages a large amount of information within a single software application. Efficient retrieval of particular information from a large collection is easily attained whenever needed. The proliferation of database management system has also
tend to recent massive gathering of all sorts of information. Today, we have remote
more instruction than we can crop: from concern transactions and expert data, to
accompanying imagine, message recital and troops instruction. Confronted with excessive collections of data, we have
now appoint renovated necessarily to remedy us cause mend managerial choices. These necessarily are automatic summarization of data, the lineage of the “excess” of information stored, and
the showing of the specimen in untried data.
It has been collecting countless data, from simple numeral measurements and text documents to more collection information such as spatial data multimedia channels, and hypertext documents. This is a non-exclusive list of a variety of information collected in digital form in databases and in flat files such as business transactions, medical and personal data, scientific data, satellite sensing, text reports, digital media, games and many more.
WHAT ARE DATA MINING ?
With the enormous amount of data stored in files, databases, and other repositories, it is
increasingly essential, if not necessary, to develop powerful means for analysis and
perhaps the interpretation of such data and for the extraction of interesting knowledge that
could help in decision-making. Of which comes Data Mining. It is popularly known as Knowledge Discovery in Databases (KDD).
Data Mining refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. “Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.”(From Wikipedia, the free encyclopedia n.d. ,para. 1).
Data mining is an interdisciplinary subfield of computer science with an overall goal to extract information (with intelligent methods) from a data set and transform the information into an intelligible building for further usage. The following figure (Figure 1) shows data mining as an interdisciplinary field (Adapted from Han and Kamber, 2006).
Figure 1 Data mining as an interdisciplinary field (Adapted from Han and Kamber, 2006).
While data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms, data mining is actually part of
the knowledge discovery process. The following figure (Figure 1.2) shows data mining as a step in an iterative knowledge discovery process.
Figure 2 Data mining is the core of the knowledge discovery process.
The Knowledge Discovery in Databases process comprises a few steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps:
• Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data are removed from the collection.
• Data integration: at this stage, multiple data sources, often heterogeneous, may
be combined in a common source.
• Data selection: at this step, the data relevant to the analysis is decided on and
retrieved from the data collection.
• Data transformation: also known as data consolidation, it is a phase in which the
selected data is transformed into forms appropriate for the mining procedure.
• Data mining: it is the crucial step in which clever techniques are applied to
extract patterns potentially useful.
• Pattern evaluation: in this step, strictly interesting patterns representing
knowledge are identified based on given measures.
• Knowledge representation: is the final phase in which the discovered knowledge
is visually represented to the user. This essential step uses visualization
techniques to help users understand and interpret the data mining results.
The KDD is an iterative process. Once the discovered knowledge is presented to the user, the evaluation measures can be enhanced, the mining can be further refined, new data can be selected or further transformed, or new data sources can be integrated, in order to get different, more appropriate results.
The term “data mining” is, in fact, a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence.
The actual data mining task is the semi-automatic or automatonlike analysis of liberal quantities of data to extract beforehand unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. Spatial indices are used by spatial databases (databases which store information related to objects in space) to optimize spatial queries. These patterns can then be seen as a kind of summary of the input data and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify manifold cluster in the data, which can then be employed to get more accurate predictive results by a decision support system. Other similar terms referring to data mining are data dredging, data fishing, and data snooping.
The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.
WHAT KIND OF DATA CAN BE MINED
In principle, data mining is not specific to one type of media or data. Data mining should be applicable to any kind of information repository. However, algorithms and approaches may differ when attaching to dissimilar types of data. Different types of data are presented by challenges that vary significantly. Data mining is being put into use and studied for databases, including relational databases, object-relational databases and object-oriented databases, data warehouses, transactional databases, unstructured and semi-structured repositories such as the World Wide Web, advanced databases such as spatial databases, multimedia databases, time-series databases and textual databases, and even flat files. The following are some of the examples in detail;
• Flat files: Flat files are actually the most common data source for data mining
algorithms, especially at the research level. Flat files are simple data files in text
or binary format with a structure known by the data mining algorithm to be
applied. The data in these files can be transactions, time-series data, scientific
• Relational Databases: A relational database consists of a set of tables containing either values of entity attributes or values of attributes from entity relationships. Tables have columns and rows, where columns represent attributes and rows represent tuples. A tuple in a relational table corresponds to either an object or a relationship between objects and is identified by a set of attribute
values representing a unique key. In Figure 3 we present some relations Customer, Items, and Borrow representing business activity in a fictitious video store OurVideoStore. These relations are just a subset of what could be a
database for the video store and is given as an example.
Figure 3 Fragments of some relations from a relational database for OurVideoStore.
The most commonly used query language for a relational database is SQL, which
allows retrieval and manipulation of the data stored in the tables, as well as the
calculation of aggregate functions such as average, sum, min, max, and count. For
instance, an SQL query to select the videos grouped by category would be:
SELECT count(*) FROM Items WHERE type=video GROUP BY category.
Data mining algorithms using relational databases can be more versatile than data
mining algorithms specifically written for flat files, since they can take advantage
of the structure inherent to relational databases. While data mining can benefit
from SQL for data selection, transformation, and consolidation, it goes beyond
what SQL could provide, such as predicting, comparing, detecting deviations, etc.
• Data Warehouses: A data warehouse as a storehouse, is a repository of data
collected from multiple data and is intended to be used as a whole under the same unified schema. A data warehouse gives the
option to analyze data from different sources under the same roof. Let us suppose
that OurVideoStore becomes a franchise in Ghana. Many video stores belonging to OurVideoStore company may have different databases and different structures. If the executive of the company wants to access the data from all stores for strategic decision-making, future direction, marketing, etc., it would be more
appropriate to store all the data in one site with a homogeneous structure that
allows interactive analysis. In other words, data from the different stores would be
loaded, cleaned, transformed and integrated together. To facilitate decision making and multi-dimensional views, data warehouses are usually modeled by a multi-dimensional data structure.
• Transaction Databases: A transaction database is a set of records representing
transactions, each with a time stamp, an identifier and a set of items. Associated
with the transaction files could also be descriptive data for the items.
• Multimedia Databases: Multimedia databases include video, images, sound and
text media. They can be stored on enlarge object-relational or object-oriented
databases, or simply on a file system. Multimedia is characterized by its dear
measure, which makes data mining even more challenging. Multimedia is characterized by its high
dimensionality, which makes data mining even more challenging. Data mining
from multimedia repositories may require computer vision, computer graphics,
image interpretation, and natural language processing methodologies.
• Spatial Databases: Spatial databases are databases that, in addition to usual data,
store geographical information like maps, and global or regional positioning. Such
spatial databases present new challenges to data mining algorithms.
Figure 4 Spatial OLAP(Online Analytical Processing)
WHAT CAN BE DISCOVERED?
The kinds of patterns that can be discovered depend upon the data mining tasks
employed. By and large, there are two types of data mining tasks: descriptive data mining
tasks and predictive data mining tasks. The descriptive data mining tasks describe the general properties of the existing data, and predictive data mining tasks attempt to do predictions based on inference on available data.
The data mining functionalities and the variety of knowledge they discover are concisely presented in the following incline:
• Characterization: Data characterization is a summarization of general features of
objects in a target class, and produces what is called characteristic rules. The data
relevant to a user-specified class are normally retrieved by a database query and
run through a summarization module to extract the essence of the data at different
levels of abstractions. For example, one may want to characterize the
OurVideoStore customers who regularly rent more than 30 movies a year. With
concept hierarchies on the attributes describing the target class, the attribute-oriented induction method can be used, for example, to carry out data
• Discrimination: Data discrimination produces what are called discriminant rules
and is basically the comparison of the general features of objects between two
classes referred to as the target class and the contrasting class. For example, one
may want to compare the general characteristics of the customers who rented
more than 30 movies in the last year with those whose rental account is lower
than 5. The techniques used for data discrimination are very similar to the
techniques used for data characterization with the exception that data
discrimination results include comparative measures.
• Association analysis: Association analysis is the discovery of what are
commonly called association rules. It studies the frequency of items occurring
together in transactional databases, and based on a threshold called support,
identifies the frequent itemsets. Another threshold, confidence, which is the
conditional probability that an item appears in a transaction when another item
appears, is used to pinpoint association rules.
• Classification: Classification analysis is the organization of data in given classes.
Also known as supervised classification, the classification uses given class labels
to order the objects in the data collection. Classification approaches normally use
a training set where all objects are already associated with known class labels.
The classification algorithm learns from the training set and builds a model. The
model is used to classify new objects. For example, after starting a credit policy,
the OurVideoStore managers could analyze the customers’ behaviors vis-à-vis
their credit, and label accordingly the customers who received credits with three
possible labels “safe”, “risky” and “very risky”. The classification analysis would generate a model that could be used to either accept or reject credit requests in the
• Prediction: Prediction has attracted considerable attention given the potential
implications of successful forecasting in a business context. There are two major
types of predictions: one can either try to predict some unavailable data values or
pending trends or predict a class label for some data. The latter is tied to
classification. Once a classification model is built based on a training set, the class
label of an object can be foreseen based on the attribute values of the object and
the attribute values of the classes. Prediction is, however, more often referred to the
forecast of missing numerical values, or increase/ decrease trends in time-related
data. The major idea is to use a large number of past values to consider probable