You are on page 1of 13

Data Mining

Unit 1

Unit 1

Introduction to Data Mining

Structure 1.1 Introduction Objectives 1.2 Meaning and Working of Data Mining 1.3 Data, Information and knowledge 1.4 Data Warehousing and Data Mining Relation 1.5 Data Mining and knowledge Discovery 1.6 Data Mining and OLAP 1.7 Data Mining and Statistics 1.8 Data Mining Technologies 1.9 Data Mining Software 1.10 Summary 1.11 Terminal Questions 1.12 Answers

1.1 Introduction
Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information information that can be used to increase revenue, cut costs, or both. Objectives At the end of this unit, you should be able to: explain the Basics of Data Mining describe the relationship between Data mining and various Business Intelligence tools like Data Warehousing, OLAP and Statistics.

1.2 Meaning and Working of Data Mining


Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. It is the process of data-driven extraction of not so obvious, but useful information from large databases. Data mining has emerged as a key business intelligence technology. But the ultimate question is where can it be useful? And how does it work?

Sikkim Manipal University

Page No. 1

Data Mining

Unit 1

We will discuss the purpose of data mining with POS (point of sale system) system. Usually supermarkets employ a POS (Point Of Sale) system that collects data from each item that is purchased. The POS system collects data on the item brand name, category, size, time and date of the purchase and at what price the item was purchased. In addition, the supermarket usually has a customer rewards program, which is also an input to the POS system. This information can directly link the products purchased with an individual. All this data for every purchase made for years and years is stored in a database in a computer by the supermarket. Now that you have a database with millions of records. What will you do with this huge data? How do you use this data to forecast or control your business activities? The solution for this is Data Mining, using data mining techniques or Alogorithm, you can uncover trends, statistical correlations, relationships and patterns that can help your business become more efficient, effective and streamlined. The supermarket can now figure out which brands sell the most, what time of the day, week, month or year is the most busiest, what products do consumers buy along with certain items. For instance, if a person buys white bread, what other item would they be inclined to buy? Typically we can find its peanut butter and jelly. There is so much good information that a supermarket can use just by data mining their own data that they have collected. There are various definitions are given by the several technical bodies. Some of them are listed below. Data Mining Definitions Data mining is the efficient discovery of valuable, nonobvious information from a large collection of data. Knowledge discovery in databases is the nontrivial process of identifying valid novel potentially useful and ultimately understandable patterns in the data. It is the automatic discovery of new facts and relationships in data that are like valuable nuggets of business data. It is the process of extracting previously unknown, valid, and actionable information from large databases and then using the information to make crucial business decisions.
Sikkim Manipal University Page No. 2

Data Mining

Unit 1

It is an interdisciplinary field bringing together techniques from machine learning, pattern recognition, statistics, databases, visualization, and neural networks. Data mining streamlines the transformation of masses of information into meaningful knowledge, which is essential or bottom-line of Business intelligence. Typical techniques for data mining involve decision trees, neural networks, nearest neighbor clustering, fuzzy logic, and Genetic algorithms. How does data mining work Although data mining is still in its infancy, companies in a wide range of Industries including finance, health care, manufacturing, transportation, are already using data mining tools and techniques to take advantage of historical data. The whole logic of data mining is based on modeling. Modeling is simply the act of building a model (a set of examples or a mathematical relationship) based on data from situations where the answer is known and then applying the model to other situations where the answers are not known. Modeling techniques have been around for centuries, of course, but it is only recently that data storage and communication capabilities required to collect and store huge amounts of data, and the computational power to automate modeling techniques to work directly on the data, have been available. As a simple example of building a model, consider the director of marketing for a telecommunications company. He would like to focus his marketing and sales efforts on segments of the population most likely to become big users of long-distance services. He knows a lot about his customers, but it is impossible to discern the common characteristics of his best customers because there are so many variables. From this existing database of customers, which contains information such as age, sex, credit history, income, zip code, occupation, etc., he can use data mining tools, such as neural networks, to identify the characteristics of those customers who make lots of long-distance calls. For instance, he might learn that his best customers are unmarried females between the ages of 21 and 35 who earn in excess of $60,000 per year. This, then, is his model for high-value customers, and he would budget his marketing efforts accordingly.
Sikkim Manipal University Page No. 3

Data Mining

Unit 1

Remember, data mining is the task of discovering interesting patterns from large amounts of data where the data can be stored in databases, data warehouses or other information repositories.

1.3 Data, Information, and Knowledge


Data are any facts, numbers, or text that can be processed by a computer. Today organizations are accumulating vast and growing amounts of data in different formats and databases. This includes operational or transactional data such as sales, cost, inventory, payroll, and accounting. nonoperational data like industry sales, forecast data, and macroeconomic data. Metadata are data about the data itself such as logical database design or data Dictionary definitions. Information: The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point-of-sale transaction data can yield information on which products are selling and when. Knowledge: Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge or consumer buying behavior. Thus, a manufacturer or a retailer could determine those items that are most susceptible to promotional efforts. Self Assessment Questions 1. Data mining is the task of _________ interesting patterns from large amounts of data. 2. Information can be converted into knowledge about _______ patterns and _____ trends. 3. Data about data is called _____________________. 4. Facts, numbers, or text is called _________________. 5. ____________ and _________________ are the key emerging Business Intelligence technologies. 6. Data mining is also called ___________________.
Sikkim Manipal University Page No. 4

Data Mining

Unit 1

1.4 Data Warehousing and Data Mining Relation


The nexus between data warehouse and data mining is indisputable. Popular business organizations use these technologies together. The current section describes the relation between data warehouse and data mining. Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. It is the process of data-driven extraction of not so obvious but useful information from large databases. Data mining has emerged as a key business intelligence technology. Data Mining is a multidisciplinary field drawing works from statistics, database technology, artificial intelligence, pattern recognition, machine learning, information theory, knowledge acquisition, information retrieval, high-performance computing, and data visualization. The aim of data mining is to extract implicit, previously unknown and potentially useful (or actionable) patterns from data. Data mining consists of many up-to-date techniques such as classification (decision trees, nave Bayes classifier, k-nearest neighbor, and neural networks), clustering (k-means, hierarchical clustering, and density-based clustering), association (one-dimensional, multidimensional, multilevel association, constraint-based association). Many years of practice show that data mining is a process, and its successful application requires data preprocessing (dimensionality reduction, cleaning, noise/outlier removal), post processing (under standability, summary, presentation), good understanding of problem domains and domain expertise. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Data warehouse is an enabled relational database system designed to support very large databases (VLDB) at a significantly higher level of performance and manageability. Data warehouse is an environment, not a product. It is an architectural construct of information that is hard to access or present in traditional operational data stores.

Sikkim Manipal University

Page No. 5

Data Mining

Unit 1

Any organization or a system in general is faced with a wealth of data that is maintained and stored, but the inability to discover valuable, often previously unknown information hidden in the data, prevents it from transferring these data into knowledge or wisdom. To satisfy these requirements, these steps are to be followed. 1. Capture and integrate both the internal and external data into a comprehensive view Mine for the integrated data information organize and present the information and knowledge in ways that expedite complex decision making.

1.5 Data Mining and Knowledge Discovery Process


This Data Mining is not specific to any industry it requires intelligent technologies and the willingness to explore the possibility of hidden knowledge that resides in the data. Data Mining is also referred to as knowledge discovery in databases (KDD). See fig. 1.1 KDD is the overall process of discovering useful knowledge from data. Data mining: An application of specific algorithms for extracting patterns from data. Data Mining is a step in the KDD process. Knowledge Discovery process 1. Develop an understanding for the application domain and identify the goal. 2. Create a target dataset o Selecting a dataset or focusing on a subset of samples or variables on which to make discoveries 3. Data cleaning and preprocessing (preprocessing) o Removal of noise and outliers o collecting necessary information to model or account for noise o handling of missing data o accounting for time sequence information 4. Data reduction and projection (pre-processing) o Finding useful features to represent the data relative to the goal
Sikkim Manipal University Page No. 6

Data Mining

Unit 1

o o

Dimensionality reduction/transformation ==> reduce number of variables Identification of invariant representations

5. Selection of appropriate data-mining task (Data Mining Task) o Summarization, classification, regression, clustering, etc. 6. Selection of data-mining algorithm(s) (Data Mining Task) o Methods to search for patterns o Decision of which models and parameters may be appropriate o Match method to goal of KDD process 7. Data-Mining o searching for patterns of interest in one or more representational forms 8. Interpretation and visualization o interpretation of mined patterns o visualization of extracted patterns and models o visualization of the data with given the extracted models 9. Consolidating discovered knowledge o Incorporating the discovered knowledge into another system o Documenting and reporting knowledge to interested parties o Checking for inconsistencies with other prior extracted or believed knowledge

Fig. 1.1: Steps in Knowledge Discovery process

Sikkim Manipal University

Page No. 7

Data Mining

Unit 1

1.6 Data Mining and OLAP


Online Analytical Processing (OLAP) is a technology that is used to create decision support software. OLAP and data mining are used to solve different kinds of analytic problems: OLAP summarizes data and makes forecasts. For example, OLAP answers questions like "What are the average sales of insurance policies, by region and by year?" Data mining discovers hidden patterns in data. Data mining operates at a detailed level instead of a summary level. Data mining answers questions like "Who is likely to buy insurance polices in the next six months, and what are the characteristics of these likely buyers? OLAP and data mining can complement each other. For example, OLAP might pinpoint problems with sales of mutual funds in a certain region. Data mining could then be used to gain insight about the behavior of individual customers in the region. Finally, after data mining predicts something like a 5% increase in sales, OLAP can be used to track the net income. OLAP systems also provide the following benefits Fast access, calculations, and summaries of an organization's data Support for multiple user access and multiple queries The ability to handle multiple hierarchies and levels of data The ability to pre-summarize and consolidate data for faster query and reporting functions The ability to expand the number of dimensions and levels of data as a business grows. Self Assessment Questions 7. Online Analytical Processing (OLAP) is a technology that is used to create _______________ software. 8. OLAP and data mining can _______ each other. 9. OLAP Support for ________ user access and multiple queries.

1.7 Data Mining and Statistics


Statistics is a branch of Mathematics. Statistics techniques are incorporated into Data mining methods. Data mining methods or techniques find the relations between variables or data in the given data base and express
Sikkim Manipal University Page No. 8

Data Mining

Unit 1

these relations using statistical nomenclature. Without statistics, there would be no data mining, as statistics are the foundation of most technologies on which data mining is built. Classical statistics embrace concepts such as Regression Analysis, Standard Distribution, Standard Deviation, Standard Variance, Discriminant Analysis, Cluster Analysis, and Confidence Intervals, all of which are used to study data and data relationships. These are the very building blocks with which more advanced statistical analyses are underpinned. Certainly, within the heart of today's data mining tools and techniques, classical statistical analysis plays a significant role. Note: Data Mining has its roots from Statistics, Artificial Intelligence and Machine Learning. Please note, Statistics, AI and Machine Learning are out of our study here, so we are not exploring much about them. The details about data mining techniques will be explored in the forthcoming units.

1.8 Data Mining Technologies


The analytical techniques used in data mining are often well-known mathematical algorithms and techniques. What is new is the application of those techniques to general business problems made possible by the increased availability of data, and inexpensive storage and processing power. Also, the use of graphical interface has led to tools becoming available that business experts can easily use. Some of the techniques are given below, Artificial neural networks Nonlinear predictive models that learn through training and resemble biological neural networks in structure. Decision trees Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Rule induction The extraction of useful if-then rules from databases on Statistical significance. Genetic algorithms Optimization techniques based on the concepts of genetic combination, mutation, and natural selection.

Sikkim Manipal University

Page No. 9

Data Mining

Unit 1

Nearest neighbor A classification technique that classifies each record based on the records most similar to it in a historical database. Data Mining has different applications in the industry. Some of them are given below, Identifying new customers Predicting customer buying habits Confirming suitable loan applicants Revealing fraud Relationship marketing Managing equity portfolios Diagnosing medical problems Inventory management Conducting certain aspects of Marketing Customer segmentation Web site design and promotion Data Mining Industries Banking Insurance Credit marketing Telecommunications Pharmaceuticals Bioinformatics etc

1.9 Data Mining Software


A Number of data mining softwares is available in the market, which is initiated by popular software vendors like IBM, Microsoft, and Orace...etc. The list is given below, MineSet (Silicon Graphics Inc. - SGI) MineSet provides tools for searching, sorting, filtering and drilling down enabling previously complex data models to be viewed intuitively through real-time 3-D graphical representation. Intelligent Miner (IBM Corp) IBM's data mining capabilities help you detect fraud, segment the customers, and simplify market basket analysis. IBM's in-database mining capabilities
Sikkim Manipal University Page No. 10

Data Mining

Unit 1

integrated with the customers existing systems to provide scalable, high performing predictive analysis without moving data into proprietary data mining platforms. Enterprise Miner (SAS Institute Inc.) It provides the most powerful, complete data mining solution on the market with unparalleled model development and deployment alternatives and extensive integration opportunities. Delivered as a distributed client-server system, it is especially well suited for data mining in large organizations Clementine (SPSS Inc - Integral Solutions) Clementine is an enterprise data mining workbench that enables you to develop predictive models quickly using business expertise and deploy them into business operations to improve decision making. DMMiner (DBMiner Technology Inc.) DB Miner Insight solutions are world's first server applications providing powerful and highly scalable association, sequence and differential mining capabilities for Microsoft SQL Server Analysis Services platform, and they also provide market basket, sequence discovery and profit optimization for Microsoft Accelerator for Business Intelligence. Weak 3 A It is a collection of machine learning algorithms for solving data mining problems. It is written in java. So it is portable across all platforms. For details visit, http://www.cs.waikato.ac.nz/weak/ Oracle 10 g: oracle 10 g provides software called Darwin, which is data mining tool. It incorporates Cluster Analysis, Classification, Prediction and Association rules. In addition to the above list, the following are popular, Ghost Minor, Mantas,CART and MARS

1.10 Summary
Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. Data: Data are any facts, numbers, or text that can be processed by a computer.

Sikkim Manipal University

Page No. 11

Data Mining

Unit 1

Metadata: data about the data itself such as logical database design or data Dictionary definitions. Information: The patterns, associations, or relationships among data that can provide information or processed data is called information. Data Mining is a multidisciplinary field drawing works from statistics, database technology, artificial intelligence, pattern recognition, machine learning, information theory, knowledge acquisition, information retrieval, high-performance computing, and data visualization. Data Mining consists of many up-to-date techniques such as o Classification o Clustering o Association Data mining is a process, and its successful application requires Data Preprocessing (dimensionality reduction, cleaning, noise/outlier removal), post processing (understandability, summary, presentation), good understanding of problem domains and domain expertise. Data mining is also referred to as knowledge discovery in databases (KDD). OLAP and Data Mining can complement each other OLAP stands for Online Analytical Processing Data Mining is a step in the KDD (Knowledge Discovery Process) Process.

1.11 Terminal Questions


1. What is data mining? Write Data Mining applications. 2. What is OLAP? Write the benefits of OLAP. 3. Differentiate between the following Data Mining and Data Warehousing OLAP and Data Mining 4. What are the data mining techniques? 5. What is Knowledge Discovery? Explain the whole process involved. 6. Write any three data mining techniques. 7. What is preprocessing?

Sikkim Manipal University

Page No. 12

Data Mining

Unit 1

1.12 Answers
Self Assessment Questions 1. Discovering 2. Historical, future 3. Meta data 4. Data 5. Data warehouse and data mining 6. Knowledge discovery 7. Decision support 8. Compliment 9. Multiple Terminal Questions 1. Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Refer section 1.2 and 1.8 2. Online Analytical Processing (OLAP) is a technology that is used to create decision support software. Refer section 1.6 3. Data Mining is a multidisciplinary field drawing works from statistics, database technology, artificial intelligence, pattern recognition, machine learning, information theory, knowledge acquisition, information retrieval, high-performance computing, and data visualization. Refer section 1.4 4. Artificial neural networks , Decision trees , Rule induction ,etc Refer section 1.8 5. Data Mining is also referred to as knowledge discovery in databases (KDD). Refer section 1.6 6. i. Classification ii. Clustering iii. Association dimensionality reduction, cleaning,

7. Data Preprocessing involves noise/outlier Removal.

Sikkim Manipal University

Page No. 13

You might also like