You are on page 1of 6

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No.

A Survey on the Research Challenges for Data Mining


Ankit Mishra
Lecturer, Dept. of CSE, TCT, Bhopal, India Email Id-ankitmishra.1610@gmail.com

Abstract
With the hugely developed computer and information technology in the last few decades, an enormous Amount of data in science and engineering has been and will continuously be generated in massive scale, either being stored in storage devices or flowing into and out of the system in the form of data streams. Such tremendous amount of data, in the order of tera- to peta-bytes, has fundamentally changed science and engineering. In this paper, we discuss the research challenges in science and engineering, from the data mining perspective and trying to provide some solutions for the challenges. Keywords: Data sets, OLAP, End-user business model, Sandwich strategy

1. Introduction
It has been popularly recognized that Data mining, a relatively young and interdisciplinary field of computer science, is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management.[1] Besides the further development of database methods to efficiently store and manage peta-bytes of data online, making these archives easily and safely accessible via the Internet. Another essential task is to develop powerful data mining tools to analyze such data. Thus, there is no wonder that data mining has also stepped on to the center stage in science and engineering.[9] Data mining, as the confluence of multiple intertwined disciplines, including statistics, machine learning, pattern recognition, database systems, information retrieval, World-Wide Web, visualization, and many application domains, has made great progress in the past decade. To ensure that the advances of data mining research and technology will effectively benefit the progress of science and engineering, it is important to examine the challenges on data mining posed in data-intensive science and engineering and explore how to further develop the Technology to facilitate new discoveries and advances in science and engineering.[2]

1.1 Definition of Data mining: Data mining is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management.[1] Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses.[4] Data mining is often defined as finding hidden information in a database. Alternatively it has been called exploratory data analysis, data driven discovery.[5]

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9

2. Data Mining Architecture

Fig-1: 3-Tier Architecture for Data Mining

The first tier is the database tier where data and metadata is prepared and stored. The second tier is called Data Mining Application where the algorithms process the data and store the results in the database. The third tier is the Front-End layer, which facilitates the parameter settings for Data Mining Application and visualization of the results in interpretable form.[6]

2.1 Architecture 2

Fig-2

The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. This warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access.[12] An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. [4]

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9

3. Advantages of Data Mining


Data mining offers lots of advantages:[7] Marking/Retailing Data mining can aid direct marketers by providing them with useful and accurate trends about their customers purchasing behavior. Based on these trends, marketers can direct their marketing attentions to their customers with more precision. For example, marketers of a software company may advertise about their new software to consumers who have a lot of software purchasing history. In addition, data mining may also help marketers in predicting which products their customers may be interested in buying. Through this prediction, marketers can surprise their customers and make the customers shopping experience becomes a pleasant one. Retail stores can also benefit from data mining in similar ways. For example, through the trends provide by data mining, the store managers can arrange shelves, stock certain items, or provide a certain discount that will attract their customers.

Banking/Crediting: Data mining can assist financial institutions in areas such as credit reporting and loan information. For example, by examining previous customers with similar attributes, a bank can estimated the level of risk associated with each given loan. In addition, data mining can also assist credit card issuers in detecting potentially fraudulent credit card transaction. Although the data mining technique is not a 100% accurate in its prediction about fraudulent charges, it does help the credit card issuers reduce their losses.

Law enforcement: Data mining can aid law enforcers in identifying criminal suspects as well as apprehending these criminals by examining trends in location, crime type, habit, and other patterns of behaviors.

Researchers: Data mining can assist researchers by speeding up their data analyzing process; thus, allowing them more time to work on other projects.

4. Disadvantages of Data Mining


Data mining offers lots of disadvantages:[7] Privacy Issues: Personal privacy has always been a major concern in this country. In recent years, with the widespread use of Internet, the concerns about privacy have increase tremendously. Because of the privacy issues, some people do not shop on Internet. They are afraid that somebody may have access to their personal information and then use that information in an unethical way; thus causing them harm. Although it is against the law to sell or trade personal information between different organizations, selling personal information have occurred.

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9

The selling of personal information may also bring harm to these customers because you do not know what the other companies are planning to do with the personal information that they have purchased.

Security issues: Although companies have a lot of personal information about us available online, they do not have sufficient security systems in place to protect that information.[8] For example, recently the Ford Motor credit company had to inform 13,000 of the consumers that their personal information including Social Security number, address, account number and payment history were accessed by hackers who broke into a database belonging to the Experian credit reporting agency. This incidence illustrated that companies are willing to disclose and share your personal information, but they are not taking care of the information properly. With so much personal information available, identity theft could become a real problem.

Misuse of information/inaccurate information: Trends obtain through data mining intended to be used for marketing purpose or for some other ethical purposes, may be misused. Unethical businesses or people may used the information obtained through data mining to take advantage of vulnerable people or discriminated against a certain group of people. In addition, data mining technique is not a 100 percent accurate; thus mistakes do happen which can have serious consequence.

5. Major Research Challenges


Analysis of huge information networks: When Google and other search engines developed, analysis of information network has become an important research border, with broad applications, such as computer network analysis social network analysis, web community discovery, terrorist network mining, , and network intrusion detection. We have a lot of large technical and information networks in field of science and engineering such aswireless networks, soldiers and supply lines in a battle field. In such information networks each node in a network contains very important and multidimensional information, as like- text, geographical contents. These kind of networks could be very dynamic and inter dependent. Although a single link containing valuable information can be sometimes noisy, unreliable, and may be misleading. Traditional data mining algorithms such as classification, market basket analysis, and cluster analysis commonly attempt to find patterns in a dataset containing independent, identically distributed (IID) samples. A key emerging challenge for data mining is tackling the problem of mining richly structured, heterogeneous datasets .The domains often consist of a variety of object types; the objects can be linked in a variety of ways. Naively applying traditional statistical inference procedures, which assume that instances are independent, can, lead to inappropriate conclusions about the data. In fact, object linkage is knowledge that should be exploited. Discovery, understanding, and usage of patterns and knowledge: Scientific and engineering applications often deal massive data of high dimensionality. The main target of pattern mining is to find item sets, substructures, subsequences, that appear in a data set with that level of frequency which always maintain a user- defined threshold. There are also requirement of mechanisms for deep understanding and interpretation of patterns, e.g., semantic annotation for frequent patterns, and contextual analysis of frequent patterns. The main target for research on pattern analysis has been focused on pattern composition (e.g., the set of items in item-set patterns) and frequency. A deep analysis of frequent patterns over the structural information can help respond questions like why this pattern is frequent?"The deep understanding of frequent patterns is essential to improve the interpretability and the usability of frequent patterns.

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9

Web and unstructured data mining: Internet is the common place for scientists and engineers to share their observations publish their data, and experiences, and exchange their ideas with the other ones. There is a large amount of scientific and engineering data on the internet. Internet is now become such kind of awesome information access and processing platform, that has not only millions of link-accessed web pages", containing textual data, multimedia data, and linkages, on the Web surface, but also databases" that can be accessed by queries. Text mining and information extraction have been applied not only to Web mining but also to the analysis of other kinds of semi-structured and unstructured information, such as digital libraries, biological information systems, computer-aided design and instruction. Technologies in the text-mining process include information extraction, topic tracking, summarization, categorization, clustering. Information extraction shows a point of stating for computers that are analyzing unstructured text and identifying key phrases and relationships within text. It does it by looking for predetermined sequences in the text, a process called pattern matching. A pattern matching system includes a topic-tracking system that keeps user profiles and it also includes Text summarization which helps users mark out whether a lengthy document meets their needs and is worth reading. The key to summarization is reducing the length and detail of a document while retaining its main points and overall meaning. Categorization involves identifying the main themes of a document .When categorizing particular documents; a computer program often treats them as a bag of words". The program does not attempt to process the actual information as information extraction does. Clustering is a technique used to group similar kind of documents, but it differs from categorization in that documents are clustered on the basis of predefined topics. Visual data mining: A picture can give thousand times more information than thousand words. There have been lots of data visualization tools for visualizing various kinds of data sets in a large amount. Besides popular bar charts, pie charts, curves, box plots, scatter plots. There are also many visualization tools using geometric (e.g., dimension stacking, parallel coordinates), hierarchical (e.g., tree map), and icon-based techniques. Most data analysts use visualization as part of a process sandwich strategy of interleaving mining and visualization to reach a goal. Tasks related to mining usually demand such kind of techniques which are capable of handling large amounts of multidimensional data, often in the format of Data Tables or relational databases. Also, interaction mechanisms for filtering, querying, and selecting data are typically required for handling larger data sets. Many mining techniques involve different mathematical steps that require user intervention. Some of these can be quite complex and visualization can support the decision processes involved in making such interventions. From this consideration, a visual data mining technique is not just a visualization technique being applied to exploit data in some phases of an analytical mining process, but a data mining algorithm in which visualization plays a major role. Another typical use of visualization in mining resides in visually conveying the results of a mining task, such as clustering or classification, to enhance user interpretation. Visual data mining is very useful to scientists and engineers because they often have good understanding of their data, can use their knowledge to interpret their data and patterns with the help of visualization tools. Tools must be developed for mapping data and knowledge into useful and easy-to-understand visual forms, and for interactive web browsing, scrolling, and zooming data and patterns to facilitate user exploration.

6. Conclusion
Science and engineering are fertile lands for data mining. In the last two decades, science and engineering have evolved to a stage that gigantic amounts of data are constantly being generates and collected, and data mining and knowledge discovery becomes the essential scientific discovery process. In this paper, we have examined a few important research challenges in science and engineering data mining. There are still several interesting research issues not covered in this short abstract. One such issue is the development of invisible data mining functionality for science and engineering which builds data mining functions as an invisible process in the system

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9

References
[1] Informatin online Available: http://en.wikipedia.org/wiki/Data_mining [2] Jiawei Han and Jing Gao University of Illinois Research Challenges for Data Mining in Science and Engineering at Urbana-Champaign [3] Yunhong Gu, Robert L Grossman Sector and Sphere: The Design and Implementation of a High Performance Data Cloud submitted to Philosphical Transactions A Special Issue associated with the 2008 UK e-Science All Hands Meeting. [4] Informatin online Available: http://www.thearling.com/text/dmwhite/dmwhite.htm [5] Data Mining: Introductory And Advanced Topics By Margaret H Dunham,Pearson education [6]BIDS KDD Methodology authored by Kamlesh Mhashilkar-Head, Execution-MiH Services of Tata Consultancy Services [7] Informatin online Available: http://www.ustudy.in/node/6653 [8] Mark J. Embrechts, Boleslaw Szymanski , Karsten Sternickel Introduction to Scientific Data Mining:Direct Kernel Methods & Applications [9] Dr. Georges Grinstein, Dr. Bhavani Thuraisingham Data Mining and Data Visualization position paper for second IEEE workshop on Database Issues for Data Visualization [10] Ben G. Weber and Michael Mateas A Data Mining Approach to Strategy Prediction [11] R. Balu , T. Devi DATA, MEDIA AND IMAGE MINING

[12]Bhavani Thuraisingham, Latifur Khan, Chris Clifton , John Maurer , Marion Ceruti Dependable Real-time
Data Mining [13] Prabhat Kumar, Berkin Ozisikyilmaz, Wei-Keng Liao, Gokhan Memik, Alok Choudhary High Performance Data Mining Using R on Heterogeneous Platforms [14] M-Tahar Kechadi , IliasK. Savvas Cooperative Knowledge Discovery & Data Mining [15] Hetal Thakkar , Nikolay Laptev , Hamid Mousavi , Barzan Mozafari , Vincenzo Russo , Carlo Zaniolo SMM: a Data Stream Management System for Knowledge Discovery [16] Jaideep Srivastava , Prasanna Desikan , Vipin Kumar Web Mining Accomplishments & Future Directions

You might also like