You are on page 1of 2


s increase our ability to access and examine mor e and more data than ever before, our ability to effectively mine the data avail able will soon be surpassed by its volume. We can now collect from so many diff erent sources that inevitably some of our data will be unusable, if not useless. The need to mine the data for trends and patterns not previously detected requ ires some insight into the connections between different types of data, and diff erent categories of knowledge. While some kinds of connections and patterns can be machine- or process-driven, the software cannot be programmed to think too f ar outside the box, or to speculate about how some data might affect other (s). Human analysts work at human speed, but when they enter their queries into the s oftware, their data is mined at computer speed. But what happens to the data af ter it is mined? It continues to reside in the storage unit that it has always been in, waiting to be mined again for some other parameter or connection. We c an t simply delete it, as it may prove useful to us in the future, again and again. We can back it up and compress our archives, but meanwhile we are still compili ng more and more data across multiple information streams on a daily basis. The best option is to warehouse our data, and keep or create separate databases for the purposes of mining. As our warehouse databases fill up, the overwhelming b ulk of the data they hold becomes prohibitive for even the most robust processin g, and the time required to mine the entire database with our strongest system b ecomes cost- and time-ineffective. For this reason, it is recommended that when we need to mine data for suspected patterns or to verify established institutional knowledge, we export only the da ta that we need to mine into smaller, more manageable databases. This allows ou r data mining software to perform as designed, and ensures (assuming we made the right choices) that we are not wasting our resources on data that is irrelevant to our purposes. In other words, it ensures that we are only mining the correc t, or to be more accurate, the appropriate data. When our mining operations are concluded, these databases could be deleted so they do not weigh down our syste ms. The model or structure of these ad hoc mining databases could then be remem bered, and the same appropriate data can be called up again at a later date, wit h current and up-to-date information. The time and effort to extract the data f or mining, and to design and build the structure of the mining database, may mak e it more cost-effective to keep the mining database for later use, as the care and feeding costs will be significantly lower than a ground-up rebuild every tim e we want to plumb the same mines. By keeping our mining databases a manageable size, we can stem the overflow of t he massive amount of data we are inundated with daily. Of course, the more data we have access to, the greater the chance that we can extract actionable intell igence from it, so there is no good argument for storing less data. The answer is simply to control and manage the data that we do store. As automated ETL tec hnology improves and it becomes easier to extract our ad hoc databases for minin g, it becomes more cost-efficient to simply keep them on a temporary basis. Thi s helps us to manage the amount of data on hand, and to keep the data miners fro m overwhelming the organization. REFERENCES: Lohr, S. (2007). Reaping Results: Data-Mining Goes Mainstream. Retrieved March 29, 2011

from html?_r=1 Parkingson, J. (2005). Pack-rat Approach to Data Storage is Drowning IT. Retriev ed March 29, 2011 from torage-is-Drowning-IT/ Thearling, K. (1997). Understanding Data Mining: It s All in the Interaction. Retri eved March 29, 2011 from Two Crows Corporation (2005). Introduction to Data Mining and Knowledge Discove ry (Third Edition). Retrieved March 29, 2011 from

You might also like