A Brief Survey On Data Mining For Biological and Environmental Problems.

INTRODUCTION
Data mining (the superior scrutiny step of the "familiarity find in Databases"
progression, or KDD), an interdisciplinary subfield of computer science, is the
computational route of determine outlines in great data sets involving ways at the
intersection of artificial intelligence, machine studying, statistics, and database systems.
The in excess of aim of the data mining development is to remove in sequence from a
data set and convert it into a comprehensible construction for more use. Aside from the
raw investigation step, it involves database and data management aspects, data pre-
processing, copy and conjecture deliberations, interestingness metrics, complexity
deliberations, post-processing of discovered configurations, visualization, and online
updating. The term is a exhortation, and is recurrently maltreatment to mean any form
of bulky-scale data or in sequence processing (anthology, taking out, warehousing,
breakdown, and information) but is also generalized to any class of computer
conclusion support system, counting artificial intelligence, machine learning, and
business intelligence. In the correct use of the word, the key expression is unearthing
commonly defined as "detecting a bit new". Even the accepted book "Data mining: no-
nonsense machine learning tools and methods with Java" (which covers mostly machine
learning material) was originally to be named just "Practical machine learning", and the
phrase "data mining" was simply added for marketing reasons.
The definite data mining duty is the repeated or semi-automatic investigation of

huge quantities of data to remove previously indefinite interesting outlines such as
clusters of data proofs (cluster analysis), odd records (irregularity discovery) and
dependencies (association rule mining). This frequently involves using database
systems such as spatial indexes. These outlines can then be seen as a kind of précis of
the input data, and may be used in further study or, for example, in machine learning
and extrapolative analytics. For example, the data mining step power identify numerous
groups in the data, which can then be used to obtain more accurate prediction results by
a decision support system.
Background
Data mining idea and physical model from data has open for centuries. Early
methods of identifying patterns in data comprise regression analysis (1800s) and Bayes'
theorem (1700s). The proliferation, ubiquity and growing power of computer
technology has improved strategy gift, thought, data group and storeroom. Data mining
concepts build up more facility for study. As data sets have adult in size has more and
more been increased with indirect, mechanical data processing, aided by other
discoveries in information technology and computer science such as neural networks,
cluster analysis, genetic algorithms (1950s).
Data mining researcher residential decision trees (1960s). Data mining

researcher also urbanized support vector machines (1990s). Data mining is helpful to
applying these methods. These methods are exposure hidden patterns in big data sets.
Data mining bridges the gap from applied statistics, bioinformatics and artificial
intelligence. Data mining provides the mathematical background to database
management by exploiting. In Data mining data is stored and indexed in databases to
implement the definite learning and sighting algorithms more ably. Data mining
allowing such methods to be functional to a large of data sets.
Problem description
 Finding relevant information in unstructured data is a challenge. The data is
unknown in terms of structure and values. The lifecycle of each part of data is in
a specific domain, whereby a domain expert is available for a priori knowledge.
Domain experts can creates structures by hand in the data, however this is a
time-consuming job and it is done for one dataset in one domain.
 An additional challenge is connecting data from different domains.
Collaborative environment
Al the data is (geographically) spread over multiple domains. Whereby the
environment consists of more than one domain and they have the intention to
collaborate. Users are physically located at different places exchanging knowledge and
share information by interacting. Nowadays collaborative environments have the
characteristics of being a complex infrastructure, multiple organizational sub-domains,
information sharing is constrained, heterogeneity, changes, volatile, and dynamic.
Fig: Collaborative environment
Domain- and privacy restrictions

Each domain can have its own privacy restrictions. A domain has its own
standards in communication, and interaction, data storage, data structures, and culture.
Another item are the privacy restrictions, domains have their own policies and
restriction. So not all privacy data can be accessed by foreigners.
Retrieving relevant information

The main problem is retrieving relevant information from multiple domain
resources. gives an overview of multiple domains with cross-domain information
sharing. Each domain consists of multiple manifests, where by these manifest can
change from structure in time. Each domain expert tries to create a structure or pattern
by hand with his/her a priori domain knowledge. However, this is done by hand. Each
domain administrator does this for his domain for fulfilling the goal of retrieving rich
information from manifest in a readable structure. When there is the need for
collaboration in connecting and coupling two domains for creating a shared
conceptualization the domain experts have to perform the job together. By
communication and creating conformity, reducing the noise of interacting, and both
physically and virtual different worlds are connected in creating a holistic environment.
Problem description
During problem analysis, system administrators combine information from
various monitoring tools. Domain administrators try to relate their findings to
administrators from different domains. This process involves manual inspection of log
files. This is a time-consuming job, which sometimes involves the inference of missing
data. The current study aims to find patterns in the observed data, and build models of
the structure of information in the log files. By relating information from multiple
sources (log files) an automated system should be able to infer missing information (1).
The problem in managing grid infrastructures is that local domain information

has no connection to other domains. Manually combining relevant data of two or more
domains is a time-consuming job, especially with grid infrastructures having dynamic
characteristics and considering the domain privacy aspect.
Approach
The following approach is used to address the research questions. First, state of
the art literature research is performed to get an overview of learning mechanisms and
agent technology in distributed systems.
Experiment: Corpus creation using genetic algorithms the first experiment is using or
creating an intelligent algorithm for Region of Interest (ROI) extraction. ROI's are
possible parts of the solution domain (the retrievable structure/pattern, see figure 1-1).
The algorithm has to have the ability to read manifests, to interpret them, and have a
learning function. The result should be a structured set containing regular expressions.
The structured set is called a corpus. The regular expressions represent the value data of
ROI's. A ROI in a log file can be a sentence, verb, IP address, date or a combination
(see figure 1-3).
The algorithm uses a log file and domain knowledge as input. Domain knowledge
contains descriptive attributes e.g. {; ,Time = . * date= }. A descriptive attribute is a
data pattern that the agent will search for in the log file. A descriptive attribute is a
regular expression. With the manifest and the domain knowledge the algorithm will
return a corpus. The corpus represents an information pattern in the log file. Information
retrieval is achieved by querying the corpus. These queries return the values of the ROI
Fig: general ROI Extraction algorithm

Prototype: Multi-agent system
The prototype of the decentralized system was based on agent technology.
Agents are described in more detail in 0. In the current study agents are used to collect
and retrieve information from manifests. They are equipped with software components
to collect data. A multi agent system (MAS) contains multiple agents which can
communicate which each other. With MAS it is possible to deal with domain- and
privacy restrictions. Agents in MAS can cooperate to achieve cooperative learning and
distributed problem solving. MAS can increase its self-efficiency and communicate
across domains and work autonomously.
The agents use Regions of Interest (ROI) to analyze manifests. A Region of
Interest is a (part of a) pattern or a (part of a) model from a manifest. The agents can
communicate with each other via the network. The agents collaborate with other agents;
to form an alliance, in order to exchange ROI's. By exchanging their knowledge, the
agents can learn from one another and can achieve higher levels of data abstraction.
SCOPE:
The scope in the current study is shown in figure 1-4 every chapter will zoom in
to a part of the figure. The scope is retrieving data from manifest(s) using domain
knowledge and machine learning techniques. Figure 1-4 describes the various levels of
the current study, and their relations. The top level describes the data that was used, in
this case manifest containing log data. The second level involved pre-processing. That
is retrieving/creating domain knowledge about the manifests. This domain knowledge
was used in the next level by the algorithm to find patterns in the manifest, based on the
domain knowledge. Finally, in the level of the Multi-agent system, the found patterns -
or ROI's- were exchanged between agents.
Fig: Text mining executed by Multi-agents
What is Data Mining?

Data mining is a perfectly inter displinary subject. Data mining can be explains in
many different ways. Data mining should have been more unlike named first is
knowledge mining from database and second is knowledge discovery from data set.
Many people treat data mining and concepts as a synonym. It’s for another popularly
used term and the others view data mining as simply an essential step in the course of
data mining.
Uses of Data mining

Games
Data mining is worn in games such as table cases with any foundation little-
plank dots and boxes, small timber curse and positive end games in chess dots-and-
boxes original are for data mining has been opened.
Business
Data mining is the scrutiny of chronological business. Data mining stored data
as databases to reveal veiled patterns and trends. Data mining is very important in
customer rapport management applications. In data mining data clustering is most
dominant and important technique. It is also be used to repeatedly learn within a
consumer list. Data mining is furthermore individual loyal to human property
department in identifying the quality of their most thriving employees.
Science and engineering
In recent years, data mining has been worn in many areas of knowledge and
industrial, such as bioinformatics, medicine, health care, physics and environment. In
the study of human genetics, genetics, sequence mining helps tackle the very important
goal of accepting. This thoughtful the mapping the inter personage variations in being
DNA/RNA sequence and protein chain the unpredictability in disease openness. In
simple terms, it aims to discover how the changes in an folks DNA/RNA sequence
affects the risks of developing common diseases such as cancer, The data mining
scheme is used to perform this task. This is known as multifactor dimensionality
reduction.
Human rights
Data mining is practical in human rights area. Data mining of government
records mostly of the courts system. Justice system enables the discovery of systemic
human rights.
Objectives:
 To collect different samples from the different areas of the protein, and forest
fire data.
 To include identification and maintenance of techniques and explore clustering,
classification techniques for this problem.
 To perform comparative analysis between different areas (protein, and forest fire
data).
 To explore different techniques for protein, and forest fire data.
 To use the data mining algorithms.
 To explore the study of environmental data, Decision Tree, prediction data.
CHAPTER 2
LITERATURE SURVEY
Information Retrieval & Extraction

Information retrieval (IR) and extraction (IE) are concepts that derive from
information science. Information retrieval is searching for documents and for
information in documents. In daily life, humans have the desire to locate and obtain
information. A human tries to retrieve information from an information system by
posing a question or query. Nowadays there is an overload of information available,
while humans need only relevant information depending on their desires. Relevance in
this context means returning a set of documents that meets the information need.
Information extraction (IE) in computing science means obtaining structured
data from an unstructured format. Often the format of structured data is stored in a
dictionary or an ontology that defines the terms in a specific domain with their relation
to other terms. IE processes each document to extract (find) possible meaningful entities
and relationships, to create a corpus. The corpus is a structured format to obtain
structured data.
Information retrieval is an old concept. In a physical library books are stored on
a shelf in a specific order e.g. per topic and then alphabetic. When a person needs
information on a specific topic, he or she can run to the shelf and locate a book that fits
the most to his or her needs. With the advent of computers, this principle can also be
used by information systems. Well-known information-retrieval systems are search
engines on the web. For instance, Google tries to find a set of available documents on
the web, using a search phrase. It tries to find matches for the search phrase or parts of
it. The pre-processing work for the search engines is the information extraction process;
to create order in a chaos of information. Google crawls the web for information,
interprets it, and stores in a specific structure so that it can be quickly accessed when
users are firing search phrases
Data mining consists of an iterative sequence of the following steps
1) Data cleaning (to remove noise and inconsistent data)
2) Data integration (where multiple data sources may be combined)
3) Data selection (where data relevant to the analysis task are retrieved from the
database)
4) Data mining (an essential process where intelligent methods are applied in order
to extract data patterns)
5) Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interestingness measures)
Fig: Data mining process

Text mining
Text mining uses the same analysis approach and techniques as data mining.
However data mining requires structured data, while text mining aims to discover
patterns in unstructured data (3). For commercial use text mining will be the follow-up
of data mining. With the growing number of digitized documents and having large text
databases, text mining will become increasingly important. Text mining can be a huge
benefit for finding relevant and desired text data from unstructured data sources.
NACTEM1 performs research in Text mining and applies the found methods on the
MEDLINE data source it's a huge database containing medical information. With text
mining, the input will be a text set which can be unstructured or semi structured. For
example a text document can have a few structured parts like title, author, publication
date, and category. The abstract and content might be unstructured components with a
high potential information value. It is hard to retrieve information from those parts with
conventional data mining.
Text mining uses unstructured documents as input data. In other words,
documents that is hard to interpret in terms of meaning. There are few companies
working on profitable applications for text-mining. Because of the challenges involved
in working with text and the differences between languages it is a challenge to create a
general solution or application. The research area is currently "too young" to deal with
all of the aspects of text and natural language processing and linking information to
each other. However, the first results are promising and perform well e.g. the work
performed by TextKernel2. (In light of the current study the author visited Text Kernel
to discuss text mining techniques.) Text kernel is a company specialized in mining data,
and is working on text mining with promising results e.g. parsing and finding structures
in Curriculum Vitae (C.V.) documents. These C.V.'s are being collect and parsed in a
general format for international staffing & recruitment agencies.
Intelligent algorithms
This paragraph begins with a basic explanation of intelligence and learning in
software components. Intelligence - in software is being characterized as:
 adaptability to a new environment or to changes in the current environment;
 capacity for knowledge and the ability to acquire it;
 capacity for reason and abstract thought;
 ability to comprehend relationships;
 ability to evaluate and judge;
 capacity for original and productive thought
Learning - is the process of obtaining new knowledge. It results in a better reaction

to the same inputs at the next session of operation. It means improvement. It is a step
toward adaptation. Learning is a important characteristic of intelligent systems.
There are three important approaches to learning:
 Learning through examples. This means learning from trainings data, For
instance ( X i, Y i ) , where Xi is a vector from the domain space D and Y i is a
vector from the solution space S , i = 1,2,. . . , n, are used to train a system to
obtain the goal function F : D — S . This type of learning is typical for neural
networks.
 Learning by being told. This is a direct or indirect implementation of a set of
heuristic rules in a system. For example, the heuristic rules to monitor a car can
be directly represented as production rules. Or instructions given to a system in a
text form by an instructor (written text, speech, natural language) can be
transformed into internally represented machine rules.
 Learning by doing. This way of learning means that the system starts with nil or
little knowledge. During its functioning it accumulates valuable experience and
profits from it, and performs better over time. This method of learning is typical
for genetic algorithms
The first two approaches are top down, because there are many possible solutions
available. These approaches can learn from experience or via an instructor (supervised).
The third one is a bottom- up strategy, beginning with a little knowledge and try to find
the best possible solution. But the final or optimal solution is not known only parts of
the solution domain are given.
In information technology, an agent is an autonomous software component.
Autonomous, because it operates without the direct intervention of humans or others
and has control over its actions and internal states. It perceives its environment through
sensors and acts upon its environment through effectors. Agents communicate with
other agents. By communication, agents can work together which can result in
cooperative problem solving, whereby agents have their own tasks and goals to fulfil in
the environment. See figure 2-2 Multi agent system canonical view where the agents
interact with each other. And an agent perceives a part of the environment with its
actions an agent can influence the environment (partial).
Multi agent system (MAS) is a multiple coupled (intelligent) (software) agents.
Multi agents which interact with one another through communication and are capable of
to perceive the environment. Multi agent systems solve problems, which are difficult or
impossible for a single agent system.
Fig: Multi agent system canonical view

Properties of environments
Accessible vs. inaccessible
An accessible environment is when agent has a complete state of the
environment. Agents detect all relevant states of the environment to base their decision
on. If the environment is highly accessible it's easier for an agent to make more accurate
decisions (accurate, as being related to - or appropriate given the state of the
environment). Examples of environments with high inaccessible potential are the
internet, and the physical real world.
Deterministic vs. non-deterministic

A deterministic environment is one in which any action has a single guaranteed
effect - there is no uncertainty about the state that will result from performing an action.
Episodic vs. non-episodic

In an episodic environment agent perform in several episodes, without any
relevance or relations between the episodes. Actions executed in previous episodes by
the agent after acting and perceiving the environment has no relevance in the following
(i.e. new) episodes. Agent doesn’t need the ability to reason ahead, the quality of
performance does not depend on previous episodes. In a non-episodic environment
however, agents should "think" about possible next steps.
Static vs. dynamic

Static environment remains unchanged even after actions performed by an
agent. Dynamic environments are changing environment and are harder to handle for
agents. If an environment is static, but agents can change characteristics of the
environment, it is called semi-dynamic.
Discrete vs. continuous

Discrete environment: there are a fixed, finite number of actions and precepts in
it e.g. chess game.
Review of Literature
In this literature (Agarwal et al., 1993) Proteins typically do not act remote in a
cubicle but meaning with in complex cellular paths, interacting with additional proteins
what's more in couples or as modules of superior complexes while lots of protein
complexes have been known by great balance untried studies, due to a huge number of
false-positive communications on hand in existing protein complexes, it is unmoving
not easy to gain an correct considerate of purposeful modules, which cover assembly of
proteins involved in general elementary biological occupation. In this paper, we at hand
a hyperclique example discovery come near for extracting purposeful modules
(hyperclique patterns) from protein compound.
In this research discussed SAS solution for pharma covigilance.( Kumar et al.,
2004),. (Arthur and Vassilvitskii, 2007) presented k-means++ method is a widely used
clustering technique. Experiments show in this research argument improves both
accuracy of k-means, and speed often quite dramatically.
In this literature (Bashers et al, 2012) authors presented Data mining. Data
mining discover knowledge from a lot of quantity of data scrutiny. In this research
deeply focused on, a genetic algorithm-bottom approach for mining arrangement set of
laws. This research work based on coverage, comprehensibility of the rules, accuracy
and simplifying the implementation of a genetic algorithm. In this research discussed
the design of encoding, which is based on fitness function of genetic algorithm and
genetic operators. Experimental results show that genetic algorithm proposed in this
research. This is suitable for discovered higher classification performance and data
mining methods to unknown data.
In (Bauer and Kohavi, 1999) presented a lot of examples bagging and boosting.
In this research author presented MDT presented the better result voting and stacking.
Bagging works as a method of increasing accuracy. Bagging is also called Bootstrap.
Bootstrap is based on random sampling with replacement. Bagging is a machine learning
ensemble algorithm.
In (Bifet et al., 2009) was present figures torrent construction. In this surround
work author presented data mining framework which is applied with techniques.
In (Dietterich, 2002) described three types of ensembles problems associated
with base learning algorithms: first is the statistical problem second is the computational
problem, and the third is representational problem.
(Dr. Bhatia and Deepika Khurana, 2013) the authors presented k- Means
clustering algorithm. This research presented the qualified study of customized k-
Means clustering algorithm and original k-means clustering algorithm. Theses
algorithms are execute on different datasets. Routine result came using Original k-
Means and other adapted k-means algorithms. In this research used MATLAB R2009b.
Authors the results are calculated on some performance measures such as, no. of points
which is misclassified, accuracy, Silhouette validity index, no. of iterations and
execution time in this research.
In (Hemalatha and Saranya, 2011) presented a survey paper of spatial data
mining different spatial task. Authors focused on the sole features that distinguish
spatial data mining from traditional data mining. In this research authors gave
applications and Techniques, Issues and challenges on spatial data mining. Author
presented spatial data analysis which is very different task for research area.
In (Dzeroski and Zenko, 2004) worked on Ensemble learning techniques. This
technique created a meta-classifier by combining several classifiers, it is typically by
voting, ensemble learning created on the same data. Ensemble learning improved their
performance.
In this literature (Ester et al., 2001) authors presented a database-oriented
scaffold for spatial data mining. A small set of basic and primary operations on these
graphs and paths were discussed as database primitives for spatial data mining. Many
types of techniques discussed in this research such as commercial DBMS were
presented. In the research authors covered the main tasks of spatial data mining: spatial
classification, spatial characterization, spatial clustering and spatial trend detection. For
each of these tasks, authors presented algorithms prototypical applications. The authors
indicated interesting directions for future. Since the system overhead imposed by this
DBMS is rather many types of concepts and techniques of improving the efficiency
should be investigated. For example, techniques for processing sets which provide more
information to the DBMS can be used to improve the overall efficiency of mining
algorithms.
FCM is based on fuzzy logic (Elena, 2013) author presented paper is a survey of
fuzzy logic theory which is applied in cluster analysis. In this work author presented
review the Fuzzy c - means clustering method in MATLAB.
In this literature (Ester et al., 2001) authors presented a database-oriented
scaffold for spatial data mining. A small set of basic and primary operations on these
graphs and paths were discussed as database primitives for spatial data mining. Many
types of techniques discussed in this research such as commercial DBMS were
presented. In the research authors covered the main tasks of spatial data mining: spatial
classification, spatial characterization, spatial clustering and spatial trend detection. For
each of these tasks, authors presented algorithms prototypical applications. The authors
indicated interesting directions for future. Since the system overhead imposed by this
DBMS is rather many types of concepts and techniques of improving the efficiency
should be investigated. For example, techniques for processing sets which provide more
information to the DBMS can be used to improve the overall efficiency of mining
algorithms.
(Ghosh and Dubey, 2013) this research authors presented two important
clustering algorithms. First is K-Means and second is an FCM (Fuzzy C-Means)
algorithm and comparative study done between these algorithms.
In this (Jain and Gajbhiye, 2012) research authors presented competitive Neural
Network, K means Clustering algorithm, Fuzzy C Means algorithm and one original
method Quantum Clustering to consider its. The main aim of this research was to
compare the four clustering techniques with lot capable presentation of sum data which
is multivariate. Authors introduced an easy-to-use and astute tool. This tool compares
some clustering methods within the same scaffold in this research.
In (Kalyankar and Alaspurkar, 2013) this research authors presented a lot of
amount of data mining concepts, data, many types of data mining methods such as
Classification, clustering, hyper cyclic patterns, many types of algorithms etc.. The
main aim of this research is study on weather data using data mining technique and
methods like clustering technique.
In (Kavitha and Sarojamma, 2012) this research author presented Disease
Management Programs are initial to encompass providers across the healthcare band. In
this editorial creator express the diabetic monitoring platform, supports Diabetes (D)
diagnosis assessment which is offering functions of DM. It is based on the CART
method. This work explained a decision tree for diabetes diagnostics and showed how
to use it as a basis for generating knowledge base rules.
In (Lavrac and Novak, 2013) this research worked on first outlines relational
data mining approaches and finding of subgroups. In this research described recently
developed approaches to semantic data mining which facilitate the use of sphere
ontology. This ontology is the background acquaintance in analysis of data. The
techniques and tools are useful of illustrated on selected biological applications.
The author in literature (Lai and Tsai, 2012) presented the preliminary results.
This result present in the validation. The author presented risk measurement of total
victory measures which upshot by profound hammering rains in the Shimen reservoir
watershed of Taiwan. In this research author used spatial analysis and data mining
algorithms. In this research researcher focused on Normalized Difference Vegetation
Index, eleven factors such as Digital Elevation Model, DEM, slope, aspect, curvature,
geology, soil, land use, fault, river and road.
Title: Monitoring of Diabetes with Data Mining via CART Method
Author: Kavitha K, Sarojamma R M
Disease Management Programs are beginning to encompass providers across the

healthcare continuum, including home health care. The premise behind disease
management is that coordinated, evidence-based interventions can be applied to the care
of patients with specific high-cost, high-volume chronic conditions, resulting in
improved clinical outcomes and lower overall costs. The paper presents an approach to
designing a platform to enhance effectiveness and efficiency of health monitoring using
DM for early detection of any worsening in a patient’s condition. In this article, we
briefly describe the diabetic monitoring platform we designed and realized which
supports Diabetes (D) diagnosis assessment offering functions of DM based on the
CART method. The work also gives a description of constructing a decision tree for
diabetes diagnostics and shows how to use it as a basis for generating knowledge base
rules. The system developed achieved accuracy and a precision of 96.39% and 100.00%
in detecting Diabetes these preliminary results were achieved on public databases of
signals to improve their reproducibility. Finally, the article also outlines how an
intelligent diagnostic system works. Clinical trials involving local patients are still
running and will require longer experimentation.
Advantages
 It can generate regression trees, where each leaf predicts a real number and not
just a class.
 CART can identify the most significant variables and eliminate non-significant
ones by referring to the Gini index.
Disadvantages
 CART may have an unstable decision tree. When learning sample is modified
there may be increase or decrease of tree complexity, changes in splitting
variables and values.
Title: Verification and Risk Assessment for Landslides in the Shimen Reservoir
Watershed of Taiwan Using Spatial Analysis and Data Mining
Author: J-S. Lai, F. Tsai
Spatial information technologies and data can be used effectively to investigate

and monitor natural disasters contiguously and to support policy- and decision-making
for hazard prevention, mitigation and reconstruction. However, in addition to the vastly
growing data volume, various spatial data usually come from different sources and with
different formats and characteristics. Therefore, it is necessary to find useful and
valuable information that may not be obvious in the original data sets from numerous
collections. This paper presents the preliminary results of a research in the validation
and risk assessment of landslide events induced by heavy torrential rains in the Shimen
reservoir watershed of Taiwan using spatial analysis and data mining algorithms. In this
study, eleven factors were considered, including elevation (Digital Elevation Model,
DEM), slope, aspect, curvature, NDVI (Normalized Difference Vegetation Index), fault,
geology, soil, land use, river and road. The experimental results indicate that overall
accuracy and kappa coefficient in verification can reach 98.1 % and 0.8829,
respectively. However, the DT model after training is too over-fitting to carry
prediction. To address this issue, a mechanism was developed to filter uncertain data by
standard deviation of data distribution. Experimental results demonstrated that after
filtering the uncertain data, the kappa coefficient in prediction substantially increased
29.5%.The results indicate that spatial analysis and data mining algorithm combining
the mechanism developed in this study can produce more reliable results for verification
and forecast of landslides in the study site.
Title: Comparative Analysis of K-Means and Fuzzy C-Means Algorithms
Author: Soumi Ghosh, Sanjay Kumar Dubey
In the arena of software, data mining technology has been considered as useful
means for identifying patterns and trends of large volume of data. This approach is
basically used to extract the unknown pattern from the large set of data for business as
well as real time applications. It is a computational intelligence discipline which has
emerged as a valuable tool for data analysis, new knowledge discovery and autonomous
decision making. The raw, unlabeled data from the large volume of dataset can be
classified initially in an unsupervised fashion by using cluster analysis i.e. clustering the
assignment of a set of observations into clusters so that observations in the same cluster
may be in some sense be treated as similar. The outcome of the clustering process and
efficiency of its domain application are generally determined through algorithms. There
are various algorithms which are used to solve this problem. In this research work two
important clustering algorithms namely centroid based K-Means and representative
object based FCM (Fuzzy C-Means) clustering algorithms are compared. These
algorithms are applied and performance is evaluated on the basis of the efficiency of
clustering output. The numbers of data points as well as the number of clusters are the
factors upon which the behaviour patterns of both the algorithms are analyzed. FCM
produces close results to K-Means clustering but it still requires more computation time
than K-Means clustering.
Advantages
 KM is relatively efficient in computational time complexity with its cost
Disadvantages
 KM may not be successful to find overlapping clusters
 It is not invariant to non-linear transformations of data.
 KM also fails to cluster noisy data and non-linear datasets.
Title: Data Mining Technique to Analyse the Metrological Data
Author: Meghali A. Kalyankar, S. J. Alaspurkar
Data Mining is the process of discovering new patterns from large data sets, this
technology which is employed in inferring useful knowledge that can be put to use from
a vast amount of data, various data mining techniques such as Classification, Prediction,
Clustering and Outlier analysis can be used for the purpose. Weather is one of the
meteorological data that is rich by important knowledge. Meteorological data mining is
a form of data mining concerned with finding hidden patterns inside largely available
meteorological data, so that the information retrieved can be transformed into usable
knowledge. We know sometimes Climate affects the human society in all the possible
ways. Knowledge of weather data or climate data in a region is essential for business,
society, agriculture and energy applications. The main aim of this paper is to overview
on Data mining Process for weather data and to study on weather data using data
mining technique like clustering technique. By using this technique we can acquire
Weather data and can find the hidden patterns inside the large dataset so as to transfer
the retrieved information into usable knowledge for classification and prediction of
climate condition. We discussed how to use a data mining technique to analyze the
Metrological data like Weather data.
Advantages
 Data Logger can automatically collect data on a 24-hour basis
CHAPTER 3
PRE-PROCESSING IN INFORMATION RETRIEVAL
Domain administrators
Manually gaining knowledge can be done by asking an expert or administrator
in the specific domain. This person can provide the required information about the
manifest. Which knowledge is needed and give a priority order on each item.
In this specific case on gaining knowledge and retrieving it via domain experts
has some disadvantages. For these experts it's a time consuming job, providing
knowledge to other people or systems. In the case of expert providing to an other
person. The domain knowledge should be provided in a consumable format. Where a
non domain expert can work with the provide knowledge in retrieving rich information
from manifest. In practice the domain knowledge can change, so that's an extra
challenge for the domain experts. This study assumes that a domain expert provides
knowledge in readable machine format as input for finding a corpus (structure) of the
manifest. It is important to have a readable Human machine communication. An
ontology which is very helpful in creating a compromise and an agreement in
communication and the meaning about the given domain knowledge.
Extracting algorithms
Delegating these processes and responsibilities raises the need for machine-
process form. A machine (extracting algorithms) reads a set of manifests and tries to
extract domain knowledge. This will be used to find a corpus out of a set of manifests.
An advantage that the domain experts have a lower load of work in providing
knowledge. Their task can change from an executing task to a controlling one. For
checking the outcome of these algorithms and give these algorithms feedback, so in the
future unseen data will be handled better. This is only the case in supervised learning,
paragraph 0. Another approach can be that domain experts only change the outcome of
an domain knowledge gaining algorithm
Regular Expressions
The domain knowledge is in this particular case represented by regular
expressions. A regular expression is very powerful, because they represent a search
pattern in a manifest.
This format of domain knowledge is used as input for the genetic algorithms. In
the GRID case for analyzing log files key value pairs are used, see table 3-1 regular
expressions. Log files have mostly a hierarchical structure containing a lot of key value
pairs. Using domain knowledge represented by regular expressions as input for corpus
creating algorithms is very powerful. The corpus can be queried and it consists of
multiple regular expressions. Indeed regular expressions are search patterns, which can
be queried.
Key Value (regular expression) Log file line example

Time TIME\p{Punct}*\s*([\w+\s*:]*) TIME: Sun Dec 7 04:02:09 2008
PID PID\p{Punct}*\s*(\w{1,}) PID: 13048
IP ([01]?\d\d? 12[0-4]\d 125[0-5])\.([01]?\d\d? Got connection
12[0-4]\d 125[0-5])\.([01]?\d\d? 12[0-4]\d 128.142.173.156
125[0-5])\.([01]?\d\d? 12[0-4]\d 125[0-5])
Table: Regular expressions
Applications of Text Mining Information Retrieval

Information retrieval (IR) concept has been developed in relation with database
systems for many years. Information retrieval is the association and retrieval of
information from a large number of text-based documents. The information retrieval
and database systems, each handle various kinds of data; some database system
problems are usually not present in information retrieval systems, such as concurrency
control, recovery, transaction management unstructured or semi-structured data sets
such as emails HTML files and full text documents etc.. Text Mining is used for finding
the new, previously unidentified information from different written resources.
Structured data is data that resides in a fixed field within a record or file. This
data is contained in relational database and spreadsheets. The unstructured data usually
refers to information that does not reside in a traditional row-column database and it is
the opposite of structured data. Semi-Structured data is the data that is neither raw data,
nor typed data in a conventional database system. Text mining is a new area of
computer science research that tries to solve the issues that occur in the area of data
mining, machine learning, information extraction, natural language processing,
information retrieval, knowledge management and classification. Figure 1 gives the
overview of text mining process and update.
Fig: Text Mining Process
Also, some common information retrieval problems are usually not encountered
in conventional database systems, such as unstructured documents, estimated search
based on keywords, and the concept of relevance. Due to the huge quantity of text
information, information retrieval has found many applications. There exist many
information retrieval systems, such as on-line library catalog systems, on-line document
management systems, and the more recently developed Web search engines
Information Extraction
The information extraction method identifies key words and relationships within
the text. It does this by looking for predefined sequences in the text, a process called
pattern matching. The software infers the relationships between all the identified places,
people, and time to give the user with meaningful information.
This technology is very useful when dealing with large volumes of text.
Traditional data mining assumes that the information being "mined" is already in the
form of a relational database. Unfortunately, for many applications, electronic
information is only available in the form of free natural language documents rather than
structured databases
Fig: Process of Text Extraction

Categorization
Categorization involves identifying the main themes of a document by inserting
the document into a predefined set of topics. When categorizing a document, a
computer program will often treat the document as a "bag of words." It does not try to
process the actual information as information extraction does. Rather, the categorization
only counts words that appear and, from the counts, identifies the main topics that the
document covers. Categorization often relies on a glossary for which topics are
predefined, and relationships are identified by looking for large terms, narrower terms,
synonyms, and related terms
Pre-processing Method
Pre-processing method plays a very important role in text mining techniques and
applications. It is the first step in the text mining process. In this paper, we discuss the
three key steps of pre-processing namely, stop words removal, stemming and TF/IDF
Fig: Text Mining Pre-Processing Techniques

Extraction
This method is used to tokenize the file content into individual word.
Stop Words Elimination

Stop words are a division of natural language. The motive that stop-words
should be removed from a text is that they make the text look heavier and less important
for analysts. Removing stop words reduces the dimensionality of term space. The most
common words in text documents are articles, prepositions, and pro-nouns, etc. that
does not give the meaning of the documents. These words are treated as stop words.
Example for stop words: the, in, a, an, with, etc. Stop words are removed from
documents because those words are not measured as keywords in text mining
applications
C . Stop word removal methods

Four types of stop word removal methods are followed, the methods are used to
remove stop words from the files.
 The Classic Method: The classic method is based on removing stop words
obtained from pre-compiled lists.
 Methods based on Zipf's Law (Z-Methods):
 In addition to the classic stop list, we use three stop word creation methods
moved by Zipf's law, including: removing most frequent words (TF-High) and
removing words that occur once, i.e. singleton words (TF1). We also consider
removing words with low inverse document frequency (IDF).
 The Mutual Information Method (MI)

o The mutual information method (MI) is a supervised method that works
by computing the mutual information between a given term and a
document class (e.g., positive, negative), providing a suggestion of how
much information the term can tell about a given class. Low mutual
information suggests that the term has a low discrimination power and
consequently it should be removed.
 Term Based Random Sampling (TBRS)

o This method was first proposed by Lo et al. (2005) to manually detect
the stop words from web documents. This method works by iterating
over separate chunks of data which are randomly selected. It then ranks
terms in each chunk based on their in format values using the Kullback-
Leibler divergence measure as shown in Equation 1.
dx (t) = Px (t).log2 Px (t) / 𝑝 (t) …..1
Where Px (t) is the normalized term frequency of a term t within a mass x, and P(t) is
the normalized term frequency of t in the entire collection. The final stop list is then
constructed by taking the least informative terms in all chunks, removing all possible
duplications.
D. Stemming
This method is used to identify the root/stem of a word. For example, the words
connect, connected, connecting, connections all can be stemmed to the word "connect"
[6]. The purpose of this method is to remove various suffixes, to reduce the number of
words, to have accurately matching stems, to save time and memory space
Fig : Stemming Process
TOKENIZATION
Tokenization is a critical activity in any information retrieval model, which
simply segregates all the words, numbers, and their characters etc. from given document
and these identified words, numbers, and other characters are called tokens. Along with
token generation this process also evaluates the frequency value of all these tokens
present in the input documents. All the phases of tokenization process are Pre-
processing involves the set of all documents are gathered and passed to the word
extraction phases in which all words are extracted. In next phase all the infrequent
words are listed and removed for example remove words having frequency less than
two. Intermediate results are passed to the stop word removal phase. In this phase
remove those English words which are useless in information retrieval these English
words are known as stop words. For example stop words include “the, as, of, and, or, to
etc. this phase is very essential in the tokenization because it has some advantages: It
reduces the size of indexing file and it also improve the overall efficiency and make
effectiveness.
Next phase in tokenization is stemming. Stemming phase is used to extract the
sub-part i.e. called as stem/root of a given word. For example, the words continue,
continuously, continued all can be rooted to the word continue. The main role of
stemming is to remove various suffixes as result in the reduction of number of words, to
have exactly matching stems, to minimize storage requirement and maximize the
efficiency of IR Model.
Fig: stemming Process
On the completion of stemming process, next step is to count the frequency of

each word. Information retrieval works on the output of this tokenization process for
achieving or producing most relevant results to the given users
Input:
This is an Information retrieval model and it is widely used in the data mining
application areas. In our project there is team of 2 members. Our project name is IR”.
Output:
Words= this<1> is<4> an<1> Information<1> retrieval<1> model<1> and<1>
it<1> widely<1> used<1> in<2> the<1> data<1> mining<1> application<!> areas<1>
our<2> project<2> there<1> team<1> of<1> members<1> name<1> IR<1>
Numbers= 2<1>
On the completion of stemming process, next step is to count the
frequency of each word. Information retrieval works on the output of this tokenization
process for achieving or producing most relevant results to the given users. For
example, there is a document in which the information likes “This is an Information
retrieval model and it is widely used in the data mining application areas. In our project
there is team of 2 members. Our project name is IR”. If this document is passing to
the tokenization technique or process then the output of the process is like it separate
the words this, is, an, information etc. If there is a number it can also separate from
other words or numbers and finally give the tokens with their occurrences count in
the given document.
Proposed Algorithm and Example

In the proposed algorithm, tokenization is done based on the set of
training vectors which are initially provided into the algorithm to train the system.
The training documents are of different knowledge domain, are use to create
vectors. The created vector helps algorithm to process the input documents. The
tokenization on documents is performed with respect to the vectors, use of vectors in
pre tokenization helps to make whole tokenization process more precise and successful.
The effect on tokenization of vectors is shown in results section also, where the no
of token generated & time consumed for the process significantly differ.
Input : (Di)
Output : (Tokens)
Begin
Step1: Collect Input documents (Di) where i=1, 2, 3....n;
Step2: For each input Di; Extract Word
(EWi) = Di;
// apply extract word process for all documents i=1, 2, 3...n in and extract words//
Step 3: For each EWi;
Stop Word
(SWi) =EWi;
// apply Stop word elimination process to remove all stop words like is, am, to, as, etc.//
Stemming
(Si) = SWi;
// It create stems of each word, like “use” is the stem of user, using, usage etc. //
Step 4: For each Si; Freq_Count
(WCi)= Si;
// for the total no. of occurrences of each Stem Si. //
Return (Si);
Step 5: Tokens (Si) will be passed to an IR System.
End
Example Phase 1:
Doc1 Military is a good option for a career builder for youngsters. Military is not
covering only defense it also includes IT sector and its various forms are
Army, Navy, and Air force. It satisfies the sacrifice need of youth for their
country.
Doc2 Cricket is the most popular game in India. In crocket a player uses a bat to hit
the ball and scoring runs. It is played between two teams; the team scoring
maximum runs will win the game.
Doc3 Science is the essentiality of education, what we are watching with our eyes
Happening non-happening all include science. Various scientists working on
different topics help us to understand the science in our lives. Science is
continuous evolutionary study, each day something new is determined.
Doc4 Engineering makes the development of any country, engineers are
manufacturing beneficial things day by day, as the professional engineers of
software develops programs which reduces man work, civil engineers gives
their knowledge to construction to form buildings, hospitals etc. Everything
can be controlled by computer systems nowadays.
Table: Documents to the tokenization process
Phase 2:
In this phase, all the words are extracted from these four documents as shown below:
Name: doc1
[Military, is, a, good, option, for, a, career, builder, for, youngsters, Military, is, not,
covering, only, defense, it, also, includes, IT, sector, and, its, various, forms, are, Army,,
Navy,, and, Air, force., It, satisfies, the, sacrifice, need, of, youth, for, their, country.]
Name: doc2
[Cricket, is, the, most, popular, game, in, India., In, crocket, a, player, uses, a,
bat, to, hit, the, ball, and, scoring, runs., It, is, played, between, two, teams;, the, team,
scoring, maximum, runs, will, win, the, game.]
Name: doc3
[Science, is, the, essentiality, of, education,, what, we, are, watching, with, our,
eyes, happening, non-happening, all, include, science., Various, scientists, working, on,
different, topics, help, us, to, understand, the, science, in, our, lives., Science, is,
continuous, evolutionary, study,, each, day, something, new, is, determined.]
Name: doc4
[Engineering, makes, the, development, of, any, country,, engineers, are,
manufacturing, beneficial, things, day, by, day,, as, the, professional, engineers, of,
software, develops, programs, which, reduces, man, work,, civil, engineers, gives, their,
knowledge, to, construction, to, form, buildings,, hospitals, etc., Everything, can, be,
controlled, by, computer, systems, nowadays.]
Phase 3 and Phase 4:

After extracting all the words, next phases is to remove all stop words
and stemming, as shown below:
Name: doc1
[militari, good, option, for, career, builder, for, youngster, militari, not, cover,
onli, defens, it, also, includ, it, sector, it, variou, form, ar,armi, navi, air, forc, it, satisfi,
sacrific, need, youth, for, their, country]
Name: doc2
[Cricket, most, popular, game, in, India, in, crocket, player, us, bat, to, hit, ball,
score, run, it, plai, between, two, team, team, score, maximum, run, win, game]
Fig: Document Tokenization Graph
Fig: Overall-Time Graph
The Tokenization with Pre-processing leads to effective and efficient

approach of processing, as shown in results strategy with pre-processing process
100 input documents and generate 200 distinct and accurate tokens in 156 (ms), while
processing same set of documents in another strategy takes 289 (ms) and
generates more than 300 tokens.
Name: doc3
[scienc, essenti, educ, what, we, ar, watch,our, ey, happen, non, happen, all,
includ, scienc,variou, scientist, wor, on, differ, topic, help, to, understand, scienc, in,
our, live, scienc, continu, evolutionari, studi, each, dai, someth, new, determin]
Name: doc4
[engine, make, develop, ani, countri, engin, ar, manufactur, benefici, thing, dai,
by, dai, profession, engin, softwar, develop, program, which, reduc, man, work, civil,
engin, give, their, knowledg, to, construct, to, form, build, hospit, etc, everyth, can, be,
control, by, comput, system, nowadai]
CHAPTER 4
BIODIVERSITY
There is a general agreement that biodiversity consists of three main types, or

levels, of diversity as depicted in Figure1. The first type is Genetic diversity, the second
one is Species diversity, and the last one is Ecosystem diversity, or Ecological diversity.
These are the three levels at which biological variety has been identified.
Fig: Types of diversity: Genetic (inner), Species (middle), and Ecosystem (outer)
Genetic diversity refers to the genetic variation and heritable traits within
organisms. All species are related with other species through a genetic network, but the
variety of genetic properties and features makes creatures different in their morphologic
characteristics. Genetic diversity applies to all living organisms having inheritance of
genes, including the amount of DNA per cell and chromosome structures. Genetic
diversity is an important factor for the adaptation of populations to changing
environments and the resistance to certain types of diseases. For species, a higher
genetic variation implies less risk. It is also essential for species evolution.
Species diversity refers to the variety of living organisms within an ecosystem,
a habitat or a region. It is evaluated by considering two factors: species richness and
species evenness. The first corresponds to the number of different species present in a
community per unit area, and the second to the relative abundance of each species in the
geographical area. Both factors are evaluated according to the size of populations or
biomass of each species in the area. Recent studies have shown relationships between
diversity within species and diversity among species. Species diversity is the most
visible part of biodiversity.
Ecosystem diversity refers to the variety of landscape of ecosystems in each
region of the world. An ecosystem is a combination of communities - associations of
species - of living organisms with the physical environment in which they live (e.g., air,
water, mineral soil, topography, and climate). Ecosystems vary in size and in every
geographic region there is a complex mosaic of interconnected ecosystems. Ecosystems
are environments with a balanced state of natural elements (water, plants, animals,
fungi, microbes, molecules, climate, etc.). Ecosystem diversity embraces the variety of
habitats and environmental parameters that occur within a region. pcologists consider
biodiversity according to the three following interdependent primary characteristics:
Ecosystems composition, i.e., the variety and richness of inhabiting species, ecosystems
structure, i.e., the physical and three dimensional patterns of life forms, and ecosystems
function, i.e., biogeochemical cycles and evolving environmental conditions. Even
though many application tools have been developed, evaluating biodiversity still faces
difficulties due to the complexity of precise evaluations of these parameters. Hence, the
overall number of species that can be measured and officially identified all around the
world is only 1.7 to 2 million and 5 to 30 million respectively
Environmental Issues
In nature, biodiversity is the key to keep natural balance in changing

environmental conditions. It functions as services, such as consumption service, that is
to serve the natural resources to human (e.g., food, clothing, housing, medicines),
industrial production service, that is to serve productivity of forest to be used either
directly or indirectly (e.g., extracting chemicals from plants in the forest), and others
(non-consumptive uses) including values of maintenance of ecosystems to be
sustainable (e.g., soil maintenance, nitrogen to the soil, synthesis of plant power,
humidity control).
All life on the planet needs nutrients and oxygen, which are the main factors for
survival. Especially, species depend on biodiversity resources produced by ecosystem
services. The ecosystem services can regulate climate changes, dispose of wastes,
recycle nutrients, filter and purify water, purify air, buffer against flooding, and
maintain soil fertility. Changes in environmental factors and ecosystems can thus
endanger life forms as reported in several scientific studies.
The report of Global Biodiversity Outlook of the Convention on Biological
Diversity (CBD) highlights Ban Ki-moon's speech, United Nations General Secretary,
on the fact that "the consequences of this collective failure, if it is not quickly corrected,
will be severe for us. Biodiversity underpins the functioning of the ecosystems on which
we depend for food and fresh water, health and recreation, and protection from natural
disasters. Its loss also affects us culturally and spiritually. This may be more difficult to
quantify, but is nonetheless integral to our well-being". The loss of biodiversity
becomes a serious issue for the twenty-first century. Its loss has direct and indirect
negative effects on many factors connecting the elements of biodiversity as well as
ecosystems.
In Environmental factors, the loss of biodiversity means that the natural balance
between environmental conditions and the different types of diversities cannot be
conserved, which will affect stability of ecosystems. This can lead to climate changes,
such as the global warming reported in several scientific studies, and consequently to
natural disasters (landslides, floods, typhoons, cyclones, hurricanes, tsunamis, etc.).
Tourism factors are impacted by environment factors: If they are affected, by
natural disasters or pollution for instance, tourism structure systems, such as aesthetic
natural landscapes and historical places, can be affected, and even destroyed. For
example, the effect of pollution on the structure and environment of the Venice city, in
Italy, is known to endanger buildings. This can impact tourism as natural landscapes
and historical places are attraction sites for tourists, which consequently contribute to
develop economical and social activities.
Human factors are linked to nutrients, oxygen, and other essential needs which
are produced from biodiversity resources. If the number of biodiversity resources is
decreased, the volume of vital products, such as food, water, plants and animals, will
also decrease. This can lead to human's consumption and survivability concerns. For
example, the augmentation of the production costs can lead to difficulties for human
populations to have access to vital resources, such as medicine, food, etc.
Fig: Environmental Issues
Society factors are affected by biodiversity loss as most parts of society

infrastructures and livelihood depend on the basic system and structure of nature. One
factor is nature productivity that depends on land structures for agriculture and
irrigation, wood materials for building habitats, and natural and energy materials for
other forms of consumption.
For example, an important part of people living in rural societies depend mainly
on productivity of agriculture and livestock for their livelihood, while people living in
metropolitan areas need more biodiversity productivity as the demands for food, energy,
materials and other resources are increased (transportation, construction, consumption
products, etc.). Scarceness of resources can thus cause an augmentation in production
costs leading to a reduction of the part of the population that has access to these
resources.
Economics factors are impacted by direct benefits and added-values of natural
resources (e.g., food, bio-fuels and renewable energies, animals and fibers, wood
materials, bio-medical treatments). These resources contribute to the economic
exchanges between countries around the world through internal and external commerce.
Biodiversity loss, and scarceness of resources, can affect populations from an
economical viewpoint. For instance, the important human population (more than 60
percent) that uses bio-medication for main health cares. It can also lead to higher
production costs, implying more competition, financial crisis, and others economic
related issues.
Methodology
This research deals with the methodological steps adopted in the research study.
The research Procedures followed is described under the following headlines:
A. Selection of Protein database and Forest fire database.

B. Selection of Data mining tools.
C. Use of Tools in research study such as MATLAB, WEKA.
D. Selection of Data mining Algorithms.
Resources and Technologies for Biodiversity
This section is devoted to the presentation of the major information providers

and different resources available, and the technologies used to represent data and
knowledge related to biodiversity and environment.
Resources
The most prominent information providers that propose data and knowledge
repositories used for biodiversity and environmental studies are shown in Figure2. Each
one provides contents related to different domains and categories depicted by the edges
in the schema. These providers propose contents of different types (documents,
databases, meta-data, spatio-temporal, etc.), in different categories (data, knowledge
and documentations) and for different domains of application. Two main categories of
resources are considered in this classification diagram. The Data category corresponds
to resources depicting facts about species (animal and plants) and environmental
conditions in specific areas. The Frameworks category corresponds to resources
depicting both tacit and formal knowledge related to biodiversity and environment
analytical application domains. Several providers, such as ESABII (East and Southeast
Asia Biodiversity Information Initiative), IUCN (International Union for Conservation
of Nature), and OECD (Organisation for Economic Co-operation and Development),
give access to both data and frameworks resources.
In the Biodiversity Policy knowledge domain, information concern issues of
principles, regulations and agreements on biodiversity. Amid these resources, we can
cite BioNET, CBD, ESABII, IUCN and UNEP that provide information to serve and
follow up among botanists, biologists and researchers in "globalization". For example,
the Agreement of the United Nations Decade on Biodiversity 2011-2020 aims to
support and implement the Strategic Plan for Biodiversity.
The Environment domain refers to results of researches and repositories on this
area to scientists or people who want to know status of environment on the earth,
especially the biologist who works on this issue. The major actors in this category are
BHL, BioNET, CBD, ECNC, IUCN, KNEU and UNEP that are organizations which
regularly publish reports and results of studies on domains related to biodiversity and
environment, such as the Protected Areas Presented as Natural Solutions to Global
Environmental Challenges at RIO +20 published by the IUCN or the Global
Environment Facility (GEF) published by the CDB organization.
The Economics domain corresponds to information from scientists about the
status of economics development, based on effects and values of ecosystem services.
The foremost information providers in this category are CBD, IUCN, OECD and
TEEB. Among reports and studies published by these organizations, we can cite
Restoring World's Forests Proven to Boost Local Economies and Reduce Poverty
by IUCN and Green Growth and Sustainable Development by OECD for example.
In the Health/Society domain, repositories supply knowledge information
referring to natural resources of ecosystem services and effects. This information has
been produced, and their validity was demonstrated, by researchers from world-wide
organizations such as BHL, CBD, IUCN, KNEU and UNEP. These organizations
provide summaries and proposals such as for example, Human Health and
Biodiversity is projected by CBD, Towards the Blue Society is projected by IUCN
and Action for Biodiversity: Towards a Society in Harmony with Nature is
published by CBD.
Fig: Classification biodiversity information
REFERENCE
1. Agarwal, t.limielinski and A.Swami mining association rules between sets of
items in large databases, in ACM SIGMOD,1993.
2. Arthur David and Vassilvitskii Sergei (2007), “k-means++: The Advantages of
Careful Seeding”, SODA ’07: Proceedings of the eighteenth annual ACM-
SIAM, 1027-1035.
3. Basheer M., Al-Maqaleh, Hamid Shahbazkia, (2012),” A Genetic Algorithm for
Discovering Classification Rules in Data Mining”, International Journal of
Computer Applications (0975 – 8887), 41(18): 40-44.
4. Bifet A., Holmes G., Pfahringer B., Kirkby R., Gavalda R.(2009), “New
ensemble methods for evolving data streams”, In KDD, ACM, 139-148.
5. Dietterich T. (2002), “Ensemble learning, in The Handbook of Brain Theory and
Neural Networks”, 2nd ed., M. Arbib, Ed., Cambridge MA: MIT Press.
6. Dr. Bhatia, M.P.S. and Khurana, Deepika (2013), “Experimental study of Data
clustering using k-Means and modified algorithms”, International Journal of
Data Mining & Knowledge Management Process (IJDKP), 3(3): 17-30.
7. Dr. Hemalatha M. and Saranya Naga N. (2011), “A Recent Survey on
Knowledge Discovery in Spatial Data Mining” IJCSI International Journal of
Computer Science, 8 (3):473-479.
8. Dzeroski S. and Zenko B. (2004), “Is combining classifiers with stacking better
than selecting the best one? Machine Learning”, 255–273.
9. Ester Martin, Kriegel Peter Hans, Sander Jörg (2001), “Algorithms and
Applications for Spatial Data Mining Published in Geographic Data Mining and
Knowledge Discovery, Research Monographs in GIS”, Taylor and Francis, 1-
32.
10. Elena Makhalova, (2013), “Fuzzy C means Clustering in MATLAB”, The 7th
International Days of Statistics and Economics, Prague, 19(21): 905-914.
11. Ester Martin, Kriegel Peter Hans, Sander Jörg (2001), “Algorithms and
Applications for Spatial Data Mining Published in Geographic Data Mining and
Knowledge Discovery, Research Monographs in GIS”, Taylor and Francis, 1-
32.
12. Ghosh, Soumi and Dubey, Sanjay, Kumar (2013), “Comparative Analysis of
KMeans and Fuzzy C-Means Algorithms”, (IJACSA) International Journal of
Advanced Computer Science and Applications, 4(4):35-39.
13. Jain Shreya and Gajbhiye Samta (2012), “A Comparative Performance Analysis
of Clustering Algorithms”, International Journal of Advanced Research in
Computer Science and Software Engineering, 2(5):441-445.
14. Kalyankar A. Meghali and Prof. Alaspurkar S. J. (2013), “Data Mining
Technique to Analyse the Metrological Data”, International Journal of
Advanced Research in Computer Science and Software Engineering, 3(2):114-
118.
15. Kavitha K. and Sarojamma R M (2012), “Monitoring of Diabetes with Data
Mining via CART Method”, International Journal of Emerging Technology and
Advanced Engineering, 2 (11):157-162.
16. Lavrac Nada and Novak Kralj Petra (2013), “Relational and Semantic Data
Mining for Biomedical Research”, Informatica (Slovenia), 37(1): 35-39.
17. Lai J-S and Tsai F.(2012),“Verification And Risk Assessment For Landslides In
The Shimen Reservoir Watershed Of Taiwan Using Spatial Analysis And Data
Mining”, International Archives of the Photogrammetry, Remote Sensing and
Spatial Information Sciences, XXXIX(B2): 67-70.
18. Li Yifeng, Ngom, Alioune (2013), “The Non-Negative Matrix Factorization
Toolbox for Biological Data Mining”, Source Code for Biology & Medicince, 8
(1): 1-18.
19. Mrs. Sujatha B. and Akila C., (2012), “Prediction of Breast Cancer Using Data
Mining”, International Journal of Computer Science and Management Research,
1 (3): 384-389.

A Brief Survey On Data Mining For Biological and Environmental Problems.

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Brief Survey On Data Mining For Biological and Environmental Problems.

Uploaded by

Copyright:

Available Formats

INTRODUCTION

The definite data mining duty is the repeated or semi-automatic investigation of

Data mining researcher residential decision trees (1960s). Data mining

Fig: Collaborative environment

Domain- and privacy restrictions

Retrieving relevant information

The problem in managing grid infrastructures is that local domain information

Fig: general ROI Extraction algorithm

What is Data Mining?

Uses of Data mining

Information Retrieval & Extraction

Fig: Data mining process

Learning - is the process of obtaining new knowledge. It results in a better reaction

Fig: Multi agent system canonical view

Deterministic vs. non-deterministic

Episodic vs. non-episodic

Static vs. dynamic

Discrete vs. continuous

Disease Management Programs are beginning to encompass providers across the

Spatial information technologies and data can be used effectively to investigate

Key Value (regular expression) Log file line example

Table: Regular expressions

Applications of Text Mining Information Retrieval

Fig: Text Mining Process

Fig: Process of Text Extraction

Fig: Text Mining Pre-Processing Techniques

Stop Words Elimination

C . Stop word removal methods

 Methods based on Zipf's Law (Z-Methods):

 The Mutual Information Method (MI)

 Term Based Random Sampling (TBRS)

Fig : Stemming Process

Fig: stemming Process

On the completion of stemming process, next step is to count the frequency of

Proposed Algorithm and Example

Table: Documents to the tokenization process

Phase 3 and Phase 4:

Fig: Overall-Time Graph

The Tokenization with Pre-processing leads to effective and efficient

There is a general agreement that biodiversity consists of three main types, or

In nature, biodiversity is the key to keep natural balance in changing

Fig: Environmental Issues

Society factors are affected by biodiversity loss as most parts of society

A. Selection of Protein database and Forest fire database.

Resources and Technologies for Biodiversity

This section is devoted to the presentation of the major information providers

You might also like