You are on page 1of 226

Translational Bioinformatics

ploscollections.org/translationalbioinformatics

'Translational Bioinformatics' is a collection of PLOS Computational Biology Education articles which reads as
a "book" to be used as a reference or tutorial for a graduate level introductory course on the science of
translational bioinformatics.

Translational bioinformatics is an emerging field that addresses the current challenges of integrating
increasingly voluminous amounts of molecular and clinical data. Its aim is to provide a better understanding of
the molecular basis of disease, which in turn will inform clinical practice and ultimately improve human health.

The concept of a translational bioinformatics introductory book was originally conceived in 2009 by Jake Chen
and Maricel Kann. Each chapter was crafted by leading experts who provide a solid introduction to the topics
covered, complete with training exercises and answers. The rapid evolution of this field is expected to lead to
updates and new chapters that will be incorporated into this collection.

Collection editors: Maricel Kann, Guest Editor, and Fran Lewitter, PLOS Computational Biology Education
Editor.

Table of Contents

Introduction to Translational Bioinformatics Collection


Russ B. Altman
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002796

Chapter 1: Biomedical Knowledge Integration


Philip R. O. Payne
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002826

Chapter 2: Data-Driven View of Disease Biology


Casey S. Greene, Olga G. Troyanskaya
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002816

Chapter 3: Small Molecules and Disease


David S. Wishart
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002805

Chapter 4: Protein Interactions and Disease


Mileidy W. Gonzalez, Maricel G. Kann
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002819

Chapter 5: Network Biology Approach to Complex Diseases


Dong-Yeon Cho, Yoo-Ah Kim, Teresa M. Przytycka
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002820

Chapter 6: Structural Variation and Medical Genomics


Benjamin J. Raphael
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002821

Chapter 7: Pharmacogenomics
Konrad J. Karczewski, Roxana Daneshjou, Russ B. Altman
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002817

Chapter 8: Biological Knowledge Assembly and Interpretation


Ju Han Kim
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002858

Chapter 9: Analyses Using Disease Ontologies


Nigam H. Shah, Tyler Cole, Mark A. Musen
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002827

Chapter 10: Mining Genome-Wide Genetic Markers


Xiang Zhang, Shunping Huang, Zhaojun Zhang, Wei Wang
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002828

Chapter 11: Genome-Wide Association Studies


William S. Bush, Jason H. Moore
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002822

Chapter 12: Human Microbiome Analysis


Xochitl C. Morgan, Curtis Huttenhower
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002808

Chapter 13: Mining Electronic Health Records in the Genomics Era


Joshua C. Denny
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002823

Chapter 14: Cancer Genome Analysis


Miguel Vazquez, Victor de la Torre, Alfonso Valencia
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002824

Chapter 15: Disease Gene Prioritization


Yana Bromberg
PLOS Computational Biology: published 25 Apr 2013 | info:doi/10.1371/journal.pcbi.1002902

Chapter 16: Text Mining for Translational Bioinformatics


K. Bretonnel Cohen, Lawrence E. Hunter
PLOS Computational Biology: published 25 Apr 2013 | info:doi/10.1371/journal.pcbi.1003044

Chapter 17: Bioimage Informatics for Systems Pharmacology


Fuhai Li, Zheng Yin, Guangxu Jin, Hong Zhao, Stephen T. C. Wong
PLOS Computational Biology: published 25 Apr 2013 | info:doi/10.1371/journal.pcbi.1003043

Collection page URL: www.ploscollections.org/translationalbioinformatics


Education

Introduction to Translational Bioinformatics Collection


Russ B. Altman*
Department of Genetics, Stanford University, Stanford, California, United States of America

This article is part of the Transla- biology, genetics, and genomics. Some symptoms, drugs, patients, clinical labora-
tional Bioinformatics collection for believe that the tremendous progress in tory measurements, and clinical images.
PLOS Computational Biology. discovery over the last 50+ years since The emergence of clinical and health
elucidation of the double helix structure information technologies has begun to
How should we define translational has not translated (theres that word!) into make these clinical data available for
bioinformatics? I had to answer this much practical health benefit. While the research through biobanks, electronic
question unambiguously in March 2008 accuracy of this claim can be debated, medical records, FDA resources about
when I was asked to deliver a review of there can be no debate that our ability to drug labels and adverse events, and claims
recent progress in translational bioinfor- measure (1) DNA sequence (including data. Therefore, a major challenge for
matics at the American Medical Infor- entire genomes!), (2) RNA sequence and translational medicine is to connect the
matics Associations Summit on Transla- expression, (3) protein sequence, structure, molecular/cellular world with the clinical
tional Bioinformatics. The lecture expression and modification, and (4) small world. The published literature, available
required me to define papers in the field, molecule metabolite structure, presence, in PubMED (http://www.ncbi.nlm.nih.
and then highlight exciting progress that and quantity has advanced rapidly and gov/pubmed), does this, as does the
occurred over the previous ,12 months. I enables us to imagine fantastic new Unified Medical Language System
have repeated this for the last few years, technologies in pursuit of human health. (UMLS) that provides a lingua franca
There are many barriers to translating (http://www.nlm.nih.gov/research/umls/
and the most difficult part of the exercise is
our molecular understanding into technol- ). However, it falls to translational bioin-
limiting my review only to those papers
ogies that impact patients. These include formatics to engineer the tools that link
that are within the field.
understanding health market size and molecular/cellular entities and clinical
I have never worried much about
forces, the regulatory milieu, how to entities. Thus, I define translational
definitions within informatics fields; they
harden the technology for routine use, bioinformatics research as the develop-
tend to overlap, merge and evolve.
and how to navigate an increasingly ment and application of informatics meth-
Informatics seems clear: the study of
complex intellectual property landscape. ods that connect molecular entities to
how to represent, store, search, retrieve
But before those activities can begin, we clinical entities.
and analyze information. The adjectives in
must overcome an even more fundamental In this collection, Dr. Kann and col-
front of informatics vary but also tend to
barrier: connecting the stuff of molecular leagues have assembled a wonderful group
make sense: medical informatics concerns
biology to the clinical world. Molecular of authors to introduce the key threads of
medical information, bioinformatics con- and cellular biology studies genes, DNA, translational bioinformatics to those new
cerns basic biological information, clinical RNA messengers, microRNAs, proteins, to the field. The collection first provides
informatics focuses on the clinical delivery signaling molecules and their cascades, a conceptual overview of the key data and
part of medical informatics, biomedical metabolites, cellular communication pro- concepts in the field, and then introduces
informatics merges bioinformatics and cesses and cellular organization. These some of the key methods for informatics
medical informatics, imaging informatics data are freely available in valuable discovery and applications. Just by exam-
focuses onimages, and so on. So what resources such as Genbank (http://www. ining the table of contents on the collec-
does this adjective translational denote? ncbi.nlm.nih.gov/genbank/), Gene Ex- tion page (http://www.ploscollections.
Translational medical research has pression Omnibus (http://www.ncbi.nlm. org/translationalbioinformatics), it is clear
emerged as an important theme in the nih.gov/geo/), Protein Data Bank (http:// that many exciting and emerging health
last decade. Starting with top-down lead- www.wwpdb.org/), KEGG (http://www. topics are squarely within the scope of
ership from the National Institutes of genome.jp/kegg/), MetaCyc (http:// translational bioinformatics: cancer, phar-
Health and its former Director, Dr. Elias metacyc.org/), Reactome (http://www. macogenomics, medical genetics, small
Zerhouni, and moving through academic reactome.org), and many other resources. molecule drugs, and diseases of protein
medical centers, research institutes and The clinical world studies diseases, signs, malfunction. There is an unmistakable
industrial research and development ef-
forts, there has been interest in more
effectively moving the discoveries and Citation: Altman RB (2012) Introduction to Translational Bioinformatics Collection. PLoS Comput Biol 8(12):
e1002796. doi:10.1371/journal.pcbi.1002796
innovations in the laboratory to the
bedside, leading to improved diagnosis, Editors : Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
Baltimore County, United States of America
prognosis, and treatment. Translational
research encompasses many activities in- Published December 27, 2012
cluding the creation of medical devices, Copyright: 2012 Russ B. Altman. This is an open-access article distributed under the terms of the Creative
molecular diagnostics, small molecule Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
therapeutics, biological therapeutics, vac-
Funding: The author received no specific funding for writing this article.
cines, and others. One of the main targets
of translation, however, is revolutionary Competing Interests: The author has declared that no competing interests exist.
explosion of knowledge in molecular * E-mail: russ.altman@stanford.edu

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002796


flavor of personalized medicine here as ords): our molecular and clinical data editors and authors on creating an impor-
well (genome association studies, mining resources are now allowing us to consider tant collection of articles, and welcome the
genetic markers, personal genomic data individual variations, and not simply reader to an exciting field whose challeng-
analysis, data mining of electronic rec- population averages. I congratulate the es and promise are unbounded.

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002796


Education

Chapter 1: Biomedical Knowledge Integration


Philip R. O. Payne*
The Ohio State University, Department of Biomedical Informatics, Columbus, Ohio, United States of America

Abstract: The modern biomedical research in the laboratory, and in preclin-


This article is part of the Transla-
research and healthcare delivery do- ical studies, to the development of trials and
tional Bioinformatics collection for
mains have seen an unparalleled studies in humans. The second area of
PLOS Computational Biology.
increase in the rate of innovation translation concerns research aimed at
and novel technologies over the past enhancing the adoption of best practices in
several decades. Catalyzed by para- the community. Cost-effectiveness of pre-
digm-shifting public and private pro- 1. Introduction
vention and treatment strategies is also an
grams focusing upon the formation The modern biomedical research do- important part of translational science.
and delivery of genomic and person- [4]
main has experienced a fundamental shift
alized medicine, the need for high-
throughput and integrative ap- towards integrative and translational
proaches to the collection, manage- methodologies and frameworks over the Several recent publications have defined
ment, and analysis of heterogeneous past several years. A common thread a translational research cycle, which
data sets has become imperative. This throughout the translational sciences are involves the translational of knowledge
need is particularly pressing in the needs related to the collection, manage- and evidence from the bench (e.g.,
translational bioinformatics domain, ment, integration, analysis and dissemina- laboratory-based discoveries) to the bed-
where many fundamental research tion of large-scale, heterogeneous biomed- side (e.g., clinical or public health inter-
questions require the integration of ical data sets. However, well-established ventions informed by basic science and
large scale, multi-dimensional clinical and broadly adopted theoretical and clinical research), and reciprocally from
phenotype and bio-molecular data practical frameworks intended to address the bedside back to the bench (e.g.,
sets. Modern biomedical informatics such needs are still largely developmental basic science studies informed by observa-
theory and practice has demonstrat- [13]. Instead, the development and tions from the point-of-care) [5]. Within
ed the distinct benefits associated execution of multi-disciplinary, transla- this translational cycle, Sung and col-
with the use of knowledge-based tional science programs is significantly leagues [5] have defined two critical
systems in such contexts. A knowl- blockages that exist between basic science
limited by the propagation of silos of
edge-based system can be defined as discovery and the design of prospective
an intelligent agent that employs a both data and knowledge, and a paucity of
reproducible and rigorously validated clinical studies, and subsequently between
computationally tractable knowledge
methods that may be used to support the the knowledge generated during clinical
base or repository in order to reason
satisfaction of motivating and integrative studies and the provision of such evidence-
upon data in a targeted domain and
reproduce expert performance rela- translational bioinformatics use cases, such based care in the clinical or public health
tive to such reasoning operations. as those focusing on the identification of settings. These are known as the T1 and
The ultimate goal of the design and expression motifs spanning bio-molecules T2 blocks, respectively. Much of the work
use of such agents is to increase the and clinical phenotypes. conducted under the auspices of the NIH
reproducibility, scalability, and acces- In order to provide sufficient context Roadmap initiative and more recently as
sibility of complex reasoning tasks. and scope to our ensuing discussion, we part of the Clinical and Translational
Examples of the application of knowl- will define translational science and re- Science Award (CTSA) program is specif-
edge-based systems in biomedicine search per the conventions provided by ically focused on identifying approaches or
span a broad spectrum, from the the National Institutes of Health (NIH) as policies that can mitigate these T1 and T2
execution of clinical decision support, follows: blockages, and thus increase the speed and
to epidemiologic surveillance of pub- efficiency by which new biomedical knowl-
lic data sets for the purposes of edge can be realized in terms of improved
detecting emerging infectious diseas- Translational research includes health and patient outcomes.
es, to the discovery of novel hypoth- two areas of translation. One is the process The positive outcomes afforded by the
eses in large-scale research data sets. of applying discoveries generated during close coupling of biomedical informatics
In this chapter, we will review the
basic theoretical frameworks that
define core knowledge types and Citation: Payne PRO (2012) Chapter 1: Biomedical Knowledge Integration. PLoS Comput Biol 8(12): e1002826.
doi:10.1371/journal.pcbi.1002826
reasoning operations with particular
emphasis on the applicability of such Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
Baltimore County, United States of America
conceptual models within the bio-
medical domain, and then go on to Published December 27, 2012
introduce a number of prototypical Copyright: 2012 Philip R. O. Payne. This is an open-access article distributed under the terms of the
data integration requirements and Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
patterns relevant to the conduct of medium, provided the original author and source are credited.
translational bioinformatics that can Funding: The author received no specific funding for this article.
be addressed via the design and use Competing Interests: The author has declared that no competing interests exist.
of knowledge-based systems.
* E-mail: philip.payne@osumc.edu

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002826


What to Learn in This Chapter sources, while capturing intermediate data
analysis steps and products, and generating
actionable output types [17,18]. Such pipe-
N Understand basic knowledge types and structures that can be applied to
lines provide a number of benefits, includ-
biomedical and translational science;
ing: 1) they support the design and execution
N Gain familiarity with the knowledge engineering cycle, tools and methods that
of data analysis plans that would not be
may be used throughout that cycle, and the resulting classes of knowledge
tractable or feasible using manual methods;
products generated via such processes;
and 2) they provide for the capture meta-
N An understanding of the basic methods and techniques that can be used to data describing the steps and intermediate
employ knowledge products in order to integrate and reason upon products generated during such data anal-
heterogeneous and multi-dimensional data sets; and
yses. In the case of the latter benefit, the
N Become conversant in the open research questions/areas related to the ability ability to capture systematic meta-data is
to develop and apply knowledge collections in the translational bioinformatics critical to ensuring that such in-silico research
domain. paradigms generate reproducible and high
quality results [17,18]. There are a number
of promising technology platforms capable
with the translational sciences have been the use of informatics-based approaches to
of supporting such data-analytic pipelin-
described frequently in the published map among various data representations, as
ing, such as the caGrid middleware [18]. It
literature [3,57]. Broadly, the critical well as the application of such mappings to
is of note, however, that widespread use of
areas to be addressed by such informatics support integrative data integration and
such pipeline tools is not robust, largely due
approaches relative to translational re- analysis operations [10,15].
to barriers to adoption related to data
search activities and programs can be The application of knowledge- ownership/security and socio-technical fac-
classified as belonging to one or more of based systems and intelligent agents tors [13,19].
the following categories: to enable high-throughput hypothe- The dissemination of data, infor-
The management of multi-dimen- sis generation and testing: Modern mation, and knowledge generated
sional and heterogeneous data sets: approaches to hypothesis discovery and during the course of translational
The modern healthcare and life sciences testing primarily are based on the intuition science research programs: It is
ecosystem is becoming increasingly data of the individual investigator or his/her widely held that the time period required
centric as a result of the adoption and team to identify a question that is of to translate a basic science discovery into
availability of high-throughput data sourc- interest relative to their specific scientific clinical research, and ultimately evidence-
es, such as electronic health records aims, and then carry out hypothesis testing based practice or public health interven-
(EHRs), research data management sys- operations to validate or refine that tion can exceed 15 years [2,5,7,20]. A
tems (e.g., CTMS, LIMS, Electronic Data question relative to a targeted data set number of studies have identified the lack
Capture tools), and a wide variety of bio- [6,16]. This approach is feasible when of effective tools for supporting the ex-
molecular scale instrumentation platforms. working with data sets comprised of change of data, information, and knowl-
As a result of this evolution, the size and hundreds of variables, but does not scale edge between the basic sciences, clinical
complexity of data sets that must be to projects involving data sets with mag- research, clinical practice, and public
managed and analyzed are growing at an nitudes on the order of thousands or even health practice as one of the major
extremely rapid rate [1,2,6,8,9]. At the millions of variables [10,14]. An emerging contributors to effective and timely trans-
same time, the data management practices and increasingly viable solution to this lation of novel biological discoveries into
currently used in most research settings challenge is the use of domain knowledge health benefits [2]. A number of informat-
are both labor intensive and rely upon to generate hypotheses relative to the ics-based approaches have been developed
technologies that have not be designed to content of such data sets. This type of to overcome such translational impedi-
handle such multi-dimensional data [9 domain knowledge can be derived from ments, such as web-based collaboration
11]. As a result, there are significant many different sources, such as public platforms, knowledge representation and
demands from the translational science databases, terminologies, ontologies, and delivery standards, public data registries
community for the creation and delivery of published literature [14]. It is important to and repositories [3,7,9,21]. Unfortunately,
information management platforms capa- note, however, that methods and technol- the systematic and regular use of such
ble of adapting to and supporting hetero- ogies that can allow researchers to access tools and methods is generally very
geneous workflows and data sources and extract domain knowledge from such poor in the translational sciences, again
[2,3,12,13]. This need is particularly sources, and apply resulting knowledge as was the prior case, due to a combina-
important when such research endeavors extracts to generate and test hypotheses tion of governance and socio-technical
focus on the identification of linkages are largely developmental at the current barriers.
between bio-molecular and phenotypic time [10,14]. At a high level, all of the aforemen-
data in order to inform novel systems-level The facilitation of data-analytic tioned challenges and opportunities corre-
approaches to understanding disease states. pipelines in in-silico research pro- spond to an overarching set of problem
Relative to the specific topic area of grams: The ability to execute in-silico statements, as follows:
knowledge representation and utilization in research programs, wherein hypotheses are
the translational sciences, the ability to designed, tested, and validated in existing N Translational bioinformatics is defined
address the preceding requirements is large- data sets using computational methods, is by the presence of complex, heteroge-
ly predicated on the ability to ensure that highly reliant on the use of data-analytic neous, multi-dimensional data sets;
semantics of such data are well understood pipelining tools. Such pipelines are ideally N The scope of available biomedical
[10,14,15]. This is a scenario often referred able to support data extraction, integration, knowledge collections that may be
to as semantic interoperability, and requires and analysis workflows spanning multiple applied to assist in the integration

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002826


and analysis of such data is growing at assertions. The KE process (Figure 1) involving the extraction of knowledge
a rapid pace; incorporates multiple steps: from existent sources (e.g., experts, litera-
N The ability to apply such knowledge
1. Acquisition of knowledge (KA)
ture, databases, etc.) for the purpose of
representing that knowledge in a comput-
collections to translational bioinfor-
matics analyses requires an under- 2. Representation of that knowledge (KR) able format [2328].
standing of the sources of such knowl- in a computable form The KE process is intended to target
edge, and methods of applying them to 3. Implementation or refinement of three potential types of knowledge, as
reasoning applications; and knowledge-based agents or applications defined below:
N The application of knowledge collec- using the knowledge collection gener-
ated in the preceding stages N Conceptual knowledge is defined
tions to support integrative analyses in in the education literature as a combi-
the translational science domain intro- 4. Verification and validation of the
nation of atomic units of information
duces multiple areas of complexity that output of those knowledge-based
and the meaningful relationships be-
must be understood in order to enable agents or applications against one or
tween those units. The education
the optimal selection and use of such more reference standards.
literature also describes two other
resources and methods, as well as the types of knowledge, labeled as proce-
interpretation of results generated via In the context of the final phase of the
KE cycle, comparative reference standards dural and strategic;
such applications.
can include expert performance measures, N Procedural knowledge is a process-
requirements acquired before designing oriented understanding of a given
2. Key Definitions the knowledge-based system, or require- problem domain [2932];

In the remainder of this chapter, we will


ments that were realized upon implemen-
tation of the knowledge-based system. In
N Strategic knowledge is knowledge
that is used to operationalize concep-
introduce a set of definitions, frameworks, this regard, verification is the process of tual knowledge into procedural knowl-
and methods that serve to support the ensuring that the knowledge-based system edge [31].
foundational knowledge integration re- meets the initial requirements of the
quirements incumbent to the efficient potential end-user community. In com- The preceding definitions are derived
and effective conduct of translational parison, validation is the process of from empirical research on learning and
studies. In order to provide a common ensuring that the knowledge-based system problem-solving in complex scientific and
understanding of key terms and concepts meets the realized requirements of the quantitative domains such as mathematics
that will be used in the ensuing discussion, end-user community once a knowledge- and engineering [30,32]. The cognitive
we will define here a number of those based system has been implemented [22]. science literature provides a similar differ-
entities, using the broad context of Knowl- Furthermore, within the overall KE pro- entiation of knowledge types, making the
edge Engineering (KE) as a basis for such cess, KA can be defined as the sub-process distinction between procedural and de-

Figure 1. Key components of the KE process.


doi:10.1371/journal.pcbi.1002826.g001

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002826


clarative knowledge. Declarative knowl- knowledge consists of both symbols of It has been argued within the KE
edge is synonymous with conceptual reality, and relationships between those literature that the constructs used by
knowledge as defined above [33]. symbols. The hypothesis further argues experts can be used as the basis for
Conceptual knowledge collections are that intelligence is defined by the ability to designing or populating conceptual knowl-
perhaps the most commonly used knowl- appropriately and logically manipulate edge collections [26]. The details of PCT
edge types in biomedicine. Such knowledge both symbols and relationships. A critical help to explain how experts create and use
and its representation span a spectrum that component of this the theory is the such constructs. Specifically, Kellys fun-
includes ontologies, controlled terminolo- definition of what constitutes a physical damental postulate is that a persons
gies, semantic networks and database sche- symbol system, which Newell and Simon processes are psychologically channelized by the
mas. A reoccurring focus throughout dis- describe as: way in which he anticipated events. This is
cussions of conceptual knowledge collections complemented by the theorys first corol-
in the biomedical informatics domain is the lary, which is summarized by his statement
a set of entities, called symbols, which
process of representing conceptual knowl- that:
are physical patterns that can occur as
edge in a computable form. In contrast, the
process of eliciting knowledge has received components of another type of entity called
less attention and reports on rigorous and an expression (or symbol structure). Thus, Man looks at his world through transparent
reproducible methods that may be used in a symbol structure is composed of a number templates which he creates and then attempts
this area are rare. It is also important to note of instances (or tokens) of symbols related to fit over the realities of which the world is
that in the biomedical informatics domain in some physical way (such as one token composed Constructs are used for predic-
conceptual knowledge collections rarely being next to another). At any instant of tions of things to come The construct is a
exist in isolation. Instead, they usually occur time the system will contain a collection of basis for making a distinction not a class of
within structures that contain multiple types these symbol structures. [36] objects, or an abstraction of a class, but a
of knowledge. For example, a knowledge- dichotomous reference axis.
base used in a modern clinical decision This preceding definition is very similar
support system might include: (1) a knowl- to that of conceptual knowledge introduced Building upon these basic concepts,
edge collection containing potential findings, earlier in this chapter, which leads to the Kelly goes on to state in his Dichotomy
diagnoses, and the relationships between observation that the computational repre- Corollary that a persons construction system is
them (conceptual knowledge), (2) a knowledge sentation of conceptual knowledge collec- composed of a finite number of dichotomous
collection containing guidelines or algo- tions should be well supported by compu- constructs. Finally, the parallel nature of
rithms used to logically traverse the previous tational theory. However, as described personal constructs and conceptual knowl-
knowledge structure (procedural knowledge), earlier, there is not a large body of research edge is illustrated in Kellys Organization
and (3) a knowledge structure containing on reproducible methods for eliciting Corollary, which states, each person charac-
application logic used to apply or operatio- such symbol systems. Consequently, the teristically evolves, for his convenience of antici-
nalize the preceding knowledge collections elicitation of the symbols and relationships pating events, a construction system embracing
(strategic knowledge). Only when these three that constitute a physical symbol system, ordinal relationships between constructs [26,37].
types of knowledge are combined, it is or conceptual knowledge collection, re- Thus, in an effort to bring together
possible to realize a functional decision mains a significant challenge. This chal- these core pieces of PCT, it can be argued
support system [34]. lenge, in turn, is an impediment to the that personal constructs are essentially
widespread use of conceptual knowledge- templates applied to the creation of
3. Underlying Theoretical based systems. knowledge classification schemas used in
Frameworks reasoning. If such constructs are elicited
3.2 Psychological and Cognitive from experts, atomic units of information
The theories that support the ability to can be defined, and the Organization
Basis for Knowledge Engineering
acquire, represent, and verify or validate Corollary can be applied to generate
At the core of the currently accepted
conceptual knowledge come from multiple networks of ordinal relationships between
psychological basis for KE is expertise
domains. In the following sub-section, those units. Collectively, these arguments
transfer, which is the theory that humans
several of those domains will be discussed, serve to satisfy and reinforce the earlier
transfer their expertise to computational
including: definition of conceptual knowledge, and
systems so that those systems are able to
provide insight into the expert knowledge
N Computational science replicate expert human performance.
One theory that helps explain the structures that can be targeted when
N Psychology and cognitive science
process of expertise transfer is Kellys eliciting conceptual knowledge.
N Semiotics Personal Construct Theory (PCT). This There are also a number of cognitive
N Linguistics theory defines humans as anticipatory science theories that have been applied to
systems, where individuals create tem- inform KE methods. Though usually very
plates, or constructs that allow them to similar to the preceding psychological
3.1 Computational Foundations of recognize situations or patterns in the theories, cognitive science theories specif-
Knowledge Engineering information world surrounding them. ically describe KE within a broader
A critical theory that supports the ability These templates are then used to antici- context where humans are anticipatory
to acquire and represent knowledge in a pate the outcome of a potential action systems who engage in frequent transfers
computable format is the physical symbol given knowledge of similar previous expe- of expertise. The cognitive science litera-
hypothesis. First proposed by Newell and riences [37]. Kelly views all people as ture identifies expertise transfer pathways
Simon in 1981 [35], and expanded upon personal scientists who make sense of as an existent medium for the elicitation of
by Compton and Jansen in 1989 [24], the the world around them through the use of knowledge from domain experts. This
physical symbol hypothesis postulates that a hypothetico-deductive reasoning system. conceptual model of expertise transfer is

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002826


often illustrated using the Hawkins 3.3 Semiotic Basis for Knowledge N Symbol: representational artifact of a
model for expert-client knowledge transfer Engineering unit of knowledge (e.g., text or icons).
[38].
It is also important to note that at a high
Though more frequently associated N Referent: actual unit of knowledge,
with the domains of computer science, which is largely a conceptual construct.
level, cognitive science theories focus upon
the differentiation among knowledge types.
psychology and cognitive science, there
are a few instances where semiotic theory
N Thought or Reference: unit of
knowledge as actually understood by
As described earlier, cognitive scientists has been cited as a theoretical basis for the individual or system utilizing or
make a primary differentiation between KE. Semiotics can be broadly defined as acting upon that knowledge.
procedural knowledge and declarative the study of signs, both individually and grouped
knowledge [31]. While cognitive science in sign systems, and includes the study of how In addition, three primary relationships
theory does not necessarily link declarative meaning is transmitted and understood [47]. As are hypothesized to exist, linking the three
and procedural knowledge, an implicit a discipline, much of its initial theoretical preceding representational formats:
relationship is provided by defining proce- basis is derived from the domain of
dural knowledge as consisting of three linguistics, and thus, has been traditionally N Stands-for imputed relation:
orders, or levels. For each level, the focused on written language. However, the relationship between the symbolic
complexity of declarative knowledge in- scope of contemporary semiotics literature representation of the knowledge and
volved in problem solving increases com- has expanded to incorporate the analysis the actual unit of knowledge
mensurately with the complexity of proce- of meaning in visual presentation systems, N Refers-to causal relation: rela-
dural knowledge being used [28,31,39]. knowledge representation models and tionship between the actual unit of
A key difference between the theories multiple communication mediums. The knowledge, and the unit of knowledge
provided by the cognitive science and basic premise of the semiotic theory of as understood by the individual or
psychology domains is that the cognitive meaning is frequently presented in a system utilizing or acting upon that
science literature emphasizes the impor- schematic format using the Ogden-Rich- knowledge
tance of placing KA studies within appro- ards semiotic triad, as shown in Figure 2 N Symbolizes causal relation:
priate context in order to account for the [48]. relationship between the unit of knowl-
distributed nature of human cognition A core component of semiotic triad is edge as understood by the individual
[25,4046]. In contrast, the psychology the hypothesis that there exist three or system utilizing or acting upon that
literature is less concerned with placing representational formats for knowledge, knowledge, and the symbolic repre-
KE studies in context. specifically: sentation of the knowledge

Figure 2. Ogden-Richards semiotic triad, illustrating the relationships between the three major semiotic-derived types of
meaning.
doi:10.1371/journal.pcbi.1002826.g002

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002826


N The strength of these relationships is involving the extraction of knowledge 4.1 Informal and Structured
usually evaluated using heuristic meth- from existent sources (e.g., experts, liter- Interviewing
ods or criteria [48]. ature, databases, etc.) for the purpose of Interviews conducted either individually
representing that knowledge in a comput- or in groups can provide investigators with
able format [2328]. This definition also insights into the knowledge used by
3.4 Linguistic Basis for Knowledge includes the verification or validation of domain experts. Furthermore, they can
Engineering knowledge-based systems that use the be performed either informally (e.g.,
The preceding theories have focused resultant knowledge collections [27]. Be- conversational exchange between the in-
almost exclusively on knowledge that may yond this basic definition of KA and its terviewer and subjects) or formally (e.g.,
be elicited from domain experts. In relationships to KE, there are two critical structured using a pre-defined series of
contrast, domain knowledge can also be characteristics of contemporary ap- questions). The advantages of utilizing
extracted through the analysis of existing proaches to KA that should be noted, as such interviewing techniques are that they
sources, such as collections of narrative follows: require a minimal level of resources, can
text or databases. Sub-language analysis be performed in a relatively short time
is a commonly described approach to the N By convention within the biomedical
frame, and can yield a significant amount
elicitation of conceptual knowledge from informatics domain, KA usually refers
of qualitative knowledge. More detailed
collections of text (e.g., narrative notes, to the process of eliciting knowledge
descriptions of interviewing techniques are
published literature, etc.). The theoretical specifically for use in knowledge-
provided in the methodological reviews
basis for sub-language analysis, known as bases (KBs) that are integral to expert provided by Boy [58], Morgan [59], and
sub-language theory was first described systems or intelligent agents (e.g., Wood [60].
by Zellig Harris in his work concerning clinical decision support systems).
the nature of language usage within However, a review of the literature
concerned with KA beyond this do-
4.2 Observational Studies
highly specialized domains [49]. A key
Ethnographic evaluations, or observa-
argument of his sub-language theory is main shows a broad variety of appli-
tional studies are usually conducted in
that language usage in such highly cation areas for KA, such as the
context, with minimal researcher involve-
specialized domains is characterized by construction of shared database mod-
ment in the workflow or situation under
regular and reproducible structural fea- els, ontologies and human-computer
consideration. These observational meth-
tures and grammars [49,50]. At an interaction models [23,5357].
ods generally focus on the evaluation of
application level, these features and
grammars can be discovered through
N Verification and validation methods expert performance, and the implicit
are often applied to knowledge-based knowledge used by those experts. Exam-
the application of manual or automated systems only during the final stage of ples of observational studies have been
pattern recognition processes to large the KE process. However, such tech- described in many domains, ranging from
corpora of language for a specific domain. niques are most effective when em-
Once such patterns have been discovered, air traffic control systems to complex
ployed iteratively throughout the en- healthcare workflows [61,62]. One of the
templates may be created that describe tire KE process. As such, they also
instances in which concepts and relation- primary benefits of such observational
become necessary components of the methods is that they are designed to
ships between those concepts are defined. KA sub-process.
These templates can then be utilized to minimize potential biases (e.g., Hawthorne
extract knowledge from sources of lan- effect [63]), while simultaneously allowing
Given the particular emphasis of this
guage, such as text [51]. The process of for the collection of information in con-
chapter on the use of conceptual knowledge
applying sub-language analysis to existing text. Additional detail concerning specific
collections for the purpose of complex
knowledge sources has been empirically observational and ethnographic field study
integrative analysis tasks, it is important to
validated in numerous areas, including methods can be found in the reviews
understand that the KA methods and tools
the biomedical domain [50,51]. Within provided by John [62] and Rahat [64].
available to support the generation of
the biomedical domain, sub-language conceptual knowledge collections can be
analysis techniques have been extended broadly divided into three complementary 4.3 Categorical Sorting
beyond conventional textual language to classes: There are a number of categorical, or
also include sub-languages that consist of card sorting techniques, including Q-sorts,
hierarchical sorts, all-in-one sorts and
graphical symbols [52].
N Knowledge unit elicitation: tech-
repeated single criterion sorts [65]. All of
niques for the elicitation of atomic
4. Knowledge Acquisition Tools these techniques involve one or more
units of information or knowledge
subjects sorting of a group of artifacts
and Methods N Knowledge relationship elicita- (e.g., text, pictures, physical objects, etc.)
While a comprehensive review of tools tion: techniques for the elicitation of according to criteria either generated by
and methods that may be used to facilitate relationships between atomic units of the sorter or provided by the researcher.
the knowledge acquisition (KA) is beyond information or knowledge The objective of such methods is to
the scope of this chapter, in the following N Combined elicitation: techniques determine the reproducibility and stability
section, we will briefly summarize example that elicit both atomic units of infor- of the groups created by the sorters. In all
cases of such techniques in order to mation or knowledge, and the rela- of these cases, sorters may also be asked to
provide a general overview of this impor- tionships that exist between them assign names to the groups they create.
tant area of informatics research, develop- Categorical sorting methods are ideally
ment, and applications. There are a variety of commonly used suited for the discovery of relationships
As was introduced in the preceding methods that target one or more above between atomic units of information or
section, KA can be defined as the sub-process these KA classes, as summarized below: knowledge. In contrast, such methods are

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002826


less effective for determining the atomic structures between conceptual entities 4.7 Sub-Language Analysis
units of information or knowledge. How- cannot be expressed as a single many- Sub-language analysis is a technique for
ever, when sorters are asked to provide valued formal context). One approach discovering units of information or knowl-
names for their groups, this data may help to the utilization of multi-dimensional edge, and the relationships between them
to define domain-specific units of knowl- formal contexts is the agreement con- within existing knowledge sources, includ-
edge or information. Further details con- text model proposed by Cole and Becker ing published literature or corpora of
cerning the conduct and analysis of [67], which uses logic-based decomposi- narrative text. The process of sub-lan-
categorical sorting studies can be found tion to partition and aggregate n-ary guage analysis is based on the sub-
in the review provided by Rugg and relations. This algorithmic approach has language theory initially proposed by
McGeorge [65]. been implemented in a freely available Zellig Harris [49]. The process by which
application named Tupleware [74]. concepts and relationships are discovered
4.4 Repertory Grid Analysis Additionally, formal contexts may be using sub-language analysis is a two-stage
Repertory grid analysis is a method defined from existing data sources, such as approach. In the first stage, large corpora
based on the previously introduced Per- databases. These formal contexts are of domain-specific text are analyzed either
sonal Construct Theory (PCT). Repertory discovered using data mining techniques manually or using automated pattern
grid analysis involves the construction of a that incorporate FCA algorithms, such as recognition techniques, in an attempt to
non-symmetric matrix, where each row the open-source TOSCANA or CHIAN- define a number of critical characteristics,
represents a construct that corresponds to TI tools. Such algorithmic FCA methods which according to Friedman et al. [50]
a distinction of interest, and each column are representative examples of a sub- include:
represents an element (e.g., unit of infor- domain known as Conceptual Knowledge
mation or knowledge) under consider- Discovery and Data Analysis (CKDD) N Semantic categorization of terms used
ation. For each element in the grid, the [75]. Additional details concerning FCA within the sub-language
expert completing the grid provides a
numeric score using a prescribed scale
techniques can be found in the reviews N Co-occurrence patterns or constraints,
provided by Cimiano et al. [66], Hereth et and paraphrastic patterns present
(defined by a left and right pole) for each al. [75], and Priss [76]. within the sub-language
distinction, indicating the strength of
relatedness between the given element- N Context-specific omissions of informa-
4.6 Protocol and Discourse Analysis tion within the sub-language
distinction pair. In many instances, the
description of the distinction being used in
The techniques of protocol and dis-
course analysis are very closely related.
N Intermingling of sub-language and
each row of the matrix is stated differently general language patterns
Both techniques are concerned with the
in the left and right poles, providing a
elicitation of knowledge from individuals
N Usage of terminologies and controlled
frame of reference for the prescribed vocabularies (i.e., limited, reoccurring
while they are engaged in problem-solving vocabularies) within the sub-language
scoring scale. Greater detail on the
or reasoning tasks. Such analyses may be
techniques used to conduct repertory grid
studies can be found in the review performed to determine the unit of Once these characteristics have been
provided by Gaines et al. [26]. information or knowledge, and relation- defined, templates or sets of rules may be
ships between those units of information established. In the second phase, the
or knowledge, used by individuals per- templates or rules resulting from the prior
4.5 Formal Concept Analysis
forming tasks in the domain under study. step are applied to narrative text in order
Formal concept analysis (FCA) has often
During protocol analysis studies, subjects to discover units of information or knowl-
been described for the purposes of devel-
are requested to think out loud (i.e., edge, and the relationships between those
oping and merging ontologies [66,67].
vocalize internal reasoning and thought units. This is usually enabled by a natural
FCA focuses on the discovery of natural
processes) while performing a task. Their language processing engine or other sim-
clusters of entities and entity-attribute
vocalizations and actions are recorded ilar intelligent agent [8185].
pairings [66], where attributes are similar
for later analysis. The recordings are then
to the distinctions used in repertory grids.
Much like categorical sorting, FCA is codified at varying levels of granularity to 4.8 Laddering
almost exclusively used for eliciting the allow for thematic or statistical analysis Laddering techniques involve the crea-
relationships between units of information [77,78]. Similarly, discourse analysis is a tion of tree structures that hierarchically
or knowledge. The conduct of FCA studies technique by which an individuals in- organize domain-specific units of informa-
involves two phases: (1) elicitation of tended meaning within a body of text or tion or knowledge. Laddering is another
formal contexts from subjects, and (2) some other form of narrative discourse example of a technique that can be used to
visualization and exploration of resulting (e.g., transcripts of a think out loud determine both units of information or
concept lattices. It is of interest to note protocol analysis study) is ascertained by knowledge and the relationships between
that the concept lattices used in FCA atomizing that text or narrative into those units. In conventional laddering
are in many ways analogous to Sowas discrete units of thought. These thought techniques, a researcher and subject
Conceptual Graphs [68], which are com- units are then subject to analyses of collaboratively create and refine a tree
prised of both concepts and labeled both the context in which they appear, structure that defines hierarchical relation-
relationships. The use of Conceptual and the quantification and description of ships and units of information or knowl-
Graphs has been described in the context the relationships between those units edge [86]. Laddering has also been
of KR [6870], as well as a number of [79,80]. Specific methodological ap- reported upon in the context of structuring
biomedical KE instances [48,7173]. proaches to the conduct of protocol and relationships between domain-specific pro-
Recent literature has described the use discourse analysis studies can be found in cesses (e.g., procedural knowledge). There-
of FCA in multi-dimensional formal the reviews provided by Alvarez [79] and fore, laddering may also be suited for
contexts (i.e., instances where relational Polson et al. [78]. discovering strategic knowledge in the

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002826


form of relationships between conceptual forceful or coercive minority of experts or tasks and activities related to: 1) the
and procedural knowledge. Additional a single expert to exert disproportionate identification of major categories of infor-
information concerning the conduct of influence on the contents of a knowledge mation to be collected, managed and
laddering studies can be found in the collection [25,27,59,87]. Additional detail disseminated during the course of a project;
review provided by Corbdridge et al. [86]. concerning group techniques can be found 2) the determination of the ultimate data
in reviews provided by Gaines [26], Liou and knowledge dissemination requirements
4.9 Group Techniques [27], Morgan [59], Roth [88], and Wood of project-related stake-holders; and 3) the
Several group techniques for multi- [60]. systematic modeling and semantic annota-
subject KA studies have been reported, tion of the data and knowledge resources
including brainstorming, nominal group 5. Integrating Knowledge in the that will be used to address items (1) and (2).
studies, Delphi studies, consensus decision- Translational Science Domain Based upon prior surveys of the state of
making and computer-aided group ses- biomedical informatics relative to the
sions. All of these techniques focus on the Building upon the core concepts intro- clinical and translational science domains
elicitation of consensus-based knowledge. duced in Section 14, in the remainder of [3,89], a framework that is informative to
It has been argued that consensus-based this chapter we will synthesize the require- preceding design and execution pattern
knowledge is superior to the knowledge ments, challenges, theories, and frame- can be formulated. Central to this frame-
elicited from a single expert [27]. Howev- works discussed in the preceding sections, work are five critical information or
er, conducting multi-subject KA studies in order to propose a set of methodological knowledge types involved in the conduct
can be difficult due to the need to recruit approaches to the data, information, and of translational science projects, as are
appropriate experts who are willing to knowledge integration requirements in- briefly summarized in Table 1.
participate, or issues with scheduling cumbent to complex translational science The preceding framework of informa-
mutually agreeable times and locations projects. We believe that it is necessary to tion and knowledge types ultimately in-
for such groups to meet. Furthermore, it is design and execute informatics efforts in forms a conceptual model for knowledge
possible in multi-subject KA studies for a such context in a manner that incorporates integration in the translational sciences.

Table 1. Overview of information and knowledge types incumbent to the translational sciences.

Information or Knowledge Type Description Examples Sources or Types

Individual and/or Population Phenotype This information type involves data elements N Demographics
and metadata that describe characteristics at the N Clinical exam findings
individual or population levels that relate to the N Qualitative characteristics
physiologic and behavioral manifestation of N Laboratory testing results
healthy and disease states.
Individual and/or Population Bio-markers This information type involves data elements and N Genomic, proteomic and metabolomic expression profiles
metadata that describe characteristics at the N Novel bio-molecular assays capable of measuring bio-
individual or population levels that relate to the molecular structure and function
bio-molecular manifestation of healthy and
disease states.
Domain Knowledge This knowledge type is comprised of community- N Literature databases
accepted, or otherwise verified and validated [17] N Public or private databases containing experimental
sources of biomedical knowledge relevant to a results or reference standards
domain of interest. Collectively, these types of N Ontologies
domain knowledge may be used to support N Terminologies
multiple operations, including: 1) hypothesis
development; 2) hypothesis testing; 3) comparative
analyses; or 4) augmentation of experimental data
sets with statistical or semantic annotations [15,17,125].
Biological Models and Technologies This knowledge type typically consists of: 1) N Algorithms
empirically validated system or sub-system level N Quantitative Models
models that serve to define the mechanisms by N Analytical Pipelines
which bio-molecular and phenotypic processes N Publications
and their markers/indicators interact as a network
[6,20,124,126]; and 2) novel technologies that
enable the analysis of integrative data sets in
light of such models. By their nature these tools
include algorithmic or embedded knowledge
sources [124,126].
Translational Biomedical Knowledge Translational biomedical knowledge represents a N Publications
sub-type of general biomedical knowledge that is N Guidelines
concerned with a systems-level synthesis (i.e., N Integrative Data Sets
incorporate quantitative, qualitative, and semantic N Conceptual Knowledge Collections
annotations) of pathophysiologic or biophysical
processes or functions of interest (e.g.,
pharmacokinetics, pharmacodynamics, bionutrition,
etc.), and the markers or other indicators that can
be used to instrument and evaluate such models.

doi:10.1371/journal.pcbi.1002826.t001

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002826


Figure 3. Practical model for the design and execution of translational informatics projects, illustrating major phases and
exemplary input or output resources and data sets.
doi:10.1371/journal.pcbi.1002826.g003

The role of Biomedical Informatics and data analytic pipelines; and 4) the pattern used to address such knowledge
KE in this framework is to address the four dissemination of knowledge collections integration requirements. This design pat-
major information management challeng- resulting from research activities. tern can be broadly divided into four major
es enumerated earlier relative to the ability phases that collectively define a cyclical and
to generate Translational Biomedical 5.1 Design Pattern for Translational iterative process (which we will refer to as a
Knowledge, namely: 1) the collection Science Knowledge Integration translational research cycle,). For each
and management of high throughput, Informed by the conceptual framework phase of the pattern, practitioners must
multi-dimensional data; 2) the generation introduced in the preceding section and consider both the required inputs and
and testing of hypotheses relative to such illustrated in Figure 3, we will now anticipated outputs, and their interrelation-
integrative data sets; 3) the provision summarize the design and execution ships between and across phases.

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002826


Phase 1 - Stakeholder engagement sometimes referred to as creating a data integration of data and knowledge corre-
and knowledge acquisition: During mashup [109115]. Data mashups sponding to a single type, per the defini-
this initial phase, key stakeholders who will are often created using a variety of readily tions set forth in Table 1). However, many
be involved in the collection, manage- available reasoners, such as those associ- translational science problem spaces re-
ment, analysis, and dissemination of proj- ated with the semantic web [109115], quire reasoning across knowledge-types
ect-specific data and knowledge are iden- which directly employ both the data and data granularities (e.g., multidimen-
tified and engaged in both formal and models and semantic annotations created sional data and knowledge collections).
informal knowledge acquisition, with the in the prior phases of the Translational The ability to integrate and reason upon
ultimate goal of defining the essential Informatics Cycle, and enable a knowl- data in a knowledge-anchored manner
workflows, processes, and data sources edge-anchored approach to such opera- that addresses such multi-dimensional
(including their semantics). Such knowl- tions. context remains an open area of research.
edge acquisition usually requires the use of Phase 4 - Analysis and dissemina- Many efforts to address this gap in
ethnographic, cognitive science, workflow tion: In this phase of the Translational knowledge and practice rely upon the
modeling, and formal knowledge acquisi- Informatics Cycle, the integrated/aggre- creation of semantically typed vertical
tion techniques [14]. The results of such gated data and knowledge created in the linkages spanning multiple integrative
activities can be formalized using a preceding phases is subject to analysis. In knowledge networks, as is illustrated in
thematic narratives [9092] and workflow most if not all cases, these analyses make Figure 4.
or process artifacts [9294]. In some use of domain or task specific applications Scalability: Similar to the challenge of
instances, it may be necessary to engage and algorithms, such as those implement- dimensionality and granularity, the issue
domain-specific subject matter experts ed in a broad number of biological data of scalability of knowledge integration
(SMEs) who are not involved in a given analysis packages, statistical analysis appli- methods also remains an open area of
project in order to augment available cations, and data mining tools, and research and development. Specifically, a
SMEs, or to validate the findings generat- intelligent agents. These types of analytical large number of available knowledge
ed during such activities [14,92]. tools are used to address questions per- integration techniques rely upon semi-
Phase 2 - Data identification and taining to one or more of the following automated or human-mediated methods
modeling: Informed by the artifacts four basic query or data interrogation or activities, which significantly curtail the
generated in Phase 1, in this phase, we patterns: 1) to generate hypotheses con- scalability of such approaches to large-
focus upon the identification of specific, cerning relationships or patterns that serve scale problems. Much of the research
pertinent data sources relative to project to link variables of interest in a data set targeting this gap in knowledge and
aims, and the subsequent creation of [116]; 2) to evaluate the validity hypoth- practice has focused on the use of artificial
models that encapsulate the physical and eses and the strength of their related data intelligence and semantic-reasoning tech-
semantic representations of that data. motifs, often using empirically-validated nologies to enable the extraction, disam-
Once pertinent data sources have been statistical tests [117,118]; 3) to visualize biguation, and application of conceptual
identified, we must then model their complex data sets in order to facilitate knowledge collections.
contents in an implementation-agnostic human-based pattern recognition [119 Reasoning and visualization: Once
manner, an approach that is most fre- 121]; and 4) to infer and/or verify and knowledge and data have been aggregated
quently implemented using model-driven validate quantitative models that formalize and made available for hypothesis discov-
architecture techniques [9599]. The re- phenomena of interest identified via the ery and testing, the ability to reason upon
sults of such MDA processes are common- preceding query patterns [122,123]. and visualize such mashups is highly
ly recorded using the Unified Modeling desirable. Current efforts to provide reus-
Language (UML) [16,100102]. During 6. Open Research Questions able methods of doing so, such as the tools
the modeling process, it is also necessary to and Future Direction and technologies provided by the semantic
identify and record semantic or domain- web community, as well as visualization
specific annotation of targeted data struc- As can be ascertained from the preced- techniques being explored by the comput-
tures, using locally relevant conceptual ing review of the theoretical and practice er science and human-computer interac-
knowledge collections (such as terminolo- bases for the integration of data and tion communities, hold significant promise
gies and ontologies), in order to enable knowledge in the translational science in addressing such needs, but are still
deeper, semantic reasoning concerning domain, such techniques and frameworks largely developmental.
such data and information [16,103,104]. have significant potential to positively Applications of knowledge-based
Phase 3 - Integration and aggre- impact the speed, efficacy, and impact of systems for in-silico science para-
gation: A common approach to the such research programs, and to enable digms: As has been discussed throughout
integration of heterogeneous and multi- novel scientific paradigms that would not this collection, a fundamental challenge in
dimensional data is the use of technology- otherwise be tractable. However, there are Translational Bioinformatics is the ability
agnostic domain or data models (per a number of open and ongoing research to both ask and answer the full spectrum of
Phases 12), incorporating semantic anno- and development questions being ad- questions possible given a large-scale and
tations, in order to execute data federation dressed by the biomedical informatics multi-dimensional data set. This challenge
operations [105] or to transform that data community relative to such approaches is particularly pressing at the confluence of
and load it into an integrative repository, that should be noted: high-throughput bio-molecular measure-
such as a data warehouse [106108]. Dimensionality and granularity: ment methods and the translation of the
Once the mechanisms needed to integrate the majority of knowledge integration findings generated by such approaches to
such disparate data sources are imple- techniques being designed, evaluated, clinical research or practice. Broadly
mented, it is then possible to aggregate the and applied relative to the context of the speaking, overcoming this challenge re-
data for the purposes of hypothesis translational science domain target low- quires a paradigm that can be described as
discovery and testing a process that is order levels of dimensionality (e.g., the in-silico science, in which informatics

PLOS Computational Biology | www.ploscompbiol.org 10 December 2012 | Volume 8 | Issue 12 | e1002826


Figure 4. Conceptual model for the generation of multi-network complexes of markers spanning a spectrum of granularity from
bio-molecules to clinical phenotypes.
doi:10.1371/journal.pcbi.1002826.g004

methods are applied to generate and test 7. Summary N Summaries of basic methods, tech-
integrative hypotheses in a high-through- niques, and design patterns that
put manner. Such techniques require the As was stated at the outset of this can be used to employ knowl-
development and use of novel KA and KR chapter, our goals were to review the edge products in order to integrate
methods and structures, as well as the basic theoretical frameworks that define and reason upon heterogeneous and
design and verification/validation of core knowledge types and reasoning multi-dimensional data sets; and
operations with particular emphasis on
knowledge-based systems targeting the
aforementioned intersection point. There the applicability of such conceptual N An introduction to the open re-
search questions and areas re-
are several exemplary instances of investi- models within the biomedical domain, lated to the ability to apply
gational tools and projects targeting this and to introduce a number of prototyp- knowledge collections and knowl-
space, including RiboWeb, BioCyc, and a ical data integration requirements and edge-anchored reasoning processes
number of initiatives concerned with the patterns relevant to the conduct of across multiple networks or knowledge
modeling and analysis of complex biolog- translational bioinformatics that can be collections.
ical systems [6,113,114,120,124]. In addi- addressed via the design and use of
tion, there are a number of large-scale knowledge-based systems. In doing so, Given that the translational bioinfor-
conceptual knowledge collections focusing we have provided: matics is defined by the presence of
on this particular area that can be complex, heterogeneous, multi-dimension-
explored as part of the repositories main- N Definitions of the basic knowledge al data sets, and in light of the growing
tained and curated by the National Center types and structures that can be volume of biomedical knowledge collec-
for Biomedical Ontologies (NCBO). How- applied to biomedical and translational tions, the ability to apply such knowledge
ever, broadly accepted methodological research; collections to biomedical data sets requires
approaches and knowledge collections N An overview of the knowledge en- an understanding of the sources of such
related to this area generally remain gineering cycle, and the products knowledge, and methods of applying them
developmental. generated during that cycles; to reasoning applications. Ultimately,

PLOS Computational Biology | www.ploscompbiol.org 11 December 2012 | Volume 8 | Issue 12 | e1002826


these approaches introduce both signifi- 1) Task One: Select a specific cancer each of the abstracts retrieved in Task
cant opportunities to advance the state of and perform a search for a collection Three to determine the optimal ontology
translational science, while simultaneously of recent literature available with full for annotating those abstracts, noting the
adding areas of complexity to the design of free text via PubMed Central (the top recommended ontology for each
translational bioinformatics studies, in- resulting corpora should contain 5 such textual resource.
cluding the methods needed to reason in manuscripts published within the 5) Task Five: For each abstract
an integrative manner across multiple last three years, selected based upon identified in Step Three, again using
networks or knowledge constructs. As their publication dates beginning the NCBO annotator (found at:
such, these theories, methods, and frame- with the most recent articles/manu- http://bioportal.bioontology.org/
works offer significant benefits as well as a scripts). Download the free text for annotator), annotate those abstracts
number of exciting and ongoing areas of each such article. using the recommended ontologies
biomedical informatics research and de- 2) Task Two: For each full text article identified in Step Four (selecting
velopment. in the corpora established during only those ontologies that are also
Task One, semantically annotate reflects in the NLMs UMLS). Then,
8. Exercises genes, gene products, and clinical for the top 23 phenotypic (e.g.,
phenotype characteristics as found in clinically relevant) concepts identi-
Instructions: Read the following mo- the Abstract, Introduction, and Con- fied via that annotation process, use
tivating use case and then perform the clusion (or equivalent) sections using the UMLS UTS (https://uts.nlm.
tasks described in each question. The applicable Gene Ontology (GO) nih.gov/) in order to identify poten-
objective of this exercise is to demonstrate concepts, using the NCBO anno- tial phenotype-genotype pathways
how available and open-access knowledge tator found at: http://bioportal. linking such phenotypic concepts
discovery and reasoning tools can be used bioontology.org/annotator) and the genes or gene products
to satisfy the information needs incumbent identified in Task Two. Please note
to biomedical knowledge integration needs 3) Task Three: Identify the top 10
most frequently occurring Gene that performing this task will require
commonly encountered in the clinical and exploring multiple relationship
translational research environment. Ontology (GO) concepts found in
your annotations, per Task Two. types reflected in the UMLS me-
Use Case: The ability to identify potentially tathesaurus (documentation con-
actionable phenotype-to-biomarker relationships is For each such concept, perform a
search of PubMed Central for cerning how to do perform such a
of critical importance in the translational science search can be found here: http://
domain. In the specific context of integrative cancer articles in which both the appropri-
ate terms describing the cancer www.ncbi.nlm.nih.gov/books/
research, it is regularly the case that structural and NBK9684/).
functional relationships between genes, gene selected in Task One as well as
products, and clinical phenotypes are used to these concepts co-occur. For the top
Answers to the Exercises can be found
design and evaluate diagnostic and therapeutic 5 (most recent) articles retrieved via
in Text S1.
approaches to targeted disease states. Large each search, retrieve the associate
volumes of domain specific conceptual knowledge abstract for subsequent analysis
Supporting Information
related to such hypothesis generation processes can 4) Task Four: Using the NCBO Ontol-
be found in publically available literature corpora ogy Recommender (http://bioportal. Text S1 Answers to Exercises
and ontologies. bioontology.org/recommender), analyze (DOCX)

Further Reading

N Brachman RJ, McGuinness DL (1988) Knowledge representation, connectionism and conceptual retrieval. Proceedings of the
11th annual international ACM SIGIR conference on research and development in information retrieval. Grenoble, France: ACM
Press.
N Campbell KE, Oliver DE, Spackman KA, Shortliffe EH (1998) Representing thoughts, words, and things in the UMLS. J Am Med
Inform Assoc 5: 421431.
N Compton P, Jansen R (1990) A philosophical basis for knowledge acquisition Knowledge Acquisition 2: 241257.
N Gaines BR (1989) Social and cognitive processes in knowledge acquisition. Knowledge Acquisition 1: 3958.
N Kelly GA (1955) The psychology of personal constructs. New York: Norton. 2 v. (1218).
N Liou YI (1990) Knowledge acquisition: issues, techniques, and methodology. Orlando, Florida, United States: ACM Press. pp.
212236.
N McCormick R (1997) Conceptual and procedural knowledge. International Journal of Technology and Design Education 7:
141159.
N Newell A, Simon HA (1981) Computer science as empirical inquiry: symbols and search. In: Haugeland J, editor. Mind design.
Cambridge: MIT Press/Bradford Books. pp. 3566.
N Patel VL, Arocha JF, Kaufman DR (2001) A primer on aspects of cognition for medical informatics. J Am Med Inform Assoc 8:
324343.
N Preece A (2001) Evaluating verification and validation methods in knowledge engineering. Micro-Level Knowledge
Management: 123145.
N Zhang J (2002) Representations of health concepts: a cognitive perspective. J Biomed Inform 35: 1724.

PLOS Computational Biology | www.ploscompbiol.org 12 December 2012 | Volume 8 | Issue 12 | e1002826


Glossary

N Data: factual information (as measurements or statistics) used as a basis for reasoning, discussion, or calculation [127]
N Information: knowledge obtained from investigation, study, or instruction [127]
N Knowledge: the circumstance or condition of apprehending truth or fact through reasoning [127]
N Knowledge engineering: a branch of artificial intelligence that emphasizes the development and use of expert systems
[128]
N Knowledge acquisition: the act of acquiring knowledge
N Knowledge representation: the symbolic formalization of knowledge
N Conceptual knowledge: knowledge that consists of atomic units of information and meaningful relationships that serve to
interrelate those units.
N Strategic knowledge: knowledge used to infer procedural knowledge from conceptual knowledge.
N Procedural knowledge: knowledge that is concerned with a problem-oriented understanding of how to address a given
task or activity.
N Terminology: the technical or special terms used in a business, art, science, or special subject [128]
N Ontology: a rigorous and exhaustive organization of some knowledge domain that is usually hierarchical and contains all
the relevant entities and their relations [128]
N Multi-dimensional data: data spanning multiple levels or context of granularity or scope while maintaining one or more
common linkages that span such levels
N Motif: a reproducible pattern
N Mashup: a combination of multiple, heterogeneous data or knowledge sources in order to create an aggregate collection of
such elements or concepts.
N Intelligent agent: a software agent that employs a formal knowledge-base in order to replicate expert performance relative
to problem solving in a targeted domain.
N Clinical phenotype: the observable physical and biochemical characteristics of an individuals that serve to define clinical
status (e.g., health, disease)
N Biomarker: a bio-molecular trait that can be measure to assess risk, diagnosis, status, or progression of a pathophysiologic
or disease state.

References
1. Coopers PW (2008) Research rewired. 48 p. 12. Maojo V, Garca-Remesal M, Billhardt H, 20. Zerhouni EA (2005) Translational and clinical
2. Casey K, Elwell K, Friedman J, Gibbons D, Alonso-Calvo R, Perez-Rey D, et al. (2006) sciencetime for a new vision. N Engl J Med
Goggin M, et al. (2008) A broken pipeline? Flat Designing new methodologies for integrating 353: 16211623.
funding of the NIH puts a generation of science biomedical information in clinical trials. Meth- 21. Sim I (2008) Trial registration for public trust:
at risk. 24 p. ods Inf Med 45: 180185. making the case for medical devices. J Gen
3. Payne PR, Johnson SB, Starren JB, Tilson HH, 13. Ash JS, Anderson NR, Tarczy-Hornoch P Intern Med 23 Suppl 1: 6468.
Dowdy D (2005) Breaking the translational (2008) People and organizational issues in 22. Preece A (2001) Evaluating verification and
barriers: the value of integrating biomedical research systems implementation. J Am Med validation methods in knowledge engineering.
informatics and translational research. J Investig Inform Assoc 15: 283289. Micro-Level Knowledge Management: 123
Med 53: 192200. 14. Payne PR, Mendonca EA, Johnson SB, Starren 145.
4. Research NDsPoC (1997) NIH directors panel JB (2007) Conceptual knowledge acquisition in 23. Brachman RJ, McGuinness DL (1988) Knowl-
on clinical research report. Bethesda, MD: biomedicine: a methodological review. J Biomed edge representation, connectionism and concep-
National Institutes of Health. Inform 40: 582602. tual retrieval. Proceedings of the 11th annual
5. Sung NS, Crowley WF, Jr., Genel M, Salber P, 15. Richesson RL, Krischer J (2007) Data standards international ACM SIGIR conference on re-
Sandy L, et al. (2003) Central challenges facing in clinical research: gaps, overlaps, challenges search and development in information retriev-
the national clinical research enterprise. JAMA and future directions. J Am Med Inform Assoc al. Grenoble, France: ACM Press.
289: 12781287. 14: 687696. 24. Compton P, Jansen R (1990) A philosophical
16. Erickson J (2008) A decade and more of UML: basis for knowledge acquisition. Knowledge
6. Butte AJ (2008) Medicine. The ultimate model
Acquisition 2: 241257.
organism. Science 320: 325327. an overview of UML semantic and structural
25. Gaines BR (1989) Social and cognitive processes
7. Chung TK, Kukafka R, Johnson SB (2006) issues and UML field use. Journal of Database
in knowledge acquisition. Knowledge Acquisi-
Reengineering clinical research with informatics. Management 19: I-Vii.
tion 1: 3958.
J Investig Med 54: 327333. 17. van Bemmel JH, van Mulligen EM, Mons B,
26. Gaines BR, Shaw MLG (1993) Knowledge
8. Kaiser J (2008) U.S. budget 2009. NIH hopes van Wijk M, Kors JA, et al. (2006) Databases for
acquisition tools based on personal construct
for more mileage from roadmap. Science 319: knowledge discovery. Examples from biomedi- psychology.
716. cine and health care. Int J Med Inform 75: 257 27. Liou YI (1990) Knowledge acquisition: issues,
9. Kush RD, Helton E, Rockhold FW, Hardison 267. techniques, and methodology. Orlando, Florida,
CD (2008) Electronic health records, medical 18. Oster S, Langella S, Hastings S, Ervin D, , United States: ACM Press. pp. 212236.
research, and the Tower of Babel. N Engl J Med Madduri R, et al. (2008) caGrid 1.0: an 28. Yihwa Irene L (1990) Knowledge acquisition:
358: 17381740. enterprise grid infrastructure for biomedical issues, techniques, and methodology. Proceed-
10. Ruttenberg A, Clark T, Bug W, Samwald M, research. J Am Med Inform Assoc 15: 138 ings of the 1990 ACM SIGBDP conference on
Bodenreider O, et al. (2007) Advancing transla- 149. trends and directions in expert systems. Or-
tional research with the Semantic Web. BMC 19. Kukafka R, Johnson SB, Linfante A, Allegrante lando, Florida, , United States: ACM Press.
Bioinformatics 8 Suppl 3: S2. JP (2003) Grounding a new information 29. Glaser R (1984) Education and thinking: the role
11. Fridsma DB, Evans J, Hastak S, Mead CN technology implementation framework in be- of knowledge. American Psychologist 39: 93104.
(2008) The BRIDG project: a technical havioral science: a systematic analysis of the 30. Hiebert J (1986) Procedural and conceptual
report. J Am Med Inform Assoc 15: 130 literature on IT use. J Biomed Inform 36: 218 knowledge: the case of mathematics. London:
137. 227. Lawrence Erlbaum Associates.

PLOS Computational Biology | www.ploscompbiol.org 13 December 2012 | Volume 8 | Issue 12 | e1002826


31. McCormick R (1997) Conceptual and proce- 55. Ian N, Adam P (2001) Towards a standard review of information science and technology.
dural Knowledge. International Journal of upper ontology. Proceedings of the international Medford, NJ: Information Today, Inc.
Technology and Design Education 7: 141159. conference on formal ontology in information 77. Polson PG (1987) A quantitative theory of
32. Scribner S (1985) Knowledge at work. Anthro- systems - volume 2001. Ogunquit, Maine, , human-computer interaction. Interfacing
pology & Education Quarterly 16: 199206. United States: ACM Press. thought: cognitive aspects of human-computer
33. Barsalow LW, Simmons WK, Barbey AK, 56. Joachim H, Gerd S, Rudolf W, Uta W (2000) interaction. MIT Press. pp. 184235.
Wilson CD (2003) Grounding conceptual knowl- Conceptual knowledge discovery and data 78. Polson PG, Lewis C, Rieman J, Wharton C
edge in modaltiy-specific systems. Trends Cogn analysis. Springer-Verlag. pp. 421437. (1992) Cognitive walkthroughs: a method for
Sci 7: 8491. 57. Tayar N (1993) A model for developing large theory-based evaluation of user interfaces.
34. Borlawsky T, Li J, Jalan S, Stern E, Williams R, shared knowledge bases. Washington, (D.C.): Int J Man-Mach Stud 36: 741773.
et al. (2005) Partitioning knowledge bases ACM Press. pp. 717719. 79. Alvarez R (2002) Discourse analysis of require-
between advanced notification and clinical 58. Boy GA (1997) The group elicitation method for ments and knowledge elicitation interviews.
decision support systems. AMIA Annu Symp participatory design and usability testing. inter- IEEE Computer Society. 255 p.
Proc: 901. actions 4: 2733. 80. Davidson JE (1977) Topics in discourse analysis.
35. Newell A, Simon HA (1981) Computer science 59. Morgan MS, Martz Jr WB (2004) Group University of British Columbia.
as empirical inquiry: symbols and search. In: consensus: do we know it when we see it? 81. Friedman C, Alderson PO, Austin JH, Cimino
Haugeland J, editor. Mind design. Cambridge: Proceedings of the 37th annual Hawaii interna- JJ, Johnson SB (1994) A general natural-
MIT Press/Bradfor Books. pp. 3566. tional conference on system sciences (HICSS04) language text processor for clinical radiology.
36. Newell A, Simon HA (1975) Computer science - track 1 - volume 1: IEEE Computer Society. J Am Med Inform Assoc 1: 161174.
as empirical inquiry: symbols and search. 60. Wood WC, Roth RM (1990) A workshop 82. Friedman C, Hripcsak G (1998) Evaluating
Minneapolis. approach to acquiring knowledge from single natural language processors in the clinical
37. Kelly GA (1955) The psychology of personal and multiple experts. Orlando, Florida, , United domain. Methods Inf Med 37: 334344.
constructs. New York: Norton. 2 v. (1218). States: ACM Press. pp. 275300. 83. Friedman C, Hripcsak G (1999) Natural lan-
38. Hawkins D (1983) An analysis of expert 61. Adria HL, William AS, Stephen DK (2003) guage processing and its future in medicine.
thinking. Int J Man-Mach Stud 18: 147. GMS: preserving multiple expert voices in Acad Med 74: 890895.
39. Zhang J (2002) Representations of health scientific knowledge management. San Fran- 84. Friedman C, Hripcsak G, Shablinsky I (1998)
concepts: a cognitive perspective. J Biomed cisco, California: ACM Press. pp. 14. An evaluation of natural language processing
Inform 35: 1724. 62. John H, Val K, Tom R, Hans A (1995) The role methodologies. Proc AMIA Symp: 855859.
40. Horsky J, Kaufman DR, Oppenheim MI, Patel of ethnography in interactive systems design. 85. Hripcsak G, Friedman C, Alderson PO, Du-
VL (2003) A framework for analyzing the interactions 2: 5665. Mouchel W, Johnson SB, et al. (1995) Unlocking
cognitive complexity of computer-assisted clini- 63. Wickstrom G, Bendix T (2000) The clinical data from narrative reports: a study of
cal ordering. J Biomed Inform 36: 422. Hawthorne effectwhat did the original natural language processing. Ann Intern Med
41. Horsky J, Kaufman DR, Patel VL (2003) The Hawthorne studies actually show? Scand 122: 681688.
cognitive complexity of a provider order entry J Work Environ Health 26: 363367. 86. Corbridge C, Rugg G, Major NO, Shadbolt
interface. AMIA Annu Symp Proc: 294298. NR, Burton AM (1994) Laddering: technique
64. Rahat I, Richard G, Anne J (2005) A general
42. Horsky J, Kaufman DR, Patel VL (2004) and tool use in knowledge acquisition. Knowl-
approach to ethnographic analysis for systems
Computer-based drug ordering: evaluation of edge Acquisition 6: 315341.
design. Coventry, United Kingdom: ACM Press.
interaction with a decision-support system. pp. 3440. 87. Agostini A, Albolino S, Boselli R, De Michelis
Medinfo 11: 10631067. G, De Paoli F, et al. (2003) Stimulating
65. Rugg G, McGeorge P (1997) The sorting
43. Horsky J, Kuperman GJ, Patel VL (2005) knowledge discovery and sharing; 2003; Sanibel
techniques: a tutorial paper on card sorts,
Comprehensive analysis of a medication dosing Island, Florida, , United States: ACM Press. pp.
picture sorts and item sorts. Expert Systems
error related to CPOE. J Am Med Inform Assoc 248257.
14: 8093.
12: 377382. 88. Roth RM, Wood WC (1990) A Delphi approach
66. Cimiano P, Hotho A, Stumme G, Tane J (2004)
44. Horsky J, Zhang J, Patel VL (2005) To err is not to acquiring knowledge from single and multiple
Conceptual knowledge processing with formal
entirely human: complex technology and user experts; 1990; Orlando, Florida, , United States:
concept analysis and ontologies. 189 p.
cognition. J Biomed Inform 38: 264266. ACM Press. pp. 301324.
67. Cole R, Becker P (2004) Agreement contexts in 89. Embi PJ, Payne PR (2009) Clinical research
45. Patel VL, Arocha JF, Diermeier M, Greenes
formal concept analysis. 172 p. informatics: challenges, opportunities and defi-
RA, Shortliffe EH (2001) Methods of cognitive
analysis to support the design and evaluation of 68. Sowa JF (1980) A conceptual schema for nition for an emerging domain. J Am Med
biomedical systems: the case of clinical practice knowledge-based systems. Pingree Park, Colo- Inform Assoc 16: 316327.
guidelines. J Biomed Inform 34: 5266. rado: ACM Press. pp. 193195. 90. Crabtree BF, Miller WL (1992) Doing qualita-
46. Patel VL, Arocha JF, Kaufman DR (2001) A 69. Salomons OW, van Houten FJAM, Kals HJJ tive research. Newbury Park, CA: Sage.
primer on aspects of cognition for medical (1995) Conceptual graphs in constraint based re- 91. Glaser B, Strauss A (1967) The discovery of
informatics. J Am Med Inform Assoc 8: 324 design. Proceedings of the third ACM sympo- grounded theory: strategies for qualitative re-
343. sium on solid modeling and applications. Salt search. Piscataway, NJ: Aldine Transaction. 271
47. Wikipedia (2006) Semiotics. Wikipedia. Lake City, Utah, , United States: ACM Press. p.
48. Campbell KE, Oliver DE, Spackman KA, 70. Yang G, Oh J (1993) Knowledge acquisition and 92. Patton MQ (2001) Qualitative research &
Shortliffe EH (1998) Representing thoughts, retrieval based on conceptual graphs. Proceed- evaluation methods. New York: Sage Publica-
words, and things in the UMLS. J Am Med ings of the 1993 ACM/SIGAPP symposium on tions. 688 p.
Inform Assoc 5: 421431. applied computing: states of the art and practice. 93. Khan SA, Kukafka R, Payne PR, Bigger JT,
49. Harris Z (1976) On a theory of language. The Indianapolis, Indiana, , United States: ACM Johnson SB (2007) A day in the life of a clinical
Journal of Philosophy 73: 253276. Press. research coordinator: observations from com-
50. Friedman C, Kra P, Rzhetsky A (2002) Two 71. Campbell JR, Carpenter P, Sneiderman C, munity practice settings. Medinfo 12: 247251.
biomedical sublanguages: a description based on Cohn S, Chute CG, et al. (1997) Phase II 94. Khan SA, Payne PR, Johnson SB, Bigger JT,
the theories of Zellig Harris. J Biomed Inform evaluation of clinical coding schemes: complete- Kukafka R (2006) Modeling clinical trials work-
35: 222235. ness, taxonomy, mapping, definitions, and flow in community practice settings. AMIA
51. Grishman R, Kittredge R (1986) Analyzing clarity. CPRI Work Group on Codes and Annual Symposium proceedings/AMIA Sym-
language in restricted domains: sublangauge Structures. J Am Med Inform Assoc 4: 238251. posium: 419423.
description and processing. Hillsdale, NJ.: Law- 72. Campbell KE, Oliver DE, Shortliffe EH (1998) 95. Rayhupathi W, Umar A (2008) Exploring a
rence Erlbaum. The Unified Medical Language System: toward model-driven architecture (MDA) approach to
52. Starren J (1997) From multimodal sublanguages a collaborative approach for solving terminolo- health care information systems development.
to medical data presentations. New York: gic problems. J Am Med Inform Assoc 5: 1216. Int J Med Inform 77: 305314.
Columbia University. 73. Cimino JJ (2000) From data to knowledge 96. Aksit M, Kurtev I (2008) Elsevier special issue on
53. Alan LR, Chris W, Jeremy R, Angus R (2001) through concept-oriented terminologies: experi- foundations and applications of model driven
Untangling taxonomies and relationships: per- ence with the Medical Entities Dictionary. J Am architecture. Science of Computer Program-
sonal and practical problems in loosely coupled Med Inform Assoc 7: 288297. ming 73: 12.
development of large ontologies. Proceedings of 74. TOCKIT (2006) Tupleware. 0.1 ed: Technische 97. Shurville S (2007) Model driven architecture
the international conference on knowledge Universitaet Darmstadt and University of and ontology development. Interactive Learning
capture. Victoria, British Columbia, Canada: Queensland. Environments 15: 9699.
ACM Press. 75. Hereth J, Stumme G, Wille R, Wille U (2000) 98. Uhl A (2003) Model driven architecture is ready
54. Canas AJ, Leake DB, Wilson DC (1999) Conceptual knowledge discovery and data for prime time. IEEE Software 20: 70-+.
Managing, mapping, and manipulating concep- analysis. Springer-Verlag. pp. 421437. 99. Soley RM (2003) Model driven architecture: the
tual knowledge. Menlo Park: AAAI Press. pp. 76. Priss U (2006) Formal concept analysis in evolution of object-oriented systems? Object-
1014. information science. In: Blaise C, editor. Annual Oriented Information Systems 2817: 2-2.

PLOS Computational Biology | www.ploscompbiol.org 14 December 2012 | Volume 8 | Issue 12 | e1002826


100. Vanderperren Y, Mueller W, Dehaene W (2008) 110. Scotch M, Yip KY, Cheung KH (2008) 119. Ardekani AM, Akhondi MM, Sadeghi MR
UML for electronic systems design: a compre- Development of grid-like applications for public (2008) Application of genomic and proteomic
hensive overview. Design Automation for Em- health using web 2.0 mashup techniques. J Am technologies to early detection of cancer.
bedded Systems 12: 261292. Med Inform Assoc 15: 783786. Archives of Iranian Medicine 11: 427434.
101. Dobing B, Parsons J (2008) Dimensions of UML 111. Sahoo SS, Bodenreider O, Rutter JL, Skinner 120. De Fonzo V, Aluffi-Pentini F, Parisi V (2007)
diagram use: a survey of practitioners. Journal of KJ, Sheth AP (2008) An ontology-driven seman- Hidden Markov models in bioinformatics. Curr
Database Management 19: 118. tic mashup of gene and biological pathway Bioinform 2: 4961.
102. Batra D (2008) Unified modeling language information: application to the domain of 121. Feng J, Naiman DQ, Cooper B (2007) Proba-
(UML) topics: the past, the problems, and the nicotine dependence. J Biomed Inform 41: bility-based pattern recognition and statistical
prospects. Journal of Database Management 19: 752765. framework for randomization: modeling tan-
IVii. 112. Cheung KH, Yip KY, Townsend JP, Scotch M dem mass spectrum/peptide sequence false
103. Komatsoulis GA, Warzel DB, Hartel FW, (2008) HCLS 2.0/3.0: Health care and life match frequencies. Bioinformatics 23: 2210
Shanbhag K, Chilukuri R, et al. (2008) caCORE sciences data mashup using Web 2.0/3.0. 2217.
version 3: Implementation of a model driven, J Biomed Inform 41: 694705. 122. Oehmen CS, Straatsma TP, Anderson GA, Orr
service-oriented architecture for semantic inter- 113. Cheung KH, Kashyap V, Luciano JS, Chen HJ, G, Webb-Robertson BJM, et al. (2006) New
operability. J Biomed Inform 41: 106123. Wang YM, et al. (2008) Semantic mashup of challenges facing integrative biological science in
104. Kunz I, Lin MC, Frey L (2009) Metadata biomedical data. J Biomed Inform s 41: 683
the post-genomic era. Journal of Biological
mapping and reuse in caBIG. BMC Bioinfor- 686.
Systems 14: 275293.
matics 10 Suppl 2: S4. 114. Belleau F, Nolin MA, Tourigny N, Rigault P,
123. Way JC, Silver PA (2007) Systems engineering
105. Chakravarthy S, Whang WK, Navathe SB Morissette J (2008) Bio2RDF: Towards a
without an engineer: why we need systems
(1994) A logic-based approach to query-process- mashup to build bioinformatics knowledge
systems. J Biomed Inform 41: 706716. biology. Complexity 13: 2229.
ing in federated databases. Information Sciences
79: 128. 115. Marks P (2006) Mashup websites are a dream 124. Knaup P, Ammenwerth E, Brandner R, Brigl B,
106. Ariyachandra T, Watson HJ (2008) Which data come true for hackers. New Scientist 190: 2829. Fischer G, et al. (2004) Towards clinical
warehouse architecture is best? Communica- 116. Payne PR, Borlawsky T, Kwok A, Greaves A bioinformatics: advancing genomic medicine
tions of the ACM 51: 146147. (2008) Supporting the design of translational with informatics methods and tools. Methods
107. DeWitt JG, Hampton PM (2005) Development clinical studies through the generation and Inf Med 43: 302307.
of a data warehouse at an academic health verification of conceptual knowledge-anchored 125. Levy D, Dondero R, Veronneau P (2008)
system: knowing a place for the first time. Acad hypotheses. AMIA Annu Symp Proc: 566570. Research rewired. Price Waterhouse Coopers.
Med 80: 10191025. 117. Mansmann U (2005) Genomic profiling - 48 p.
108. Braa J (2005) A data warehouse approach can interplay between clinical epidemiology, bioin- 126. Webb CP, Pass HI (2004) Translation research:
manage multiple data sets. Bull World Health formatics and biostatistics. Methods Inf Med 44: from accurate diagnosis to appropriate treat-
Organ 83: 638639. 454460. ment. J Transl Med 2: 35.
109. Yu J, Benatallah B, Casati F, Daniel F (2008) 118. Xu R, Wunsch D (2005) Survey of clustering 127. (2012) Merriam Webster online dictionary.
Understanding mashup development. IEEE algorithms. IEEE Transactions on Neural Net- Merriam Webster.
Internet Computing 12: 4452. works 16: 645678. 128. (2012) Wordnet. Princeton University.

PLOS Computational Biology | www.ploscompbiol.org 15 December 2012 | Volume 8 | Issue 12 | e1002826


Education

Chapter 2: Data-Driven View of Disease Biology


Casey S. Greene, Olga G. Troyanskaya*
Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America

Abstract: Modern experimental systems level understanding of diseases or lead to systematic biological heterogeneity.
strategies often generate genome- tissues. Computational heterogeneity (e.g. some
scale measurements of human tis- Human genome-scale experimental data datasets have discrete value measurements
sues or cell lines in various physi- include microarrays [1,2,3], genome-wide while others are continuous) comes from the
ological states. Investigators often association studies [4,5], and RNA interfer- diversity of experimental platforms used to
use these datasets individually to ence screens [6,7] among many other assay biological processes. Integrative ap-
help elucidate molecular mecha- experimental designs [8]. These experi- proaches that bring together diverse data
nisms of human diseases. Here we ments range from those targeted towards types and experiments must address the
discuss approaches that effectively tissue specificity [9] to those targeted challenge of effectively combining these data
weight and integrate hundreds of towards specific diseases such as cancer for inference.
heterogeneous datasets to gene- [10]. The NCBI Gene Expression Omnibus There are many strategies for combin-
gene networks that focus on a (GEO) [11], a database of microarrays ing these diverse and heterogeneous data.
specific process or disease. Diverse alone, contains over 700 human datasets These include ridge regression [19,20],
and systematic genome-scale mea- collected under diverse experimental con- Bayesian inference [21,22,23,24,25], ex-
surements provide such approach- ditions encompassing more than 8000 pectation maximization [26], and support
es both a great deal of power and a
individual arrays. The human PeptideAtlas vector machines [27]. This chapter focuses
number of challenges. We discuss
[12], a similar resource for proteomics on the strategy of Bayesian integration,
some such challenges as well as
methods to address them. We also experiments, currently contains almost 6.7 which is capable of both predicting the
raise important considerations for million MS/MS spectra representing al- probability of an interaction between gene
the assessment and evaluation of most 84,000 non-singleton peptides across pairs and providing information on the
such approaches. When carefully 220 samples. In addition to these high contribution of each experiment to that
applied, these integrative data-driv- throughput experiments, there are databas- prediction. Bayesian integration allows for
en methods can make novel high- es of biochemical pathways [13], gene datasets to be combined based on the
quality predictions that can trans- function [14], pharmacogenomics [15], strength of evidence from individual data-
form our understanding of the and protein-protein interactions [16,17,18]. sets, which can be either learned from the
molecular-basis of human disease. Integrating heterogeneous genome-scale data [28] or expert annotated [29]. Intui-
experiments and databases is a challenging tively the Bayesian strategy works by
task. Beyond the straightforward concern of evaluating the accuracy and coverage of
experimental noise in each individual data- each individual dataset and the relevance of
This article is part of the Transla- set, integrative approaches also face partic- each source of data to the disease or tissue of
tional Bioinformatics collection for ular challenges inherent to the process of interest and using this information to weight
PLOS Computational Biology. unifying heterogeneous data types. Specifi- each datasets impact on resulting predic-
cally we are concerned with biological and tions. Here we discuss Bayesian methods
computational sources of heterogeneity. that infer genome-scale functional relation-
1. Introduction Biological heterogeneity among experiments ship networks from high throughput exper-
emerges from the measurement of many imental data by building on exiting gold
Researchers are using genome-scale different processes or the unique probing of standards. We discuss how these methods
experimental methods (i.e. approaches biological systems. The source of biological work, how to develop high quality gold
that assay hundreds or thousands of genes material (e.g. whether experiments measure standards, and how to evaluate networks of
at a time) to probe the molecular mech- cells in culture or biopsied tissues) can also predicted functional relationships.
anisms of normal biological processes and
disease states across systems from cell
culture to human tissue samples. Data of
Citation: Greene CS, Troyanskaya OG (2012) Chapter 2: Data-Driven View of Disease Biology. PLoS Comput
this scale can provide a great deal of Biol 8(12): e1002816. doi:10.1371/journal.pcbi.1002816
information about the process or disease of
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
interest, the tissue of origin, and the Baltimore County, United States of America
metabolic state of the organism, among Published December 27, 2012
other factors. To understand biological
Copyright: 2012 Greene, Troyanskaya. This is an open-access article distributed under the terms of the
processes on a systems level one must Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
combine data from measurements across medium, provided the original author and source are credited.
different molecular levels (e.g. proteomic, Funding: This work was supported by the National Science Foundation (NSF) CAREER [award DBI-0546275];
metabolomic, and genomic measure- National Institutes of Health (NIH) [R01 GM071966, R01 HG005998 and T32 HG003284]; National Institute of
ments) while incorporating data from General Medical Sciences (NIGMS) Center of Excellence [P50 GM071508]. The funders had no role in the
preparation of the manuscript.
diverse experiments within each individual
level. An effective integrative analysis will Competing Interests: The authors have declared that no competing interests exist.
take advantage of these data to develop a * E-mail: ogt@genomics.princeton.edu

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002816


What to Learn in This Chapter The final component of this formula is
the probability of observing the experi-
mental result that was observed for gene i,
N What a functional relationship network represents.
PEi . This value is the proportion of
N The fundamentals of Bayesian inference for genomic data integration. genes from the standard measured above
N How to build a network of functional relationships between genes using the median to the total number of genes in
examples of functionally related genes and diverse experimental data. the standard,
N How computational scientists study disease using data driven approaches, such
as integrated networks of protein-protein functional relationships. Above Median 211
N Strategies to assess predictions from a functional relationship network PEi ~ ~
Total in Standard 422
~0:5:

2. Combining Diverse Data It is important to note that, if the prior is


Using Bayesian Inference PEi DDi PDi adjusted from the proportion observed in
PDi DEi ~ :
PEi the data, PEi must also be adjusted to
Bayesian inference is a powerful tool present the probability of the evidence
that can be used to make predictions based The probability that a gene is involved
under the new prior. With these compo-
on experimental evidence. If we want to in disease ignoring any evidence, PDi , is
nents we can calculate the probability of
calculate the probability that a gene of known as the prior probability. We can
disease given the experimental evidence
unknown function is involved in a disease, conservatively estimate this as, for in-
for gene i as
we can begin by developing a list of genes stance, the proportion of positive examples
known to be involved in the disease to the proportion of total genes. If the
PEi DDi PDi 0:75|0:01
(positive examples) and a list of genes not organism of interest has 20,000 genes, this PDi DEi ~ ~ ~0:015:
PEi 0:5
involved in the disease (negative exam- would be
ples). These positive and negative exam-
ples are termed a gold standard in the Positive Examples 200 This probability is still small in large
PDi ~ ~ ~0:01: part due to our conservative prior, but by
field of machine learning. Figure 1 shows, Genes in Organism 20,000
under three conditions, how the measure- assuming that experimental results from
ments for positive genes and negative different datasets are independent, we
genes are distributed in datasets measuring This is likely to be too conservative as it can perform this same calculation for
three hypothetical conditions. From this, assumes that there are no unknown genes gene i in experimental condition B using
we can observe that genes having a higher that are involved in the disease of interest. this probability as the prior, and the
(more to the right) score in Condition A In practice, however, as evidence accu- calculation for condition C using the
and a lower (more to the left) score in mulates the impact of the prior probability probability from condition B as the prior.
on individual predictions is diminished. This procedure exploits Bayes theorem
Condition C appear to be involved in the
With knowledge of the state of gene i in to bring together diverse evidence sources
disease.
Condition A we can calculate PEi DDi . In through the common framework of
Bayesian inference allows us to use these
this example, assume that the measurement probabilities.
distributions to quantify the probability that
for gene i is above the median. This
a gene is involved in disease given these
data. Table 1 shows experimental results
probability of observing the experimental 3. Defining a Functional
result for gene i given that a gene is involved Relationship Gold Standard
from Condition A where the median has
in disease can be calculated as
been used to divide the continuous values Going beyond gene lists to networks of
into discrete bins. genes requires a different type of gold
From this contingency table we can PEi jDi ~ standard. While the inference approach
calculate the probability that a gene i is Positive Examples Above Median described in Section 2 can be used to
involved in disease, PDi , given the ~ implicate genes in a disease or process, the
Positive Examples
experimental results for gene i, Ei . Math- specific roles of those genes remain
ematically this can be written as PDi DEi . 150 unclear. In the strategy from Section 2,
~0:75:
Bayes theorem states that 200 positive and negative genes make up the

Figure 1. Potential distributions of experimental results obtained for datasets collected under three different conditions. The dotted
line indicates the distribution of negative examples and the solid line indicates the distribution of positive examples. In condition A the positive
examples more often occur to the right of the negative examples, in condition B both sets overlap, and in condition C the positive examples occur
more often to the left of the negative examples.
doi:10.1371/journal.pcbi.1002816.g001

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002816


Table 1. A contingency table for the experimental results for Condition A.

Below Median Above Median Total

Positive Examples 50 150 200


Negative Examples 161 61 222
Total 211 211 422

Genes are discretized into values above or below the median. The numbers of positive and negative examples come from the gold standard. These values can be used
to predict the probability that a gene with unknown status is involved in the disease.
doi:10.1371/journal.pcbi.1002816.t001

gold standard. By building a gold standard from the biological process ontology that be calculated from diverse data using
of positive and negative relationships, it were appropriate for confirmation or refu- Bayesian inference. The process is similar
becomes possible to predict whether or not tation through laboratory experiments such to the integration process described for
a pair of genes interacts. as response to DNA damage stimulus and single-gene prediction, but there are differ-
As with all machine learning strategies, aldehyde metabolism. These terms can ences. For each dataset, appropriate scores
the gold standard determines what type of be downloaded and used to build a positive for each gene pair must be calculated.
relationship can be discovered. Here we functional relationship standard. Gene pairs Furthermore, these scores should not re-
will describe the process of building a gold where both pairs share one of these terms quire any manual intervention or adjust-
standard of functional relationships, but a can be considered to have a functional ment that would make an analysis of
different standard of only physical or only relationship. Gene pairs which do not share hundreds or thousands of datasets time
metabolic interactions could be used to an annotation are of unknown status. For consuming. For datasets that are naturally
develop a network with those types of Bayesian inference we must also have a made up of pair-wise scores such as yeast two-
connections. Here we define two genes as negative standard. One potential way to hybrid assays, this task is straightforward.
having a functional relationship if they develop a negative standard would be to For datasets made up of individual
work together to carry out a biological randomly select pairs of genes. This assumes gene measurements, such as microarray
process (e.g. a KEGG pathway) that can be that most pairs of genes do not interact. experiments, a useful measure must be
assayed by definitive experimental follow- It is possible to add additional high found.
up. This definition allows us to capture quality experimentally annotated relation- One measure that can provide pair-wise
diverse types of relationships, while discov- ships to these standards from other scores across arrays is correlation. Corre-
ering relationships suitable for biological databases. Databases like KEGG [13], lation quantifies the amount that two
follow-up. The Gene Ontologys biological Reactome [31], and HPRD [32] have genes vary together and can be a useful
process ontology provides annotations of previously been used to identify additional indicator of functional relationships. Com-
genes to process, but includes both very functional relationships [33]. The positive paring correlation across datasets in a
broad and very narrow processes. Two and negative relationships from the stan- regular manner is difficult however, be-
examples of broad terms would be bio- dard determine the type of relationship cause datasets may display more or less
logical regulation and response to stim- that will be predicted by the Bayesian correlation based on both true biology
ulus. Two examples of narrow terms integration. Here we use functional rela- (e.g. under some conditions more genes
would be positive regulation of cell growth tionships, but a gold standard built strictly vary together) or experimental error (e.g.
involved in cardiac muscle cell develop- from physical protein-protein interactions systematic biases due to hybridization
ment and cell-matrix adhesion involved will infer only physical interactions rela- conditions) and the variance of gene-wise
in tangential migration using cell-cell tionships between genes. correlations would vary based on these
interactions. The broad terms are not dataset dependent effects. Fishers z-trans-
specific enough to provide a meaningful 4. Building a Network of form provides a means to convert these
gold standard, while the narrow terms have Functionally Related Genes correlation coefficients (r) to z-scores by
too few annotations to provide sufficient calculating z as
examples of known relationships. Given a gold standard of gene-gene
To address this shortcoming, Myers et al. relationships, the probability that two genes 1 1zr
[30] used a panel of experts to select terms of unknown status have a relationship can z~ ln :
2 1{r

These z-scores provide a familiar frame-


work to work with correlation and allow
correlation measures between genes to be
compared across datasets. It is then
possible to categorize genes pairs as
negatively correlated, uncorrelated, or
positively correlated based on whether
their z-score is less than, approximately
Figure 2. An example of querying HEFalMp for the role of APOE across all biological equal to, or greater than zero.
processes (http://hefalmp.princeton.edu/). These pairs can then be used as
doi:10.1371/journal.pcbi.1002816.g002 evidence in an integration. In the single

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002816


test p-value [34] can then be calculated
using the counts of genes connected to the
query, the number of genes connected to
the query and annotated to the disease of
interest, as well as the total number of
genes in the network and the number of
those genes annotated to the disease [34].
The approach used by the HEFalMp
online tool is more complicated because
the network-specific calculations must be
done in real time for the web interface.
Figure 5 shows diseases significantly asso-
ciated with the APOE protein through the
HEFalMp online tool, while the procedure
used to generate the results for Figure 6
flips the analysis and shows genes signifi-
cantly associated with Alzheimer disease
based on their connectedness to genes
annotated to this disease in OMIM [35].

5. Evaluating Functional
Relationship Networks
After performing a Bayesian integra-
tion it is appropriate to assess the quality
Figure 3. The result of querying HEFalMp for the role of APOE across all biological
of the inference approach. One straight-
processes. Red links indicate that there is a high probability of a functional relationship between forward way to evaluate the network
the two genes. would be measure the concordance of
doi:10.1371/journal.pcbi.1002816.g003 the gold standard and predictions from
the network. This is easily done by
gene situation, we were interested in biological processes as shown in Figure 2. ordering gene pairs by their probabilities
PDi DEi , or the probability of gene i The result is shown in Figure 3. The red in the network from highest to lowest. For
causing disease given its evidence. Here we links indicate that there is a high probability each gene pair in the gold standard, the
are interested in the probability of a of a functional relationship between the two true positive rate (TPR) to that point can
functional relationship between genes i genes and green links indicate a low be calculated as

and j, P FRi,j , given some pair-wise probability. Black links indicate a probabil-
evidence (e.g. correlation), Ei,j . As in the ity of approximately 0.5.
Positive Pairs Thus Far
single gene situation, this can be calculated The probability of a functional relation- TPR~ :
with ship between any pair of genes is calculat- Total Positives in Standard
ed as described previously. As such, this
  probability is dependent on evidence from The false positive rate (FPR) can be
 P Ei,j DFRi,j P FRi,j each individual dataset. By clicking on a calculated with the same values for negative
P FRi,j DEi,j ~  :
P Ei,j link, the contributions for each dataset pairs. These values can then be plotted with
towards that gene pair are provided as FPR on the horizontal axis and TPR on the
shown in Figure 4 for APOE and PLTP. vertical access. This provides one type of
Like before, a contingency table is used. This figure indicates the value of including receiver-operator characteristic (ROC)
The difference in this situation is that the high quality databases such as BioGRID curve which can be used to assess the quality
table is based on pair-wise gene measures as input data. While the microarray of predictions from the network. The area
instead of measurements for individual datasets are informative, in this case the under this curve (AUC) summarizes to a
genes. This process, when used to calcu- three highest weighted datasets were non- single number the quality of predictions.
late pair-wise probabilities of functional microarray data sources. Unfortunately this approach to evalua-
relationships for all of the genes in the These functional relationships can then tion uses the same evaluation standard as
genome of interest, results in a functional be used to connect genes to diseases the gold standard used for learning and
relationship network for the organism of through guilt by association approaches. therefore it tests the ability of the inference
interest. Guilt by association approaches work by approach to match the gold standard, and
Huttenhower et al. [33] performed finding genes or diseases that are highly not its ability to make new predictions.
Bayesian integration and prediction using connected to query genes. How exactly One way to avoid this circularity is to hold
human gold standards and datasets. This this is done depends on the underlying a group of genes out of the gold standard
tool allows users to query the network and network, the size and type of the query during the integration process. Connec-
also displays what datasets contribute to the sets, whether or not the task must be done tions between these held out genes can
relationships predicted from the integrated in real time. An example approach would then be used after the networks are
approach. As an example we can query be to consider as positives only relation- generated to assess the quality of predic-
HEFalMp to find out how the APOE ships with a probability from the inference tions from the network (in this case the
protein relates to all genes across all stage of greater than 0.9. A Fishers exact concordance between the predictions and

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002816


Figure 4. The highest and lowest contributing datasets for the pair of APOE and PLTP are shown (http://hefalmp.princeton.edu/
gene/one_specific_gene/18543?argument = 21697&context = 0). These contributions are based on how well the bin containing the queried
gene pair separated known positive functional relationships from known negative functional relationships.
doi:10.1371/journal.pcbi.1002816.g004

the known relationship status of the held can result in too few known relationships cross-validation approach. With cross-val-
out genes are used). While the holdout for assessment of the network. This idation, the gene sets are divided up into
approach is effective for large gold stan- assessment problem can be alleviated at groups. Like the hold-out approach, all
dards, when gold standards are small this the cost of computation time by using a but one group is used to train the network

Figure 5. The diseases that are significantly connected to APOE through the guilt by association strategy used in HEFalMp.
Alzheimer disease and Macular degeneration are both annotated to the disease in OMIM as noted by the gold bars to the left of the disease (http://
hefalmp.princeton.edu/gene/diseases?context = 0&name = APOE). The other diseases are implicated by APOEs functional relationships to genes
annotated to that disease in OMIM.
doi:10.1371/journal.pcbi.1002816.g005

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002816


Figure 6. The genes that are most significantly connected to Alzheimer disease genes using the HEFalMp network and OMIM
disease gene annotations (http://hefalmp.princeton.edu/disease/all_genes/55?context = 0). The gold bars to the left of APP and APOE
indicate that both genes were annotated Alzheimer disease according to OMIM.
doi:10.1371/journal.pcbi.1002816.g006

while the evaluation is performed on the selected for follow-up. These are com- designing biological experiments [36]. If
left out group. In contrast to the hold-out bined with randomly selected genes to these predictions lead to a higher success
approach, the process of training and create a gene list for evaluation. Literature rate in molecular biology experiments, an
evaluation is performed iteratively with evidence for genes on this list can be integrative analysis can dramatically lower
each group of genes being evaluated, but assessed, and a comparison can be per- the cost per discovery. Hibbs et al. [37]
like the hold-out approach, only the formed for genes selected from the net- used a data driven approach to direct
predictions generated on held out genes work and genes selected randomly. If the experimental biology and found that
are used for evaluation. proportion of literature based positives of computational predictions could be exper-
When standards are incomplete, exist- genes or pairs selected from the network is imentally validated at a substantially
ing literature can also be used for substantially higher than those selected higher rate than randomly selected genes.
evaluation. This can be incorporated in a randomly, this provides evidence that the Furthermore, those genes that were found
number of ways. One way is to use a blind network recapitulates true biology. by computational methods were more
literature evaluation. Pairs predicted with Fundamentally the goal of this data likely to exhibit a subtle phenotype than
high probability or genes highly connected driven functional genomics strategy is to the genes already known to be involved.
to members of the standard can be create a network of predictions useful for This study provides evidence that compu-
tational predictions combined with exper-
imental science can lower the cost of
experimental discoveries while finding
subtle phenotypes that high throughput
experimental designs may miss.

6. Summary
Data driven functional genomics strate-
gies combine methods from statistics and
computer science to integrate diverse
experimental data for the purpose of
making novel biological predictions. By
bringing diverse data together, these meth-
ods are capable of discovering patterns of
biological relevance not well characterized
in individual studies [38]. Furthermore,
because these methods rely on existing
data, they can be used to efficiently direct
definitive low throughput experimental
studies in a cost effective manner [37,39].
Integrative data driven approaches are
Figure 7. The functional relationship network discovered by a data driven integration often compared to publicly available
for the YFG gene in YFO. databases of knowledge or experiments
doi:10.1371/journal.pcbi.1002816.g007 or to the statistical analysis of results from

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002816


Table 2. A contingency table for gene-pairs based on correlation in a gene expression dataset.

Negatively Correlated Uncorrelated Positively Correlated

Known Positive Relationships 20 30 50


Known Negative Relationships 400 300 200

doi:10.1371/journal.pcbi.1002816.t002

Table 3. A contingency table for gene-pairs based on a database of physical interactions.

Not Physically Interacting Physically Interacting

Known Positive Relationships 10 90


Known Negative Relationships 900 100

doi:10.1371/journal.pcbi.1002816.t003

individual high throughput experiments, but use, these methods can generate predictions relationship if they are uncorrelated in
they are distinct from both of these. Data- capable of efficiently directing experimental this dataset? What if they are negative-
bases generated by literature curation are by biology [37,40]. ly correlated?
their nature not well suited to the discovery of 3. Using the contingency tables from
new knowledge and databases of experimen- 7. Exercises Tables 2 and 3 and the knowledge that
tal results require researchers to know a priori 20% of gene-pairs in the organism of
which datasets are relevant to the biological 1. All proteins connected to the protein interest have a functional relationship,
question of interest. Integrative data driven Your Favorite Gene (YFG) in the what is the probability that genes YFG
approaches combine high throughput exper- functional relationship network of Your and MFG have a functional relationship
iments and databases of diverse types and in Favorite Organism (YFO) are shown in if they are positively correlated in the
so doing can make predictions beyond those Figure 7. Three of them are known to experiment that Table 2 is derived from
discovered using single data sources. be associated with Your Favorite and physically interacting in the data-
The flexibility of the data driven approach Disease (YFD). These genes are base from which Table 3 is derived?
also gives rise to its greatest challenge. This YFDG1, YFDG2, and YFDG3. YFD
4. What is the major difference between
strategy relies upon gold standards that are a has six genes annotated to it among the
100 genes present in YFO. Using a databases and integrative data driven
representation of high quality current knowl-
Fishers exact test to evaluate guilt by approaches?
edge. When these standards are of high
quality and appropriate to the biological association, is YFG significantly associ- Answers to the Exercises can be found
question of interest, the resulting answers are ated with YFD (av0:05)? in Text S1.
likely to be useful. If the standards are of 2. Does the gene expression dataset
lower quality, the utility of the predictions described by the contingency table in Supporting Information
will be lessened. In many cases the gold Table 2 provide any information about
standard quality is the critical determinant of whether or not the genes YFG and Text S1 Answers to Exercises
success for these algorithms. With careful MFG are likely to have a functional (DOCX)

Further Reading

N Kanehisa M, Bork P (2003) Bioinformatics in the post-sequence era. Nat Genet


33 Suppl: 305310.

Glossary

N Functional Relationship: The type of interaction that two genes have if they
participate in the same biological process.
N Gold Standard: A set of genes or gene-pairs with a known status (positive or
negative) in the tissue, process, disease, or phenotype of interest.
N Hypergeometric/Fishers Exact Test: A test of independence appropriate for
categorical count data when the number of items in each cell is small.

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002816


References
1. Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, 15. Klein TE, Chang JT, Cho MK, Easton KL, 28. Myers C, Robson D, Wible A, Hibbs M, Chiriac
Ball CA, et al. (2002) Identification of genes periodically Fergerson R, et al. (2001) Integrating genotype C, et al. (2005) Discovery of biological networks
expressed in the human cell cycle and their expression and phenotype information: an overview of the from diverse functional genomic data. Genome
in tumors. Mol Biol Cell 13: 19772000. PharmGKB project. Pharmacogenetics Re- Biol 6: R114R114.
2. Hegde P, Qi R, Gaspard R, Abernathy K, s e a rc h N e t w o r k a n d K n o w l e d g e B a s e 29. Troyanskaya OG, Dolinski K, Owen AB, Altman
Dharap S, et al. (2001) Identification of tumor Pharmacogenomics J 1: 167170. RB, Botstein D (2003) A Bayesian framework for
markers in models of human colorectal cancer 16. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim combining heterogeneous data sources for gene
using a 19,200-element complementary DNA SM, et al. (2002) DIP, the Database of Interacting function prediction (in Saccharomyces cerevisiae).
microarray. Cancer Res 61: 77927797. Proteins: a research tool for studying cellular Proc Natl Acad Sci U S A100: 83488353.
3. Lock C, Hermans G, Pedotti R, Brendolan A, networks of protein interactions. Nucleic Acids 30. Myers CL, Barrett DR, Hibbs MA, Huttenhower
Schadt E, et al. (2002) Gene-microarray analysis Res 30: 303305. C, Troyanskaya OG (2006) Finding function:
of multiple sclerosis lesions yields new targets 17. Bader G, Betel D, Hogue C (2003) BIND: the evaluation methods for functional genomic data.
validated in autoimmune encephalomyelitis. Nat Biomolecular Interaction Network Database. BMC Genomics 7: 187.
Med 8: 500508. Nucleic Acids Res 31: 248250. 31. Vastrik I, DEustachio P, Schmidt E, Gopinath
4. Wellcome Trust Case Control Consortium (2007) 18. Snel B, Lehmann G, Bork P, Huynen MA (2000) G, Croft D, et al. (2007) Reactome: a knowledge
Genome-wide association study of 14,000 cases of STRING: a web-server to retrieve and display the base of biologic pathways and processes. Genome
seven common diseases and 3,000 shared con- repeatedly occurring neighbourhood of a gene. Biol 8: R39.
trols. Nature 447: 661678. Nucleic Acids Res 28: 34423444. 32. Peri S, Navarro JD, Amanchy R, Kristiansen TZ,
5. Schymick JC, Scholz SW, Fung HC, Britton A, 19. Mostafavi S, Ray D, Warde-Farley D, Grouios C, Jonnalagadda CK, et al. (2003) Development of
Arepalli S, et al. (2007) Genome-wide genotyping Morris Q (2008) GeneMANIA: a real-time human protein reference database as an initial
in amyotrophic lateral sclerosis and neurologically multiple association network integration algo- platform for approaching systems biology in
normal controls: first stage analysis and public rithm for predicting gene function. Genome Biol humans. Genome Res 13: 23632371.
release of data. Lancet Neurol 6: 322328. 9 Suppl 1: S4. 33. Huttenhower C, Haley EM, Hibbs MA, Du-
6. Kittler R, Pelletier L, Heninger AK, Slabicki M, 20. Warde-Farley D, Donaldson SL, Comes O, Zuberi meaux V, Barrett DR, et al. (2009) Exploring the
Theis M, et al. (2007) Genome-scale RNAi K, Badrawi R, et al. (2010) The GeneMANIA human genome with functional maps. Genome
profiling of cell division in human tissue culture prediction server: biological network integration Res 19: 10931106.
cells. Nat Cell Biol 9: 14011412.
for gene prioritization and predicting gene func- 34. Sokal RR, Rohlf FJ (1995) Biometry : the
7. Krishnan MN, Ng A, Sukumaran B, Gilfoy FD,
tion. Nucleic Acids Res 38: W214W220. principles and practice of statistics in biological
Uchil PD, et al. (2008) RNA interference screen
21. Lee I, Ambaru B, Thakkar P, Marcotte EM, Rhee research. New York: W.H. Freeman. xix, 887 p.
for human genes associated with West Nile virus
SY (2010) Rational association of genes with traits 35. Hamosh A, Scott AF, Amberger J, Valle D,
infection. Nature 455: 242245.
using a genome-scale gene network for Arabidopsis McKusick VA (2000) Online Mendelian Inheri-
8. Ozsolak F, Song JS, Liu XS, Fisher DE (2007)
thaliana. Nat Biotechnol 28: 149156. tance in Man (OMIM). Human Mutation 15: 57
High-throughput mapping of the chromatin
structure of human promoters. Nat Biotechnol 22. Lee I, Date SV, Adai AT, Marcotte EM (2004) A 61.
25: 244248. probabilistic functional network of yeast genes. 36. Greene CS, Troyanskaya OG (2012) Accurate
9. Su AI, Wiltshire T, Batalov S, Lapp H, Ching Science 306: 15551558. evaluation and analysis of functional genomics data
KA, et al. (2004) A gene atlas of the mouse and 23. Lee I, Blom UM, Wang PI, Shim JE, Marcotte and methods. Ann N Y Acad Sci 1260: 95100.
human protein-encoding transcriptomes. Proc EM (2011) Prioritizing candidate disease genes by 37. Hibbs MA, Myers CL, Huttenhower C, Hess DC,
Natl Acad Sci U S A 101: 60626067. network-based boosting of genome-wide associa- Li K, et al. (2009) Directing experimental biology:
10. Perou CM, Sorlie T, Eisen MB, van de Rijn M, tion data. Genome Res 21: 11091121. a case study in mitochondrial biogenesis. PLoS
Jeffrey SS, et al. (2000) Molecular portraits of 24. Kim WK, Krumpelman C, Marcotte EM (2008) Comput Biol 5: e1000322. doi:10.1371/journal.
human breast tumours. Nature 406: 747752. Inferring mouse gene functions from genomic- pcbi.1000322.
11. Edgar R, Domrachev M, Lash AE (2002) Gene scale data using a combined functional network/ 38. Huttenhower C, Hibbs M, Myers C, Troyans-
Expression Omnibus: NCBI gene expression and classification strategy. Genome Biol 9 Suppl kaya OG (2006) A scalable method for integration
hybridization array data repository. Nucleic Acids 1: S5. and functional analysis of multiple microarray
Res 30: 207210. 25. Rhodes DR, Tomlins SA, Varambally S, Maha- datasets. Bioinformatics 22: 28902897.
12. Desiere F, Deutsch EW, King NL, Nesvizhskii AI, visno V, Barrette T, et al. (2005) Probabilistic 39. Hess DC, Myers CL, Huttenhower C, Hibbs MA,
Mallick P, et al. (2006) The PeptideAtlas project. model of the human protein-protein interaction Hayes AP, et al. (2009) Computationally driven,
Nucleic Acids Res 34: D655D658. network. Nat Biotechnol 23: 951959. quantitative experiments discover genes required
13. Kanehisa M, Goto S (2000) KEGG: kyoto 26. Segal E, Wang H, Koller D (2003) Discovering for mitochondrial biogenesis. PLoS Genet 5:
encyclopedia of genes and genomes. Nucleic molecular pathways from protein interaction and e1000407. doi:10.1371/journal.pgen.1000407.
Acids Res 28: 2730. gene expression data. Bioinformatics 19: i264 40. Guan Y, Dunham M, Caudy A, Troyanskaya O
14. Ashburner M, Ball CA, Blake JA, Botstein D, i272. (2010) Systematic planning of genome-scale
Butler H, et al. (2000) Gene ontology: tool for the 27. Chen X, Lin MZ, Shen XL (2011) PAIR: the experiments in poorly studied species. PLoS
unification of biology. The Gene Ontology predicted Arabidopsis interactome resource. Nu- Comput Biol 6: e1000698. doi:10.1371/journal.
Consortium Nat Genet 25: 2529. cleic Acids Res 39: D1134D1140. pcbi.1000698.

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002816


Education

Chapter 3: Small Molecules and Disease


David S. Wishart1,2,3*
1 Department of Biological Sciences, University of Alberta, Edmonton, Alberta, Canada, 2 Department of Computing Science, University of Alberta, Edmonton, Alberta,
Canada, 3 National Research Council, National Institute for Nanotechnology (NINT), Edmonton, Alberta, Canada

Abstract: Big molecules such as biochemistry have focused on identifying bioinformatics called chemical bioinfor-
proteins and genes still continue to the chemicals that cause (toxins), cure matics a discipline that has evolved to
capture the imagination of most (drugs) or characterize (biomarkers) most help address the blended chemical and
biologists, biochemists and bioin- human diseases. Historically, this kind of molecular biological needs of toxicoge-
formaticians. Small molecules, on work has been reliant on the slow, careful nomics, pharmacogenomics, metabolo-
the other hand, are the molecules and sometime tedious approaches of mics and systems biology.
that most biologists, biochemists classical analytical chemistry and classical Chemical bioinformatics combines the
and bioinformaticians prefer to biochemistry. Nevertheless, it has led to sequence-centric tools of bioinformatics
ignore. However, it is becoming important discoveries and enormous ad- with the chemo-centric tools of chemin-
increasingly apparent that small vances in our understanding of the actions formatics. The term cheminformatics,
molecules such as amino acids, of chemicals on genes, proteins and cells. which is an abbreviated form of chemical
lipids and sugars play a far more With the recent emergence of high informatics, was first coined by Frank
important role in all aspects of throughput omics technologies, our Brown nearly 15 years ago [3]. Cheminfor-
disease etiology and disease treat- ability to detect, identify, and characterize matics (as it is known in North America) or
ment than we realized. This partic- small molecules along with their large chemoinformatics (as it is known in Europe
ular chapter focuses on an emerg- molecule targets has been radically and the rest of the world) is actually a close
ing field of bioinformatics called changed [1,2]. Now it is possible to cousin to bioinformatics. Just as bioinfor-
chemical bioinformatics a disci-
perform as many sequencing experiments, matics is a field of information technology
pline that has evolved to help
address the blended chemical and mass spectrometry (MS) experiments or concerned with using computers to analyze
molecular biological needs of tox- compound identifications in a single day as molecular biological data, cheminformatics
icogenomics, pharmacogenomics, used to be done in a single year. As a is a field of information technology that uses
metabolomics and systems biolo- result, traditional fields such as toxicology, computers to facilitate the collection, stor-
gy. In the following pages we will pharmacology and biochemistry have age, analysis and manipulation of large
cover several topics related to been transformed into totally new fields quantities of chemical data.
chemical bioinformatics. First, a called toxicogenomics, pharmacogenomics However, there are some distinct cul-
brief overview of some of the most and metabolomics. This transformation tural differences between bioinformatics
important or useful chemical bioin- has changed not only the fundamentals of and cheminformatics. For instance, che-
formatic resources will be given. these disciplines, but also the fundamentals minformatics software is mostly designed
Second, a more detailed overview of their data. Rather than trying to for use by chemists, while bioinformatics
will be given on those particular manage a few samples, a few sequences software is designed for use by molecular
resources that allow researchers to or a few compounds in a paper notebook biologists. Consequently there is often a
connect small molecules to diseas- or on an Excel spreadsheet, researchers terminology gap that makes it difficult for
es. This section will focus on are confronted with the task of handling biologists to use cheminformatic software
describing a number of recently hundreds of samples, thousands of com- and chemists to use bioinformatics soft-
developed databases or knowl- pounds, thousands of spectra and thou- ware. Likewise, most cheminformatic soft-
edgebases that explicitly relate
sands of genes or protein sequences. This ware is structure-based or picture-driven
small molecules either as the
has led to the development of novel while most bioinformatic software is se-
treatment, symptom or cause to
disease. Finally a short discussion computational tools and entirely new quence-based or text-driven. As a result,
will be provided on newly emerg- bioinformatic disciplines to facilitate the different search and query interfaces have
ing software tools that exploit handling of this data. This particular evolved that are quite specific to either
these databases as a means to chapter focuses on an emerging field of cheminformatic or bioinformatic software.
discover new biomarkers or even
new treatments for disease. Citation: Wishart DS (2012) Chapter 3: Small Molecules and Disease. PLoS Comput Biol 8(12): e1002805.
doi:10.1371/journal.pcbi.1002805
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
Baltimore County, United States of America
This article is part of the Transla- Published December 27, 2012
tional Bioinformatics collection for Copyright: 2012 David S. Wishart. This is an open-access article distributed under the terms of the Creative
PLOS Computational Biology. Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
Funding: Funding to develop the databases described in this article was provided by Genome Canada, Alberta
Innovates, and the Canadian Institutes of Health Research. The funders had no role in the preparation of the
1. Introduction manuscript.

For most of the past 100 years, the fields Competing Interests: The author has declared that no competing interests exist.
of toxicology, pharmacology and clinical * E-mail: david.wishart@ualberta.ca

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002805


What to Learn in This Chapter enced conditions such as cancer, allergies
or birth defects.
However, as detailed below, not all of
N The meaning of chemical bioinformatics
the available chemical-bioinformatic data-
N Strengths and limitations of existing chemical bioinformatic databases bases are particularly suited for these kinds
N Using databases to learn about the cause and treatment of diseases of disease-associated queries. This likely
N The Small Molecule Pathway Database (SMPDB) reflects the relatively nascent stage of this
N The Human Metabolome Database (HMDB) field (its less than five years old) and the
fact that disease-related information is
N DrugBank
much more difficult to gather and codify
N The Toxin and Toxin-Target Database (T3DB)
than either chemical structure or gene
N PolySearch and Metabolite Set Enrichment Analysis sequence information. Certainly all of
todays existing chemical-bioinformatic
databases contain information about dif-
Further compounding this culture gap is pelling reason to write software. Programs ferent classes of chemicals (metabolites,
the fact that most cheminformatics soft- such as BLAST [4] would be useless drugs or poisons) and most contain some
ware and chemical compound databases without GenBank [5], likewise, PSIPRED limited information about the correspond-
were developed without the expectation [6] couldnt exist without the Protein ing protein and/or genetic targets. How-
that this information would ever be Databank [7] and Gene Set Enrichment ever, only a very small number of these
biologically or medically relevant. Like- Analysis GSEA [8] would be impossible databases actually include information on
wise, most bioinformatics software and without the GEO and KEGG databases the diseases or physiological effects that
bioinformatic databases were developed [9,10]. Given their importance, it is may be caused, cured or characterized by
without the intention of using this data to perhaps worthwhile to briefly review the these chemicals.
facilitate small molecule biomarker identi- different types of chemical-bioinformatic
fication or small molecule drug discovery. databases that are available and discuss 2.1 Metabolic Pathway Databases
Consequently most biological sequence some of their particular strengths and
Among the four major classes of chem-
data is not linked in any meaningful way limitations.
ical-bioinformatic databases that are avail-
to drug or disease information and vice Currently there are four major classes of able, metabolic pathway databases are
versa. However, thanks to the emergence chemical-bioinformatic databases. These perhaps the best known and most widely
of new fields such as pharmacogenomics, include: 1) small molecule (or metabolic) used. They include a number of popular
toxicogenomics, systems biology and me- pathway databases; 2) metabolite or me- web-based resources such as the Kyoto
tabolomics, there is now a growing desire tabolomic databases; 3) drug databases; Encyclopedia of Genes and Genomes also
to bring bioinformatics and cheminfor- and 4) toxin or toxic substance databases. known as KEGG [10], the Cyc databas-
matics closer together. This has spawned In an ideal world each of these database es [11,12], the Reactome database [13],
the new field of chemical bioinformatics. classes could/should be useful for relating WikiPathways [14], the Small Molecule
In this chapter we will cover several topics small molecules to human diseases or Pathway Databases or SMPDB [15] and
related to chemical bioinformatics. First, a disease treatments. For instance, metabolic the Medical Biochemistry Page [http://
brief overview of some of the most important pathway databases would be expected to themedicalbiochemistrypage.org/]. Sever-
chemical bioinformatic resources will be be most useful for understanding the big- al commercial pathway databases also exist
given. This will include a discussion of some picture relationship between small mole- such as TransPath (from BioBase Inc.),
of the major databases and classes of cules and disease either with regard to PathArt (from Jubilant Biosys Inc.), Meta-
databases. Second, a more detailed overview those small molecule compounds causing Base (from GeneGo Inc.) and Ingenuity
will be given on those particular resources disease (i.e. toxins), indicating disease (i.e. Pathways Analysis (Ingenuity Systems Inc.),
that allow researchers to connect small biomarkers) or being used in the treatment many of which provide nicely illustrated
molecules to diseases. This section will focus of disease (i.e. drugs). On the other hand, metabolic pathway diagrams. Most of these
on describing a number of recently devel- metabolite or metabolomic databases pathway databases were designed to facil-
oped databases or knowledgebases that would be expected to be most useful for itate the exploration of metabolism and
explicitly relate small molecules either as associating small molecule biomarkers metabolites across many different species.
the treatment, symptom or cause to with specific diseases, such as inborn errors This broad, multi-organism perspective has
disease. Finally a short discussion will be of metabolism or a variety of chronic or been critical to enhancing our basic
provided on newly emerging software tools infectious diseases characterized by me- understanding of metabolism and our
that exploit these databases as a means to tabolite imbalances. Drug databases would appreciation of biological diversity. Meta-
discover new biomarkers or even new obviously be most relevant for identifying bolic pathway databases also serve as the
treatments for disease. small molecules with disease treatments, backbone to facilitate many practical ap-
although they could also be used to plications in biology including comparative
2. Databases for Chemical identify small molecule drugs causing genomics and targeted genome annotation.
adverse drug reactions. Finally toxin or Table 1 lists the names, web addresses and
Bioinformatics
toxic compound databases would be general features for these and other useful
Electronic databases lie at the heart of expected to be most useful for identifying pathway databases.
almost any subdiscipline of bioinformatics the compounds causing diseases or causing Those metabolic pathway databases
and chemical bioinformatics is no symptoms associated with certain poison- that strive for very broad organism
exception. Indeed, without databases there ing or environmental exposure incidents. coverage, such as KEGG and Reactome,
is essentially no foundational knowledge to This could include acute poisonings or tend to use pathway diagrams that are
the discipline, and consequently, no com- more long-term, environmentally influ- very generic and highly schematic, while

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002805


those that are organism-specific (i.e. hu- compounds, MMCD [24] which contains their drug targets. Both TTD and Super-
man-only), such as SMPDB and the experimental and predicted NMR spectra Target support text, sequence and chem-
Medical Biochemistry Page, tend to use of Arabadopsis metabolites, and the Golm ical structure searches, while PharmGKB
diagrams that are very specific and much Metabolome database [25] which contains provides mechanistic, pharmacodynamic
richer in detail, colour and content. Most MS spectra of different plant metabolites. and pharmacokinetic pathway informa-
pathway databases support interactive These spectral databases are frequently tion for 68 different drugs or drug classes.
image mapping with hyperlinked informa- used to facilitate compound identification As a general rule, chemically oriented
tion content that allows users to view via spectral comparison. More recently, a drug databases tend to appeal to medicinal
chemical information (if a compound is much more comprehensive kind of meta- chemists, biochemists and molecular biol-
clicked) or brief summaries of genes and/ bolomic database has emerged which ogists. In addition to these somewhat
or proteins (if a protein is clicked). Almost attempts to combine chemical data, spec- specialized databases, a much more com-
all of the databases support some kind of tral data, protein target data, biomarker prehensive hybrid database, known as
limited text search and a few, such as data and disease data into a single DrugBank [32] has recently been devel-
Reactome, SMPDB and the Cyc data- resource. Perhaps the best example of this oped. Drugbank combines the clinical/
bases, support the mapping of gene, is the Human Metabolome Database disease information of the clinically ori-
protein and/or metabolite expression data (HMDB). The HMDB is a database ented drug databases with the biochemi-
onto pathway diagrams. As might be containing comprehensive data on most cal/chemical information of the chemical-
expected, the major focus of most of of the known or measurable endogenous ly oriented drug databases. As a result, a
todays small molecule pathway databases metabolites in humans [26]. Table 2 typical DrugBank entry contains 80100
is on basic metabolism. As a result, only presents a summary of the names, web different data fields, instead of 510 as
one of these databases (SMPD) actually addresses and general features for the seen with the other kinds of databases.
includes any pathways associated with major metabolite/metabolomic databases. Like TTD and SuperTarget, DrugBank
drug action or disease. supports very extensive text, sequence and
2.3 Pharmaceutical Product chemical structure searches. It also pro-
2.2 Metabolomic Databases Databases vides detailed pathway information on the
The second major class of chemical- The third major class of chemical mechanism of action for .200 different
bioinformatic databases are metabolomic bioinformatic databases are the drug or drugs or drug classes. Table 3 provides a
or metabolite databases. These databases pharmaceutical product databases. In short summary of the names, descriptions
tend to have a major focus on chemicals particular, two types of electronic drug and website addresses of the more popular
and chemical descriptors with a lesser (or databases have started to emerge over the drug or pharmaceutical product databas-
even absent) focus on biological data. past five years: 1) clinically oriented drug es.
They are primarily used for metabolite databases and 2) chemically oriented drug
identification especially in metabolomic databases. Examples of some of the better- 2.4 Toxic Substance Databases
studies. Some databases are almost exclu- known clinically oriented drug databases The final class of chemical-bioinfor-
sively chemical in nature, containing include DailyMed [27] and RxList [28]. matic databases we will discuss are the
primarily information on the chemical These resources typically offer very de- toxic substance databases. These include
name(s), synonyms, InChI (International tailed clinical information (i.e. their for- the Animal Toxin Database (ATDB),
Chemical Identifier) identifier, structure, mulation, metabolism and indications) SuperToxic [33], ACToR [34], the Com-
and molecular weight. These include Lipid about selected drugs derived from their parative Toxicogenomics Database [35]
Maps [16], a comprehensive database of FDA labels. As a result, these kinds of and T3DB [36]. Table 4 presents a
biological lipids; ChEBI [17], a database databases are targeted more towards summary of the names, web addresses
of biologically interesting compounds; pharmacists, physicians or consumers. and general features for these databases.
PubChem [18], a collection of most Examples of chemically or genetically The Animal Toxin Database (ATDB),
known organic chemicals with links to oriented drug databases include the TTD with .3800 peptide toxins, provides data
PubMed articles and more than 500,000 [29], PharmGKB [30] and SuperTarget on the sequence of many peptide/protein
bioassays; ChemSpider [19], a chemical [31]. TTD (which stands for Therapeutic toxins from venomous insects and animals
databases that is similar in size to Pub- Target Database) contains information on as well as information on the channel
Chem; KNApSAcK [20], a database of 5028 drugs (both approved and experi- targets to which these toxins bind. Both
plant phytochemicals and METLIN [21], mental) with 1894 identified targets and ACToR (which stands for the Aggregated
a database of known and presumptive links to 560 different diseases or indica- Computational Toxicology Resource) and
human metabolites. All of these databases tions. PharmGKB (which stands for Phar- SuperToxic provide bioassy data and
support a variety of text search options macogenomics Knowldege Base) has in- chemical structure information for a very
and a few (such as PubChem, ChemSpi- formation on 1587 approved drugs (with large number of industrial or pharmaceu-
der, LipidMaps and ChEBI) support descriptions and indications), including tically interesting chemicals (.60,000 for
structure and structure similarity searches. pharmacogenomic data on 287 drugs. SuperToxic, .500,000 for AcTOR). The
In addition to these biochemical databas- SuperTarget contains information on Comparative Toxicogenomics Database
es, there are a number of smaller databas- more than 2500 target proteins, which (CTD), with .5000 chemicals, provides
es that contain spectral (NMR or MS) data are annotated with about 7300 literature- literature-derived information on chemi-
of small molecule metabolites. These mined relations to 1500 different drugs. cal-gene interactions. This includes micro-
include the BioMagResBank or BMRB All three of these databases provide array information on genes that are up/
[22] which contains experimental NMR synoptic data (510 data fields per entry) down-regulated upon contact or exposure
spectra of mammalian metabolites, Mass- about the nomenclature, structure and/or to these chemicals. T3DB (which stands
Bank [23] which contains MS spectra of a physical properties of small molecule drugs for the Toxin, Toxin-Target Database)
variety of metabolites, drugs and toxic and, in the case of SuperTarget and TTD, provides very extensive structural, physio-

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002805


Table 1. Alphabetical List of Popular Metabolic Pathway Databases.

Database Name URL or Web Address Comments

HumanCyc (Encylopedia of Human Metabolic Pathways) http://humancyc.org/ -MetaCyc adopted to human metabolism
-No disease or drug pathways
KEGG (Kyoto Encyclopedia of Genes and Genomes) http://www.genome.jp/kegg/ -Best known and among the most complete
metabolic pathway databases
-Covers many organisms
-A Few disease and drug pathways
The Medical Biochemistry Page http://themedicalbiochemistrypage.org/ -Simple metabolic pathway diagrams with
extensive explanations
-A few drug and disease pathways
MetaCyc (Encyclopedia of Metabolic Pathways) http://metacyc.org/ -Similar to KEGG in coverage, but different
emphasis
-Well referenced
-No disease or drug pathways
Reactome (A Curated Knowledgebase of Pathways) http://www.reactome.org/ -Pathway database with more advanced query
features
-Not as complete as KEGG or MetaCyc
Roche Applied Sciences Biochemical Pathways Chart http://www.expasy.org/cgi-bin/search-biochem-index -The old metabolism standard (on line)
-Describes most human metabolism
Small Molecule Pathway Database (SMPDB) http://www.smpdb.ca/ -Pathway database with disease, drug and
metabolic pathways for humans
-Extensive search, analysis and visualization
tools
Wikipathways http://www.wikipathways.org -Community annotated pathway database for
19 model organisms
-Contains 175 human pathways
-Few drug or disease pathways

doi:10.1371/journal.pcbi.1002805.t001

logical, mechanistic, medical and bio- a metabolomic database that associates chemicals in each pathway. In addition, the
chemical information on about 3100 metabolites to disease biomarkers or disease cellular locations (membrane, cytoplasm,
commonly encountered (i.e. household or diagnosis; 3) DrugBank is a drug database mitochondrion, nucleus, peroxisome, etc.)
environmental) toxins and poisons. that links drugs and drug targets to of all metabolites and the enzymes involved
Each of these databases addresses the symptoms, diseases and disease treatments in their processing are explicitly illustrated.
needs of certain communities such as and 4) T3DB is a toxic substance database Likewise the quaternary structures (if
animal physiologists (ATDB), toxicoge- that associates toxins and their biological known) and cofactors associated with each
nomics or toxicology specialists (CTD targets with symptoms, conditions, diseases of the pathway proteins are also shown. If
and T3DB), environmental or industrial and disease treatments. A more detailed some of the metabolic processes occur
regulators (ACToR) or medicinal chemists description of each of these databases is primarily in one organ or in the intestinal
interested in toxicity prediction (Super- provided below. microflora, this information is also illustrat-
Toxic). However, with the exception of ed. The inclusion of explicit chemical,
T3DB, most of these online toxin or toxic 3. SMPDB A Pathway cellular and physiological information is
compound databases are relatively lightly Database for Drugs and Disease one of the more unique and useful features
annotated, with fewer than a dozen data of SMPDB. SMPDB is also unique in its
fields per compound and essentially no As noted earlier, SMPDB is a pathway inclusion of significant numbers of meta-
physiological, disease or disease symptom database specifically designed to facilitate bolic disease pathways (.100) and drug
information. clinical omics studies, with a specific pathways (.200) not found in any other
Clearly not all of the chemical-bioinfor- emphasis on clinical biochemistry and pathway database. Likewise, unlike other
matic databases we have described in this clinical pharmacology. Currently SMPD pathway databases, SMPDB supports a
section are suitable for deriving information consists of more than 450 highly detailed, number of unique database querying and
about small molecules and disease. Likewise, hand-drawn pathways describing small viewing features. These include simplified
many of the databases mentioned above are molecule metabolism or small molecule database browsing, the generation of pro-
not exactly suitable for translational bioin- processes that are specific to humans. tein/metabolite lists for each pathway, text
formatic questions or for applications relat- These pathways can be placed into four querying, chemical structure querying and
ing to medicine, medical biochemistry or different categories: 1) metabolic pathways; sequence querying, as well as large-scale
clinical research. However, there is at least 2) small molecule disease pathways; 3) small pathway mapping via protein, gene or
one database in each of the four major molecule drug pathways and 4) small chemical compound lists.
chemical-bioinformatic database classes that molecule signaling pathways. An example The SMPDB interface is largely mod-
does generally meet these criteria. In of a typical SMPDB pathway (Phenylke- eled after the interface used for DrugBank
particular: 1) SMPDB is a pathway database tonuria) is shown in Figure 1. As seen in this [32], T3DB [36] and the HMDB [26],
that explicitly relates small molecules to figure, all SMPDB pathways explicitly with a navigation panel for Browsing,
disease and disease treatment; 2) HMDB is include the chemical structure of the major Searching and Downloading the database.

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002805


Table 2. Alphabetical List of Metabolomic, Chemical or Spectral Databases.

Database Name URL or Web Address Comments

BioMagResBank (BMRB Metabolimics) http://www.bmrb.wisc.edu/metabolomics/ -Emphasis on NMR data, no biological or biochemical


data
-Specific to plants (Arabadopsis)
Chemicals Entities of Biological Interest (ChEBI) http://www.ebi.ac.uk/chebi/ -Covers metabolites and drugs of biological interest
-Focus on ontology and nomenclature not biology
ChemSpider http://www.chemspider.com/ -Meta-database containing chemical data from 100+
other databases
-20+ million compounds
-Good search utilities
Golm Metabolome Database http://csbdb.mpimp-golm.mpg.de/csbdb/gmd/gmd.html -Emphasis on MS or GC-MS data only
-No biological data
-Few data fields
-Specific to plants
Human Metabolome Database http://www.hmdb.ca -Largest and most completely annotated
metabolomic database
-Specific to humans only
KNApSAcK http://kanaya.naist.jp/KNApSAcK/ -A phytochemical database containing data for
50,000 compounds
LipidMaps http://www.lipidmaps.org/ -Contains 22,500 different lipids found in plants &
animals
-Nomenclature standard
METLIN Metabolite Database http://metlin.scripps.edu/ -Human specific metabolite database
-Name, structure, ID only
PubChem http://pubchem.ncbi.nlm.nih.gov/ -Database containing 27 million unique chemicals
with links to Bioassays and PubMed abstracts

doi:10.1371/journal.pcbi.1002805.t002

Below the navigation panel is a simple text browsing tool that provides a tabular the SMPDB pathway button brings up a
query box that supports general text synopsis of SMPDBs content with thumb- full-screen image for the corresponding
queries of the entire textual content of nail images of the pathway diagrams, pathway. Once opened the pathway
the database. Mousing over the Browse textual descriptions of the pathways, as image may be expanded by clicking on the
button allows users to choose between two well as lists of the corresponding chemical Zoom button located at the top and
browsing options, SMP-BROWSE and components and enzyme/protein compo- bottom of the image. An image legend
SMP-TOC. SMP-TOC is a scrollable nents. This browse view allows users to link is also available beside the Zoom
hyperlinked table of contents that lists all scroll through the database, select different button.
pathways by name and category. SMP- pathway categories or re-sort its contents. At the top of each pathway image is a
BROWSE is a more comprehensive Clicking on a given thumbnail image or pathway synopsis contained in a yellow

Table 3. Alphabetical List of Pharmaceutical Compound or Drug Databases.

Database Name URL or Web Address Comments

DailyMed http://dailymed.nlm.nih.gov/ -A drug database containing FDA label (package inserts)


for most approved drugs
DrugBank http://www.drugbank.ca/ -Comprehensive database of 1480 drugs with 1700 drug
targets
-Contains chemical, biological & clinical data
-Extensive search utilities
PharmGKB http://www.pharmgkb.org/ -Data on 1587 approved drugs including
pharmacogenomic data on 287 drugs.
-Provides mechanistic, pathway information for 68
different drugs
SuperTarget http://bioinf-tomcat.charite.de/supertarget/ -Searchable database of drugs and drug targets
-Includes 2500 target proteins, which are annotated with
about 7300 literature-mined relations to 1500 different
drugs.
TTD (Therapeutic Target Database) http://xin.cz3.nus.edu.sg/group/ttd/ttd.asp -Contains data on 1894 drug targets for 5126 drugs
-Limited chemical data
-No clinical or pharmacological data

doi:10.1371/journal.pcbi.1002805.t003

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002805


box while at the bottom of each image is a names, gene names, protein names, Uni- ence sections. The SeqSearch button
list of references. On the right of each Prot IDs, GenBank IDs, Agilent IDs or allows users to conduct BLASTP (protein)
pathway image is a grey-green Highlight/ Affymetrix IDs (with or without concen- sequence searches of the protein sequences
Analyzer tool with a list of the key tration data) and to have a table generated contained in SMPDB. SeqSearch supports
metabolites/drugs and enzymes/proteins of pathways containing those components. both single and multiple sequence BLAST
found in the pathway. Checking on The resulting table, like the SMP- queries.
selected items when in the SMP-Highlight BROWSE table, displays a thumbnail To summarize, SMPDB allows users to
mode will cause the corresponding metab- image of the matching pathways along interactively explore, through detailed
olite or protein in the pathway image to be with the list of matching components pathway diagrams, the linkage between
highlighted with a red box. Entering (metabolites, drugs, proteins, etc.). The metabolites, genes or proteins and meta-
concentration or relative expression values table is ordered by the number of matches bolic diseases. It also allows users to
(arbitrary units) beside compound or and a significance score (calculated via a investigate the connection between drugs
protein names, when in the SMP-Analyzer hypergeometric function), with the path- and their protein or gene targets through
mode, will cause the corresponding me- way having the most matches being placed comprehensive illustrations of their mech-
tabolites or proteins to be highlighted with at the top. Clicking on the thumbnail anism of action. Because of its detailed
differing shades of green or red to illustrate image or the SMPDB pathway button depictions of both disease and drug
increased or decreased concentrations. As brings up a full-screen image for the pathways and its extensive use of visuali-
with most pathway databases, all of the corresponding pathway with all the zation and query tools, SMPDB can
chemical structures and proteins/enzymes matching components (metabolites, drugs, potentially support a variety of transla-
illustrated in SMPDBs diagrams are proteins, etc.) highlighted in red. Concen- tional bionformatic/cheminformatic ques-
hyperlinked to other on-line databases or tration data can be displayed using a red- tions. For example, through SMPDB it is
tables. Specifically, all metabolites, drugs to-yellow gradient by entering concentra- possible for users to: 1) identify a metabolic
or proteins shown in the SMP-BROWSE tion data in a text box located beside the disease or medical condition given a list of
tables or in a pathway diagram are linked map image. metabolites (via SMP-MAP); 2) use exper-
to HMDB, DrugBank or UniProt [37] SMPDBs Search menu also offers users imental gene expression data to identify
respectively. Therefore, clicking on a a choice of searching the database by which diseases, conditions or pathways are
chemical or protein image will open a chemical structure (ChemQuery), text most affected by a given drug, dietary or
new browser window with the correspond- (TextQuery) or sequence (SeqSearch). chemical treatment (via SMP-MAP); 3)
ing DrugCard, MetaboCard or UniProt The ChemQuery option allows users to use metabolomic or metabolite expression
table being displayed. draw (using MarvinSketch applet) or write data to help understand or rationalize
The most powerful search option in (using a SMILES string) a chemical specific metabolic diseases, conditions or
SMPDB is SMP-MAP, which offers both compound and to search SMPDB for biomarkers (through SMP-MAP); 4) de-
multi-identifier searches as well as Omic drugs and metabolites similar or identical termine the similarity of a newly found/
(transcriptomic, proteomic or metabolo- to the query compound. The TextQuery synthesized compound to an existing drug
mic) mapping. In contrast to SMP- button supports a more sophisticated text (via the ChemQuery search); 5) determine
BROWSE, which is used for data brows- search (partial word matches, data field the possible mechanism of action or
ing and single entity highlighting, SMP- selection, Boolean text searches, case protein targets for a newly found/synthe-
MAP can be used for multi-entity high- sensitive, misspellings, etc.) of the text sized compound (via the ChemQuery
lighting and mapping. In particular SMP- portion of SMPDB, including the accom- search); 6) ascertain whether a certain
MAP allows users to enter lists of chemical panying pathway explanations and refer- protein found in bacteria, fungi or viruses

Table 4. Alphabetical Listing of Toxic Compound Databases.

Database Name URL or Web Address Comments

ACToR (Aggregated Computation Toxicology http://actor.epa.gov/actor/faces/ACToRHome.jsp -Contains aggregated data on 2,500,000 environmental
Resource) chemicals
-Searchable by chemical name and structure
-Data includes chemical structure, physico-chemical
values, in vitro assay data and in vivo toxicology data.
ATDB (Animal Toxin Database) http://protchem.hunnu.edu.cn/toxin/index.jsp -Database with .3800 peptide toxins
-Provides sequence data on peptide/protein toxins
from venomous insects and animals
CTD (Comparative Toxicogenomic Database) http://ctd.mdibl.org/ -Data on .5000 chemicals with literature-derived
information on chemical-gene interactions
SuperToxic http://bioinformatics.charite.de/supertoxic/ -Contains data on 60,000 toxic compounds and some
target data
-Provides chemical and toxicity information
-Can predict the toxicity of query compounds
T3DB (Toxin, Toxin-Target Database) http://www.t3db.org/ -Searchable database of 3100 common toxins and 1400
target proteins
-Provides extensive structural, physiological,
mechanistic, medical and biochemical information

doi:10.1371/journal.pcbi.1002805.t004

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002805


Figure 1. A pathway diagram for Phenylketonuria as taken from SMPDB (http://www.smpdb.ca).
doi:10.1371/journal.pcbi.1002805.g001

could be a drug target (via the SeqSearch isoform or paralogue, may be a drug visualization and mapping tools to explain
query); 7) ascertain whether a newly target or a disease indicator (through the or teach others about metabolic diseases,
identified human protein, such as an SeqSearch query); or 8) use the pathway basic metabolism or drug action.

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002805


4. HMDB A Resource for Browse allows users to search through to their chemical class designation. Each
Biomarker Discovery and the HMDB compound by compound displayed compound name is hyperlinked
Disease Diagnosis through a series of hyperlinked, synoptic to an HMDB MetaboCard. Users may
summary tables. These metabolite tables search for compounds (via a text box) or
The Human Metabolome Database can be rapidly browsed, sorted or refor- select to view certain compound classes
(HMDB) is the by-product of the Human matted in a manner similar to the way using a pull-down menu located at the top
Metabolome Project a 3-year (2005 PubMed abstracts may be viewed. Click- of the ClassBrowse page.
2008), $7.5 million dollar project that was ing on the MetaboCard button found in In addition to the data browsing and
aimed at collating, identifying and anno- the leftmost column of any given HMDB sorting features already described, the
tating all the endogenous metabolites in summary table opens a webpage describ- HMDB also offers a chemical structure
the human body [38]. The HMDB is ing the compound of interest in much search utility, a local BLAST search [4]
actually the largest and most comprehen- greater detail. Each MetaboCard entry that supports both single and multiple
sive, organism-specific metabolomic data- contains more than 100 data fields with sequence queries, a Boolean text search
base assembled to date. It contains spec- half of the information being devoted to based on KinoSearch (http://www.
troscopic, quantitative, analytic and chemical or physico-chemical data and the rectangular.com/kinosearch/), a chemical
molecular-scale information about human other half devoted to biological or bio- structure search utility based on ChemAx-
metabolites, their associated enzymes or medical data. These data fields include a ons MarvinView, a relational data extrac-
transporters, their abundance and their comprehensive compound description, tion tool, an MS spectral matching tool
disease-related properties. The HMDB names and synonyms, structural informa- and an NMR spectral search tool (for
currently contains more than 8000 human tion, physico-chemical data, reference identifying compounds via MS or NMR
metabolite entries that are linked to more NMR and MS spectra, biofluid concen- data from other metabolomic studies).
than 45,000 different synonyms. These trations (normal and abnormal), disease These can all be accessed via the database
metabolites are further connected to 3360 associations, pathway information, en- navigation bar located at the top of every
distinct enzymes, which in turn, are linked zyme data, gene sequence data, protein HMDB page.
to nearly 100 metabolic pathways and sequence data, SNP and mutation data as HMDBs simple text search supports
more than 150 disease pathways. More well as extensive links to images, references text matching, text match rankings, mis-
than 1000 metabolites have disease-asso- and other public databases such as KEGG spellings (offering suggestions for incor-
ciated information, including both normal [10], BioCyc [12], PubChem [18], ChEBI rectly spelled words) and highlights text
and abnormal metabolite concentration [17], PubMed, PDB [7], SwissProt/Uni- where the word is found. In addition to
values. These diagnostic metabolites or Prot [37], GenBank [5], and OMIM [39]. this simple text search, HMDBs TextQu-
metabolite signatures are linked to more Outside of HMDB Browse, there are ery function uses the same KinoSearch
than 500 different diseases (genetic and five other browsing options that allow engine, but also supports more sophisti-
acquired). The HMDB also contains users to explore or navigate the database. cated text querying functions (Boolean
experimental metabolite concentration da- Disease Browse allows users to view logic, multi-word matching and parenthet-
ta for normal plasma, urine, CSF and/ known metabolic disorders (as well as ical groupings) as well as data-field-specific
or other biofluids for more than 5000 other diseases) and the metabolites that queries such as finding the query word
compounds. More than 900 compounds are typically associated with these condi- only in the Compound Source field.
are also linked to experimentally acquired tions. It also allows users to enter lists of The HMDBs structure similarity search
reference 1H and 13C NMR and MS/ metabolites and to identify which diseases tool (ChemQuery) is the equivalent to
MS spectra. The entire database, includ- are characterized by perturbations to these BLAST for chemical structures. Users
ing text, sequence, structure and image metabolite levels. PathBrowse allows may sketch (through MarvinViews chem-
data occupies nearly 30 Gigabytes of data users to browse through the custom-drawn ical sketching applet) or paste a SMILES
most of which can be freely downloaded. HMDB pathway images. Each pathway is string (40) of a query compound into the
The HMDB is a fully searchable named and each image is zoomable and ChemQuery window. Submitting the que-
database with many built-in tools for extensively hyperlinked. Users may also ry launches a structure similarity search
viewing, sorting and extracting search PathBrowse using lists of com- tool that looks for common substructures
metabolites, biofluid concentrations, en- pounds (obtained from a metabolomic from the query compound that match the
zymes, genes, NMR or MS spectra and experiment) and view hyperlinked tables HMDBs metabolite database. High scor-
disease information. As with any web- that display all of the pathways that are ing hits are presented in a tabular format
enabled database, the HMDB supports potentially affected. Biofluid Browse with hyperlinks to the corresponding
standard text queries (through the text allows users to browse metabolite entries MetaboCards (which in turn links to the
search box located near the top of each based on their concentrations and the protein target). The ChemQuery tool
page). It also offers extensive support for biofluids in which they are found. Users allows users to quickly determine whether
higher-level database search and selection may select entries by biofluid type and sort their compound of interest is a known
functions through a navigation bar (locat- the table by compound name, HMDB ID, metabolite or chemically related to a
ed at the top of each page). The navigation concentration, disease, age, or gender. known metabolite. In addition to these
bar has six pull-down menu tabs HML Browse allows users to browse structure similarity searches, the Chem-
(Home, Browse, Search, About, or search through the Human Metabo- Query utility also supports compound
Download and Contact Us). The lome Library (HML). The HML is a searches on the basis of chemical formula
Browse tab allows users to select from library of ,1000 reference metabolites and molecular weight ranges.
six browsing options including HMDB stored in 280uC freezers at the Human HMDBs BLAST search (SeqSearch)
Browse, Disease Browse, Path- Metabolome Project Centre in Edmonton, allows users to search through the HMDB
Browse, Biofluid Browse, HML Canada. ClassBrowse, is designed to via sequence similarity as opposed to
Browse and ClassBrowse. HMDB allow users to view compounds according chemical similarity. A given gene or

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002805


protein sequence may be searched against their targets both through descriptions of linkage between drugs and drug targets
the HMDBs sequence database of meta- the compounds and their known biological and this particular feature made Drug-
bolically important enzymes and trans- roles and through the identification of Bank particularly popular. Another im-
porters by pasting the FASTA formatted known pathways or catalyzing enzymes. In portant innovation in this database was
sequence (or sequences) into the Seq- addition, the HMDB also supports the the presentation of drug and drug target
Search query box and pressing the sub- direct identification of potential diagnostic data in synoptic DrugCards (in anology to
mit button. A significant hit reveals, biomarkers based on their mass, mass library cards or study flash-cards). This
through the associated MetaboCard hy- spectra or NMR spectra. Because of this concept (which is now used in many other
perlink, the name(s) or chemical struc- linkage, the HMDB can potentially sup- chemical-bioinformatic databases) helped
ture(s) of metabolites that may act on that port a variety of translational bionformatic make DrugBank particularly easy to view
query protein. With SeqSearch metabo- or cheminformatic queries. For example, and navigate. Currently DrugBank con-
lite-protein interactions from model or- through the HMDB it is possible for users tains detailed information on 1480 FDA-
ganisms (chimp, rat, mouse, dog, cat, etc.) to: 1) identify a novel biomarker for a approved drugs corresponding to 28,447
may be mapped to these organisms via the given condition or disease given an NMR brand names and synonyms. This collec-
human data in the HMDB. or GC/MS or MS/MS spectrum of the tion includes 1281 synthetic small mole-
The HMDBs data extraction utility purified compound (via the MS/NMR cule drugs, 128 biotech (mostly peptide or
(Data Extractor) employs a simple rela- search tools); 2) identify metabolites from a protein) drugs and 71 nutraceutical drugs
tional database system that allows users to biofluid mixture that has been analyzed by or supplements. DrugBank also contains
select one or more data fields and to NMR, GC/MS or MS/MS (via the MS/ information on the 1669 different targets
search for ranges, occurrences or partial NMR search tools); 3) identify a disease or (protein, lipid or DNA molecules) and
occurrences of words or numbers. The condition given a list of metabolites (via metabolizing enzymes with which these
Data Extractor uses clickable web forms so Disease Browse); 4) identify a pathway or drugs interact. Additionally the database
that users may intuitively construct SQL- process that has been altered/perturbed maintains data on 187 illicit drugs (i.e.
like queries. The data extraction tool given a list of metabolites obtained from a those legally banned or selectively banned
allows users to easily construct complex metabolomic experiment (via Path- in most developed nations) and 64 with-
queries as find all diseases where the Browse); 5) determine normal and abnor- drawn drugs (those removed from the
concentration of homogentisic acid in mal concentration ranges for metabolites market due to safety concerns). Chemical,
urine is greater than 1 mM. in different biofluids (via Biofluid Browse); pharmaceutical and biological information
The NMR and MS search utilities allow 6) obtain authentic standards of unique about these classes of drugs is extremely
users to upload spectra (for the MS search) metabolites to confirm the diagnosis of a important, not only in understanding their
or peak lists (for the NMR search) and to certain disease (via HML Browse); 7) adverse reactions, but also in being able to
search for matching compounds from the determine the similarity of a newly predict whether a new drug entity may
HMDBs collection of MS and NMR found/synthesized compound to an exist- have unexpected chemical or functional
spectra. In particular, the HMDB contains ing metabolite (via the structure similarity similarities to a dangerous or highly
more than 2000 experimentally collected search); 8) determine the possible mecha- addictive drug.
1
H and 13C NMR spectra for 900 pure nism of action or protein targets for a As with the HMDB, the DrugBank
compounds (most collected in water at newly discovered/synthesized metabolite website contains many built-in tools and a
pH 7.0). It also contains approximately or metabolite analogue (via the structure variety of customized features for viewing,
3800 predicted 1H and 13C NMR spectra similarity search); 9) diagnose or deter- sorting, querying and extracting drug or
for 1900 other compounds for which mine the cause of illnesses thought to be drug target data. These include a number
authentic samples could not be acquired. brought on by metabolite changes of higher-level database searching func-
The HMDBs mass spectra library con- (through the text search); 10) extract tions such as a local BLAST [4] sequence
tains 2400 MS/MS (Triple-Quad) spectra detailed information on metabolites, met- search (SeqSearch) that supports both
collected at 3 different collision energies abolic diseases or metabolic pathways (via single and multiple protein sequence
for more than 800 pure compounds. The the data extractor); 11) extract information queries (for drug target searching), a
HMDBs spectral search utilities allow on common metabolite classes (via the boolean text search (TextSearch) for
both pure compounds and mixtures of data extractor or ClassBrowse); 12) ascer- sophisticated text searching and querying,
compounds to be identified from their MS tain whether a certain protein or protein a chemical structure search utility (Chem-
or NMR spectra via peak matching homologue may also be involved in a Query) for structure matching and struc-
algorithms. Compounds may also be metabolic process or pathway (via the ture-based querying as well as a relational
identified or searched for by entering their sequence search). data extraction tool (Data Extractor) for
chemical formula or their mass (either performing complex queries.
their exact mass or a mass range). Figure 2 5. DrugBank A Resource for The BLAST search (SeqSearch) is
provides a screenshot montage illustrating Drug Discovery and Disease particularly useful for drug discovery
the types of viewing and searching options Treatment applications as it can potentially allow
available in HMDB. users to quickly and simply identify drug
To summarize the HMDB allows users As previously noted, DrugBank [32] is leads from newly sequenced pathogens.
to link endogenous metabolites (both their essentially a hybrid clinically AND chem- Specifically, a new sequence, a group of
identity and their concentration) to a ically oriented drug database that links sequences or even an entire proteome can
variety of disease conditions, including sequence, structure and mechanistic data be searched against DrugBanks database
metabolic disorders, genetic diseases, about drug molecules with sequence, of known drug target sequences by pasting
chronic (age-related) disorders and a structure and mechanistic data about their the FASTA formatted sequence (or se-
variety of infectious diseases. It also drug targets. DrugBank was one of the first quences) into the SeqSearch query box
provides links between metabolites and electronic databases to provide the explicit and pressing the submit button. A

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002805


significant hit can reveal the name(s) or of Action, Contraindications and Toxicity sequence of the gene fragment with the
chemical structure(s) of potential drug fields often includes details about any SNP highlighted in a red box.
leads that may act on that query protein known adverse reactions. This may in- The purpose of these SNP tables is to
(or proteome). The structure similarity clude descriptions of known phase I or allow one to go directly from a drug of
search tool (ChemQuery) can be used in phase II enzyme interactions, alternate interest to a list of potential SNPs that may
a similar manner to SeqSearch. For metabolic routes or the existence of contribute to the reaction or response seen
instance, users may sketch a chemical secondary drug targets. Secondary drug in a given patient or in a given population.
structure or paste a SMILES string [40] targets represent proteins (or other mac- In particular, these SNP lists may serve as
of a possible drug lead or a drug that romolecules) that are different than the hypothesis generators that allow SNP or
appears to be causing an adverse reaction primary target for which the drug was gene characterization studies to be some-
into the ChemQuery window. After sub- initially designed or targeted towards. what more focused or targeted. By com-
mitting the query, the database launches a Some drugs may have five or more targets, paring the experimentally obtained SNP
structure similarity search that looks for of which only one might be relevant to results to those listed in DrugBank for that
common substructures from the query treating the disease. DrugBank uses a drug (and its drug targets) it may be
compound that match DrugBanks data- relatively liberal interpretation of drug possible to ascertain which polymorphism
base of known drug or drug-like com- targets in order to help identify these for which drug target or drug metabolizing
pounds. High scoring hits are presented in secondary drug targets. In particular, for enzyme may be contributing to an unusual
a tabular format with hyperlinks to the DrugBank a drug target is defined as any drug response. Obviously these database-
corresponding DrugCards. The ChemQu- macromolecule identified in the literature derived SNP suggestions may require
ery tool allows users to quickly determine that binds, transports or transforms a drug. additional experimental validation to
whether their compound of interest acts The binding or transformation of a drug prove their causal association.
on the desired protein target or whether by a secondary drug target or an off- Drugbank also includes two tables that
the compound of interest may unexpect- target protein is one of the most common provide much more explicit information
edly interact with unintended protein causes for unwanted side effects or adverse on the relationship between drug respons-
targets. drug reactions (ADRs) [41]. By providing es/reactions and gene variant or SNP
In addition to these search features, a fairly comprehensive listing of secondary data. The two tables, which are accessible
DrugBank also provides a number of drug targets (along with their SNP infor- from the GenoBrowse submenu located
general browsing tools for exploring the mation and other genetic data), DrugBank on DrugBanks Browse menu bar, are
database as well as several specialized is potentially able to provide additional called SNP-FX (short for SNP-associated
browsing tools such as PharmaBrowse insight into the underlying causes of a effects) and SNP-ADR (short for SNP-
and GenoBrowse for more specific tasks. patients response to a given drug. associated adverse drug reactions). SNP-
For instance, PharmaBrowse is designed to DrugBank also provides detailed se- FX contains data on the drug, the
address the needs of pharmacists, physi- quence and SNP data on known drug interacting protein(s), the causal SNPs
cians and medicinal chemists who tend to metabolizing enzymes and known drug or genetic variants for that gene/protein,
think of drugs in clusters of indications or the therapeutic response or effects caused
targets. In particular DrugBank contains
drug classes. This particular browsing tool by the SNP-drug interaction (improved or
detailed summary tables about each of the
provides navigation hyperlinks to more diminished response, changed dosing re-
SNPs for each of the drug targets or drug
than 70 drug classes, which in turn list the quirements, etc.) and the associated refer-
metabolizing enzymes that have been
FDA-approved drugs associated with the ences describing these effects in more
characterized by various SNP typing
drugs. Each drug name is then linked to its detail. SNP-ADR follows a similar format
efforts, such as the SNP Consortium [42]
respective DrugCard. GenoBrowse, on the to SNP-FX but the clinical responses are
and HapMap [43]. Currently DrugBank
other hand, is specifically designed to restricted only to adverse drug reactions
contains information on 26,292 coding
address the needs of geneticists or those (ADR). SNP-FX contains literature-de-
(exon) SNPs and 73,328 non-coding
specialists interested in specific Drug-SNP rived data on the therapeutic effects or
(intron) SNPs derived from known drug
relationships. This browsing tool provides therapeutic responses for more than 70
targets. It also has data on 1188 coding
navigation hyperlinks to more than 60 drug-polymorphism combinations, while
different drugs, which in turn list the target SNPs and 8931 non-coding SNPs from
SNP-ADR contains data on adverse
genes, SNPs and the physiological effects known drug metabolizing enzymes. By
reactions compiled from more than 50
associated with these drugs. clicking on the Show SNPs hyperlink
drug-polymorphsim pairings. All of the
In addition to its general utility as a listed beside either the metabolizing en- data in these tables is hyperlinked to drug
general drug encyclopedia, DrugBank also zymes or the drug target SNP field, the entries from DrugBank, protein data from
contains several tables, data fields or data SNP summary table can be viewed. These SwissProt, SNP data from dbSNP and
types that are particularly useful for tables include: 1) the reference SNP ID bibliographic data from PubMed. A screen
pharmacogenomic or pharmacogenetic (with a hyperlink to dbSNP); 2) the allele shot of the SNP-ADR table is shown in
studies. These include synoptic descrip- variants; 3) the validation status; 4) the Figure 3. As can be seen from the figure,
tions of a given drugs Pharmacology as chromosome location and reference base these tables provide consolidated, detailed
well as its Mechanism of Action, Contra- position; 5) the functional class (synony- and easily accessed information that clear-
indications, Toxicity, Phase I Metaboliz- mous, non-synonymous, untranslated, in- ly identifies those SNPs that are known to
ing Enzymes (name, protein sequence and tron, exon); 6) mRNA and protein acces- affect a given drugs efficacy, toxicity or
SNPs), and associated Drug Targets sion links (if applicable); 7) the reading metabolism.
(names, protein sequence, DNA sequence, frame (if applicable); 8) the amino acid To summarize, DrugBank allows users
chromosome location, locus number and change (if existent); 9) the allele frequency to link drugs to a variety of disease
SNPs). The information contained in as measured in African, European and conditions or health indications. It also
DrugBanks Pharmacology, Mechanism Asian populations (if available) and 10) the provides links between drugs and their

PLOS Computational Biology | www.ploscompbiol.org 10 December 2012 | Volume 8 | Issue 12 | e1002805


Figure 2. A screenshot montage illustrating the types of viewing and searching options available in HMDB (http://www.hmdb.ca).
doi:10.1371/journal.pcbi.1002805.g002

PLOS Computational Biology | www.ploscompbiol.org 11 December 2012 | Volume 8 | Issue 12 | e1002805


targets both through descriptions of the toxicology, their toxic effects and their action of the toxin. In addition to
mechanism of action and through the potential treatments. T3DB currently con- providing comprehensive numeric, se-
identification of known protein (or gene) tains over 3000 toxic substance entries quence and textual data, each ToxCard
targets. Because of this kind of extensive corresponding to more than 34,000 differ- also contains hyperlinks to other databas-
data linkage, DrugBank can potentially ent synonyms. These toxins are further es, abstracts, digital images and interactive
support a number of translational bionfor- connected to some 1450 protein targets applets for viewing the molecular struc-
matic or cheminformatic questions. For through almost 35,500 toxin and toxin- tures of each toxic substance.
example, through DrugBank it is possible target associations. These associations are A key feature that distinguishes the
for users to: 1) determine the similarity of a supported by more than 5400 references. T3DB from other on-line toxin or toxi-
newly found/synthesized compound to an The entire database, including text, se- cology resources is its extensive support for
existing drug (via the structure similarity quence, structure and image data, occu- higher-level database search and selection
search); 2) determine the possible mecha- pies nearly 16 Gigabytes of data most of functions. In addition to the data viewing
nism of action or protein targets for a which can be freely downloaded. and sorting features already mentioned,
newly found/synthesized compound (via As with HMDB and DrugBank, the the T3DB also offers a local BLAST
the structure similarity search); 3) diagnose T3DB is designed to be a fully searchable search that supports both single and
or determine the cause of illnesses thought web resource with many built-in tools and multiple sequence queries, a boolean text
to be brought on by adverse drug reactions features for viewing, sorting and extracting search based on KinoSearch, a chemical
(through the text search or SNPADR/ toxin and toxin-target annotation, includ- structure search utility based on ChemAx-
SNPFX); 4) treat or find references to the ing structures and gene and protein ons MarvinView, and a relational data
treatment of illnesses based on symptoms sequences. A screenshot montage illustrat- extraction tool similar to that found in
or disease diagnosis (via the text search); 5) ing the types of viewing and searching DrugBank and the HMDB [26,32]. These
extract information on common drug options available is shown in Figure 4. As can all be accessed via the database
targets (via the data extractor or the with HMDB and DrugBank, the T3DB navigation bar located at the top of every
sequence search); 6) extract information supports standard text queries through the T3DB page.
on common drug classes or structures (via text search box located on the home page. T3DBs simple text search box (located
the data extractor or the structure search); It also offers general database browsing at the top of most T3DB pages) supports
7) ascertain whether a certain protein using the Browse button located in the text matching, text match rankings, mis-
found in bacteria, fungi or viruses could T3DB navigation bar. To facilitate brows- spellings and highlights text where the
be a drug target (via the sequence search); ing, the T3DB is divided into synoptic word is found. In addition to this simple
or 8) ascertain whether a newly identified summary tables which, in turn, are linked text search, T3DBs TextQuery function
human protein, such as an isoform or to more detailed ToxCards- in analogy supports more sophisticated text querying
paralogue, may be a drug target (through to the DrugCard concept found in Drug- functions including and and or que-
the sequence search). Bank [32] or the MetaboCard in HMDB ries, multi-word matching and parenthet-
[26]. All of the T3DBs summary tables ical groupings as well as data-field-specific
6. T3DB A Resource linking can be rapidly browsed, sorted or refor- queries such as finding the query word
Small Molecules to Disease & matted in a manner similar to the way only in the Compound Source field.
Toxicity PubMed abstracts may be viewed. Click- Additional details and examples are pro-
ing on the ToxCard button, found in the vided on the T3DBs TextQuery page.
A toxic substance is a small molecule, leftmost column of any given T3DB T3DBs sequence searching utility (Seq-
peptide, or protein that is capable of summary table, opens a webpage describ- Search) allows users to search through
causing injury, disease, genetic mutations, ing the toxin of interest in much greater T3DBs collection of 1450 known (human)
birth defects or death. Toxins, both detail. Each ToxCard entry contains over toxin targets. This service potentially
natural and man-made, represent an 80 data fields, with ,50 data fields allows users to identify both orthologous
important class of poisonous compounds devoted to chemical and toxicological/ and paralogous targets for known toxins or
that are ubiquitous in nature, in homes, medical data and ,30 data fields (each) toxin targets. It also facilitates the identi-
and in the workplace. Common toxins devoted to describing the toxin target(s). fication of potential toxin targets from
include pollutants, pesticides, preserva- A ToxCard begins with various identi- other animal species. With SeqSearch,
tives, drugs, venoms, food toxins, cosmetic fiers and descriptors (names, synonyms, gene or protein sequences may be
toxins, dyes, and cleaning compounds. compound description, structure image, searched against the T3DBs sequence
Because toxic compounds are essentially related database links and ID numbers), database of identified toxin-target se-
disease-causing agents, it has long been followed by additional structure and quences by pasting the FASTA formatted
recognized that there is a need to associate physico-chemical property information. sequence (or sequences) into the Seq-
toxic compound data with molecular The remainder of data on the toxin is Search query box and pressing the sub-
toxicology and clinical symptomology. devoted to providing detailed toxicity and mit button.
While this has been done in a variety of toxicological data, including route of T3DBs structure similarity search tool
toxicology textbooks and medical refer- delivery, mechanism of action, medical (ChemQuery) can be used in a similar
ence manuals, it has only recently been information, and toxicity measurements. manner as its SeqSearch tool. Users may
done using electronic databases and the All of a toxins targets are also listed within sketch a chemical structure (through
tools associated with bioinformatics and the ToxCard. Each of these targets are ChemAxons freely available chemical
cheminformatics. described by some 30 data fields that sketching applet) or paste a SMILES string
T3DB [36] is currently the only chem- include both chemical and biological of a query compound into the ChemQu-
ical-bioinformatic database that provides (sequence, molecular weight, gene ontolo- ery window. Submitting the query launch-
in-depth, molecular-scale information gy terms, etc.) information, as well as es a structure similarity search that looks
about toxins, their associated targets, their details on their role in the mechanism of for common substructures from the query

PLOS Computational Biology | www.ploscompbiol.org 12 December 2012 | Volume 8 | Issue 12 | e1002805


Figure 3. A screen shot of DrugBanks SNP-ADR table. This displays the information on the adverse drug reactions (ADRs) and associated SNP
(single nucleotide polymorphisms) with certain drugs and drug targets (http://www.drugbank.ca).
doi:10.1371/journal.pcbi.1002805.g003

compound that matches the T3DBs supports compound searches on the basis construct complex queries (find all toxins
database of known toxic compounds. of SMILES strings (under the SMILES that target acetylcholinesterase and are
Users can also select the type of search tab) and molecular weight ranges (under pesticides) or to build a series of highly
(exact or Tanimoto score) to be per- the Molecular Weight tab). customized tables. The output from these
formed. High scoring hits are presented The T3DBs data extraction utility queries is provided in HTML format with
in a tabular format with hyperlinks to the (Data Extractor) employs a simple rela- hyperlinks to all associated ToxCards.
corresponding ToxCards (which, in turn, tional database system that allows users to To summarize, T3DB allows users to
links to the targets). The ChemQuery tool select one or more data fields and to link toxic substances to a variety of disease
allows users to quickly determine whether search for ranges, occurrences or partial conditions, including acute toxicity, long-
their compound of interest is a known occurrences of words, strings, or numbers. term toxicity, birth defects, cancer, other
toxin or chemically related to a known The data extractor uses clickable web illnesses. It also provides links between
toxin and which target(s) it may act upon. forms so that users may intuitively con- toxic substances and their targets both
In addition to these structure similarity struct SQL-like queries. Using a few through descriptions of the mechanism of
searches, the ChemQuery utility also mouse clicks, it is relatively simple to action and through the identification of

PLOS Computational Biology | www.ploscompbiol.org 13 December 2012 | Volume 8 | Issue 12 | e1002805


known protein (or gene) targets. Because one would have to have a list of all known PubMed abstract database. In particular,
of this kind of extensive data linkage, human genes and perform 25,000+ que- PolySearch allows users to find newly
T3DB can potentially support a variety of ries with each gene name and the words described or previously unknown (to the
bioinformatic or cheminformatic queries. breast cancer. To conduct the second user, at least) associations between: 1)
For example, through T3DB it is possible query, it would be necessary to have a list drugs and disease; 2) metabolites and
for users to: 1) determine the similarity of a of all known diseases (more than 5000 are disease; 3) genes/proteins and disease; 4)
newly found/synthesized compound to an known) and perform 5000+ queries with drugs and drug targets; 5) metabolites and
existing toxin (via the structure similarity the word tamoxifen included in each metabolizing enzymes; 6) SNPs and dis-
search); 2) determine the possible mecha- query. Obviously this would take a person ease and 7) mutations and disease. In
nism of action or protein targets for a a very long time. However, using a addition, through its other query fields or
newly found/synthesized compound (via computer to perform these repeated que- query options, PolySearch is able to
the structure similarity search); 3) diagnose ries would be much less tedious and much perform a large number (.50) of other
or determine the cause of illnesses thought faster. PolySearch is designed to rapidly text queries that may be relevant to a
to be brought on by exposure to a given perform these types of expansive queries variety of applications in translational
toxin (through the text search); 4) treat or by exploiting the PubMed application bioinformatics.
find references to the treatment of illnesses programming interface (API) and a special
brought on by exposure to a given toxin collection of dictionaries and thesauruses 7.2 Metabolite Set Enrichment
(via the text search); 5) extract information compiled from various bioinformatic and Analysis
on common toxin targets (via the data chemical-bioinformatic databases. In par- The Metabolite Set Enrichment Analy-
extractor); 6) extract information on com- ticular, the typical query supported by sis (MSEA) server [45] is a web-based tool
mon toxin classes (via the data extractor); PolySearch is Given X, find all Ys designed to help researchers identify and
7) ascertain whether a certain protein or where X or Y can be diseases, tissues, cell interpret patterns of human or mammali-
protein homologue may also be a toxin compartments, gene/protein names, an metabolite concentration changes in a
target (via the sequence search); or 8) SNPs, mutations, drugs and metabolites. biologically meaningful context. It is based
ascertain whether a newly identified pep- The disease names and synonyms in on the concepts originally developed for
tide or protein may be a toxin (through the PolySearch are derived from medical gene expression or microarray analysis
sequence search). dictionaries and MeSH (medical subject called Gene Set Enrichment Analysis or
headings), gene and protein names/syno- GSEA [8]. The central idea behind GSEA
7. Software for Interpreting nyms are derived from UniProt, drug is to directly investigate the enrichment of
Small Molecule and Disease names/synonyms are derived from Drug- pre-defined groups of functionally related
Data Bank while metabolites and metabolite genes (or gene sets) instead of individual
synonyms are derived from the HMDB. genes. This group-based approach does
With the recent emergence of chemical- Obviously, without these small molecule not require pre-selection of genes with an
bioinformatic databases having a solid dictionaries or thesauruses, many of Poly- arbitrary threshold. Instead, functionally
translational (i.e. biomedical) functionality, Searchs queries could not be performed. related genes are evaluated together as
the way has been cleared for the develop- PolySearch also exploits a variety of gene sets, allowing additional biological
ment of software tools that exploit these techniques in text mining and information information to be incorporated into the
databases. This is a natural process in both retrieval to identify, highlight and rank analysis process. Key to the development
bioinformatics and cheminformatics as informative abstracts, paragraphs or sentenc- of GSEA has been the compilation of
databases always appear before any soft- es. A central premise to PolySearchs search libraries or databases of gene expression
ware applications are typically developed. strategy is the assumption that the greater the changes that are associated with specific
Given that the field of chemical bioinfor- frequency with which an X and Y association conditions, pathways, diseases or pertur-
matics is still quite young and the number occurs within a collection of abstracts, the bations. Therefore in order to develop
of databases with disease and small more significant the association is likely to be. MSEA, it was necessary to extract a large
molecule information is still relatively For instance, if COX2 is mentioned in body of metabolite expression changes (i.e.
small, it is not surprising to find that the PubMed as being associated with colon chemical profiles) and metabolic pathway
number of software tools developed to cancer 510 times but thioredoxin is associated information from a variety of databases.
exploit these databases is still quite small. with colon cancer only once, then one is Fortunately, the existence of SMPDB and
Here we will briefly describe two recently more likely to have more confidence in the HMDB made the compilation of this
developed software tools PolySearch and COX2-colon cancer association. Frequency metabolite expression library relatively
MSEA that exploit the data in SMPDB, alone is not always the best way to rate a easy. By downloading the freely available
HMDB and DrugBank to perform a paper or a website for its relevancy. There- data in HMDB and SMPDB, the authors
number of useful applications. fore, in addition to counting the frequency of of MSEA were able to construct a
apparent associations, PolySearch employs a collection of five metabolite set libraries
7.1 Text Mining with PolySearch specially developed text-ranking scheme to containing ,1,000 biologically meaning-
PolySearch [44] is a freely available, score the most relevant sentences and ful groups of metabolites. In MSEA, a
web-based text-mining tool that allows abstracts that associate both the query and group of metabolites are considered to
users to search through large numbers of match terms with each other. constitute a metabolite set if they are
PubMed abstracts to make large-scale In summary, PolySearch is able to known to be: a) involved in the same
linkages or associations. Examples of exploit the name and synonym sets from biological processes (i.e., metabolic path-
large-scale associations are: Find all genes a number of small-molecule and disease ways, signaling pathways); b) changed
associated with breast cancer or Find all databases (HMDB, DrugBank, MeSH, significantly under the same pathological
diseases treatable by tamoxifen. In order OMIM) thereby allowing users to perform conditions (i.e., various metabolic diseas-
to conduct the first query using PubMed, a range of text mining queries on the es); and c) present in the same locations

PLOS Computational Biology | www.ploscompbiol.org 14 December 2012 | Volume 8 | Issue 12 | e1002805


Figure 4. A screenshot montage illustrating the types of viewing and searching options available in T3DB (http://www.t3db.org).
doi:10.1371/journal.pcbi.1002805.g004

PLOS Computational Biology | www.ploscompbiol.org 15 December 2012 | Volume 8 | Issue 12 | e1002805


such as organs, tissues or cellular organ- biomarkers of disease, the relationship matic databases provide links between
elles. The resulting metabolite sets were between small molecules and human small molecules and their large molecule
organized into three categories: pathway- disease is often overlooked. However it is targets, relatively few provide linkages to
associated, disease-associated, and location important to remember that more than clinical, physiological or disease informa-
based. MSEAs pathway-associated me- 95% of all diagnostic clinical assays are tion.
tabolite library contains 84 entries based designed to detect small molecules (i.e. The second part of this chapter focused
on the 84 human metabolic pathways blood glucose, serum creatinine, amino on describing a number of recently
found in SMPDB. MSEAs disease-associ- acid analysis, etc.). Likewise nearly 90% of developed databases that explicitly relate
ated metabolite sets were mainly collected all known drugs are small molecules, 50% small molecules to disease. This included
from information in the HMDB, the of all drugs are derived from pre-existing detailed descriptions of four databases: 1)
Metabolic Information Center (MIC), metabolites and 30% of identified genetic the Small Molecule Pathway Database
and SMPDB. Using these resources, a disorders involve diseases of small mole- (SMPDB); 2) the Human Metabolome
total of 851 physiologically informative cule metabolism. Clearly, small molecules Database (HMDB); 3) DrugBank and 4)
metabolite sets were created. These dis- are important and given the rapid growth the Toxin, Toxin-Target Database
ease-associated metabolite sets were fur- in metabolomics, pharmacogenomics and (T3DB). SMPDB is a graphically oriented
ther divided into three sub-categories systems biology, it is likely that their role in pathway database that contains ,450
based on the biofluids in which they were disease diagnosis and disease treatment metabolic pathways, disease pathways
measured: 398 metabolite sets in blood, will continue to grow. Given these exciting and drug pathways. The HMDB is a
335 in urine, and 118 in cerebral-spinal growth prospects and given the impor- comprehensive metabolomic database that
fluid (CSF). MSEAs location-based library tance of small molecules in medicine and is primarily oriented to answering ques-
contains 57 metabolite sets based on the translational research, scientists are now tions in clinical metabolomic and clinical
Cellular Location and Tissue Loca- realizing that there is a critical need to link biochemistry. DrugBank is a comprehen-
tion listed in the HMDB. information about small molecules to their sive drug database containing detailed
While the exact statistical or enrichment corresponding big molecule targets. information about drugs, drug targets
analysis methods used in MSEA are well This has led to the emergence of a new and clinical pharmacology. The T3DB is
beyond the scope of this chapter, suffice it field of bioinformatics called chemical a toxicology database containing detailed
to say that MSEA essentially allows one to bioinformatics. information about toxins, toxin targets and
take lists of metabolites and to identify This chapter has covered several topics their corresponding toxicological informa-
which pathways, diseases or medical related to chemical bioinformatics and the tion. Each of these databases was de-
conditions are most likely to be associated role that chemical bioinformatics can play scribed in terms of its content, general
with that metabolite set. It is also possible in identifying the chemicals that cause design and query/search functions. Addi-
to do the same kind of operation with a list (toxins), cure (drugs) or characterize tionally, explicit examples of various
of metabolites and their absolute (or (biomarkers) many human diseases. The translational or disease-related applica-
relative) concentrations. While disease/ first part of the chapter gave a brief tions were provided for each database.
metabolite associations can be made overview of some of the most important The final part of this chapter provided a
through HMDB and SMPDB, these or widely used chemical bioinformatic short discussion of some of the newly
primitive search tools do not have the resources along with a more detailed emerging software tools that exploit these
same statistical rigor that characterizes a discussion of some of the major classes databases, including PolySearch and
full-fledged enrichment analysis. Further- of chemical-bioinformatic databases. In MSEA (Metabolite Set Enrichment Anal-
more, the MSEA pathway and disease particular four major database classes ysis). PolySearch is a text-mining tool that
data set is somewhat larger than what is were described: 1) small molecule (or exploits the synonym data found in these
found in the HMDB or SMPDB. This metabolic) pathway databases; 2) metab- small molecule databases to allow expan-
means that MSEA will be far more likely olite or metabolomic databases; 3) drug sive PubMed queries to be performed.
to find a useful (and statistically significant) databases; and 4) toxin or toxic substance MSEA is a metabolomic analysis tool that
pathway or disease than what could be databases. Examples of each of these exploits the pathway and disease informa-
done with HMDB or SMPDB. databases were given and many of their tion found in SMPDB and HMDB to
Overall, MSEA is an example of an strengths and limitations were discussed. perform pathway and disease identifica-
analytical software tool that exploits While most of these chemical-bioinfor- tion from raw metabolomic data.
chemical-bioinformatic data to perform
robust statistical analyses of metabolomic
or clinical chemistry data. Given their
Further Reading
close similarity, it is reasonable to expect
MSEA could eventually be integrated with
GSEA, thereby allowing a comprehensive N Villas-Boas SG, Nielson J, Smedsgaard J, Hansen MAE, Roessner-Tunali U, editors
(2007) Metabolome analysis: an introduction. New York: John Wiley & Sons.
analysis of both gene and metabolite
expression changes on a single integrated N Wishart DS (2008) DrugBank and its relevance to pharmacogenomics.
program or website. No doubt this kind of Pharmacogenomics 9: 11551162.
integrated omic analysis tool is not far N Krawetz S, editor (2009) Bioinformatics for systems biology. Totowa: Humana
away from being developed. Press.
N Wishart DS (2008) Applications of metabolomics in drug discovery and
8. Summary development. Drugs R D 9: 307322.

With todays focus on genes and


N Baxevanis A (2003) Current protocols in bioinformatics. New York: John Wiley &
Sons see Chapter 14.
proteins as the primary causes or

PLOS Computational Biology | www.ploscompbiol.org 16 December 2012 | Volume 8 | Issue 12 | e1002805


Glossary the urine of a 3 month-old baby with
unusually light coloring of the skin,
Cheminformatics a field of information technology that uses computers to eczema (an itchy skin rash), and a musty
facilitate the collection, storage, analysis and manipulation of large quantities of body odor. What compound is it and what
chemical data. disease might this baby have?
2) Your natural product chemist neighbor
DrugBank A database containing chemical and biological data on drugs and has just isolated a compound from the
drug targets. Tanzanian periwinkle a rare plant species
found only in the highlands of Eastern
GSEA Gene Set Enrichment Analysis. GSEA is a statistically based bioinformatic
method designed to directly investigate the enrichment of pre-defined groups of Tanzania. Locals use the plant as a treatment
functionally related genes (or gene sets) from gene expression data. for a variety of blood disorders. The structure
of the compound is given by the following
HMDB The Human Metabolome Database. A database containing chemical and SMILES string: COC1 = CC = C2C( = C
biological data on human metabolites aimed at clinical metabolomic studies. C1 = O)C(CCC1 = CC(OC) = C(OC)C(OC)
= C21)NC(CO)
MS Mass Spectrometry. An analytical method that measures molecular weight What compound is this similar to, what
of compounds based on their mass to charge ratio. Mass spectrometry is one of diseases could it be used to treat and what
the standard methods to determine the molecular formula of new compounds proteins might it bind?
and to confirm the identity of synthesized chemicals or natural products.
3) A viral protein with the following
Metabolome the collection of all small molecule metabolites found in a given sequence has been isolated from a number
cell, tissue, organ or organism. of dead and dying African Green Monkeys
that were housed at a local zoo.
Metabolomics - a branch of omics research that is primarily concerned with the PQVTLYQRPLVTIRVGGQLKEALIDTGADD
high-throughput identification and quantification of small molecule (,1500 Da) TVLENMNLPGRWKPKMIGAIAGFIKVKQYDQI
metabolites in the metabolome. TVEICGHKGIGTILVGPTPVNIIGRNLLTLIG
CTLNF
MSEA Metabolite Set Enrichment Analysis. MSEA is a statistically based The illness seems to be spreading to
bioinformatic method designed to directly investigate the enrichment of pre-
other monkey colonies in the zoo. What
defined groups of functionally related metabolites (or metabolite sets) from
drugs could be used to treat the sick
metabolomic data.
monkeys and to prevent the spread of the
NMR Nuclear Magnetic Resonance Spectroscopy. An analytical method that disease?
measures nuclear magnetism under very high magnetic fields. NMR is the 4) A farmer who has just finished
standard method used by chemists today to identify and characterize small harvesting his barley field has come into
molecules. the clinic complaining of skin irritation,
burning and itching, a rash and a series of
Pharmacogenomics A newly emerging field of pharmacology that integrates skin blisters. He also has eye pain,
genotyping and gene expression data with classical pharmacological and adverse conjunctivitis, burning sensations about
drug reaction studies. the eyes, and blurred vision. Other
SMPDB The Small Molecule Pathway Database. A database containing pathway symptoms have included nausea, vomiting
diagrams and interactive viewing tools for small molecules involved in and fatigue. Suspecting that he may have
metabolism, drug action and disease. been exposed to some toxin or pesticide a
chemical analysis has been performed of
T3DB The Toxin, Toxin-Target Database. A database with chemical and his blood, urine and lacrimal (tear) fluid.
biological data on common toxins, poisons, household chemicals, pollutants and MS analysis of all three fluids has
other harmful substances. identified an unusual compound with a
molecular weight of 296.126 daltons.
Toxicogenomics A newly emerging field of toxicology that integrates What compound might this be?
genotyping and gene expression data with classical toxicological and toxicity 5) What kind of drugs can be used to
studies.
treat breast cancer? Describe your search
strategy and your rationale for this search
strategy.
While the sub-discipline of chemical likely that chemical bioinformatics will soon Answers to the Exercises can be found
bioinformatics is still quite young, and the be able to establish itself as one of the most in Text S1.
number of tools for translational applications medically useful sub-disciplines in the entire
is still relatively small, it should be clear that field of bioinformatics. Supporting Information
what is now out there has considerable
potential for a wide range of clinical, 9. Exercises Text S1 Answers to Exercises.
biomedical, pharmaceutical and toxicological (DOC)
applications. Certainly as more tools are 1) A compound with a molecular weight
developed and as more databases evolve, it is of 136.053 daltons has been isolated from

PLOS Computational Biology | www.ploscompbiol.org 17 December 2012 | Volume 8 | Issue 12 | e1002805


References
1. Trujillo E, Davis C, Milner J (2006) Nutrige- 16. Fahy E, Sud M, Cotter D, Subramaniam S (2007) resources for exploring drug-target relationships.
nomics, proteomics, metabolomics, and the LIPID MAPS online tools for lipid research. Nucleic Acids Res 36: D919922.
practice of dietetics. J Am Diet Assoc 106: 403 Nucleic Acids Res 35: W606612. 32. Wishart DS, Knox C, Guo AC, Shrivastava S,
413. 17. Degtyarenko K, de Matos P, Ennis M, Hastings J, Hassanali M, et al. (2006) DrugBank: a compre-
2. Feng X, Liu X, Luo Q, Liu BF (2008) Mass Zbinden M, et al. (2008) ChEBI: a database and hensive resource for in silico drug discovery and
spectrometry in systems biology: an overview. ontology for chemical entities of biological exploration. Nucleic Acids Res 34: D668672.
Mass Spectrom Rev 27: 635660. interest. Nucleic Acids Res 36: D344350. 33. Schmidt U, Struck S, Gruening B, Hossbach J,
3. Brown FK. (1998) Chemoinformatics: what is it 18. Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, et Jaeger IS, et al. (2009) SuperToxic: a compre-
and how does it impact drug discovery. Annu al. (2009) PubChem: a public information system hensive database of toxic compounds. Nucleic
Rep Med Chem 33: 375384. for analyzing bioactivities of small molecules. Acids Res 37: D295299.
4. Altschul SF, Madden TL, Schaffer AA, Zhang J, Nucleic Acids Res 37: W623633. 34. Judson R, Richard A, Dix D, Houck K, Elloumi
Zhang Z, et al. (1997) Gapped BLAST and PSI- 19. Williams AJ (2008) Public chemical compound F, et al. (2008) ACToRAggregated Computa-
BLAST: a new generation of protein database databases. Curr Opin Drug Discov Devel 11: tional Toxicology Resource. Toxicol Appl Phar-
search programs. Nucleic Acids Res 25: 3389 393404. macol 233: 713.
3402. 20. Shinbo Y, Nakamura Y, Altaf-Ul-Amin M, Asahi 35. Davis AP, Murphy CG, Saraceni-Richards CA,
5. Benson DA, Karsch-Mizrachi I, Lipman DJ, K, Kuokawa M, et al. (2006) KNApSAcK: a Rosenstein MC, Wiegers TC, et al. (2009)
Ostell J, Sayers EW (2010) GenBank. Nucleic comprehensive species-metabolite relationship Comparative Toxicogenomics Database: a
Acids Res 38: D4651. database. Biotech Agri Forestry 57: 165181. knowledgebase and discovery tool for chemical-
6. McGuffin LJ, Bryson K, Jones DT (2000) The 21. Smith CA, OMaille G, Want EJ, Qin C, Traguer gene-disease networks. Nucleic Acids Res 37:
PSIPRED protein structure prediction server. SA, et al. (2005) METLIN: a metabolite mass D786792.
Bioinformatics 16: 404405. spectral database. Ther Drug Monit 27: 747751. 36. Lim E, Pon A, Djoumbou Y, Knox C, Shrivas-
22. Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, tava S, et al. (2010\) T3DB: a comprehensively
7. Westbrook J, Feng Z, Jain S, Bhat TN, Thanki N,
Ioannidis YE, et al. (2008) BioMagResBank. annotated database of common toxins and their
et al. (2002) The Protein Data Bank: unifying the
Nucleic Acids Res 36: D402408. targets. Nucleic Acids Res 38: D781786.
archive. Nucleic Acids Res 30: 245248.
23. Taguchi R, Nishijima M, Shimizu T (2007) Basic
8. Subramanian A, Tamayo P, Mootha VK, 37. Boutet E, Lieberherr D, Tognolli M, Schneider
analytical systems for lipidomics by mass spec-
Mukherjee S, Ebert BL, et al. (2005) Gene set M, Bairoch A (2007) UniProtKB/Swiss-Prot.
trometry in Japan. Methods Enzymol 432: 185
enrichment analysis: a knowledge-based ap- Methods Mol Biol 406: 89112.
211.
proach for interpreting genome-wide expression 38. Wishart DS (2007) Proteomics and the human
24. Cui Q, Lewis IA, Hegeman AD, Anderson ME,
profiles. Proc Natl Acad Sci U S A 102: 15545 metabolome project. Expert Rev Proteomics 4:
Li J, et al. (2008) Metabolite identification via the
15550. 333335.
Madison Metabolomics Consortium Database.
9. Edgar R, Domrachev M, Lash AE (2002) Gene Nat Biotechnol 26: 162164. 39. Hamosh A, Scott AF, Amberger J, Valle D,
Expression Omnibus: NCBI gene expression and 25. Kopka J, Schauer N, Krueger S, Birkemeyer C, McKusick VA (2000) Online Mendelian Inheri-
hybridization array data repository. Nucleic Acids Usadel B, et al. (2005) GMD@CSB.DB: the tance in Man (OMIM). Hum Mutat 15: 5761.
Res 30: 207210. Golm Metabolome Database. Bioinformatics 21: 40. Weininger D (1988) SMILES 1. Introduction and
10. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita 16351638. encoding rules. J Chem Inf Comput Sci 28: 31
KF, Itoh M, et al. (2006) From genomics to 26. Wishart DS, Knox C, Guo AC, Eisner R, Young 38.
chemical genomics: new developments in KEGG. N, et al. (2009) HMDB: a knowledgebase for the 41. Shoshan MC, Linder S (2008) Target specificity
Nucleic Acids Res 34: D354357. human metabolome. Nucleic Acids Res 37: and off-target effects as determinants of cancer
11. Karp PD, Riley M, Paley SM, Pelligrini-Toole A D603610. drug efficacy. Expert Opin Drug Metab Toxicol
(1996) EcoCyc: an encyclopedia of Escherichia 27. Polen HH, Zapantis A, Clauson KA, Jebrock J, 4: 273280.
coli genes and metabolism. Nucleic Acids Res 24: Paris M (2008) Ability of online drug databases to 42. Thorisson GA, Stein LD (2003) The SNP
3239. assist in clinical decision-making with infectious Consortium website: past, present and future.
12. Krummenacker M, Paley S, Mueller L, Yan T, disease therapies. BMC Infect Dis 8: 153. Nucleic Acids Res 31: 124127.
Karp PD (2005) Querying and computing with 28. Hatfield CL, May SK, Markoff JS (1999) Quality 43. International HapMap Consortium (2007) A
BioCyc databases. Bioinformatics 21: 34543455. of consumer drug information provided by four second generation human haplotype map of over
13. Joshi-Tope G, Gillespie M, Vastrik I, DEustachio Web sites. Am J Health Syst Pharm 56: 2308 3.1 million SNPs. Nature 449: 851861.
P, Schmidt E, et al. (2005) Reactome: a knowl- 2311. 44. Cheng D, Knox C, Young N, Stothard P,
edgebase of biological pathways. Nucleic Acids 29. Zhu F, Han B, Kumar P, Liu X, Ma X, et al. Damaraju S, et al. (2008) PolySearch: a web-
Res 33: D428432. (2010) Update of TTD: Therapeutic Target based text mining system for extracting relation-
14. Pico AR, Kelder T, van Iersel MP, Hanspers K, Database. Nucleic Acids Res 38: D787791. ships between human diseases, genes, mutations,
Conklin BR, et al. (2008) WikiPathways: pathway 30. Sangkuhl K, Berlin DS, Altman RB Klein TE drugs and metabolites. Nucleic Acids Res 36:
editing for the people. PLoS Biol 6: e184. (2008) PharmGKB: understanding the effects of W399405.
doi:10.1371/journal.pbio.0060184 individual genetic variants. Drug Metab Rev 40: 45. Xia J, Wishart DS (2010) MSEA: a web-based
15. Frolkis A, Knox C, Lim E, Jewison T, Law V, et al. 539551. tool to identify biologically meaningful patterns in
(2010) SMPDB: The Small Molecule Pathway 31. Gunther S, Kuhn M, Dunkel M, Campillos M, quantitative metabolomic data. Nucleic Acids Res
Database. Nucleic Acids Res 38: D480487. Senger C, et al. (2008) SuperTarget and Matador: 38: W717.

PLOS Computational Biology | www.ploscompbiol.org 18 December 2012 | Volume 8 | Issue 12 | e1002805


Education

Chapter 4: Protein Interactions and Disease


Mileidy W. Gonzalez1, Maricel G. Kann2*
1 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America, 2 Biological
Sciences, University of Maryland, Baltimore County, Baltimore, Maryland, United States of America

Abstract: Proteins do not function create macromolecular structures of various Likewise, interaction maps obtained from
in isolation; it is their interactions complexities and heterogeneities. Proteins one species can be used, with some
with one another and also with interact in pairs to form dimers (e.g. reverse limitations, to predict interaction networks
other molecules (e.g. DNA, RNA) transcriptase), multi-protein complexes (e.g. in other species. Protein interaction net-
that mediate metabolic and signal- the proteasome for molecular degradation), works can also suggest functions for
ing pathways, cellular processes, or long chains (e.g. actin filaments in muscle previously uncharacterized proteins by
and organismal systems. Due to fibers). The subunits creating the various uncovering their role in pathways or
their central role in biological complexes can be identical or heteroge- protein complexes [4]. Due to their central
function, protein interactions also neous (e.g. homodimers vs. heterodimers) role in biological function, protein inter-
control the mechanisms leading to and the duration of the interaction can be actions also control the mechanisms lead-
healthy and diseased states in transient (e.g. proteins involved in signal ing to healthy and diseased states in
organisms. Diseases are often transduction) or permanent (e.g. some organisms. Diseases are often caused by
caused by mutations affecting the ribosomal proteins). However, protein in- mutations affecting the binding interface
binding interface or leading to teractions do not always have to be physical or leading to biochemically dysfunctional
biochemically dysfunctional alloste- [1]. The term protein interaction is also allosteric changes in proteins. Therefore,
ric changes in proteins. Therefore, used to describe metabolic or genetic protein interaction networks can elucidate
protein interaction networks can correlations, and even co-localizations. the molecular basis of disease, which in
elucidate the molecular basis of
Metabolic interactions describe proteins turn can inform methods for prevention,
disease, which in turn can inform
involved in the same pathway (e.g. the diagnosis, and treatment [5,6].
methods for prevention, diagnosis,
and treatment. In this chapter, we Krebs cycle proteins), while genetically The study of human disease experi-
will describe the computational identified associations identify co-expressed enced extensive advancements once the
approaches to predict and map or co-regulated proteins (e.g. enzymes biomedical characterization of proteins
networks of protein interactions regulating the glycolytic pathway). As the shifted to studies taking into account a
and briefly review the experimental name implies, protein interactions by co- proteins network at different functional
methods to detect protein interac- localization list proteins found in the same levels (i.e. in pair-wise interactions, in
tions. We will describe the applica- cellular compartment. complexes, in pathways, and in whole
tion of protein interaction networks Whether the association is physical or genomes). For instance, consider how our
as a translational approach to the functional, protein-protein interaction understanding of Huntingtons disease
study of human disease and eval- (PPI) data can be used in a larger scale (HD) has evolved from the early Mende-
uate the challenges faced by these to map networks of interactions [2,3]. In lian single-gene studies to the latest HD-
approaches. PPI network graphs, the nodes represent specific network-based analyses. HD is an
the proteins and the lines connecting them autosomal dominant neurodegenerative
represent the interactions between them disease with features recognized by Hun-
This article is part of the Transla- (Figure 1). Protein interaction networks tington in 1872 [7], and whose specific
tional Bioinformatics collection for are useful resources in the abstraction of patterns of inheritance were documented
PLOS Computational Biology. basic science knowledge and in the in 1908 [8]. After almost a century of
development of biomedical applications. genetics studies, the culprit gene in HD
By studying protein interaction networks was identified; in 1993, we learned that
1. Introduction
we can learn about the evolution of HD was caused by the repeat expansion of
Early biological experiments revealed individual proteins and about the different a CAG trinucleotide in the Huntingtin (Htt)
proteins as the main agents of biological systems in which they are involved. gene [9]. This expansion causes aggrega-
function. As such, proteins ultimately
determine the phenotype of all organ- Citation: Gonzalez MW, Kann MG (2012) Chapter 4: Protein Interactions and Disease. PLoS Comput Biol 8(12):
isms. Since the advent of molecular e1002819. doi:10.1371/journal.pcbi.1002819
biology we have learned that proteins Editor: Fran Lewitter, Whitehead Institute, United States of America
do not function in isolation; instead, it is Published December 27, 2012
their interactions with one another and
This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted,
also with other molecules (e.g. DNA, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under
RNA) that mediate metabolic and signal- the Creative Commons CC0 public domain dedication.
ing pathways, cellular processes, and Funding: This research was funded by the National Institutes of Health (NIH) 1K22CA143148 to MGK (PI); ACS-
organismal systems. IRG grant to MGK (PI), and R01LM009722 to MGK (co-investigator). MWGs research was supported in part by
The concept of protein interaction is the Intramural Research Program of the NIH, National Library of Medicine. The funders had no role in the
preparation of the manuscript.
generally used to describe the physical
contact between proteins and their interact- Competing Interests: The authors have declared that no competing interests exist.
ing partners. Proteins associate physically to * E-mail: mkann@umbc.edu

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002819


What to Learn in This Chapter empirically characterize protein interac-
tions at a small scale or to detect them at a
large scale. Still, experimental detections
N Experimental and computational methods to detect protein interactions
only generate pair-wise interaction rela-
N Protein networks and disease tionships and with incomplete coverage
N Studying the genetic and molecular basis of disease (because of experimental biases toward
N Using protein interactions to understand disease certain protein types and cellular localiza-
tions). Experimental identification meth-
ods also exhibit an unacceptably high
tion of the mutant Htt in insoluble neuronal the interaction of two given proteins by fraction of false positive interactions and
inclusion bodies, which consequently leads fusing each of them to a transcription- often show low agreement when generated
to neuronal degeneration. Yet, even when binding domain. If the proteins interact, by different techniques [1517]. Experi-
the key disease-causing protein in HD had the transcription complex is activated, mental biophysical methods can comple-
been identified, the mechanism for Htt which transcribes a reporter gene whose ment the high-throughput detections by
aggregation remained unknown. In 2004, product can be detected. Since it is an in providing specific interaction details; but
Goehler et al. [10] mapped all the PPIs that vivo technique, the Y2H system is highly they are expensive, extremely laborious,
take place in HD and discovered that the effective at detecting transient interactions and can only be implemented for a few
interaction between Htt and GIT1, a and can be readily applied to screen large complexes at a time.
GTPase-activating protein, mediates Htt genome-wide libraries (e.g. to map an Computational methods for the predic-
aggregation. Further validation ([11,12]) organisms full set of interactions or tion of PPIs provide a fast and inexpensive
confirmed GTI1s potential as a target for interactome). But, the Y2H system is alternative to complement experimental
therapeutic strategies against HD. limited by its biases toward non-specific efforts. Computational interaction studies
In this chapter, we will describe the main interactions. Likewise, Y2H cannot can be used to validate experimental data
experimental methods to identify protein identify complexes (i.e. it only reports and to help select potential targets for
interactions and the computational ap- binary interactions) or interactions of pro- further experimental screening [18]. More
proaches to map their networks and to teins initiating transcription by themselves. importantly, computational methods give
predict new interactions purely in silico. We Although protein interactions are usually us the ability to study proteins within the
will describe the application of protein detected and studied in pair-wise form, in context of their interaction networks at
interaction networks as a translational ap- reality they often occur in complexes and different functional levels (i.e. at the
proach to the study of human disease and as part of larger networks of interaction. In complex, pathway, cell, or organismal
evaluate the challenges faced by these vitro direct detection methods (e.g. mass level), thus, allowing us to convert lists of
approaches. spectrometry, affinity purification) are better pair-wise relationships into complete net-
suited to detect macromolecular inte- work maps. Since they are based on
2. Experimental Identification ractions, yet, they have their own different principles, computational tech-
of PPIs limitations: interactions occurring in vitro niques can also uncover functional rela-
2.1 Biophysical Methods do not necessarily occur in vivo (e.g. when tionships and even provide information
Protein interactions are identified proteins are compartmentalized in about interaction details (e.g. domain
through different biochemical, physical, different cell locations) and complexes interactions), which may elude some
and genetic methods (Figure 2). Historical- are often difficult to purify, which is a experimental methods.
ly, the main source of knowledge about required step in the protocol [13].
protein interactions has come from bio- 2.2.2 Indirect high-throughput 3.1 Computational Methods for PPI
physical methods, particularly from those methods. Several high-throughput Predictions
based on structural information (e.g. X-ray methods deduce protein interactions by Computational interaction prediction
crystallography, NMR spectroscopy, fluorescence, looking at characteristics of the genes enco- methods can be classified into two types:
atomic force microscopy). Biophysical methods ding the putative interacting partners. For methods predicting protein domain inter-
identify interacting partners and also pro- instance, gene co-expression is based on the actions from existing empirical data about
vide detailed information about the bio- assumption that the genes of interacting pro- protein-protein interactions and methods
chemical features of the interactions (e.g. teins must be co-expressed to provide the relying entirely on theoretical information
binding mechanism, allosteric changes products for protein interaction. Expression to predict protein-protein or domain-
involved). Yet, since they are time- and profile similarity is calculated as a correla- domain interactions (Figure 2).
resource-consuming, biophysical character- tion coefficient between relative expression 3.1.1 Empirical predictions. The
izations only permit the study of a few levels and subsequently compared against a computational techniques based on experi-
complexes at a time. background distribution for random non- mental data use the relative frequency of
interacting proteins. Synthetic lethality, on the interacting domains [19], maximum likelihood esti-
2.2 High-Throughput Methods other hand, introduces mutations on two mation of domain interaction probability [20,21],
To document protein interactions at a separate genes, which are viable alone but co-expression [22], or network properties [2327]
larger scale, automated methods have been lethal when combined, as a way to deduce to predict protein and domain interactions.
developed to detect interactions directly or physically interacting proteins [14]. The main disadvantage of empirical com-
to deduce them through indirect approach- putations is that, by relying on an existing
es (Figure 2). 3. Computational Predictions of protein network to infer new nodes, they
2.2.1 Direct high-throughput PPIs propagate the inaccuracies of the experi-
methods. Yeast two-hybrid (Y2H) is one mental methods.
of the most-commonly used direct high- As discussed in section 2, experimental 3.1.2 Theoretical predictions.
throughput method. The Y2H system tests approaches provide the means to either Theoretical techniques to predict PPIs

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002819


Figure 1. A PPI network of the proteins encoded by radiation-sensitive genes in mouse, rat, and human, reproduced from [89].
Yellow nodes represent the proteins and blue lines show the interactions between them. The radiation-related genes were text-mined from PubMed
and the protein interaction information was obtained from HPRD.
doi:10.1371/journal.pcbi.1002819.g001

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002819


Figure 2. A diagram of the different experimental and computational methods to characterize, detect, and predict PPIs.
doi:10.1371/journal.pcbi.1002819.g002

incorporate a variety of biological proteins is assumed to interact if they alignment (MSA) of each protein and its
considerations; they take advantage of show enrichment of the same correlated orthologs, iii) from the MSAs, building
the fact that interacting proteins coevolve mutations [42]. distance matrices, and iv) calculating the
to preserve their function (e.g. mirrortree, 3.2.2 Coevolution at the full- correlation coefficient between the
phylogenetic profiling [2835]), occur in the sequence level. Methods detecting distance matrices. The mirrortree
same organisms (e.g. [36,37]), conserve coevolution at the full-sequence level are correlation coefficient is used for
gene order (e.g. gene neighbors method based on the idea that changes in one measuring tree similarity, thereby,
[38,39]) or are fused in some organisms protein are compensated by correlated allowing the evaluation of whether the
(e.g. the Rosetta Stone method [40,41]). changes in its interacting partner to proteins in question coevolved [28
preserve the interaction [29,30,4245]. 35].
Therefore, as interacting proteins The mirrortree method has been suc-
3.2 Theoretical Predictions of PPIs coevolve, they tend to have phylogenetic cessfully implemented to confirm experi-
Based on Coevolution trees with topologies that are more similar mental interactions in E. coli [4], S.
Below, we will expand on two methods than expected by chance [46]. The cerevisiae [48], and H. sapiens [49]. But,
generating theoretical PPI predictions coevolution of interacting proteins was the degree of similarity between the
through coevolutionary signal detection first qualitatively observed for polypeptide phylogenetic trees is strongly affected by
either at the residue or at the full-sequence growth factors, neurotransmitters, and the sequence divergence driven by the
level. immune system proteins with their underlying speciation process [4,50].
3.2.1 Coevolution at the residue respective receptors [47]. Several Therefore, two proteins may have similar
level. Pairs of residues within the same methodologies have been developed to phylogenetic trees due only to common
protein can coevolve because of three- measure coevolution at the full-sequence speciation events, but they may not
dimensional proximity or shared function level, and among them, the mirrortree necessarily be interacting partners. By
[42]. The intramolecular correlations of method is one of the most intuitive and subtracting the signal from speciation
interacting protein partners can be used to accurate options. As shown in Figure 3, events Pazos et al. [4] and Sato et al.
predict intermolecular coevolution. mirrortree measures coevolution for a [50] showed improvements for the per-
Residue-based coevolution methods given pair of proteins by i) identifying the formance of the mirrortree method. One
measure the set of correlated pair orthologs of both proteins in common approach creates a speciation vector
mutations in each protein. A pair of species, ii) creating a multiple sequence from the distance matrices derived from

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002819


Figure 3. A schema of the mirrortree method for predicting interacting proteins. The orthologs of two proteins (A and B from the same species)
are used to construct two multiple sequence alignments (MSAs). Distance matrices, which implicitly represent evolutionary trees, are constructed from the
MSAs. Each matrix square represents the tree distance between two orthologs and dark colors represent closeness. The two distance matrices are
compared using linear correlation. A high correlation between the distance matrices suggests interaction between proteins A and B.
doi:10.1371/journal.pcbi.1002819.g003

the ribosomal 16S sequences (for prokary- 4. Protein Networks and Therefore, mutations in a single gene may
otes and 18S for eukaryotes), while the Disease cause multiple syndromes or only cause
other uses the average distance of all disease in some of the biological processes
4.1 Studying the Genetic Basis of
proteins in a pair of organisms. Both the gene mediates. Establishing which geno-
methods subtract the speciation vector Disease types are responsible for the perturbed
from the original distance matrix con- The majority of our current knowledge phenotype of interest is not straightforward.
structed for the given protein pair. about the etiology of various diseases
Genes can influence one another in several
In principle, to characterize protein comes from approaches aiming to uncover
ways; genes can interact synergistically,
interactions at a systems level, all pro- their genetic basis. In the near future, the
(as in epistasis), or they can modify one
tein-protein and domain-domain interac- ability to generate individual genome data
another (e.g. the expression of one gene
tions in a given organism must be using next generation sequencing methods
might affect the expression of another).
catalogued. The mirrortree method is a promises to change the field of transla-
Cystic fibrosis and Becker muscular
suitable option to complement experimen- tional bioinformatics even more.
dystrophy, previously considered classical
tal detections because it is inexpensive and Since the inception of Mendelian ge-
examples of Mendelian patterns of inher-
fast. Moreover, mirrortree only requires netics in the 1900s, great effort has gone
itance, are now believed to be caused by
the proteins sequences as input and thus into cataloguing the genes associated with
a mutation of one gene which is modified
can be used to analyze proteins for which individual diseases. A gene can be isolated
by other genes [57,58]. Thus, even simple
no other information is available. Since based on its position in the chromosome
Mendelian diseases can lead to complex
mirrortree predictions are based on differ- by a process known as positional cloning
genotype-phenotype associations [59].
ent principles than any other computa- [52]. A few examples of human disease-
Environmental factors (e.g. diet, infection
tional or experimental techniques, they related genes identified by positional
by bacteria) are also major determinants of
can also uncover functional relationships cloning include the genes associated with
disease phenotype expression often acting
eluding other methods. Still, the imple- cystic fibrosis [53], HD [9], and breast
in combination with other genotype-phe-
mentation of the mirrortree approach is cancer susceptibility [54,55]. Even in
notype association confounders (i.e. plei-
under several limitations. One limitation simple Mendelian diseases, however, the
otropy and gene modifiers). In fact, most
of the mirrortree method is the minimum correlation between the mutations in the
common diseases such as cancer, meta-
number of orthologs it requires. Selecting patients genome and the symptoms is not
bolic, psychiatric and cardio-vascular dis-
orthologs in large families with many often clear [56]. Several reasons have been
orders (e.g. diabetes, schizophrenia and
paralogs is also a considerable challenge suggested for this apparent lack of corre-
hypertension) are believed to be caused by
for mirrortree [49]. In addition, coevolu- lation between genotype and phenotype,
several genes (multigenic) and are affected
tion does not necessarily take place including pleiotropy, influence of other
by several environmental factors [60].
uniformly across the sequence; different genes, and environmental factors.
sites may coevolve at different rates based Pleiotropy occurs when a single gene
on functional constraints. Thus, coevolu- produces multiple phenotypes. Pleiotropy 4.2 Studying the Molecular Basis of
tion signals vary when measured across the complicates disease elucidation because a Disease
entire sequence vs. at the domain level mutation on a pleiotropic gene may have Much can be learned from document-
[51]. an effect on some, all, or none of its traits. ing the genes associated with a particular

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002819


disease (e.g. identifying risk factors that they can lead to the loss of vital cellular interaction subnetworks to yield pathway
might be used for diagnostic purposes). functions (due to misfolding and aggrega- hypotheses that can be used to understand
Yet, to understand the biological details of tion) and can cause cytotoxicity [11]. different aspects of disease progression
pathogenesis and disease progression and Pathogen-host protein interactions also play a [67,68]. See Table 1 for useful resources
to subsequently develop methods for key role in bacterial and viral infections by incorporating pathway and PPI informa-
prevention, treatment and even diagnosis, facilitating the hijacking of the hosts tion in disease elucidation.
it is necessary to identify the molecules and metabolism for microbial need. The inter- Mapping interactomes provide the op-
the mechanisms triggering, participating, action between the Human papillomavirus portunity to identify disease pathways by
and controlling the perturbed biological (HPV) and its host provides one of the identifying key subnetworks. In 2005, Rual
process. Deciphering the molecular mech- most striking examples of the centrality of et al. [69] mapped the human protein
anisms leading to diseased states is an even protein interactions in infectious diseases. interactome. Below are some of the
bigger challenge than elucidating the HPV infection occurs in a large fraction of findings that have been uncovered when
genetic basis of complex diseases [61]. the population (7580% of Americans combining PPI and pathway analysis since
Even when the genetic basis of a disease is [64]) by generating lesions of the anogen- then.
well understood, not much is known about ital tract and for some it leads to cancer.
the molecular details leading to the Upon infection, the HPV genome is (i) Over 39,000 protein interactions
disorders. frequently integrated into the host ge- have been identified in the human
4.2.1 The role of protein nome, but only two viral genes (E6 and cell [70].
interactions in disease. Protein E7) are retained and expressed. Remark- (ii) Disease genes are generally non-
interactions provide a vast source of ably, the interactions of only two viral essential and occupy peripheral
molecular information; their interactions proteins with the hosts proteins are positions in the human interac-
(with one another, DNA, RNA, or small enough to cause HPV-induced carcino- tome [71], although, in a few
molecules) are involved in metabolic, genesis. E6 and E7 bypass the immune diseases like cancer, disease genes
signaling, immune, and gene-regulatory system by interacting with important tend to encode highly-connected
networks. Since protein interactions negative cell regulatory proteins to target proteins (hubs) [72,73].
mediate the healthy states in all them for degradation and thus, inactiva- (iii) Disease genes tend to cluster to-
biological processes, it follows that they tion. These two proteins also inhibit gether and co-occur in central
should be the key targets of the molecular- cellular terminal differentiation, induce network locations [6].
based studies of biological diseased states. cellular transformation and immortaliza- (iv) Proteins involved in similar pheno-
Disease-causing mutations affecting tion of the host cells, and direct the types (e.g. all cancer proteins) are
protein interactions can lead to proliferation of the tumorigenically-trans- highly interconnected [73].
disruptions in protein-DNA interactions, formed cells [65]. (v) Viral networks differ significantly
protein misfolds, new undesired 4.2.2 Using PPI networks to from cellular networks, which rais-
interactions, or can enable pathogen-host understand disease. PPI networks can es the hypothesis that other intra-
protein interactions. help identify novel pathways to gain basic cellular pathogens might also have
Protein-DNA interaction disruptions are most knowledge of disease. Note that pathways distinguishing topologies [74].
clearly illustrated by the p53 tumor are different from PPI networks. PPI
(vi) Etiologically unrelated diseases of-
suppressor protein and its role in cancer. networks map the physical or functional
ten present similar symptoms be-
Mutations on p53s DNA-binding domain interaction between protein pairs resulting
cause separate biological processes
destroy its ability to bind to its target DNA in a complex grid of connections (Figure 1). often use common molecular path-
sequences, thus preventing transcriptional Pathways, on the other hand, represent ways [75].
activation of several anti-cancer mecha- genetic, metabolic, signaling, or neural
nisms it mediates (e.g. apoptosis, genetic processes as a series of sequential PPI networks can be used to explore the
stability, and inhibition of angiogenesis). biochemical reactions where substrates differences between healthy and diseased
Protein misfolding can result in disruptions are changed in a linear fashion. For states. Building interaction networks for
of protein-protein interactions, as occurs instance, the glycolysis pathway maps the systems under different conditions (e.g.
in the Von Hippel-Lindau syndrome conversion of glucose to pyruvate through wild type vs. mutant, presence of environ-
(VHL)VHL is a rare condition in which a linear chain of ten different steps. mental factor vs. its absence) might be the
hemangioblastomas are formed in the Pathway analysis alone cannot uncover key to understanding the differences
cerebellum, spinal cord, kidney, and the molecular basis of disease. When between healthy and pathological states.
retina. A mutation from Tyrosine to performing pathway analysis to study The work by Charlesworth et al. [76] on
Histidine at residue 98 on the binding site disease, differential expression experi- the perturbation of the canonical path-
disrupts binding of the VHL protein to the ments are the main source of protein ways and networks of interactions when
hypoxia-inducible factor (HIF) protein. As candidates. However, most of the gene humans are exposed to cigarette smoke
a result, the VHL protein no longer expression candidates are useless to path- illustrates the potential of such approach-
degrades the HIF protein, which leads to way-based analysis of disease because the es. As one might expect, this study found
the expression of angiogenic growth fac- majority of human genes have not been that the smoking-susceptible genes were
tors and local proliferation of blood vessels assigned to a pathway. Protein interaction overrepresented in pathways involved in
[62,63]. networks can be used to identify novel several aspects of cell death (cell cytotox-
New undesired protein interactions are the pathways. Protein interaction subnetworks icity, cell lysis), cancer (e.g. tumorigenesis),
main causes of several diseases, including tend to group together the proteins that and respiratory functions. A somewhat
Huntingtons disease (see introduction), interact in functional complexes and more unexpected finding, however, con-
cystic fibrosis, and Alzheimers disease. pathways [66]. Thus, new methods are firmed that exposure to the smoke envi-
New interactions alter homeostasis since being developed to accurately extract ronmental factor affected a large subnet-

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002819


Table 1. Pathway databases with disease information.

Resource Featured organisms Disease information Website

KEGG Yeast, mouse, human Comprehensive http://www.genome.jp/kegg/disease/


REACTOME Human+20 other species Sparse http://reactome.org/
SMPDB Human Small molecules Metabolic disease pathways http://www.smpdb.ca
PharmGKB Human Gene-drug-disease relationships http://www.pharmgkb.org/index.jsp
NetPath Human 10 immune and 10 cancer pathways http://www.netpath.org/index.html

doi:10.1371/journal.pcbi.1002819.t001

work of proteins involved in the immune- new interactions only present in the protein), its inhibition may affect many
inflammatory response. This study gave diseased states. For example, Rossin et al. activities that are essential for the proper
new insights into how smoke causes used genome-wide association studies function of the cell and might thus be
disease: the exogenous toxicants in smoke (GWAS) to identify regions with variations unsuitable as a drug target. On the other
perturb several protein interactions in the that predispose immune-mediated diseases hand, less connected nodes (e.g. nodes
healthy cell state, thereby depressing the [81]. The GWAS studies provided a list of affecting a single disease pathway) could
immune system, while disrupting the proteins found to interact in a preferential constitute vulnerable points of the disease-
inflammation response. The study also manner. The resulting disease single- related network, which are better candi-
explained why smoking cessation has some nucleotide polymorphisms identified by dates for drug targets. The work by
immediate health benefits; eliminating GWAS studies such as that by Rossin et Yildirim and Goh [83] illustrates the
smoke exposure reverses the alterations al. can be eventually incorporated into advantages of evaluating drugs within the
at the transcriptomic level and restores the genotyping diagnostic tools. context of cellular and disease networks.
majority of normal protein interactions. Identifying disease subnetworks, and in turn This group created a drug-target network
Protein interaction studies play a major role in pathways that get activated in diseased states, can to map the relationships between the
the prediction of genotype-phenotype associations provide markers to create new prognostic protein targets of all drugs and all
while also identifying new disease genes. tools. For instance, using a protein-net- disease-gene products. The topological
The identification of disease-associated work-based approach, Chuang et al. [66] analysis of the human drug-target network
interacting proteins also identifies poten- identified a set of subnetwork markers that revealed that (i) most drugs target currently
tially interesting disease-associated gene accurately classify metastatic vs. non- known targets; (ii) only a small fraction of
candidates (i.e. the genes coding for the metastatic tumors in individual patients. disease genes encodes drug-target proteins;
interacting proteins are putative disease- Metastasis is the leading cause of death in (iii) current drugs do not target diseases
causing genes). One of the best ways to patients with breast cancer. However, a equally but only address some regions of
identify novel disease genes is to study the patients risk for metastasis cannot be the human disease network; and (iv) most
interaction partners of known disease- accurately predicted and it is currently drugs are palliativethey treat the symp-
associated proteins [77]. Gandhi et al. only estimated based on other risk factors. toms not the cause of the disease, which
[78] found that mutations on the genes When metastasis is deemed likely, breast largely reflects our lack of knowledge
of interacting proteins lead to similar cancer patients are prescribed aggressive regarding the molecular basis of diseases
disease phenotypes, presumably because chemotherapy, even when it might be such that for many pathologies we can only
of their functional relationship. Therefore, unnecessary. By integrating protein net- treat the symptoms but not cure them.
protein interactions can be used to prior- works with cancer expression profiles, the
itize gene candidates in studies investigat- authors identified relevant pathways that 5. SummaryTrends in the
ing the genetic basis of disease [79]. become activated during tumor progres- Translational Characterization
Others have used the properties of protein sion, which discriminate metastasis better of Human Disease
interaction networks to differentiate dis- than markers previously suggested by
ease from non-disease proteins. Based on studies using differential gene expression We are still quite far from understand-
this approach, Xu et al. [80] devised a alone. ing the etiology of most diseases. Further
classifier based on several topological Disease networks can inform drug design by advances on relevant experimental tech-
features of the human interactome to helping suggesting key nodes as potential nology (e.g. genetic linkage, protein inter-
predict genes related to disease. The drug targets. Drug target identification action prediction), along with integrative
classifier was trained on a set of non- constitutes a good example of the potential computational tools to organize, visualize,
disease and a set of disease genes (from of integrating structural data with high- and test hypotheses should provide a step
OMIM) and applied to a collection of over throughput data [82]. The structural forward in that direction. More than ten
5,000 human genes. As a result, 970 details on binding or allosteric sites can years after the completion of the human
disease genes were identified, a fraction be used to design molecules to affect genome project, it is clear that our
of which were experimentally validated. protein function. On the other hand, approach to human disease elucidation
New diagnostic tools can result from genotype- reconstruction of the different protein needs to change. The $3-billion human
phenotype associations established through networks (signaling, metabolic, regulatory, book of life and the $138-million effort
PPIs. The genes of interacting proteins etc.) in which the potential target is to catalog the common gene variants
can be studied to identify the mutation(s) involved can help predict the overall relevant to disease have so far failed to
leading to the interaction disruptions seen impact of the disruption. If, for example, deliver the wealth of biological knowledge
in healthy individuals or to the creation of the target is a hub (a highly connected of human diseases and the subsequent

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002819


personalization of medicine the scientific of proteins and their networks. In reality, Datasets:
community expected [84]. proteins are continuously being synthe- The following datasets were adapted
To date, biomedical research of the sized and degraded. The kinetics of with permission from [87]
etiology of disease has largely focused on processes and network dynamics need to
identifying disease-associated genes. But, be considered to achieve a complete N Dataset S1: EBV interactome
the molecular mechanisms of pathogenesis understanding of how the disruptions of N Dataset S2: EBV-Human interactome
are extremely complex; gene-products protein interactions lead to disease. Third,
interact in different pathways and multiple human PPIs are often predicted based on Software requirements:
genes and environmental factors can affect homology and from studies investigating Download and install Cytoscape
their expression and activity. Likewise, the disease in other organisms. The same (http://www.cytoscape.org, [88]) locally.
same proteins may participate in different mechanisms of interaction might or might Note:
pathways and mutations on their genes not exist in the organism of interest or The instructions below correspond to
may or may not affect some or all of the their regulation and phenotypic effects Cytoscape v. 2.8.0; but, should be appli-
biological processes they mediate. Thus, might be different. Ideally, since network cable to future releases.
gene-disease associations cannot be and structural approaches are comple- I. Visualize the EBV interactome
straightforwardly deduced and their use- mentary, the combination of network using Cytoscape
fulness alone (in the absence of a molec- studies with a more detailed structural
A. Import Dataset S1 into cytos-
ular context) in elucidating the biology of analysis has the potential to enhance the
cape
healthy phenotype disruptions is question- study of disease mechanisms and rational
able. Evidence is accumulating to suggest drug design.
that in the majority of cases illnesses are Currently, in the PPI field, a large
N Select File -.Import -.Network (Multi-
ple File Types)
traceable to a large number of genes number of studies focus on the topological
affecting a network or pathway. The characterization of organisms interac- N Click the Select button to browse to
effects on healthy phenotype disruption Dataset S1s location
tomes. Those studies have yielded valuable
may vary from one individual to another information regarding general trends of N Click Import
based on the persons gene variants and on molecular organization and their differ-
how disruptive the alterations might be to B. Change the network layout
ences across genomes. To gain a deeper
the network [85]. understanding of individual diseases, how-
To achieve a comprehensive genotype- ever, the trend needs to move from global N Click on View-.Hide data panel
phenotype understanding of disease, trans- characterizations to disease-specific inter- N Click the 1:1 magnifying glass icon to
lational research should be conducted actomes. Phenotype-specific interaction zoom out to display all elements of
within a framework integrating methodol- network analyses should help identify the current network
ogies for uncovering the genetics with subnetworks mapping to pathways that N Select Layout-.Cytoscape Layouts-
those investigating the molecular mecha- can be targeted therapeutically and point .Force-directed (unweighted) Layout
nisms of pathogenesis. In fact, the studies to key molecules essential to the biological
yielding the most biological insight into function under study. Since disease infer- C. Format the nodes and edges
disease to which we alluded in this chapter ences are as good as the modeled PPI
were those which implemented a com- networks, the ontologies used by PPI N Select View-.Open Vizmapper
bined genotype-phenotype approach; resources need to be expanded to better N Choose the Default Current Visual Style
those studies identified the disease-suscep-
tible genes and investigated their network
describe disease phenotypes, cytological N Click on the pair of connected nodes icon in
changes, and molecular mechanisms. the Defaults box
of interactions and affected pathways. As a
result, the combined approaches managed
N Scroll down on the resulting dialog to
6. Exercises change the following default visual
to explain known clinical observations properties:
while also suggesting new mechanisms of Objective: To investigate Epstein-
pathology. Barr Virus (EBV) pathogenesis us- NODE__SIZE = 20
PPI analysis provides an effective means ing protein-protein interactions
NODE_FONT_SIZE = 20
to investigate biological processes at the EBV is a member of the herpesvirus
molecular level. Yet, any conclusions family and one of the most common NODE_LABEL_POSI-
obtained based on PPI methods must be human viruses. According to the CDC, TION = (Node Anchor Points)
validated since these methods are subject in the United States around 95% of adults SOUTH
to limitations inherent to the nature of have been infected by EBV. Upon infec-
data collection and availability. First, one tion in adults, EBV replicates in epithelial N Note: Feel free to click and drag any
nodes with labels that overlap to
must be aware that the roles of protein cells and establishes latency in B lympho-
increase visual clarity.
interactions are context-specific (tissue, cytes, eventually causing infectious mono-
disease stage, and response). Thus, two nucleosis 35%50% of the time and D. Print the EBV interactome
proteins observed to interact in vitro might sometimes cancer [86]. In the next four
not interact in vivo if they are localized in sections, your goal will be to study the
different cell compartments. Even when in interactions among EBV proteins and N Select File-.Export-.Current Net-
common cell compartments, protein abun- between the virus and its host (using the work View as Graphics
dance or presence of additional interactors EBV-EBV and EBV-human interactomes Answer the following questions:
might affect whether the interaction oc- respectively) as a means to investigate how
curs at all. Second, most of the PPI EBV leads to disease at the molecular i. How many nodes and edges are
methodologies use a simplistic static view level. featured in this network?

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002819


Table 2. Topological properties of human proteins for exercise III.

Average topological property ET-HP Random human protein

Degree 1562 5.960.1


Number of components 4 12.660.25
Nodes in largest component 1,112 52165
Distance to other proteins 3.260.1 4.0360.01

doi:10.1371/journal.pcbi.1002819.t002

ii. How many self interactions does the Use the topological information pro- work, formulate a hypothesis to
network have? vided for you in Table 2 to investigate explain how LF2 may be driving
iii. How many pairs are not connected whether the EBV-targeted Human Pro- EBV to latency suggesting at least
to the largest connected component? teins (ET-HPs) differ from the average one molecular mechanism by which
iv. Define the following topological pa- human protein. LF2 may inactivate Rta.
rameters and explain how they might Answer the following questions: ii. Why is establishing latency (opposed
be used to characterize a protein- to promoting rapid replication of
i. Based on the degree property, what
protein interaction network: node viral particles) an effective mecha-
can you deduce about the connect-
degree (or average number of neigh- nism of virus infection?
edness of ET-HPs? What does this
bors), network heterogeneity, aver- iii. Assign putative functions to EBVs
age clustering coefficient distribu- tell you about the kind of proteins
(i.e. what type of network compo- SM and EBNA3A proteins based on
tion, network centrality. the function of the human proteins
nent) EBV targets?
II. Characterize the EBV-Human with which they interactHint: Lo-
ii. What do the number and size of the
interactome cate these proteins in the EBV-
largest components tell you about
Import Dataset S2 into cytoscape to Human network. What clinical ob-
the inter-connectedness of the ET-
create a map of the EBV-Human inter- servation (see the introductory para-
HP subnetwork?
actome. Format and output the network graph to section 6. Exercises) might
iii. Why is distance relevant to network these proteins subnetworks explain?
according to steps A through D in part I.
centrality? What is unusual about the
Answer the following questions:
distance of ET-HPs to other proteins Answers to the Exercises can be found
i. How many unique proteins were and what can you deduce about the in Text S1.
found to interact in each organism? importance of these proteins in the
ii. How many interactions are mapped? Human-Human interactome? Supporting Information
iii. How many human proteins are iv. Based on your conclusions from
Text S1 Answers to Exercises.
targeted by multiple (i.e. how many questions iiii, explain why EBV
(DOCX)
individual human proteins interact targets the ET-HP set over the other
with .1) EBV proteins? human proteins and speculate on the Dataset S1 EBV Interactome Data.
iv. How does identifying the multi- advantages to virus survival the (SIF)
targeted human proteins help you protein set might confer.
Dataset S2 EBV-Human Interactome
understand the pathogenicity of the IV. Integrating knowledge from Data.
virus? Hint: Speculate about the three different interactomes (SIF)
role of the multi-targeted human
Answer the following questions: Figure S1 EBV Interactome Map.
proteins in the virus life cycle.
(PDF)
v. How might you test the predictions i. The Rta protein is a transactivator
you formulated above? that is central to viral replication in Figure S2 EBV-Human Interactome
EBV. When Rta is co-expressed with Map.
III. Characterize the topological the LF2 protein replication attenuates (PDF)
properties of the human proteins and the virus establishes latency.
that are targeted by EBV Solely based on the EBV-EBV net-

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002819


Further Reading

N Chen JY, Youn E, Mooney SD (2009) Connecting protein interaction data, mutations, and disease using bioinformatics.
Methods Mol Biol 541: 449461.
N Nussinov R, Schreiber G (2009) Computational protein-protein interactions. Boca Raton: CRC Press.
N Ideker T, Sharan R (2008) Protein networks in disease. Genome Res 18: 644652.
N Juan D, Pazos F, Valencia A (2008) Co-evolution and co-adaptation in protein networks. FEBS Lett 582: 12251230.
N Panchenko A, Przytycka T (2008) Protein-protein interactions and networks: identification, computer analysis, and prediction.
London: Springer.
N Klussmann E, Scott J, Aandahl EM (2008) Protein-protein interactions as new drug targets. Berlin: Springer.
N Kann MG (2007) Protein interactions and disease: computational approaches to uncover the etiology of diseases. Brief
Bioinform 8: 333346.
N Shoemaker BA, Panchenko AR (2007) Deciphering protein-protein interactions. Part II. Computational methods to predict
protein and domain interaction partners. PLoS Comput Biol 3: e43. doi:10.1371/journal.pcbi.0030043

Glossary
Mendelian traits or diseases, named after Gregor Mendel, are the traits inherited and controlled by a single gene.

Positional cloning is a method to find the gene producing a specific phenotype in an area of interest in the genome. The first
step of positional cloning is linkage analysis, in which the gene is mapped using a group of DNA polymorphisms from families
segregating the disease phenotype.

Epistasis refers to the phenomenon in which one gene masks the phenotypic effect of another.

Angiogenesis is the physiological process leading to growth of new blood vessels. Angiogenesis is a normal and vital process
in growth, development, and wound healing; but it is also a fundamental step in the transition of tumors from a dormant to a
malignant state.

Hemangioblastomas are tumors of the central nervous system that originate from the vascular system.

References
1. De Las Rivas J, de Luis A (2004) Interactome 11. Duennwald ML, Jagadish S, Giorgini F, Mu- 20. Deng M, Mehta S, Sun F, Chen T (2002)
data and databases: different types of protein chowski PJ, Lindquist S (2006) A network of Inferring domain-domain interactions from pro-
interaction. Comp Funct Genomics 5: 173 protein interactions determines polyglutamine tein-protein interactions. Genome Res 12: 1540
178. toxicity. Proc Natl Acad Sci U S A 103: 11051 1548.
2. Barabasi AL, Oltvai ZN (2004) Network biology: 11056. 21. Nye TM, Berzuini C, Gilks WR, Babu MM,
understanding the cells functional organization. 12. Giorgini F, Muchowski PJ (2005) Connecting the Teichmann SA (2005) Statistical analysis of
Nat Rev Genet 5: 101113. dots in Huntingtons disease with protein inter- domains in interacting protein pairs. Bioinfor-
3. Grindrod P, Kibble M (2004) Review of uses of action networks. Genome Biol 6: 210. matics 21: 9931001.
network and graph theory concepts within 13. Shoemaker BA, Panchenko AR (2007) Decipher- 22. Fraser HB, Hirsh AE, Wall DP, Eisen MB (2004)
proteomics. Expert Rev Proteomics 1: 229238. ing proteinprotein interactions. Part I. Experi- Coevolution of gene expression among interact-
4. Pazos F, Ranea JA, Juan D, Sternberg MJ (2005) mental techniques and databases. PLoS Comput ing proteins. Proc Natl Acad Sci U S A 101:
Assessing protein co-evolution in the context of Biol 3: e42. doi:10.1371/journal.pcbi.0030043 90339038.
the tree of life assists in the prediction of the 14. Costanzo M, Baryshnikova A, Bellay J, Kim Y, 23. Kanaan SP, Huang C, Wuchty S, Chen DZ,
interactome. J Mol Biol 352: 10021015. Spear ED, et al. (2010) The genetic landscape of a Izaguirre JA (2009) Inferring protein-protein
5. Kann MG (2007) Protein interactions and disease: cell. Science 327: 425431. interactions from multiple protein domain com-
computational approaches to uncover the etiology 15. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori binations. Methods Mol Biol 541: 4359.
of diseases. Brief Bioinform 8: 333346. M, et al. (2001) A comprehensive two-hybrid 24. Guimaraes KS, Przytycka TM (2008) Interrogat-
6. Ideker T, Sharan R (2008) Protein networks in analysis to explore the yeast protein interactome. ing domain-domain interactions with parsimony
disease. Genome Res 18: 644652. Proc Natl Acad Sci U S A 98: 45694574. based approaches. BMC Bioinformatics 9: 171.
7. Huntington G (1872) On chorea. Med Surg Rep 16. Mrowka R, Patzak A, Herzel H (2001) Is there a 25. Guimaraes KS, Jothi R, Zotenko E, Przytycka
26: 320321. bias in proteome research? Genome Res 11: TM (2006) Predicting domain-domain interac-
8. Punnett RC (1908) Mendelism in Relation to 19711973. tions using a parsimony approach. Genome Biol
Disease. Proc R Soc Med 1: 135168. 17. von Mering C, Krause R, Snel B, Cornell M, 7: R104.
9. (1993) A novel gene containing a trinucleotide Oliver SG, et al. (2002) Comparative assessment 26. Riley R, Lee C, Sabatti C, Eisenberg D (2005)
repeat that is expanded and unstable on Hun- of large-scale data sets of protein-protein interac- Inferring protein domain interactions from data-
tingtons disease chromosomes. The Huntingtons tions. Nature 417: 399403. bases of interacting proteins. Genome Biol 6:
Disease Collaborative Research Group. Cell 72: 18. Shoemaker BA, Panchenko AR, Bryant SH R89.
971983. (2006) Finding biologically relevant protein do- 27. Izarzugaza JM, Juan D, Pons C, Ranea JA,
10. Goehler H, Lalowski M, Stelzl U, Waelter S, main interactions: conserved binding mode Valencia A, et al. (2006) TSEMA: interactive
Stroedicke M, et al. (2004) A protein interaction analysis. Protein Sci 15: 352361. prediction of protein pairings between interacting
network links GIT1, an enhancer of huntingtin 19. Sprinzak E, Margalit H (2001) Correlated families. Nucleic Acids Res 34: W315319.
aggregation, to Huntingtons disease. Mol Cell sequence-signatures as markers of protein-protein 28. Gertz J, Elfond G, Shustrova A, Weisinger M,
15: 853865. interaction. J Mol Biol 311: 681692. Pellegrini M, et al. (2003) Inferring protein

PLOS Computational Biology | www.ploscompbiol.org 10 December 2012 | Volume 8 | Issue 12 | e1002819


interactions from phylogenetic distance matrices. 50. Sato T, Yamanishi Y, Kanehisa M, Toh H (2005) Protein Reference Database2009 update. Nu-
Bioinformatics 19: 20392045. The inference of protein-protein interactions by cleic Acids Res 37: D767772.
29. Goh CS, Bogan AA, Joachimiak M, Walther D, co-evolutionary analysis is improved by excluding 71. Goh KI, Cusick ME, Valle D, Childs B, Vidal M,
Cohen FE (2000) Co-evolution of proteins with the information about the phylogenetic relation- et al. (2007) The human disease network. Proc
their interaction partners. J Mol Biol 299: 283 ships. Bioinformatics 21: 34823489. Natl Acad Sci U S A 104: 86858690.
293. 51. Kann MG, Jothi R, Cherukuri PF, Przytycka TM 72. Wachi S, Yoneda K, Wu R (2005) Interactome-
30. Goh CS, Cohen FE (2002) Co-evolutionary (2007) Predicting protein domain interactions transcriptome analysis reveals the high centrality
analysis reveals insights into protein-protein from coevolution of conserved regions. Proteins of genes differentially expressed in lung cancer
interactions. J Mol Biol 324: 177192. 67: 811820. tissues. Bioinformatics 21: 42054208.
31. Jothi R, Kann MG, Przytycka TM (2005) 52. Botstein D, Risch N (2003) Discovering genotypes 73. Jonsson PF, Bates PA (2006) Global topological
Predicting protein-protein interaction by search- underlying human phenotypes: past successes for features of cancer proteins in the human inter-
ing evolutionary tree automorphism space. Bioin- mendelian disease, future approaches for complex actome. Bioinformatics 22: 22912297.
formatics 21 Suppl 1: i241i250. disease. Nat Genet 33 Suppl: 228237. 74. Uetz P, Dong YA, Zeretzke C, Atzler C, Baiker
32. Pazos F, Helmer-Citterich M, Ausiello G, Valen- 53. Kerem B, Rommens JM, Buchanan JA, Markie-
A, et al. (2006) Herpesviral protein networks and
cia A (1997) Correlated mutations contain wicz D, Cox TK, et al. (1989) Identification of the
their interaction with the human proteome.
information about protein-protein interaction. cystic fibrosis gene: genetic analysis. Science 245:
Science 311: 239242.
J Mol Biol 271: 511523. 10731080.
54. Miki Y, Swensen J, Shattuck-Eidens D, Futreal 75. Lim J, Hao T, Shaw C, Patel AJ, Szabo G, et al.
33. Pazos F, Valencia A (2002) In silico two-hybrid
system for the selection of physically interacting PA, Harshman K, et al. (1994) A strong candidate (2006) A protein-protein interaction network for
protein pairs. Proteins 47: 219227. for the breast and ovarian cancer susceptibility human inherited ataxias and disorders of Purkinje
34. Ramani AK, Marcotte EM (2003) Exploiting the gene BRCA1. Science 266: 6671. cell degeneration. Cell 125: 801814.
co-evolution of interacting proteins to discover 55. Wooster R, Bignell G, Lancaster J, Swift S, Seal 76. Charlesworth JC, Curran JE, Johnson MP,
interaction specificity. J Mol Biol 327: 273284. S, et al. (1995) Identification of the breast cancer Goring HH, Dyer TD, et al. Transcriptomic
35. Jothi R, Cherukuri PF, Tasneem A, Przytycka susceptibility gene BRCA2. Nature 378: 789792. epidemiology of smoking: the effect of smoking on
TM (2006) Co-evolutionary Analysis of Domains 56. Scriver CR, Waters PJ (1999) Monogenic traits gene expression in lymphocytes. BMC Med
in Interacting Proteins Reveals Insights into are not simple: lessons from phenylketonuria. Genomics 3: 29.
Domain-Domain Interactions Mediating Pro- Trends Genet 15: 267272. 77. Oti M, Snel B, Huynen MA, Brunner HG (2006)
tein-Protein Interactions. J Mol Biol 362: 861 57. Groman JD, Meyer ME, Wilmott RW, Zeitlin Predicting disease genes using protein-protein
875. PL, Cutting GR (2002) Variant cystic fibrosis interactions. J Med Genet 43: 691698.
36. Pellegrini M, Marcotte EM, Thompson MJ, phenotypes in the absence of CFTR mutations. 78. Gandhi TK, Zhong J, Mathivanan S, Karthick L,
Eisenberg D, Yeates TO (1999) Assigning protein N Engl J Med 347: 401407. Chandrika KN, et al. (2006) Analysis of the
functions by comparative genome analysis: pro- 58. Sun H, Smallwood PM, Nathans J (2000) human protein interactome and comparison with
tein phylogenetic profiles. Proc Natl Acad Biochemical defects in ABCR protein variants yeast, worm and fly interaction datasets. Nat
Sci U S A 96: 42854288. associated with human retinopathies. Nature Genet 38: 285293.
37. Huynen MA, Bork P (1998) Measuring genome Genet 26: 242246. 79. Chen L, Tai J, Zhang L, Shang Y, Li X, et al.
evolution. Proc Natl Acad Sci U S A 95: 5849 59. Dipple KM, McCabe ERB (2000) Phenotypes of (2011) Global risk transformative prioritization
5856. patients with simple Mendelian disorders are for prostate cancer candidate genes in molecular
38. Dandekar T, Snel B, Huynen M, Bork P (1998) complex traits: thresholds, modifiers, and systems networks. Mol Biosyst 7: 25472553.
Conservation of gene order: a fingerprint of dynamics. Am J Hum Genet 66: 17291735. 80. Xu J, Li Y (2006) Discovering disease-genes by
proteins that physically interact. Trends Biochem 60. Van Heyningen V, Yeyati PL (2004) Mechanisms topological features in human protein-protein
Sci 23: 324328. of non-Mendelian inheritance in genetic disease. interaction network. Bioinformatics 22: 2800
39. Overbeek R, Fonstein M, DSouza M, Pusch GD, Hum Mol Genet 13 Spec No 2: R225233. 2805.
Maltsev N (1999) Use of contiguity on the 61. Mayeux R (2005) Mapping the new frontier: 81. Rossin EJ, Lage K, Raychaudhuri S, Xavier RJ,
chromosome to predict functional coupling. In complex genetic disorders. J Clin Invest 115: Tatar D, et al. (2011) Proteins encoded in
Silico Biol 1: 93108. 14041407. genomic regions associated with immune-mediat-
40. Marcotte EM, Pellegrini M, Ng HL, Rice DW, 62. Brauch H, Kishida T, Glavac D, Chen F, Pausch ed disease physically interact and suggest under-
Yeates TO, et al. (1999) Detecting protein F, et al. (1995) Von Hippel-Lindau (VHL) disease lying biology. PLoS Genet 7: e1001273.
function and protein-protein interactions from with pheochromocytoma in the Black Forest
doi:10.1371/journal.pgen.1001273
genome sequences. Science 285: 751753. region of Germany: evidence for a founder effect.
82. Jiang Z, Zhou Y (2005) Using bioinformatics for
41. Enright AJ, IliopoulosI., Kyrpides NC, Ouzounis Hum Genet 95: 551556.
drug target identification from the genome.
CA (1999) Protein interaction maps for complete 63. Ohh M, Park CW, Ivan M, Hoffman MA, Kim
genomes based on gene fusion events. Nature Am J Pharmacogenomics 5: 387396.
TY, et al. (2000) Ubiquitination of hypoxia-
402: 8690. inducible factor requires direct binding to the 83. Yildirim MA, Goh KI, Cusick ME, Barabasi AL,
42. Juan D, Pazos F, Valencia A (2008) Co-evolution beta-domain of the von Hippel-Lindau protein. Vidal M (2007) Drug-target network. Nat Bio-
and co-adaptation in protein networks. FEBS Nat Cell Biol 2: 423427. technol 25: 11191126.
Lett. 64. Association ASH (2007) HPV Resource Center. 84. Hall SS Revolution postponed. Sci Am 303: 60
43. Valencia A, Pazos F (2003) Prediction of protein- 65. Scheffner M, Whitaker NJ (2003) Human papil- 67.
protein interactions from evolutionary informa- lomavirus-induced carcinogenesis and the ubiq- 85. Nadeau JH (2009) Transgenerational genetic
tion. Methods Biochem Anal 44: 411426. uitin-proteasome system. Semin Cancer Biol 13: effects on phenotypic variation and disease risk.
44. Valencia A, Pazos F (2002) Computational 5967. Hum Mol Genet 18: R202R210.
methods for the prediction of protein interactions. 66. Chuang HY, Lee E, Liu YT, Lee D, Ideker T 86. CDC (2006) Epstein-Barr Virus and Infectious
Curr Opin Struct Biol 12: 368373. (2007) Network-based classification of breast Mononucleosis. Center for Disease Control and
45. Pazos F, Valencia A (2001) Similarity of phylo- cancer metastasis. Mol Syst Biol 3: 140. Prevention/National Center for Infectious Dis-
genetic trees as indicator of protein-protein 67. Ideker T, Ozier O, Schwikowski B, Siegel AF eases.
interaction. Protein Eng 14: 609614. (2002) Discovering regulatory and signalling 87. Calderwood MA, Venkatesan K, Xing L, Chase
46. Pazos F, Juan D, Izarzugaza JM, Leon E, circuits in molecular interaction networks. Bioin- MR, Vazquez A, et al. (2007) Epstein-Barr virus
Valencia A (2008) Prediction of protein interac- formatics 18 Suppl 1: S233240. and virus human protein interaction maps. Proc
tion based on similarity of phylogenetic trees. 68. Hallock P, Thomas MA (2012) Integrating the Natl Acad Sci U S A 104: 76067611.
Methods Mol Biol 484: 523535. Alzheimers disease proteome and transcriptome: 88. Smoot ME, Ono K, Ruscheinski J, Wang PL,
47. Fryxell KJ (1996) The coevolution of gene family a comprehensive network model of a complex Ideker T (2011) Cytoscape 2.8: new features for
trees. Trends Genet 12: 364369. disease. OMICS 16: 3749. data integration and network visualization. Bioin-
48. Hakes L, Lovell SC, Oliver SG, Robertson DL 69. Rual JF, Venkatesan K, Hao T, Hirozane- formatics 27: 431432.
(2007) Specificity in protein interactions and its Kishikawa T, Dricot A, et al. (2005) Towards a 89. Zhang J, Yang Y, Wang Y, Wang Z, Yin M, et al.
relationship with sequence diversity and coevolu- proteome-scale map of the human protein- (2011) Identification of hub genes related to the
tion. Proc Natl Acad Sci U S A 104: 79998004. protein interaction network. Nature 437: 1173 recovery phase of irradiation injury by microarray
49. Tillier ER, Charlebois RL (2009) The human 1178. and integrated gene network analysis. PLoS ONE
protein coevolution network. Genome Res 19: 70. Keshava Prasad TS, Goel R, Kandasamy K, 6: e24680. doi:10.1371/journal.pone.0024680
18611871. Keerthikumar S, Kumar S, et al. (2009) Human

PLOS Computational Biology | www.ploscompbiol.org 11 December 2012 | Volume 8 | Issue 12 | e1002819


Education

Chapter 5: Network Biology Approach to Complex Diseases


Dong-Yeon Cho., Yoo-Ah Kim., Teresa M. Przytycka*
National Center for Biotechnology Information, NLM, NIH, Bethesda, Maryland, United States of America

Abstract: Complex diseases are This article is part of the Transla- conditions characterized by impairments
caused by a combination of genetic tional Bioinformatics collection for in reciprocal social interaction and com-
and environmental factors. Uncov- PLOS Computational Biology. munication, and the presence of restricted
ering the molecular pathways and repetitive behaviors [1]. Similar
through which genetic factors affect heterogeneity is present in other complex
a phenotype is always difficult, but 1. Introduction diseases including cancer.
in the case of complex diseases this Complex diseases are caused, among other Given the above challenges, how can we
is further complicated since genetic approach the study of complex diseases? A
factors, by a combination of genetic
factors in affected individuals might useful clue is provided by the fact that
perturbations. Thus in the case of a
be different. In recent years, systems genes, gene products, and small molecules
biology approaches and, more spe- complex disease we do not assume that a
single genetic mutation can be pinned interact with each other to form a complex
cifically, network based approaches interaction network. Thus a perturbation
emerged as powerful tools for down as a cause. Many diseases fall in this
category including cancer, autism, diabe- in one gene can be propagated through
studying complex diseases. These
tes, obesity, and coronary artery disease. the interactions, and affect other genes in
approaches are often built on the
knowledge of physical or functional Even though there are other factors the network. However, the fact that we
interactions between molecules involved in such diseases, this review will observe similar disease phenotypes despite
which are usually represented as focus on genetic causes. different genetic causes suggests that these
an interaction network. An interac- One of the fundamental difficulties in different causes are not unrelated but
tion network not only reports the studying genetic causes of complex diseas- rather dys-regulate the same component
binary relationships between indi- es is that different disease cases might be of the cellular system [3]. Therefore in
vidual nodes but also encodes caused by different genetic perturbations. studies of complex diseases researchers
hidden higher level organization of In addition, if a disease is caused by a increasingly focus on groups of related/
cellular communication. Computa- combinatorial effect of many mutations, interconnected genes, referred to as mod-
tional biologists were challenged the individual effects of each mutation ules or subnetworks.
with the task of uncovering this might be small and thus hard to discover.
organization and utilizing it for the
For example, autism is considered to be 2. Interactome
understanding of disease complex-
one of the most heritable complex disor-
ity, which prompted rich and di- Biomoecules in a living organisim rarely
verse algorithmic approaches to be ders, but its underlying genetic causes are
still largely unknown [1]. One of the act individually. Instead, they work to-
proposed. We start this chapter with
proposed factors that contribute to this gether in a cooperative way to provide
a description of the general charac-
teristics of complex diseases fol- difficulty is the role of rare genetic specific functions. A variety of intermolec-
lowed by a brief introduction to variations in the emergence of the disease ular interactions including protein-protein
physical and functional networks. [2]. interactions, protein-DNA interactions,
Next we will show how these An additional difficulty in studying and RNA interactions are essential to
networks are used to leverage complex diseases relates to disease hetero- these cooperative activities. These interac-
genotype, gene expression, and geneity. Specifically, in a complex disease, tions can be conveniently represented as
other types of data to identify disease phenotypes might vary significant- networks (graphs) with nodes (vertices)
dysregulated pathways, infer the ly among patients. The recognition of this which denote molecules, and links (edges)
relationships between genotype fact has lead, for example, to renaming which denote interactions between them.
and phenotype, and explain disease autism to autism spectrum disorders Depending on the type of interaction, the
heterogeneity. We group the meth- (ASDs) referring in this way to a group of corresponding edge might be directed or
ods by common underlying princi-
ples and first provide a high level
description of the principles fol- Citation: Cho D-Y, Kim Y-A, Przytycka TM (2012) Chapter 5: Network Biology Approach to Complex Diseases. PLoS
Comput Biol 8(12): e1002820. doi:10.1371/journal.pcbi.1002820
lowed by more specific examples.
We hope that this chapter will give Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
Baltimore County, United States of America
readers an appreciation for the
wealth of algorithmic techniques Published December 27, 2012
that have been developed for the This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted,
purpose of studying complex dis- modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under
eases as well as insight into their the Creative Commons CC0 public domain dedication.
strengths and limitations. Funding: This work is supported by the intramural program of National Library of Medicine, NIH. The funders
had no role in the preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: przytyck@ncbi.nlm.nih.gov
. These authors contributed equally to this work.

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002820


What to Learn in this Chapter While these physical interaction net-
works have significantly advanced our
understanding of the relationships be-
N Characteristics and challenges of complex diseases
tween molecules, a concern is their level
N Physical and functional networks and the methods to construct them of noise and incompleteness. Indeed,
N Different classes of algorithms that use networks to leverage genotype, gene physical interaction networks obtained by
expression, and other types of data to identify dys-regulated pathways in high-throughput techniques are found to
diseases: include numerous non-functional protein-
N Scoring, correlation, and set cover based methods for identification of dys- protein interactions [14] and at the same
regulated network modules based on various data types including genotype time many missing true interactions.
and phenotype. Therefore physical interactions are often
N Distance and flow based methods for inferring information flow from complemented with functional interac-
genotype to phenotype tions.
N Applications to disease classification and treatment
2.2 Functional Interaction Networks
While physical interaction networks
undirected. For example, a binding be- In the following subsections, we briefly provide information on how proteins
tween two proteins is usually represented describe how physical and functional interact with each other, sometimes we
as an undirected edge while an interaction interactions networks are constructed and may be more interested in how proteins
between a transcription factor and a gene how they are applied to analyze complex work together to perform a certain func-
diseases. We then explore the modularity tion. Functional networks aim to connect
whose expression is regulated by the given
genes with similar or related functions
transcription factor is usually represented of networks a widely accepted phenom-
even if they do not necessarily physically
as a directed edge where the direction goes enon in biological networks that has
interact. Similarly functional regulatory
from the transcription factor to the gene. proven to be helpful in disease studies.
networks are constructed so that the
Biological interaction networks have
interactions depict direct or indirect regu-
characteristic topological properties [4]. 2.1 Physical Interaction Networks latory relationships. Consequently, several
One of the basic properties observed in Physical contacts between proteins are computational methods have been pro-
many biological networks is the scale-free critical in many biological functions. In fact posed to derive functional interaction
property [5]. A scale free network is much of the molecular machinery respon- networks.
defined as a network whose node degree sible for transcription, translation, and Since functionally related genes are
distribution follows a power law. Formally, degradation is made of stable protein likely to show mutual dependence in their
the function P(k) indicating the fraction of complexes. There are two main approaches expression patterns [15], gene expression
nodes interacting with k other nodes in the for detecting physical protein interactions data has been often used to detect
network follows P(k),ak2c, where a is a [10]. The first approach is to detect functional relationships. Co-expression
normalization constant and the degree physical interactions between protein pairs. networks can be constructed by computing
exponent c is usually in the range of The most widely used high-throughput correlation coefficients or mutual informa-
2,c,3. Obviously, in biological networks technology for detecting pairwise interac- tion between gene expression profiles of
the scale free property holds only approx- tion is yeast two-hybrid (Y2H) method. every pair of genes in different experimen-
imately and practically the most important Alternatively, physical interactions among tal settings. To build more comprehensive
implication of this observation is the fact groups of proteins can be detected without functional networks, co-expression data is
that these networks are characterized by a explicit consideration of interacting part- frequently combined with other types of
small number of highly connected nodes ners. For this type of approach, interaction data such as Gene Ontology [16,17],
while most nodes interact with only a few data is typically obtained by tandem affinity outcome of genetic interaction experi-
neighbors. These highly connected nodes, purification coupled to mass spectrometry ments, and physical interactions. Such
called hubs have been proposed to play (TAP-MS). A more detailed review on integrated networks have been constructed
important roles in biological processes [6] experimental methods for the detection for a variety of organisms including yeast
and shown to be related to the modular and analysis of protein-protein interactions [18], fly [19], mouse [20], and human [21].
structure of the physical and functional can be found in [11]. It is worth noting that Gene regulatory network reconstruction
interaction networks [7]. Therefore it networks obtained with various technolo- algorithms such as ARACNE [22] and
might be interesting to consider disease gies often have different topological prop- SPACE [23] identify regulatory relation-
related genes in the context of the erties [7]. For example, in the case of the ships building on the assumption that
topological properties of interaction net- yeast TAP-MS network, hub nodes are changes in the expression level of a
works such as connectivity or modularity enriched with essential genes (the genes transcription factor should be mirrored in
[8,9]. With respect to connectivity, one without which yeast cannot survive in the expression changes of the genes regu-
should note that known disease genes tend standard growth medium). In contrast, lated by the transcription factor (TF).
to be more studied which might introduce hubs in yeast Y2H networks are enriched Causal relations among genes can also be
a bias towards higher connectivity. Impor- with genes that are pleiotropic [12]. Finally, naturally modeled using Bayesian networks
tantly, independently of the source of the experimental procedures detecting protein- which can represent conditional dependen-
non-uniformity of node degree distribu- protein interactions have also been com- cies between expression levels (for a primer
tion, this characteristic property of inter- plemented by various computational meth- on Bayesian network analysis utilizing
action networks needs to be kept in mind ods using evolutionary-based approaches, expression data (see [24]); for a recent
while designing proper null models for statistical analysis, and/or machine learn- review see [25]). Considering the temporal
conclusions derived using these networks. ing techniques (for a review, see [13]). aspects of gene expression profiles, dynamic

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002820


Bayesian networks have been used to model functions of many genes are still unknown, the third type of approaches focuses on
feedback loops as well as gene regulation the prediction of the functional role of a predicting molecules and modules that
patterns [26,27]. While expression profiles module may be possible if the module mediate such information propagation.
serve as primary data sources for construct- contains a sufficient number of genes of What are the benefits of analyzing
ing functional regulatory networks, this known functions. Such enrichment analy- phenotypic and genotypic differences in
data is often complemented with additional sis builds on the assumption that a fraction diseases in the context of their molecular
information such as experimentally derived of genes can be assigned a functional interactions? First, the integrative network
transcription factor binding data from category such as Gene Otology (GO) term based approaches can identify subnet-
ChiP-seq experiments or computationally [17]. The question of whether the number works that include genes that do not
identified binding motifs. of genes with a functional annotation in a necessarily show a significantly different
given gene module is higher than expected state in disease versus control but still play
2.3 Modules and Pathways by chance can be determined by statistical an important role within a module by
It is widely accepted that the cellular tests such as x2 or Fisher exact test. A mediating a connection between other
system is modular. Hartwell et al. defined a variety of software tools have been devel- disease associated genes. For example, in
functional module as an entity, composed oped to perform such an analysis [50]. their pioneering approach, Ideker et al.
of many types of interacting molecules, [52] integrated yeast proteinprotein and
whose function is separable from those of 3. Identifying Modules and protein-DNA interactions with gene ex-
other modules [28]. While the precise Pathways Dys-regulated in pression changes in response to perturba-
meaning of separation is left undefined, tions of the yeast galactose utilization
Diseases
this general description provides a good pathway and identified Active Subnetworks
intuition behind the concept of a module. Since complex diseases are believed to (sets of connected genes with significantly
Traditionally, molecular pathways have be caused by combinations of genetic differential expression) which included
been delineated by focused studies of alterations affecting a common component common transcription factors showing
particular functions such as cell growth. of the cellular system, module-centirc moderate changes in their gene expression
Typically, these pathways contain not only approaches are particulalry pormissing in level but connecting other dys-regulated
topological connectivity information but thier study. How can disease associated genes. Second, a module based approach
also the roles of molecules such as whether modules/subnetworks be identified? Com- increases statistical power, allowing the
a given molecule is an activator or plementing interaction data with addition- identification of a perturbed module even
inhibitor of the activity of another mole- al data related to disease states helps in in the case when the perturbation of each
cule. However, these hand-curated path- separating subnetworks perturbed in a individual gene in the module might not
ways are often incomplete. In addition, disease of interest from the remainder of be statistically significant. For example,
while some functions, such as cell growth the network. Both genotypic data (e.g., many cases of genetic diseases such as
or differentaition, have been relatively well SNP, copy number alteration) and molec- autism and schizophrenia are affected by
studied, studies of other pathways are less ular phenotypic data such as gene expres- rare germline variations which are difficult
extensive. Therefore, given the availability sion profiles in disease samples have been to distinguish from noise due to their
of large scale interaction networks, it is used to aid the identification of perturbed rarity. However, recent studies showed
natural to attempt to extract meaningful network modules and explain the connec- that a significant portion of the altered
functional modules from such networks. tion between genotypic and phenotypic genes belong to a highly interconnected
While there is no unique way to mathe- data (reviewed in [51]). Basing on the protein network [53], suggesting the
matically define functional modules, the assumption that complex diseases are network approach can better detect the
most common approach is to search for caused by a set of mutations which, causal genes. Third, identified network
densely connected subgraphs or clusters although strongly vary among patients, modules can provide better understanding
[2946]. Additionally, gene expression are likely to dys-regulate common path- of the biological underpinning of the
information can be used alone or in ways, such dys-regulated pathways might diseases and therefore more reliable mark-
concert with protein interaction data to be uncovered by mapping genes altered in ers in disease diagnosis and treatments (see
obtain gene modules by grouping co- the diseases onto a PPI (protein-protein Section 4 for more discussion).
expressed genes into one module [4749]. interaction) network and then searching
It is important to keep in mind that for network modules enriched with the 3.1 Network Modules Enriched with
modules identified by analysis of high- altered genes. On the other hand, organ- Genetic Alterations
throughput data are noisy, containing ismal level phenotypes such as diseases are One way in which differing genetic
both false negative and false positive edges. directly related to molecular level changes variations might dys-regulate a common
In addition they do not usually provide such as gene expression. Thus an alterna- pathway is when the genes containing
information about the nature of an tive group of approaches considers mod- these alterations belong to the pathway.
interaction. Therefore, unlike hand curat- ules enriched with abnormally expressed This potential explanation has led to the
ed pathways, computationally identified genes. Finally, molecular pathways can idea that the dys-regulated pathways
network modules typically lack a mecha- also be considered as means of informa- might be uncovered by mapping the genes
nistic explanation of pathway activities but tion flow. For example, the activation of altered in the diseases to an interaction
rather serve as groups of genes that work the EGFR signaling pathway starts with network and searching for the modules
together to achieve a particular function. the activation of the EGFR receptor, enriched with the altered genes (See
An important advantage of working which in turn activates a number of Figure 1).
with modules rather than individual genes signaling proteins downstream which ini- Following this principle, the first step to
relates to the fact that it is often easier to tiate several signal transduction cascades, identify such modules is to select candidate
predict the function of a module than the such the MAPK, Akt and JNK pathways genes whose alterations may have caused a
function of a gene. In particular, while the and culminate in cell proliferation. Thus disease of interest. Genes or whole geno-

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002820


Figure 1. Identification of network modules enriched with genetic alterations. (A) Genomic regions with alterations. (B) Genes in the
altered regions are mapped to the interaction network and modules enriched with such genes are identified.
doi:10.1371/journal.pcbi.1002820.g001

mic regions that are altered in the disease developed by Gilman et al. and applied to enriched with genes that have abnormal
are first identified, and the genes residing identify a biological subnetwork affected by expression, several different computational
in the altered regions are mapped to an rare de novo copy number variations techniques have been used to achieve
interaction network. Both physical and (CNVs) in autism [58,59], the authors first these tasks, which we discuss shortly
functional interaction networks can be constructed a gene network where edges below. The methods are also illustrated
used, and edges might be weighted based, were assigned the likelihood odd ratio for in Figure 2.
for example, on the likelihood of having contributing to the same genetic phenotype. 3.2.1 Scoring based
the same phenotypes or influences be- Subsequently a greedy growth algorithm methods. Suppose that there is a
tween genes [5456]. Next, modules are was used to find clusters in this network. In subset of genes which are differentially
typically defined as subsets of genetically another approach, Rossin et al. [60] consid- expressed in disease samples and they are
altered genes that are highly interconnect- ered the genomic regions found to be closely connected to each other in an
ed or within close proximity to each other associated with Rheumatoid Arthritis (RA) interaction network. A subnetwork
in the interaction network together with and Crohns disease (CD) in previous including such genes might be a good
non-altered genes necessary to mediate GWAS studies, and connected the genes candidate for a disease associated network
these connections. Edge weights, if given, residing in these regions based on module (Figure 2A). Implementing this
can be used to prioritize the modules. In interaction data to obtain network mod- idea requires a way to score candidate
many cases, finding the best subnetwork is ules. It was also verified that those modules. Various methods have been
computationally expensive and search identified modules exhibited significant suggested for measuring the significance
algorithms such as greedy growth heuris- differences in expression level in the of the differential expression of genes in a
tics or more sophisticated approximation disease samples. module and their connectivity (the
algorithms have been proposed. Finally, distances between the genes). In addition,
rigorous statistical tests have been applied 3.2 Differentially Expressed Network different methods adopt different search
to evaluate the significance of selected Modules algorithms to find high scoring candidate
modules. Another popular and successful ap- modules. Finally, some approaches
Examples. The idea of finding genetically proach to find disease associated modules additionally require that all genes are
altered network modules has been utilized in is to search for subnetworks that are either up-regulated or down-regulated in
various disease studies. Analyzing ovarian significantly enriched with genes whose the same direction.
cancer TCGA data (The Cancer Genome expression levels are changed in disease Examples. Chuang et al. defined the
Atlas), HOTNET identified subnetworks in samples. Building on the observation that activity score for a subnetwork by com-
a protein interaction network in which genes a molecular perturbation typically affects paring gene expression profiles from two
are mutated in a significant number of the expression levels of genes in a whole different types of samples (metastatic or
patients [54]. The identified networks in- module rather than individual genes, these non-metastatic in their study) [61]. More
cludes the NOTCH signaling pathway approaches identify the modules which specifically, they first computed how well
which is indeed known to be significantly exhibit different expression patterns in the expression of a gene discriminates
mutated in cancer samples [57]. The disease states relative to a control. Gene between the two patient groups and then
method is based on the set cover approach expression data has been widely utilized scored candidate subnetworks based on
(see Set cover based approach section for identifying dys-regulated modules and aggregate discriminative power over all
below), which is found to be effective in drug targets, inferring interactions be- genes in the subnetwork. Then they
capturing different genetic variations across tween genes, and classifying diseases. searched for the most discriminative
patients. In the NETBAG (NETwork-Based While these approaches are based on the networks in a greedy manner. While the
Analysis of Genetic associations) method, common idea of finding gene modules method was used for disease classification

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002820


Figure 2. Finding differentially expressed modules. (A) Score based method selects the module with significant expression changes. (B)
Correlation based method selects edges with correlation changes. The red and blue edges are correlated and anti-correlated edges, respectively. (C)
Set cover based method selects a set of genes covering all samples. In this example, each sample has at least 2 differentially expressed genes and the
genes are connected in the network.
doi:10.1371/journal.pcbi.1002820.g002

(see Section 4), it can readily be applied to Examples. Aiming to identify regulatory the SMAD1 gene, which could not be
leverage the difference between disease networks defining phenotypic classes of detected by differential expression analysis
and non-disease cohorts. human cell lines, Muller et al. searched for only.
3.2.2 Correlation based Jointly Active Connected Subnetworks To understand the mechanism of aging,
methods. Comparing expression (connected subnetworks with high average Xue et al. applied a network module
patterns between genes is a basis for internal expression similarity) in a human approach [65,66]. They utilized a PPI
constructing a co-expression network, interaction network [62] and demonstrat- network and overlaid expression data
extracting modules exhibiting similar ed the power of combining network and obtained from various stages of aging.
expression patterns, and further expression data. Two types of edges correlated and anti-
understanding molecular changes in IDEA (Interactome Dysregulation En- correlated were selected. The subnet-
diseases. Considering expression correlation richment Analysis) method [63] focused work that includes only those edges was
of disease cases in the context of interactions on the identification of perturbed network called the NP (negative and positive)
can provide additional power in the edges in a combined interaction network network, is proposed to be related to the
identification of a disease associated module (PPI, transitional, signaling, posttransla- aging mechanism. Further modularizing
(Figure 2B). If the expression changes of two tional modifications predicted by MINDy the network with hierarchical clustering of
neighboring nodes are correlated with each [64]), and searched for the edges connect- expression patterns, they obtained a few
other, this may suggest that the two ing genes which in a disease state show loss age related modules and found some genes
interacting genes have related functional or gain of expression correlation. The connecting different modules through PPIs
roles. With this in mind, some approaches utility of the method was demonstrated in are more likely to affect aging/longevity,
look at connected components which show the analysis of FL lymphoma and other which was also experimentally validated.
highly correlated and anti-correlated cancer types. In particular, they identified 3.2.3. Set cover based methods. A
expression patterns. Other approaches BCL2 as the gene adjacent to the largest group of methods employ a combinatorial
search for loss and gain of correlation in number of dys-regulated edges in FL approach named set cover. In a set cover,
disease states to identify dys-regulated edges. lymphoma. This analysis also identified a gene is considered to cover a disease

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002820


Figure 3. Finding information propagation modules. (A) Shortest path approach to uncover information propagation. The shortest paths from
a target gene (with hexagon shape) to each of three candidate genes are shown. The closest gene is identified as the most probable disease causing
gene. (B) Flow based approach. The gene receiving the most significant amount of flow is identified as the disease gene. The information flow
methods often follow Kirchhoffs current law (the amount of incoming information equals the amount of outgoing information).
doi:10.1371/journal.pcbi.1002820.g003

sample if it is dys-regulated in the sample. so that each disease sample is covered by a more effective way to understand disease
For example, it can be decided if a gene is certain minimal number of genes. They mechanisms might be to combine both
covering a sample or not based on the fold applied this approach to a Parkinsons genotypic (the putative causes of diseases)
change of gene expression level in the disease dataset. Chowdhury et al. [68], and phenotypic data (their effects). Ex-
sample or using a statistical test such as z- developed an alternative network cover pression Quantitative trait loci (eQTL)
test. The main principle of the set cover based algorithm and used the identified analysis is a useful method to find the
approach is that each disease case has modules for disease classification in a relationship between genotype and phe-
some dys-regulated (thus covering) genes human colorectal cancer dataset. notype [72,73]. eQTL treats the level of
but in heterogeneous diseases, different Set Cover approaches have also been gene expression as a quantitative pheno-
cases will typically have different covering applied to data types other than gene type, which is assumed to be controlled by
genes. Set cover approaches provide a expression. For example, Kim et al. genotypic information. Loci that putative-
strategy to select a representative set of proposed a module cover approach to ly control the expression of a given gene
such covering genes (Figure 2C). This is identify gene modules which collectively are identified by determining the associa-
usually done by defining some cover disease samples [70]. At the same tions between genotype and gene expres-
optimization criterion and attempting to time they required that each module is sion. Given an association between a
select a set of genes which is optimal with coherent, containing genes with similar genotypic variation in a locus and expres-
respect to this criterion. For example, genotype-phenotype mappings (see Sec- sion level of a gene, the next challenge is to
given a set of genes and disease samples tion 4 for more discussion). The HotNet uncover the pathway(s) through which the
along with covering relationships, a subset Algorithm discussed in Section 3.1 also genetic variation leads to the expression
of genes is selected so that each sample is utilized a variant of a set cover approach change. Recently, several groundbreaking
covered by some minimal number of genes to find genetically altered modules. In pathway elucidation methods have
while the total number of selected genes is their case, a gene is defined to cover a emerged, as illustrated in Figure 3 and
minimized. sample if the gene is mutated in the described below.
Many observed organism-level pheno- sample, and they looked for a fixed size 3.3.1 Distance based methods. A
types arise in a heterogeneous way. connected set of genes covering as many simple approach to identify a possible
Diseases such as cancer are now seen as samples as possible. The Dendrix (De pathway from a genetically altered gene
a spectrum of related disorders that novo Driver Exclusivity) algorithm was (putative cause) to the gene with correlated
manifest themselves in a similar fashion. also developed to discover mutated gene expression change (target gene) is to test if
Since different samples may be covered by modules in cancer and, though it does not there is a path in an interaction network
different genes and those genes may be utilize interaction data, it aims to find sets connecting the putative causal gene to its
connected in an interaction network, set of genes, domains, or nucleotides whose target gene. The shortest path connecting
cover approaches can be useful to identify mutations exhibit both high coverage and a causal gene and its target is often used to
gene modules explaining a heterogeneous high exclusivity in the disease samples explain their causal relationship
set of samples [6769]. [71]. (Figure 3A). The intermediate nodes on
Examples. Aiming to detect dys-regulated such a shortest path are likely members of
pathways in complex diseases, Ulitksy et al. 3.3 Uncovering Information an affected pathway/module. Several
extended the set cover technique by Propagation Modules variations of the shortest path approach
integrating expression data and interaction The approaches discussed thus far have have been used in extracting disease
networks [67]. Their method, named dealt with modules of genes associated associated network modules [7476]. For
DEGAS (de novo discovery of dys-regu- with either phenotypic or genotypic infor- example, Carter et al. searched for the
lated pathways) searches for a smallest set mation. While both approaches are help- shortest directionally consistent paths in
of genes forming a connected subnetwork ful for predicting dys-regulated modules, a molecular interaction networks connecting

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002820


seed genes to their targets. The targets incorporating such additional data, target gene, many of which are simply by
were inferred by linear decomposition of network flow approaches can more chance. However, simply applying a more
gene expression data [76]. confidently suggest information stringent p-value cutoff for multiple testing
When multiple target genes exist, the propagation pathways. corrections often eliminates many true
well-known graph-theoretical concept of a The information flow of biological causal regions. Moreover, each region
Steiner tree is often used in place of a set of networks has been used to predict protein may contain dozens of candidate causal
shortest paths. Given a set of nodes to be functions, to prioritize candidate disease genes. Current flow analysis can be
connected, a Steiner tree is an acyclic genes, and to find network centralities applied to complement eQTL analysis
subgraph (a tree) connecting all these [7,7988]. The flow-based approach is and help to identify the genes whose
nodes while using the minimum number particularly useful for augmenting network alterations are most likely to cause abnor-
of edges. In a Steiner tree, the individual information for eQTL analysis. Specifical- mal expression for the target gene. Using
path from the putative causal gene (the ly, it can be used to pinpoint likely causal copy number variations and gene expres-
root of the tree) to each of the target genes genes in genomic eQTL regions and to sion profiles of the same set of cancer
does not need to be the shortest, but the uncover genes involved in the propagation patients, Kim et al. first identified chromo-
size (i.e., the number of edges) of the whole of information signals from such causal somal regions where copy number varia-
tree is minimized. The Steiner tree genes to their target genes. tions correlated with gene expression
approach has been used to find new There are several mathematical formu- changes. Subsequently, they used the
functional associations for proteins [77]. lations that can be used to capture current flow algorithm to identify potential
Tuncbag et al. extended the approach to information propagation. In addition to causal genes in the associated regions. By
the Steiner forest problem (allowing mul- the aforementioned current flow, other selecting genes receiving significant
tiple trees), applying it to proteomic data approaches include random walk and amounts of current in the network, Kim
from glioblastoma multiforme (GBM). In network flow. While mathematically dif- et al. identified putative causal genes in
their study, each tree was rooted in a ferent, many information propagation Glioblastoma and uncovered commonly
different cell surface receptor and repre- methods share a number of similar dys-regulated pathways, including insulin
sented independent signaling pathways assumptions such as flow conservation receptor signaling pathways and RAS
originated from this receptor [78]. (Kirchhoffs law). In the random walk signaling. The identified pathways fea-
Distance based methods, such as the method, a number of random walkers tured several hub nodes, such as EGFR,
shortest path approach or the Steiner tree repeatedly start from a node. The likeli- that were known to be important players
method, have several shortcomings. In hood of associating a gene in the network in Glioma or more generally in cancer.
particular, they ignore the fact that a pair to a disease is estimated by the number of Compared to simple genome-wide associ-
of genes may have multiple paths con- random walkers arriving at the gene. Gene ation studies, which only identify putative
necting them in a network. In addition expression correlation provides one way to associations between causal loci and target
they use network topology without consid- compute the weight of a gene in the genes, the current flow based method
ering additional data (e.g. gene expression) network which, in turn, provides the provides increased power to predict causal
and assume that the shortest pathways are transition probability of the random walk- disease genes and to uncover dys-regulated
the most informative or most likely used er. The network flow methods are closely pathways.
paths, which may not always be the case. related to the current flow approach. A variant of the network flow approach,
3.3.2 Flow-based methods. In the Unlike current flow, however, the network the minimum cost network flow, was used
information flow approach, genotypic flow model resembles water-finding paths to model the response to increased expres-
variations are considered the source of through pipes. Capacities are associated sion of alpha-synucleain, a protein impli-
perturbation, while genes with phenotypic with pipes (edges) providing constraints on cated in several neurodegenerative disor-
changes are considered the targets of a how much flow can go through each pipe. ders, including Parkinsons disease [81]. In
perturbation pathway. Instead of finding Examples. Tu et al. [79] used the random addition to the edge capacities, the min
single paths connecting source and targets, walk approach to infer causal genes and cost network flow approach associates
flow-based methods compute the fraction underlying causal paths over a molecular weights with edges representing the cost
of flow going through each intermediate interaction network for yeast knock-out of sending flow through an edge. These
node/edge. Fraction of flow indicates the experimental data. Current flow is an weights were computed based on the
probability of using the given path in equivalent form of random walk that can probability of the two genes interacting
information propagation (Figure 3B). In be used in a more computationally in a response pathway, while capacities
the case of current flow approach, the efficient way [89]. Using this knowledge, were calculated using the transcript levels
network is modeled to mimic the behavior Suthram et al. [80] developed the eQED of target genes.
of current in an electronic circuit, where method, which integrates eQTL analysis
each edge has an associated resistance. with molecular interaction information 4. Applications of Network
The current flow network provides an modeled as a current flow network. Modules Disease Diagnosis
efficient framework equivalent to a Kim et al. further extended the eQED and Treatment
random walk, which is also often used idea to identify causal genes and dys-
for modeling information flow in regulated pathways and applied it to Can network modules help facilitate a
biological networks (see discussion below). Glioma sample analysis [69,90].One of more personalized approach for disease
An important advantage of network flow the challenges of eQTL analysis is a diagnosis and treatment? Traditional ap-
approaches their ability to incorporate massive multiple testing problem, for proaches of clinical disease classification
additional data (such as gene expression, which various multiple testing correction have been based on pathological analysis
confidence level of interactions, and methods have been proposed. Without of patients and existing knowledge of
functional associations of genes) to the such corrections, eQTL analysis typically diseases. However, traditional diagnostic
probabilistic network models. By finds multiple associated regions for each approaches are prone to errors. Alterna-

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002820


tively, knowledge about dys-regulated which disease classes are characterized by 6. Exercises
pathways can be used to subtype diseases which combinations of modules.
and to develop relevant treatments for 1. Construct coexpression networks fol-
individual disease subgroups. For example, 4.2 Disease Similarity lowing the steps below [98].
network modules have been used to Network modules can also be used to
predict patient survival, metastasis, drug explain disease similarity. Overlaps of a. Download the three expression data-
responses for various types of cancer dys-regulated network modules explain sets from the following page: http://
[61,68,9194]. why some complex diseases share sim- www.geneticsofgeneexpression.org/
ilar phenotypic traits. Suthram et al. network/download
4.1 Disease Classification used a variant of PathBlast [95] to b. Compute 3 population-specific corre-
A supervised approach to disease identify dense subnetworks. Analysis of lations for each pair of 4238 genes
classification starts with a set of samples disease similarity was achieved by com- with the expression data. (Hint: There
with a known partition into disease paring expression patterns of various are 8,978,203 pairs of genes.)
subtypes (e.g., metastatic or not) and diseases in the modules [96]. Several c. For gene pairs which have similar
attempts to identify a classifying principle dys-regulated modules were found to be correlations in the 3 datasets, calculate
using specific molecular features. The common to many diseases, which ex- the weighted average correlation,
general strategy for supervised disease plains why some drugs can treat many weighted by the number of individuals
classification is to search for subnetworks, different diseases. in each population. Hint: In the
also called subnetwork markers, whose Supplemental Table 1 published with
activities best discriminate the two disease 4.3 Response to Treatment [98] (http://genome.cshlp.org/content/
subtypes. As in the case of single-gene Modules may help determine whether suppl/2009/10/02/gr.097600.109.
disease markers, a network marker will a given patient will respond to a partic- DC1/nayak_supplemental_material.
distinguish some but not all disease cases ular drug, which is valuable for treatment pdf), you can find the list of gene
and multiple subnetworks might be nec- design. In addition, understanding molec- pairs whose correlations differ signif-
essary. Among selected candidate network ular differences between responders and icantly among the 3 datasets.
markers, the best markers are selected non-responders is likely to help develop-
d. Construct the correlation network by
based on a set of training samples. Some ment of alternative treatments. For ex-
connecting gene pairs whose weighted
methods take an unsupervised approach, ample, Chu and Chen used a network
average correlations are greater than
where subclasses and their features are approach to discover apoptosis drug
a pre-defined threshold (e.g., 0.5).
discovered without using a known train- targets [97]. Chu and Chen constructed
a PPI network for apoptosis in normal e. Compute specific parameters describ-
ing set.
cells and applied a nonlinear stochastic ing the network topology. (Hint: You
Examples. Chuang et al. showed how dys-
model to remove false positive interac- can use the NetworkAnalyzer Cytos-
regulated network modules (described in
tions using microarray data. Comparing cape plugin http://med.bioinf.mpi-
Section 3.2) provide more robust and
the resulting subnetworks helped to shed inf.mpg.de/netanalyzer/)
accurate predictions than those by single
gene based classifications when applied to some light on the mechanisms leading to f. For different correlation thresholds,
breast cancer metastasis analysis [61]. apoptosis and to identify potential drug compare the networks in terms of
Chuang et al.s work provided the proof targets. topological properties.
of principle for using network modules in
disease classification. A number of subse- 5. Summary 2. Suppose that in a co-expression net-
quent extensions and improvements to work two genes are identified to have
Network biology provides powerful correlated expression patterns. Provide
Chuang et al.s work were suggested. For tools for the study of complex diseases.
example, Lee et al. incorporated curated at least two possible biological expla-
Network-based approaches leverage the nations of this correlation.
pathways, and searched for a subset of idea that complex diseases can be better
genes with discriminative features for the understood from the perspective of dys- 3. Some variants of information flow
disease phenotype [94]. More recently, regulated modules than at the individual approaches that identify pathways of
Dao et al. developed alternative network gene level. Modularity is a widely information flow from a mutated gene
based approaches for classification of accepted concept in molecular networks to a target gene with correlated ex-
cancer subtypes by identifying densely and module-based approaches provide a pression require that the last but one
connected subnetwork and randomized number of advantages including robust- node gene on such a pathway (the node
algorithms [92,93]. Other techniques for ness in the identification of dys-regulated preceding the target gene) to be a
best marker identification, such as set pathways and improved disease classifi- transcription factor. What is a justifi-
cover and bottom-up enumeration tech- cation. cation for such requirement? What can
niques, were also proposed [68,91]. In addition, network based formulations be advantages and disadvantages of
Kim et al. identified gene modules using allow using a wealth of methods already such a design?
a module cover approach to capture developed in graph theory, such as 4. Consider a set cover approach to find a
disease heterogeneity in brain cancer shortest paths, network flow, and Steiner representative set of genes dys-regulat-
samples from Rembrandt and Ovarian trees. Network-based methods have sever- ed in a given set of cancer patients. The
Cancer samples from TCGA [70]. Next, al limitations including the lack of mech- algorithm finds the smallest number of
Kim et al. superimposed the selected anistic explanations. Despite the limita- genes so that each disease case is
modules onto the results from an inde- tions, network analysis has been applied covered at least k times. How does
pendently proposed classification scheme successfully in many disease studies, sug- the number of selected genes depend
[57]. As a result, Kim et al. uncovered gesting testable hypotheses. on k? If you suspect that data for 5%

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002820


6. In the graph shown in Figure 4, find
the shortest paths connecting C with
each of T1, T2, T3, and T4. Do the
edges used by these paths correspond
to a Steiner tree? Explain why or why
not.

Answers to the exercises are provided in


Text S1.

Supporting Information
Text S1 Answers to Exercises.
(PDF)
Figure 4. A hypothetical interaction network to be used with Exercises 5 and 6.
doi:10.1371/journal.pcbi.1002820.g004 Acknowledgments
The authors thank Mileidy W. Gonzalez
patients might be incorrect, how would graph shown in Figure 4, find two (NIH\NLM\NCBI) and Pawel Przytycki
you modify the optimization problem? different Steiner trees connecting genes (Princeton University) for their helpful com-
C, T1, T2, T3, and T4. ments on the manuscript.
5. A Steiner tree connecting a set of nodes
does not need to be unique. In the

Further Reading

N Schadt EE (2009) Molecular networks as sensors and drivers of common human diseases. Nature 461(7261): 218223.
N Barabasi AL, Gulbahce N, Loscalzo J (2011) Network medicine: a network-based approach to human disease. Nat Rev Genet
12(1): 5668.
N Przytycka TM, Singh M, Slonim DK (2010) Toward the dynamic interactome: its about time. Brief Bioinform 11(1): 1529.
N Przytycka TM, Cho DY (2012) Interactome. In: Meyers RA, editor. Encyclopedia of molecular cell biology and molecular
medicine. John Wiley and Sons, Inc. doi:10.1002/3527600906.mcb.201100018
N Califano A, Butte AJ, Friend S, Ideker T, Schadt E (2012) Leveraging models of cell regulation and GWAS data in integrative
network-based association studies. Nat Genet 44(8): 841847. doi:10.1038/ng.2355
N Vidal M, Cusick ME, Barabasi AL (2011) Interactome networks and human disease. Cell 144(6): 986998.
N Kim Y, Przytycka TM (2012) Bridging the gap between genotype and phenotype via network approaches. Frontiers in Genetics
special issue on mapping complex disease traits with global gene expression. Front Genet 3: 227. doi:10.3389/
fgene.2012.00227

References
1. Veenstra-Vanderweele J, Christian SL, Cook EH, 9. Wachi S, Yoneda K, Wu R (2005) Interactome- 17. Ashburner M, Ball CA, Blake JA, Botstein D,
Jr. (2004) Autism as a paradigmatic complex transcriptome analysis reveals the high centrality Butler H, et al. (2000) Gene ontology: tool for the
genetic disorder. Annu Rev Genomics Hum of genes differentially expressed in lung cancer unification of biology. The Gene Ontology
Genet 5: 379405. tissues. Bioinformatics 21: 42054208. Consortium. Nat Genet 25: 2529.
2. Pinto D, Pagnamenta AT, Klei L, Anney R, 10. De Las Rivas J, Fontanillo C (2010) Protein-protein 18. Lee I, Date SV, Adai AT, Marcotte EM (2004) A
Merico D, et al. (2010) Functional impact of interactions essentials: key concepts to building and probabilistic functional network of yeast genes.
global rare copy number variation in autism analyzing interactome networks. PLoS Comput Biol Science 306: 15551558.
spectrum disorders. Nature 466: 368372. 6: e1000807. doi:10.1371/journal.pcbi.1000807 19. Costello JC, Dalkilic MM, Beason SM, Gehlhau-
3. Schadt EE (2009) Molecular networks as sensors 11. Berggard T, Linse S, James P (2007) Methods for sen JR, Patwardhan R, et al. (2009) Gene
and drivers of common human diseases. Nature the detection and analysis of protein-protein networks in Drosophila melanogaster: integrating
461: 218223. interactions. Proteomics 7: 28332842. experimental data to predict gene function.
4. Gursoy A, Keskin O, Nussinov R (2008) Topo- 12. Yu H, Braun P, Yildirim MA, Lemmens I, Genome Biol 10: R97.
logical properties of protein interaction networks Venkatesan K, et al. (2008) High-quality binary 20. Guan Y, Myers CL, Lu R, Lemischka IR, Bult
from a structural perspective. Biochem Soc Trans protein interaction map of the yeast interactome CJ, et al. (2008) A genomewide functional
36: 13981403. network. Science 322: 104110. network for the laboratory mouse. PLoS Comput
5. Albert R (2005) Scale-free networks in cell 13. Shoemaker BA, Panchenko AR (2007) Decipher- Biol 4: e1000165. doi:10.1371/journal.-
biology. J Cell Sci 118: 49474957. ing protein-protein interactions. Part II. Compu- pcbi.1000165
6. Jeong H, Mason SP, Barabasi AL, Oltvai ZN tational methods to predict protein and domain 21. Ramani AK, Li Z, Hart GT, Carlson MW, Boutz
(2001) Lethality and centrality in protein net- interaction partners. PLoS Comput Biol 3: e43. DR, et al. (2008) A map of human protein
works. Nature 411: 4142. doi:10.1371/journal.pcbi.0030043 interactions derived from co-expression of human
7. Zotenko E, Mestre J, OLeary DP, Przytycka TM 14. Levy ED, Landry CR, Michnick SW (2009) How mRNAs and their orthologs. Mol Syst Biol 4: 180.
(2008) Why do hubs in the yeast protein perfect can protein interactomes be? Sci Signal 2: 22. Margolin AA, Nemenman I, Basso K, Wiggins C,
interaction network tend to be essential: reexam- pe11. Stolovitzky G, et al. (2006) ARACNE: an
ining the connection between the network 15. Eisen MB, Spellman PT, Brown PO, Botstein D algorithm for the reconstruction of gene regula-
topology and essentiality. PLoS Comput Biol 4: (1998) Cluster analysis and display of genome- tory networks in a mammalian cellular context.
e1000140. doi:10.1371/journal.pcbi.1000140 wide expression patterns. Proc Natl Acad Sci U S A BMC Bioinformatics 7 Suppl 1: S7.
8. Jonsson PF, Bates PA (2006) Global topological 95: 1486314868. 23. Peng J, Wang P, Zhou N, Zhu J (2009) Partial
features of cancer proteins in the human inter- 16. (2010) The Gene Ontology in 2010: extensions Correlation Estimation by Joint Sparse Regres-
actome. Bioinformatics 22: 22912297. and refinements. Nucleic Acids Res 38: D331335. sion Models. J Am Stat Assoc 104: 735746.

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002820


24. Peer D (2005) Bayesian network analysis of 48. Feng J, Jiang R, Jiang T (2011) A max-flow based 68. Chowdhury SA, Koyuturk M (2010) Identification
signaling networks: a primer. Sci STKE 2005: approach to the identification of protein com- of coordinately dysregulated subnetworks in com-
pl4. plexes using protein interaction and microarray plex phenotypes. Pac Symp Biocomput: 133144.
25. Alterovitz G, Liu J, Afkhami E, Ramoni MF data. IEEE/ACM Trans Comput Biol Bioinform 69. Kim YA, Wuchty S, Przytycka TM (2011)
(2007) Bayesian methods for proteomics. Proteo- 8: 621634. Identifying causal genes and dysregulated path-
mics 7: 28432855. 49. Maraziotis IA, Dimitrakopoulou K, Bezerianos A ways in complex diseases. PLoS Comput Biol 7:
26. Xuan NV, Chetty M, Coppel R, Wangikar PP (2007) Growing functional modules from a seed e1001095. doi:10.1371/journal.pcbi.1001095
(2012) Gene regulatory network modeling via protein via integration of protein interaction and 70. Kim Y, Salari R, Wuchty S, Przytycka TM (2013)
global optimization of high-order dynamic Bayes- gene expression data. BMC Bioinformatics 8: 408. Module Cover a new approach to genotype-
ian network. BMC Bioinformatics 13: 131. 50. Tipney H, Hunter L (2010) An introduction to phenotype studies; Pacyfic Synposium on Bio-
27. Zou M, Conzen SD (2005) A new dynamic effective use of enrichment analysis software. computing 18: 103110.
Bayesian network (DBN) approach for identifying Hum Genomics 4: 202206. 71. Vandin F, Upfal E, Raphael BJ (2012) De novo
gene regulatory networks from time course 51. Kim Y, Przytycka T (2012) Bridging the gap discovery of mutated driver pathways in cancer.
microarray data. Bioinformatics 21: 7179. between genotype and phenotype via network Genome Res 22: 375385.
28. Hartwell LH, Hopfield JJ, Leibler S, Murray AW approaches. Frontiers in Genetics special issue on 72. Stranger BE, Forrest MS, Clark AG, Minichiello
(1999) From molecular to modular cell biology. mapping complex disease traits with global gene MJ, Deutsch S, et al. (2005) Genome-wide
Nature 402: C4752. expression. Front Genet 3: 227. doi:10.3389/ associations of gene expression variation in hu-
29. Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek fgene.2012.00227 mans. PLoS Genet 1: e78. doi:10.1371/journal.
T (2006) CFinder: locating cliques and overlap- 52. Ideker T, Ozier O, Schwikowski B, Siegel AF pgen.0010078
ping modules in biological networks. Bioinfor- (2002) Discovering regulatory and signalling 73. Stranger BE, Nica AC, Forrest MS, Dimas A, Bird
matics 22: 10211023. circuits in molecular interaction networks. Bioin- CP, et al. (2007) Population genomics of human
30. Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kur- formatics 18 Suppl 1: S233240. gene expression. Nat Genet 39: 12171224.
okawa K, Kanaya S (2006) Development and 53. ORoak BJ, Vives L, Girirajan S, Karakoc E, 74. Managbanag JR, Witten TM, Bonchev D, Fox LA,
implementation of an algorithm for detection of Krumm N, et al. (2012) Sporadic autism exomes Tsuchiya M, et al. (2008) Shortest-path network
protein complexes in large interaction networks. reveal a highly interconnected protein network of analysis is a useful approach toward identifying
BMC Bioinformatics 7: 207. de novo mutations. Nature 485: 246250. genetic determinants of longevity. PLoS ONE 3:
31. Arnau V, Mars S, Marn I (2005) Iterative Cluster 54. Vandin F, Upfal E, Raphael BJ (2011) Algorithms e3802. doi:10.1371/journal.pone.0003802
Analysis of Protein Interaction Data. Bioinfor- for detecting significantly mutated pathways in 75. Shih YK, Parthasarathy S (2012) A single source k-
matics 21: 364378. cancer. J Comput Biol 18: 507522. shortest paths algorithm to infer regulatory path-
32. Asthana S, King OD, Gibbons FD, Roth FP 55. Rossin EJ, Lage K, Raychaudhuri S, Xavier RJ, ways in a gene network. Bioinformatics 28: i4958.
(2004) Predicting protein complex membership Tatar D, et al. Proteins encoded in genomic 76. Carter GW, Prinz S, Neou C, Shelby JP, Marzolf
using probabilistic network reliability. Genome regions associated with immune-mediated disease B, et al. (2007) Prediction of phenotype and gene
Res 14: 11701175. physically interact and suggest underlying biolo- expression for combinations of mutations. Mol
33. Bader GD, Hogue CW (2003) An automated gy. PLoS Genet 7: e1001273. Syst Biol 3: 96.
method for finding molecular complexes in large 56. Gilman SR, Iossifov I, Levy D, Ronemus M, 77. Bailly-Bechet M, Borgs C, Braunstein A, Chayes
protein interaction networks. BMC Bioinfor- Wigler M, et al. Rare de novo variants associated J, Dagkessamanskaia A, et al. (2011) Finding
matics 4: 2. with autism implicate a large functional network undetected protein associations in cell signaling
34. Bader JS (2003) Greedily building protein net- of genes involved in formation and function of
by belief propagation. Proc Natl Acad Sci U S A
works with confidence. Bioinformatics 19: 1869 synapses. Neuron 70: 898907.
108: 882887.
1874. 57. The Cancer Genome Atlas Research Network
78. Tuncbag N, McCallum S, Huang SS, Fraenkel E
35. Brun C, Chevenet F, Martin D, Wojcik J, (2011) Integrated genomic analyses of ovarian
(2012) SteinerNet: a web server for integrating omic
Guenoche A, et al. (2003) Functional classifica- carcinoma. Nature 474: 609615.
data to discover hidden components of response
tion of proteins for the prediction of cellular 58. Gilman SR, Iossifov I, Levy D, Ronemus M,
pathways. Nucleic Acids Res 40: W505509.
function from a protein-protein interaction net- Wigler M, et al. (2011) Rare de novo variants
79. Tu Z, Wang L, Arbeitman MN, Chen T, Sun F
work. Genome Biol 5: R6. associated with autism implicate a large function-
(2006) An integrative approach for causal gene
36. Dunn R, Dudbridge F, Sanderson CM (2005) al network of genes involved in formation and
identification and gene regulatory pathway infer-
The use of edge-betweenness clustering to function of synapses. Neuron 70: 898907.
ence. Bioinformatics 22: e489496.
investigate biological function in protein interac- 59. Levy D, Ronemus M, Yamrom B, Lee YH,
tion networks. BMC Bioinformatics 6: 39. 80. Suthram S, Beyer A, Karp RM, Eldar Y, Ideker
Leotta A, et al. (2011) Rare de novo and
37. Jiang P, Singh M (2010) SPICi: a fast clustering transmitted copy-number variation in autistic T (2008) eQED: an efficient method for inter-
algorithm for large biological networks. Bioinfor- spectrum disorders. Neuron 70: 886897. preting eQTL associations using protein net-
matics 26: 11051111. works. Mol Syst Biol 4: 162.
60. Rossin EJ, Lage K, Raychaudhuri S, Xavier RJ,
38. King AD, Przulj N, Jurisica I (2004) Protein Tatar D, et al. (2011) Proteins encoded in 81. Yeger-Lotem E, Riva L, Su LJ, Gitler AD,
complex prediction via cost-based clustering. genomic regions associated with immune-mediat- Cashikar AG, et al. (2009) Bridging high-
Bioinformatics 20: 30133020. ed disease physically interact and suggest under- throughput genetic and transcriptional data
39. Luo F, Yang Y, Chen CF, Chang R, Zhou J, et al. lying biology. PLoS Genet 7: e1001273. reveals cellular responses to alpha-synuclein
(2007) Modular organization of protein interac- doi:10.1371/journal.pgen.1001273 toxicity. Nat Genet 41: 316323.
tion networks. Bioinformatics 23: 207214. 61. Chuang HY, Lee E, Liu YT, Lee D, Ideker T 82. Lee E, Jung H, Radivojac P, Kim JW, Lee D
40. Navlakha S, Schatz MC, Kingsford C (2009) (2007) Network-based classification of breast (2009) Analysis of AML genes in dysregulated
Revealing biological modules via graph summa- cancer metastasis. Mol Syst Biol 3: 140. molecular networks. BMC Bioinformatics 10
rization. J Comput Biol 16: 253264. 62. Muller FJ, Laurent LC, Kostka D, Ulitsky I, Suppl 9: S2.
41. Newman ME (2006) Modularity and community Williams R, et al. (2008) Regulatory networks 83. Kohler S, Bauer S, Horn D, Robinson PN (2008)
structure in networks. Proc Natl Acad Sci U S A define phenotypic classes of human stem cell lines. Walking the interactome for prioritization of
103: 85778582. Nature 455: 401405. candidate disease genes. Am J Hum Genet 82:
42. Pereira-Leal JB, Enright AJ, Ouzounis CA (2004) 63. Mani KM, Lefebvre C, Wang K, Lim WK, Basso 949958.
Detection of functional modules from protein K, et al. (2008) A systems biology approach to 84. Missiuro PV, Liu K, Zou L, Ross BC, Zhao G, et
interaction networks. Proteins 54: 4957. prediction of oncogenes and molecular perturbation al. (2009) Information flow analysis of interac-
43. Qi Y, Balem F, Faloutsos C, Klein-Seetharaman targets in B-cell lymphomas. Mol Syst Biol 4: 169. tome networks. PLoS Comput Biol 5: e1000350.
J, Bar-Joseph Z (2008) Protein complex identifi- 64. Wang K NI, Banerjee N, Margolin AA, Califano A. doi:10.1371/journal.pcbi.1000350
cation by supervised graph local clustering. Genome-wide Discovery of Modulators of Tran- 85. Nabieva E, Jim K, Agarwal A, Chazelle B, Singh
Bioinformatics 24: i250258. scriptional Interactions in Human B Lymphocytes; M (2005) Whole-proteome prediction of protein
44. Rives AW, Galitski T (2003) Modular organiza- 2006; Venice. pp. 348362. function via graph-theoretic analysis of interac-
tion of cellular networks. Proc Natl Acad Sci U S A 65. Xue H, Xian B, Dong D, Xia K, Zhu S, et al. tion maps. Bioinformatics 21 Suppl 1: i302310.
100: 11281133. (2007) A modular network model of aging. Mol 86. Newman M (2005) A measure of betweenness
45. Spirin V, Mirny LA (2003) Protein complexes and Syst Biol 3: 147. centrality based on random walks. Social Net-
functional modules in molecular networks. Proc 66. Xia K, Xue H, Dong D, Zhu S, Wang J, et al. works 27: 3954.
Natl Acad Sci U S A 100: 1212312128. (2006) Identification of the proliferation/differen- 87. Stojmirovic A, Yu YK (2007) Information flow in
46. Wang C, Ding C, Yang Q, Holbrook SR (2007) tiation switch in the cellular network of multicel- interaction networks. J Comput Biol 14: 11151143.
Consistent dissection of the protein interaction lular organisms. PLoS Comput Biol 2: e145. 88. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan
network by combining global and local metrics. doi:10.1371/journal.pcbi.0020145 R (2010) Associating genes and protein complexes
Genome Biol 8: R271. 67. Ulitsky I, Krishnamurthy A, Karp RM, Shamir R with disease via network propagation. PLoS Comput
47. Chen J, Yuan B (2006) Detecting functional (2010) DEGAS: de novo discovery of dysregulat- Biol 6: e1000641. doi:10.1371/journal.pcbi.1000641
modules in the yeast protein-protein interaction ed pathways in human diseases. PLoS ONE 5: 89. Doyle PG SJ (1984) Random walks and electric
network. Bioinformatics 22: 22832290. e13367. doi:10.1371/journal.pone.0013367 networks

PLOS Computational Biology | www.ploscompbiol.org 10 December 2012 | Volume 8 | Issue 12 | e1002820


90. Kim YA, Przytycki JH, Wuchty S, Przytycka TM markers predict response to chemotherapy. of human disease similarities reveals common
(2011) Modeling information flow in biological Bioinformatics 27: i205213. functional modules enriched for pluripotent drug
networks. Phys Biol 8: 035012. 94. Lee E, Chuang HY, Kim JW, Ideker T, Lee D targets. PLoS Comput Biol 6: e1000662.
91. Chowdhury SA, Nibbe RK, Chance MR, (2008) Inferring pathway activity toward precise doi:10.1371/journal.pcbi.1000662
Koyuturk M (2011) Subnetwork state functions disease classification. PLoS Comput Biol 4: 97. Chu LH, Chen BS (2008) Construction of a
define dysregulated subnetworks in cancer. e1000217. doi: 10.1371/journal.pcbi.1000217 cancer-perturbed protein-protein interaction net-
J Comput Biol 18: 263281. 95. Kelley BP, Sharan R, Karp RM, Sittler T, Root work for discovery of apoptosis drug targets.
92. Dao P, Colak R, Salari R, Moser F, Davicioni E, DE, et al. (2003) Conserved pathways within BMC Syst Biol 2: 56.
et al. (2010) Inferring cancer subnetwork markers bacteria and yeast as revealed by global protein 98. Nayak RR, Kearns M, Spielman RS, Cheung
using density-constrained biclustering. Bioinfor- network alignment. Proc Natl Acad Sci U S A VG (2009) Coexpression network based on
matics 26: i625631. 100: 1139411399. natural variation in human gene expression
93. Dao P, Wang K, Collins C, Ester M, Lapuk A, et 96. Suthram S, Dudley JT, Chiang AP, Chen R, reveals gene interactions and functions. Genome
al. (2011) Optimally discriminative subnetwork Hastie TJ, et al. (2010) Network-based elucidation Res 19: 19531962.

PLOS Computational Biology | www.ploscompbiol.org 11 December 2012 | Volume 8 | Issue 12 | e1002820


Education

Chapter 6: Structural Variation and Medical Genomics


Benjamin J. Raphael*
Department of Computer Science and Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America

Abstract: Differences between in- human traits and disease. The germline number of individual genomes that are
dividual human genomes, or be- variants discovered by GWAS thus far necessary to perform a GWAS.
tween human and cancer genomes, explain only a small fraction of the In the past five years, next-generation
range in scale from single nucleotide heritability of many traits, and this miss- DNA sequencing technologies became
variants (SNVs) through intermediate ing heritability gap [1] is a major commercially available from companies
and large-scale duplications, dele- bottleneck for future GWAS. The somatic such as 454, Illumina, Life Technologies,
tions, and rearrangements of geno- mutations measured in cancer genomes and Complete Genomics. These and other
mic segments. The latter class, called are very heterogeneous, with relatively few sequencing technologies continue to ad-
structural variants (SVs), have re- mutations that are shared by large num- vance at a breathtaking pace, and conse-
ceived considerable attention in the bers of cancer patients, even those with the quently the cost of DNA sequencing has
past several years as they are a same (sub)type of cancer. This mutational declined by several orders of magnitude in
previously under appreciated source heterogeneity complicates efforts to distin- the past decade. These technologies pro-
of variation in human genomes. guish functional mutations that drive vide an unprecedented opportunity to
Much of this recent attention is the cancer development from random passen- measure all variants; germline and somat-
result of the availability of higher- ger mutations [2]. ic; SNPs and SVs, in both normal and
resolution technologies for measur- cancer genomes.
Comprehensive studies of the genetic
ing these variants, including both
basis of disease require the measurement In this chapter, we discuss the applica-
microarray-based techniques, and
of all variants that distinguish individual tion of these sequencing technologies in
more recently, high-throughput
DNA sequencing. We describe the genomes. Until recently, GWAS focused medical genomics, and specifically on the
genomic technologies and computa- on the measurement of single nucleotide characterization of structural variation.
tional techniques currently used to polymorphisms (SNPs), or single nucleo-
measure SVs, focusing on applica- tide differences between individual ge- 2. Germline and Somatic
tions in human and cancer genomics. nomes. In the past few years, it has Structural Variation
become clear that germline variants occu-
py a continuum of scales ranging from Structural variants are important con-
SNPs to larger structural variants (SVs) tributors to genome variation and consid-
duplications, deletions, inversions, and eration of these variants is necessary for
This article is part of the Transla- translocations of large (w100 nucleotides) disease association and cancer genetics
tional Bioinformatics collection for blocks of DNA sequence. Moreover, until studies. In this section, we briefly review
PLOS Computational Biology. recently GWAS focused attention on current knowledge about structural varia-
common SNPs, those whose frequency in tion in human and cancer genomes.
the population was at least 5%. This
1. Introduction restriction was part of the common 2.1 Germline Structural Variation
disease, common variant hypothesis Characterizing the DNA sequence dif-
The decade since the assembly of the which posits that an appreciable fraction ferences that distinguish individuals is a
human genome has witnessed dramatic of susceptibility to common diseases results major challenge in human genetics. Until
advances in understanding the genetic from germline variants that are common a few years ago, the primary focus was to
differences that distinguish individual hu- in the population. However, this restric- identify single nucleotide polymorphisms
mans and that are responsible for specific tion was also dictated by technological (SNPs), and projects such as HapMap [3]
traits. Genome-wide association studies limitations, as it was not cost effective to provide catalogs of common SNPs in
(GWAS) in humans have identified com- measure all genetic variants in the large several human populations. Recent
mon germline, or inherited, DNA variants
that are associated with various common
Citation: Raphael BJ (2012) Chapter 6: Structural Variation and Medical Genomics. PLoS Comput Biol 8(12):
human diseases, including diabetes, heart e1002821. doi:10.1371/journal.pcbi.1002821
disease, etc. At the same time, cancer
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
genome sequencing studies have cataloged Baltimore County, United States of America
numerous somatic mutations that arise Published December 27, 2012
during the lifetime of an individual and
Copyright: 2012 Benjamin J. Raphael. This is an open-access article distributed under the terms of the
that drive cancer progression. These Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
successes are ushering in the era of medium, provided the original author and source are credited.
personalized medicine, where treatment Funding: This work is supported by National Institutes of Health (R01 HG005690). BJR is supported by an
for a disease is tailored to the genetic National Science Foundation CAREER Award (CCF-1053753), a Career Award from the Scientific Interface from
characteristics of the individual. the Burroughs Wellcome Fund and an Alfred P. Sloan Research Fellowship. The funders had no role in the
preparation of the manuscript.
Despite this progress, significant hurdles
remain in achieving a comprehensive Competing Interests: The author has declared that no competing interests exist.
understanding of the genetic basis of * E-mail: braphael@brown.edu

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002821


What to Learn in This Chapter discovered [17]. Next-generation DNA
sequencing technologies provide the op-
portunity to reconstruct the organization
N Current knowledge about the prevalence of structural variation in human and
of cancer genomes at single nucleotide
cancer genomes.
resolution [18,19]. Projects including The
N Strategies for using microarray and high-throughput DNA sequencing
Cancer Genome Atlas (TCGA) (http://
technologies to measure structural variation.
cancergenome.nih.gov) and International
N Computational techniques to detect structural variants from DNA sequencing Cancer Genome Consortium (ICGC) are
data.
using these technologies to measure so-
matic mutations in thousands of cancer
genomes [20].
whole-genome sequencing and microar- important for animal models of human
ray measurements have shown that struc- diseases.
tural variation, including duplications, 2.3 Mechanisms of Structural
deletions, and inversions of large blocks 2.2 Somatic Structural Variation and Variation
of DNA sequence, is common in the As additional genetic and somatic
Cancer
human genome [4]. SVs include both structural variants are characterized, there
Cancer is a disease driven by somatic
copy number variants duplications and is increasing opportunity to characterize
mutations that accumulate during the
deletions that change the number of the mechanisms that produce these vari-
lifetime of an individual. The inheritance
copies of a segment of the genome, and ants. A distinguishing feature of the
of mutations by daughter cells during
balanced rearrangements such as inver- mitosis and selection for advantageous different mechanisms is the amount of
sions and translocations that do not mutations make cancer a microevolu- sequence similarity, or homology, at the
alter the copy number of the genome. tionary process [12,13] within a popula- breakpoints of the structural variant. One
The Database of Genomic Variants [5] tion of cells. Decades of cytogenetic studies extreme is little or no sequence similarity.
currently (winter 2011) lists apprroxi- have shown that somatic structural vari- These variants are thought to result from
mately 66 thousand copy number variants ants are a feature of many cancer random (or near random) double-stranded
and approximately 900 inversion variants genomes. These early studies, particularly breaks in DNA. These breaks might occur
in the human genome, and this number in leukemias and lymphoma, identified a due to exposure to external DNA damag-
continues to increase. Some of these number of recurrent chromosomal rear- ing agents. For example, ultraviolet radi-
entries are multiple reports of the same rangements that are present in many ation or various chemotherapy drugs
variant due to problems in merging SV patients with the same type of cancer. produce double-stranded breaks. Aberrant
predictions across different platforms/ For example, a significant fraction of repair of these breaks result in structural
technologies (see Section 5 below). Nev- patients with chronic myelogenous leuke- variants. This mechanism is termed non-
ertheless, SVs are extensive in human mia (CML) exhibit a translocation be- homologous end-joining (NHEJ) [21,22].
populations. tween chromosomes 9 and 22. The break- The opposite extreme is high sequence
Germline SVs account for a greater points of this translocation lie in two genes, similarity at the breakpoints. This mech-
share of the total nucleotide differences BCR and ABL, and the translocation anism is termed non-allelic homologous
between two individual human genomes results in the BCR-ABL fusion gene that recombination (NAHR). This mechanism
than SNPs [6]. Copy number variants is directly implicated in the development is similar to the normal biological process
alone account for approximately 18% of of this cancer. In addition to fusion genes, of homologous recombination that occurs
genetic variation in gene expression, somatic SVs can also lead to altered during meiosis and exchanges DNA be-
having little overlap with variation associ- expression of oncogenes and tumor sup- tween two homologous chromosomes. But
ated to SNPs [7], and can affect the pressor genes due to both genetic and as the name states, NAHR is a rearrange-
epigenetic mechanisms [14]. For example, ment that occurs between homologous
expression of genes up to 300 kb away
in Burkitts lymphoma, a translocation sequences that are not the same allele on
from the variant [8]. Both common and
activates the MYC oncogene by fusing it homologous chromosomes. Rather
rare SVs have recently been linked to
with a strong promoter. NAHR occurs between repetitive sequenc-
several human diseases including autism
In solid tumors, the situation is more es on the genome (Figure 1) [2325]. The
[9] and schizophrenia [10]. In addition to
complicated. Many solid tumors have human genome contains numerous repet-
SVs that cause disease, SVs segregating in
genomes that are extensively rearranged itive sequences ranging from Alu elements
a population perturb patterns of linkage of 300 bp to segmental duplications, also
compared to the normal healthy genome
disequilibrium and haplotype structure called low copy repeats, of tens to
from which they were derived [14]. These
[11]. Thus, it is essential to catalog SVs hundreds of kbp [26]. Thus, there are
highly rearranged genomes are thought to
in order to understand their consequences be a product of genome instability result- numerous substrates for NAHR in the
for human population genetics. Incorrect ing from mutations in the DNA repair human genome, and not surprisingly
identification of SVs in samples can lead to machinery. This complex organization of numerous reported structural variants that
spurious genetic associations resulting cancer genomes obscures functional driver result from NAHR. For example, the 1000
from the undetected SVs, erroneous SVs in a background of passenger muta- Genomes Project, a large NIH project to
merging of distinct variants in different tions. However, with the availability of survey all classes of variation SNPs
samples, and failure to recognize hetero- higher-resolution genomics technologies, through SV in 1000 human genomes
zygosity at a locus. recurrent fusion genes are also being recently reported that approximately 23%
Finally, structural variants are also found in solid tumors, such as prostate of deletions were a result of NAHR [27].
present in model organisms such as mouse [15] and lung cancers [16]. These results Importantly, due to technical limitations in
and fruit fly. Identifying these variants is suggest that additional events remain to be discovering NAHR-mediated SVs (see

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002821


Figure 1. An inversion resulting from non-allelic homologous recombination (NAHR) between two nearly identical segmental
duplications (blue boxes) with opposite orientations (arrows). The inversion flips the orientation of the subsequence, or block, B in one
genome relative to the other genome.
doi:10.1371/journal.pcbi.1002821.g001

below), this percentage may be an under- recent excitement surrounding structural important limitations. First, because
estimate. variation stems from improvements in aCGH measures only differences in the
There are other mechanisms for the genomics technologies that allow more number of copies of a genomic region
formation of SVs. The division between complete measurements of SVs of all between a test and reference genome,
homology mediated and non-homologous types. These include microarrays and aCGH detects only copy number variants.
mechanisms may not be so strict. NHEJ more recently next-generation DNA se- Thus, aCGH is blind to copy-neutral, or
events sometimes have some degree of quencing technologies. In this section, we balanced, variants such as inversions, or
microhomology (e.g. 225 bp of similarity) briefly describe these technologies. reciprocal translocations. Moreover,
at their breakpoints. Other mechanisms aCGH requires that the genomic probes
such as fork stalling and template switch- 3.1 Microarrays from the reference genome lie in non-
ing (FoSTeS) have also been proposed. The first genome-wide surveys of SVs in repetitive regions, making it difficult to
Some of these are reviewed in [28]. the human genome in 2004 utilized detect SVs with breakpoints in repetitive
Finally, the relative contribution of each microarray-based techniques such as array regions, such as NAHR events or the
of these mechanisms in generating germ- comparative genomic hybridization insertion/deletion of repetitive sequences.
line SVs versus somatic SVs remains an (aCGH). In aCGH, differentially fluores-
active area of investigation, with conflict- cently labeled DNA from an individual, or 3.2 Next-generation DNA
ing reports about the importance of test, genome and a reference genome are Sequencing Technologies
repetitive sequences in somatic structural hybridized to an array of genomic probes DNA sequencing technology has ad-
variants found in cancer genomes derived from the reference genome. Mea- vanced dramatically in recent years, and
[21,22,24,25,29]. surements of test:reference fluorescence several next-generation DNA sequenc-
ratio, called the copy number ratio, at ing technologies from companies such as
3. Technologies for each probe identifies locations of the test Illumina, ABI, and 454 have significantly
Measurement of Structural genome that are present in higher or lower lowered the cost of sequencing DNA.
Variation copy in the reference genome. Microar- However, these technologies, and the
rays containing hundreds of thousands of Sanger sequencing technique they are
Structural variants vary widely in size probes are available, and thus one obtains replacing, are severely limited in the
and complexity, ranging from insertions/ copy number ratios at hundreds of thou- length of a DNA molecule that can be
deletions of hundreds of nucleotides to sands of locations. Since individual copy sequenced. Present sequencing technolo-
large scale chromosomal rearrangements. number ratios are subject to various types gies produce short sequences of DNA,
Large structural variants can be visualized of experimental error, computational tech- called reads, that range from 251000
directly on chromosomes, through cytoge- niques are needed to analyze aCGH data. nucleotides, or base pairs (bp), with the
netic techniques such as chromosome For further details about aCGH and upper end of this range requiring technol-
painting, spectral karyotyping (SKY), or aCGH analysis, see [30]. ogies (e.g. Sanger and 454) that are
fluorescent in situ hybridization (FISH). In aCGH is equally applicable for mea- considerably more expensive. Much of
fact, Sturtevant and Dobzhansky studied surement of germline SVs in normal the recent excitement in DNA sequencing
inversion polymorphisms in Drosophila in genomes and somatic SVs in cancer has been in short read DNA sequencers
the 1920s well before the modern genomes. In fact, aCGH was originally (e.g.llumina Genome Analyzer, Life Tech-
genomics era. However, SVs that are too developed for cancer genomics applica- nologies SOLiD and Ion Torrent) that
small to be directly observed on chromo- tions. aCGH is now very affordable yield reads of only 25150 nucleotides.
somes are generally more difficult to detect making it possible to detect copy number These reads are much shorter than the
and to characterize than single nucleotide variants in large numbers of genomes at one to two hundred million bp of a typical
polymorphisms (SNPs). Much of the reasonable cost. However, aCGH has two human chromosome. However, the large

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002821


Figure 2. Two major approaches to detect structural variants in an individual genome from next-generation sequencing data are de
novo assembly and resequencing. In de novo assembly, the individual genome sequence is constructed by examining overlaps between reads. In
resequencing approaches, reads from the individual genome are aligned to a closely related reference genome. Examination of the resulting
alignments reveals differences between the individual genome and the reference genome.
doi:10.1371/journal.pcbi.1002821.g002

number of reads that are produced There are two approaches to detecting Improving de novo assembly is a very active
(hundreds of millions), results in a cost SVs from next-generation DNA sequenc- research area (see [31]), but human
per nucleotide that is several orders of ing data (Figure 2). The first is de novo genome assemblies of high enough quality
magnitude lower than Sanger sequencing. assembly. In this approach, sophisticated for SV studies remain out of reach for
Many DNA sequencing technologies algorithms are used to reconstruct the inexpensive short-read technologies.
employ a paired end, or mate pair, genome sequence from overlaps between The second approach to detect SVs in
sequencing protocol to increase the effec- reads. The assembled genome sequence is next-generation DNA sequencing data is a
tive read length. In this protocol two reads then compared to the reference genome, resequencing approach that leverages
are generated from opposite ends of a or the assembled genomes of other the extensive finishing efforts undertaken
longer DNA fragment, or insert. With individuals, to identify all types of variants. in the Human Genome Project. In a
earlier Sanger sequencing protocols, the If the genome sequence is successfully resequencing approach, one finds differ-
sizes of these DNA fragments were assembled, this approach is the best for ences between an individual genome and a
dictated by the cloning vector that was characterization of SVs. Unfortunately, closely related reference genome whose
used. Fragment, or insert, sizes of 2 kb assembling a human genome de novo sequence is known by aligning reads from
150 kb could be obtained by cloning into i.e. with no prior information of the individual genome to the reference
bacterial plasmids or bacterial artificial sufficient quality for structural variation genome. Differences (variants) between
chromosomes (BACs). With next-genera- studies remains difficult with limited read the genomes correspond to differences
tion technologies, a variety of techniques lengths. Currently, human genome assem- between the aligned reads and the refer-
have been employed to generate paired blies are highly fragmented, consisting of ence sequence. In the next section, we
reads. At present, the most efficient and tens-hundreds of thousands of contigs, describe how to predict SVs using a
effective techniques produce paired reads intermediate sized sequences of thousands resequencing approach.
from fragments of only a few hundred bp, to tens of thousands of nucleotides.
although fragments of 23 kb are avail- Moreover, the associations between some 3.3 New DNA Sequencing
able. Thus, next-generation sequencing structural variants and repetitive sequenc- Technologies
technologies have both limited read es implies that assemblies of finished (not Many of the challenges in reliable
lengths and limited insert sizes compared draft quality) are necessary for comprehen- measurement of SVs described above are
to Sanger sequencing. sive coverage of structural variation. related to limitations in sequencing tech-

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002821


Figure 3. A strobe with 3 subreads.
doi:10.1371/journal.pcbi.1002821.g003

nologies. In particular, SVs with break- promise for a dramatic shift in DNA specialized task of aligning millions-billions
points in highly-repetitive sequences are sequencing where extremely long reads of individual short reads led to the
beyond the abilities of current technolo- (tens of kb) are generated, making both de development of new software programs
gies. New third-generation and single- novo assembly and variant detection by tailored to this task, such as Maq, BWA,
molecule technologies promise additional resequencing straightforward problems. Bowtie/Bowtie2, BFAST, mrsFAST, etc.
advantages for structural variation discov- [3843]. A key decision in read alignment
ery. These advantages include longer read 4. Resequencing Strategies for for SV detection is whether to consider
lengths, easier sample preparation, lower Structural Variation only reads with a single, best alignment to
input DNA requirements, and higher the reference genome, or to also include
throughput. For example, Pacific Biosci- A resequencing strategy predicts SVs by reads with multiple high-quality align-
ences recently released their Single-Mole- alignments of sequence reads to the ments. Some read alignment programs
cule Real Time (SMRT) sequencing, a reference genome. There are two main will output only a single alignment for
technology that measures in real time the steps in any resequencing strategy: (1) each read, in some cases choosing an
incorporation of nucleotides by a single alignments of reads; (2) prediction of SVs alignment randomly if there are multiple
DNA polymerase molecule immobilized in from alignments. Resequencing approach- alignments of equal score. If one uses only
a nanopore [32]. es are straightforward in principle, but in reads with a unique alignment, then there
One application of this technology is practice sensitive and specific detection of is limited power to detect SVs whose
strobe sequencing. A strobe read, or strobe, structural variation in human genomes is breakpoints lie in repetitive regions, such
consists of multiple subreads from a single notoriously difficult [34,35]. While some as SVs resulting from NAHR. On the
contiguous molecule of DNA. These sub- types of SVs are easy to detect with next- other hand, if one allows reads whose
reads are separated by a number of dark generation sequencing technologies, other alignment is ambiguous, then the problem
nucleotides (called advances), whose iden- complex SVs are refractory to detection. of SV prediction requires an algorithm to
tity is unknown (Figure 3). Thus far, This is due to both technological limita- distinguish among the multiple possible
Pacific Biosciences has demonstrated tions and biological features of SVs. DNA alignments for each read. Many SV
strobes of lengths up to 20 kb with 24 sequencing technologies produce reads prediction algorithms analyze only unique
subreads each of 50400 bp. Additional with sequencing errors, have limited read alignments, although several recent algo-
improvements are expected as technology lengths and insert sizes, and have other rithms use ambiguous alignments. A few of
matures. Strobes generalize the concept of sampling biases (e.g. in GC-rich regions). these are noted below.
paired reads by including more than two Biologically, human SVs are: (i) enriched
reads from a single DNA fragment. for repetitive sequences near their break- 4.2 Split Reads
Strobes provide long-range sequence in- points [23]; (ii) may overlap, have multiple A direct approach to detect structural
formation with low input DNA require- states or complex architectures; and (iii) variants from aligned reads is to identify
ments, a feature missing from current recurrent (but not identical) variants may reads whose alignments to the reference
sequencing technologies. This additional exist at the same locus [36,37]. These genome are in two parts. These so called
information is useful for detection and de properties mean that the alignment of split reads contain the breakpoint of the
novo assembly of complex SV that lie in reads to the reference genome and the structural variant (Figure 4). To reduce
highly repetitive regions, or contain mul- prediction of SVs from these alignments is false positive predictions of structural
tiple breakpoints in a small region. How- not always an easy task. Algorithms are variants, one requires the presence of
ever, the advantages of strobes are reduced required to make highly sensitive and specific multiple split reads sharing the same
by higher single-nucleotide error rates. predictions of SVs. breakpoint. Because the two parts of a
Thus, realizing the advantages of strobes In this section we review the main issues split read align independently to the
requires new algorithms that exploit infor- in predicting SVs using a resequencing reference genome, these alignments must
mation from multiple, spaced subreads to approach. We begin with read alignment. be long enough to be aligned uniquely (or
overcome high single-nucleotide error Then we describe the three major ap- with little ambiguity) to the reference.
rates [33]. proaches that are used to identify struc- Thus, split read analysis is a feasible
Sequencing technologies continues its tural variants from aligned reads: (i) split strategy only when the reads are suffi-
rapid development. Improvements in the reads; (ii) depth of coverage analysis; and ciently long. For example, if one has a
chemistry, imaging, and manufacture of (iii) paired-end mapping. 36 bp read containing the breakpoint of
existing technologies are increasing their an SV at its midpoint, one must align the
read lengths, insert lengths, and through- 4.1 Read Alignment two 18 bp halves of the read to the
put. Additional sequencing technologies Alignment of reads to a reference reference genome. Finding unique align-
are under active development. Nanopore- genome is a special case of sequence ments of an 18 bp sequence is often not
based technologies that directly read the alignment, one of the most researched possible. There are no reports of successful
nucleotides of long molecules of DNA hold problems in bioinformatics. However, the prediction of structural variants from split

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002821


Figure 4. Identification of a deletion in an individual genome by split read analysis (middle), and by depth of coverage analysis
(bottom).
doi:10.1371/journal.pcbi.1002821.g004

reads alone using next generation DNA eter c, called the coverage, is a key factors to consider beyond this simple
sequencing reads less that 50 bp in length. parameter in a sequencing experience. analysis. For example, since reads are
Instead, split read methods have been For example, recent cancer sequencing sampled at random from the genome,
proposed that use paired reads, and projects with Illumina technology have coverage is not constant, but rather follows
require that one read in the pair has a used 30X coverage which means that a distribution with mean c. A Poisson
full length alignment to the reference. This the number of reads and length of reads are distribution is typically used as an approx-
alignment of the read from one end of the chosen such that c~30. imation to this distribution, although other
fragment is used to anchor the search for Now, if the individual genome con- distributions sometimes provide a better fit
alignments of the other split read of the tained a deletion of a segment of the to the data. In addition, repetitive se-
fragment [4446]. human reference genome, the coverage of quences in the reference genome and
this segment would be reduced by half if biases in sequencing (e.g. different cover-
4.3 Depth of Coverage the deletion was heterozygous or re- age of GC-rich regions) also affect depth of
Depth of coverage (also called read duced to zero if the deletion was coverage calculations. Nevertheless, there
depth) analysis detects differences in the homozygous (Figure 4). Similarly, if an are several computational methods for
number of reads that align to intervals in interval of the reference genome was depth of coverage analysis [47,48]. Many
the reference genome. Assuming that reads duplicated, or amplified, in the individual of these are largely similar to those used to
are sampled uniformly from the genome genome, the coverage of this interval analyze microarray copy number data.
sequence, the number of reads that contain would increase in proportion to the
a given nucleotide of the reference is, on number of copies. Thus, the observed 4.4 Paired-end Sequencing and
NL coverage of an interval of the reference Mapping
average, c~ , where N is the number of genome, the depth of coverage, gives an
G The most common approach for rese-
reads, L is the length of each read, and G is indication of the number of copies of this quencing SVs is paired-end mapping
the length of the genome. This is the interval in the individual genome. Of (PEM) (Figure 5). Paired-end mapping
Lander-Waterman model, and the param- course, there are numerous additional was used to identify somatic SVs in cancer

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002821


Figure 5. Paired end mapping (PEM). Fragments from an individual genome are sequenced from both ends and the resulting paired reads are
aligned to a reference genome. Most paired reads correspond to concordant pairs, where the distance between the alignment of each read agrees
with the distribution of fragment lengths (right). The remaining discordant pairs suggest structural variants (here a deletion) that distinguish the
individual and reference genomes.
doi:10.1371/journal.pcbi.1002821.g005

genomes [49,50] and the same idea has paired reads, as most fragments will (Figure 6). Note that this is a simplification
been applied to identify germline structur- correspond to a concordant pair (Figure 5). of the underlying biology, as there are
al variants [51,52]. While the early paired- To distinguish real SVs from sequenc- sometimes small insertions or deletions at
end mapping studies used older clone- ing errors, one looks for clusters of breakpoints, but these small changes have
based sequencing, paired-end mapping is discordant pairs that indicate the same limited effect on the analysis of larger
now possible using various next-generation SV. Numerous algorithms have been structural variants.
sequencing technologies. developed to predict SVs by finding Now the discordant pairs that indicate
In PEM, a paired-end sequencing clusters of discordant pairs. Early algo- an SV have the property that the locations
protocol is used to obtain paired reads rithms used only those paired reads whose of the read alignments are near the
from opposite ends of a larger DNA alignments to the reference genome were breakpoints a and b. However, a paired
fragment, or clone, from a individual genome. non-ambiguous; i.e. there was only a single read does not give independent informa-
These paired reads are then aligned to a best alignment [5355]. More sophisti- tion about the breakpoint a and the
reference genome. Most paired reads cated algorithms use paired reads with breakpoint b. Rather, the breakpoints a
multiple ambiguous alignments to the and b are related by a linear inequality
result in concordant pairs where the
reference genome and use a variety of that defines a polygon in 2D genome space
distance between aligned reads is equal
combinatorial and statistical techniques to called the breakpoint region (Figure 6).
to the fragment length. In contrast,
select among these alignments [5658]. For example, suppose that the pair of
discordant pairs have alignments with
Finally, some approaches model the fact reads from a single fragment align to the
abnormal distance or that lie on different
that the human genome is diploid to avoid same chromosome of the reference ge-
chromosomes. These suggest the presence
making inconsistent structural variant nome such that the read with lower
of an SV or a sequencing error. For
predictions [59]. coordinate starts at position x in the
example, a discordant pair whose distance All of the approaches above rely on
between alignments is too long suggests a reference and the read with higher coor-
predicting structural variants that are dinate ends at position y in the reference.
deletion in the individual genome supported by multiple paired reads. Some,
(Figure 5), while a discordant pair whose (For simplicity, we ignore the fact that the
but not all, of them are careful when sequence of a read can align to either
alignments are on different chromosomes determining whether a group of paired
suggests a translocation. Other types of strand (forward or reverse) of the reference
reads genuinely support the same variant.
discordant pairs identify inversions, trans- genome. The strand of an alignment gives
We illustrate the issue here using the
positions, or duplications that distinguish additional information about the location
Geometric Analysis of Structural Variants
the individual genome from the reference of the breakpoint. See [55] for further
(GASV) method of [55]. A key feature of
genome. Note that in general the length of details.) If the sequenced fragment has
GASV is that it records both the informa-
any particular sequenced fragment is not length L then the breakpoints a and b
tion that the paired reads reveal about the
known. Rather, during the preparation of satisfy the equation (a{x)z(y{b)~L.
boundaries (breakpoints) of the structural
genomic DNA for sequencing, the DNA is As described above, the size of any
variant and the uncertainty associated with this
fragmented and fragments are size-select- particular fragment is typically unknown.
measurement. Most types of SV, including
ed to an appropriate target length. It is deletions, inversions, and translocations Rather, one defines a minimum size Lmin
desirable for this size selection to be as have two breakpoints a and b where the and maximum size Lmax of a sequenced
strict as possible, so that only fragments reference genome is cut. The segments fragment, perhaps according to the em-
near the target length are sequenced. adjacent to these coordinates are then pirical fragment length distribution. Thus,
However, in practice the size selection pasted together in a way that is particular we have the inequality
procedure produces fragments whose to the type of SV. For example, a deletion
lengths vary around the target length. is defined by coordinates a and b in the Lmin (a{x)z(y{b)Lmax :
Typically, the distribution of fragment reference genome such that the nucleotide
lengths is obtained empirically by exam- at position a is joined to the nucleotide at This equation defines the unknown break-
ining the distances between all aligned position b in the individual genome points a and b in terms of the known

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002821


Figure 6. (Top) A discordant pair (arc) indicates a deletion with unknown breakpoints a and b located in orange blocks. Positions x, y and the
minimum Lmin and maximum Lmax length of end-sequenced fragments constrain breakpoints (a,b) to lie within the indicated orange blocks, and are
governed by the indicated linear inequalities. (Bottom) A polygon in 2D genome space expresses the linear dependency between breakpoints a and
b and records the uncertainty in the location of the breakpoints.
doi:10.1371/journal.pcbi.1002821.g006

coordinates x and y of the aligned reads relies on computational geometry algo- variants, while oligonucleotide aCGH
and the length of sequenced fragments. rithms for polygon intersection. These techniques are now used in studies profil-
The pairs of breakpoints (a,b) that satisfy scale to millions of discordant pairs that ing tens of thousands of genomes. Large
this equation form a polygon (specifically a result from next-generation sequencing projects like the 1000 Genomes Project
trapezoid) in two-dimensional genome platforms. and The Cancer Genome Atlas (TCGA)
space. We define the breakpoint region B While the algorithms above consider are performing paired-end sequencing and
of discordant pair (x,y) to be the break- many of the issues in prediction of aCGH of many human genomes, and
points (a,b) satisfying the above equation. structural variants, there remains room matched tumor and normal genomes,
This geometric representation provides for improvement. Most notably, many respectively. At the same time, smaller or
a principled way to combine information algorithms still use only one of the possible single investigator projects are using a
across multiple paired-reads: multiple signals of structural variants: read depth, variety of paired-end sequencing ap-
paired-reads indicate the same variant if split reads, or paired reads. Improvements proaches and/or microarray-based tech-
their corresponding breakpoint regions in specificity are likely possible by inte- niques with different trade-offs in cost-per-
intersect. The geometric representation grating these multiple signals into a single sample vs. measurement resolution. Thus,
also provides precise breakpoint localiza- prediction algorithm [60]. in the near future there will be an
tion by multiple paired reads; separates enormous number of measurements of
multiple measurements of the same vari- 5. Representation of Structural SVs, but using a wide range of technolo-
ant from measurements of nearby or Variants gies of varying resolution, sensitivity, and
overlapping variants; and facilitates robust specificity. This diversity of approaches
comparisons across multiple samples and Next generation DNA sequencing tech- will likely continue for some time as
measurement technologies. Finally, the nologies are dramatically reducing the cost investigators explore tradeoffs between
approach is computationally efficient as it of sequence-based surveys of structural the cost of measuring variants in one

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002821


read sequencing showed that many rear-
rangements were too small to be detected
by cytogenetics, and identified highly
rearranged genomic loci that encompass a
complex intertwining of rearrangement
and duplication [21,29,49,50,6163]. Such
highly rearranged loci are hypothesized to
result from genome instability caused by
defective DNA repair in cancer cells, or
from external DNA damage. An extreme
example is the phenomenon of chromo-
thripsis that results from massive, simulta-
neous breakage and aberrant repair of
many genomic loci [64]. Identifying all of
the SVs and thereby reconstructing the
organization of cancer genomes can suggest
that certain regions of the genome are
selected for their pathogenetic properties,
and also lend insight into the mechanisms
of genome instability in tumors [14].
A second challenge is that cancer tissues
are a heterogeneous mixture of cells with
possibly different numbers of mutations.
This heterogeneity includes admixture
between normal and cancer cells, as well
as subpopulations of tumor cells. Some of
Figure 7. Mutation, selection, and clonal expansion in tumor development leads to
these subpopulations might contain impor-
genomic heterogeneity between cells in a tumor. Current DNA sequencing approaches tant driver mutations, or drug resistance
sequence DNA from many cells and thus result in a heterogenous mixture of mutations, with mutations. Because of the amount of DNA
varying numbers of both passenger mutations (black) and driver mutations (red). required for current sequencing technolo-
doi:10.1371/journal.pcbi.1002821.g007 gies, most cancer genome sequencing
studies do not sequence single tumor cells
sample with high confidence versus sur- nearby variants or overlapping variants. but rather sequence a mixture of cells
veying variants in many samples with This situation is now improving, and more (Figure 7). Since the signal for detecting
lower confidence per sample. For exam- recent software records both the informa- variants is proportional to the number of
ple, in cancer genome studies the goal of tion that the measurement reveals about cells in the mixture that contain the variant,
finding recurrent mutations demands the presence of normal cells will reduce the
the breakpoints of the structural variant
power to detect somatic mutations. Fur-
survey of many genomes and thus large and the uncertainty associated with this
ther, the ability to detect mutations that are
sample sizes might be preferred over high measurement. Software that uses this
rare in the tumor cell population will be
coverage sequencing of one sample. uncertainty to classify and compare SVs
even lower. Eventually, whole genome
The problem of comparing variants across samples and measurement platforms
sequencing of single cells will provide
across samples and/or measurement plat- is also now available [55]. Such precision
fascinating datasets to study cancer genome
forms is less studied than the problem of provides increased confidence in associa-
evolution, with some recent hints of the
detecting variants in a single sample. tions between a structural variant and a
discoveries to come in [65].
Standard practice remains to use heuristics disease, helps separate germline from
that merge predicted structural variants somatic structural variants in cancer ge-
into the same variant in they overlap by a nome sequencing projects, and aids in the 7. Future Prospects
significant fraction (e.g. 5070%) on the study of rare recurrent variants that might This chapter described the challenges in
reference genome. For example, the Data- occur on a variety of genetic backgrounds. identification and characterization of
base of Genomic Variations (DGV) [5], structural variants. With further improve-
arguably the most comprehensive reposito- 6. Challenges for Cancer ments in sequencing technologies and
ry of measured human structural variants, Genomics Studies algorithms over the next few years, it will
merges structural variant predictions whose be possible to systematically measure
coordinates overlap by 70% on the The study of somatic structural variation nearly all but the most complex variants
reference genome. Such heuristics are in cancer genomes presents additional in an individual genome. The most
typically the only approach available to challenges beyond those described above difficult cases, such as variants mediated
databases of human structural variants for generic resequencing approaches. First, by homologous recombination between
because many early studies did not report most cancer genomes are aneuploid, mean- nearly identical sequences, might remain
information on the uncertainty (i.e. error ing that the number of copies of regions of inaccesible until significantly different
bars) in the boundaries (breakpoints) of the the genome are variable, due to duplica- types of DNA sequencing technologies
variant. This situation makes it difficult to tions and deletions of segments of the become available. Nevertheless, the fact
explicitly separate multiple measurements normal genome. High-resolution recon- that systematic identification of nearly all
of the same variant from measurements of structions of cancer genomes by paired germline and somatic structural variants in

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002821


Further Reading targets the BCR-ABL fusion gene in chronic
myelogenous leukemia (CML) and Iressa
that targets the EGFR gene in lung cancer.
N Alkan C, Coe BP, Eichler EE (2011) Genome structural variation discovery and
Discovery of additional cancer-specific drug
genotyping. Nat Rev Genet 12: 363376.
targets requires not only technologies to
N Mardis ER (2012) Genome sequencing and cancer. Curr Opin Genet and Dev 22: globally survey somatic mutations in cancer
245250.
genomes, but also techniques (experimental
N Meyerson M, Gabriel S, Getz G (2010) Advances in understanding cancer and/or computational) to classify the subset
genomes through second-generation sequencing. Nature Reviews Genetics 11: of variants that are functional, and then the
685696. further subset of these functional variants
N Medvedev P, Stanciu M, Brudno M (2009) Computational methods for that are druggable.
discovering structural variation with next-generation sequencing. Nat Methods The sequencing technologies and algo-
6: 1320. rithms described in this chapter are laying
N Sindi S, Helman E, Bashir A, Raphael BJ (2009) A geometric approach for the foundation for personalized medicine,
classification and comparison of structural variants. Bioinformatics 25: i222 but much work remains to translate the
i230. information revealed by genome sequenc-
N Stratton MR (2011) Exploring the genomes of cancer cells: progress and ing into improved clinical practice.
promise. Science 331: 15531558.
8. Exercises
an individual genome is now possible will ing genetic information about susceptibil- (1) Consider the chromosomal inversion
enable further progress in human and ity to a disease or efficacy of particular in Figure 1. What signals in next-
cancer genetics. treatments into improved medical out- generation sequencing data can be
For genetic association studies, having comes will require additional work. used to detect a chromosomal inver-
complete lists of germline variants from The opportunities and challenges are sion?
many individuals means that unexplained similar in cancer genetics. Systematic mea- (2) The human genome is diploid with
heritability for a trait cannot readily be surement of all somatic mutations will yield
two copies, maternal and paternal, of
blamed on lack of measurement of genetic information that might guide treatments, each chromosome. What constraints
information. Unfortunately, this does not and eventually lead to personalized oncol-
does this place on prediction of
necessarily imply that finding the genetic ogy. Current cancer treatments are limited germline structural variants?
basis for specifc traits will become easy. by the non-specificity of most cancer drugs
There remain other challenges, including and by the fact that cancer cells can evolve Answers to the Exercises can be found
the possiblity that combinations of vari- resistance to single drug treatments. Tailor- in Text S1.
ants, interactions between genetic and ing of treatment to the particular genetic
environmental factors, or other epigenetic mutations in a tumor promises to revolu- Supporting Information
mechanisms, may contribute to pheno- tionize cancer therapy. There are already
type. See [66] in this collection for further several examples of such personalized Text S1 Answers to Exercises.
discussion of these issues. Finally, translat- treatments including the drug Gleevec that (PDF)

References
1. Manolio TA, Collins FS, Cox NJ, Goldstein DB, 9. Marshall C, Noor A, Vincent J, Lionel A, Feuk L, function of chromosome aberrations in cancer.
Hindorff LA, et al. (2009) Finding the missing et al. (2008) Structural variation of chromosomes Nat Genet 36: 331334.
heritability of complex diseases. Nature 461: 747753. in autism spectrum disorder. Am J Hum Genet 18. Meyerson M, Gabriel S, Getz G (2010) Advances
2. Stratton MR (2011) Exploring the genomes of 82: 477488. in understanding cancer genomes through sec-
cancer cells: progress and promise. Science 331: 10. Stone JL, ODonovan MC, Gurling H, Kirov ond-generation sequencing. Nat Rev Genet 11:
15531558. GK, Blackwood DH, et al. (2008) Rare chromo- 685696.
3. Frazer K, Ballinger D, Cox D, Hinds D, Stuve L, somal deletions and duplications increase risk of 19. Mardis ER (2012) Genome sequencing and
et al. (2007) A second generation human schizophrenia. Nature 455: 237241. cancer. Curr Opin Genet Dev 22: 245250.
haplotype map of over 3.1 million SNPs. Nature 11. Sindi SS, Raphael BJ (2009) Identification and 20. International Cancer Genome Consortium, Hud-
449: 851861. frequency estimation of inversion polymorphisms son TJ, Anderson W, Artez A, Barker AD, et al.
4. Sharp AJ, Cheng Z, Eichler EE (2006) Structural from haplotype data. In: RECOMB. pp. 418 (2010) International network of cancer genome
variation of the human genome. Annu Rev 433. projects. Nature 464: 993998.
Genomics Hum Genet 7: 407442. 12. Nowell PC (1976) The clonal evolution of tumor 21. Bignell GR, Santarius T, Pole JCM, Butler AP,
5. Iafrate A, Feuk L, Rivera M, Listewnik M, cell populations. Science 194: 2328. Perry J, et al. (2007) Architectures of somatic
Donahoe P, et al. (2004) Detection of large-scale 13. Merlo LM, Pepper JW, Reid BJ, Maley CC genomic rearrangement in human cancer ampli-
variation in the human genome. Nat Genet 36: (2006) Cancer as an evolutionary and ecological cons at sequence-level resolution. Genome Res
949951. process. Nat Rev Cancer 6: 924935. 17: 12961303.
6. Redon R, Ishikawa S, Fitch K, Feuk L, Perry G, 14. Albertson DG, Collins C, McCormick F, Gray 22. Campbell P, Stephens P, Pleasance E, OMeara
et al. (2006) Global variation in copy number in JW (2003) Chromosome aberrations in solid S, Li H, et al. (2008) Identification of somatically
the human genome. Nature 444: 444454. tumors. Nat Genet 34: 36976. acquired rearrangements in cancer using ge-
7. Stranger BE, Forrest MS, Dunning M, Ingle CE, 15. Tomlins SA, Rhodes DR, Perner S, Dhanase- nome-wide massively parallel paired-end se-
Beazley C, et al. (2007) Relative impact of karan SM, Mehra R, et al. (2005) Recurrent quencing. Nat Genet 40: 722729.
nucleotide and copy number variation on gene fusion of tmprss2 and ets transcription factor 23. Kidd J, Cooper G, Donahue W, Hayden H,
expression phenotypes. Science 315: 848853. genes in prostate cancer. Science 310: 644648. Sampas N, et al. (2008) Mapping and sequencing
8. Lower KM, Hughes JR, De Gobbi M, Hender- 16. Soda M, Choi Y, Enomoto M, Takada S, of structural variation from eight human ge-
son S, Viprakasit V, et al. (2009) Adventitious Yamashita Y, et al. (2007) Identification of the nomes. Nature 453: 5664.
changes in long-range gene expression caused by trans- forming EML4-ALK fusion gene in non- 24. Kolomietz E, Meyn MS, Pandita A, Squire JA
polymorphic structural variation and promoter small-cell lung cancer. Nature 448: 561566. (2002) The role of Alu repeat clusters as mediators
competi- tion. Proc Natl Acad Sci USA 106: 17. Mitelman F, Johansson B, Mertens F (2004) of recurrent chromosomal aberrations in tumors.
2177121776. Fusion genes and rearranged genes as a linear Genes Chromosomes Cancer 35: 97112.

PLOS Computational Biology | www.ploscompbiol.org 10 December 2012 | Volume 8 | Issue 12 | e1002821


25. Darai-Ramqvist E, Sandlund A, Mller S, Klein of short DNA sequences to the human genome. framework with simulation-based error models
G, Imreh S, et al. (2008) Segmental duplications Genome Biol 10: R25. for inferring genomic structural variants from
and evolutionary plasticity at tumor chromosome 41. Langmead B, Salzberg SL (2012) Fast gapped- mas-sive paired-end sequencing data. Genome
break-prone regions. Genome Res 18: 370379. read alignment with bowtie 2. Nat Methods 9: Biol 10: R23.
26. Bailey J, Eichler E (2006) Primate segmental 357359. 55. Sindi S, Helman E, Bashir A, Raphael BJ (2009)
duplications: crucibles of evolution, diversity and 42. Homer N, Merriman B, Nelson SF (2009) A geometric approach for classification and
disease. Nat Rev Genet 7: 552564. BFAST: an alignment tool for large scale genome comparison of structural variants. Bioinformatics
27. Mills RE, Walter K, Stewart C, Handsaker RE, resequencing. PLoS ONE 4: e7767. doi:10.1371/ 25: i222230.
Chen K, et al. (2011) Mapping copy number journal.pone.0007767. 56. Hormozdiari F, Alkan C, Eichler EE, Sahinalp
variation by population-scale genome sequencing. 43. Hach F, Hormozdiari F, Alkan C, Hormozdiari SC (2009) Combinatorial algorithms for structur-
Nature 470: 5965. F, Birol I, et al. (2010) mrsfast: a cache-oblivious al variation detection in high-throughput se-
28. Stankiewicz P, Lupski JR (2010) Structural algorithm for short-read mapping. Nat Methods quenced genomes. Genome Res 19: 12701278.
variation in the human genome and its role in 7: 576577. 57. Quinlan AR, Clark RA, Sokolova S, Leibowitz
disease. Annu Rev Med 61: 437455. 44. Maher CA, Kumar-Sinha C, Cao X, Kalyana-
ML, Zhang Y, et al. (2010) Genome-wide
29. Raphael B, Volik S, Yu P, Wu C, Huang G, et al. Sundaram S, Han B, et al. (2009) Transcriptome
mapping and assembly of structural variant
(2008) A sequence-based survey of the complex sequencing to detect gene fusions in cancer.
breakpoints in the mouse genome. Genome Res
structural organization of tumor genomes. Ge- Nature 458: 97101.
20: 623635.
nome Biol 9: R59. 45. Mills RE, Luttig CT, Larkins CE, Beauchamp A,
Tsui C, et al. (2006) An initial map of insertion 58. Lee S, Cheran E, Brudno M (2008) A robust
30. Pinkel D, Albertson DG (2005) Array compara- framework for detecting structural variations in a
tive genomic hybridization and its applications in and deletion (INDEL) variation in the human
genome. Genome Res 16: 11821190. genome. Bioinformatics 24: 5967.
cancer. Nat Genet 37 Suppl: S11S7. 59. Hormozdiari F, Hajirasouliha I, Dao P, Hach F,
31. Schatz MC, Delcher AL, Salzberg SL (2010) 46. Ye K, Schulz MH, Long Q, Apweiler R, Ning Z
(2009) Pindel: a pattern growth approach to Yorukoglu D, et al. (2010) Next-generation Varia-
Assembly of large genomes using second-genera- tionHunter: combinatorial algorithms for trans-
tion sequencing. Genome Res 20: 11651173. detect break points of large deletions and medium
sized insertions from paired-end short reads. poson insertion discovery. Bioinformatics 26:
32. Eid J, Fehr A, Gray J, Luong K, Lyle J, et al. i350357.
Bioinformatics 25: 28652871.
(2009) Real-time DNA sequencing from single 60. Sindi SS, Onal S, Peng LC, Wu HT, Raphael BJ
47. Chiang DY, Getz G, Jaffe DB, OKelly MJ, Zhao
polymerase molecules. Science 323: 133138. (2012) An integrative probabilistic model for
X, et al. (2009) High-resolution mapping of copy-
33. Ritz A, Bashir A, Raphael BJ (2010) Structural identification of structural variation in sequencing
number alterations with massively parallel se-
variation analysis with strobe reads. Bioinfor- data. Genome Biol 13: R22.
quencing. Nat Methods 6: 99103.
matics 26: 12911298. 61. Volik S, Raphael B, Huang G, Stratton M, Bignel
48. Yoon S, Xuan Z, Makarov V, Ye K, Sebat J
34. Medvedev P, Stanciu M, Brudno M (2009) (2009) Sensitive and accurate detection of copy G, et al. (2006) Decoding the fine-scale structure
Computational methods for discovering structural number variants using read depth of coverage. of a breast cancer genome and transcriptome.
variation with next-generation sequencing. Nat Genome Res 19: 15861592. Genome Res 16: 394404.
Methods 6: 1320. 49. Volik S, Zhao S, Chin K, Brebner J, Herndon D, 62. Raphael B, Pevzner P (2004) Reconstructing
35. Alkan C, Coe BP, Eichler EE (2011) Genome et al. (2003) End-sequence profiling: sequence- tumor amplisomes. Bioinformatics 20 Suppl 1:
structural variation discovery and genotyping. based analysis of aberrant genomes. Proc Natl i265273.
Nat Rev Genet 12: 363376. Acad Sci USA 100: 76967701. 63. Hampton OA, Den Hollander P, Miller CA,
36. Scherer S, Lee C, Birney E, Altshuler D, Eichler 50. Raphael B, Volik S, Collins C, Pevzner P (2003) Delgado DA, Li J, et al. (2009) A sequence-level
E, et al. (2007) Challenges and standards in Reconstructing tumor genome architectures. map of chromosomal breakpoints in the MCF-7
integrating surveys of structural variation. Nat Bioinformatics 19 Suppl 2: i162171. breast cancer cell line yields insights into the
Genet 39: 715. 51. Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison evolution of a cancer genome. Genome Res 19:
37. Perry G, Ben-Dor A, Tsalenko A, Sampas N, VA, et al. (2005) Fine-scale structural variation of 167177.
Rodriguez-Revenga L, et al. (2008) The fine-scale the human genome. Nat Genet 37: 72732. 64. Stephens PJ, Greenman CD, Fu B, Yang F,
and complex architecture of human copy-number 52. Korbel JO, Urban AE, Affourtit JP, Godwin B, Bignell GR, et al. (2011) Massive genomic
variation. Am J Hum Genet 82: 685695. Grubert F, et al. (2007) Paired-end mapping rearrange- ment acquired in a single catastrophic
38. Li H, Ruan J, Durbin R (2008) Mapping short reveals extensive structural variation in the event during cancer development. Cell 144: 27
DNA sequencing reads and calling variants using human genome. Science 318: 420426.
mapping quality scores. Genome Res 18: 1851 40.
53. Chen K,Wallis JW, McLellan MD, Larson DE,
1858. 65. Navin N, Kendall J, Troge J, Andrews P, Rodgers
Kalicki JM, et al. (2009) BreakDancer: an
39. Li H, Durbin R (2009) Fast and accurate short L, et al. (2011) Tumour evolution inferred by
algorithm for high-resolution mapping of geno-
read alignment with Burrows-Wheeler transform. single-cell sequencing. Nature 472: 9094.
mic structural variation. Nat Methods 6: 677
Bioinformatics 25: 17541760. 681. 66. Moore J, Bush W (2012) Chapter 11: Genome-
40. Langmead B, Trapnell C, Pop M, Salzberg SL 54. Korbel JO, Abyzov A, Mu XJ, Carriero N, wide association studies. PLoS Comput Biol 8:
(2009) Ultrafast and memory-efficient alignment Cayting P, et al. (2009) PEMer: a computational e1002802. doi:10.1371/journal.pcbi.1002802.

PLOS Computational Biology | www.ploscompbiol.org 11 December 2012 | Volume 8 | Issue 12 | e1002821


Education

Chapter 7: Pharmacogenomics
Konrad J. Karczewski1,2, Roxana Daneshjou2,3, Russ B. Altman2,3*
1 Program in Biomedical Informatics, Stanford University, Stanford, California, United States of America, 2 Department of Genetics, Stanford University, Stanford, California,
United States of America, 3 Department of Medicine, Stanford University, Stanford, California, United States of America

Abstract: There is great variation in undergone treatment, he begins experi- to take modern medicine down a more
drug-response phenotypes, and a encing unexpected bone marrow toxicity, personalized path.
one size fits all paradigm for drug immunosuppression, and life-threatening Modern physicians prescribe medica-
delivery is flawed. Pharmacoge- infections. This type of scenario was tions based on clinical judgment or evi-
nomics is the study of how human encountered after mercaptopurine first dence from clinical trials. In order to select
genetic information impacts drug came on the market in the 1950s. In the a drug and dosage, physicians take clinical
response, and it aims to improve mid-1990s, scientists began to realize that factors such as gender, weight, or organ
efficacy and reduced side effects. In genetics could explain a majority of the function into consideration. The personal
this article, we provide an overview cases of life-threatening bone marrow variation that may affect drug selection or
of pharmacogenetics, including toxicity [1]. Now, many drugs that were dosing, such as genetics, is not considered
pharmacokinetics (PK), pharmaco- once noted to cause so-called unpredict- in many settings. Thus, while a daily 75 mg
dynamics (PD), gene and pathway able reactions are being re-evaluated for dose of clopidogrel for a 70 kg adult would
interactions, and off-target effects. drug-gene interactions. obviously be inappropriate for a 20 kg
We describe methods for discover- The history of medicine is full of child, it is less obvious that two adults with
ing genetic factors in drug response, medications with unintended consequenc- identical presentations and clinical back-
including genome-wide association
es; the ability to understand some of the grounds might require vastly different
studies (GWAS), expression analysis,
underlying causes has been a recent doses. However, for an increasing number
and other methods such as che-
moinformatics and natural lan- development. In the 1950s, succinylcholine of drugs, this appears to be the case. For
guage processing (NLP). We cover was used by anesthesiologists as a muscle instance, two patients with similar clinical
the practical applications of phar- relaxant during operations. However, presentations could be given the same dose
macogenomics both in the pharma- about 1 in 2500 individuals experienced a of the anti-platelet drug clopidogrel, and
ceutical industry and in a clinical horrific reaction respiratory arrest. Later one would be adequately protected against
setting. In drug discovery, pharma- research revealed that those individuals had cardiovascular events while the other
cogenomics can be used to aid lead defects in both copies of cholinesterase, the experiences a myocardial infarction due to
identification, anticipate adverse enzyme required to metabolize succinyl- inadequate therapeutic protection. What
events, and assist in drug repurpos- choline into an inactive form. During the accounts for this difference? Genetics the
ing efforts. Moreover, pharmacoge- 1980s, a drug used to treat angina, patient with the inadequate therapeutic
nomic discoveries show promise as perhexiline, caused neural and liver toxicity protection likely has a polymorphism of
important elements of physician in a subset of patients. Scientists later found CYP2C19 with decreased activity, so that
decision support. Finally, we con- that this toxicity occurred in individuals this key enzyme cannot efficiently metab-
sider the ethical, regulatory, and with a rare polymorphism of CYP2D6, an olize clopidogrel into its active metabolite.
reimbursement challenges that re- enzyme involved in the drugs metabolism. The interaction between drugs and genetics
main for the clinical implementation
Genetics not only plays a role in adverse has been termed pharmacogenomics.
of pharmacogenomics.
events, but also influences an individuals In general, pharmacogenomics can be
optimal drug dose. Two anticoagulants, defined as the sum of the words parts: the
warfarin and clopidogrel, have different study and application of genetic factors
This article is part of the Transla- therapeutic doses based on an individuals (often in a high-throughput, genomic
tional Bioinformatics collection for genetic makeup. Scientists are increasingly fashion) relating to the bodys response to
PLOS Computational Biology. learning more about the interaction be- drugs, or pharmacology (for the major
tween drugs and human genetics in order questions in the field of pharmacoge-

1. Introduction Citation: Karczewski KJ, Daneshjou R, Altman RB (2012) Chapter 7: Pharmacogenomics. PLoS Comput
Biol 8(12): e1002817. doi:10.1371/journal.pcbi.1002817
A child with leukemia goes to the
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
doctors office to be treated. The oncolo- Baltimore County, United States of America
gist has decided to use mercaptopurine, a Published December 27, 2012
drug with a narrow therapeutic range.
Copyright: 2012 Karczewski et al. This is an open-access article distributed under the terms of the Creative
The efficacy and toxicity of this drug lies in Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
its ability to act as a myelosuppressant, provided the original author and source are credited.
which means it suppresses white and red Funding: KJK is supported by NIH/NLM National Library of Medicine training grant Graduate Training in
blood cell production. Despite the dangers Biomedical Informatics T15-LM007033 and the NSF Graduate Research Fellowship Program. RD is supported
this regimen poses, the oncologist is by Stanford Medical Scholars. RBA is supported by PharmGKB GM61374. The funders had no role in the
preparation of the manuscript.
confident with his ability to administer
the drug based on his experience with Competing Interests: The authors have declared that no competing interests exist.
prior patients. However, after the child has * E-mail: russ.altman@stanford.edu

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002817


What to Learn in This Chapter application of pharmacogenomics to a clin-
ical setting will require the education of
physicians in the utility of genome sequencing
N Interactions between drugs (small molecules) and genes (proteins)
or genotyping for the benefit of their patients.
N Methods for pharmacogenomic discovery With the dawn of human genome
N Association- and expression-based methods sequencing, especially the impending wide-
N Cheminformatics and pathway-based methods spread availability of personal genotyping to
N Database resources for pharmacogenomic discovery and application the public, and an expanded knowledge of
(PharmGKB) the clinical impact of genetics and molecular
N Applications of pharmacogenomics into a clinical setting biology, physicians around the world are
beginning to use patients personal genetics in
informing prescription decisions. While still in
nomics, see Box 1). Once a patient takes a throughput expression analysis and che- its early phases, pharmacogenomics will
drug, the drug must travel through the minformatics have provided investigators undoubtedly lead the way in the development
body to its target(s), act on its target(s), and with valuable tools for learning about of personalized medicine.
then leave the body. The first and last of physiological drug responses. Finally, as
these processes is facilitated by pharmaco- sequencing technologies become exponen- 2. Pharmacogenomics in Action
kinetic (PK) genes, which may affect a tially cheaper and the $1,000 Genome
drug in the ADME processes: to be becomes an attainable goal, whole-genome When a physician administers a drug,
absorbed into and distributed through the or exome sequencing will soon become an intricate cascade of events unfolds as
body, metabolized (either to an active commonplace in pharmacogenomic stud- this molecule interacts with the physiolog-
form or broken down into an inactive ies. As these types of studies become less ical environment. In the simplest scenario,
form), and excreted. The action of a drug expensive and more mainstream, pharma- a drug (after interacting with a number of
on its targets involves pharmacodynamic cogenomics will transition from simply an proteins on its way to its target) may act as
(PD) genes, which include the direct interesting research topic to a main role an agonist or an antagonist against a
targets themselves, genes affected down- player in pharmacological development receptor, which is composed of one or
stream, and the genes responsible for the and clinical application. more proteins. At the molecular level, the
clinical outcome. PK and PD genes can be The applications of pharmacogenomics metabolite can bind to the proteins active
involved in both intentional on-target are of interest to industry, clinicians, academ- site, which can include ligand-binding
effects that produce the desired therapeu- ics, and patients alike. For the biopharma- sites, conformation-altering sites, or cata-
tic response, as well as unintentional off- ceutical industry, pharmacogenomics can lytic sites. This effect can then be propa-
target effects that cause adverse events improve the drug development process gated through biochemical pathways to
(side effects or other unintended conse- through faster and safer drug trials and the produce a cellular and finally, systemic
quences of the drug). Current researchers early identification of drug responders, non- physiological effect. Along the way, hu-
are working to tease out genes involved in responders, and those prone to adverse man genetic variation can affect the way
both the PK and PD pathways that affect events. For clinicians and patients, pharma- these receptors interact with drugs, leading
drug action in order to improve dosing cogenomics can aid the decision-making to consequences in the efficacy of the drug
and avoid adverse drug reactions. process in prescriptions and determination and causing potential adverse events.
The search for genetic factors that relate of the optimal dose of a drug.
to pharmacological response begins much Many significant challenges remain in the 2.1. Drug-Receptor Interactions:
like the search for a genetic association of field of pharmacogenomics, beyond the Agonists and Antagonists
any trait. Standard association study meth- simple identification of more genetic variants Agonists interact with a receptor in an
ods (such as GWAS) search for significant related to drug response. First, the transition activating fashion: these small molecules
associations between a binary or continu- to whole-genome sequencing will require mimic the behavior of the receptors natural
ous trait and the genetic profiles of case and newer analysis methods, as well as more ligand, producing a result that is either
control sets. In a GWAS, the trait of interest extensive annotations, to assign meaning to weaker than, the same as, or stronger than
can be a disease state or physical trait. novel variants. A database of the relation the natural ligand. For example, sympatho-
Specifically, in the case of pharmacoge- between genes, variants, and drugs, such as mimetic drugs are a clinically important class
nomics, the trait is an actual drug dose, PharmGKB, will be instrumental in the of agonists that interact with the G-protein-
response, or adverse event profile, though aggregation of information curated from the coupled receptors that are endogenously
the study design should be carefully con- literature. In addition, the characterization of stimulated by catecholamines. These drugs
sidered for the specific application (see adverse events and their underlying causes is are given to produce responses normally
below: Methods). Additionally, high- a topic of active research. Finally, the elicited by the sympathetic nervous system.
Some examples of sympathomimetic drug
action include relaxation of bronchial smooth
Box 1. Problem Statement muscle in asthma, increasing the muscular
contractions of the heart in cases of reversible
N What are the genes involved in a drugs mechanism of action? heart failure due to cardiogenic or septic
N How are a drugs effects propagated through pathways? shock, or vasoconstriction of superficial
N How can this information be applied to characterize off-target adverse vasculature to reduce nasal congestion. There
are several subtypes of adrenoreceptors and
events?
N How can pharmacogenomics information be utilized in prescription and dosing different drugs stimulate different receptor
subtypes. For instance, a very clinically
decisions?
relevant drug, albuterol, can be inhaled to

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002817


stimulate b2 receptors (whose natural ligand idealized environment, in reality, proteins All such drug-protein interactions are
is norepinephrine) on the smooth muscle of are complex molecules with intricate often associated with the intended
the lungs. Its action leads to the activation of secondary and tertiary structures: they action of the drug, whether they involve
adenylyl cyclase, which ultimately leads to the harbor rugged landscapes on their surfac- what the body does to the drug
dilation of bronchial smooth muscle, provid- es, with charged or hydrophobic hills and (pharmacokinetics, PK) or what the drug
ing life-saving relief for asthma patients (See valleys serving as pockets to which poten- does to the body (pharmacodynamics,
Chapter 9 of [2]). However, some studies tial small molecules can bind. At these PD). However, drug-protein interactions
have identified that the very agonists that twists and turns, proteins contain their may also lead to off-target interactions,
provide relief to asthmatics can lead to active sites, including structural sites, which can cause adverse events. Along the
asthma exacerbation or death in a subset of binding sites, and catalytic sites. Metabo- way, variants in genes can affect these
patients. Research has indicated that at least lites (drugs) that enter a proteins binding interactions, which influence the pharma-
in some populations, this phenomenon could site or catalytic site can either switch on cological effect of the drug (See Figure 2 of
be related to genetic polymorphisms of the b2 the function of the protein (agonists) or [9]).
receptors [3]. prevent further reactions (antagonists). 2.2.1. Pharmacokinetic (PK)
Antagonists, on the other hand, inhibit the Such an effect is especially common if interactions. On the way to its target
receptor partially or fully, reversibly or the drug bears chemical similarity to the and on its way out, a drug may interact
irreversibly so that the cascade caused by natural ligand of the protein. with many proteins that aid or hinder its
normal receptor activation cannot occur. Non-steroidal anti-inflammatory drugs progress. These interactions define
The same adrenergic receptor subclasses (NSAIDs), which cause both reversible a drugs pharmacokinetics, which en-
mentioned before can also be antagonized. and irreversible inhibitory processes, are compass absorption, distribution,
b receptor blockers (beta-blockers) are an a familiar drug class that illustrates drug- metabolism, and excretion (ADME)
antagonist drug class clinically indicated to protein interactions. In general, NSAIDs processes. These parameters determine
treat chronic, irreversible heart failure. The inhibit the action of cyclooxygenases how quickly a drug reaches its target
mechanism of the beneficial effects of b (coded by the COX genes), which mediate and how long its action can last.
blockers is not well understood. The prevail- inflammation (see below: Molecular and When a drug is administered, it must first
ing theory is that since the high levels of Physiological Effects; reviewed in [5]). For be absorbed by the body and distributed to
circulating catecholamines triggered by heart instance, ibuprofen inhibits cyclooxygen- the relevant organs and cells. One impor-
failure lead to detrimental cardiac remodel- ases in a reversible fashion, by localizing to tant parameter, bioavailability, involves the
ing, blocking the cardiac catecholamine its critical catalytic site and competing fraction of the dose of the drug that ends in
receptors (b1 and b2 receptors) with a b with arachidonic acid to prevent the systemic circulation, much of which is
blocker can slow down additional de-com- modification of the substrate [6]. based on mode of administration: intrave-
pensation. The b blockers for heart failure, Alternatively, a drug can react cova- nous delivery would provide 100% bio-
bisoprolol, carvedilol, and metoprolol, antag- lently with a proteins critical structural, availability, while an orally ingested tablet
onize (that is, inhibit) b1 and b2 receptors: binding, or catalytic site to affect the or capsule may be incompletely absorbed
their action is substantially greater at the b1 structure of the site or the protein as a by the gastrointestinal tract or metabolized
receptor, which is the dominant receptor in whole. As mentioned previously, drugs can before it reaches systemic circulation. For
the heart. However, some patients do not covalently modify their protein targets, non-injection methods (as most prescrip-
respond as well to this therapy as others, and causing protein inactivation. In the case of tion drugs are administered), bioavailability
clinical studies have suggested that this may NSAIDs, aspirin irreversibly inactivates often depends on absorption and enzymatic
be due to b1 receptor polymorphisms. More cyclooxygenases by acetylating critical action. If the drug is administered orally,
extensive studies of these polymorphisms are serine resides (e.g. Serine 530 of COX- bioavailability is influenced by gastric
underway to definitively identify the phar- 2): the bulky sidechain renders the cata- emptying (i.e. transit time), gastrointestinal
macogenetic variables affecting b blocker lytic sites unable to modify arachidonic enzymatic action, gastrointestinal absorp-
success [4]. acid [7]. Irreversible reactions can also tion, and liver metabolism. Since drugs
Often in the literature, the discussion of work in the opposite direction, where the absorbed from the gastrointestinal system
drugs and proteins has involved vague protein modifies the structure of the drug, are taken to the liver via the portal vein
notions of interactions without any discus- potentially altering its activity (see below: prior to entering systemic circulation, the
sion about the underlying molecular mecha- PK Interactions). liver can exert a tremendous effect on first
nisms. A drugs interaction with any receptor Often, such an interaction occurs be- pass metabolism. Once a drug has entered
is dependent on how well the molecular cause a drug bears structural similarity to systemic circulation, issues of molecular
conformation of the drug can interact with the molecules natural ligand. For instance, transport affects the drugs ability to
the structure of the target. Before any methotrexate is an antifolate drug used to distribute (or reach its target). Genetic
discussion of downstream physiological ef- treat a number of diseases, including variation in the proteins that mediate these
fects, a drugs mechanism of action begins cancers and autoimmune diseases. Metho- processes can affect the absorption and
with the specific molecular reaction between trexate is structurally similar to dihydrofo- distribution of certain drugs. For instance,
the drug and cellular proteins. This interac- late (Figure 1A) and as such, binds to the the class of ABC (ATP binding cassette)
tion itself can provide insight into the effect of same region of DHFR (Figure 1B). Dihy- transporters is involved in many of the
drugs on physiology and influence potential drofolate typically fits into DHFR in a transport processes in the circulation of
pharmacogenomic knowledge. known conformation (Figure 1C), but a drugs and metabolites, especially in the gut
phenylalanine to arginine mutation chang- and across the blood-brain barrier: poly-
2.2. Drug-Receptor Interactions: The es this binding conformation (Figure 1D morphisms in these genes is associated with
Details E). This mutation is hypothesized to confer altered bioavailability of certain drugs, such
While biologists tend to represent pro- methotrexate resistance in individuals with as the cardiac drug digoxin (digitalis;
teins as colored ovals existing in an this variant [8]. reviewed in [10]).

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002817


Figure 1. Methotrexate binds to the folate-binding region of DHFR. (A) Structural similarity between methotrexate and dihydrofolate. (B)
Methotrexate (green) and dihydrofolate (blue) fit into the same binding pocket of DHFR. (C) The conformation of dihydrofolate bound to the
reference version of the receptor. (DE) Two possible conformations of dihydrofolate bound to the F31R/Q35E variants of the receptor. These variants
have decreased affinity to methotrexate, relative to dihydrofolate. Reprinted with permission from [8].
doi:10.1371/journal.pcbi.1002817.g001

The bodys metabolism of a drug can teins that interact with the drug). Perhaps affecting up to 25% of all drug therapies
lead to the conversion of a precursor drug the most famous drug-metabolizing (reviewed in [12]). For instance, CYP2C9
into an active metabolite or the break- proteins are members of the cytochrome plays a major role in the metabolism of
down of the active form into an inactive P450 family (CYP genes), which are warfarin to the inactive hydroxylated
form for excretion. As with absorption and involved in the phase I metabolism of the forms, including 7-hydroxywarfarin ([13],
distribution, inter-individual variation in majority of known drugs [11]. Polymor- reviewed in [14]). As such, CYP2C9 is the
metabolism can often be explained by phisms in these genes have been implicat- second greatest contributor to the varia-
genetics (specifically, changes in the pro- ed in human drug response variation, tion in warfarin dosage discovered thus

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002817


Figure 2. Association methods. (A) An association study with cases and controls. Millions of genetic loci are probed to ascertain association, or
separation between genotypes in cases and controls. (B) Each SNP is tested independently using a 262 contingency table and a x2 test or Fishers
exact test. (C) Each SNP is assessed for genome-wide significance, after Bonferroni correction. Reprinted from [64].
doi:10.1371/journal.pcbi.1002817.g002

far, which has led to its inclusion in circulated by absorption and distribution. the drugs are pumped out before achiev-
pharmacogenetic dosing equations [15]. For instance, one member of the ABC ing their therapeutic effect (reviewed in
Finally, the body constantly cycles family, P-glycoprotein (P-gp or ABCB1) is [17]). Thus, inhibition of P-gp has
through the gamut of small molecules a transporter protein that actively pumps remained an active area of research for
that flow through it. For example, the drugs and other metabolites out of cells (a augmenting cancer treatment [18]. Addi-
kidney is involved in finely regulating detailed view into the mechanism of P-gp tionally, upregulation of elimination me-
ionic concentrations and purging out can be found in [16]). Upregulation of P- diators such as P-gp should be considered
unwanted metabolites. As small mole- gp causes increased efflux of small mole- for pharmacogenomic dose adjustments,
cules, drugs are not exempt from these cules, which causes multi-drug resistance. with the caveat that increasing a drugs
processes and are also excreted from the For example, resistance to statins and dose may have other potential detrimen-
body, purging what was brought in and chemotherapeutic drugs occurs because tal effects.

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002817


2.2.2. Pharmacodynamic (PD) hydrocortisone can occur (See Chapter 2 bronchial smooth muscle, causing bron-
interactions. Pharmacodynamics (PD) of [2]). chial spasm, a dangerous event for asth-
encapsulates the specific effect of the matics (See Chapter 13 of [2]). Another
drug on its targets and downstream 2.3. Propagation through Pathways example is tamoxifen, the selective estro-
pathways. The drug-target interactions As in the example of hydrocortisone, gen receptor modulator (SERM), which
can be on-target, where interactions once a drug affects a gene (whether on- has improved outcomes in patients with
lead to a therapeutic effect, or off- target or off-target), the effects can estrogen receptor positive breast cancers.
target, where interactions lead to propagate through multiple proteins in the This drug antagonizes the estrogen recep-
undesired effects. PD also deals with how same pathway. Biology does not occur in a tor in the breast, blocking one of the
a drug concentration affects the target vacuum: proteins are dynamic and inter- signals that the cancer cells rely on.
what concentration is needed to reach the act with many other proteins to produce a However, tamoxifen also has agonist
maximum effect, beyond which additional physiological function. activity at the estrogen receptors in
drug does not increase response (maximal In the simplest cases, if the direct effect endometrial tissue. This off target action
effect) and what concentration is required of a drug is the inhibition of a functional can lead to a 2- to 7-fold increased risk of
to reach half of this maximal effect protein, all downstream effects of that endometrial cancer [23].
(sensitivity). protein will be affected. For instance, if a Alternatively, a drug may interact with
In many cases, structurally similar drug disrupts a kinases active site, all a protein (unrelated to the intended target)
molecules (e.g. a drug that is similar to a downstream factors in a kinase cascade to produce an off-target adverse event.
proteins natural ligand) can bind and would not be phosphorylated. As in the For example, in addition to the on-
affect the same region of a protein and case of hydrocortisone, a drugs activation target adverse events described above,
produce a pharmacological effect. For of a transcription factors DNA-binding tamoxifen is also associated with cardiac
instance, vitamin K and warfarin both domain will switch on the expression of abnormalities and muscle cramping. Pre-
interact with VKORC1 (Vitamin K ep- the transcription factors targets. These liminary data (discovered by docking
Oxide Reductase Complex subunit 1), an downstream targets lead to many of the methods, see below: Cheminformatics)
enzyme that typically converts the inactive biological effects of a given drug. Thus, a suggest that these events may be due to
epoxidized form of Vitamin K back to the variant in a pharmacogene may be an off-target interaction with sarcoplas-
active reduced form [19]. Warfarin binds considerably upstream or downstream of mic reticulum Ca2+ ion channel ATPase
to VKORC1 near its catalytic site (See the drugs direct protein interactions, but protein (SERCA) [24].
Figure 3 of [20]), inhibiting the reduction still affect the action of the drug.
reaction; the ensuing lack of active Vita- For instance, suppose protein A is
2.5. Molecular and Physiological
min K results in the downstream anti- known to interact with proteins B and C.
coagulant effects of warfarin (See Figure 2 When a drug is used to block protein A in Effects
of [21] and below: Molecular and Physi- order to inhibit protein Bs downstream A drugs interaction with its target and
ological Effects). Polymorphisms in effects, the interaction between proteins A the downstream effects (through any of the
VKORC1 are intensely linked to the and C may also be affected. If protein A targets pathways) leads to the alterations
efficacy of warfarin [22] by affecting and Cs interaction is essential for healthy in cellular physiology. In some cases, a
warfarins ability to bind to VKORC1 cellular function, administration of the cellular systemic response may be acti-
and displace vitamin K. As such, sensitiv- drug could lead to severe adverse events. vated or switched off, such as apoptosis or
ity to warfarin varies significantly in Most of the interactions discussed so far inflammation. The cell may signal to other
individuals, leading to twenty-fold dose comprise on-target effects (A and B), cells to produce a larger response, which is
differences. Warfarins optimal dose can while innocent bystander interactions (A then observed in the larger context of the
be better estimated by including and C) are known as off-target events. body. For instance, warfarins inhibition of
VKORC1 polymorphisms in a dosing In other cases, the drug may exert an VKORC1 slows the vitamin K-dependent
equation rather than using clinical factors effect on an unrelated protein D (that may, clotting pathway. This results in decreased
alone [15]. for example, bear structural resemblance thrombus formation by platelets, or collo-
Often, a drugs mechanism of action to protein A). quially known as blood thinning. In
involves its localization to some binding other cases, a drug may suppress a bodys
pocket that then disrupts (or enhances) the 2.4. Adverse Events (Off-Target) natural response. For instance, NSAIDs
function of the protein. For example, Drugs are designed for their therapeutic such as aspirin and ibuprofen inhibit COX
hydrocortisone is a lipid-soluble drug that effects, which require the molecule to bind proteins, preventing the conversion of
diffuses across the cell membrane and to one or more targets that then produce arachidonic acid to prostoglandin H2
interacts with the glucocorticoid receptors. downstream effects. Adverse events, how- (PGH2) and blocking the downstream
These receptors reside in an inactive ever, can occur when the on-target production of other prostoglandins, which
conformation because they are bound to interaction produces a potentially related, mediate inflammation and pain response.
heat shock proteins, which hold the but unintended effect, or when drugs bind While in the case of VKORC1, phar-
glucocorticoid receptors in the inactive to off-target proteins to produce an macogenomic variation is observed at the
state. The binding of hydrocortisone unrelated, unintended effect. Such effects direct site of action of warfarin, variation
causes the dissociation of the heat shock may be harmful to the patient, but may in downstream receptors can also influ-
protein and allows the DNA-binding and occasionally be inadvertently helpful (see ence the effect of drugs on the body.
transcription-activating binding domains below: Drug Repurposing). For instance, For instance, calumenin (CALU) is an
of the glucocorticoid receptor to enter an this adverse event can occur due to the inhibitor of the vitamin K-dependent
active conformation. Now, target genes intended interaction in an unintended clotting pathway. While calumenins ef-
can be transcribed, and the many anti- tissue: the b blockers used to treat heart fects are downstream of the direct inter-
inflammatory downstream effects of failure can also block b receptors in the action between VKORC1 and warfarin,

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002817


Figure 3. Cheminformatics methods. New associations discovered by cheminformatics methods. The Similarity Ensemble Approach (SEA) uses
ligand similarity methods to discover potential new associations between drugs and targets. Reprinted with permission from [33].
doi:10.1371/journal.pcbi.1002817.g003

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002817


variants in calumenin are also associated ing cases and controls). The first of these, drug response may be different (or even
with differences in warfarin dosage [25]. design of a suitable genotyping array, is the opposite) because of a hidden interac-
technically easy and inexpensive, though tion with an alternate variant of another
3. Methods for Discovery of the exact design can depend on the desired gene. Specifically, since many of the
Pharmacogenomic Genes and balance between unbiased genome-wide original genotyping platforms were devel-
studies and a targeted SNP panel (see oped for Caucasian populations, studies
Variants
below). As in any trait-association study, on Africans or Asians will require different
Pharmacogenomic research aims to the second consideration: the selection, approaches. Additionally, the first SNP
identify the genes (and gene variants) characterization, and covariate identifica- identified is typically simply an associat-
involved in the interaction between a drug tion of cases and controls provides a ed variant, rather than the causative
and the body. For any of the pharmaco- significant challenge. variant. In order to determine the specific
genomic applications discussed below, Because performing a million indepen- proteins directly involved in drug re-
there exist methods for discovering rele- dent tests requires stringent significance sponse, further experimental or informat-
vant genes and variants (typically single correction, large numbers of cases and ics analysis must be performed on genes
nucleotide polymorphisms, or SNPs) relat- controls, often in the thousands, are and variants linked to the associated
ed to drug response. Traditional SNP- required to discover a SNP that will variant.
based methods, such as genome wide achieve genome-wide significance.
association studies (GWAS), can be used SNP-based GWAS methods are effective 3.2. Expression Methods
to discover candidate regions of interest. when there is a strong signal from some In addition, other sources of data can be
Alternatively, analysis of other sources of SNP for the size of the study (that is, when used to identify genes involved in drug
data, including expression or biochemical there is good separation between geno- response, including RNA expression data
data, may provide additional gene candi- types for the cases and controls). However, from microarrays or RNA-Seq experi-
dates. Once candidate variants are identi- under this stringent independence model, ments from drug-treated samples. For
fied, further computational and experi- weaker signals may be lost among the instance, using expression profiles from
mental follow-up may be required to fully noise that plagues genetic association patients with a disease of interest, one can
characterize all the genes and pathways studies. Thus, combining data from mul- identify the genes involved in the progres-
involved in the drugs progression through tiple SNPs in a single gene can boost sion of the disease and identify potential
the body. power and decrease the number of drug target candidates. Alternatively, ex-
hypotheses for multiple hypothesis correc- pression profiles generated from a drug
3.1. Association Methods tion [26]. Alternatively, if we have prior treated sample (compared to control) can
In a GWAS, hundreds of thousands or information about the drugs mechanism be used to determine a molecular response
millions of SNPs (representing regions of of action, we can create targeted SNP to a drug. Ideally, such drug treatment
the genome with the most inter-individual panels, limited to genes in the drug targets experiments would be done on humans in
variation) are probed on a DNA micro- pathway, to decrease the hypothesis space order to generate organic in vivo physio-
array for each individual in a set of cases [27]. logical response. However, such experi-
and controls (Figure 2A). For each SNP, As the price of high-throughput se- ments are unethical for experimental
significance of the association between a quencing continues to fall, many investi- (early phase) drugs, require significant
SNP and the trait is measured a chi- gators are turning to exome or whole regulatory approval, and are expensive.
squared test, based on a 262 contingency genome sequencing to discover genetic Thus, established cell lines have provided
table of alleles (or genotypes, if a dominant factors of drug response. Such technolo- a valuable, lower-cost resource for inves-
or recessive model is assumed) and case/ gies have the advantage of remaining tigators to generate gene expression pro-
control status (Figure 2B). In the case of a unbiased in SNP discovery, detecting less files.
continuous independent variable, such as common (and even personal) mutations, One such effort, the connectivity map
drug dose, a likelihood-ratio test or a Wald and capturing larger-scale information, (CMAP), a publicly available resource of
test is applied to measure whether there is including copy number variants (CNVs) gene expression data of cell lines treated
a significantly different dose between the and structural variants (SVs). with various small molecules, has been
two groups of genotypes. Often, in major association studies, the used to compare expression profiles (See
Each SNP is tested independently, and SNP platform (DNA microarray) used is Figure 1 of [28]) to identify metabolite-
thus significance (p-values) must be cor- comprised of SNPs that serve as tags for protein interactions, small molecules with
rected for multiple hypotheses, usually a larger stretch of nearby SNPs. Such an similar binding profiles, and metabolites
using a Bonferroni correction or False approach is possible due to the presence of that may mimic or suppress disease [28].
Discovery Rate (FDR). A SNP that linkage disequilibrium in the genome, a For instance, this approach predicted
reaches genome-wide significance phenomenon where SNPs tend to be gedunin to be an inhibitor of HSP90 due
(Figure 2C) is then a candidate for inherited together (linked); the particu- to the similarity between gedunins ex-
follow-up analysis, as are genes in or near lar structure of these haploblocks (which pression profile and the profile of known
the significant SNP, genes for which the SNPs are typically inherited together) is inhibitors. Despite the lack of structural
SNP is an eQTL (a SNP associated with specific to each racial population. Because similarity between gedunin and other
the expression of some other gene), and different populations have different linkage HSP90 inhibitors, CMAPs predicted re-
genes in the same pathway as these genes. structures and a different series of poly- sult was validated biochemically.
The two most important considerations morphisms, platforms that are optimized Thus, cell lines can be used as surro-
for the design of any pharmacogenomic for one population may not be the best gates for individuals, where a cellular
study include the selection of representa- choice for another. This problem is further phenotype is used as a proxy for the
tive genetic markers, as well as phenotyp- complicated by underlying differences in individuals own physiological response
ically well-characterized patients (includ- genomes: the effect a given SNP has on based on the cellular expression response

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002817


to a drug treatment. For instance, one can that incorporate the structure of the ligand Ideally, multiple sources of evidence can
search for associations between a cell lines along with known interactions can identify be integrated to fully characterize the
genetic makeup and cell viability after patterns of related drug targets [32]. Such physiological response to a drug. Once
drug treatment (the IC50 of drug for each an approach can suggest new functions for sufficient confidence is generated for a
cell lines) [29]. Alternatively, similar meth- known drugs, explain off-target adverse potential pharmacogenomic mechanism,
ods can be used to characterize toxicolog- events, and importantly, predict poly- the first step towards clinical application
ical response: treating cell lines with a drug pharmacology, or the action of a single involves the storage and dissemination of
and measuring gene expression can sug- drug on multiple targets [33] (Figure 3). the information in a curated database, such
gest genes involved in the drugs toxicity. These methods leverage small molecule as PharmGKB. Combining information
While not yet extensively employed in databases such as PubChem and from multiple analyses will allow for more
practice, other sources of high-throughput ChEMBL, which maintain structures and powerful characterization of the pharmaco-
experimental omics data, such as meta- properties of small molecules and ligands, genomic response. For instance, dosing
bolomics or proteomics data could be used as well as bioassay results of these equations for sensitive drugs such as warfarin
for similar analyses. compounds. can be developed by multiple linear regres-
Additionally, a wealth of scientific sion of variants (as well as clinical covariates)
3.3. Cheminformatics/NLP (Other information is available in the biomedical on observed doses [15]. Finally, a centralized
Discovery Methods) literature as lower-throughput free text. resource such as PharmGKB will allow for
While not strictly pharmacoge- Thus, text mining techniques such as systematic pharmacogenomic analysis: such
nomics methods, cheminformatics has natural language processing (NLP), which as for automated annotation of an input of
provided a valuable tool for investigators exploits sentence syntax to pull structured genomic variants.
in the initial stages of drug discovery. For knowledge from the literature, can be used
instance, combining information about to mine PubMed and other sources of
published information to discover new 4. Pharmacogenomics in Drug
protein structure and small molecule
drug-protein interactions [34]. Discovery
structure, docking methods predict the
best fit of a molecule (or all molecules in Pharmacogenomics can impact how the
a database such as PubChem or 3.4. Pathway Discovery pharmaceutical industry develops drugs, as
ChEMBL) by minimizing the conforma- Once a candidate gene is identified, early as the drug discovery process itself
tion energy of the molecule-protein fit. studying the genes known genetic net- (Figure 4). First, cheminformatics and path-
Such methods are computationally ex- works, cascades, and pathways can help way analysis can aid in the discovery of
pensive, as they explore a large search identify other possible candidates that suitable gene targets, followed by small
space for each pair of molecule and affect drug action. For instance, if a kinase molecules as leads for potential drugs.
protein and use molecular dynamics or is identified as a drug target, the proteins it Additionally, discovery of pharmacogenomic
genetic algorithms to optimize fits. phosphorylates (and any proteins affected variants for the design of clinical trials can
Therefore, molecule docking can be downstream) may be relevant to the study allow for safer, more successful passage of
limited to the active site of a protein of the drug. Additionally, knowledge of drugs through the pharmaceutical pipeline.
with a group of molecules to decrease the biological pathways influencing a disease
search space. Alternatively, if a given can aid in the drug discovery process (see
below: Drug Discovery). 4.1. Small Molecule Candidate
molecule is previously known to interact
with a protein, molecule similarity met- Numerous online or downloadable re- Identification
rics can be included to suggest similar sources exist for pathway and network A key starting point in developing a drug
molecules as protein-binding candidates. analysis, such as Biocarta, Ingenuity, for illness or disease involves finding a suitable
In this way, a search limited to ligands KEGG, and PharmGKB. For a gene-drug gene to target. Typically, genes implicated in
that score above a similarity threshold to relationship of interest, information on the a disease can be discovered by GWAS,
the known ligand would be much faster genes network or pathway can be used to exome sequencing, analysis of RNA expres-
than a search through all of PubChem. limit the hypothesis space of other analyses sion profiles, or other biochemical methods.
While such predictions must still be and experiments. Pathway analysis can These genes and others in the same pathway
confirmed through biochemistry (such connect the dots between known gene- can be considered as candidate drug targets.
as binding assays), these methods can drug interactions to generate new hypoth- The potential target space could be limited by
be used to limit the hypothesis space for eses of key genes that may also contribute excluding genes on the basis of their similarity
drug discovery, prioritizing the expen- to the pharmacogenomics of the drug. to other genes (possibly due to paralogy) to
sive, lower-throughput biochemical as- Additionally, a mechanism of action can avoid off-target effects.
says. be formalized by closing the loop between Once potential gene or pathway targets
For a potential drug target, cheminfor- all the genes involved. are identified, cheminformatics methods
matics methods can be used to identify can be used to generate predictions for
new hits or optimize leads by sug- 3.5. Validation and Application potential leads (or drug candidates) for
gesting molecules that may disrupt the The methods discussed thus far provide a high-throughput drug screen. For in-
function of the protein. For instance, only computational evidence for potential stance, protein structure information can
docking methods were used to successfully drug-protein interactions. In order to be combined with small molecule struc-
identify novel molecules that could serve prove drug-protein interactions and ef- ture information to predict favorable
as inhibitors of CTX-M b-lactamase at fects, follow-up biochemical methods, such drug-gene interactions. After such predic-
millimolar binding affinities [30]. Various as measuring binding affinity or functional tions are generated, follow-up biochemi-
algorithms have been developed for assays, are required to demonstrate a cal experiments would be required to
screening ligand-target fits using docking molecules potential therapeutic activity confirm the interaction before the small
(reviewed in [31]). Additionally, methods or to definitively prove an interaction. molecules are considered further.

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002817


Figure 4. Drug discovery. Pharmacogenomics can be used at multiple steps along the drug discovery pipeline to minimize costs, as well as
increase throughput and safety. First, association and expression methods (as well as pathway analysis) can be used to identify potential gene targets
for a given disease. Cheminformatics can then be used to narrow the number of targets to be tested biochemically, as well as identifying potential
polypharmacological factors that could contribute to adverse events. After initial trials (including animal models and Phase I trials),
pharmacogenomics can identify variants that may potentially affect dosing and efficacy. This information can then be used in designing a larger
Phase III clinical trial, excluding non-responding and targeting the drug towards those more likely to respond favorably.
doi:10.1371/journal.pcbi.1002817.g004

In a similar vein, pathway analysis can be that target genes in the same pathway as the a lack of toxicity verified in animal models,
used to select new, potentially safer drug original gene. the small molecule goes through a series of
targets. Namely, if a drug (which targets some increasingly larger phases of clinical trials.
gene) is initially discovered as effective, but 4.2. Clinical Trial Pipeline Basic efficacy and relative safety are
found to cause adverse events, safer alterna- Once a small molecule has been demonstrated before and during Phase II
tives might be found by searching for drugs biochemically identified as a lead and clinical trials, on the path to Phase III.

PLOS Computational Biology | www.ploscompbiol.org 10 December 2012 | Volume 8 | Issue 12 | e1002817


However, Phase III trials often require
thousands of patients, and thus, a phar-
maceutical company would ideally be
confident that the drug will successfully
pass and be profitable before investing in
such an expensive endeavor.
Most of the time, patient response to a
drug is variable during the initial Phase II
trials and as this response is often related
to genetic factors (PK or PD protein
variability), pharmacogenomics can be
used to limit the cohort for Phase III
trials. Specifically, if a protein variant is
identified that separates drug responders
from non-responders, individuals with the
non-responder variant could be exclud-
ed from the next phase of the trial
(reviewed in [35]). While this would limit
the scope and usability of the drug, it
would ensure the passage of the drug
through the trial. As such, pharmaceutical
Figure 5. Drug repurposing. Docking methods suggest binding site similarity between COMT
companies would need to balance the loss (green) and InhA (blue). The overlap between the predicted locations of their cofactors (purple
in revenue of a less globally applicable and orange, respectively) and ligands (red and yellow, respectively) suggest potential similarity in
drug with the risk of FDA rejection of the their functions. Thus, the same drug that has been used to inhibit COMT (entacapone) was
drug. predicted to inhibit the M. tuberculosis protein InhA for potential treatment of tuberculosis.
Reprinted from [36].
doi:10.1371/journal.pcbi.1002817.g005
4.3. Drug Repurposing
As mentioned previously, cheminfor-
matics methods can be used to identify studies, studies on a known safe drug are tion to achieve therapeutic response, and
novel drug-protein interactions. While significantly cheaper and carry much which drugs should be avoided on basis of
these predicted interactions can be used lower risk. adverse events. In order to achieve these
to discover new small molecules for goals, the findings of the research lab
therapeutic purposes, any new drug must 5. Applying Pharmacogenomic needs to be translated into the clinic, and
still go through the significant regulatory Knowledge the practice of using pharmacogenomics
hurdles of safety and efficacy testing. must be integrated into the existing
However, drugs already on the market Pharmacogenomics has the potential to medical system (Figure 6).
for some therapeutic purpose are FDA- transform the way medicine is practiced,
approved for safe use in humans, and their by replacing broad methods of screening
repurposing would simply involve dem- and treatment with a more personalized 5.1. Prescribing
onstrating that the drug can be used approach that takes into account both When a physician treats a condition,
effectively for a different indication. In clinical factors and the patients genetics. there can be multiple approaches to that
general, any method that can be used to As demonstrated previously, genetic vari- treatment. Currently, a physician consid-
characterize off-target effects can be ation can greatly influence the nature of ers clinical and social factors when choos-
used in drug repurposing, by finding the effects a drug will have on an ing an approach, asking questions such as,
effects that are salubrious. individual (whether it will work or cause how is the patients organ function?,
For instance, docking methods have an adverse event), as well as the amount of have their been any past problems with
been used in discovering novel functions drug required to produce the desired this type of treatment?, how compliant
for already-established small molecules (or effect. To this end, pharmacogenomics will the patient be with one treatment
drug repurposing or repositioning). will impact the way drugs are prescribed, versus another?, and for this kind of
The similarity between a drug target for dosed, and monitored for adverse reac- patient, what is the best evidence-based
Parkinsons disease, catchol-O-methyl- tions. treatment?. Based on his or her clinical
transferase (COMT) and a bacterial pro- On an individual scale, the derivation of experience, the physician then chooses a
tein in Mycobaterium tuberculosis (the clinically actionable pharmacological in- drug to use. If there are multiple treat-
enoyl-acyl carrier protein reductase, InhA) formation from the genome is already a ments available, the physician will choose
narrowed down an investigation of poten- reality: the clinical annotation of a pa- one and monitor the patients progress.
tial drug targets for M. tuberculosis tients full genome sequence has suggested Having the ability to know which drugs
infections (Figure 5). From this result, the patients likely resistance to clopido- will work best beforehand can improve
entacapone, a drug already approved to grel, positive response to lipid-lowering care, because a physician will administer
treat Parkinsons by inhibiting COMT, drugs, and lower initial dose requirement the best treatment and not waste time on a
was predicted to bind to InhA, which was of warfarin [37]. Thus, physicians will use treatment that is likely to fail for a
then validated biochemically and shown to pharmacogenomics alongside traditional particular individual.
have antibacterial activity [36]. Thus, clinical practices to predict which drugs One area where gene-based prescribing
while full efficacy for treatment of tuber- are more or less likely to work, which is steadily advancing is in the area of cancer
culosis must still be demonstrated in larger patients will require more or less medica- genomics. Cancer drugs generally have

PLOS Computational Biology | www.ploscompbiol.org 11 December 2012 | Volume 8 | Issue 12 | e1002817


Figure 6. Applying pharmacogenomics in the clinic. A proposed clinical workflow including pharmacogenomic information. A physician
considers the patients current presentation and past history when coming up with a working diagnosis and based on his or her clinical judgment,
decides what drugs the patient may need. For example, if the physician wanted to add clopidogrel to the patients regimen, the physician would

PLOS Computational Biology | www.ploscompbiol.org 12 December 2012 | Volume 8 | Issue 12 | e1002817


input it into the electronic medical record (EMR). The EMR would interrogate the genome and present a message such as clopidogrel sensitivity:
POOR METABOLIZER, REDUCED ANTI-PLATELET EFFECT - gene: CYP2C19 - gene result *2/*2. Based on this recommendation, the physician may
adjust the dose accordingly or choose another drug. In this case, the physician will likely increase the dose of clopidogrel in order to achieve
therapeutic effect. Reprinted with permission from [65].
doi:10.1371/journal.pcbi.1002817.g006

many toxic side effects, and in many cases equivalent to that of a prior myocardial going to succeed or if another drug would be
of advanced cancers, physicians guess and infarction in a nondiabetic individual. Pres- a better choice [40].
test medications by prescribing them and ently, the physician chooses drugs based on One major caveat of gene-based pre-
monitoring progress. In addition, the very his best clinical judgment and then monitors scription decisions (as well as dosing, as
nature of cancer is personal, insofar as the outcome of the treatment. However, as discussed below) involves the applicability
each specific cancer is caused by the unique the tolerance and efficacy of certain popularly of a finding in one population to other
sum of individual somatic mutations (that prescribed drugs has been shown to be tied to populations (see above: Association Meth-
is, mutations that occur in the individual genetics, such information could be used in ods). While a pharmacogenomic effect
after birth and are not inherited or passed prescription decisions. For instance, statins may be true for a given population (with
on). Certain signatures of cancers, or are a class of drugs that are inhibitors of a certain genetic background, in animal
mutations that produce similar cancer HMG-CoA reductase, an enzyme that helps model parlance), it may not directly apply
phenotypes, allow for the grouping of produce cholesterol in the liver. Thus, statins to other populations due to unknown
cancers into distinctions, such as leukemia are given in an effort to lower cholesterol, genetic factors, especially combinatorial
or lymphoma, but even these exhibit particularly low-density lipoprotein (LDL) effects. Because there is no current stan-
significant variability among classifications. cholesterol, whose increased levels are a dard for translating a result between
Thus, the ability to sequence and study the cardiac risk factor. Statins are often pre- ethnicities, follow-up work is required for
genomes of cancer cells of an individual can scribed to patients with type II diabetes and each specific pharmacogenomic interac-
help identify the driving somatic mutations high cholesterol in order to help them reach a tion before it is applied in a clinical setting.
and provide a tool for rational drug choice. more healthy cholesterol range. Even though
For example, the median survival for ad- studies have suggested genetic influences on 5.2. Adverse Drug Reactions
vanced or recurrent endometrial cancer is statin efficacy and tolerance, such findings are Another factor physicians need to consider
very poor, due to the fact that physicians treat not yet widely applied in clinical medicine. when choosing a drug is the risk of adverse
empirically with chemotherapy, which may One study found that in individuals with events, or any detrimental, unintended
have no therapeutic benefits. Researchers diabetes, variation in the HMG-coA reduc- consequence of administering a drug at
studying mutations in the pathways of tase gene was associated with a decreased indicated clinical doses. In a milder form,
endometrial cancer cell lines found that response to statin therapy. In this study, a an adverse event could be an allergic rash
response to doxorubicin, a chemotherapy significantly greater percentage of individuals from penicillin. These events can also be
used to treat endometrial cancer, was related heterozygous for the G minor allele of much more intense: severe adverse drug
to mutations in the Src pathways, which are rs17238540 were unable to reach target reactions (SADRs) are those that can cause
involved in cell proliferation, motility, and cholesterol and triglyceride goals when significant injury or even death, and are
survival. By pinpointing mutations in this compared to individuals homozygous for estimated to occur in about 2 million patients
pathway, the researchers were able to the major allele. Additionally, these individ- a year in the United States. In fact, SADRs
rationalize supplementing the drug regimen uals had a 13% smaller reduction in total are the fourth leading cause of death in the
with the addition of SU6656, a drug that cholesterol and a 27% smaller reduction in United States, with about 100,000 yearly
competitively inhibits the Src pathway, which triglycerides. This is an example of just one deaths. Because of the impact of SADRs,
increased the sensitivity of some of the cell variation in the HMG-coA gene; other scientists and physicians hope that the
lines to doxorubicin [38]. As cancers are variations certainly exist and can impact application of pharmacogenomics can help
typically characterized by a lack of error- how well a patient responds to statins [39]. predict which patients are most susceptible to
correction mechanisms and inhibited apop- Another gene that has been found to affect experiencing an SADR to a given drug. With
tosis, such an approach is particularly response to statins is the APOE gene, which is this knowledge in hand, a physician can
important, as the initial failure of a chemo- associated with the regulation of total either more closely monitor these patients or
therapeutic drug allows time for a cancer to cholesterol and LDL cholesterol. There are choose an alternative therapy [41].
develop further mutations and spread further. several variants in this gene, and there are For instance, statins have been associated
In the future, interrogating cancer genomes differences between how type II diabetic with a rare but incredibly severe adverse
could allow rational drug prescribing, de- individuals carrying these variants respond to reaction: myopathy and rhabdomyolysis. A
creasing the amount of time spent on statins. For instance, the individuals homo- study looking at the possible genetic influenc-
ineffective therapies and increasing the zygous for the E2 variant were all able to es of this reaction found a SNP in the
number of successful cures. reach their target LDL cholesterol; however SLCO1B1 gene associated with this severe
Pharmacogenomics can also play a role in 32% of individuals homozygous for the E4 adverse drug reaction, with an odds ratio of
drug decisions for prevalent conditions, variant failed to reach target LDL cholesterol. 4.5 [42]. However, there are also cases of
allowing physicians to predict when a Moreover, E2 variant homozygotes had a individuals who experience milder symptoms
commonly successful therapy may fail. For significantly greater lipid lower response to and develop statin intolerance. Some of these
instance, there is an arsenal of drugs doctors statins than some of the other variants. Thus, individuals experience an elevation in crea-
can use to combat the co-morbidities of type APOE is another gene that may be predictive tine kinase or alanine aminotransferase while
II diabetes. These co-moribidities are usually of statin resistance or reduced efficacy. on statins, indicating possible muscle or liver
cardiac risk factors, such as lipid abnormal- Knowledge of these genes could play a role damage. A recent study found that the
ities and high blood pressure: the cardiac risk in the future of drug prescribing, as physicians functional variants V174A and N130D in
factor conferred by type II diabetes is would be able to predict a priori if a drug was the SLCO1B1 gene, which encodes the

PLOS Computational Biology | www.ploscompbiol.org 13 December 2012 | Volume 8 | Issue 12 | e1002817


organic anion transporting polypeptide about 30 million cases annually and are size [45]. One benefit of pharmacogenomics
OATP1B1, are predictive of statin intoler- indicated to prevent myocardial infarction, is that the associations between genetics and
ance [43]. OATP1B1 in these individuals has venous thrombosis, and cardioembolic drug effects is more concrete and immedi-
reduced maximal transport ability, possibly stroke. However, the dose needed to achieve ately applicable than in other translational
leading to higher levels of statins in the adequate anticoagulation can vary by as bioinformatics concepts such as disease risk
patients blood. Currently, studies are under- much as twentyfold between patients. Cur- assessment, where scientists are struggling
way to determine if there is a difference rently, physicians start with an initial dose with missing heritability and combinations
between the available statin drugs with and titrate (adjust) over time until the target of moderate risks. Because of this, unlike
regards to these pharmacogenetic compo- international normalized ratio (INR), an other therapies, which require a randomized
nents, in order to better inform physicians indicator of anticoagulation, is reached. clinical trial in order to prove efficacy, the
about the drug choices they make. However, until the therapeutic dose is application of pharmacogenomic principles
In the effort of applying pharmacogenetics reached, there is the opportunity for over- may not require the same level of scrutiny.
in the clinic, trials have already shown that coagulation, which leads to an increased risk Rather than providing some novel therapy,
screening tests have clinical utility. For of thromboembolic events, or under-dosing, the vast majority of pharmacogenomic
instance, abacavir, a nucleoside reverse- which can lead to ineffectiveness, and thus, findings are simply supplementing physician
transcriptase inhibitor used to treat AIDS, hemorrhaging and bleeding. The discovery knowledge about previously approved drugs.
causes a hypersensitivity reaction in 5 to 8% of variants affecting warfarin dosing have led Physicians already utilize the clinical back-
of patients. This reaction can include fever, to the creation of algorithms that use clinical grounds of their patients (i.e. weight, gender,
rash, and gastrointestinal or respiratory (such as weight and other drug status) and presumed organ function, drug interactions,
symptoms. Since this adverse reaction neces- pharmacogenetic (variants in CYP2C9 and compliance) when making decisions about
sitates stopping therapy (and patients cannot VKORC1; see above, PK and PD Interac- drugs. As long as adding the variable of
be put on the drug again because of the risk of tions, respectively) information in order to genetic information is non-inferior to the
a more severe reaction upon re-exposure), predict a patients optimal starting warfarin current standard of care, there should not be
physicians could avoid prescribing this drug if dose. One such dosing algorithm, produced resistance to its implementation [46].
they were capable of predicting which by the International Warfarin Pharmacoge- Once a biomarker is shown to be
individuals would have a reaction. Recently, netic Consortium, was capable of predicting important, other decisions will have to be
it was identified that HLA-B*5701 was doses using a pharmacogenetic algorithm at a made: Should testing for the biomarker be
associated with hypersensitivity to abacavir. significantly more accurate rate than an required, or should it just be recommended?
Armed with this information, a double-blind, algorithm using clinical factors alone [15]. Socio-economic considerations along with
randomized, prospective study in nearly 2000 However, one of the drawbacks is that these the predictive value of the biomarker will
patients was conducted to determine if predictions are most accurate in a Caucasian need to be considered. At first pass, the use of
screening for this variant could help prevent population; additional research is needed in pharmacogenomic data may be completely
hypersensitivity reactions in AIDS patients. populations of different ancestries in order to left to the clinicians judgment until the FDA
The results supported the use of pharmaco- produce a more broad-spanning pharmaco- has formalized its role in their application.
genetics in the clinic: prescreening eliminated genetic algorithm. Once a pharmacogenetic biomarker is ap-
immunologically confirmed hypersensitivity proved, the drugs label will need to reflect
reactions and significantly decreased hyper- 5.4. Applying Pharmacogenomics in the genetic components involved: biomarkers
sensitivity symptoms, compared to the con- the Clinic identifying the patient population that should
trol group [44]. Though examples exist of how pharmaco- receive the drug would be printed under
genomics could impact prescribing drugs, indication, biomarkers related to drug
5.3. Dosing predicting adverse events, and dosing drugs, mechanism may appear under the clinical
Once a physician has chosen a drug based the actual application of pharmacogenomics pharmacology section, and biomarkers
on efficacy and consideration of adverse is just beginning to gain traction. As related to safety may be indicated in adverse
events, the next step is to determine what pharmacogenomics knowledge steadily in- events. The challenge for the FDA and
the correct dose at which to administer the creases and the infrastructure for its usage clinicians alike will require vigilance about
drug. Currently, clinical factors such as continually develops, the day when all updating new information as the onslaught of
gender, weight, and kidney or liver function physicians regularly apply genetics to drug pharmacogenetic associations continues to
may be taken into account when dosing a dosing draws closer. The challenges that pour in [45].
medication. However, genetics can play a remain include surmounting regulatory hur- Pharmacogenetic research continues to
large role in how a drug is dosed as well. dles, developing ways to continually update discover new drug-gene interactions. The
As mentioned previously, a major reason known findings, delivering knowledge to volume of new findings exceeds the capabil-
drug doses differ between individuals is due to physicians, and integrating genomics into ities of any individual to parse. Thus,
polymorphisms in proteins involved in phar- medicine. However, scientists have worked to bioinformatics will have to play an integral
macokinetics or pharmacodynamics. Varia- address these challenges, and pharmacoge- role in the translation of the data to the
tion in enzymes involved in pharmacokinet- nomics will likely serve as one of the first bedside. Text mining (see Methods: Chemin-
ics, such as the Cytochrome P450 metabolic major clinical applications of personalized formatics/NLP) will be instrumental to
enzymes (and mainly, CYP2D6, CYP2C9, genomic medicine. extracting structured data from the literature
and CYP3A4), can affect the availability of In the United States, the FDA regulates in order to update knowledge bases, such as
drugs reaching their targets. Alternatively, the drugs and drug labels. Therefore, the PharmGKB. Ultimately, this knowledge will
targets themselves (PD genes) can respond communication between scientists and the be integrated into a centralized database to
differently based on their specific structure. FDA will be critical to the adoption of make the information accessible to all.
One of the emerging examples of dosing pharmacogenomic information on drug la- In order to fully translate pharmacoge-
based on genetics is the anticoagulant, bels. Evaluation will depend on the trial nomics into the clinic, this information must
warfarin. Prescriptions for warfarin number design, sample size, reproducibility, and effect be well integrated with the electronic medical

PLOS Computational Biology | www.ploscompbiol.org 14 December 2012 | Volume 8 | Issue 12 | e1002817


Table 1. Examples of pharmacogenomics used in this chapter. Additional examples can be found at PharmGKB.

Drug Gene (Selected Examples) SNPs/Genotypes (Selected Examples) Sources

Mercaptopurine Inosine triphosphate, pyrophosphatase (ITPA), rs41320251, rs1800584 [63], [62]


Thiopurine methyltransferase (TMPT)
Succinylcholine Butyrylcholinesterase (BCHE) rs28933390, rs28933389 [61]
Perhexiline Cytochrome P450 2D6 (CYP2D6) CYP2D6 *4/*5, *5/*6, *4/*6 [60]
Clopidogrel Cytochrome P450 2C19 (CYP2C19) rs4244285 [59]
Albuterol Beta-2 adrenergic receptor (ADRB2) rs1042713 [58]
Metoprolol Beta-1 adrenergic receptor (ADRB1) rs1801252 [57]
Methotrexate Methylenetetrahydrofolate reductase (MTHFR) rs4846051 [56]
Warfarin Cytochrome P450 2C9 (CYP2C9), Vitamin K expodide rs1799853, rs1057910, rs7294, rs9934438, [55], [54], [53], [52], [25],
reductase (VKORC1), Calumenin (CALU) rs9923231, rs339097 [51]
Atorvastatin P-glycoprotein (ABCB1) rs1045642, rs2032582 [50]
Statins HMG-coA reductase (HMGCR), Apolipoprotein rs17238540, APOE - E2, E4, rs4149056, [39], [40], [49], [43]
E (APOE), Solute carrier organic anion transporter rs2306283
family, member 1B1 (SLCO1B1)
Abacavir HLA-B*5701 genes rs2395029, rs3093726 [48]

doi:10.1371/journal.pcbi.1002817.t001

system (Figure 6). Full adoption will require a would be expected to work and if any grows, text mining methods may become
curated, updated database with FDA or possible adverse events might be expect- instrumental in interrogating the literature
evidence-based approved drug-gene interac- ed (Figure 6). and collecting relevant data for clinical use.
tions that would be available for physicians to Pharmacogenetics is a rapidly developing The application of pharmacogenomics in the
use in their medical practice. For example, field; however, some challenges remain in clinic can help inform physicians in drug
PharmGKB is primarily used as a scientific implementing scientific findings from the prescribing, drug dosing, and prediction of
tool for identifying drug-gene interactions. bench to the bedside. Because of the continued adverse events. Because many of the drugs
However, its clinical utility was shown when development and work in this field, these undergoing pharmacogenomic study are
it was used to generate drug recommenda- challenges will be addressed, ushering in an already FDA-approved, adoption of pharma-
tions based on an individuals fully sequenced age of personalized drug treatments. cogenomics in the clinic is mostly dependent
genome [37]. Such resources serve as the on the availability of genome sequencing and
precursor to the systems that will be in place 6. Summary the development of implementation infra-
when all individuals have sequenced genomes structure. Moreover, pharmacogenomics can
Pharmacogenomics encompasses the in-
readily available for physician use. also aid in drug development, providing
teraction between human genetics and drugs,
Finally, for pharmacogenomics to be pharmaceutical companies with an additional
which can be affected by variation in genes
widely applied, personal genomics needs tool to design more successful, cheaper trials.
involved in pharmacokinetics (PK) and
to become ingrained into modern med- Thus, pharmacogenomics promises to help
pharmacodynamics (PD). Thus, a major goal
icine. Physicians and patients must be launch medicine and drug development into
of pharmacogenomics is to elucidate which
educated as to the benefits of genomic the realm of personalized care.
genes affect drug action, using cheminfor-
medicine, in order to dispel any myths matics, expression studies, and genome-wide
and to avoid ethical issues. Moreover, association studies (GWAS). Association 7. Exercises
genetic testing facilities meeting the methods can be used to discover novel
U.S. governments Clincial Laboratory associations by comparing the genetic differ- 1. (A) Download a genotype and phenotype
Improvement Amendments (CLIA) cer- ences between cases with a certain phenotype dataset of your choosing. Using PLINK
tification requirements need to be es- and controls. Expression analysis and che- ( h t t p : / / p n g u .m g h . h a r v a r d .e d u /
tablished in order to provide patients minformatics can be used to expand knowl- ,purcell/plink/) or a statistical program
with genomic data that is considered edge about drug-gene interactions by com- such as R (http://www.r-project.org/),
acceptable for clinical use. Finally, paring gene expression or interaction profiles calculate the association (using a Fishers
insurance companies must be on board among drugs and genes. Analysis of these exact test) between ,Trait. and each
to reimburse genetic testing. Since studies can yield information about how these SNP. After Bonferroni correction, does
sequencing costs continue to drastically genes affect drug action. Because of differ- any SNP reach genome-wide signifi-
fall, the debates surrounding cost will ences in haplotype structure between popu- cance? (B) Does using a different correc-
soon become moot [46]. Thus, we are lations, studies validated in one population tion method such as Benjamini or False
rapidly entering an age where every may not be directly applicable to a different Discovery Rate (FDR) result in any more
patient can have his or her genome population. However, as knowledge accumu- significant SNPs?
available. With the availability of an lates about drug-gene interactions, scientists 2. (A) Use a pharmacogenomic database
individuals genome, a physician looking can contribute to databases, such as (such as PharmGKB) to find genes that
to administer a drug such as a statin can PharmGKB, documenting known relation- may interact with metformin. (B) Are
check to see whether or not the statin ships (Table 1). As the volume of knowledge any of these genes known to interact

PLOS Computational Biology | www.ploscompbiol.org 15 December 2012 | Volume 8 | Issue 12 | e1002817


with other drugs? Which drugs? (C) year old Caucasian (175 cm, 75 kg), 5. Read about the clinical uses of a whole
Bonus question: Are any of these drugs not taking amiodarone or enzyme genome or exome in healthy [37] and
related (by structure or function) to inhibitors, who is rs9923231 TT and diseased [47] individuals. How can
metformin? CYP2C9 *2/*2? (B) Would the clinical pharmacogenomics be directly applied
3. (A) Implement a warfarin dosing algorithm have over- or under-estimated in a clinical setting?
equation (e.g. the one found in [15]). his (or your) dose and what are the
If you have a personal genotype, input potential consequences of such an error? Answers to the Exercises can be found
your information and calculate your 4. You are a physician and would like to in Text S1.
optimal starting warfarin dose; other- prescribe simvastatin. What parts of the Supporting Information
wise, calculate the optimal dose (as genome would you want interrogated to
predicted by both the clinical and know about prescribing this drug and Text S1 Answers to Exercises.
pharmacogenetic algorithms) for a 66- why? (DOCX)

Further Reading

N Altman RB, Flockhart D, Goldstein DB (2012) Principles of pharmacogenetics and pharmacogenomics. Cambridge: Cambridge
University Press. 400 p.
N Altman RB, Kroemer HK, McCarty CA, Ratain MJ, Roden D (2010) Pharmacogenomics: will the promise be fulfilled? Nat Rev
Genet 12: 6973.
N Altman RB (2011) Pharmacogenomics: noninferiority is sufficient for initial implementation. Clin Pharmacol Ther 89: 348350.
N Klein TE, Chang JT, Cho MK, Easton KL, Fergerson R, et al. (2001) Integrating genotype and phenotype information: an
overview of the PharmGKB project. Pharmacogenetics Research Network and Knowledge Base. Pharmacogenomics J 1: 167
170.
N Roses AD (2000) Pharmacogenetics and the practice of medicine. Nature 405: 857865.
N Roses AD (2004) Pharmacogenetics and drug development: the path to safer and more effective drugs. Nat Rev Genet 5: 645
656.

Glossary

N Adverse event - A side effect, or unintended consequence of taking a drug.


N Cheminformatics - Methods that utilize chemical structures of metabolites and/or protein structure to discover potential
drug-gene interactions.
N Drug Target - The specific protein whose interaction with a drug constitutes that drugs mechanism of action.
N (Gene) Expression - The relative amount of RNA from a gene in a cell at a given snapshot in time, often used as a proxy for
activity of the gene in the condition in which the experiment was performed.
N Hit - A small molecule that disrupts the function of a potential drug target (for treatment of a disease).
N Lead - An optimized (often chemically modified) hit with high specificity for its target and reasonable pharmacogenomic
properties.
N Linkage - The property that multiple SNPs are often inherited together. When a SNP is associated with a trait or disease, it is
not necessarily the causal SNP, but may be linked to other variation that is the molecular and physiological cause of the
association.
N (DNA) Microarray - An experimental method that probes hundreds of thousands or millions of regions of the genome to
determine the genotype at each locus.
N Off-target effect - The effects of a drug propagated by interactions with proteins other than the drug target (innocent
bystanders).
N On-target effect - The effects of a drug propagated by the intended interaction with the drug target.
N Pharmacodynamics - The mechanisms that relate to what the drug does to the body, including on-target and off-target
effects, intended and unintended, beneficial or harmful.
N Pharmacogenomics - The study and application of genetic factors relating to the bodys response to drugs.
N Pharmacokinetics - The range of mechanisms that relate to what the body does to the drug, including absorption,
distribution, metabolism, and elimination of a drug.
N Polymorphism - A mutation in the genome that varies among individuals in a sizable fraction (often, minor allele frequency
.0.01) of the population.
N Polypharmacology - The interaction of a drug with multiple targets.
N SADR - Severe Adverse Drug Reaction. An adverse event that results in significant injury or death.
N SNP - Single Nucleotide Polymorphism (see Polymorphism)

PLOS Computational Biology | www.ploscompbiol.org 16 December 2012 | Volume 8 | Issue 12 | e1002817


References
1. Abbott A (2003) With your genes? Take one of pharmacogenetics. Human mutation 29: 456 treatment in diabetes: a GoDARTS study.
these, three times a day. Nature 425: 760762. 460. Pharmacogenetics and Genomics 18: 10211026.
2. Katzung BG, Masters SB, and Trevor AJ (2012) 22. Rost S, Fregin A, Ivaskevicius V, Conzelmann E, 40. Donnelly LA, Palmer CNA, Whitley AL, Lang
Basic & clinical pharmacology. New York: Hortnagel K, et al. (2004) Mutations in CC, Doney ASF, et al. (2008) Apolipoprotein E
McGraw-Hill Medical. VKORC1 cause warfarin resistance and multiple genotypes are associated with lipid-lowering
3. Yu I-W, Bukaveckas BL (2008) Pharmacogenetic coagulation factor deficiency type 2. Nature 427: responses to statin treatment in diabetes: a Go-
tests in asthma therapy. Clin Lab Med 28: 645 537541. DARTS study. Pharmacogenetics and Genomics
665. 23. Bland AE, Calingaert B, Secord AA, Lee PS, 18: 279287.
4. Azuma J, Nonen S (2009) Chronic heart failure: Valea FA, et al. (2009) Relationship between 41. Giacomini KM, Krauss RM, Roden DM,
beta-blockers and pharmacogenetics. Eur J Clin tamoxifen use and high risk endometrial cancer Eichelbaum M, Hayden MR, et al. (2007) When
Pharmacol 65: 317. histologic types. Gynecol Oncol 112: 150154. good drugs go bad. Nature 446: 975977.
5. Smith WL, DeWitt DL, Garavito RM (2000) 24. Xie L, Wang J, Bourne PE (2007) In silico 42. Group SC, Link E, Parish S, Armitage J,
Cyclooxygenases: structural, cellular, and molec- elucidation of the molecular mechanism defining Bowman L, et al. (2008) SLCO1B1 variants and
ular biology. Annu Rev Biochem 69: 145182. the adverse effect of selective estrogen receptor statin-induced myopathya genomewide study.
6. Rome LH, Lands WE (1975) Structural require- modulators. PLoS Comput Biol 3: e217. N Engl J Med 359: 789799.
ments for time-dependent inhibition of prosta- doi:10.1371/journal.pcbi.0030217 43. Donnelly LA, Doney ASF, Tavendale R, Lang
glandin biosynthesis by anti-inflammatory drugs. 25. Voora D, Koboldt DC, King CR, Lenzini PA, CC, Pearson ER, et al. (2011) Common non-
Proc Natl Acad Sci USA 72: 48634865. Eby CS, et al. (2010) A polymorphism in the synonymous substitutions in SLCO1B1 predis-
7. DeWitt DL, el-Harith EA, Kraemer SA, Andrews VKORC1 regulator calumenin predicts higher pose to statin intolerance in routinely treated
MJ, Yao EF, et al. (1990) The aspirin and heme- warfarin dose requirements in African Americans. individuals with type 2 diabetes: a go-DARTS
binding sites of ovine and murine prostaglandin Clin Pharmacol Ther 87: 445451. study. Clin Pharmacol Ther 89: 210216.
endoperoxide synthases. J Biol Chem 265: 5192 26. Tatonetti NP, Dudley JT, Sagreiya H, Butte AJ, 44. Mallal S, Phillips E, Carosi G, Molina J-M,
5198. Altman RB (2010) An integrative method for Workman C, et al. (2008) HLA-B*5701 screening
8. Volpato JP, Yachnin BJ, Blanchet J, Guerrero V, scoring candidate genes from association studies: for hypersensitivity to abacavir. N Engl J Med
Poulin L, et al. (2009) Multiple conformers in application to warfarin dosing. BMC Bioinfor- 358: 568579.
active site of human dihydrofolate reductase matics 11 Suppl 9: S9. 45. Surh LC, Pacanowski MA, Haga SB, Hobbs S,
F31R/Q35E double mutant suggest structural 27. Thorn CF, Whirl-Carrillo M, Klein TE, Altman Lesko LJ, et al. (2010) Learning from product
basis for methotrexate resistance. J Biol Chem RB (2007) Pathway-based approaches to phar- labels and label changes: how to build pharma-
284: 2007920089. macogenomics. Current Pharmacogenomics cogenomics into drug-development programs.
9. Giacomini KM, Brett CM, Altman RB, Benowitz 5:7986. Pharmacogenomics 11: 16371647.
NL, Dolan ME, et al. (2007) The pharmacoge- 28. Lamb J, Crawford ED, Peck D, Modell JW, Blat 46. Altman RB (2011) Pharmacogenomics: noninfer-
netics research network: from SNP discovery to IC, et al. (2006) The Connectivity Map: using iority is sufficient for initial implementation. Clin
clinical drug response. Clin Pharmacol Ther. pp. gene-expression signatures to connect small Pharmacol Ther 89: 348350.
328345. molecules, genes, and disease. Science 313: 47. Worthey EA, Mayer AN, Syverson GD, Helbling
10. Dietrich CG, Geier A, Oude Elferink RPJ (2003) 19291935. D, Bonacci BB, et al. (2010) Making a definitive
ABC of oral bioavailability: transporters as 29. Gamazon ER, Huang RS, Cox NJ, Dolan ME diagnosis: Successful clinical application of whole
gatekeepers in the gut. Gut 52: 17881795. (2010) Chemotherapeutic drug susceptibility as- exome sequencing in a child with intractable
11. Evans WE, Relling MV (1999) Pharmacoge- sociated SNPs are enriched in expression quan- inflammatory bowel disease. Genetics in medicine
nomics: translating functional genomics into titative trait loci. Proc Natl Acad Sci USA 107: : official journal of the American College of
rational therapeutics. Science 286: 487491. 92879292. Medical Genetics.
12. Ingelman-Sundberg M, Sim SC, Gomez A, 30. Chen Y, Shoichet BK (2009) Molecular docking 48. Mallal S, Nolan D, Witt C, Masel G, Martin AM,
Rodriguez-Antona C (2007) Influence of cyto- and ligand specificity in fragment-based inhibitor et al. (2002) Association between presence of
chrome P450 polymorphisms on drug therapies: discovery. Nat Chem Biol 5: 358364. HLA-B*5701, HLA-DR7, and HLA-DQ3 and
pharmacogenetic, pharmacoepigenetic and clini- 31. Kolb P, Ferreira RS, Irwin JJ, Shoichet BK (2009) hypersensitivity to HIV-1 reverse-transcriptase
cal aspects. Pharmacol Ther 116: 496526. Docking and chemoinformatic screens for new inhibitor abacavir. Lancet 359: 727732.
13. Rettie AE, Korzekwa KR, Kunze KL, Lawrence ligands and targets. Curr Opin Biotechnol 20: 49. Link E, Parish S, Armitage J, Bowman L, Heath
RF, Eddy AC, et al. (1992) Hydroxylation of 429436. S, et al. (2008) SLCO1B1 variants and statin-
warfarin by human cDNA-expressed cytochrome 32. Keiser MJ, Roth BL, Armbruster BN, Ernsberger induced myopathya genomewide study.
P-450: a role for P-4502C9 in the etiology of (S)- P, Irwin JJ, et al. (2007) Relating protein N Engl J Med 359: 789799.
warfarin-drug interactions. Chem Res Toxicol 5: pharmacology by ligand chemistry. Nat Biotech- 50. Rebecchi IM, Rodrigues AC, Arazi SS, Genvigir
5459. nol 25: 197206. FD, Willrich MA, et al. (2009) ABCB1 and
14. Goldstein JA, de Morais SM (1994) Biochemistry 33. Keiser MJ, Setola V, Irwin JJ, Laggner C, Abbas ABCC1 expression in peripheral mononuclear
and molecular biology of the human CYP2C AI, et al. (2009) Predicting new molecular targets cells is influenced by gene polymorphisms and
subfamily. Pharmacogenetics. pp. 285299. for known drugs. Nature 462: 175181. atorvastatin treatment. Biochem Pharmacol 77:
15. Consortium IWP, Klein TE, Altman RB, Eriks- 34. Garten Y, Coulet A, Altman RB (2010) Recent 6675.
son N, Gage BF, et al. (2009) Estimation of the progress in automatically extracting information 51. Voora D, Koboldt DC, King CR, Lenzini PA,
warfarin dose with clinical and pharmacogenetic from the pharmacogenomic literature. Pharma- Eby CS, et al. A polymorphism in the VKORC1
data. N Engl J Med 360: 753764. cogenomics 11: 14671489. regulator calumenin predicts higher warfarin dose
16. Jones PM, George AM (2004) The ABC trans- 35. Roses AD (2004) Pharmacogenetics and drug requirements in African Americans. Clin Phar-
porter structure and mechanism: perspectives on development: the path to safer and more effective macol Ther 87: 445451.
recent research. Cell Mol Life Sci 61: 682699. drugs. Nat Rev Genet 5: 645656. 52. DAndrea G, DAmbrosio RL, Di Perna P,
17. Leschziner GD, Andrew T, Pirmohamed M, 36. Kinnings SL, Liu N, Buchmeier N, Tonge PJ, Xie Chetta M, Santacroce R, et al. (2005) A
Johnson MR (2007) ABCB1 genotype and PGP L, et al. (2009) Drug discovery using chemical polymorphism in the VKORC1 gene is associat-
expression, function and therapeutic drug re- systems biology: repositioning the safe medicine ed with an interindividual variability in the dose-
sponse: a critical review and recommendations for Comtan to treat multi-drug and extensively drug anticoagulant effect of warfarin. Blood 105: 645
future research. Pharmacogenomics J 7: 154179. resistant tuberculosis. PLoS Comput Biol 5: 649.
18. Thomas H, Coley HM (2003) Overcoming e1000423. doi:10.1371/journal.pcbi.1000423 53. Aithal GP, Day CP, Kesteven PJ, Daly AK (1999)
multidrug resistance in cancer: an update on the 37. Ashley EA, Butte AJ, Wheeler MT, Chen R, Association of polymorphisms in the cytochrome
clinical strategy of inhibiting p-glycoprotein. Klein TE, et al. (2010) Clinical assessment P450 CYP2C9 with warfarin dose requirement
Cancer Control 10: 159165. incorporating a personal genome. Lancet 375: and risk of bleeding complications. Lancet 353:
19. Zimmermann A, Matschiner JT (1974) Biochem- 15251535. 717719.
ical basis of hereditary resistance to warfarin in 38. Indermaur MD, Xiong Y, Kamath SG, Boren T, 54. Rettie AE, Wienkers LC, Gonzalez FJ, Trager
the rat. Biochem Pharmacol 23: 10331040. Hakam A, et al. (2010) Genomic-directed target- WF, Korzekwa KR (1994) Impaired (S)-warfarin
20. Oldenburg J, Watzka M, Rost S, Muller CR ed therapy increases endometrial cancer cell metabolism catalysed by the R144C allelic
(2007) VKORC1: molecular target of coumarins. sensitivity to doxorubicin. Am J Obstet Gynecol variant of CYP2C9. Pharmacogenetics 4: 3942.
J Thromb Haemost 5 Suppl 1: 16. 203: 158.e151140. 55. Crespi CL, Miller VP (1997) The R144C change
21. Owen RP, Altman RB, Klein TE (2008) 39. Donnelly LA, Doney ASF, Dannfald J, Whitley in the CYP2C9*2 allele alters interaction of the
PharmGKB and the International Warfarin AL, Lang CC, et al. (2008) A paucimorphic cytochrome P450 with NADPH:cytochrome
Pharmacogenetics Consortium: the changing role variant in the HMG-CoA reductase gene is P450 oxidoreductase. Pharmacogenetics 7: 203
for pharmacogenomic databases and single-drug associated with lipid-lowering response to statin 210.

PLOS Computational Biology | www.ploscompbiol.org 17 December 2012 | Volume 8 | Issue 12 | e1002817


56. Hughes LB, Beasley TM, Patel H, Tiwari HK, chrome P450 2C19 genotype with the antiplatelet 63. Stocco G, Cheok MH, Crews KR, Dervieux T,
Morgan SL, et al. (2006) Racial or ethnic effect and clinical efficacy of clopidogrel therapy. French D, et al. (2009) Genetic polymorphism of
differences in allele frequencies of single-nucleo- JAMA 302: 849857. inosine triphosphate pyrophosphatase is a deter-
tide polymorphisms in the methylenetetrahydro- 60. Barclay ML, Sawyers SM, Begg EJ, Zhang M, minant of mercaptopurine metabolism and tox-
folate reductase gene and their influence on Roberts RL, et al. (2003) Correlation of CYP2D6 icity during treatment for acute lymphoblastic
response to methotrexate in rheumatoid arthritis. genotype with perhexiline phenotypic metaboliz- leukemia. Clin Pharmacol Ther 85: 164172.
Ann Rheum Dis 65: 12131218. er status. Pharmacogenetics 13: 627632. 64. Takeuchi F, McGinnis R, Bourgeois S, Barnes C,
57. Johnson JA, Zineh I, Puckett BJ, McGorray SP, 61. Nogueira CP, Bartels CF, McGuire MC, Adkins Eriksson N, et al. (2009) A genome-wide associ-
Yarandi HN, et al. (2003) Beta 1-adrenergic S, Lubrano T, et al. (1992) Identification of two ation study confirms VKORC1, CYP2C9, and
receptor polymorphisms and antihypertensive different point mutations associated with the CYP4F2 as principal genetic determinants of
response to metoprolol. Clin Pharmacol Ther fluoride-resistant phenotype for human butyr- warfarin dose. PLoS Genet 5: e1000433.
74: 4452. ylcholinesterase. Am J Hum Genet 51: 821 doi:10.1371/journal.pgen.1000433
58. Israel E, Chinchilli VM, Ford JG, Boushey HA, 828. 65. Pulley JM, Denny JC, Peterson JF, Bernard GR,
Cherniack R, et al. (2004) Use of regularly 62. Otterness DM, Szumlanski CL, Wood TC, Vnencak-Jones CL, et al. (2012) Operational
scheduled albuterol treatment in asthma: geno- Weinshilboum RM (1998) Human thiopurine implementation of prospective genotyping for
type-stratified, randomised, placebo-controlled methyltransferase pharmacogenetics. Kindred
personalized medicine: the design of the Vander-
cross-over trial. Lancet 364: 15051512. with a terminal exon splice junction mutation
bilt PREDICT project. Clin Pharmacol Ther 92:
59. Shuldiner AR, OConnell JR, Bliden KP, Gandhi that results in loss of activity. J Clin Invest 101:
8795.
A, Ryan K, et al. (2009) Association of cyto- 10361044.

PLOS Computational Biology | www.ploscompbiol.org 18 December 2012 | Volume 8 | Issue 12 | e1002817


Education

Chapter 8: Biological Knowledge Assembly and


Interpretation
Ju Han Kim1,2,3*
1 Division of Biomedical Informatics, Seoul National University College of Medicine, Seoul, Korea, 2 Seoul National University Biomedical Informatics (SNUBI), Seoul
National University College of Medicine, Seoul, Korea, 3 Systems Biomedical Informatics National Core Research Center (SBI-NCRC), Seoul National University College of
Medicine, Seoul, Korea

Abstract: Most methods for large- extract biological meanings from the mas- regulons (i.e., sets of co-regulated genes)
scale gene expression microarray sive amounts of transcriptome expression and their putative cis-regulatory elements.
and RNA-Seq data analysis are de- data. Most of the microarray and RNA-Seq Here, the discovered motifs seem to be
signed to determine the lists of data analysis methods are designed to regarded as functional annotations to the
genes or gene products that show determine the lists of genes or gene corresponding genes. Many Functional An-
distinct patterns and/or significant products that show distinct patterns and/ notation Analysis (FAA) methods have been
differences. The most challenging or significant differences. Clustering and developed to test whether certain Gene
and rate-liming step, however, is to differential expression analysis, for exam- Ontology (GO) terms [2] or biological
determine what the resulting lists of ple, typically generate lists of significantly pathways are significantly enriched within
genes and/or transcripts biologically clustered and Differentially Expressed a particular list of genes. Many GO and
mean. Biomedical ontology and Genes (DEGs), respectively. The most biological pathway-based tools for gene
pathway-based functional enrich- challenging and rate-liming step, however, expression analysis have been developed
ment analysis is widely used to is to determine what the resulting lists of and proven to be useful [39].
interpret the functional role of tightly genes or gene products biologically mean. FAA is an attempt to extract biological
correlated or differentially expressed The first analytic approach for the semantics from given lists of genes that are
genes. The groups of genes are biological interpretation of obtained gene determined without considering any bio-
assigned to the associated biological lists was to manually collect and put down logical meaning but by a quantitative
annotations using Gene Ontology
all available descriptive information con- statistical analysis like clustering and
terms or biological pathways and
cerning each gene next to it and to try to DEG analysis methods. Gene Set Enrich-
then tested if they are significantly
enriched with the corresponding infer the collective meaning of the textual ment Analysis (GSEA) [10,11], however,
annotations. Unlike previous ap- descriptors for the group of genes under the takes quite the reverse way. GSEA uses
proaches, Gene Set Enrichment Anal- biological systems context. The assumption pre-defined gene sets with a priori estab-
ysis takes quite the reverse approach here is that if a certain keyword is lished biological meanings like biological
by using pre-defined gene sets. significantly over-represented or a mean- pathways. For each pre-defined gene set,
Differential co-expression analysis ingful pattern is found among the textual GSEA tries to determine if it shows
determines the degree of co-expres- descriptors for a gene group, then the significant expression change. Therefore,
sion difference of paired gene sets keyword or the pattern can be regarded as what GSEA essentially tests is if the pre-
across different conditions. Out- the semantic interpretation of the gene defined biological meaning assigned to
comes in DNA microarray and RNA- group. the gene set shows significant change or
Seq data can be transformed into the It seems that Tavazoie et al. [1] was first to not. It has been successfully demonstrated
graphical structure that represents formally analyze the over-representation of that GSEA can successfully detect subtle
biological semantics. A number of functional annotations for the lists of genes but set-wise coordinated expression chang-
biomedical annotation and external with semantic interpretations. By means of es that cannot be detected by individual
repositories including clinical re- partitional clustering and motif discovery, gene tests [10].
sources can be systematically inte- given genome-wide gene-expression clusters, The gene-set approach greatly improves
grated by biological semantics with- he analyzed significantly over-represented biological interpretability by using pre-
in the framework of concept lattice
regulatory motifs in the upstream sequences defined gene sets with established biological
analysis. This array of methods for
of clustered yeast genes for uncovering new meanings. The same strategy can be applied
biological knowledge assembly and
interpretation has been developed
during the past decade and clearly Citation: Kim JH (2012) Chapter 8: Biological Knowledge Assembly and Interpretation. PLoS Comput Biol 8(12):
improved our biological understand- e1002858. doi:10.1371/journal.pcbi.1002858
ing of large-scale genomic data from Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
the high-throughput technologies. Baltimore County, United States of America
Published December 27, 2012
This article is part of the Transla- Copyright: 2012 Ju Han Kim. This is an open-access article distributed under the terms of the Creative
tional Bioinformatics collection for Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
PLOS Computational Biology.
Funding: This work was supported by the basic science research program through the National Research
Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2010-0028631). The
1. Introduction funders had no role in the preparation of the manuscript.

One of the challenges in DNA micro- Competing Interests: The author has declared that no competing interests exist.
array and RNA-Seq data analysis is to * E-mail: juhan@snu.ac.kr

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002858


What to Learn in This Chapter overrepresented in a given list of genes. GO
annotation and pathway membership fre-
quencies for a list of genes obtained by
N How to find genes associated with a particular disease (or condition) from
differential expression analysis (Figure 1 (a))
microarray or RNA-Seq data
or co-expression analysis (or clustering)
N How to find biological pathways and/or biomedical ontology terms for the
(Figure 1 (b)) are input to statistical analyses
interpretation of particular gene groups associated with a particular disease
to test if they are significantly over-repre-
N How to characterize biological properties of a particular list of genes sented. For example, in Figure 1, the genes
N Which data resources are useful for interpreting large-scale gene expression in the gene list (i.e. selected genes) are
profiles significantly enriched with a GO term,
N What are the limitations of individual gene-based analysis for determining GO:000123, but not with GO:000126. It
differentially expressed genes (even with multiple hypothesis correction) means that the genes are significantly
N How to identify gene groups that are differentially expressed or differentially associated with the biological meaning of
the GO term, GO:000123.
co-expressed between normal and disease samples
N Compare in terms of semantic interpretation the functional annotation analysis In principle, any attribute of a gene can
be applied for FAA including transcription
methods for co-expressed genes as in clustering and for pre-defined gene sets
as in GSEA factor binding sites [1], clinical phenotypes
like disease associations, MeSH (Medical
N How to organize and visualize a massive and redundant annotation list of genes
Subject Heading) terms, microRNA bind-
or gene sets into a unified framework of biological understanding
ing sites, protein family memberships,
chromosomal bands, etc. as well as GO
for the analysis of differential co-expression gene expression profiles and annotations terms and biological pathways. Moreover,
analysis. Cho et al. proposed dCoxS algo- into a unified framework enables us to these features may in turn have their own
rithm that determines if a pair of gene sets interpret complex biomedical data in a ontological structures as illustrated in
coordinated co-expression patterns shows comprehensive and organized fashion. Figure 2. GO and MeSH have a tree-
significant changes across different condi- The outline for this chapter is as follows. ish graph structure, which is more
tions [12]. If a pair of gene sets (or pathways) In Section 2, a comprehensive survey of formally a DAG (Directed Acyclic Graph),
shows a positive co-expression pattern in biomedical annotation resources will be in which each term may be a child of one
normal tissue but a negative co-expression given with major ontology and biological or more parents. Pathways have directed
pattern in cancer cells, then it can be pathway-based analysis methods. Section graph structures. Clusters may also be
assumed that the pair of gene sets may play 3 describes gene set-wise differential ex- organized into a hierarchical tree or a
an important role in the cancerous trans- pression analysis methods with its semantic graph structure. ArrayXPath [6,9] pro-
formation. This dyadic relation can easily interpretation power. Section 4 describes vides one of the most comprehensive
be extended to create a network of gene sets differential co-expression analysis. Finally, collections of these structured features for
showing differential co-expression patterns in Section 5, application of formal concept annotation analysis [14].
across different conditions. analysis for systematic semantic interpre- Differential expression analysis deter-
Sometimes, given the genomic scale, tation of gene expression profiles will be mines significantly down- or up-regulated
even the extracted list of biological mean- introduced with the following summary in genes (or DEGs) between two conditions,
ings and significant functional annotations section 6. i.e. control and treatment groups to
are too big and complex such that they explore the effect of a drug. Students t-
need to be systematically organized. Or- 2. Pathway and Ontology-Based test, Wilcoxons rank sum test and AN-
dering of obtained semantics using con- OVA may be applied to detect DEGs.
Analysis
cept lattice analysis improves biological Given the huge number of genes to be
interpretation of microarray gene-expres- GO and biological pathway-based anal- tested, multiple-hypothesis-testing prob-
sion data. BioLattice considers gene ex- ysis is one of the most powerful methods for lem should be properly managed. Co-
pression clusters as objects and annota- inferring the biological meanings of ob- expression analysis puts similar expression
tions as attributes and provides a graphical served expression changes (Figure 1). It profiles together and different ones apart,
executive summary (i.e. the context of the enables us to analyze a list of interesting returning lists of co-expressed genes that
whole experiment) of the order relations genes resulting from microarray and RNA- are assumed to be tightly co-regulated.
by arranging them on a concept lattice in Seq experiments, without molecular biolo- Clustering algorithms can be classified into
an order based on the set inclusion theory gists help. The genes in the list may be the hierarchical-tree clustering and partitional
[13]. ones statistically significantly up or down clustering. While some partitional cluster-
A wide range of tools and resources in regulated between conditions (i.e. DEGs), ing algorithms do not impose a structure
microarray and RNA-Seq data analysis where the number of the genes belong to a to the clusters, others like Self Organizing
have a potential impact on personalized list depends on the threshold of signifi- Feature Maps (SOM) organize clusters
medicine and are invaluable in biomedical cance. Another method is to perform a co- into a grid structure. Imposing a structure
research. Integrative analysis of heteroge- expression (or clustering) analysis grouping based on cluster similarity may be per-
neous biological and clinical data is genes with similar expression patterns formed after clustering.
essential to discover meaningful knowl- across different experimental conditions. Although DEGs are different from
edge. The construction of semantic rela- Many genome databases provide GO clusters, biological interpretation of the
tionships of biological resources makes it annotations to their genes and gene prod- resulting lists of significantly up- or down-
possible to unify multi-layered and hetero- ucts, which are also members of biological regulated DEGs (Figure 1(a)) may also be
geneously formatted data from genome to pathways. FAA determines which biological benefited by the same ontology and
phenome. Semantic analysis integrating pathways or GO terms are significantly pathway-based annotation analysis. Clus-

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002858


Figure 1. Functional annotation analysis based on biological pathways and GO terms. Annotation frequencies for a list of genes obtained
by differential expression and co-expression analyses of microarray and RNA-Seq data are input to a statistical analysis of significant over-
representation within the selected group. C: conditions, g: genes, s: gene groups.
doi:10.1371/journal.pcbi.1002858.g001

tering is classified as an unsupervised BioCyc collection of pathways and genome computationally-predicted protein proper-
method. Results from supervised methods databases developed by SRI International. ties for a variety of complete genomes as
for a variety of classification tasks can The pathway figures of MetaCyc are not well. MeSH has many clinical terms
sometimes be organized into a list based static diagrams so that it can be updated including disease names. Other knowledge
on, for example, their contributions to the and expanded while KEGG provides static resources like OMIM (Online Mendelian
task. In principle any list of genes can be collections of pathway diagrams. Inheritance in Man) Morbid Map can also
carefully applied to ontology and pathway- One major goal of ontology is to be used to associate genes to MeSH disease
based annotation analysis. provide a shared understanding of a names. GO and MeSH are now parts of
Metabolic pathways like KEGG and certain domain of information. GO was UMLS (Unified Medical Language System)
MetaCyc and signaling pathways like first created as controlled vocabularies for which has a semantic network structure. In
BioCarta are very powerful resources for standardized annotation of genome data- principle, any biomedical ontology can be
the understanding of shared biological bases. Genes and gene products are systematically applied for improving bio-
processes of a group of genes. Pathways annotated by GO as well as free text medical understanding of gene expression
are commonly presented as directed input by curators. DAG structures are microarray and RNA-Seq data.
graphs, where nodes mainly represent imposed to the three controlled vocabu- Once the genes of interest are success-
molecules such as proteins and com- laries of GO; Molecular Function (MF), fully associated with correct functional
pounds, and edges represent relation types Cellular Compartment (CC), and Biolog- annotations, the next step is to examine
between two nodes. MetaCyc is an ical Process (BP). To each node (or GO if there are any GO terms that have a
experimentally determined non-redundant term), a set of genes are annotated. MIPS larger than expected subset of listed genes
metabolic pathway database. It is the began as a source for data on yeast in their annotation list. For example, if
largest collection containing over 1400 biology, and now provides an integrated 20% of the genes in a gene list are
metabolic pathways [15]. It is a part of the source for experimental, literature and annotated with a GO term apoptosis while

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002858


Figure 2. Collection of biological knowledge-based annotation resources for genes and gene clusters. The right panel shows an
example of GO enrichment analysis result for a yeast cell division experiment.
doi:10.1371/journal.pcbi.1002858.g002

only 1% of the genes in the whole human the number of successes by a serial range of genomic resources and techniques.
genome fall into this functional category, sampling from a finite population. It is Visual and textual presentation helps users
then the gene list can be regarded as equivalent to a one-tailed Fishers exact to understand biological semantics and
strongly related with the functional anno- test. One should consider the choice of contexts. A number of analysis tools with
tation. Most statistical tests like Chi-square, universe (or background), that makes these steps have been introduced: ArrayX-
binomial and hypergeometric tests can be substantial impact on the result. All genes Path, Pathway Miner, EASE in pathway
applied. Chi-square test cannot be used to having at least one GO annotation, all analysis, GOFish, GOTree Machine, Fa-
test data of small sample size. Hypergeo- genes ever known in genome databases, all tiGO, GOAL, GOMIner, FuncAssociate
metric test is widely used for functional genes on the microarray, or all transcripts in ontology analysis and GeneMerge,
enrichment analysis of gene lists, but it is of RNA-Seq data that pass non-specific MAPPFinder, DAVID, GFINDer, Onto-
computationally more intensive. filters can be candidate universe. One Tools in both analyses [14].
Suppose we have a total of N genes with more problem comes from the hierarchi-
n genes belonging to a group of interest cal tree (or graphical) structure of GO 3. Gene Set-Wise Differential
(cluster or DEGs). Among them M genes categories (or pathways) while the hyper- Expression Analysis
are annotated to a specific GO term and k geometric test assumes independence of
genes belong to the interest group and are categories. A parent term can simply be Researchers primary interest with
annotated to the specific GO term. The rated as significant because of the influ- DNA microarray and RNA-Seq data is
probability of having at most k genes can ence from its significant children. More- to identify differentially expressed genes
be calculated by hypergeometric distribu- over, more general statements require (DEGs). To this aim, a number of
tion according to the following: stronger evidence that is required to prove statistical methods have been introduced,
more specific statements. Conditional evaluating statistical significance of indi-
X
k
PX k~ hyjN; M; n hypergeometric testing methods [16,17] vidual genes between two conditions.
y~0 exclude GO terms if there is no evidence Gene set-wise differential expression anal-
! ! beyond that provided by its significant ysis method, however, evaluates coordi-
M N{M children. Because many tests are per- nated differential expression of gene
X
k
y n{y formed, p-values must be interpreted with groups, the meaning of which are previ-
~ ! caution. ously defined as those of biological path-
N ways. The first developed in this category
y~0 Pathway and ontology-based analysis con-
n sist of database mapping, statistical testing, is the Gene Set Enrichment Analysis
and presentation steps [18]. Mapping gene (GSEA) that evaluates for each a priori
lists to GO terms or pathways requires defined gene set the significant association
Hypergeometric distribution is a dis- resolving gene name ambiguities and incon- with phenotypic classes in DNA micro-
crete probability distribution describing sistencies (not discussed here) using a wide array experiments [10].

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002858


Figure 3. Differential expression analysis for individual genes and predefined gene sets. C: conditions, c: condition sets, g: genes, s: gene
groups, S: predefined gene sets. (Modified from [30]).
doi:10.1371/journal.pcbi.1002858.g003

While FAA tries to determine over- GSEA first creates a ranked list of genes
represented GO terms or biological pathways according to their differential expression X 1
after determining significant co-expression between experimental conditions and then Pmiss S,i~
N{NH
clusters or DEG lists (Figure 3(a) and (c)), determines, for each a priori defined gene gj 6 [ S
GSEA takes the reverse-annotation or gene set, whether members of a gene set tend to ji
set-wise approach (Figure 3(b)). This gene occur toward the top (or bottom) of the
set-wise differential expression analysis meth- ranked list, in which case the gene set is where NH indicates the number of genes in
od successfully identified modest but coordi- correlated with the phenotypic class dis- S and is an exponent to control the weight
nated changes in gene expression that might tinction. With the interesting gene set, S, of the step. The ES is the maximum
have been missed by conventional individual Enrichment Score (ES) is calculated by deviation from zero of Phit Pmiss. It
gene-wise differential expression analysis. evaluating the fractions of genes in S corresponds to a weighted Kolmogorov-
Moreover, gene set-wise approach provides (hits) weighted by their correlation and Smirnov-like statistic.
straightforward biological interpretation be- the fractions of genes not in S (misses) GSEA assesses the significance by
cause the gene sets are defined by biological present up to a given position i in the permuting the class labels. Concerning
knowledge. GSEAs success clearly demon- ranked gene list, L, where N genes are the definition of the null hypothesis,
strates that many tiny expression changes can ordered according to the correlation, methods can be classified into competitive
collectively create a big change that is r(gj) = rj of their expression profiles with and self-contained tests [19]. A competi-
statistically significant. Another advantage is interest gene set: tive test compares differential expression of
that utilizing pre-defined and well-established the gene set to a standard defined by the
gene sets rather than finding or creating novel complement of that gene set. A self-
lists of genes markedly improves semantic   contained test, in contrast, compares the
interpretability and computational feasibility. X r j p X p gene set to a fixed standard that does not
Phit S,i~ , where NR ~ r j 
It is believed that functionally related genes NR gj [S
depend on the measurements of genes
often show a coordinated expression pattern gj [S outside the gene set. The competitive test
to accomplish their functional role. ji is more popular than the self-contained
test.

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002858


Figure 4. Differential co-expression anslyses. Differential co-expression (a) of clusters can be detected by a method proposed by Kostka and
Spang [26], (b) of gene pairs can be detected by a method proposed by Lai et al. [24], and (c) of paired gene sets by a method proposed by Cho et al.
[12]. C: conditions, g: genes, s: gene clusters, S: a priori defined gene sets. (Modified from [30]).
doi:10.1371/journal.pcbi.1002858.g004

Typical gene sets are regulatory-motif, 4. Differential Co-Expression whether a cluster shows significant condi-
function-related, and disease-related sets. Analysis tional difference in the degree of co-
MSigDB (Molecular Signatures Database) expression. An additive model-based scoring
is one of leading gene set databases (http:// Co-expression analysis determines the can be used based on the mean squared
www.broadinstitute.org/gsea/msigdb) con- degree of co-expression of a group (or residual [26]. Let conditions and genes be
taining a total of 6769 gene sets which are cluster) of genes under a certain condition. denoted by J and I, respectively. The mean
classified into five different collections Unlike co-expression analysis, differential squared residual of model is a measurement
(positional, curated, motif, computational co-expression analysis determines the degree of co-expression of genes:
and GO gene sets). Several interesting of co-expression difference of a gene pair or
extensions were proposed in terms of a gene cluster across different conditions, SI,J ~
sample level applications. For example, which may relate to key biological processes
researchers developed genomic signatures provoked by changes in environmental 1 X 2
aij {ai: {a:j {a::
to identify the activation status of on- conditions [12,2325]. Differential co-ex- jI j{1jJ j{1 I,J
cogenic pathways and predict the sensiti- pression analysis methods can be catego-
vity to individual chemotherapeutic drugs rized into three major types (Figure 4): (a) where an entry aij is the expression level of
[20,21]. Significance Analysis of Function differential co-expression of gene cluster(s) gene i in condition j, ai. is the mean
and Expression (SAFE) [22] extends [26], (b) gene pair-wise differential co- expression level of gene i in conditions, a.j
GSEA to cover multiclass, continuous and expression [24] and (c) differential co- is the mean expression level of genes in
survival phenotypes. It also provides more expression of paired gene sets [12]. condition j, a..is the mean expression levels
options for the test statistic, including To identify differentially co-expressed of genes in conditions. A group of gene with
Wilcoxon rank sum, Kolmogorov-Smirnov gene cluster(s) between two conditions, (C1 a low score S9 means high correlation of
and Hypergeometric statistic. and C2 in Figure 4 (a)), a method determines genes. Given two groups J1 and J2, e.g.

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002858


Figure 5. The dCoxS algorithm. Expression matrices of two gene sets (upper panel) are transformed into Renyi relative entropy matrices by all
sample pair-wise comparisons (middle panel). For each condition, Interaction Score (IS), a kind of correlation coefficients, between a pair of entropy
matrices is obtained. Upper diagonal heat maps in the middle panel are transformed into scatter plots in the lower panel where ISs are depicted as
fitted lines. (Modified from [12]).
doi:10.1371/journal.pcbi.1002858.g005

disease and control, the method minimizes calculated as expected conditional F-sta- sounds very interesting. However, it seems
the score, S (I) of a set of genes, I: tistic (ECF), a modified F-statistic, for all that there is very little chance for such a
pair of genes between two conditions [24]. cluster to exist. Similarly, one can hardly find
SI,J1 A meta-analytic approach can also detect such a set among a priori defined gene sets
SI ~ gene pairs with significant differential co- (i.e. biological pathways). It is even difficult to
SI,J2
 2 expression between normal and cancer expect a biological pathway whose members
P 1 1 1
jJ2 j{1 : I ,J1 aij {ai: {a:j {a:: samples [25]. These methodologies can be are all highly positively (or negatively) co-
~   regarded as an attempt to discover gene expressed in a condition because a biological
jJ1 j{1 P 2 2 2 2
I ,J2 aij {ai: {a:j {a:: pairs that are, in principle, positively pathway is a complex functional system with
correlated in one condition (i.e. normal) interacting positive and negative feedback
and negatively correlated in another (i.e. loops. Thus, members of a biological
A greedy downhill approach finds local cancer). Identification of differentially co- pathway may not be contained in a single
minima of the score. Another approach expressed gene clusters or gene pairs co-expression cluster, especially when the
uses t-statistic for each cluster to evaluate usually do not use a priori defined gene cluster is not very big, but be split into
the difference of the degree of co-expression sets or pairs but try to find the best ones different clusters.
between conditions, after creating gene among all possible combinations without The dCoxS (differential co-expression
expression clusters [27]. These methods considering prior knowledge. Thus the of gene sets) algorithm identifies (a priori
can be viewed as an attempt to find gene biological interpretation of the clusters or defined or semantically enriched) gene set
clusters that are tightly co-regulated (i.e. pairs may also be improved by ontology pairs differentially co-expressed across
highly co-expressed) in one condition (i.e. and pathway-based annotation analysis. different conditions (Figure 4 (c) and
normal) but not in another (i.e. cancer). The idea of finding gene clusters that show Figure 5) [12]. Biological pathways can
To identify differentially co-expressed positive correlation in one condition and be used as pre-defined gene sets and the
gene pairs in Figure 4(b), F-statistic can be negative correlation in another condition differential co-expression of the biological

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002858


Figure 6. Concept lattice. The binary relation set R = { (C1,b), (C1,f), (C1,j), (C2,b), (C2,d),, (C5,e), (C5,h)} can be represented as (a) a relation matrix,
(b) a directed bipartite graph, and (c) a concept lattice. Colored rectangles in the relation matrix represent concepts. The same color represents the
same concept in (a) and (c). (Modified from [13]).
doi:10.1371/journal.pcbi.1002858.g006

pathway pairs between conditions is ana- with dimensions 20 (genes) by 25 and by different conditions and N1 and N2 are
lyzed. To measure the expression similarity 15 (samples) for a condition, we calculate the numbers of upper-diagonal elements,
between paired gene-sets under the same 190 ( = (20*19)/2) sample pair-wise entro- which is calculated by n(n21)/2 (n = num-
condition, dCoxS defines the interaction py distances for each pathway expression ber of samples) for each condition.
score (IS) as the correlation coefficient matrix. The IS is obtained by calculating For the purpose of comparison, all gene
between the sample-wise entropies. Even the correlation coefficient between the two pair-wise Zf values are calculated for each
when the numbers of the genes in different entropy vectors. Finally, the statistical condition and the conditional difference of
pathways are different, IS can always be significance of the difference of the Fishers the Fishers Z-transformed correlation
obtained because it uses only sample-wise Z-transformed ISs between two conditions coefficients is tested for each gene pair as
distances regardless of whether the two is tested for each pathway pair. follows,
pathways have the same number of genes
or not.  
1 1zIS 1 1zCC
Zf ~ |ln Zf ~ |ln
2 1{IS 2 1{CC
IS~
P G1 G2 G1 The p-value of the difference in the Zf
ivj (RE {RE )(RE {RE G2 )
q
P q
P values is calculated using the standard
ivj (RE {RE
G1 G1 2
) ivj (RE {RE
G2 G2 2
) (Zf1 {Zf2 )
normal distribution in equation. p(ZD p
1=N1 {3z1=N2 {3
where RESi and RESj are the matrices of the
Renyi relative entropy of gene sets, Si and Sj. (Zf1 {Zf2 ) where CC indicates the correlation coeffi-
P(ZD p
When estimating the relative entropy, mul- 1=N1 {3z1=N2 {3 cient of a gene pair, Zfi Fishers Z-
tivariate kernel density estimation was used transformed correlation coefficient and Ni
to model gene-gene correlation structure. the number of samples in conditions i. The
For example, when we compute the IS Zf1 and Zf2 are the Fishers Z-trans- p value for differential co-expression is
of a pair of pathway expression matrices formed values of the IS under two obtained according to the difference

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002858


Figure 7. BioLattice of mouse renal inflammation induced by glomerular basement membrane (GBM) antibody.
doi:10.1371/journal.pcbi.1002858.g007

between the Z values from the normal read the massive annotation lists for a factor binding, chromosomal co-location
distribution. For each gene pair, three p large number of clusters. It is unthinkably and proteinprotein interaction networks)
values are obtained, one from each hard to manually assemble the puzzle can be added to better explore the
condition and another from the difference pieces (i.e., the cluster-annotation sets) underlying structures. The representation
between the conditions. Bonferroni cor- into an executive summary (i.e., the of relationship between clusters can give
rection is applied. context of the whole experiment). Many more insight to interpret functions of
annotations are redundant such that many interesting genes.
5. Biological Interpretation and clusters share the same annotations in a Figure 6 demonstrates a context (or a
very complex manner. Ideally, the assem- gene expression dataset) with clusters and
Biological Semantics
bly should involve eliminating redundant annotations. Note that the relation matrix
Biological interpretation of genomic attributes and organizing the pieces in a between objects (i.e., rows or clusters) and
data requires a variety of semantic knowl- well-defined order for better biological attributes (i.e., columns or annotations)
edge. Biomedical semantics provides rich understanding and insight into the under- can be represented by a bipartite graph
descriptions for biomedical domain knowl- lying context of the experiment under (Figure 6(b)) or a concept lattice
edge. Biomedical semantics is a valuable investigation. (Figure 6(c)). A concept lattice organizes
resource not only for biological interpre- BioLattice is a mathematical framework all clusters and annotations of a relation
tation but also for multi-layered heteroge- based on concept lattice analysis to matrix into a single unified structure with
neous data integration and genotype- organize traditional clusters and associated no redundancy and no loss of informa-
phenotype association. Symbolic inference annotations into a lattice of concepts for tion. It is worth noting that the cluster
algorithms may add further values. better biological interpretation of micro- labels, C1 to C5, and the annotation labels
Although GO and pathway-based anal- array gene-expression data [13]. BioLat- appear once and only once in the lattice
ysis of co-expressed gene groups is one of tice considers gene expression clusters as diagram (Figure 6(c)). Now one can
the most powerful approaches for inter- objects and annotations as attributes and interpret the whole experimental context
preting microarray experiments, they have provides a graphical summary of the order (Figure 6(a)) by reading the ordered
limitations. The result, for example, is relations by arranging them on a concept concepts with clusters and annotations.
typically a long unordered list of annota- lattice in an order based on set inclusion Structural analyses methods like prom-
tions for tens or hundreds of gene clusters. relation. Complex relations among clusters inent sub-lattice analysis and core-periph-
Most of the analysis tools evaluate only and annotations are clarified, ordered and ery structure analysis may help further
one cluster at a time in a sequential visualized. Redundancy of annotation is understanding [13]. Figure 7 shows a
manner without considering the informa- completely removed. It also has an BioLattice for a mouse anti-GBM glomer-
tive association network of clusters and advantage that heterogeneous biological ulonephritis model [28]. Genes showing
annotations. It is very time-consuming to knowledge resources (such as transcription significant time-dose effect were clustered

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002858


into 100 clusters and annotated with GO concept while the intent contains all previous approaches identifying a set of
terms. The whole complex clusters and attributes shared by the objects. The set of significant genes, Gene Set Enrichment
annotations are organized into a single all concepts of the context (G, M, I) is Analysis uses pre-defined sets to search for
unified lattice graph, providing an executive denoted by C(G, M, I). A concept lattice is groups of functionally related genes with
summary. The Ganter algorithm [29] can drawn by ordering (A, B), which are defined coordinated expression across a list of
be used to construct BioLattice. A web-based as concepts of the context (G, M, I). The set genes ranked by differentially expression.
tool using Perl, JavaScript and Scalable of all concepts of a context together with Differential co-expression analysis deter-
Vector Graphics are available at http:// the partial order (A1, B1)#(A2, B2): u A1 mines the degree of co-expression differ-
www.snubi.org/software/biolattice/. Prom- # A2 (which is equivalent to B1 $ B2) is ence of a gene set pair across different
inent sub-lattice analysis reveals a mean- called a concept lattice. conditions. The dCoxS algorithm identi-
ingful sub-structure converging into clus- We regard A as defining gene expression fies differentially co-expressed gene set
ter 85, which has the GO term clusters that share common knowledge under different conditions. Outcomes in
chemotaxis and inherits proteolysis and attributes and B as defining the knowledge microarray and RNA-Seq data can be
peptidolysis (clusters 58 and 96), inflam- terms that are annotated to the clusters. transformed into the graphical structure
matory response, immune response, protein The concepts are arranged in a hierarchi- that represents biological semantics. A
amino acid phosphorylation, and cell surface cal order so that the order of C1#C2 u number of biomedical annotation and
receptor linked signal transduction (cluster 60), A1 # A2 u B1 $ B2 is defined at external repositories including clinical
signal transduction (cluster 19), intracellular C1 = (A1, B1), C2 = (A2, B2). The top resources can be integrated by biological
signaling cascade (cluster 65). It is clearly element of a lattice is a unit concept, semantics analysis tools such as BioLattice.
visualized that cellular immune response representing a concept that contains all
system activation is the core pathological objects. The bottom element is a zero 7. Exercises
process in the IgA nephropathy model of concept having no object.
kidney and clusters 19, 58, 60, 65, 5 and 1) Select significantly DEGs from the train
96 are within those concepts. 6. Summary dataset of AML (Acute Myelocytic
Context in concept lattice analysis is a Leukemia) and ALL (acute lympho-
triplet (G, M, I) consisting of two sets G and M This chapter has shown major compu-
blastic leukemia) expression data
and a relation I between G and M. The tational approaches to facilitate biological
(http://www.broadinstitute.org/cgi-
elements of G and M are called objects and interpretation of high-throughput micro-
bin/cancer/publications/pub_paper.
attributes, respectively. We denote gIm or (g, array and RNA-Seq experiments. The
cgi?mode = view&paper_id = 43) and
m) M I to show that object g has attribute m. For enrichment analysis with ontologies, bio-
find enriched GO terms from an
a set A # G of objects, we define A9: = { m M M logical pathways or external resources is
ontology analysis tool. Dataset and
| gIm for all g M A } (i.e., the set of attributes widely used to interpret the functional role
analysis functions are also included in
common to the objects in A). Corresponding- of correlated genes or differentially ex-
R statistical package, golubEsets in
ly, for a set B#M of attributes, we define pressed genes. In analysis steps, the groups
Bioconductor.
B9: = { g M G | gIm for all m M B } (i.e., the set of of genes are assigned to the associated
objects that have all attributes in B). biological annotation terms using GO 2) List significantly enriched pathways
Concept lattice analysis models concepts terms or biological pathways. Then it is using a pathway analysis tool with
as units of thought, consisting of two parts. necessary to examine whether gene mem- the dataset in Exercise 1.
A concept of the context (G, M, I) is a pair bers are statistically enriched in each of the 3) Find KEGG pathways significantly
(A, B) with A#G, B#M, A9 = B and B9 = A. annotation terms or pathway by compar- associated with leukemia subtype in
We call A and B the extent and the intent, ing background set by measuring statistical the 2-sample comparison of AML
respectively, of concept (A, B). The extent test such as Chi-square, Fishers exact, and ALL by GSEA through the
consists of all objects belonging to the binomial and hypergeometric test. Unlike Kolmogorov-Smirnoff test. Analysis
and data set are provided by SAFE R
(http://bioconductor.org/packages/
Further Reading
2.0/bioc/html/safe.html).
N Draghici S (2003) Data analysis tools for DNA microarrays. Chapman and Hall/ 4) Identify the differentially co-ex-
pressed gene set pairs using dCoxS
CRC Press.
with simulated data in (http://www.
N Curtis RK, Oresic M, Vidal-Puig A (2005) Pathways to the analysis of microarray
snubi.org/publication/dCoxS).
data. Trends Biotechnol 23(8): 429435.
Compute interaction score between
N Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S (2005) Bioinformatics and
matrix M and M1 using ias function.
computational biology solutions using R and Bioconductor. Springer.
And, compute interaction score be-
N Deshmukh SR, Purohit SG (2007) Microarray data: statistical analysis using R.
tween M and M2. Finally, using
Oxford: Alpha Science International Ltd.
compcorr function, estimate signifi-
N Guerra R, Goldstein DR (2008) Meta-analysis and combining information in cance of difference of ias. Note that in
genetics and genomics. Chapman and Hall. compcorr function, n1 and n2 is the
N Werner T (2008) Bioinformatics applications for pathway analysis of microarray number of all possible sample pairs.
data. Current Opinion in Biotechnology 19 (1): 5054.
5) Report semantic relationships of
N Emmert-Streib F, Dehmer M (2010) Medical biostatistics for complex diseases. pathways and GO terms using Bio-
Wiley. Lattice (http://www.snubi.org/
N Kann MG (2010) Advances in translational bioinformatics: computational software/biolattice/). Use the result
approaches for the hunting of disease genes. Brief Bioinform 11(1): 96110. of k-means clustering (k = 10) with

PLOS Computational Biology | www.ploscompbiol.org 10 December 2012 | Volume 8 | Issue 12 | e1002858


Glossary DEG in Exercise 1. Select Category
as biological process, p-value,0.05.
Bioconductor: a free, open source and open development software project for
the analysis and comprehension of genomic data generated by wet lab Answers to the Exercises can be found
experiments in molecular biology written in R Statistical Package. in Text S1.

Clustering: algorithm that puts similar things together and different things Supporting Information
apart.
Text S1 Answers to Exercises
Gene expression profiling: the measurement of the activity (or expression) of (DOCX)
thousands of genes at once to create a global picture of cellular function using
DNA microarray technology.

Gene set: a meaningful grouping of genes like biological pathways, genes


sharing certain regulatory-motifs, genes sharing certain functional annotations,
and certain disease-related gene sets.

Gene Set Enrichment Analysis: an algorithm to determine whether an a priori


defined set of genes shows statistically significant coordinated differential
expression between conditions.

Gene Ontology: a set of controlled vocabularies in molecular function,


biological process and cellular component for the standardized annotations of
genes and gene products across all species.

Hypergeometric distribution: a discrete probability distribution that describes


the number of successes in a sequence of n draws from a finite population
without replacement, just as the binomial distribution describes the number of
successes for draws with replacement.

KolmogorovSmirnov test (KS test): a nonparametric test for the equality of


continuous, one-dimensional probability distributions that can be used to
compare a sample with a reference probability distribution (one-sample KS test),
or to compare two samples (two-sample KS test).

References
1. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, for comparative analysis of large gene sets in Gene collection of pathway/genome databases. Nucleic
Church GM (1999) Systematic determination of Ontology space. Appl Bioinformatics 3(4): 261264. Acids Res 38 (Database issue): D473479.
genetic network architecture. Nat Genet 22(3): 9. Chung HJ, Park CH, Han MR, Lee S, Ohn JH, 16. Alexa A, Rahnenfuhrer J, Lengauer T (2006)
281285. et al. (2005) ArrayXPath II: mapping and Improved scoring of functional groups from gene
2. Ashburner M, Ball CA, Blake JA, Botstein D, visualizing microarray gene-expression data with expression data by decorrelating GO graph
Butler H, et al. (2000) Gene ontology: tool for the biomedical ontologies and integrated biological structure. Bioinformatics 22(13): 16001607.
unification of biology. The Gene Ontology pathway resources using Scalable Vector Graph- 17. Falcon S, Gentleman R (2007) Using GOstats to
Consortium. Nat Genet 25(1): 2529. ics. Nucleic Acids Res 33(Web Server issue): test gene lists for GO term association. Bioinfor-
3. Dahlquist KD, Salomonis N, Vranizan K, Lawlor W621626. matics. 23(2): 257258.
SC, Conklin BR (2002) GenMAPP, a new tool for 10. Mootha VK, Lindgren CM, Eriksson KF, Sub- 18. Huang da W, Sherman BT, Lempicki RA (2009)
viewing and analyzing microarray data on ramanian A, Sihag S, et al. (2003) PGC-1alpha- Bioinformatics enrichment tools: paths toward the
biological pathways. Nat Genet 31(1): 1920. responsive genes involved in oxidative phosphor- comprehensive functional analysis of large gene
4. Al-Shahrour F, Daz-Uriarte R, Dopazo J (2004) ylation are coordinately downregulated in human lists. Nucleic Acids Res 37(1): 113.
FatiGO: a web tool for finding significant diabetes. Nat Genet 34(3): 267273. 19. Goeman JJ, Buhlmann P (2007) Analyzing gene
associations of Gene Ontology terms with groups 11. Subramanian A, Tamayo P, Mootha VK, Mu- expression data in terms of gene sets: methodo-
of genes. Bioinformatics 20(4): 578580. kherjee S, Ebert BL, et al. (2005) Gene set logical issues. Bioinformatics 23 (8): 980987.
5. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, enrichment analysis: a knowledge-based approach 20. Bild AH, Yao G, Chang JT, Wang Q, Potti A,
et al. (2004) TermFinderopen source software for interpreting genome-wide expression profiles. et al. (2006) Oncogenic pathway signatures in
for accessing Gene Ontology information and Proc Natl Acad Sci U S A. 102(43): 1554515550. human cancers as a guide to targeted therapies.
finding significantly enriched Gene Ontology
12. Cho SB, Kim J, Kim JH (2009) Identifying set- Nature 439(7074): 353357.
terms associated with a list of genes. Bioinfor-
wise differential co-expression in gene expression 21. Potti A, Dressman HK, Bild A, Riedel RF, Chan
matics 20(18): 37103715.
microarray data. BMC Bioinformatics 10: 109. G (2006) Genomic signatures to guide the use of
6. Chung HJ, Kim M, Park CH, Kim J, Kim JH
13. Kim J, Chung HJ, Jung Y, Kim KK, Kim JH chemotherapeutics. Nat Med 12(11): 12941300.
(2004) ArrayXPath: mapping and visualizing
(2008) BioLattice: a framework for the biological 22. Barry WT, Nobel AB, Wright FA (2005)
microarray gene-expression data with integrated
biological pathway resources using Scalable interpretation of microarray gene expression data Significance analysis of functional categories in
Vector Graphics. Nucleic Acids Res 32(Web using concept lattice analysis. J Biomed Inform 41 gene expression studies: a structured permutation
Server issue): W460464. (2): 232241. approach. Bioinformatics 21(9): 19431949.
7. Zhang B, Schmoyer D, Kirov S, Snoddy J (2004) 14. Yue L, Reisdorf WC (2005) Pathway and onto- 23. Li KC (2002) Genome-wide co-expression dy-
GOTree Machine (GOTM): a web-based plat- logy analysis: emerging approaches connecting namics: theory and application. Proc Natl Acad
form for interpreting sets of interesting genes transcriptome data and clinical endpoints. Curr Sci U S A 99(26): 1687516880.
using Gene Ontology hierarchies. BMC Bioinfor- Mol Med 5(1): 1121. 24. Lai Y, Wu B, Chen L, Zhao H (2004) A statistical
matics 5: 16. 15. Caspi R, Altman T, Dale JM, Dreher K, Fulcher method for identifying differential gene-gene co-
8. Zhong S, Storch KF, Lipan O, Kao MC, Weitz CJ, CA, et al. (2010) The MetaCyc database of expression patterns. Bioinformatics 20 (17): 3146
et al. (2004) GoSurfer: a graphical interactive tool metabolic pathways and enzymes and the BioCyc 55.

PLOS Computational Biology | www.ploscompbiol.org 11 December 2012 | Volume 8 | Issue 12 | e1002858


25. Choi JK, Yu U, Yoo OJ, Kim S (2005) Differential 27. Watson M (2006) CoXpress: differential co- immune complex-mediated kidney disease. Kid-
co-expression analysis using microarray data and expression in gene expression data. BMC Bioin- ney International, 66(5): 18261837
its application to human cancer. Bioinformatics formatics 7: 509. 29. Ganter B, Wille R (1999) Formal concept
21(24): 43484355. 28. Kim JH, Ha IS, Hwang CI, Lee YJ, Kim Y, et al. analysis: mathematical foundations. Berlin; New
26. Kostka D, Spang R (2004) Finding disease (2004) Gene expression profiling of anti-GBM York: Springer.
specific alterations in the co-expression of genes. glomerulonephritis model: the role of NFkB in 30. Emmert-Streib F, Dehmer M (2010) Medical
Bioinformatics 20 Suppl 1: i194199. Biostatistics for Complex Diseases. Wiley.

PLOS Computational Biology | www.ploscompbiol.org 12 December 2012 | Volume 8 | Issue 12 | e1002858


Education

Chapter 9: Analyses Using Disease Ontologies


Nigam H. Shah*, Tyler Cole, Mark A. Musen
Center for Biomedical Informatics Research, Stanford University, Stanford, California, United States of America

This article is part of the Transla- by performing enrichment analysis using


Abstract: Advanced statistical methods
used to analyze high-throughput data tional Bioinformatics collection for disease ontologies. We first review the
such as gene-expression assays result in PLOS Computational Biology. current methods of GO based enrichment
long lists of significant genes. One way analysis to provide a foundation for
to gain insight into the significance of discussing analyses using Disease Ontolo-
altered expression levels is to determine 1. Introduction gies. Note that there is also research
whether Gene Ontology (GO) terms underway on the use of pathways for
associated with a particular biological Advanced statistical methods are most enrichment analyses as well as comparing
process, molecular function, or cellular often used to perform the analysis of high- statistically significant, concordant differ-
component are over- or under-represent- throughput data such as gene-expression ences between two biological states as in
ed in the set of genes deemed significant. assays [15], the result of which is a long Gene Set Enrichment Analysis [9], which
This process, referred to as enrichment list of significant genes. Extracting are not discussed here.
analysis, profiles a gene-set, and is widely biological meaning from such lists is a
used to makes sense of the results of nontrivial and time-consuming task,
high-throughput experiments. The ca-
1.1 Gene Ontology Enrichment
which is exacerbated by the inconsisten-
nonical example of enrichment analysis cies in free-text gene annotations. The Analysis
is when the output dataset is a list of Gene Ontology (GO) offers a taxonomy The goal of enrichment analysis is to
genes differentially expressed in some that provides a mechanism to determine determine which biological processes (or
condition. To determine the biological statistically significant functional sub- molecular function) might be predomi-
relevance of a lengthy gene list, the usual groups within gene groups. One way to nantly affected in the set of genes that were
solution is to perform enrichment analysis deemed interesting or significantly
gain insight into the biological signifi-
with the GO. We can aggregate the changed. The simplest approach is to
cance of alterations in gene expression
annotating GO concepts for each gene calculate functional enrichment/deple-
levels is to determine whether the GO
in this list, and arrive at a profile of the tion for each GO terma higher (or
biological processes or mechanisms af-
terms associated with the particular bio-
logical process, molecular function, or lower) proportion of genes with certain
fected by the condition under study.
cellular component are over- or under- annotations among the significantly
While GO has been the principal target
represented in the set of genes deemed changed genes than among all of the
for enrichment analysis, the methods of
enrichment analysis are generalizable. We significant by the statistical analysis [6]. genes measured in the experiment. The
can conduct the same sort of profiling This process, often referred to as enrich- finding of enrichment should not be
along other ontologies of interest. Just as ment analysis, can be used to summarize interpreted as evidence implicating the
scientists can ask Which biological pro- a gene-set [7], although it can also be GO term in the process studied without an
cess is over-represented in my set of relevant for other high-throughput mea- appropriate statistical test.
interesting genes or proteins? we can surement modalities including proteo- The calculation of GO based functional
also ask Which disease (or class of mics, metabolomics, and studies using enrichment involves two sets of items
diseases) is over-represented in my set tissue-microarrays [8]. (usually genes or proteins): 1) The refer-
of interesting genes or proteins?. For With the availability of tools for auto- ence set, which is the set of items with
example, by annotating known protein matic ontology-based annotation of data- which the significant-set is to be com-
mutations with disease terms from the sets with terms from biomedical ontologies pared; the reference set may comprise all
ontologies in BioPortal, Mort et al. recently besides the GO, we need not restrict of the genes in the genome or all of the
identified a class of diseasesblood enrichment analysis to the GO. In this genes for which there were probes in the
coagulation disordersthat were associ-
chapter, we outline the methodology of high throughput experiment; 2) The set of
ated with a 14-fold depletion in substitu-
enrichment analysis, the associated chal- interest, which is the subset or list of
tions at O-linked glycosylation sites. With
lenges, and discuss novel analyses enabled significant genes that is to be analyzed for
the availability of tools for automatic
annotation of datasets with terms from
disease ontologies, there is no reason to Citation: Shah NH, Cole T, Musen MA (2012) Chapter 9: Analyses Using Disease Ontologies. PLoS Comput
restrict enrichment analyses to the GO. In Biol 8(12): e1002827. doi:10.1371/journal.pcbi.1002827
this chapter, we will discuss methods to Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
perform enrichment analysis using any Baltimore County, United States of America
ontology available in the biomedical Published December 27, 2012
domain. We will review the general Copyright: 2012 Shah et al. This is an open-access article distributed under the terms of the Creative
methodology of enrichment analysis, the Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
associated challenges, and discuss the provided the original author and source are credited.
novel translational analyses enabled by Funding: We acknowledge support from NIH grant U54 HG004028 for the National Center for Biomedical
the existence of public, national compu- Ontology. The funders had no role in the preparation of the manuscript.
tational infrastructure and by the use of Competing Interests: The authors have declared that no competing interests exist.
disease ontologies in such analyses.
* E-mail: nigam@stanford.edu

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002827


What to Learn in This Chapter The p-value reports the likelihood of finding
n genes annotated with a particular GO
term in the set of interest by chance alone,
N Review the commonly used approach of Gene Ontology based enrichment
given the number of genes annotated with
analysis
that GO terms in the reference set. A
N Understand the pitfalls associated with current approaches
biological process, molecular function or
N Understand the national infrastructure available for using alternative ontologies cellular location (represented by a GO term)
for enrichment analysis is called enriched if the p-value is less than
N Learn about a generalized enrichment analysis workflow and its application 0.05. GO annotations form the corner-stone
using disease ontologies of enrichment analysis in sets of differentially
expressed genes. The GO projects Web site
lists over 50 tools that can be used in this
enrichment (or depletion) of GO terms in quite reasonable for large reference sets (e.g.
process [11].
their annotations. the whole genome) because the probability
Enrichment analysis can be done as a
The analysis process (Figure 1) counts of selecting a gene annotated with the term
hypothesis-generating task, such as asking
the GO annotations for both gene lists to apoptosis into the set of interest does not
which GO terms are significant in a
calculate the number of genes (n and m) change significantly after each selection.
particular set of genes or a hypothesis-
annotated with a particular GO term in However, when a gene or protein is
driven task such as asking whether apop-
each list and then calculates the probabil- picked from a smaller reference set, then the tosis is significantly enriched or depleted in
ity (p-value) of the occurrence of at least n probability that the next picked gene is a particular set of genes.
genes belonging to that category among annotated to apoptosis is affected by wheth- In the hypothesis-driven setting, the
the N genes in the set of interest, given that er the previously picked genes were anno- analysis can include all of the genes that
m genes are annotated with that term tated to apoptosis. Under these circumstanc- are annotated both directly to apoptosis
among the M genes in the reference set. es, the hypergeometric distributiona and to its child nodes to maximize the
There are multiple ways to calculate the discrete probability distribution that de- statistical power because no correction for
probability of observing a specific enrich- scribes the number of successes in a multiple comparisons is required. The
ment value. The simplest approach is to use sequence of n draws from a finite population hypothesis-generating approach allows an
a binomial model. For example, if one without replacementis a better statistical unbiased search for significant GO anno-
assumes that the probability of picking a model. Another option is the Fishers exact tations. The analysis can be done with a
gene annotated with the GO term apopto- test or the chi-squared distribution, both of bottom-up approach where for every leaf
sis is fixed and is equal to the proportion of which take into consideration how the term the genes annotated with that GO
genes annotated with apoptosis in the probabilities change when a gene is picked. term are also assigned to its immediate
reference set, then the binomial distribution The hypergeometric p-value is calculated parent term. One can propagate the
provides the probability of obtaining a using the following formula: annotations recursively up along parent
particular proportion of apoptosis genes 
m M{m
 nodes until a significant node is found or
n N{n
among the genes in the set of interest by Pn~  until the root is reached. (Note: this
N
chance [10]. Such an approximation is n upward propagation of annotations is

Figure 1. An overview of the process to calculate enrichment of GO categories. The steps usually followed are: (1) Get annotations for each
gene in reference set and the set of interest. (2) Count the occurrence (n) of each GO term in the annotations of the genes comprising the set of
interest. (3) Count the occurrence (m) of that same GO term in the annotations of the reference set. (4) Assess how surprising is it to find n, given m,
M and N.
doi:10.1371/journal.pcbi.1002827.g001

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002827


referred to as computing the transitive can lower the p-value cutoff in order to trend. The canonical example of enrich-
closure of the annotation set over the re-duce the FDR to acceptable levels. ment analysis is in the interpretation of a
graph of the Gene Ontology). Newer Multiple hypothesis testing is a general list of differentially expressed genes in
approaches can also perform the enrich- problem that is not specific to GO (see some condition. The usual approach is to
ment analysis accounting for the position [15] for a general review). perform enrichment analysis with the GO
of the term in the GO hierarchy [12 A related issue arising from performing [17]. There are currently over 400 publi-
14]. the transitive closurethe propagation of cations on methods and tools for GO-
1.1.1 Interpretation of p- annotations along the parent-child based enrichment, but (to the best of our
values. The p-values should be pathsis that the parallel tests performed knowledge) only a single tool, Genes2Mesh,
interpreted with caution because the for nodes in a given path will be correlated uses something other than the GO (i.e. the
choice of the reference set to which the because the same genes can appear several Medical Subject Headings or MeSH), to
set of interest is compared affects the p- times on each path. Correction methods calculate enrichment [19].
value. For whole genome arrays, using the that assumes independence of categories While GO has been the principal target
list of all genes on the array as the might not function well in this situation for enrichment analysis, we can carry out
reference set is equivalent to using the and might preclude identification of some the same sort of profiling using Disease
complete list of genes in the genome. categories that are indeed enriched [6]. It Ontologies. Just as scientists can ask
However, for arrays containing a selected is possible to use the structure of the GO Which biological process is over-represented in
subset of genes associated with a biological to decorrelate the analysis of various terms my set of interesting genes or proteins?, they also
process, the choice of the gene set to use as [1214] or to use corrections methods should be able to ask Which disease (or class
the reference set is not obvious. Moreover, such as a BenjaminiYekutieli correction, of diseases) is over-represented in my set of
the p-value calculation using the which accounts for the dependency be- interesting genes or proteins? For example, by
hypergeometric distribution assumes the tween the multiple tests [16]. annotating known protein mutations with
independence of the GO annotation disease terms from the ontologies in
categories, an assumption that may not 1.2 Summary of Existing Limitations BioPortal, Mort et al. recently identified
be justified. In 2005, Khatri and Draghici noted a class of diseasesblood coagulation
Another difficulty in determining sig- that, despite widespread adoption, GO- disordersthat were associated with a
nificance using the calculated p-value and based enrichment analysis has intrinsic 14-fold depletion in substitutions at O-
a cutoff of 0.05, especially in the hypoth- drawbacks [17] and scientists must still linked glycosylation sites [20].
esis-generating approach mentioned rely on literature searches to understand a There are several resources that can be
above, is that multiple testing increases set of genes fully. These drawbacks used as disease ontologies for enrichment
the likelihood of obtaining what appears represent conceptual limitations of the analysis. We use the term disease ontol-
to be a statistically significant value by current state of the art and include: ogy to refer to artifactsterminologies,
chance. Multiple testing occurs because vocabularies as well as ontologiesthat
the GO term to be tested for enrichment N Incomplete annotationseven today, can provide a hierarchy of parent-child
is not pre-selected, but each term is tested. roughly 20% of genes lack any GO terms for disease conditions. One of the
This allows multiple opportunities (equal annotation most elaborate ontology for diseases is the
to the number of terms tested) to obtain a
statistically significant p-value by chance
N Annotation bias because of inter-rela- Systematized Nomenclature for Medicine-
Clinical Terms (SNOMED CT) is consid-
tionships between annotations (e.g.
alone in a given gene list. However, annotation with certain GO terms is ered to be the most comprehensive,
correcting for multiple testing by using a not conditionally independent). multilingual clinical healthcare terminolo-
Bonferroni correction in which the critical gy in the world [21]. SNOMED CT was a
p-value cut-off is divided by the number
N Lack of a systematic mechanism to
joint development between the NHS in
define a level of abstraction, to com-
of tests performed is too restrictive pensate for differing levels of granular- England and the College of American
especially when annotations are propa- ity. Pathologists (CAP). It was formed in 1999
gated up to the root node via a transitive by the convergence of SNOMED RT and
closure; then the number of tests is equal The remainder of the chapter discusses the United Kingdoms Clinical Terms
to the number of terms in the GO approaches to using existing, public bioin- Version 3 (formerly known as the Read
hierarchy. formatics tools to address these limitations Codes). As of 2007, SNOMED CT is
In this situation, calculation of the false and use disease ontologies in such analy- maintained and distributed by the Inter-
discovery rate (FDR), which provides an ses. national Health Terminology Standards
estimate of the percentage of false posi- Development Organization (IHTSDO).
tives among the categories considered 2. Using Disease Ontologies Currently, SNOMED CT contains more
enriched at a certain p-value cutoff, Going beyond GO Annotations than 311,000 active concepts with unique
allows for a more informed choice of the meanings and formal logic-based defini-
p-value cutoff. One can estimate the false As we have discussed, enrichment tions organized into multiple hierarchies.
discovery rate (FDR) for the enriched analysis provides a means of understand- The disease hierarchy is available under
categories by performing simulations ing the results of high-throughput datasets the clinical finding root node (analogous to
which generate a user-specified number [17,18]. Conceptually, enrichment analy- the biological process root node in the
of random gene sets of the same size as sis involves associating elements in the Gene Ontology). Another widely used
the set of interest and calculate the results of high-throughput data analysis to disease ontology is the National Cancer
average number of categories that are concepts in an ontology of interest, using Institute thesaurus (NCIt), which is an
considered enriched in the random gene the ontology hierarchy to create a sum- ontology that provides terms for clinical
sets, at a p-value cutoff of 0.05. If the marization of the result, and computing care, translational and basic research, and
FDR is above the desired threshold, we statistical significance for any observed public information and administrative

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002827


activities. NCIt is a widely recognized enrichment analysis. The most obvious textthe annotationsand their position
standard for biomedical coding and refer- advancement is that almost all biomedical in the submitted record.
ence, used by a variety of public and ontologies are now available in public Finally the availability of large annota-
private institutions including the Clinical repositories such as BioPortal [23]built tion repositories such as the Resource
Data Interchange Standards Consortium as a part of the NIHs Biomedical Index, which is a large repository of
Terminology (CDISC), the U.S. Food and Information Science and Technology Ini- automatically created annotations by the
Drug Administration (FDA), the Federal tiativewhich enables the use of terms NCBO, and the NIF database index,
Medication Terminologies (FMT), and the from multiple ontologies in data analysis which is another large repository of
National Council for Prescription Drug workflows. As of this writing, the BioPortal computationally generated annotations
Programs (NCPDP). The disease hierar- library contains more than 204 publicly on public data sources relevant to neuro-
chy is available under the root node of accessible biomedical ontologies and their science, provide a source of co-occurrence
Diseases, Disorders and Findings. The metadata, ranging in domains from geno- statistics among ontology-terms in anno-
most widely used disease ontology is the mics to clinical medicine to biomedical tations. The availability of such annotation
International Classification of Diseases software resources, and comprising nearly corpora makes the dependence between
(ICD), which is part of the WHO Family 1.5 million terms. BioPortals ontology annotations with different ontology terms
of International Classifications. Version 9 library includes ontologies that individual explicit.
of ICD is widely used in the United States investigators submit directly to BioPortal, Given these publicly available sources
for billing purposes in the health care terminologies drawn from both the Uni- for ontologies, tools for creating ontology-
system. Finally, there is effort to create an fied Medical Language System (UMLS) based annotations and large repositories
ontology of Human Diseases (available at and the WHO Family of International (or corpora) of annotations, it is now
http://diseaseontology.sourceforge.net) Classifications (WHO-FIC). The BioPortal feasible to use disease ontologies in
that conforms to the principles of the library also includes the ontologies that are enrichment analysis in a manner similar
Open Biomedical Ontologies Foundry candidates to the OBO Foundry, which is to the Gene Ontology.
[22]. The Human Disease ontology is an initiative to create a set of well- As we have discussed, one key aspect of
under review by the OBO Foundry since documented and well-defined reference calculating statistical enrichment is the
2006. For the purpose of the current ontologies that are designed to work with choice of a reference-term frequency. It
discussion, and enrichment analysis in one another to form a single, non- is not clear what the appropriate refer-
general, pretty much disease ontology that redundant system [22]. In addition to ence-term frequency should be when
provides a clear hierarchy of parent-child ontologies, BioPortal contains more than 1 calculating enrichment of ontology-terms
for diseases would be suitable for use. million mappings between similar terms in for which a background set is not
Enrichment analysis owes its popular- different ontologies and 16.4 billion auto- defined. For example, in the case of Gene
ity to the fact that the process is matically created annotations on records Ontology annotations, the background set
methodologically straightforward and from 22 public databases of biomedical is usually the GO annotations of the set of
yields these easily interpretable results. data. Resources such as BioPortal provides genes on which the data were collected or
Apart from analyzing results of high a unified view of all its ontologies, which the GO annotations of a set of genes
throughput experiments, enrichment may be encoded in different formats, known in the genome for the species on
analysis can also be used as an explor- each of which has its own purpose, which the data were collected. A back-
atory tool to generate hypotheses for scope, and use. The unified view of the ground set is not available, however, when
clinical research. Computationally gener- content enables uniform programmatic calculating enrichment using disease on-
ated annotations (from multiple ontolo- access to all ontologies and terminologies tologies that have not been used for
gies) on patient cohorts can provide a in the library for use in data analysis manual annotation in a way the Gene
foundation for enrichment analysis for workflows. Ontology has. For this situation, there are
risk-factor determination. For example, The availability of automated annota- two main options: 1) to construct a
enrichment analysis can identify general tion tools, such as the Annotator Web reference set programmatically (discussed
classes of drugs, diseases, and test results service from the NCBO and MetaMap in Section 2.3); or 2) use the frequency of
that are commonly found in readmitted from the National Library of Medicine particular terms in a large corpus, such as
transplant patients but not in healthy allows the creation ontology-based anno- the Resource Index, Medline abstracts or
recipients. As noted, the GO has been the tations from free-text descriptions of gene on Web pages indexed by Internet search
principal target for such analysis and and protein functions (such as GeneRIFs); engines such as Google.
despite widespread adoption, GO-based as a result the lack of preexisting, manually Multiple hypothesis testingbecause
enrichment analysis has intrinsic draw- assigned annotations is no longer a each term is tested for enrichment indi-
backsthe primary ones being incom- bottleneck. For example, the Annotator viduallyis also unavoidable when per-
pleteness of and bias among available Web service enables users to provide a forming enrichment analysis with disease
manually created annotations. Below, we textual metadata of an item of interest ontologies. However, methods of correct-
discuss recent advances in the use of such as a GeneRIF describing a genes ing the resultant increase in false discovery
ontologies for automated creation of function or an abstract corresponding to a rates that work in the case of GO based
annotations that allow us to address these PubMed recordto computationally gen- enrichment analyses are directly applica-
drawbacks and apply enrichment analysis erate ontology-based annotations for the ble when using disease ontologies for such
using disease ontologies. item of interest. The user specifies which analyses.
ontologies to use, and whether also to use Several researchers have noted that
2.1 Advances in Ontology Access mappings to other ontologies or transitive enrichment analysis is more meaningful
and Automated Annotation closure of hierarchy relations to extend the when performed for combinations for
There are several recent advances that annotations. The service returns the on- terms [24]. For example, it is biologically
enable us to use disease ontologies in tology terms that it recognizes from the more meaningful to know that a certain

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002827


Figure 2. Workflow schematic of enrichment analysis. If the input set has only textual annotations, we first run the Annotator service to create
ontology-term annotations. The annotation counts in the input set are first aggregated along the ontology hierarchy and then compared with a
background set for a statistically significant difference in the frequency of each ontology term. If a significant difference in the term frequency is
found, that term is called enriched in the input set of entities. The results of the analysis are returned either as a tag-cloud, a graph, or as an XML
output that users can process as required.
doi:10.1371/journal.pcbi.1002827.g002

molecular function in a certain biological process descriptions instead of ontology terms. For child-parent (IS_A) relationship, we gen-
at a certain cellular location is enriched than example, a user might have a file associ- erate the complete set of implied (indirect)
it is to know about each of the terms ating gene IDs with their GeneRIF annotations based on child-parent rela-
separately. Similarly, when using ontolo- descriptions from NCBI. In this situation tionships, by traversing and aggregating
gies other than GO, it is more meaningful a user can invoke the NCBO Annotator along the ontology hierarchy.
to look for enrichment of combinations service [25,26] to process these textual Step 2 Once the ontology terms and
such as certain adverse reactions in a given descriptions and assign ontology terms to their aggregate frequencies in the input
disease when treated by a particular drug. the element identifiers (Step 0). Given the dataset are calculated, we arrive at the step
However, exhaustively examining all pos- users selection of an ontology, the anno- of determining the meaning or significance
sible 3-term combinations of ontology tator processes the input text (say GeneR- of the results. Enrichment analysis with
terms is computationally expensive and IFs) to identify concepts that match GO has benefited from the existence of a
most of the random term combinations ontology terms (based on preferred names natural and easily defensible choice for a
make no biological sense. The identifica- or synonyms). The implementation details background setall of the given organ-
tion of combinations that are meaningful and accuracy of the Annotator service are isms genes, all genes measured on the
and appear at a high enough frequency to described in [25]. The result is a list of platform, etc. For most of the disease
justify their use in enrichment computa- computationally annotated element iden- ontologies we consider, no such compre-
tions is an exciting and fruitful area of tifiers based on the input textual descrip- hensive distribution exists [28]; and as
research. tion, and this output is equivalent to the discussed before, for calculating statistical
first input type. Using this step, were able enrichment, we need the background term
to create ontology-based annotations from frequency to determine if the aggregate
2.2 DIY Disease Ontology-based free-text descriptions. Thus, we are no annotation counts after step 1 are sur-
Enrichment Analysis Workflow longer reliant on the availability of ex- prising given the background. By lever-
We have seen that the progress in the haustive manually-curated annotations, aging existing projects and resources, there
current state of the art in storing, accessing such as those required with GO-based are several methods by which a user can
and using ontologies for annotation pro- analyses. address this problem. We discuss a couple
vides components that allow enrichment Step 1 After this optional preprocessing of heuristic approaches to address this
analysis when preexisting annotations do step, for each ontology term in the input problem, and in Section 2.3 discuss a
not exist; as in the case of disease dataset one can programmatically traverse systematic process to create custom refer-
ontologies. We now discuss a workflow to the ontology structure and retrieve the ence sets.
conduct enrichment analysis in domains complete listing of paths from the concept In the first approach, one can access a
beyond just expression analysis. A sche- to the root(s) of the ontology using Web database of automatically created annota-
matic of the workflow is shown in Figure 2. services [27]. A traversal through each of tions over the entirety of MEDLINE
A user can start with two principal types these paths, essentially recapitulates the abstracts and use these annotations source
of inputs. In the first case, the user already ontology hierarchy. Each term along the as an approximate proxy for the true
has the elements of the dataset of interest path is associated as an annotation to that background distribution frequency of a
annotated with specific ontology terms element identifier in the input dataset to specific term. To generate the background
i.e. the user already has a file associating which the starting term was associated frequency, for a given term X, we retrieve
element identifiers (gene names, patient with. This procedure of tracing terms back the text strings corresponding to its
ID numbers, etc.) with ontology term to the graphs root performs the transitive preferred name and all of its synonyms,
identifiers. In the second case, the user closure of the annotations over the ontol- and then add up the MEDLINE occur-
has associations of identifiers to textual ogy hierarchy. In essence, for each rence counts for each of these strings. We

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002827


Figure 3. Tag cloud output: An example for the annotations of grants from FY1981 using SNOMEDCT. Blue denotes low-frequency
terms and red denotes highly frequent terms. Many concepts, such as neoplasm of digestive tract, occur at high frequencies in most years, possibly
denoting the constant focus on cancer research. An appropriate background term frequency distribution is necessary to determine significance of the
high frequency.
doi:10.1371/journal.pcbi.1002827.g003

return this number (m) as well as the total technical accuracy before interpreting the enrichment analysis is another exciting
number of entries in the MEDLINE results for scientific significance. To and fruitful area of research.
annotation database (M). The fraction m/ evaluate technical accuracy, we suggest
M then represents the background fre- that users create benchmark data sets
quency of the term X in the annotated similar to those of Toronen and
2.3 Creating Reference Sets for
corpus. Using this frequency we can colleagues [30], who created gene lists Custom Enrichment Analysis
compute significant comparative over- or with a selected enrichment level and a As discussed before, a key pre-requisite
under-representation in the input dataset. selected number of independent, over- for performing enrichment analysis is the
The second approach uses NCBOs represented classes to compare different availability of an appropriate reference
Resource Index, which is a repository of GO-based enrichment methods. In the dataset to compare against when looking
automatically-created annotations. Access case of analyses using disease ontologies, for over- or under-represented terms. In
to the Resource Index allows a user to the benchmark data sets would comprise this section, we describe: (i) a general
make the same sort of calculations as with gene lists enriched for specific disease method that uses hand-curated GO anno-
the MEDLINE term frequencies, but also terms, clinical-trial lists enriched for a tations as a starting point, for creating
offers information on the co-occurrence of specific drug being studied; lists of research reference datasets for enrichment analysis
ontological terms in textual descriptions publications that are enriched for known using other ontologies; and (ii) a gene
and annotations of datasets; enabling the NCIt terms, and so on. A sample disease reference annotation dataset for
user to quantify the degree to which terms benchmark list of aging related genes and performing disease-based enrichment.
are independent or correlated in the their annotations is provided in Section 5. GO annotations are unique because
annotation space. Exercises. This dataset was compiled by highly trained curators associate GO
Step 3 There are several possible output computationally creating disease term terms to gene products manually, based
mechanisms to such an analysis workflow. annotations on 261 human genes on literature review. We describe how,
The simplest is a tag cloud, which intui- designated to be related to aging with the availability of tools for automatic
tively summarizes the results of the analysis according to the GeneAge database [31]. ontology-based annotation with terms
(Figure 3). The sizes and colors of terms in The annotations of this gene list are from disease ontologies, it is possible to
the cloud indicate the relative frequency of enriched for disorders, such as create reference annotation datasets for
the terms offering a high-level overview. atherosclerosis, that are known to be enrichment analysis using ontologies other
However, a tag clouds representative associated with aging. Such benchmark than the GOfor example, the Human
ability is limited because there is no easy data sets can be used to ensure accuracy of Disease Ontology.
way to show significance relative to some the enrichment statistics as well as to Unlike GO terms, which actually ap-
expectation, or to show the elements in the evaluate the appropriateness of different pear in the text with low frequency, or
input associated with some term. sources of reference-term frequencies for gene identifiers, which are ambiguous,
The second output format is in XML, computing enrichment. disease terms are amenable to automated,
which is amenable to postprocessing by The inconsistency of abstraction levels term extraction techniques. Therefore,
the user, as needed. The result for each in ontologies is an often discussed stum- using tools which recognize mentions of
term contains its respective frequency bling block for enrichment analysis [17]. ontology terms in user submitted text, we
information in the input data along with Two terms at equal depths may not can automatically recognize occurrences
the counts on which the frequency is represent concepts of similar granularity, of terms from the Human Disease Ontol-
based. The results on each term can also creating a bias in the reported term ogy (DO) from a given corpus of text [28];
contain the list of identifiers that mapped enrichment. By comprehensively analyz- the key is to identify the text source that
to that term. Each node includes informa- ing the frequencies of terms in MEDLINE can be relied upon to recognize disease
tion on the level in the ontology at which and the NCBO Resource Index, a user terms to associate with genes.
the term is found. Using such an output, it can perform a thorough analysis of Unlike other natural-language tech-
is straight forward to create graphical dependencies among ontology-term anno- niques for finding genedisease associa-
visualizations similar to those that most tations to make existing biases explicit as tions, our proposed method uses manually
GO based enrichment analysis tools pro- well as to define custom abstraction levels curated GO annotations as the starting
vide [29]; see example in Figure 4. using methods developed by Alterovitz basis to identify the text source from which
2.2.1. Ensuring quality. For any et al. [32]. The development of methods to to recognize disease terms. Basically, we
such custom analysis workflow it is reliably identify the appropriate level of use manually curated GO annotations to
essential to set up tests that ensure abstraction at which to report the results of identify those publications that were the

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002827


PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002827
Figure 4. The figure shows a visualization generated using the GO TermFinder tool. The GO graph layout shows the significantly enriched
GO terms in the annotations of the analyzed gene set. The color of the nodes is an indication of their Bonferroni corrected P-value (orange , = 1e-10;
yellow 1e-10 to 1e-8; green 1e-8 to 1e-6; cyan 1e-6 to 1e-4; blue 1e-4 to 1e-2; tan .0.01).
doi:10.1371/journal.pcbi.1002827.g004

basis for associating a GO term with a and abstract using the National Library of shown in Figure 6 and the analysis itself
particular gene. Medicine eUtils. We save each articles is offered as an exercise for the reader in
Figure 5 summarizes our method. First, title and abstract as a file and annotate it Section 6. Exercises. What differentiates
we start with GO annotations, which via the Annotator service using the disease our suggested method from other ap-
provide the PubMed identifiers of papers ontology as the target. Once we have the proaches [28,36] for finding genedisease
based on which gene products are associ- publicationdisease tuples, we cross-refer- associations is the use of GO annotations
ated with specific GO terms by a curator. ence them with the genepublication as a basis for identifying reliable gene
The annotations essentially give us a link tuples resulting in genedisease associa- publication records that serve as the
between gene identifiers and PubMed tions for 7316 human genes. foundation for generating automated
articles and only those PubMed articles Out of 25,000 currently estimated annotations. Furthermore, researchers
that were deemed to be relevant for GO human genes, we are able to annotate can reuse our method to examine func-
annotation curation. Next, we recognize 7316 genes (29.2%) with at least one tion along other dimensions. For exam-
terms from an ontology of interest (e.g. disease term from the Human Disease ple, researchers can use the Pathway
Human Disease) in the title and abstracts Ontology. Previous methods that use ontology to generate genepathway asso-
of those articles. Finally, we associate the advanced text mining have been able to ciations.
recognized ontology terms with the gene annotate 4408 genes (17.7%) [33]. A study 2.3.1 Ensuring quality. When using
identifiers to which the article analyzed based on OMIM associated 1777 genes an automated annotation process to create
was associated. (7.1%) with disease terms to create a a reference annotation set, there are some
In order to demonstrate feasibility of the human diseasome [34] and an auto- caveats to consider. First, not all ontologies
proposed workflow and to provide a mated approach using MetaMap as the are equally suited for creating automated
sample reference annotation set for per- concept recognizer and GeneRIFs as well annotations. Second, automated
forming disease ontology based analyses in as descriptions from OMIM as the input annotation depends highly on the quality
the exercises of this chapter, we download textual descriptions annotated roughly of the input text corpus. Third, some
GO annotation files for human gene 14.9% of the human genome with disease errors in annotation are inevitable in an
products from geneontology.org. These terms [28]. Because the number of human automated process. We discuss these issues
files are tab-delimited text files that genes known at the time of each study below.
contain, among other things, a list of gene varies, we make the comparisons loosely. Using other ontologies. Although we specif-
identifiers, associated GO terms, and the In order to validate our background ically focus on creating annotations with
publication source (a PubMed identifier) annotation set, we evaluated our gene terms from the human disease ontology,
on the basis of which that GO annotation disease association dataset in several ways the method we have devised (Figure 6) can
was created. We removed all electronically described in [35]. First, we examined a set create annotations with terms from other
inferred annotations (IEA) from the anno- of genes related specifically to aging from ontologies. In the presented workflow, to
tation file. We also removed all qualified the GenAge database [31] for their obtain a background dataset for enrich-
annotations, such as negated (NOT) ones. coherence in terms of the assigned disease ment for some ontology other than DO,
As a result, we obtain a list of publications annotations. Next, we performed disease- researchers would simply configure a
and the genes they describe, genepubli- based enrichment analysis on the same parameter for the Annotator Web service
cation tuples. In the next step, using the aging related gene set using our newly to use their ontology of choice from
PubMed identifiers obtained from the GO created reference annotation set. The BioPortal. In fact, other researchers have
annotation files, we fetch each articles title results of the enrichment analysis are used a similar annotation workflow to

Figure 5. Workflow for generating background annotation sets for enrichment analysis: We obtain a set of PubMed articles from
manually curated GO annotations, which we process using the NCBO Annotator service.
doi:10.1371/journal.pcbi.1002827.g005

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002827


Figure 6. Disease terms significantly enriched in annotations of aging-related genes: The tag cloud shows those disease terms in
the annotations of the 261 aging related genes that are statistically enriched given our genedisease background annotation
dataset. Terms that are significantly enriched appear larger. We used a binomial test to detect enriched disease terms in the aging related gene set.
Note that mis-annotated terms (such as Recruitment) and non-informative terms (such as Disease) are not deemed enriched by the statistical analysis.
doi:10.1371/journal.pcbi.1002827.g006

recognize morphological features in textu- association dataset. These missed annota- enrichment (Figure 6)though that is
al descriptions of fish species [37]. tions provide an opportunity for refining not guaranteed. Advanced text mining
Not all ontologies are viable candidates the annotation workflows to use sources of can potentially provide checks against
for automatic annotation because not all text beyond just the papers referenced in such kinds of errors by analyzing the
ontology terms appear in the text of a GO annotations. context in which a potential disease term is
MEDLINE abstract. For example, using Annotation errors. Some errors in annota- mentioned.
termfrequency counts in MEDLINE tion are inevitable in an automated
abstracts [38], we calculated that disease process. For example, in the reference
3. Novel Use Cases Enabled
terms are mentioned 46% more often annotation set we created, TP53 was also
than GO terms in MEDLINE abstracts. annotated, wrongly, to Recruitment. We believe that extending the current
As another example, only 10% of the Papers that were the basis of creating enrichment-analysis methods to ontologies
manually assigned GO terms can be GO annotations for TP53 certainly men- beyond GO and to extending the method
detected directly in the paper abstract tion the term Recruitment; however beyond analyzing gene and protein anno-
supporting that particular GO annota- that term is not a disease. The term tations to any set of entities for term
tion. Because disease terms are mentioned Recruitment is in the Human Disease enrichment will enable several novel use
significantly more often than GO terms, Ontology and is declared to be a synonym cases. For example, a user might analyze a
the automated annotation process works of auditory recruitment, which does not set of papers published in the last three
well for annotating genes with disease have an asserted superclass, or a place in years in a particular domain (say, signal
ontology terms. the hierarchy indicating a possible error in transduction) and identify which pathway
Missing annotations. Out of the 261 aging- the ontology. However, because such was mentioned most frequently. Similar-
related genes in our evaluation subset, the errors will affect annotation of both the ly, a user could analyze descriptions of
Annotator left out 24 genes (9%), for set of interest and the reference set equally, genes controlled by a particular ultra-
which we have no disease terms associated the errors will most likely cancel each conserved region of DNA to generate
with those genes in our genedisease other out when computing statistical hypotheses about the regions function in

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002827


specific disease processes. We discuss the Using a database containing the complete Given the recent advances in ontology
potential of some of the novel use cases set of grants in this intervalwith their repositories and methods of automated
enabled by disease ontology based en- titles, amounts, recipient institutions, annotation, we argue that enrichment
richment analysis. etc.we selected grants worth over analysis based on textual descriptions is
Analysis of protein annotations To demon- $1,250,000 (in constant 2008 dollars). We possible.
strate the feasibility of performing enrich- annotated the titles of these grants with We have systematically discussed how
ment analysis and recovering known GO SNOMEDCT terms and used the anno- to accomplish enrichment analysis using
annotations as well as to demonstrate tation sets to generate tag clouds for each ontologies other than the Gene Ontology
enrichment analysis with multiple ontolo- year, such as the one shown in Figure 3 for as well as address some of the limitations
gies, we analyzed a list of 261 known aging year 1981, to create a visual summary of of existing analysis methods. For example,
related genes from the GenAge database funding trends on a per year basis. Further the roughly 20% of genes that lack
[31]. We started by collecting textual analysis cross-linking annotation on grants annotations can now be associated, via
descriptions for UniProt protein entries their GeneRIFs, with terms from disease
with annotations on publications from
corresponding to each human gene in the ontologies. We have outlined possible
specific institutions can enable compara-
GenAge database. The textual descrip- directions of research for overcoming
tive analysis of the research efficacy at
tions included the protein name, gene other limitations such as inconsistent
name, general descriptions of the function different institutions.
abstraction levels in ontologies, perform-
and catalytic activity as well as keywords Hypothesis generation for Clinical Research
ing the analysis using combinations of
and GO terms. We processed this text as Finally, enrichment analysis can also be
ontology terms, and accounting for anno-
described in the workflow in Figure 6 and used as an exploratory tool to generate tation bias.
created annotations from Medical Subject hypotheses for clinical research by analyz- In order to perform enrichment analysis
Heading (MSH), Online Mendelian In- ing annotations on medical records in using ontologies other than the GO, a key
heritance in Man (OMIM), UMLS Me- conjunction with annotation of molecular pre-requisite is the availability of a back-
tathesaurus (MTH) and Gene Ontology datasets. For example, in the case of ground set of annotations to enable the
(GO). kidney transplants, extended-criteria do- enrichment calculation. We have de-
We created a background set of anno- nor (ECD) organs have a 40% rate of scribed a general method, which uses
tations on 19671 proteins by applying the delayed graft function and a higher hand-curated GO annotations as a start-
same protocol to manually annotated and incidence of rejection compared to stan- ing point, for creating background datasets
reviewed proteins from SwissProt (Jan dard-criteria donor (SCD) kidneys. for enrichment analysis using other ontol-
2010 version). We calculated enrichment Identifying causes of this difference is ogiessuch as the Human Disease On-
and depletion of specific terms, corrected crucial to identify patients in which an tology, for which hand-curated annota-
for multiple hypotheses and obtained a list ECD transplant has a high chance of tions are not available.
of significant terms for all four ontologies. working. To demonstrate the feasibility and
Not surprisingly, aging is an enriched At several research sites, the datasets utility of our proposals, we have created
term. There were several other terms collected to address this question comprise a background set of annotations to enable
enriched such as electron transport immunological metrics beyond the stan- enrichment analysis with the Human
(2.79e-10), protein kinase activity (2.8e- dard clinical risk factors, including multi- Disease Ontology and validated that
10) and nucleotide excision repair (8.78e- parameter flow-cytometric analysis of the background set by using the created
07) which appeared in MSH, MTH, and peripheral immune-cell repertoire, geno- annotations to examine the coherence of
GO. The enriched terms also included mic analysis, and donor-specific functional known aging related genes and by per-
aging associated diseases such as Alzhei- forming enrichment analysis on an aging
assessments. These patient data sets can be
mers disease (0.01), Werner syndrome related gene set from the GeneAge
annotated using automated methods
(5.3e-05), Diabetes Mellitus (1.5e-04) and database [31]. We make the set of aging
[8,26] to enable enrichment analysis for
neurodegeneration (2.5e-03) from OMIM. related genes and the reference annotation
risk-factor determination.
This case study demonstrate that en- set available for reader exercises in
For example, simple enrichment anal-
richment analysis with multiple ontologies enrichment analysis.
is feasible and it enables a comprehensive ysis might allow identification of classes of
We argue that enrichment analysis
characterization of the biological signal drugs, diseases, and test results that are
using computationally created ontology-
present in gene/protein lists [39]. For commonly found only in readmitted
based annotations from textual descrip-
example, by annotating known protein transplant patients. Enrichment analysis
tions is possible, thus introducing
mutations with disease terms from the to identify common pairs of terms of enrichment analysis as a research meth-
ontologies in BioPortal, Mort et al. recently different semantic types can identify odology in new domains such as hypoth-
identified a class of diseasesblood coag- combinations of drug classes and co- esis generation for clinical research;
ulation disordersthat were associated morbidities, or test risk-factors and co- without requiring manually created anno-
with a 14-fold depletion in substitutions at morbidities that are common in this tations.
O-linked glycosylation sites [20]. population.
Analysis of funding trends To demonstrate
5. Exercises
the feasiblity of such analyses in a novel
4. Summary
domain, we processed the funding alloca- (1) For the 260 aging related genes in
tions of the NIH in fiscal years 19801989. Because enrichment analysis with GO is Dataset S1, perform enrichment analysis
We aimed to identify trends in institutional widely accepted and scientifically valuable, using the Human Disease ontology, using
funding priorities over time, as represented we argue that the logical next step is to Dataset S2 as the reference annotation set.
by changes in the relative frequencies of extend this methodology to other ontolo- Some considerations while working
ontology concepts from year-to-year. giesspecifically disease ontologies. through the problem:

PLOS Computational Biology | www.ploscompbiol.org 10 December 2012 | Volume 8 | Issue 12 | e1002827


N The genes are listed with their Uni- Further Reading
protIDs.
N Using the notation in Section 1.1, the
N Tirrell R, Evani U, Berman AE, Mooney SD, Musen MA, et al. (2010) An ontology-
values of N and M are the total neutral framework for enrichment analysis. AMIA Annual Symposium proceed-
number of unique genes in the aging ings/AMIA Symposium AMIA Symposium 2010: 797801.
set and total set, respectively, and not
the number of unique terms. The
N Shah NH, Jonquet C, Chiang AP, Butte AJ, Chen R, et al. (2009) Ontology-driven
indexing of public datasets for translational bioinformatics. BMC Bioinformatics
values of n and m are the unique genes 10 Suppl 2: S1.
that are annotated with a given term in
the corresponding set. N Alterovitz G, Xiang M, Mohan M, Ramoni MF (2007) GO PaD: the Gene Ontology
Partition Database. Nucleic Acids Res 35(Database issue): D322D327.
N When performing the hypergeometric
N Myhre S, Tveit H, Mollestad T, Laegreid A (2006) Additional gene ontology
test, if the test calculates the p value structure for improved biological reasoning. Bioinformatics 22: 20202027.
based on finding a value of n greater
than or less than what was observed N Toronen P, Pehkonen P, Holm L (2009) Generation of Gene Ontology
benchmark datasets with various types of positive signal. BMC Bioinformatics
(instead of equal to what was observed), 10: 319.
remember to add or subtract 1 from the
number of genes annotated with a given
term when calculating. If you are using a
function to calculate, refer to the org/wiki/index.php/Annotator_Client_ N Due to the large number of GeneRIFs,
documentation to understand the input Examples. All NCBO REST Web the BioPortal Annotator may timeout
required. services require the parameter api- while the user is looping through genes to
key = YourApiKey. It is strongly en- annotate. It is suggested that the annota-
N Consider from which tail of the
couraged that all users of the NCBO tion is done incrementally and joined or
hypergeometric distribution you wish
Annotator Web service use only the intermittent saves of the annotations is
to calculate the p value.
virtual ontology identifier. To do done to prevent timely re-annotation.
(2) For the 260 aging related genes, so, set the isVirtualOntolgyId param- N The given set of aging genes will have
perform enrichment analysis using eter to true. This will ensure that you considerably more annotations terms
SNOMED-CT (Systematized Nomencla- access the version of the ontology that is per gene than the set of all genes in the
ture of Medicine-Clinical Terms). Use the actually in the database. Failure to do this GeneRIF database. This bias should
GeneRIF (Gene Reference into Function) will result in your code breaking every be a consideration when deciding on
database as the source text to annotate with time the database is updated. an appropriate M. There are numer-
disease terms from SNOMED-CT. Choose
an appropriate reference annotation set
N Output from the annotation service ous approaches to address this, and a
simple method may be to limit the
can be conveniently parsed in XML.
and justify the choice. Some considerations To see an example of what this might reference set of genes M to only those
while working through the problem: look like, visit http://bioportal. with at least a given number of
bioontology.org/annotator. Insert the annotated terms. You may also want
N An index of GeneRIFs, maintained
sample text or use text of your choice, to limit the results to only those terms
by the National Center for Biotech- that appear at least a given amount of
makes selection(s) under Select Ontol-
nology Information (NCBI) and times in the aging gene annotations.
the National Institutes of Health ogies and Select UMLS Semantic
(NIH), can be downloaded from here: Types and click Get Annotations. At Answers to the Exercises can be found
ftp://ftp.ncbi.nih.gov/gene/GeneRIF/ the bottom by Format Results As you in Text S1.
can select XML to see the XML tree
N Mapping from UniprotIDs to Gen-
structure of the Annotator output.
eIDs, which are used in the GeneRIF Supporting Information
database, can be done here: http:// N The suggested ontology for this exercise
Dataset S1 Data file for Exercise 1
www.uniprot.org/help/mapping. is SNOMED-CT (ontology ID: 1353)
and semantic types Anatomical Struc- (TXT)
Note that you will get 261 GeneIDs for
the 260 UniprotIDs. ture (T017), Disease or Syndrome Dataset S2 Data file for Exercise 1
(T047), Neoplastic Process (T191), and
N Annotation using the National Center
NCBO BioPortal concept (T999).
(TXT)
for Biomedical Ontologys BioPortal Dataset S3 Data file for Exercise 1
Annotator Service requires obtaining N Some processing of the GeneRIF text
(OBO)
an API key. This can be done after may be necessary to prevent errors in
registration and going to Account annotation. It is suggested to remove Dataset S4 Additional info on genes
where your API key will be displayed: GeneRIFs with new line characters mentioned in S1 and S2. Can be used in
http://bioportal.bioontology.org/ (\n) and replace single or double lieu of GeneRIFs in Exercise 2.
quotes with white space. (TXT)
N Information on the programmatic use of
the BioPortal Annotator as a client can be N Many GeneIDs have multiple Gen- Text S1 Answers to Exercises
found here: http://www.bioontology. eRIF entries. The user will find more (DOCX)
org/wiki/index.php/Annotator_Web_ efficient annotation if all of the Gen-
Table S1 Exercise 1 analysis results
service. Example code from numerous eRIF entries for a given gene are
(CSV)
languages, including Java, R, Python, concatenated and passed to the anno-
Ruby, Excel, HTML, and Perl, can be tator instead of annotating individual Table S2 Exercise 2 analysis results
found here: http://www.bioontology. GeneRIF entries for the same gene. (CSV)

PLOS Computational Biology | www.ploscompbiol.org 11 December 2012 | Volume 8 | Issue 12 | e1002827


Acknowledgments edu/cgi-bin/GOTermFinder. We also thank
Paea LePendu for assistance in creating solu-
We acknowledge use of the publicly available tions to the exercises.
GO Term Finder server at http://go.princeton.

References
1. Altman RB, Raychaudhuri S (2001) Whole- activities along the taxonomic tree. Genome Biol 28. Osborne JD, Flatow J, Holko M, Lin SM, Kibbe
genome expression analysis: challenges beyond 8: R33. WA, et al. (2009) Annotating the human genome
clustering. Curr Opin Struct Biol 11: 340 15. Farcomeni A (2008) A review of modern multiple with Disease Ontology. BMC Genomics 10 Suppl
347. hypothesis testing, with particular attention to the 1: S6.
2. Brazma A, Vilo J (2000) Gene expression data false discovery proportion. Stat Methods Med 29. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, et
analysis. FEBS Lett 480: 1724. Res 17: 347388. al. (2004) TermFinderopen source software for
3. Quackenbush J (2002) Microarray data normal- 16. Benjamini Y, Yekutieli D (2001) The control of accessing Gene Ontology information and finding
ization and transformation. Nat Genet 32 Suppl: the false discovery. Rate under dependency. Ann significantly enriched Gene Ontology terms
496501. Stat 29: 11651188. associated with a list of genes. Bioinformatics
4. Tusher VG, Tibshirani R, Chu G (2001) 17. Khatri P, Draghici S (2005) Ontological analysis 20: 37103715.
Significance analysis of microarrays applied to of gene expression data: current tools, limitations, 30. Toronen P, Pehkonen P, Holm L (2009) Gener-
the ionizing radiation response. Proc Natl Acad and open problems. Bioinformatics 21: 3587 ation of Gene Ontology benchmark datasets with
Sci U S A 98: 51165121. 3595. various types of positive signal. BMC Bioinfor-
5. Huttenhower C, Hibbs M, Myers C, Troyans- 18. Shah NH, Fedoroff NV (2004) CLENCH: a matics 10: 319.
kaya OG (2006) A scalable method for integration program for calculating Cluster ENriCHment 31. de Magalhaes JP, Budovsky A, Lehmann G,
and functional analysis of multiple microarray using the Gene Ontology. Bioinformatics 20: Costa J, Li Y, et al. (2009) The Human Ageing
datasets. Bioinformatics 22: 28902897. 11961197. Genomic Resources: online databases and tools
6. Zeeberg BR, Feng W, Wang G, Wang MD, Fojo 19. Ade AS, States DJ, Wright ZC (2007) Genes2- for biogerontologists. Aging Cell 8: 6572.
AT, et al. (2003) GoMiner: a resource for Mesh. Ann Arbor, MI: National Center for 32. Alterovitz G, Xiang M, Hill DP, Lomax J, Liu J,
biological interpretation of genomic and proteo- Integrative Biomedical Informatics. et al. (2010) Ontology engineering. Nat Biotech-
mic data. Genome Biol 4: R28. nol 28: 128130.
20. Mort M, Evani US, Krishnan VG, Kamati KK,
7. Rhee SY, Wood V, Dolinski K, Draghici S (2008) 33. Altman RB, Bergman CM, Blake J, Blaschke C,
Baenziger PH, et al. (2010) In silico functional
Use and misuse of the gene ontology annotations. Cohen A, et al. (2008) Text mining for biology
profiling of human disease-associated and poly-
Nat Rev Genet 9: 509515. the way forward: opinions from leading scientists.
morphic amino acid substitutions. Human Muta-
8. Shah NH, Jonquet C, Chiang AP, Butte AJ, Chen Genome Biol 9 Suppl 2: S7.
tion 31: 335346.
R, et al. (2009) Ontology-driven indexing of 34. Goh KI, Cusick ME, Valle D, Childs B, Vidal M,
21. Spackman KA (2004) SNOMED CT milestones:
public datasets for translational bioinformatics. et al. (2007) The human disease network. Proc
endorsements are added to already-impressive Natl Acad Sci U S A 104: 86858690.
BMC Bioinformatics 10 Suppl 2: S1.
standards credentials. Healthc Inform 21: 54, 56. 35. Lependu P, Musen MA, Shah NH (2011)
9. Subramanian A, Tamayo P, Mootha VK,
Mukherjee S, Ebert BL, et al. (2005) Gene set 22. Smith B, Ashburner M, Rosse C, Bard J, Bug W, Enabling enrichment analysis with the Human
enrichment analysis: a knowledge-based ap- et al. (2007) The OBO Foundry: coordinated Disease Ontology. J Biomed Inform 44 Suppl 1:
proach for interpreting genome-wide expression evolution of ontologies to support biomedical data S31S38.
profiles. Proc Natl Acad Sci U S A 102: 15545 integration. Nat Biotechnol 25: 12511255. 36. Krallinger M, Leitner F, Valencia A (2010)
15550. 23. Noy NF, Shah NH, Whetzel PL, Dai B, Dorf M, Analysis of biological processes and diseases using
10. Draghici S, Khatri P, Martins RP, Ostermeier GC, et al. (2009) BioPortal: ontologies and integrated text mining approaches. Methods Mol Biol 593:
Krawetz SA (2003) Global functional profiling of data resources at the click of a mouse. Nucleic 341382.
gene expression. Genomics 81: 98104. Acids Res 37(Web Server issue): W170W173. 37. Sarkar N (2010) Using biomedical ontologies to
11. (2002) Gene ontology consortium website. 24. Myhre S, Tveit H, Mollestad T, Laegreid A enable morphology based phylogenetics: a feasi-
12. Alexa A, Rahnenfuhrer J, Lengauer T (2006) (2006) Additional gene ontology structure for bility study for fishes; 2010; Boston, MA.
Improved scoring of functional groups from gene improved biological reasoning. Bioinformatics 22: 38. Xu R, Musen MA, Shah NH (2010) A compre-
expression data by decorrelating GO graph 20202027. hensive analysis of five million UMLS metathe-
structure. Bioinformatics 22: 16001607. 25. Shah NH, Bhatia N, Jonquet C, Rubin D, saurus terms using eighteen million MEDLINE
13. Grossmann S, Bauer S, Robinson PN, Vingron Chiang AP, et al. (2009) Comparison of concept citations. AMIA Annual Symposium proceed-
M (2007) Improved detection of overrepresenta- recognizers for building the Open Biomedical ings/AMIA Symposium 2010: 907911.
tion of Gene-Ontology annotations with parent Annotator. BMC Bioinformatics 10 Suppl 9: S14. 39. Tirrell R, Evani U, Berman AE, Mooney SD,
child analysis. Bioinformatics 23: 30243031. 26. Jonquet C, Shah NH, Musen MA (2009) The Musen MA, et al. (2010) An ontology-neutral
14. Schlicker A, Rahnenfuhrer J, Albrecht M, Open Biomedical Annotator; 2009 March 1517; framework for enrichment analysis. AMIA An-
Lengauer T, Domingues FS (2007) GOTax: San Francisco, CA. pp. 5660. nual Symposium proceedings/AMIA Symposium
investigating biological processes and biochemical 27. (2010) NCBO REST services. 2010: 797801.

PLOS Computational Biology | www.ploscompbiol.org 12 December 2012 | Volume 8 | Issue 12 | e1002827


Education

Chapter 10: Mining Genome-Wide Genetic Markers


Xiang Zhang1, Shunping Huang2, Zhaojun Zhang2, Wei Wang3*
1 Department of Electrical Engineering and Computer Science, Case Western Reserve University, Ohio, United States of America, 2 Department of Computer Science,
University of North Carolina at Chapel Hill, North Carolina, United States of America, 3 Department of Computer Science, University of California at Los Angeles, California,
United States of America

Abstract: Genome-wide associa- tics and computer science [3,4,5,6]. In this is to find SNPs (markers) in X , that are
tion study (GWAS) aims to discover section, we will first provide a brief highly associated with Y . There are
genetic factors underlying pheno- introduction to the necessary biological several challenging issues that need to be
typic traits. The large number of background. We will then formalize the addressed when developing an analytic
genetic factors poses both compu- problem and discuss both traditional and method for GWAS [7,8].
tational and statistical challenges. recently developed methods for genome- Scalability Most GWAS datasets consist
Various computational approaches wide analysis of associations. of a large number of SNPs. Therefore the
have been developed for large A human genome contains over 3 billion algorithms for GWAS need to be highly
scale GWAS. In this chapter, we will DNA base pairs. There are four possible scalable. For example, for a typical human
discuss several widely used com- nucleotides at each base in the DNA: GWAS, the dataset may contain up to
putational approaches in GWAS. adenine (A), guanine (G), thymine (T), millions SNPs and involve thousands of
The following topics will be cov- and cytosine (C). In some locations in the individuals. Inefficient methods may con-
ered: (1) An introduction to the genome, a genetic variation may be found sume a large amount of computational
background of GWAS. (2) The
which involves two or more nucleotides resources and time to find highly associated
existing computational approaches
that are widely used in GWAS. This across different individuals. These genetic SNPs.
will cover single-locus, epistasis variations are known as single-nucleotide Missing markers Even with the
detection, and machine learning polymorphism (SNPs), i.e., a variation of a current dense genotyping technique, many
methods that have been recently single nucleotide in the DNA sequence. In genetic variants are still not genotyped.
developed in biology, statistic, and most cases, there are two possible nucleo- Current methods usually assume genetic
computer science communities. tides for a variant. We denote the more linkage to enhance the power. Imputation,
This part will be the main focus of frequent one as 0, and the less frequent which tries to impute the unknown
this chapter. (3) The limitations of one as 1. For bases on autosomal markers by using existing SNPs databases,
current approaches and future di- chromosomes, there are two parallel nucle- is another popular approach to handle
rections. otides, which leads to three possible missing markers. The well known related
combinations, 00, 01 and 11. These projects include the International Hap-
genotype combinations are known as Map project [9] and the 1000 Genomes
This article is part of the Transla- major homozygous site, heterozygous Project [10].
tional Bioinformatics collection for site and minor heterozygous site re- Complex traits One approach in
PLOS Computational Biology. spectively. These genetic variations con- GWAS is to test the association between
tribute to the phenotypic differences among the trait and each marker in a genome,
the individuals. (A phenotype is the com- which is successful in detecting a single
1. Introduction posite of an organisms observable charac- gene related disease. However, this ap-
teristics or traits.) Genome-wide association proach may have problems in finding
With the advancement of genotyping study (GWAS) aims to find strong associa- markers associated with complex traits.
technology, genome-wide high-density sin- tions between SNPs and phenotypes across This is because that complex traits are
gle nucleotide polymorphisms (SNPs) of
a set of individuals. affected by multiple genes, and each gene
human and other organisms are now
More formally, let X ~fX1 ,X2 ,    , may only have a weak association with the
available [1,2]. The goal of genome-wide
XN g be the set of N SNPs for M phenotype. Such markers with low mar-
association studies (GWAS) is to seek
individuals in the study, and Y be the ginal effects are hard to detect by the
strong associations between phenotype
phenotype of interest. The goal of GWAS single-locus methods.
and genetic variations in a population that
represent (genomically proximal) causal
genetic effects. As the most abundant Citation: Zhang X, Huang S, Zhang Z, Wang W (2012) Chapter 10: Mining Genome-Wide Genetic Markers. PLoS
source of genetic variation, millions of Comput Biol 8(12): e1002828. doi:10.1371/journal.pcbi.1002828
SNPs have been genotyped across the Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
entire genome. Analyzing such large Baltimore County, United States of America
amount of markers poses great challenges Published December 27, 2012
to traditional computational and statistical Copyright: 2012 Zhang et al. This is an open-access article distributed under the terms of the Creative
methods. In this chapter, we introduce the Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
basic concept of genome-wide association
study, and discuss recently developed Funding: This work was supported by the following grants: NSF IIS-1162369, NSF IIS-0812464, NIH GM076468
and NIH MH090338. The funders had no role in the preparation of the manuscript.
methods for GWAS.
Genome-wide association study is an Competing Interests: The authors have declared that no competing interests exist.
inter-discipline problem of biology, statis- * E-mail: weiwang@cs.ucla.edu

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002828


What to Learn in This Chapter association. The p-value of the test is the
probability of the contingency table given the
fixed margins. The probability of obtaining
N The background of Genome-wide association study (GWAS).
such values in Table 1 is given by the
N The existing computational approaches that are widely used in GWAS. This will hypergeometric distribution:
cover single-locus, epistasis detection, and machine learning methods.
N The limitations of current approaches and future directions.
O :0
!
O :1
!
O :2
!

O00 O01 O02


p~ ! ~
In the remainder of the chapter, we will Many tests can be used to assess the S
first discuss the single-locus methods. We significance of the association between a
will then study epistasis detection (multi- single SNP and a binary phenotype. The test O0:
locus) approaches which are designed for statistics are usually based on the contingency (O:0 !O:1 !O:2 !)(O0: !O1: !)
association studies of complex traits. For table. The null hypothesis is that there is no S!(O00 !O01 !O02 !O10 !O11 !O12 !)
epistasis detection, we will mainly focus on association between the rows and columns of
exact two-locus association mapping the contingency table. Most modern statistical packages can
methods. 2.2.1 Pearsons x2 test. Pearsons x2 calculate the significance of Fisher tests. The
test can be used to test a null hypothesis actual computation performed by the existing
2. Single-Locus Association stating that the frequency distribution of software packages may be different from the
Mapping certain events observed in a sample is exact formulation given above because of the
consistent with a particular theoretical numerical difficulties. A simple, somewhat
As the rapid development of high- distribution [11]. better computational approach relies on a
throughput genotyping technology, mil- The value of the test statistic is gamma function or log-gamma function.
lions of SNPs are now available for How to accurately compute hypergeometric
genome-wide association studies. Single- and binomial probabilities remains an active
X X (Oij {Eij )2
locus association test is a traditional way X 2~ , research area.
for association studies. Specifically, for i j
Eij 2.2.4 Cochran-Armitage test. For
each SNP, a statistical test is performed complex traits, contributions to disease
to evaluate the association between the Oi: O:j
where Eij ~ risk from SNPs are widely considered to be
SNP and the phenotype. A variety of tests S . The degree of freedom
is 2. roughly additive. In other words, the
can be applied depending on the data heterozygous alleles will have an inter-
2.2.2 G-test. G-test is an
types. The phenotype involved in a study mediate risk between two homozygous
can be case-control (binary), quantitative approximation of the log-likelihood ratio.
The test statistic is alleles. Cochran-Armitage test can be used
(continuous), or categorical. We categorize in this case [12,5]. Let the test statistic of U
the statistical tests based on what kind of be the following:
phenotypes they can be applied on. XX Oij
G~2 Oij :ln( ),
i j
Eij X
2
O1i O0i
2.1 Problem Formalization U~O1: O0: i:( { )
Oi: O:j
O1: O0:
Let fX1 ,    ,XN g be a set of N SNPs where Eij ~ S .
i~0
for M individuals and Xn ~fXn1 ,
The null hypothesis is that the observed After substitution, we get
   ,XnM g (1nN). We use 0, 1, 2 to
frequencies result from random sampling
represent the homozygous major allele,
from a distribution with the given expect- U~S :(O11 z2O12 {O1: :(O:1 z2O:2 )
heterozygous allele, and homozygous mi-
nor allele respectively. Thus we have that ed frequencies. The distribution of G is
Xnm [f0,1,2g (1nN,1mM). Let approximately that of x2 , with the same The variance of U under the null
Y ~fy1 ,    ,yM g be the phenotype. Note degree of freedom as in the corresponding hypothesis can be computed as
that the values that Y can take depend on x2 test. When applied to a reasonable size
its type. of samples, the G-test and the x2 test will (S{O1: )O1:
Var(U)~
lead to the same conclusions. S
2.2.3 Fisher exact test. When the
2.2 Case-Control Phenotype S(O:1 z4O:2 ){(O:1 z2O:2 )2 
sample size is small, the Fisher exact test is
In a case-control study, the phenotype
useful to determine the significance of the Notice that for a large sample size S, we
can be represented as a binary variable
with 0 representing controls and 1 repre- U2
have p
U
*N(0,1), hence Var(U) *x21 .
senting cases. Var(U)
Table 1. Contingency table for a single
A contingency table records the SNP Xn and a phenotype Y . 2.2.5 Summary. There is no overall
frequencies of different events. Table 1 winner of the introduced tests. Cochran-
is an example contingency table. For a Armitage test may not be the best if the risks
SNP Xn and a phenotype Y , and we use Xn ~0 Xn ~1 Xn ~2 Totals are deviated from the additive model.
Oij to denote the number of individuals Meanwhile, x2 test, G-test, and Fisher exact
Y ~0 O00 O01 O02 O0:
whose Xn equals test can handle the full range of risks, but they
Pi and Y equals P j. Also,
Y ~1 O10 O11 O12 O1: will unavoidably lose some power in the
we have Oi: ~ Oij and O:j ~ Oij .
j i Totals O:0 O: 1 O:2 S detection of additive ones. Different tests may
The P total number of individuals be applied on the same data to detect
S~ Oij . doi:10.1371/journal.pcbi.1002828.t001 different effects.
i,j

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002828


2.3 Quantitative Phenotype We have the sums of squares as follows: Many methods can be used to control
In addition to case-control phenotypes, FWER. Bonferroni correction is a com-
many complex traits are quantitative. This X
M X
M monly used method, in which p-values
type of study is also often referred to as the SSxx ~ (xi {x)2 , SSyy ~ (Yi {Y )2 , need to be enlarged to account for the
quantitative trait locus (QTL) analysis. i~1 i~1 number of comparisons being performed.
The standard tools for testing the associ- X
M Permutation test [13] is also widely used to
ation between a single marker and a and SSxy ~ (xi {x)(Yi {Y ) correct for multiple testing in GWAS.
continuous outcome are analysis of vari- i~1 2.5.1 Bonferroni correction. In
ance (ANOVA) and linear regression. Bonferroni correction, the p-value of a
2.3.1 One-way ANOVA. The F-test
P
M P
M
test is multiplied by the number of tests in
where x~ M1 xi and Y ~ M1 yi
in one-way analysis of variance is used to i~1 i~1 the multiple comparison.
assess whether the expected values of a To achieve least squares, the estimator
SSxy
quantitative variable within several pre- of b is SSxx . To evaluate the significance of p(Xi ,Y )~p(Xi ,Y )  N
defined groups differ from each other. the obtained model, a hypothesis testing
For each SNP Xn , we can divide all the for b~0 is then applied. Here the number of tests is the number of
individuals into three groups according to SNPs N in a study. Bonferroni correction
their genotypes. Let Yi (i[f0,1,2g) be a 2.4 Multiple Testing Problem is a single-step procedure, in which each of
subset of phenotypes of which the individ- In a typical GWAS, the test needs to be the p-values is independently corrected.
uals have the genotypes equal to i. We performed many times. We should pay 2.5.2 Permutation tests. In the
represent the number of phenotypes in Y i attention to a statistical issue known as the permutation test, data are reshuffled. For
as Mi , and we have Yi ~fyi1 ,    ,yiMi g. multiple testing problem. In the remainder each permutation, p-values for all the tests are
S2 P2 of this section, we will discuss the multiple re-calculated, and the minimal p-value is
Notice that Yi ~Y and Mi ~M testing problem and how to effectively
i~0 i~0
retained. After K permutations, we get totally
control error rate in GWAS. K minimal p-values. The corrected p-value is
The total sum of squares (SST) can be
Type 1 error rate, is the possibility that a given by the proportion of minimal p-values
divided into two parts, the between-group
null hypothesis is rejected when it is actually which is less than the original p-value.
sum of squares (SSB) and the within-group
true. In other words, it is the chance of Let fY1 ,    ,Yk g be the set of K
sum of squares (SSW):
observing a positive (significant) result even if permutations. For each permutation
it is not. If a test is performed multiple times, Yk (1kK), the minimal p-value pYk
X
M the overall Type 1 Error rate will increase.
SST~ (ym {Y )2 ~ is given by
This is called the multiple testing problem.
m~1
Let a be the type 1 error rate for a statistical
X Mi
2 X test. If the test is performed n times, the pYk ~minfp(Xi ,Yk )D1ing:
0
(yim {Y )2 , experimental-wise error rate a is given by
i~0 m~1 Then we have the corrected p-value
a~1{(1{a)n :
#fpYk vp(Xi ,Y )D1kKg
X
2 For example, if a~0:05 and n~20, then p(Xi ,Y )~ :
SSB~ 2
(Yi0 {Y ) , and K
a~1{(1{0:05)20 ~0:64. In this case, the
i~0 chance of getting at least one false positive is The permutation method takes advantage of
X Mi
2 X 64%. the correlation structure between SNPs. It is
0  )2 ,
SSW ~SST{SSB~ (yim {Y i Because of the multiple testing problem, less stringent than Bonferroni correction.
i~0 m~1 the test result may not be that significant
even if its p-value is less than a significant 2.6 False Discovery Rate Control
where level a. To solve this problem, the nominal False discovery rate (FDR) controls the
p-value need to be corrected/adjusted. expected proportion of type 1 error among all
Mi
1 XM
 ~ 1
X significant hypotheses. It is less conservative
Y~ ym and Yi yim : than the family-wise error rate. For example,
M m~1 Mi m~1 2.5 Family-Wise Error Rate Control
For the single-locus test, we denote the p- if 100 observed results are claimed to be
value for a association test of a SNP Xi and a significant, and the FDR is 0.1, then 10 of
phenotype Y to be p(Xi ,Y ), and the results are expected to be false discoveries.
SSB
The formula of F-test statistic is F ~ SSW , corrected p-value to be p(Xi ,Y ). Family-wise One way to control the FDR is as
and F follows the F-distribution with 2 and error rate (FWER), or the experiment-wise follows [14]. The p-values of SNPs and the
S-3 degrees of freedom under the null error rate, is the probability of at least one false phenotype are ranked from smallest to
hypothesis, i.e., F *F(2,S{3) . association. We use a to denote family-wise largest. We denote the ordered p-values to
2.3.2 Linear regression. In the error rate, and it is given by be p1 ,    ,pN . Starting from the largest p-
linear regression model, a least-squares value to the smallest, the original p-value is
regression line is fit between the phenotype a~P(reject H0 DH0 )~Preject at least multiplied by the total number of SNPs
values and the genotype values [11]. For and divided by its rank. For the ith p-value
one of Hi (1in)DH0 ,
simplicity, we denote the genotypes of a pi , its corrected p-value pi is given by
single SNP to be x1 ,x2 ,    ,xM . Based on where n is the total number of tests and H0 is
the data (x1 ,y1 ),    ,(xM ,yM ), we need to the hypothesis that all the Hi (1in) are N
fit a line in the form of Y ~azbx. true. pi ~pi  ( ):
i

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002828


In this section, we have discussed com- pressed as the sum of two terms. The first Table 2. Grouping of Y by Xi .
monly used methods in single-locus study, term is based on the single-SNP ANOVA
the multiple testing problem and how to test. The second term is based on the
control error rate in GWAS. In the next genotype of the SNP-pair and is indepen- Xi ~1 Xi ~0
section, we will introduce methods used dent of permutations. This property allows
for two-locus association studies. We will to index SNP-pairs in a 2D array based on group A group B
focus on one class work that finds exact the genotype relationship between SNPs. doi:10.1371/journal.pcbi.1002828.t002
solution when searching for SNP-SNP Since the number of entries in the 2D
interactions in GWAS. array is bound by the number of individ-
uals in the study, many SNP-pairs share a and (Xi Xj ) are:
3. Exact Methods for Two-Locus common entry. Moreover, it can be shown
Association Study that all SNP-pairs indexed by the same M{2
F (Xi ,Y )~ |
entry have exactly the same upper bound. 2{1
The vast number of SNPs has posed Therefore, we can compute the upper 1:1
SSB (Xi ,Y )
great computational challenge to genome- bound for a group of SNP-pairs together. ,
SST (Xi ,Y ){SSB (Xi ,Y )
wide association study. In order to under- Another important property is that the
stand the underlying biological mecha- indexing structure only needs to be built
nisms of complex phenotype, one needs to once and can be reused for all permutated
consider the joint effect of multiple SNPs data. Utilizing the upper bound and the F (Xi Xj ,Y )~
simultaneously. Although the idea of indexing structure, FastANOVA only M{g SSB (Xi Xj ,Y ) 1:2
studying the association between pheno- needs to perform the ANOVA test on a | ,
g{1 SST (Xi Xj ,Y ){SSB (Xi Xj ,Y )
type and multiple SNPs is straightforward, small number of candidate SNP-pairs
the implementation is nontrivial. For a without the risk of missing any significant
study with total N SNPs, in order to find pair. We discuss the algorithm in further
the association between n SNPs and the detail in the following. where g in Equation (1.2) is the number of
phenotype, a brute-force approach
  is to Let fX1 ,X2 ,    ,XN g be the set of SNPs groups that the genotype of (Xi Xj ) parti-
N of M individuals (Xi [f0,1g,1iN) and tions the individuals into. Possible values
exhaustively enumerate all possible
n Y ~fy1 ,y2 ,    ,yM g be the quantitative of g are 3 or 4, assuming all SNPs are
SNP combinations and evaluate their phenotype of interest, where ym distinct: If none of groups A, B, a1 , a2 , b1 ,
associations with the phenotype. The (1mM) is the phenotype value of b2 is empty, then g~4. If one of them is
computational burden imposed by this individual m. empty, then g~3.
For any SNP Xi (1iN), we repre- P
enormous search space often makes the Let T~ ym be the sum of all
complete genome-wide association study sent the F-statistic from the ANOVA test ym [Y
intractable. Moreover, although permuta- of Xi and Y as F(Xi ,Y ). For any SNP- phenotype values. The total sum of
tion test has been considered the gold pair (Xi Xj ), we represent the F-statistic squared deviations does not depend on
standard method for multiple testing from the ANOVA test of (Xi Xj ) and Y as the groupings of individuals:
correction, it will dramatically increase F (Xi Xj ,Y ).
the computational burden because the The basic idea of ANOVA test is to X T2
process needs to be performed for all partition the total sum of squared devia- SST (Xi ,Y )~SST (Xi Xj ,Y )~ y2m { :
ym [Y
M
permuted data. tions SST into between-group sum of
In this section, we will focus on the squared deviations SSB and within-group P
Let Tgroup ~ ym be the sum of
recently developed exact method for two- sum of squared deviations SSW : ym [group
locus epistasis detection. Different from phenotype values in a specific group, and
the single-locus approach, the goal of two- SST ~SSB zSSW : ngroup be the number of individuals in that
locus epistasis detection is to identify group. SSB (Xi ,Y ) and SSB (Xi Xj ,Y ) can
interacting SNP-pairs that have strong be calculated as follows:
In our application of the two-locus asso-
association with the phenotype. FastA-
ciation study, Table 2 and Table 3 show
NOVA [15] is an algorithm for two-locus
the possible groupings of phenotype values
ANOVA (analysis of variance) test on
by the genotypes of Xi and (Xi Xj ) TA2 TB2 T 2
quantitative traits and FastChi [16] for SSB (Xi ,Y )~ z { ,
respectively.
two-locus chi-square test on case-control nA nB M
Let A, B, a1 , a2 , b1 , b2 represent the
phenotypes. COE [17] is a general
groups as indicated in Table 2 and
method that can be applied in a wide
Table 3. We use SSB (Xi ,Y ) and
range of tests. TEAM [18] is designed for
SSB (Xi Xj ,Y ) to distinct the one locus
studies involving a large number of
individuals such as human studies. In this (i.e., single-SNP) and two locus (i.e., SNP-
subsection, we will discuss these algo- pair) analyses. Specifically, we have Table 3. Grouping of Y by Xi Xj .
rithms, and their strengths and limita-
tions. SST (Xi ,Y )~SSB (Xi ,Y )zSSW (Xi ,Y ),
Xi ~1 Xi ~0

3.1 The FastANOVA Algorithm SST (Xi Xj ,Y )~SSB (Xi Xj ,Y )z


Xj ~1 group a1 group b1
FastANOVA utilizes an upper bound of SSW (Xi Xj ,Y ): Xj ~0 group a2 group b2
the two-locus ANOVA test to prune the
search space. The upper bound is ex- The F-statistics for ANOVA tests on Xi doi:10.1371/journal.pcbi.1002828.t003

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002828


Table 4. Notations for the bounds.
Ta2 Ta2
1 2
SSB (Xi Xj ,Y )~ z z
na1 na2
Symbols Formulas
Tb2 Tb2 2 Xna
1 2 T la1 1
y
z { : i~1 Ai
nb1 nb2 M XnA
ua1 y Ai
i~nA {na1 z1

Note that for any group of A, B, a1 , a2 , b1 , R1 (Xi Xj Y ) maxf(nA la1 {na1 TA )2 ,(nA ua1 {na1 TA )2 g
2
Tgroup na1 (nA {na1 )nA
b2 , if ngroup ~0, then ngroup is defined to be Xnb
lb1 1
y
0. i~1 Bi
XnB
Let fym Dym [Ag~fyA1 ,yA2 ,    ,yAnA g ub1
i~n {n z1 Bi
y
B b1
be the phenotype values in group A.
Without loss of generality, assume that R2 (Xi Xj Y ) maxf(nB lb1 {nb1 TB )2 ,(nB ub1 {nb1 TB )2 g
nb1 (nB {nb1 )nB
these phenotype values are arranged in
ascending order, i.e., doi:10.1371/journal.pcbi.1002828.t004

yA1 yA2    yAnA : Note that na1 is the number of 1s in Xj upper bound. In this way, we can calculate
when Xi takes value 1, and nb1 is the number the upper bound for a group of SNP-pairs
of 1s in Xj when Xi takes value 0. It is easy to together. Note that for typical genome-wide
Let fym Dym [Bg~fyB1 ,yB2 ,    ,yBnB g be prove that switching na1 and na2 does not association studies, the number of individuals
the phenotype values in group B. Without change the F-statistic value and the correct- M is much smaller than the number of SNPs
loss of generality, assume that these ness of the upper bound. This is also true if N. Therefore, the additional cost for access-
phenotype values are arranged in ascend- we switch nb1 and nb2 . Therefore, without ing Array(Xi ) is minimal compared to
ing order, i.e., loss of generality, we can always assume that performing ANOVA tests for all pairs
na1 is the smaller one between the number of (Xi Xj )[AP(Xi ).
1s and number of 0s in Xj when Xi takes For multiple tests, permutation proce-
yB1 yB2    yBnB :
value 1, and nb1 is the smaller one between dure is often used in genetic analysis for
the number of 1s and number of 0s in Xj controlling family-wise error rate. For
We have the overall upper bound on
when Xi takes value 0. genome-wide association study, permuta-
SSB (Xi Xj ,Y ):
If there are m 1s and (M{m) 0s in Xi , tion is less commonly used because it often
Theorem 1 (Upper bound of SSB (Xi
then for any (Xi Xj )[AP(Xi ), the possible entails prohibitively long computation
Xj ,Y ))
values that na1 can take are f0,1,2,    , times. Our FastANOVA algorithm makes
tm=2sg. The possible values that nb1 can permutation procedure feasible in ge-
SSB (Xi Xj ,Y )SSB (Xi ,Y )zR1 (Xi Xj Y )z take are f0,1,2,    ,t(M{m)=2sg. nome-wide association study.
To efficiently retrieve the candidates, the Let Y ~fY1 ,Y2 ,    ,YK g be the K
R2 (Xi Xj Y ):
SNP-pairs (Xi Xj ) in AP(Xi ) are grouped permutations of the phenotype Y . Following
by their (na1 ,nb1 ) values and indexed in a the idea discussed above, the upper bound in
The notations in the bound can be found 2D array, referred to as Array(Xi ). Theorem 1 can be easily incorporated in the
in Table 4. The upper bound in Theorem 1 Suppose that there are 32 individuals, and algorithm to handle the permutations. For
is tight. The tightness of the bound is obvious the genotype of Xi consists of half 0s and half every SNP Xi , the indexing structure
from the derivation of the upper bound, since 1s. Thus for the SNP-pairs in AP(Xi ), the Array(Xi ) is independent of the permuted
there exists some genotype of SNP-pair possible values of na1 and nb1 are phenotypes in Y . The correctness of this
(Xi Xj ) that makes the equality hold. f0,1,2,    ,8g. Figure 1 shows the 9|9 property relies on the fact that, for any
We now discuss how to apply the upper array, Array(Xi ), whose entries represent (Xi Xj )[AP(Xi ), na1 and nb1 only depend on
bound in Theorem 1 in detail. The set of the possible values of (na1 ,nb1 ) for the SNP- the genotype of the SNP-pair and thus
all SNP-pairs is partitioned into non- pairs (Xi Xj )[AP(Xi ). The entries in the same remain constant for different phenotype
overlapping groups such that the upper column have the same na1 value. The entries permutations. Therefore, for each Xi , once
bound can be readily applied to each in the same row have the same nb1 value. The we build Array(Xi ), it can be reused in all
group. For every Xi (1iN), let na1 value of each column is noted beneath permutations.
AP(Xi ) be the set of SNP-pairs each column. The nb1 value of each row is
noted left to each row. Each entry of the array 3.2 The FastChi Algorithm
is a pointer to the SNP-pairs (Xi Xj )[AP(Xi ) As our initial attempt to develop scalable
AP(Xi )~f(Xi Xj )Diz1jNg: having the corresponding (na1 ,nb1 ) values. algorithms for genome-wide association
For any SNP Xi , the maximum number of study, FastANOVA is specifically designed
For all SNP-pairs in AP(Xi ), nA , TA , nB , TB the entries in Array(Xi ) is (qM4 rz1)2 . The for the ANOVA test on quantitative pheno-
and SSB (Xi ,Y ) are constants. Moreover, proof of this property is straightforward and types. Another category of phenotypes is
la1 , ua1 are determined by na1 , and lb1 , ub1 omitted here. In order to find candidate SNP- generated in case-control study, where the
are determined by nb1 . Therefore, in the pairs, we scan all entries in Array(Xi ) to phenotypes are binary variables representing
upper bound, na1 and nb1 are the only calculate their upper bounds. Since the disease/non-disease individuals. Chi-square
variables that depend on Xj and may vary SNP-pairs indexed by the same entry share test is one of the most commonly used
for different SNP-pairs (Xi Xj ) in AP(Xi ). the same (na1 ,nb1 ) value, they have the same statistics in binary phenotype association

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002828


Figure 1. The index array Array(Xi ) for efficient retrieval of the candidate SNP-pairs.
doi:10.1371/journal.pcbi.1002828.g001

study. We can extend the principles in The ones whose (R1 (Xi Xj ),R2 (Xi Xj )) val- magnitude faster than the brute force
FastANOVA for efficient two-locus chi- ues fall below the line can be pruned without alternative.
square test. The general idea of FastChi is any further test.
similar to that of FastANOVA, i.e., re- Suppose that there are 32 individuals, Xi 3.3 The COE Algorithm
formulating the chi-square test statistic to contains half 0s, and half 1s. For the Both FastANOVA and FastChi rework the
establish an upper bound of two-locus chi- SNP-pairs in AP(Xi ), the possible values of formula of ANOVA test and Chi-square test
square test, and indexing the SNP-pairs 0 1 2 3 4 to estimate an upper bound of the test value
according to their genotypes in order to R1 (and R2 ) are f , , , , ,
16 15 14 13 12 for SNP pairs. These upper bounds are used
effectively prune the search space and reuse 5 6 7 8 to identify candidate SNP pairs that may have
, , , g. Figure 2 shows the 2-D space
redundant computations. Here we briefly 11 10 9 8 strong epistatic effect. Repetitive computation
introduce the FastChi algorithm. of R1 and R2 . The blue stars represent the in a permutation test is also identified and
For SNP Xi , we represent the chi-square values that (R1 ,R2 ) can take. The line performed once those results are stored for use
test value of Xi and the binary phenotype Y as x2 (Xi ,Y )zT1 S1 R1 zT2 S2 R2 ~h is plot- by all permutations. These two strategies lead
x2 (Xi ,Y ). For any SNP-pair Xi and Xj , we ted in the figure. Only the SNP-pairs whose to substantial speedup, especially for large
use x2 (Xi Xj ,Y ) to represent the chi-square (R1 ,R2 ) values are in the shaded region are permutation test, without compromising the
test value for the combined effect of (Xi Xj ) subject to two-locus Chi-square test. accuracy of the test. These approaches
with Y . Let A,B,C,D represent the following Similar to FastANOVA, in FastChi, we guarantee to find the optimal solutions.
events respectively: Y ~0 ^ Xi ~0; Y ~0^ can index the SNP-pairs in AP(Xi ) accord- However, a common drawback of these
Xi ~1; Y ~1 ^ Xi ~0; Y ~1 ^ Xi ~1. Let ing to their genotype relationships, i.e., by the methods is that they are designed for specific
Oevent denote the observed value of an event. values of (R1 ,R2 ). Experimental results tests, i.e., chi-square test and ANOVA test.
T1 , T2 , S1 , S2 , R1 , and R2 represent the demonstrate that FastChi is an order of The upper bounds used in these methods do
formulas shown in Table 5. We have the not work for other statistical tests, which are
upper bound of x2 (Xi Xj ,Y ) stated in
Theorem 2. Table 5. Notations used in the derivation of the upper bound for two-locus Chi-
Theorem 2 (Upper bound of x2 (Xi Xj , square test.
Y ))

x2 (Xi Xj ,Y )x2 (Xi ,Y )zT1 S1 R1 z Symbols Formulas

T2 S2 R2 : T1 M2
(OA zOB )(OA zOC )(OC zOD )
S1 maxfO2A ,O2C g
For given phenotype Y and SNPXi ,   
R1 OXj ~1 OXj ~0
minf DXi ~0 , DXi ~0 g
x2 (Xi ,Y ), T1 , S1 , T2 , and S2 are constants. OXj ~0 OXj ~1
R1 and R2 are the only variables that T2 M2
depend on Xj and may vary for different (OA zOB )(OB zOD )(OC zOD )
SNP-pairs (Xi Xj )[AP(Xi ). (Recall that S2 maxfO2B ,O2D g
AP(Xi )~f(Xi Xj )Diz1jNg.) Thus for   
R2 OXj ~1 OXj ~0
a given Xi , we can treat equation minf
OXj ~0
DXi ~1 ,
OXj ~1
DXi ~1 g
x2 (Xi ,Y )zT1 S1 R1 zT2 S2 R2 ~h as a
straight line in the 2-D space of R1 and R2 . doi:10.1371/journal.pcbi.1002828.t005

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002828


Figure 2. Pruning SNP-pairs in AP(Xi ) using the upper bound.
doi:10.1371/journal.pcbi.1002828.g002

also routinely used by researchers. In addition, However, there are two major drawbacks that exhaustively computing all two-locus test
new statistics for epistasis detection are limit their applicability. First, they are designed values in permutation test, it enables both
continually emerging in the literature. There- for relatively small sample size and only FWER and FDR controlling. It is applicable
fore, it is desirable to develop a general model consider homozygous markers (i.e., each to all statistics based on the contingency table.
that supports a variety of statistical tests. SNP can be represented as a f0,1g binary Previous methods are either designed for
The COE algorithm takes the advantage variable). In human study, the sample size is specific tests or require the test statistics satisfy
of convex optimization. It can be shown that usually large and most SNPs contain hetero- certain property. Experimental results dem-
a wide range of statistical tests, such as zygous genotypes and are coded using onstrate that TEAM is more efficient than
chi-square test, likelihood ratio test (also f0,1,2g. These make previous methods existing methods for large sample studies.
known as G-test), and entropy-based tests intractable. Second, although the family-wise TEAM incorporates the permutation test
are all convex functions of observed frequen- error rate (FWER) and the false discovery rate for proper error controlling. The key idea is
cies in contingency tables. Since the maxi- (FDR) are both widely used for error to incrementally update the contingency
mum value of a convex function is attained at controlling, previous methods are designed tables of two-locus tests. We show that only
the vertices of its convex domain, by only to control the FWER. From a compu- four of the eighteen observed frequencies in
constraining on the observed frequencies in tational point of view, the difference in the the contingency table need to be updated to
the contingency tables, we can determine the FWER and the FDR controlling is that, to compute the test value. In the algorithm, we
domain of the convex function and get its estimate FWER, for each permutation, only build a minimum spanning tree [19] on the
maximum value. This maximum value is the maximum two-locus test value is needed. SNPs. The nodes of the tree are SNPs. Each
used as the upper bound on the test statistics To estimate the FDR, on the other hand, for edge represents the genotype difference
to filter out insignificant SNP-pairs. COE is each permutation, all two-locus test values between the two connected SNPs. This tree
applicable to all tests that are convex. must be computed. structure can be utilized to speed up the
To address these limitations, TEAM is updating process for the contingency tables.
proposed for efficient epistasis detection in A majority of the individuals are pruned and
3.4 The TEAM Algorithm human GWAS. TEAM has several advan- only a small portion are scanned to update
The methods we have discussed so far tages over previous methods. It supports to the contingency tables. This is advantageous
provide promising alternatives for GWAS. both homozygous and heterozygous data. By in human study, which usually involves

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002828


thousands of individuals. Extensive experi- The basic idea is to use linear regression to
mental results demonstrate the efficiency of model the probability of the occurrence of a 
Genotype  Phenotype
the TEAM algorithm. specific outcome. Logistic regression is appli- 1
As a summary of the exact two-locus cable to both single-locus and multi-locus 0 

algorithms, FastANOVA and FastChi are association studies and can incorporate 0 0

designed for specific tests and binary geno- covariates and other factors in the model. 
1 1
type data. The COE algorithm is a more Let Y [f0,1g be a binary variable 
0 0
general method that can be applied to all representing disease status (diseased verses 

convex tests. The TEAM algorithm is more non diseased), and X [f0,1,2g be a SNP. 1 0

suitable for large sample human GWAS. The conditional probability of having the 0 0

disease given a SNP is h(X )~P(Y ~1DX ). 
1 1
4. Multifactor Dimensionality We define the logit function to convert the 
0 0
Reduction range of the probability from 0,1 to 

({?,z?) : 1 1
Multifactor dimensionality reduction 
0 0
(MDR) [20] is a data mining method to
identify interactions among discrete variables h(X )
logit(X )~ln :
for binary outcomes. It can be used to detect 1{h(X )
Question 2: Assuming that we have
high-order gene-gene and gene-environment
The logit can be considered as a latent the following SNP and phenotype data, is
interactions in case-control studies. By pooling
continuous variable that will be fit to a the SNP significantly associated with the
multi-locus SNPs into two groups, one
linear predictor function: phenotype? Here, we represent each SNP
classified as high-risk and the other classified
as low risk, MDR effectively reduces the site as the number of minor alleles on that
predictors from n dimensions to one dimen- logit(X )*b0 zb  X : locus, so 0 and 2 are for major and minor
sion. Then, the one-dimensional variable is homozygous sites, respectively, and 1 is for
evaluated through cross-validation. The steps To cope with multiple SNP loci and the heterozygous sites. We also assume
are repeated for all other n factor combina- potential covariates, we can modify the that minor alleles contribute to the phe-
tions, and the factor model which has the above model. For example, in the follow- notype and the effect is additive. In other
lowest prediction error is chosen as the best n ing model the logit is fit with predictors of words, the effect from a minor homozy-
factor model. Its detailed steps are as follows: SNPs (X1 , X2 ) and covariates (Z1 , Z2 ): gous site should be twice as large as that
from a heterozygous site. You may use any
N Divide the set of factors into 10 equal
logit(X )*b0 zb1  X1 zb2  X2 zb3 
test methods introduced in the chapter.
subsets. How about permutation tests?
N Select a set of n factors from the pool X1  X2 zb4  Z1 zb5  Z2 :

of all factors in the training set Genotype  Phenotype
Although logistic regression can handle  0:53
N Create a contingency table for these n
complicated models, it may be computa- 1 

factors by counting the number of cases  0:78
tionally demanding when the number of 2 
and controls in each combination. 
predictors is large [23]. 2  0:81
N Compute the case-control ratio in each
1

 {0:23
combination. Label them as high-risk if 
it is greater than a certain threshold, and 6. Summary 
1  {0:73
otherwise, it is marked as low-risk. 
The potential of genome-wide association 0  0:81

N Use the labels to classify individuals. study for the identification of genetic variants
2

 0:27
Compute the misclassification rate. that underlying phenotypic variations is well 
0  2:59
N Repeat previous steps for all combina- recognized. The availability of large SNP data
generated by high-throughput genotyping 1


 1:84
tions of n factors across 10 training and 
testing subsets. methods poses great computational and 0  0:03
statistical challenges. In this chapter, we have
N Choose the model whose average
discussed serval computational approaches to
misclassification rate is minimized Question 3: Categorize the following
and cross-validation consistency is detect associations between genetic markers
and the phenotypes. For further readings, the methods in the table. The methods are x2
maximized as the best model.
readers are encouraged to refer to [11,7,24,25] test, G-test, ANOVA, Students T-test,
MDR designs a constructive induction for discussions about current progress and Pearsons correlation, linear regression,
method that combines two or more SNPs challenges in large-scale genetic association logistic regression.
before testing for association. The power of studies. 
case  control phenotype  quantitative phenotype
the MDR approach is that it can be 
combined with other methodologies includ- 7. Exercises
ing the ones described in this chapter.
Question 1: The table below con- Question 4: Why is it important to
tains binary genotype and case-control study multiple-locus association? What are
5. Logistic Regression the challenges?
phenotype data from ten individuals.
Logistic regression is a statistical method Give the contingency table and use x2
for predicting binary and categorical out- test to compute the association test Answers to the Exercises can be found
come. It is widely used in GWAS [21,22]. score. in Text S1.

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002828


Further Reading

N Cantor RM, Lange K, Sinsheimer JS (2008) Prioritizing GWAS results: a review of statistical methods and recommendations for
their application. Nat Rev Genet 9(11): 855867.
N Cordell HJ (2009) Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet 10(6): 392404.
N Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, et al. (2009) Finding the missing heritability of complex diseases.
Nature 461(7265): 747753.
N Moore JH, Williams SM (2009) Epistasis and its implications for personal genetics. Am J Hum Genet 85(3): 309320.
N Phillips PC (2010) Epistasis - the essential role of gene interactions in the structure and evolution of genetic systems. Am J
Hum Genet 86(1): 622.
N Wang K, Li M, Hakonarson H (2010) Analysing biological pathways in genome-wide association studies. Nat Rev Genet 11:
843854.

Supporting Information
Text S1 Answers to Exercises
(PDF)

References
1. Churchill GA, Airey DC, Allayee H, Angel JM, studies for complex traits: consensus, uncertainty 18. Zhang X, Huang S, Zou F, Wang W (2010)
Attie AD, et al. (2004) The collaborative cross, a and challenges. Nat Rev Genet 9(5): 356369. TEAM: Efficient two-locus epistasis tests in
community resource for the genetic analysis of 9. Thorisson GA, Smith AV, Krishnan L, Stein LD human genome-wide association study. Bioinfor-
complex traits. Nat Genet 36: 11331137. (2005) The international hapmap project web site. matics 26(12): 217227.
2. The International HapMap Consortium (2003) Genome Res 15: 1592. 19. Cormen TH, Leiserson CE, Rivest RL, Stein C
The international hapmap project. Nature 10. The 1000 Genomes Project Consortium (2010) A (2001) Introduction to algorithms. MIT Press and
426(6968): 789796. map of human genome variation from popula- McGraw-Hill.
3. Saxena R, Voight B, Lyssenko V, Burtt N, de tion-scale sequencing. Nature 467: 10611073. 20. Ritchie MD, Hahn LW, Roodi N, Bailey LR,
Bakker P, et al. (2007) Genome-wide association 11. Balding DJ (2006) A tutorial on statistical Dupont WD, et al. (2001) Multifactor-dimension-
analysis identifies loci for type 2 diabetes and methods for population association studies. Nat ality reduction reveals high-order interactions
triglyceride levels. Science 316: 13311336. Rev Genet 7(10): 781791. among estrogen-metabolism genes in sporadic
4. Scuteri A, Sanna S, Chen W, Uda M, Albai G, et 12. Samani NJ, Erdmann J, Hall AS, Hengstenberg breast cancer. Am J Hum Genet 69: 138147.
al. (2007) Genome-wide association scan shows C, Mangino M, et al. (2007) Genomewide 21. Cordell HJ (2002) Epistasis: what it means, what
genetic variants in the FTO gene are associated association analysis of coronary artery disease. it doesnt mean, and statistical methods to detect
N Engl J Med 357: 443453.
with obesity-related traits. PLoS Genet 3(7): e115. it in humans. Hum Mol Genet 11: 24632468.
13. Westfall PH, Young SS (1993) Resampling-based
doi:10.1371/journal.pgen.0030115 22. Wason J, Dudbridge F (2010) Comparison of
multiple testing. Wiley: New York.
5. The Wellcome Trust Case Control Consortium multimarker logistic regression models, with
14. Benjamini Y, Hochberg Y (1995) Controlling the
(2007) Genome-wide association study of 14,000 application to a genomewide scan of schizophre-
false discovery rate: a practical and powerful
cases of seven common diseases and 3,000 shared approach to multiple testing. J R Stat Soc nia. BMC Genet 11: 80.
controls. Nature 447: 661678. Series B Stat Methodol 57(1): 289300. 23. Yang C, Wan X, Yang Q, Xue H, Tang N, et al.
6. Weedon M, Lettre G, Freathy R, Lindgren C, 15. Zhang X, Zou F, Wang W (2008) FastANOVA: (2011) A hidden two- locus disease association
Voight B, et al. (2007) A common variant of an efficient algorithm for genome-wide associa- pattern in genome-wide association studies. BMC
HMGA2 is associated with adult and childhood tion study. KDD 2008: 821829. Bioinformatics 12: 156.
height in the general population. Nat Genet 39: 16. Zhang X, Zou F, Wang W (2009) FastChi: an 24. Hoh J, Ott J (2003) Mathematical multi-locus
12451250. effcient algorithm for analyzing gene-gene inter- approaches to localizing complex human trait
7. Hirschhorn J, Daly M (2005) Genome-wide actions. PSB 2009: 528539. genes. Nat Rev Genet 4: 701709.
association studies for common diseases and 17. Zhang X, Pan F, Xie Y, Zou F, Wang W (2010) 25. Musani S, Shriner D, Liu N, Feng R, Coffey C, et
complex traits. Nat Rev Genet 6: 95108. COE: a general approach for efficient genome- al. (2007) Detection of gene6gene interactions in
8. McCarthy M, Abecasis G, Cardon L, Goldstein wide two-locus epistatic test in disease association genome-wide association studies of human pop-
D, Little J, et al. (2008) Genome-wide association study. J Comput Biol 17(3): 401415. ulation data. Hum Hered 63(2): 6784.

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002828


Education

Chapter 11: Genome-Wide Association Studies


William S. Bush1*, Jason H. Moore2
1 Department of Biomedical Informatics, Center for Human Genetics Research, Vanderbilt University Medical School, Nashville, Tennessee, United States of America,
2 Departments of Genetics and Community Family Medicine, Institute for Quantitative Biomedical Sciences, Dartmouth Medical School, Lebanon, New Hampshire, United
States of America

Abstract: Genome-wide associa- While understanding the complexity of of GWAS to common diseases that have a
tion studies (GWAS) have evolved human health and disease is an important complex multifactorial etiology.
over the last ten years into a objective, it is not the only focus of human
powerful tool for investigating the genetics. Accordingly, one of the most
2. Concepts Underlying the
genetic architecture of human dis- successful applications of GWAS has been
ease. In this work, we review the in the area of pharmacology. Pharmaco- Study Design
key concepts underlying GWAS, genetics has the goal of identifying DNA 2.1 Single Nucleotide
including the architecture of com- sequence variations that are associated Polymorphisms
mon diseases, the structure of with drug metabolism and efficacy as well The modern unit of genetic variation is
common human genetic variation, as adverse effects. For example, warfarin is the single nucleotide polymorphism or SNP.
technologies for capturing genetic a blood-thinning drug that helps prevent SNPs are single base-pair changes in the
information, study designs, and the blood clots in patients. Determining the DNA sequence that occur with high
statistical methods used for data appropriate dose for each patient is frequency in the human genome [5]. For
analysis. We also look forward to
important and believed to be partly the purposes of genetic studies, SNPs are
the future beyond GWAS.
controlled by genes. A recent GWAS typically used as markers of a genomic
revealed DNA sequence variations in region, with the large majority of them
several genes that have a large influence having a minimal impact on biological
This article is part of the Transla- on warfarin dosing [4]. These results, and systems. SNPs can have functional conse-
tional Bioinformatics collection for more recent validation studies, have led to quences, however, causing amino acid
PLOS Computational Biology. genetic tests for warfarin dosing that can changes, changes to mRNA transcript
be used in a clinical setting. This type of stability, and changes to transcription
1. Important Questions in genetic test has given rise to a new field factor binding affinity [6]. SNPs are by
Human Genetics called personalized medicine that aims to far the most abundant form of genetic
tailor healthcare to individual patients variation in the human genome.
A central goal of human genetics is to based on their genetic background and SNPs are notably a type of common
identify genetic risk factors for common, other biological features. The widespread genetic variation; many SNPs are present
complex diseases such as schizophrenia availability of low-cost technology for in a large proportion of human popula-
and type II diabetes, and for rare Mende- measuring an individuals genetic back- tions [7]. SNPs typically have two alleles,
lian diseases such as cystic fibrosis and ground has been harnessed by businesses meaning within a population there are
sickle cell anemia. There are many that are now marketing genetic testing two commonly occurring base-pair pos-
different technologies, study designs and directly to the consumer. Genome-wide sibilities for a SNP location. The fre-
analytical tools for identifying genetic risk association studies, for better or for worse, quency of a SNP is given in terms of the
factors. We will focus here on the genome- have ushered in the exciting era of minor allele frequency or the frequency of
wide association study or GWAS that personalized medicine and personal ge- the less common allele. For example, a
measures and analyzes DNA sequence SNP with a minor allele (G) frequency of
netic testing. The goal of this chapter is to
variations from across the human genome
introduce and review GWAS technology, 0.40 implies that 40% of a population
in an effort to identify genetic risk factors
study design and analytical strategies as an has the G allele versus the more common
for diseases that are common in the
important example of translational bioin- allele (the major allele), which is found in
population. The ultimate goal of GWAS
formatics. We focus here on the application 60% of the population.
is to use genetic risk factors to make
predictions about who is at risk and to
identify the biological underpinnings of
disease susceptibility for developing new Citation: Bush WS, Moore JH (2012) Chapter 11: Genome-Wide Association Studies. PLoS Comput Biol 8(12):
prevention and treatment strategies. One e1002822. doi:10.1371/journal.pcbi.1002822
of the early successes of GWAS was the Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
identification of the Complement Factor H Baltimore County, United States of America
gene as a major risk factor for age-related Published December 27, 2012
macular degeneration or AMD [13]. Not Copyright: 2012 Bush, Moore. This is an open-access article distributed under the terms of the Creative
only were DNA sequence variations in this Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
gene associated with AMD but the bio-
logical basis for the effect was demonstrat- Funding: This work was supported by NIH grants ROI-LM010098, ROI-LM009012, ROI-AI59694, RO1-EY022300,
and RO1-LM011360. The funders had no role in the preparation of the manuscript.
ed. Understanding the biological basis of
genetic effects will play an important role in Competing Interests: The authors have declared that no competing interests exist.
developing new pharmacologic therapies. * E-mail: william.s.bush@vanderbilt.edu

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002822


What to Learn in This Chapter technology needed to gather genetic
information and the sample size needed
to discover statistically significant genetic
N Basic genetic concepts that drive genome-wide association studies
effects. The spectrum of potential genetic
N Genotyping technologies and common study designs effects is sometimes visualized and parti-
N Statistical concepts for GWAS analysis tioned by effect size and allele frequency
N Replication, interpretation, and follow-up of association results (figure 1). Genetic effects in the upper right
are more amenable to smaller family-
based studies and linkage analysis, and
Commonly occurring SNPs lie in stark frequency (including alleles in the apolipo- may require genotyping relatively few
contrast to genetic variants that are protein E or APOE gene for Alzheimers genetic markers. Effects in the lower right
implicated in more rare genetic disorders, disease [11] and PPARg gene in type II are typical of findings from GWAS,
such as cystic fibrosis [8]. These conditions diabetes [12]), led to the development of requiring large sample sizes and a large
are largely caused by extremely rare the common disease/common variant (CD/CV) panel of genetic markers. Effects in the
genetic variants that ultimately induce a hypothesis [13]. upper right, most notably CFH, have been
detrimental change to protein function, This hypothesis states simply that com- identified using both linkage analysis and
which leads to the disease state. Variants mon disorders are likely influenced by GWAS. Effects in the lower left are
with such low frequency in the population genetic variation that is also common in perhaps the most difficult challenge, re-
are sometimes referred to as mutations, the population. There are several key quiring genomic sequencing of large
though they can be structurally equivalent ramifications of this for the study of samples to associate rare variants to
to SNPs - single base-pair changes in the complex disease. First, if common genetic disease.
DNA sequence. In the genetics literature, variants influence disease, the effect size Over the last five years, the common
the term SNP is generally applied to (or penetrance) for any one variant must disease/common variant hypothesis has
common single base-pair changes, and the be small relative to that found for rare been tested for a variety of common
term mutation is applied to rare genetic disorders. For example, if a SNP with 40% diseases, and while much of the heritability
variants. frequency in the population causes a for these conditions is not yet explained,
highly deleterious amino acid substitution common alleles certainly play a role in
2.2 Failures of Linkage for Complex that directly leads to a disease phenotype, susceptibility. The National Human Ge-
nearly 40% of the population would have nome Institute GWAS catalog (http://
Disease
that phenotype. Thus, the allele frequency www.genome.gov/gwastudies) lists over
Cystic fibrosis (and most rare genetic
and the population prevalence are com- 3,600 SNPs identified for common diseas-
disorders) can be caused by multiple
pletely correlated. If, however, that same es or traits, and in general, common
different genetic variants within a single
SNP caused a small change in gene diseases have multiple susceptibility alleles,
gene. Because the effect of the genetic
expression that alters risk for a disease by each with small effect sizes (typically
variants is so strong, cystic fibrosis follows
some small amount, the prevalence of the increasing disease risk between 1.22
an autosomal dominant inheritance pat-
disease and the influential allele would be times the population risk) [14]. From these
tern in families with the disorder. One of
only slightly correlated. As such, common results we can say that for most common
the major successes of human genetics was
variants almost by definition cannot have diseases, the CD/CV hypothesis is true,
the identification of multiple mutations in
high penetrance. though it should not be assumed that the
the CFTR gene as the cause of cystic
Secondly, if common alleles have small entire genetic component of any common
fibrosis [8]. This was achieved by geno-
genetic effects (low penetrance), but com- disease is due to common alleles only.
typing families affected by cystic fibrosis
mon disorders show heritability (inheri-
using a collection of genetic markers across
tance in families), then multiple common
the genome, and examining how those
alleles must influence disease susceptibility.
3. Capturing Common Variation
genetic markers segregate with the disease 3.1 The Human Haplotype Map
For example, twin studies might estimate
across multiple families. This technique, Project
the heritability of a common disease to be
called linkage analysis, was subsequently To test the common disease/common
40%, that is, 40% of the total variance in
applied successfully to identify genetic disease risk is due to genetic factors. If the variant hypothesis for a phenotype, a
variants that contribute to rare disorders allele of a single SNP incurs only a small systematic approach is needed to interro-
like Huntington disease [9]. When applied degree of disease risk, that SNP only gate much of the common variation in the
to more common disorders, like heart explains a small proportion of the total human genome. First, the location and
disease or various forms of cancer, linkage variance due to genetic factors. As such, density of commonly occurring SNPs is
analysis has not fared as well. This implies the total genetic risk due to common needed to identify the genomic regions
the genetic mechanisms that influence genetic variation must be spread across and individual sites that must be examined
common disorders are different from those multiple genetic factors. These two points by genetic studies. Secondly, population-
that cause rare disorders [10]. suggest that traditional family-based ge- specific differences in genetic variation
netic studies are not likely to be successful must be cataloged so that studies of
2.3 Common Disease Common for complex diseases, prompting a shift phenotypes in different populations can
Variant Hypothesis toward population-based studies. be conducted with the proper design.
The idea that common diseases have a The frequency with which an allele Finally, correlations among common ge-
different underlying genetic architecture occurs in the population and the risk netic variants must be determined so that
than rare disorders, coupled with the incurred by that allele for complex diseases genetic studies do not collect redundant
discovery of several susceptibility variants are key components to consider when information. The International HapMap
for common disease with high minor allele planning a genetic study, impacting the Project was designed to identify variation

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002822


Figure 1. Spectrum of Disease Allele Effects. Disease associations are often conceptualized in two dimensions: allele frequency and effect size.
Highly penetrant alleles for Mendelian disorders are extremely rare with large effect sizes (upper left), while most GWAS findings are associations of
common SNPs with small effect sizes (lower right). The bulk of discovered genetic associations lie on the diagonal denoted by the dashed lines.
doi:10.1371/journal.pcbi.1002822.g001

across the genome and to characterize netic variation within a population over has existed. As such, different human sub-
correlations among variants. time. It is related to the concept of populations have different degrees and
The International HapMap Project chromosomal linkage, where two markers on patterns of LD. African-descent popula-
used a variety of sequencing techniques a chromosome remain physically joined tions are the most ancestral and have
to discover and catalog SNPs in European on a chromosome through generations of smaller regions of LD due to the accumu-
descent populations, the Yoruba popula- a family. In figure 2, two founder lation of more recombination events in
tion of African origin, Han Chinese chromosomes are shown (one in blue that group. European-descent and Asian-
individuals from Beijing, and Japanese and one in orange). Recombination descent populations were created by
individuals from Tokyo [15,16]. The events within a family from generation founder events (a sampling of chromo-
project has since been expanded to include to generation break apart chromosomal somes from the African population), which
11 human populations, with genotypes for segments. This effect is amplified through altered the number of founding chromo-
1.6 million SNPs [7]. HapMap genotype generations, and in a population of fixed somes, the population size, and the
data allowed the examination of linkage size undergoing random mating, repeated generational age of the population. These
disequilibrium. random recombination events will break populations on average have larger regions
apart segments of contiguous chromo- of LD than African-descent groups.
some (containing linked alleles) until Many measures of LD have been
3.2 Linkage Disequilibrium
eventually all alleles in the population proposed [17], though all are ultimately
Linkage disequilibrium (LD) is a prop-
are in linkage equilibrium or are indepen- related to the difference between the
erty of SNPs on a contiguous stretch of
dent. Thus, linkage between markers on a observed frequency of co-occurrence for
genomic sequence that describes the
population scale is referred to as linkage two alleles (i.e. a two-marker haplotype)
degree to which an allele of one SNP is
disequilibrium. and the frequency expected if the two
inherited or correlated with an allele of The rate of LD decay is dependent on markers are independent. The two com-
another SNP within a population. The multiple factors, including the population monly used measures of linkage disequi-
term linkage disequilibrium was coined by size, the number of founding chromo- librium are D and r2 [15,17] shown in
population geneticists in an attempt to somes in the population, and the number equations 1 and 2. In these equations, p12
mathematically describe changes in ge- of generations for which the population is the frequency of the ab haplotype, p1: is

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002822


Figure 2. Linkage and Linkage Disequilibrium. Within a family, linkage occurs when two genetic markers (points on a chromosome) remain
linked on a chromosome rather than being broken apart by recombination events during meiosis, shown as red lines. In a population, contiguous
stretches of founder chromosomes from the initial generation are sequentially reduced in size by recombination events. Over time, a pair of markers
or points on a chromosome in the population move from linkage disequilibrium to linkage equilibrium, as recombination events eventually occur
between every possible point on the chromosome.
doi:10.1371/journal.pcbi.1002822.g002

the frequency of the a allele, and p2: is the one allele of the first SNP is often observed preventing genotyping SNPs that provide
frequency of the b allele. with one allele of the second SNP, so only redundant information. Based on analy-
one of the two SNPs needs to be sis of data from the HapMap project,
D0 ~ genotyped to capture the allelic variation. .80% of commonly occurring SNPs in
8 p p {p p 9 There are dependencies between these European descent populations can be
> AB ab Ab aB >
< min(pA pb ,pa pB ) if pAB pab {pAb paB w0 >
> =1 two statistics; r2 is sensitive to the allele captured using a subset of 500,000 to one
frequencies of the tow markers, and can million SNPs scattered across the ge-
>
> pAB pab {pAb paB >
: if pAB pab {pAb paB v0 >
; only be high in regions of high D. nome [19].
min(pA pB ,pa pb )
One often forgotten issue associated
with LD measures is that current technol- 3.3 Indirect Association
2 ogy does not allow direct measurement of The presence of LD creates two possible
(pAB pab {pAb paB )
r2 ~ 2 haplotype frequencies from a sample positive outcomes from a genetic associa-
pA pB pa pb
because each SNP is genotyped indepen- tion study. In the first outcome, the SNP
dently and the phase or chromosome of influencing a biological system that ulti-
D is a population genetics measure that is origin for each allele is unknown. Many mately leads to the phenotype is directly
related to recombination events between well-developed and documented methods genotyped in the study and found to be
markers and is scaled between 0 and 1. A for inferring haplotype phase and estimat- statistically associated with the trait. This is
D value of 0 indicates complete linkage ing the subsequent two-marker haplotype referred to as a direct association, and the
equilibrium, which implies frequent re- frequencies exist, and generally lead to genotyped SNP is sometimes referred to as
combination between the two markers and reasonable results [18]. the functional SNP. The second possibility is
statistical independence under principles SNPs that are selected specifically to that the influential SNP is not directly
of Hardy-Weinberg equilibrium. A D of 1 capture the variation at nearby sites in the typed, but instead a tag SNP in high LD
indicates complete LD, indicating no genome are called tag SNPs because alleles with the influential SNP is typed and
recombination between the two markers for these SNPs tag the surrounding stretch statistically associated to the phenotype
within the population. For the purposes of of LD. As noted before, patterns of LD are (figure 3). This is referred to as an indirect
genetic analysis, LD is generally reported population specific and as such, tag SNPs association [10]. Because of these two
in terms of r2 , a statistical measure of selected for one population may not work possibilities, a significant SNP association
correlation. High r2 values indicate that well for a different population. LD is from a GWAS should not be assumed as
two SNPs convey similar information, as exploited to optimize genetic studies, the causal variant and may require

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002822


Figure 3. Indirect Association. Genotyped SNPs often lie in a region of high linkage disequilibrium with an influential allele. The genotyped SNP
will be statistically associated with disease as a surrogate for the disease SNP through an indirect association.
doi:10.1371/journal.pcbi.1002822.g003

additional studies to map the precise needed to capture the variation across the change in LDL level per allele or by
location of the influential SNP. African genome. genotype class. With an easily measurable
Conceptually, the end result of GWAS It is important to note that the technol- ubiquitous quantitative trait, GWAS of
under the common disease/common var- ogy for measuring genomic variation is blood lipids have been conducted in
iant hypothesis is that a panel of 500,000 changing rapidly. Chip-based genotyping numerous cohort studies. Their results
to one million markers will identify platforms such as those briefly mentioned were also easily combined to conduct an
common SNPs that are associated to above will likely be replaced over the next extremely well-powered massive meta-
common phenotypes. To conduct such a few years with inexpensive new technolo- analysis, which revealed 95 loci associated
study practically requires a genotyping gies for sequencing the entire genome. to lipid traits in more than 100,000 people
technology that can accurately capture These next-generation sequencing meth- [21]. Here, HDL and LDL may be the
the alleles of 500,000 to one million SNPs ods will provide all the DNA sequence primary traits of interest or can be
for each individual in a study in a cost- variation in the genome. It is time now to considered intermediate quantitative traits
effective manner. retool for this new onslaught of data. or endophenotypes for cardiovascular
disease.
4. Genotyping Technologies 5. Study Design Other disease traits do not have well-
established quantitative measures. In these
Genome-wide association studies were Regardless of assumptions about the circumstances, individuals are usually clas-
made possible by the availability of chip- genetic model of a trait, or the technology sified as either affected or unaffected a
based microarray technology for assaying used to assess genetic variation, no genetic binary categorical variable. Consider the
one million or more SNPs. Two primary study will have meaningful results without vast difference in measurement error
platforms have been used for most GWAS. a thoughtful approach to characterize the associated with classifying individuals as
These include products from Illumina phenotype of interest. When embarking either case or control versus precisely
(San Diego, CA) and Affymetrix (Santa on a genetic study, the initial focus should measuring a quantitative trait. For exam-
Clara, CA). These two competing tech- be on identifying precisely what quantity or ple, multiple sclerosis is a complex clinical
nologies have been recently reviewed [20] trait genetic variation influences. phenotype that is often diagnosed over a
and offer different approaches to measure long period of time by ruling out other
SNP variation. For example, the Affyme- 5.1 Case Control versus Quantitative possible conditions. However, despite the
trix platform prints short DNA sequences Designs loose classification of case and control,
as a spot on the chip that recognizes a There are two primary classes of GWAS of multiple sclerosis have been
specific SNP allele. Alleles (i.e. nucleotides) phenotypes: categorical (often binary enormously successful, implicating more
are detected by differential hybridization case/control) or quantitative. From the than 10 new genes for the disorder [22].
of the sample DNA. Illumina on the other statistical perspective, quantitative traits So while quantitative outcomes are pre-
hand uses a bead-based technology with are preferred because they improve power ferred, they are not required for a
slightly longer DNA sequences to detect to detect a genetic effect, and often have a successful study.
alleles. The Illumina chips are more more interpretable outcome. For some
expensive to make but provide better disease traits of interest, quantitative 5.2 Standardized Phenotype Criteria
specificity. disease risk factors have already been A major component of the success with
Aside from the technology, another identified. High-density lipoprotein multiple sclerosis and other well-conduct-
important consideration is the SNPs that (HDL) and low-density lipoprotein (LDL) ed case/control studies is the definition of
each platform has selected for assay. This cholesterol levels are strong predictors of rigorous phenotype criteria, usually pre-
can be important depending on the heart disease, and so genetic studies of sented as rule list based on clinical
specific human population being studied. heart disease outcomes can be conducted variables. Multiple sclerosis studies often
For example, it is important to use a chip by examining these levels as a quantitative use the McDonald criteria for establishing
that has more SNPs with better overall trait. Assays for HDL and LDL levels, case/control status and defining clinical
genomic coverage for a study of Africans being already useful for clinical practice, subtypes [23]. Standardized methods like
than Europeans. This is because African are precise and ubiquitous measurements the McDonald criteria establish a concise,
genomes have had more time to recom- that are easy to obtain. Genetic variants evidence-based approach that can be
bine and therefore have less LD between that influence these levels have a clear uniformly applied by multiple diagnosing
alleles at different SNPs. More SNPs are interpretation for example, a unit clinicians to ensure that consistent pheno-

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002822


type definitions are used for a genetic billing and procedure codes, along with chi-square test (and the related Fishers
study. free text are necessary. Because every exact test).
Standardized phenotype rules are par- medical center has its own set of policies, Logistic regression is an extension of
ticularly critical for multi-center studies care providers, and health insurance linear regression where the outcome of a
to prevent introducing a site-based effect providers, some algorithms developed in linear model is transformed using a
into the study. And even when estab- one clinical setting may not work as well logistic function that predicts the proba-
lished phenotype criteria are used, there in another. bility of having case status given a
may be variability among clinicians in Once a manageable subset of records is genotype class. Logistic regression is often
how those criteria are used to assign obtained by an algorithm, the accuracy of the preferred approach because it allows
case/control status. Furthermore, some the results is examined by clinicians or for adjustment for clinical covariates (and
quantitative traits are susceptible to bias other phenotype experts as gold-standard other factors), and can provide adjusted
in measurement. For example, with for comparison. The positive predictive odds ratios as a measure of effect size.
cataract severity lens photographs are value (PPV) of the initial algorithm is Logistic regression has been extensively
used to assign cases to one of three types assessed, and based on feedback from case developed, and numerous diagnostic pro-
of lens opacity. In situations where there reviewers, the selection algorithm is re- cedures are available to aid interpretation
may be disagreement among clinicians, a fined. This process of case-review followed of the model.
subset of study records is often examined by algorithmic refinement is continued For both quantitative and dichotomous
by clinicians at multiple centers to assess until the desired PPV is reached. trait analysis (regardless of the analysis
interrater agreement as a measure of This approach has been validated by method), there are a variety of ways that
phenotyping consistency [24]. High in- replicating established genotype-pheno- genotype data can be encoded or shaped
terrater agreement means that phenotype type relationships using EMR-derived for association tests. The choice of data
rules are being consistently applied across phenotypes [16], and has been applied to encoding can have implications for the
multiple sites, whereas low agreement multiple clinical and pharmacogenomic statistical power of a test, as the degrees of
suggests that criteria are not uniformly conditions [2628]. freedom for the test may change depend-
interpreted or applied, and may indicate ing on the number of genotype-based
a need to establish more narrow pheno- 6. Association Test groups that are formed. Allelic association
type criteria. 6.1 Single Locus Analysis tests examine the association between one
When a well-defined phenotype has allele of the SNP and the phenotype.
5.3 Phenotype Extraction from been selected for a study population, and Genotypic association tests examine the
Electronic Medical Records genotypes are collected using sound tech- association between genotypes (or geno-
The last few years of genetic research niques, the statistical analysis of genetic type classes) and the phenotype. The
has seen the growth of large clinical bio- data can begin. The de facto analysis of genotypes for a SNP can also be grouped
repositories that are linked to electronic genome-wide association data is a series of into genotype classes or models, such as
medical records (EMRs) [25]. The devel- single-locus statistic tests, examining each dominant, recessive, multiplicative, or
opment of these resources will certainly SNP independently for association to the additive models [29].
advance the state of human genetics phenotype. The statistical test conducted Each model makes different assump-
research and foster integration of genetic depends on a variety of factors, but first tions about the genetic effect in the data
information into clinical practice. From a and foremost, statistical tests are different assuming two alleles for a SNP, A and a,
study design perspective, identifying phe- for quantitative traits versus case/control a dominant model (for A) assumes that
notypes from EMRs can be challenging. studies. having one or more copies of the A allele
Electronic medical records were estab- Quantitative traits are generally ana- increases risk compared to a (i.e. Aa or
lished for clinical care and administrative lyzed using generalized linear model (GLM) AA genotypes have higher risk). The
purposes not for research. As such, approaches, most commonly the Analysis recessive model (for A) assumes that two
idiosyncrasies arise due to billing practices of Variance (ANOVA), which is similar to copies of the A allele are required to alter
and other logistical reasons, and great care linear regression with a categorical pre- risk, so individuals with the AA genotype
must be taken not to introduce biases into dictor variable, in this case genotype are compared to individuals with Aa and
a genetic study. classes. The null hypothesis of an ANOVA aa genotypes. The multiplicative model
The established methodology for con- using a single SNP is that there is no (for A) assumes that if there is 36 risk for
ducting electronic phenotyping is to difference between the trait means of any having a single A allele, there is a 96 risk
devise an initial selection algorithm genotype group. The assumptions of GLM for having two copies of the A allele: in
(using structured EMR fields, such as and ANOVA are 1) the trait is normally this case if the risk for Aa is k, the risk for
billing codes, or text mining procedures distributed; 2) the trait variance within AA is k2 . The additive model (for A)
on unstructured text), which identifies a each group is the same (the groups are assumes that there is a uniform, linear
record subset from the bio-repository. In homoskedastic); 3) the groups are inde- increase in risk for each copy of the A
cases where free text is parsed, natural pendent. allele, so if the risk is 36 for Aa, there is a
language processing (NLP) is used in Dichotomous case/control traits are 66 risk for AA - in this case the risk for
conjunction with a controlled vocabulary generally analyzed using either contingen- Aa is k and the risk for AA is 2k. A
such as the Unified Medical Language cy table methods or logistic regression. common practice for GWAS is to exam-
System (UMLS) to relate text to more Contingency table tests examine and ine additive models only, as the additive
structured and uniform medical con- measure the deviation from independence model has reasonable power to detect
cepts. In some instances, billing codes that is expected under the null hypothesis both additive and dominant effects, but it
alone may be sufficient to accurately that there is no association between the is important to note that an additive
identify individuals with a particular phenotype and genotype classes. The most model may be underpowered to detect
phenotype, but often combinations of ubiquitous form of this test is the popular some recessive effects [30]. Rather than

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002822


choosing one model a priori, some studies GWAS, hundreds of thousands to mil- generate an empirical distribution with
evaluate multiple genetic models coupled lions of tests are conducted, each one with resolution N, so a permutation procedure
with an appropriate correction for multi- its own false positive probability. The with an N of 1000 gives an empirical p-
ple testing. cumulative likelihood of finding one or value within 1/1000th of a decimal place.
more false positives over the entire Several software packages have been
6.2 Covariate Adjustment and GWAS analysis is therefore much higher. developed to perform permutation testing
Population Stratification For a somewhat morbid analogy, consider for GWAS studies, including the popular
In addition to selecting an encoding the probability of having a car accident. If PLINK software [35], PRESTO [36], and
scheme, statistical tests should be adjusted you drive your car today, the probability PERMORY [37].
for factors that are known to influence the of having an accident is fairly low. Another commonly used approach is to
trait, such as sex, age, study site, and However if you drive every day for the rely on the concept of genome-wide signifi-
known clinical covariates. Covariate ad- next five years, the probability of you cance. Based on the distribution of LD in
justment reduces spurious associations due having one or more accidents over that the genome for a specific population,
to sampling artifacts or biases in study time is much higher than the probability there are an effective number of
design, but adjustment comes at the price of having one today. independent genomic regions, and thus
of using additional degrees of freedom One of the simplest approaches to an effective number of statistical tests that
which may impact statistical power. One correct for multiple testing is the Bonfer- should be corrected for. For European-
of the more important covariates to roni correction. The Bonferroni correction descent populations, this threshold has
adjusts the alpha value from a = 0.05 to been estimated at 7.2e-8 [38]. This
consider in genetic analysis is a measure
a = (0.05/k) where k is the number of reasonable approach should be used with
of population substructure. There are
statistical tests conducted. For a typical caution, however, as the only scenario
often known differences in phenotype
GWAS using 500,000 SNPs, statistical where this correction is appropriate is
prevalence due to ethnicity, and allele
significance of a SNP association would when hypotheses are tested on the
frequencies are highly variable across
be set at 1e-7. This correction is the most genome scale. Candidate gene studies or
human subpopulations, meaning that in
conservative, as it assumes that each replication studies with a focused hypoth-
a sample with multiple ethnicities, ethnic-
association test of the 500,000 is indepen- esis do not require correction to this level,
specific SNPs will likely be associated to
dent of all other tests an assumption that as the number of effective, independent
the trait due to population stratification.
is generally untrue due to linkage disequi- statistical tests is much, much lower than
To prevent population stratification, the
librium among GWAS markers. what is assumed for genome-wide signif-
ancestry of each sample in the dataset is
An alternative to adjusting the false icance.
measured using STRUCTURE [31] or
positive rate (alpha) is to determine the
EIGENSTRAT [32] methods that com-
pare genome-wide allele frequencies to
false discovery rate (FDR). The false 6.4 Multi-Locus Analysis
discovery rate is an estimate of the In addition to single-locus analyses,
those of HapMap ethnic groups. The
proportion of significant results (usually genome-wide association studies provide
results of these analyses can be used to
at alpha = 0.05) that are false positives. an enormous opportunity to examine
either exclude samples with similarity to a
Under the null hypothesis that there are interactions among genetic variants
non-target population, or they can be used
no true associations in a GWAS dataset, p- throughout the genome. Multi-locus analy-
as a covariate in association analysis.
values for association tests would follow a sis, however, is not nearly as straightfor-
EIGENSTRAT is commonly used in this
uniform distribution (evenly distributed ward as conducting single-locus tests, and
circumstance, where principle component
from 0 to 1). Originally developed by presents numerous computational, statisti-
analysis is used to generate principle
Benjamini and Hochberg, FDR proce- cal, and logistical challenges [39].
component values that could be described
dures essentially correct for this number of Because most GWAS genotype be-
as an ethnicity score. When used as
expected false discoveries, providing an tween 500,000 and one million SNPs,
covariates, these scores adjust for minute
estimate of the number of true results examining all pair-wise combinations of
ancestry effects in the data.
among those called significant [33]. These SNPs is a computationally intractable
techniques have been widely applied to approach, even for highly efficient algo-
6.3 Corrections for Multiple Testing GWAS and extended in a variety of ways rithms. One approach to this issue is to
A p-value, which is the probability of [34]. reduce or filter the set of genotyped SNPs,
seeing a test statistic equal to or greater Permutation testing is another approach eliminating redundant information. A
than the observed test statistic if the null for establishing significance in GWAS. simple and common way to filter SNPs
hypothesis is true, is generated for each While somewhat computationally inten- is to select a set of results from a single-
statistical test. This effectively means that sive, permutation testing is a straightfor- SNP analysis based on an arbitrary
lower p-values indicate that if there is no ward way to generate the empirical significance threshold and exhaustively
association, the chance of seeing this result distribution of test statistics for a given evaluate interactions in that subset. This
is extremely small. dataset when the null hypothesis is true. can be perilous, however, as selecting
Statistical tests are generally called This is achieved by randomly reassigning SNPs to analyze based on main effects
significant and the null hypothesis is the phenotypes of each individual to will prevent certain multi-locus models
rejected if the p-value falls below a another individual in the dataset, effec- from being detected so called purely
predefined alpha value, which is nearly tively breaking the genotype-phenotype epistatic models with statistically unde-
always set to 0.05. This means that 5% of relationship of the dataset. Each random tectable marginal effects. With these
the time, the null hypothesis is rejected reassignment of the data represents one models, a large component of the herita-
when in fact it is true and we detect a false possible sampling of individuals under the bility is concentrated in the interaction
positive. This probability is relative to a null hypothesis, and this process is repeat- rather than in the main effects. In other
single statistical test; in the case of ed a predefined number of times N to words, a specific combination of markers

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002822


(and only the combination of markers) Replication of a significant result in an 7.2 Meta-Analysis of Multiple
incurs a significant change in disease risk. additional population is sometimes re- Analysis Results
The benefits of this analysis are that it ferred to as generalization, meaning the The results of multiple GWAS studies
performs an unbiased analysis for inter- genetic effect is of general relevance to can be pooled together to perform a meta-
actions within the selected set of SNPs. It multiple human populations. analysis. Meta-analysis techniques were
is also far more computationally and Identical phenotype criteria should be originally developed to examine and refine
statistically tractable than analyzing all used in both GWAS and replication significance and effect size estimates from
possible combinations of markers. studies. Replication of a GWAS result multiple studies examining the same hypothesis
Another strategy is to restrict examina- should be thought of as the replication of a in the published literature. With the
tion of SNP combinations to those that specific statistical model a given SNP development of large academic consortia,
fall within an established biological con- predicts a specific phenotype effect. Using meta-analysis approaches allow the syn-
text, such as a biochemical pathway or a even slightly different phenotype defini- thesis of results from multiple studies
protein family. As these techniques rely tions between GWAS and replication without requiring the transfer of protected
on electronic repositories of structured studies can cloud the interpretation of genotype or clinical information to parties
biomedical knowledge, they generally the final result. who were not part of the original study
couple a bioinformatics engine that gen- A similar effect should be seen in the approval only statistical results from a
erates SNP-SNP combinations with a replication set from the same SNP, or a study need be transferred. For example, a
statistical method that evaluates combi- SNP in high LD with the GWAS-identi- recent publication examining lipid profiles
nations in the GWAS dataset. For exam- fied SNP. Because GWAS typically use was based on a meta-analysis of 46 studies
ple, the Biofilter approach uses a variety SNPs that are markers that were chosen [21]. A study of this magnitude would be
of public data sources with logistic based on LD patterns, it is difficult to say logistically difficult (if not impossible)
regression and multifactor dimensionality what SNP within the larger genomic without meta-analysis. Several software
reduction methods [40,41]. Similarly, region is mechanistically influencing dis- packages are available to facilitate meta-
INTERSNP uses logistic regression, log- ease risk. With this in mind, the unit of analysis, including STATA products and
linear, and contingency table approaches replication for a GWAS should be the METAL [45,46].
to assess SNP-SNP interaction models genomic region, and all SNPs in high LD are A fundamental principle in meta-anal-
[42]. potential replication candidates. However, ysis is that all studies included examined
continuity of effect should be demonstrat- the same hypothesis. As such, the general
7. Replication and Meta- ed across both studies, with the magnitude design of each included study should be
Analysis and direction of effect being similar for the similar, and the study-level SNP analysis
7.1 Statistical Replication genomic region in both datasets. If SNPs should follow near-identical procedures
The gold standard for validation of any in high LD are used to demonstrate the across all studies (see Zeggini and Ioanni-
genetic study is replication in an additional effect in replication, the direction of effect dis [47] for an excellent review). Quality
independent sample. That said, there are a must be determined using a reference control procedures that determine which
variety of criteria involved in defining panel to determine two-SNP haplotype SNPs are included from each site should
replication of a GWAS result. This was frequencies. For example, if allele A is be standardized, along with any covariate
the subject of an NHGRI working group, associated in the GWAS with an odds adjustments, and the measurement of
which outlined several criteria for estab- ratio of 1.5, and allele T of a nearby SNP clinical covariates and phenotypes should
lishing a positive replication [43]. These is associated in the replication set with an be consistent across multiple sites. The
criteria are discussed in the following odds ratio of 1.46, it must be demonstrated sample sets across all studies should be
paragraphs. that allele A and allele T carry effects in independent an assumption that should
Replication studies should have suffi- the same direction. The most straightfor- always be examined as investigators often
cient sample size to detect the effect of the ward way to assess this is to examine a contribute the same samples to multiple
susceptibility allele. Often, the effects reference panel, such as the HapMap studies. Also, an extremely important and
identified in an initial GWAS suffer from data, for a relevant population. If this somewhat bothersome logistical matter is
winners curse, where the detected effect is panel shows that allele A from SNP 1 and ensuring that all studies report results
likely stronger in the GWAS sample than allele T from SNP 2 form a two-marker relative to a common genomic build and
in the general population [44]. This means haplotype in 90% of the sample, then this reference allele. If one study reports its
that replication samples should ideally be is a reasonable assumption. If however the results relative to allele A and another
larger to account for the over-estimation of panel shows that allele A from SNP 1 and relative to allele B, the meta-analysis result
effect size. With replication, it is important allele A from SNP 2 form the predomi- for this SNP may be non-significant
for the study to be well-powered to identify nant two-marker haplotype, the effect has because the effects of the two studies
spuriously associated SNPs where the null probably flipped in the replication set. nullify each other.
hypothesis is most likely true in other Mapping the effect through the haplotype With all of these factors to consider, it is
words, to confidently call the initial would be equivalent to observing an odds rare to find multiple studies that match
GWAS result a false-positive. ratio of 1.5 in the GWAS and 0.685 in the perfectly on all criteria. Therefore, study
Replication studies should be conducted replication set. heterogeneity is often statistically quantified
in an independent dataset drawn from the In brief, the general strategy for a in a meta-analysis to determine the degree
same population as the GWAS, in an replication study is to repeat the ascertain- to which studies differ. The most popular
attempt to confirm the effect in the GWAS ment and design of the GWAS as closely as measures of study heterogeneity are the Q
target population. Once an effect is possible, but examine only specific genetic statistic and the I2 index [48], with the I2
confirmed in the target population, other effects found significant in the GWAS. index favored in more recent studies.
populations may be sampled to determine Effects that are consistent across the two Coefficients resulting from a meta-analysis
if the SNP has an ethnic-specific effect. studies can be labeled replicated effects. have variability (or error) associated with

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002822


them, and the I2 index represents the haplotypes contain genotypes for surround- three billion nucleotides. Challenges asso-
approximate proportion of this variability ing markers that were not genotyped in the ciated with data storage and manipula-
that can be attributed to heterogeneity study sample. Because the study sample tion, quality control and data analysis will
between studies [49]. I2 values fall into low haplotypes may match multiple reference be manifold more complex, thus chal-
(,25), medium (.25 and ,75), and high haplotypes, surrounding genotypes may be lenging computer science and bioinfor-
(.75) heterogeneity, and have been pro- given a score or probability of a match based matics infrastructure and expertise. Merg-
posed as a way to identify studies that on the haplotype overlap. For example, ing sequencing data with that from other
should perhaps be removed from a meta- rather than assign an imputed SNP a single high-throughput technology for measur-
analysis. It is important to note that these allele A, the probability of possible alleles is ing the transcriptome, the proteome, the
statistics should be used as a guide to reported (0.85 A, 0.12 C, 0.03 T) based on environment and phenotypes such as the
identifying studies that perhaps examine a haplotype frequencies. This information can massive amounts of data that come from
different underlying hypothesis than others be used in the analysis of imputed data to neuroimaging will only serve to compli-
in the meta-analysis, much like outlier take into account uncertainty in the geno- cate our goal to understand the genotype-
analysis is used to identify unduly influential type estimation process, typically using phenotype relationship for the purpose of
points. Just as with outliers, however, a Bayesian analysis approaches [51]. Popular improving healthcare. Integrating these
study should only be excluded if there is an algorithms for genotype imputation include many levels of complex biomedical data
obvious reason to do so based on the BimBam [52], IMPUTE [53], MaCH [54], along with their coupling with experi-
parameters of the study not simply and Beagle [55]. mental systems is the future of human
because a statistic indicates that this study Much like conducting a meta-analysis, genetics.
increases heterogeneity. Otherwise, agnos- genotype imputation must be conducted
tic statistical procedures designed to reduce with great care. The reference panel (i.e. 9. Exercises
meta-analysis heterogeneity will increase the 1000 Genomes data or the HapMap
false discoveries. project) must contain haplotypes drawn 1. True or False: Common diseases, such
from the same population as the study as type II diabetes and lung cancer, are
7.3 Data Imputation sample in order to facilitate a proper likely caused by mutations to a single
To conduct a meta-analysis properly, the haplotype match. If a study was conducted gene. Explain your answer.
effect of the same allele across multiple distinct using individuals of Asian descent, but only 2. Will the genotyping platforms designed
studies must be assessed. This can prove European descent populations are repre- for GWAS of European Descent pop-
difficult if different studies use different sented in the reference panel, the genotype ulations be of equal utility in African
genotyping platforms (which use different imputation quality will be poor as there is a Descent populations? Why or why not?
SNP marker sets). As this is often the case, lower probability of a haplotype match.
3. When conducting a genetic study, what
GWAS datasets can be imputed to generate Also, the reference allele for each SNP must
additional factors should be measured
results for a common set of SNPs across all be identical in both the study sample and
and adjusted for in the statistical
studies. Genotype imputation exploits the reference panel. Finally, the analysis of
analysis?
known LD patterns and haplotype frequen- imputed genotypes should account for the
cies from the HapMap or 1000 Genomes uncertainty in genotype state generated by 4. True or False: SNPs that are associated
project to estimate genotypes for SNPs not the imputation process. to disease using GWAS design should
directly genotyped in the study [50]. be immediately considered for molec-
The concept is similar in principle to 8. The Future ular studies. Explain your answer.
haplotype phasing algorithms, where the con-
tiguous set of alleles lying on a specific Genome-wide association studies have Answers to the Exercises can be found
chromosome is estimated. Genotype impu- had a huge impact on the field of human in Text S1.
tation methods extend this idea to human genetics. They have identified new genet-
populations. First, a collection of shared ic risk factors for many common human Supporting Information
haplotypes within the study sample is diseases and have forced the genetics
Text S1 Answers to Exercises
computed to estimate haplotype frequencies community to think on a genome-wide
(DOCX)
among the genotyped SNPs. Phased haplo- scale. On the horizon is whole-genome
types from the study sample are compared sequencing. Within the next few years we
to reference haplotypes from a panel of will see the arrival of cheap sequencing Acknowledgments
much more dense SNPs, such as the technology that will replace one million Thanks are extended to Ms. Davnah Urbach
HapMap data. The matched reference SNPs with the entire genomic sequence of for her editorial assistance.

Further Reading

N 1000 Genomes Project Consortium, Altshuler D, Durbin RM, Abecasis GR, Bentley DR, et al. (2010) A map of human genome
variation from population-scale sequencing. Nature 467: 10611073.
N Haines JL, Pericak-Vance MA (2006) Genetic analysis of complex disease. New York: Wiley-Liss. 512 p.
N Hartl DL, Clark, AG (2006) Principles of population genetics. Sunderland (Massachusetts): Sinauer Associates, Inc. 545 p.
N NCI-NHGRI Working Group on Replication in Association Studies, Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, et al.
(2007) Replicating genotype-phenotype associations. Nature 447: 655660.

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002822


Glossary

GWAS: genome-wide association study; a genetic study design that attempts to identify commonly occurring genetic variants that
contribute to disease risk
Personalized Medicine: the science of providing health care informed by individual characteristics, such as genetic variation
SNP: single nucleotide polymorphism; a single base-pair change in the DNA sequence
Linkage Analysis: the attempt to statistically relate transmission of an allele within families to inheritance of a disease
Common disease/Common variant hypothesis: The hypothesis that commonly occurring diseases in a population are caused in part
by genetic variation that is common to that population
Linkage disequilibrium: the degree to which an allele of one SNP is observed with an allele of another SNP within a population
Direct association: the statistical association of a functional or influential allele with a disease
Indirect association: the statistical association of an allele to disease that is in strong linkage disequilibrium with the allele that is
functional or influential for disease
Population stratification: the false association of an allele to disease due to both differences in population frequency of the allele and
differences in ethnic prevalence or sampling of affected individuals
False positive: from statistical hypothesis testing, the rejection of a null hypothesis when the null hypothesis is true
Genome-wide significance: a false-positive rate threshold established by empirical estimation of the independent genomic regions
present in a population
Replication: the observation of a statistical association in a second, independent dataset (often the same population as the first
association)
Generalization: the replication of a statistical association in a second population
Imputation: the estimation of unknown alleles based on the observation of nearby alleles in high linkage disequilibrium

References
1. Haines JL, Hauser MA, Schmidt S, Scott WK, complex traits. Nat Rev Genet 6: 95108. doi: studies. Methods Mol Biol 700: 316. doi:
Olson LM, et al. (2005) Complement factor H 10.1038/nrg1521 10.1007/978-1-61737-954-3_1
variant increases the risk of age-related macular 11. Corder EH, Saunders AM, Strittmatter WJ, 21. Teslovich TM, Musunuru K, Smith AV, Ed-
degeneration. Science 308: 419421. doi: Schmechel DE, Gaskell PC, et al. (1993) Gene mondson AC, Stylianou IM, et al. (2010)
10.1126/science.1110359 dose of apolipoprotein E type 4 allele and the risk Biological, clinical and population relevance of
2. Edwards AO, Ritter R, III, Abel KJ, Manning A, of Alzheimers disease in late onset families. 95 loci for blood lipids. Nature 466: 707713. doi:
Panhuysen C, et al. (2005) Complement factor H Science 261: 921923. 10.1038/nature09270
polymorphism and age-related macular degener- 12. Altshuler D, Hirschhorn JN, Klannemark M, 22. Habek M, Brinar VV, Borovecki F (2010) Genes
ation. Science 308: 421424. doi: 10.1126/ Lindgren CM, Vohl MC, et al. (2000) The associated with multiple sclerosis: 15 and count-
science.1110189 common PPARgamma Pro12Ala polymorphism ing. Expert Rev Mol Diagn 10: 857861. doi:
3. Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler is associated with decreased risk of type 2 diabetes. 10.1586/erm.10.77
RS, et al. (2005) Complement factor H polymor- Nat Genet 26: 7680. doi: 10.1038/79216 23. Polman CH, Reingold SC, Edan G, Filippi M,
phism in age-related macular degeneration. 13. Reich DE, Lander ES (2001) On the allelic spectrum Hartung HP, et al. (2005) Diagnostic criteria for
Science 308: 385389. doi: 10.1126/sci- of human disease. Trends Genet 17: 502510. multiple sclerosis: 2005 revisions to the McDon-
ence.1109557 14. Hindorff LA, Sethupathy P, Junkins HA, Ramos ald Criteria. Ann Neurol 58: 840846. doi:
4. Cooper GM, Johnson JA, Langaee TY, Feng H, EM, Mehta JP, et al. (2009) Potential etiologic 10.1002/ana.20703
Stanaway IB, et al. (2008) A genome-wide scan and functional implications of genome-wide 24. Chew EY, Kim J, Sperduto RD, Datiles MB, III,
for common genetic variants with a large association loci for human diseases and traits. Coleman HR, et al. (2010) Evaluation of the age-
influence on warfarin maintenance dose. Blood Proc Natl Acad Sci U S A 106: 93629367. doi: related eye disease study clinical lens grading
112: 10221027. doi: 10.1182/blood-2008-01-
10.1073/pnas.0903103106 system AREDS report No. 31. Ophthalmology
134247
15. International HapMap Consortium (2005) A 117: 21122119. doi: 10.1016/j.ophtha.2010.02.
5. Genomes Project Consortium (2010) A map of
haplotype map of the human genome. Nature 033
human genome variation from population-scale
437: 12991320. doi: 10.1038/nature04226 25. Denny JC, Ritchie MD, Crawford DC, Schildcr-
sequencing. Nature 467: 10611073. doi:
10.1038/nature09534 16. Ritchie MD, Denny JC, Crawford DC, Ramirez out JS, Ramirez AH, et al. (2010) Identification
6. Griffith OL, Montgomery SB, Bernier B, Chu B, AH, Weiner JB, et al. (2010) Robust replication of of genomic predictors of atrioventricular con-
Kasaian K, et al. (2008) ORegAnno: an open- genotype-phenotype associations across multiple duction: using electronic medical records as a
access community-driven resource for regulatory diseases in an electronic medical record. tool for genome science. Circulation 122: 2016
annotation. Nucleic Acids Res 36: D107-D113. Am J Hum Genet 86: 560572. doi: 10.1016/ 2021. doi: 10.1161/CIRCULATIONAHA.110.
doi: 10.1093/nar/gkm967 j.ajhg.2010.03.003 948828
7. Altshuler DM, Gibbs RA, Peltonen L, Altshuler 17. Devlin B, Risch N (1995) A comparison of linkage 26. Wilke RA, Berg RL, Linneman JG, Peissig P,
DM, Gibbs RA, et al. (2010) Integrating common disequilibrium measures for fine-scale mapping. Starren J, et al. (2010) Quantification of the
and rare genetic variation in diverse human Genomics 29: 311322. doi: 10.1006/ clinical modifiers impacting high-density lipopro-
populations. Nature 467: 5258. doi: 10.1038/ geno.1995.9003 tein cholesterol in the community: Personalized
nature09298 18. Fallin D, Schork NJ (2000) Accuracy of haplotype Medicine Research Project. Prev Cardiol 13: 63
8. Kerem B, Rommens JM, Buchanan JA, Markiewicz frequency estimation for biallelic loci, via the 68. doi: 10.1111/j.1751-7141.2009.00055.x
D, et al. (1989) Identification of the cystic fibrosis expectation-maximization algorithm for un- 27. Kullo IJ, Fan J, Pathak J, Savova GK, Ali Z, et al.
gene: genetic analysis. Science 245: 10731080. phased diploid genotype data. Am J Hum Genet (2010) Leveraging informatics for genetic studies:
9. MacDonald ME, Novelletto A, Lin C, Tagle D, 67: 947959. doi: 10.1086/303069 use of the electronic medical record to enable a
Barnes G, et al. (1992) The Huntingtons disease 19. Li M, Li C, Guan W (2008) Evaluation of genome-wide association study of peripheral
candidate region exhibits many different haplo- coverage variation of SNP chips for genome-wide arterial disease. J Am Med Inform Assoc 17:
types. Nat Genet 1: 99103. doi: 10.1038/ association studies. Eur J Hum Genet 16: 635 568574. doi: 10.1136/jamia.2010.004366
ng0592-99 643. doi: 10.1038/sj.ejhg.5202007 28. McCarty CA, Wilke RA (2010) Biobanking and
10. Hirschhorn JN, Daly MJ (2005) Genome-wide 20. Distefano JK, Taverna DM (2011) Technological pharmacogenomics. Pharmacogenomics 11: 637
association studies for common diseases and issues and experimental design of gene association 641. doi: 10.2217/pgs.10.13

PLOS Computational Biology | www.ploscompbiol.org 10 December 2012 | Volume 8 | Issue 12 | e1002822


29. Lewis CM (2002) Genetic association studies: tion scans. Genet Epidemiol 32: 227234. doi: 47. Zeggini E, Ioannidis JP (2009) Meta-analysis in
design, analysis and interpretation. Brief Bioin- 10.1002/gepi.20297 genome-wide association studies. Pharmacoge-
form 3: 146153. 39. Moore JH, Ritchie MD (2004) STUDENT- nomics 10: 191201. doi: 10.2217/
30. Lettre G, Lange C, Hirschhorn JN (2007) Genetic JAMA. The challenges of whole-genome ap- 14622416.10.2.191
model testing and statistical power in population- proaches to common diseases. JAMA 291: 48. Huedo-Medina TB, Sanchez-Meca J, Marin-
based association studies of quantitative traits. 16421643. doi: 10.1001/jama.291.13.1642 Martinez F, Botella J (2006) Assessing heteroge-
Genet Epidemiol 31: 358362. doi: 10.1002/ 40. Grady BJ, Torstenson ES, McLaren PJ, de neity in meta-analysis: Q statistic or I2 index?
gepi.20217 Bakker PI, Haas DW, et al. (2011) Use of Psychol Methods 11: 193206. doi: 10.1037/
31. Falush D, Stephens M, Pritchard JK (2003) biological knowledge to inform the analysis of 1082-989X.11.2.193
Inference of population structure using multilocus gene-gene interactions involved in modulating 49. Higgins JP (2008) Commentary: Heterogeneity in
genotype data: linked loci and correlated allele virologic failure with efavirenz-containing treat- meta-analysis should be expected and appropri-
frequencies. Genetics 164: 15671587. ment regimens in art-naive actg clinical trials ately quantified. Int J Epidemiol 37: 11581160.
32. Price AL, Patterson NJ, Plenge RM, Weinblatt participants. Pac Symp Biocomput 253264. doi: 10.1093/ije/dyn204
ME, Shadick NA, et al. (2006) Principal compo- 41. Bush WS, Dudek SM, Ritchie MD (2009) 50. Li Y, Willer C, Sanna S, Abecasis G (2009)
nents analysis corrects for stratification in ge- Biofilter: a knowledge-integration system for the Genotype imputation. Annu Rev Genomics Hum
nome-wide association studies. Nat Genet 38: multi-locus analysis of genome-wide association Genet 10: 387406. doi: 10.1146/annurev.-
904909. doi: 10.1038/ng1847 studies. Pac Symp Biocomput 368379. genom.9.081307.164242
33. Hochberg Y, Benjamini Y (1990) More powerful
42. Herold C, Steffens M, Brockschmidt FF, Baur 51. Marchini J, Howie B, Myers S, McVean G,
procedures for multiple significance testing. Stat
MP, Becker T (2009) INTERSNP: genome-wide Donnelly P (2007) A new multipoint method for
Med 9: 811818.
interaction analysis guided by a priori informa- genome-wide association studies by imputation of
34. van den Oord EJ (2008) Controlling false
tion. Bioinformatics 25: 32753281. doi: genotypes. Nat Genet 39: 906913. doi: 10.1038/
discoveries in genetic studies. Am J Med
10.1093/bioinformatics/btp596 ng2088
Genet B Neuropsychiatr Genet 147B: 637644.
doi: 10.1002/ajmg.b.30650 43. Chanock SJ, Manolio T, Boehnke M, Boerwinkle 52. Guan Y, Stephens M (2008) Practical issues in
35. Purcell S, Neale B, Todd-Brown K, Thomas L, E, Hunter DJ, et al. (2007) Replicating genotype- imputation-based association mapping. PLoS
Ferreira MA, et al. (2007) PLINK: a tool set for phenotype associations. Nature 447: 655660. Genet 4: e1000279. doi: 10.1371/journal.p-
whole-genome association and population-based doi: 10.1038/447655a gen.1000279
linkage analyses. Am J Hum Genet 81: 559575. 44. Zollner S, Pritchard JK (2007) Overcoming the 53. Howie BN, Donnelly P, Marchini J (2009) A
doi: 10.1086/519795 winners curse: estimating penetrance parameters flexible and accurate genotype imputation meth-
36. Browning BL (2008) PRESTO: rapid calculation from case-control data. Am J Hum Genet 80: od for the next generation of genome-wide
of order statistic distributions and multiple-testing 605615. doi: 10.1086/512821 association studies. PLoS Genet 5: e1000529.
adjusted P-values via permutation for one and 45. Sanna S, Jackson AU, Nagaraja R, Willer CJ, doi: 10.1371/journal.pgen.1000529
two-stage genetic association studies. BMC Bioin- Chen WM, et al. (2008) Common variants in the 54. Biernacka JM, Tang R, Li J, McDonnell SK,
formatics 9: 309. doi: 10.1186/1471-2105-9-309 GDF5-UQCC region are associated with varia- Rabe KG, et al. (2009) Assessment of genotype
37. Pahl R, Schafer H (2010) PERMORY: an LD- tion in human height. Nat Genet 40: 198203. imputation methods. BMC Proc 3 Suppl 7: S5.
exploiting permutation test algorithm for power- doi: 10.1038/ng.74 55. Browning BL, Browning SR (2009) A unified
ful genome-wide association testing. Bioinfor- 46. Willer CJ, Sanna S, Jackson AU, Scuteri A, approach to genotype imputation and haplotype-
matics 26: 20932100. doi: 10.1093/bioinfor- Bonnycastle LL, et al. (2008) Newly identified loci phase inference for large data sets of trios and
matics/btq399 that influence lipid concentrations and risk of unrelated individuals. Am J Hum Genet 84: 210
38. Dudbridge F, Gusnanto A (2008) Estimation of coronary artery disease. Nat Genet 40: 161169. 223. doi: 10.1016/j.ajhg.2009.01.005
significance thresholds for genomewide associa- doi: 10.1038/ng.76

PLOS Computational Biology | www.ploscompbiol.org 11 December 2012 | Volume 8 | Issue 12 | e1002822


Education

Chapter 12: Human Microbiome Analysis


Xochitl C. Morgan1, Curtis Huttenhower1,2*
1 Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America, 2 The Broad Institute of MIT and Harvard, Cambridge,
Massachusetts, United States of America

Abstract: Humans are essentially bacteria, archaea, fungi, and viruses. The necessary to grow an organism in the lab
sterile during gestation, but during community formed by this complement of in order to study it. Specific microbial
and after birth, every body surface, cells is called the human microbiome; it species were detected by plating samples
including the skin, mouth, and gut, contains almost ten times as many cells as on specialized media selective for the
becomes host to an enormous are in the rest of our bodies and accounts growth of that organism, or they were
variety of microbes, bacterial, ar- for several pounds of body weight and identified by features such as the morpho-
chaeal, fungal, and viral. Under orders of magnitude more genes than are logical characteristics of colonies, their
normal circumstances, these mi- contained in the human genome [1,2]. growth on different media, and metabolic
crobes help us to digest our food Under normal circumstances, these mi- production or consumption. This ap-
and to maintain our immune sys- crobes are commensal, helping to digest proach limited the range of organisms
tems, but dysfunction of the hu- our food and to maintain our immune that could be detected to those that would
man microbiota has been linked to systems. Although the human microbiome actively grow in laboratory culture, and it
conditions ranging from inflamma- has long been known to influence human led the close study of easily-grown, now-
tory bowel disease to antibiotic- familiar model organisms such as Esche-
health and disease [1], we have only
resistant infections. Modern high-
recently begun to appreciate the breadth richia coli. However, E. coli as a taxonomic
throughput sequencing and bioin-
of its involvement. This is almost entirely unit accounts for at most 5% of the
formatic tools provide a powerful
means of understanding the con- due to the recent ability of high-through- microbes occupying the typical human
tribution of the human microbiome put sequencing to provide an efficient and gut [2]. The vast majority of microbial
to health and its potential as a cost-effective tool for investigating the species have never been grown in the
target for therapeutic interven- members of a microbial community and laboratory, and options for studying and
tions. This chapter will first discuss how they change. Thus, dysfunctions of quantifying the uncultured were severely
the historical origins of microbiome the human microbiota are increasingly limited until the development of DNA-
studies and methods for determin- being linked to disease ranging from based culture-independent methods in the
ing the ecological diversity of a inflammatory bowel disease to diabetes 1980s [4].
microbial community. Next, it will to antibiotic-resistant infection, and the Culture-independent techniques, which
introduce shotgun sequencing potential of the human microbiome as an analyze the DNA extracted directly from
technologies such as metage- early detection biomarker and target for a sample rather than from individually
nomics and metatranscriptomics, therapeutic intervention is a vibrant area cultured microbes, allow us to investigate
the computational challenges and several aspects of microbial communities
of current research.
methods associated with these
(Figure 1). These include taxonomic
data, and how they enable micro-
2. A Brief History of Microbiome diversity, such as how many of which
biome analysis. Finally, it will con-
clude with examples of the func- Studies microbes are present in a community,
tional genomics of the human and functional metagenomics, which at-
microbiome and its influences up- Historically, members of a microbial tempts to describe which biological tasks
on health and disease. community were identified in situ by stains the members of a community can or do
that targeted their physiological character- carry out. The earliest DNA-based meth-
istics, such as the Gram stain [3]. These ods probed extracted community DNA
could distinguish many broad clades of for genes of interest by hybridization, or
bacteria but were non-specific at lower amplified specifically-targeted genes by
This article is part of the Transla- taxonomic levels. Thus, microbiology was PCR prior to sequencing. These studies
tional Bioinformatics collection for almost entirely culture-dependent; it was were typically able to describe diversity at
PLOS Computational Biology.

Citation: Morgan XC, Huttenhower C (2012) Chapter 12: Human Microbiome Analysis. PLoS Comput Biol 8(12):
e1002808. doi:10.1371/journal.pcbi.1002808
1. Introduction
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
The question of what it means to be Baltimore County, United States of America
human is more often encountered in Published December 27, 2012
metaphysics than in bioinformatics, but it Copyright: 2012 Morgan, Huttenhower. This is an open-access article distributed under the terms of the
is surprisingly relevant when studying the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
human microbiome. We are born consist- medium, provided the original author and source are credited.

ing only of our own eukaryotic human Funding: This work was supported by the NIH grant 1R01HG005969-01. The funders had no role in the
preparation of the manuscript.
cells, but over the first several years of life,
our skin surface, oral cavity, and gut are Competing Interests: The authors have declared that no competing interests exist.
colonized by a tremendous diversity of * E-mail: chuttenh@hsph.harvard.edu

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002808


What to Learn in This Chapter 3.2 Binning 16S rRNA Sequences
into OTUs
N An overview of the analysis of microbial communities A bioinformatic challenge that arises
immediately in the analysis of rRNA genes
N Understanding the human microbiome from phylogenetic and functional
is the precise definition of a unique
perspectives
sequence. Although much of the 16S
N Methods and tools for calculating taxonomic and phylogenetic diversity
rRNA gene is highly conserved, several
N Metagenomic assembly and pathway analysis of the sequenced regions are variable or
N The impact of the microbiome on its host hypervariable, so small numbers of base
pairs can change in a very short period of
evolutionary time [17]. Horizontal trans-
a broad level, or detect the presence or the community describing either the fer, multicopy or ambiguous rDNA mark-
absence of individual biochemical func- absolute number of cells in which it is ers, and other confounding factors do,
tions, but with few details in either case. carried or their relative abundance within however, blur the biological meaning of
One of the earliest targeted metage- the population. As it is impractical to fully species as well as our ability to resolve
nomic assays for studying uncultured sequence every genome in every cell (a them technically [17]. Finally, because
communities without prior DNA extrac- statement that should remain safely true 16S regions are typically sequenced using
tion was fluorescent in situ hybridization no matter how cheap high-throughput only a single pass, there is a fair chance
(FISH), in which fluorescently-labeled, sequencing becomes), microbial ecology that they will thus contain at least one
specific oligonuclotide probes for marker has defined a number of molecular sequencing error. This means that requir-
genes are hybridized to a microbial markers that (more or less) uniquely tag ing tags to be 100% identical will be
community [5]. FISH probes can be distinct genomes. Just as the make, model, extremely conservative and treat essential-
targeted to almost any level of taxonomy and year of a car identify its components ly clonal genomes as different organisms.
from species to phylum. Although FISH without the need to meticulously inspect Some degree of sequence divergence is
was initially limited to the 16S rRNA the entirety of every such car, a marker is a typically allowed - 95%, 97%, or 99% are
marker gene and thus to diversity studies, DNA sequence that identifies the genome sequence similarity cutoffs often used in
it has since been expanded to functional that contains it, without the need to practice [18] - and the resulting cluster of
gene probes that can be used to identify sequence the entire genome. nearly-identical tags (and thus assumedly
specific enzymes in communities [6]. Although different markers can be identical genomes) is referred to as an
However, it remains a primarily low- chosen for analyzing different populations, Operational Taxonomic Unit (OTU) or
throughput, imaging-based technology. several properties are desirable for a good sometimes phylotype. OTUs take the
To investigate microbial communities marker. A marker should be present in place of species in many microbiome
efficiently at scale, almost all current every member of a population, should diversity analyses because named species
studies employ high-throughput DNA differ only and always between individuals genomes are often unavailable for partic-
sequencing, increasingly in combination with distinct genomes, and, ideally, should ular marker sequences. The assignment of
with other genome-scale platforms such as sequences to OTUs is referred to as
differ proportionally to the evolutionary
proteomics or metabolomics. Although binning, and it can be performed by A)
distance between distinct genomes. Sever-
DNA sequencing has existed since the unsupervised clustering of similar sequenc-
al such markers have been defined,
1970s [7,8], it was historically quite es [19], B) phylogenetic models incorpo-
including ribosomal protein subunits,
expensive; sequencing environmental rating mutation rates and evolutionary
elongation factors, and RNA polymerase
DNA further required the additional time relationships [20], or C) supervised meth-
subunits [10], but by far the most
and expense of clone library construction. ods that directly assign sequences to
ubiquitous (and historically significant
It was not until the 2005 advent of next- taxonomic bins based on labeled training
[11]) is the small or 16S ribosomal RNA
generation high-throughput sequencing data [21] (which also applies to whole-
subunit gene [12]. This 1.5 Kbp gene is genome shotgun sequences; see below).
[9] that it became economically feasible
commonly referred to as the 16S rRNA
for most scientists to sequence the DNA of The binning process allows a commu-
(after transcription) or sometimes rDNA; it
an entire environmental sample, and nity to be analyzed in terms of discrete
satisfies the criteria of a marker by bins or OTUs, opening up a range of
metagenomic studies have since become
containing both highly conserved, ubiqui- computationally tractable representations
increasingly common.
tous sequences and regions that vary with for biological analysis. If each OTU is
greater or lesser frequency over evolution- treated as a distinct category, or each 16S
3. Taxonomic Diversity ary time. It is relatively cheap and simple sequence is binned into a named phylum
3.1 The 16S rRNA Marker Gene to sequence only the 16S sequences from a or other taxonomic category, a pool of
Like a metazoan, a microbial commu- microbiome [13], thus describing the microbiome sequences can be represented
nity consists fundamentally of a collection population as a set of 16S sequences and as a histogram of bin counts [22].
of individual cells, each carrying a distinct the number of times each was detected. Alternately, this histogram can be binar-
complement of genomic DNA. Commu- Sequences assayed in this manner have ized into presence/absence calls for each
nities, however, obviously differ from been characterized for a wide range of bin across a collection of related samples.
multicellular organisms in that their com- cultured species and environmental iso- Because diverse, general OTUs will always
ponent cells may or may not carry lates; these are stored and can be auto- be present in related communities, and
identical genomes, although substantial matically matched against several data- overly-specific OTUs may not appear
subsets of these cells are typically assumed bases including GreenGenes [14], the outside of their sample of origin, the latter
to be clonal. One can thus assign a Ribosomal Database Project [15], and approach is typically most useful for low-
frequency to each distinct genome within Silva [16]. complexity microbiomes or OTUs at an

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002808


Figure 1. Bioinformatic methods for functional metagenomics. Studies that aim to define the composition and function of uncultured
microbial communities are often referred to collectively as metagenomic, although this refers more specifically to particular sequencing-based
assays. First, community DNA is extracted from a sample, typically uncultured, containing multiple microbial members. The bacterial taxa present in

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002808


the community are most frequently defined by amplifying the 16S rRNA gene and sequencing it. Highly similar sequences are grouped into
Operational Taxonomic Units (OTUs), which can be compared to 16S databases such as Silva [16], Green Genes [14], and RDP [15] to identify them as
precisely as possible. The community can be described in terms of which OTUs are present, their relative abundance, and/or their phylogenetic
relationships. An alternate method of identifying community taxa is to directly metagenomically sequence community DNA and compare it to
reference genomes or gene catalogs. This is more expensive but provides improved taxonomic resolution and allows observation of single nucleotide
polymorphisms (SNPs) and other variant sequences. The functional capabilities of the community can also be determined by comparing the
sequences to functional databases (e.g. KEGG [170] or SEED [171]). This allows the community to be described as relative abundances of its genes and
pathways. Figure adapted from [172].
doi:10.1371/journal.pcbi.1002808.g001

appropriately tuned level of specificity. shown to correlate with decreased micro- size z, how big must the sample size y be to
Bioinformaticians studying 16S sequences biome diversity, presumably as one or a observe all of them at least once? In other
must choose whether to analyze a collec- few microbes overgrow during immune or words, If Ive sequenced some amount of
tion of taxonomically-binned microbiomes nutrient imbalance in a process not unlike diversity, how much more exists in my
as a set of abundance histograms, or as a an algal bloom [26]. Intriguingly, recent microbiome? and, How much do I need
set of binary presence/absence vectors. results have also shown that essentially no to sequence to completely characterize my
However, either representation can be bacterial clades are widely and consistently microbiome? The latter is known as the
used as input to decomposition methods shared among the human microbiome [2]. Coupon Collectors Problem, as identical
such as Principle Components Analysis or Many organisms are abundant in some questions can be asked if a cereal manu-
Canonical Correlation Analysis [23] to individuals, and many organisms are facturer has randomly hidden one of
determine which OTUs represent the prevalent among most individuals, but several different possible prize coupons in
most significant sources of population none are universal. Although they can each box of cereal [27]. Within a com-
variance and/or correlate with community vary over time and share some similarity munity, several estimators including the
metadata such as temperature, pH, or with some individuals, our intestinal con- Chao1 [28], Abundance-based Coverage
clinical features [24,25]. tents appear to be highly personalized Estimator (ACE) [29], and Jackknife [30]
when considered in terms of microbial measures exist for calculating alpha diver-
3.3 Measuring Population Diversity presence, absence, and abundance. sity, the number (richness) and distribution
An important concept when dealing Two mathematically well-defined ques- (evenness) of taxa expected within a single
with OTUs or other taxonomic bins is that tions arise when quantifying population population. These give rise to figures
of population diversity, the number of diversity (Figure 2): given that x bins have known as collectors or rarefaction curves,
distinct bins in a sample or in the been observed in a sample of size y from a since increasing numbers of sequenced
originating population. This is of critical population of size z, how many bins are taxa allow increasingly precise estimates of
importance in human health, since a expected to exist in the population; or, total population diversity [31]. Addition-
number of disease conditions have been given that x bins exist in a population of ally, when comparing multiple popula-

Figure 2. Ecological representations of microbial communities: collectors curves, alpha, and beta diversity. These examples describe
the A) sequence counts and B) relative abundances of six taxa (A, B, C, D, E, and F) detected in three samples. C) A collectors curve, typically
generated using a richness estimator such as Chao1 [28] or ACE [29], approximates the relationship between the number of sequences drawn from
each sample and the number of taxa expected to be present based on detected abundances. D) Alpha diversity captures both the organismal
richness of a sample and the evenness of the organisms abundance distribution. Here, alpha diversity is defined by the Shannon index [32],
P
H~{ Si~1 (pi ln(pi )), where pi is the relative abundance of taxon i, although many other alpha diversity indices may be employed. E) Beta diversity
represents the similarity (or difference) in organismal composition between samples. In this example, it can be simplistically defined by the equation
b~(ni {c)z(n2 {c), where n1 and n2 are the number of taxa in samples 1 and 2, respectively, and c is the number of shared taxa, but again many
metrics such as Bray-Curtis [34] or UniFrac [24] are commonly employed.
doi:10.1371/journal.pcbi.1002808.g002

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002808


tions, beta diversity measures including to apparent diversity. As a simple example, term metagenomics is used with some
absolute or relative overlap describe how consider that a single base pair error in a frequency to describe the entire body of
many taxa are shared between them 100 bp sequence read will create a new high-throughput studies now possible with
(Figure 2). An alpha diversity measure OTU at the 99% similarity threshold. microbial communities, although it also
thus acts like a summary statistic of a single Apparent diversity can thus be dramatically refers more specifically to whole-metagen-
population, while a beta diversity measure modified by the choice of marker gene, the ome shotgun (WMS) sequencing of geno-
acts like a similarity score between popu- region within it that is sequenced, the mic DNA fragments from a communitys
lations, allowing analysis by sample clus- biochemical marker extraction and ampli- metagenome [38,39]. Metatranscrip-
tering or, again, by dimensionality reduc- fication processes, and the read length and tomics, a close relative, implies shotgun
tions such as PCA [20]. Alpha diversity is noise characteristics of the sequencing sequencing of reverse-transcribed RNA
often quantifiedPby the Shannon Index platform. Accounting for such errors com- transcripts [40,41], metaproteomics
[32], H~{ Si~1 (pi ln(pi )), or the putationally continues to be a fruitful area [42,43] the quantification of protein or
P
Simpson Index [33], D~ Si~1 p2i , where of research, particularly as 454-based peptide levels, and metametabolomics (or
pi is the fraction of total species comprised technologies have transitioned to the Illu- less awkwardly community metabolomics)
by species i. Beta diversity can be mea- mina platform, as current solutions can [44,45] the investigation of small-molecule
sured by simple taxa overlap or quantified discard all but the highest-quality sequence metabolites. Of these assays, the latter
by the Bray-Curtis dissimilarity [34], regions [18]. A major confound in many three in particular are still in their infancy,
Si zSj {2Cij early molecular richness analyses was the but are carried out using roughly the same
BCij ~ , where Si and Sj are abundance of chimeric sequences, or reads technologies as their culture-based coun-
Si zSj
the number of species in populations i and in which two unique marker sequences terparts, and the resulting data can
j, and Cij is the total number of species at (typically 16S regions) adhere during the typically be analyzed using comparable
the location with the fewest species. Like amplification process, creating an appar- computational methods.
similarity measures in expression array ently novel taxon. Although sequence As of this writing, no complete meta-
analysis, many alpha- and beta-diversity chimeras can now be reliably removed metabolomic studies from uncultured mi-
measures have been developed that each computationally [13,19,35], this filtering crobiomes have yet been published, al-
reveal slightly different aspects of commu- process is still an essential early step in any though their potential usefulness in
nity ecology. microbiome analysis. understanding e.g. the human gut micro-
Alternatively, the diversity within or A final consideration in the computa- biome and its role in energy harvest,
among communities can be analyzed in tional analysis of community structure obesity, and metabolic disorders is clear
terms of its phylogenetic distribution assays is the use of microarray-based [44]. Metaproteomic and metatranscrip-
rather than by isolating discrete bins. This methods for 16S (and other marker) tomic studies have primarily focused on
method of quantifying community diver- quantification within a microbiome. Just environmental samples [46,47,48], but
sity describes it in terms of the total as high-throughput RNA sequencing par- human stool metatranscriptomics [41,49]
breadth or depth of the phylogenetic allels gene expression microarrays, 16S and medium-throughput human gut me-
branches spanned by a microbiome (or rDNA sequencing parallels phylochips, taproteomics [42,43] have also been
shared among two or more). For example, microarrays constructed with probes com- successfully executed and analyzed using
consider a collection of n highly-related plementary to a variety of 16S and other bioinformatics similar to those for meta-
16S sequences. These might be treated marker sequences [36]. The design and genomes (see below) [42]. Quantification
either as one OTU or as n distinct taxa, analysis of such arrays can be challenging, of the human stool metatranscriptome and
depending on how finely they are binned, as 16S sequences (or any good genomic metaproteome in tandem with host bio-
but a phylogenetic analysis will consider markers) will be highly similar, and the molecular activities should yield fascinat-
them to span a small evolutionary distance potential for extensive cross-hybridization ing insights into our relationship with our
no matter how large n becomes. Con- must be taken into account both when microbial majority.
versely, two highly-divergent binned determining what sequences to place on a DNA extraction and WMS sequencing
OTUs are typically no different than two chip and how to quantify their abundance from uncultured samples developed, like
similar OTUs, but a phylogenetic method after hybridization [37]. The continued many sequencing technologies, concur-
would score them as spanning a large usefulness of such arrays will be dictated rently with the Human Genome Project
evolutionary distance. OTU-based and by future trends in high-throughput se- [2,50,51,52], and as with other communi-
phylogenetic methods tend to be comple- quencing costs and barcoding, but at ty genomic assays, the earliest applications
mentary, in that each will reveal different present phylochips are beginning to be were to environmental microbes due to
aspects of community structure. OTUs are constructed to capture functional sequenc- the ease of isolation and extraction
highly sensitive to the specific means by es in combination with measures of taxon [53,54]. WMS techniques are in some
which taxa are binned, for example, abundances in high throughput, and they ways much the same now as they were
whereas phylogenetic measures are sensi- represent an interesting option for popu- then, modulo the need for complex Sanger
tive to the method of tree construction. lation-level microbiome assays. clone library construction: isolate micro-
Like the OTU-based diversity estimators bial cells of a target size range (e.g. viral,
discussed above, several standard metrics 4. Shotgun Sequencing and bacterial, or eukaryotic), lyse the cells
such as UniFrac [20] exist for quantifying Metagenomics (taking care not to lose DNA to native
phylogenetic diversity, and these can be DNAses), isolate DNA, fragment it to a
treated as single-sample descriptors or as While measures of community diversity target length, and sequence the resulting
multiple-sample similarity measures. have dominated historical analyses, mod- fragments [55,56]. Since this procedure
It is critically important in any micro- ern high-throughput methods are being can be performed on essentially any
biome richness analysis to account for the developed for a host of other meta heterogeneous population, does not suffer
contribution that technical noise will make assays from uncultured microbes. The from the single-copy and evolutionary

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002808


assumptions of marker genes, and does not whether they should be analyzed by functional metagenomics, however, typi-
require (although can include) amplifica- homology, de novo, or a combination cally focus on the function (either bio-
tion, it can to some degree produce a less thereof. An illustrative example is the task chemically or phenotypically) of individual
biased community profile than does 16S of determining which parts of each genes and gene products within a com-
sequencing [57]. sequence read (or ORF/contig/etc.) en- munity and fall into one of two categories.
code one or more genes, i.e. gene finding Top-down approaches screen a metagen-
4.1 Metagenome Data Analysis or calling. By homology, each sequence ome for a functional class of interest, e.g. a
Unlike whole-genome shotgun (WGS) can be BLASTed [66] against a large particular enzyme family, transporter or
sequencing of individual organisms, in database of reference genomes, which will chelator, pathway, or biological activity,
which the end product is typically a single retrieve any similar known reading frames; essentially asking the question, Does this
fully assembled genome, metagenomes the boundaries of these regions of similar- community carry out this function and, if
tend not to have a single finish line ity thus become the start and stop of the so, in what way? Bottom-up approaches
and have been successfully analyzed using metagenomic open reading frames. This attempt to reconstruct profiles, either
a range of assembly techniques. The method is robust to sequencing and descriptive or predictive, of overall func-
simplest is no assembly at all - the short assembly errors, but it is sensitive to the tionality within a community, typically
reads produced as primary data can, after contents of the reference database. Con- relying on pathway and/or metabolic
cleaning to reduce sequencing error [18], versely, de novo methods have been devel- reconstructions and asking the question,
be treated as taxonomic markers or as oped to directly bin [67,68,69] and call What functions are carried out by this
gene fragments and analyzed directly. genes within [61,62] metagenomic se- community?
Since microbial genomes typically contain quences using DNA features alone (GC Either approach relies, first, on catalog-
few intergenic sequences, most fragments content, codon usage, etc.). As with ing some or all of the gene products
will contain pieces of one or more genes; genome analysis for newly sequenced present in a community and assigning
these can be used to quantify enzymatic or single organisms, most de novo methods them molecular functions and/or biolog-
pathway abundances directly as described rely on interpolated [70] or profile [71] ical roles in the typical sense of protein
below [1,58,59,60]. Alternatively, meta- Hidden Markov Models (HMMs) or on function predictions [53,54,59]. As with so
genome-specific assembly algorithms have other machine learners that perform many bioinformatic methods, the simplest
been proposed that reconstruct only the classification based on encoded sequence techniques rely on BLAST [66]: a top-
open reading frames from a population (its features [72,73]. This is a far more down investigation can BLAST represen-
ORFeome), recruiting highly sequence- challenging task, making it sensitive to tatives of gene families of interest into the
similar fragments on an as-needed basis to errors in the computational prediction community metagenome to determine
complete single gene sequences and avoid- process, but it enables a greater range of their presence and abundance [63], and
ing assembly of larger contigs [61,62]. The discovery and community characterization a bottom-up approach can BLAST reads
most challenging option is to attempt full efforts by relying less on prior knowledge. or contigs from a metagenome into a large
assemblies for complete genomes present Hybrid methods for e.g. taxonomic bin- annotated reference database such as nr to
in the community, which is rarely possible ning [69] have recently been developed perform knowledge transfer by homology
save in very simple communities or with that consume both sequence similarity and [75,76,77]. Top-down approaches dove-
extreme sequencing depth [53,54]. When de novo sequence features as input, and for tail well with experimental screens for
successful, this has the obvious benefit of some tasks such systems might represent a individual gene product function [6], and
establishing synteny, structural variation, sweet spot between computational com- bottom-up approaches are more descrip-
and opening up the range of tools plexity, availability of prior knowledge, tive of the community as a whole [78].
developed for whole-genome analysis and biological accuracy. This tradeoff As each metagenomic sample can
[63], and guided assemblies using read between knowledge transfer by homology contain millions of reads and databases
mapping (rather than de novo assembly) can and de novo prediction from sequence is such as nr in turn contain millions of
be used when appropriate reference ge- even more pronounced when characteriz- sequences, computational efficiency is a
nomes are available. However, care must ing predicted genes, as discussed below. critical consideration in either approach.
be taken in interpreting any such assem- On one hand, stricter nucleotide searches
blies, since horizontal transfer and com- 5. Computational Functional or direct read mapping to reference
munity complexity prevent unambiguous Metagenomics genomes [79,80] improve runtime and
assemblies in essentially all realistic cases specificity at the cost of sensitivity; on the
[64]. A more feasible middle ground is Essentially any analysis of a microbial other, more flexible characterizations of
emerging around maximal assemblies that community is functional in the sense sequence function such as HMMs [72,73]
capture the largest unambiguous contigs in that it aims to determine the overall tend to simultaneously increase coverage,
a community [65], allowing e.g. local phenotypic consequences of the commu- accuracy, and computational expense.
operon structure to be studied without nitys composition and biomolecular ac- Any of these sequence annotation methods
introducing artificial homogeneity into the tivity. For example, the Human Micro- can be run directly on short reads, on
data. In any of these cases - direct analysis biome Project began to investigate what ORF assemblies, or on assembled contigs,
of reads, ORF assembly, maximal unam- typical human microbial community and statistical methods have been pro-
biguous scaffolds, or whole genomes - members are doing [60], how they are posed to more accurately estimate the
subsequent analyses typically focus on the affecting their human hosts [2], what frequencies of functions in the underlying
functional aspects of the resulting genes impact they have on health or disease, community when they are under-sampled
and pathways as detailed below. and these help to suggest how pro- or (requiring the estimation of unobserved
A key bioinformatic tradeoff in analyz- antibiotics can be used to change commu- values [81]) or over-sampled (correcting
ing metagenomic WMS sequences, re- nity behavior for the better [74]. The for loci with greater than 16 coverage
gardless of their degree of assembly, is approaches referred to as computational [82]). In any of these cases, the end result

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002808


of such an analysis is an abundance profile plausible values [81], and targeted search- ing community dynamics [25,111,112]
for each metagenomic sample quantifying es of the metagenome of interest for more and to therapeutic probiotic treatments
the frequency of gene products in the distant homologues with which to fill the for dysbioses in the human microbiome
community; the profiles for several related hole [98]. Since we are currently able to [113,114].
communities can be assembled into a infer function for only a fraction of the
frequency matrix resembling a microarray genes in any given complete genome, let 6. Host Interactions and
dataset. Gene products (rows) in such a alone metagenome, any of these ap- Interventions
profile can be identified by functional proaches should be deemed hypothetical
descriptors such as Gene Ontology [83] at best; nevertheless, like any missing value A final but critical aspect of translation-
or KEGG [84] terms, protein families imputation process, they can provide al metagenomics lies in understanding not
such as Pfams [73] or TIGRfams [72], numerically stable guesses that are sub- only a microbial community but also its
enzymatic [85], transport [86], or other stantially better than random [99]. Finally, environment - that is, its interaction with a
structural classes [87], or most often as as described above for taxa, the resulting human host. Our microbiota would be of
orthologous families such as Homolo- data can be used to summarize each interest to basic research alone if they were
Genes [88], COGs [89], NOGs [90], or reference pathway either qualitatively (i.e. not heavily influenced by host immunity
KOs [84]. with what likelihood is it present in the and, in turn, a major influence on host
A logical next step, given such an community?) or quantitatively (how abun- health and disease. The skin of humans
abundance profile of orthologous families, dant is it in the community?), and in its hosts relatively few taxa (e.g. Propionibacte-
is to assemble them into profiles of simplest form condenses the abundance rium [115]), the nasal cavity somewhat
community metabolic and functional matrix of orthologous families into an more (e.g. Corynebacterium [116]), the oral
pathways. This requires an appropriate abundance (or presence/absence) matrix cavity (dominated by Streptococcus) several
catalog of reference pathways such as of pathways. Either the ortholog or hundred taxa (with remarkable diversity
KEGG [84], MetaCyc [91], or GO [83], pathway matrices can then be tested for even among saliva, tongue, teeth, and
although it should be noted that none of differentially abundant features represent- other substrates [117,118]) and the gut
these is currently optimized for modeling ing diagnostic biomarkers with potential over 500 taxa with densities over 1011
communities rather than single organisms explanatory power for the phenotype of cells/g [2,119]. Almost none of these
in monoculture [90]. The pathway infer- interest, using statistical methods devel- communities are yet well-understood, al-
ence process is similar to that performed oped for identical tests in expression though anecdotes abound. The skin mi-
when annotating an individual newly biomarker discovery [100] and genome- crobiome is thought to be a key factor in
sequenced genome [92] and consists of wide association studies [101]. antibiotic resistant Staphylococcus aureus in-
three main steps: A) assigning each However, our prior knowledge of (pri- fections [120,121]; nasal communities
ortholog to one or more pathways, B) marily) metabolic pathways can be lever- have interacted with the pneumococcus
gap filling or interpolation of missing aged to produce richer inferences from population to influence its epidemiological
annotations, and C) determining the such pathway abundance information. carriage patterns subsequent to vaccina-
presence and/or abundance of each Given sufficient information about the tion programs [122]; and extreme dysbio-
pathway. The first ortholog assignment pathways in a community, it is relatively sis in cystic fibrosis can be a precursor to
step is necessary since many gene families straightforward to predict what metabolic pathogenic infection [123].
participate in multiple pathways; phospho- compounds have the potential to be The gut, however, is currently the best-
enolpyruvate carboxykinase, for example, produced. However, it is much more studied human microbiome [119,124,125].
is used in the TCA cycle, glycolysis, and in difficult to infer what metabolite pools It is a dynamic community changing over
various intercellular signaling mechanisms and fluxes in the community will actually the course of days [126,127], over the
[93]. The abundance mass for each be under a specific set of environmental longer time scales of infant development
enzyme is distributed across its functions conditions [102,103]. Multi-organism flux [112,128,129,130] and aging [131,132], in
in one or more possible pathways; meth- balance analysis (FBA) is an emerging tool response to natural perturbations such as
ods for doing this range from the simple to enable such analyses [104], but given diet [59,133,134,135] and illness
assumption that it is equally active in all the extreme difficulty of constructing [114,136], and modified in as-yet-unknown
reference pathways (as currently done by accurate models for even single organisms ways by the modern prevalence of travel,
KAAS [94] or MG-RAST [76]) to the [105] or of determining model parameters chemical additives, and antibiotics [126].
elimination of unlikely pathways and the in a multi-organism community [53], no Indeed, the human gut microbiome has
redistribution of associated mass in a successful reconstructions have yet been proven difficult to study exactly because it is
maximum parsimony fashion [95]. Sec- performed for complex microbiomes. The so intimately related to the physiology of its
ond, once all observed orthologs have area holds tremendous promise, however, host; inasmuch as no two people share
been assigned to pathways (when possible), first with respect to metabolic engineering identical microbiota, most microbiomes are
gaps or holes in the reference pathways - it is not yet clear what successes might be strikingly divergent between distinct host
can be filled, using the assumption that the achieved with respect to biofuel produc- species, rendering results from model
enzymes necessary to operate a nearly tion or bioremediation using synthetically organisms difficult to interpret [137,138].
complete pathway should be present manipulated communities in place of Nevertheless, studies in wild type verte-
somewhere in the community. Essentially individual organisms [106,107]. Second, brates such as mice [139,140] and zebrafish
three methods have been successfully in addition to metabolite profiling, multi- [141,142] have found a number of similar-
employed for gap filling: searching for organism growth prediction allows the ities in their microbiotic function and host
alternative pathway fragments to explain determination of mutualisms, parasitisms, interactions. In particular, germ-free or-
the discrepancy [96,97], purely mathemat- and commensalisms among taxa in the ganisms have yielded insights into the
ical smoothing to replace the missing community [108] [109,110], opening the microbiotas role in maturation of the host
enzymes abundances with numerically door to basic biological discoveries regard- immune system and, surprisingly, even

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002808


anatomical development of the intestine aggregation leads to local nutritive and they shift the overall systems-level balance
[143,144]. Similarly, gnotobiotic systems in structural environments favorable to e.g. of many taxa within the community? Do
which an organisms natural microbiota are Fusobacterium and Porphyromonas [154]. Each they reduce the levels of detrimental
replaced with their human analog are a of these steps is mediated by a combination metabolites in the host, or do they increase
current growth area for closer study of the of cell surface recognition molecules, extra- the levels of beneficial compounds? Do
phenotypic consequences of controlled cellular physical interactions, metabolic they change biomolecular activity being
microbiotic perturbations [145]. codependencies, and explicit intercellular carried out in microbial cells, adjacent
One of the highest-profile demonstra- signaling, providing an excellent example host epithelial or immune cells, or distal
tions of this technique and of the micro- of the complexity with which structured cells through host signaling mechanisms?
biotas influence on human health has microbiomes can evolve. Indeed, the evol- Or, as in polygenic genetic disorders, does
been in an ongoing study of the micro- vability of such systems, both as a whole a combination of many factors result in
biome in obesity [146]. Early studies in [155] and at the molecular level [156], is health or disease status as an emergent
wild-type mice [139] demonstrated gross yet another aspect of the work remaining to phenotype?
taxonomic shifts in the composition and computationally characterize microbiotic The human microbiome has been
diversity of the microbiomes of obese biomolecular and community function. referred to as a forgotten organ [160],
individuals; follow-ups in gnotobiotic mice Finally, the microbiota clearly represent and the truth of both words is striking.
confirmed that this phenotype was trans- a key component of future personalized Our trillions of microbial passengers
missible via the microbiome [147]. These medicine. First, the number and diversity account for a proportion of our metabo-
initial studies were taxonomically focused of phenotypes linked to the composition of lism and signaling as least as great as that
and found that, while high-level phyla the microbiota is immense: obesity, diabe- performed by more integral body parts,
were robustly perturbed in obesity (which tes, allergies, autism, inflammatory bowel and after a century of molecular biology,
incurs a reduction in Bacteroidetes and disease, fibromyalgia, cardiac function, we have only begun to realize their
concomitant increase in Firmicutes [139]), various cancers, and depression have all importance within the last few years. To
few if any specific taxa seemed to be been reported to correlate with micro- close with a success story, the popular
similarly correlated [138,140]. Subsequent biome function [157]. Even without caus- press [161] recently reported on the full
functional metagenomics, first in mouse ative or modulatory roles, there is tremen- recovery of a patient suffering from
[148] and later a small human cohort [59], dous potential in the ability to use the Clostridium difficile-associated diarrhea,
established that the functional consistency taxonomic or metagenomic composition which had led her to lose over 60 pounds
of these shifts operates more consistently, of a subjects gut or oral flora (both easily in less than a year. C. difficile is often
enriching the microbiomes capacity for sampled) as a diagnostic or prognostic refractory to antibiotics, with spores able
energy harvest and disregulating fat stor- biomarker for any or all of these condi- to repopulate from very low levels, and the
age and signaling within the host. While tions. Commercial personal genomics patients normal microbiota had been
these observations represent major de- services such as 23andMe (Mountain decimated by the infection and subsequent
scriptive triumphs, further computational View, CA) promise to decode your disease treatment. Finally, she received a simple
and experimental work must yet be risk based on somatic DNA from a saliva fecal transplant from her husband, in
performed to establish the underlying sample; bioinformatic techniques have yet which the host microbiome was replaced
biomolecular mechanisms and whether to be developed that will allow us to do the with that of a donor. Within days, not only
they are correlative, causative, or may be same using microbial DNA. had she begun a complete recovery, but a
targeted by interventions to actively treat Second, the microbiota are amazingly metagenomic survey of her microbiota
obesity [59]. plastic; they change metagenomically showed that the new community was
A similarly complex community for within hours and metatranscriptomically almost completely established and had
which we have a greater understanding of within minutes in response to perturba- restored normal taxonomic abundances
the functional mechanisms at play is the tions ranging from broad-spectrum antibi- [162]. While this is an extreme case,
formation of biofilms in the oral cavity otics to your breakfast bacon and eggs similar treatments have shown a success
preceding caries (cavities) or periodontitis [41,126,127]. For any phenotype to which rate of some 90% historically [163], all of
[149]. While we are still investigating the they are causally linked, this opens the which occurred before modern genomic
microbiota of the saliva [150] and of the possibility of pharmaceutical, prebiotic techniques allowed us to more closely
oral soft tissues [151], colonization of the (nutrients promoting the growth of bene- examine the microbiota. Imagine perform-
tooth enamel is somewhat better under- ficial microbes [113,119]), or probiotic ing any other organ transplant with such a
stood due to the removal of significant treatments. Indeed, Nobel Prize winner high rate of success - while blindfolded!
interaction with host tissue. Even more Ilya Mechnikov famously named Lactoba- Like so many other discoveries of the
strikingly, this biofilm, or physically struc- cillus bulgaricus, a primary yogurt-produc- genomic era, the study of the human
tured consortium of multiple microbial ing bacterium, for its apparent contribu- microbiome has begun with amazing
taxa, must reestablish itself from almost tion to the longevity of yogurt-consuming achievements, and it will require contin-
nothing each time we brush our teeth - a Bulgarians [158], and despite a degree of ued experimental and bioinformatic efforts
process that can be achieved within hours unfortunate popular hype, the potential to better understand the biology of these
[152]. Streptococci in particular possess a health benefits of a variety of probiotic microbial communities and to see it
number of surface adhesins and receptors organisms are indeed supported by recent translated into clinical practice.
that enable them to behave as early findings [125,159]. Unfortunately, we
colonizers on bare tooth surface and to currently understand few of the mecha- 7. Summary
bind together a variety of subsequent nisms by which these interventions oper-
microbes [153]. These fairly minimal ate. Do the supplemented organisms The human microbiome consists of
bacteria are metabolically supported by outcompete specific pathogens, do they unicellular microbes - mainly bacterial,
Veillonella and Actinomyces species, and their simply increase their own numbers, or do but also archaeal, viral, and eukaryotic -

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002808


that occupy nearly every surface of our allowing inference of the overall metabolic Q4. What factors in the microbial
bodies and have been linked to a wide potential of the community and inference environment might you expect to be
range of phenotypes in health and disease. of diagnostic and potentially explanatory reflected in metabolism, signaling, and
High-throughput assays have offered the functional biomarkers. Ongoing studies biomolecular function between skin bacte-
first comprehensive culture-free tech- are beginning to investigate the ways in ria and oral bacteria? What impact would
niques for surveying the members of these which the microbiota can be directly you expect this to have on the pathways
communities and their biomolecular ac- engineered using pharmaceuticals, prebio- carried in these community metagenomes,
tivities at the transcript, protein, and tics, probiotics, or diet as a preventative or or on their alpha diversities?
metabolic levels. Most current technolo- treatment for a wide range of disorders. Q5. It is estimated that 25% of the
gies rely on DNA sequencing to examine population has Clostridium difficile in their
either individual taxonomic markers in a 8. Exercises intestines. Why is this not usually a
microbial community, typically the 16S problem?
ribosomal subunit gene, or the composite Q1. You have a collection of 16S rRNA Q6. Consider the impact upon the
metagenome of the entire community. gene sequencing data, which consists of an human microbiome of two perturbations:
Taxonomic analyses lend themselves to Illumina run in which the 100 bp V6 social contact and brushing your teeth.
computational techniques rooted in mi- hypervariable region has been amplified. What short-term and long-term impact do
crobial ecology, including diversity mea- The error rate of Illumina sequencing has you expect on alpha diversity? Beta
sures within (alpha) and between (beta) been estimated as 1.361023 per base pair diversity?
samples; these can be defined quantita- [164], and you have 30 million Illumina Q7. Calculate richness, the inverse
tively (based on abundance) or qualitative- reads. Will binning your reads into OTUs Simpson index, and the Shannon index
ly (based on presence/absence), and they at 100% or 97% give you a more for each sample described in the table
may or may not take into account the interpretable estimation of the number of below. Which has the highest alpha
phylogenetic relatedness of the taxa being OTUs present? Why? diversity? Why is the answer different
investigated. Finally, in the absence of Q2. You have collections of 16S rRNA
information regarding specific named according to which measurement you use?
gene reads from two environmental sam-
species in a community, sequences are ples, A and B. You examine 50 reads each
often clustered by similarity into Opera- from sample A and sample B, which
tional Taxonomic Units (OTUs) as the correspond to four taxa in A and two taxa OTU Sample 1 Sample 2 Sample 3
fundamental unit of analysis within a in B. You examine 25 more reads from A 20 20 30
sample. each library and detect two more taxa in A B 20 20 30
In contrast, whole-genome shotgun and one more in B. In total, two of these C 1 20 30
analyses begin with sequences sampled taxa are present in both communities A D 1 20 0
from the entire community metagenome. E 1 0 1
and B. Which sample has higher alpha
These can also be taxonomically binned, diversity by counting taxonomic richness?
or they can be assembled, partially assem- What is the beta diversity between A and Answers to the Exercises can be found
bled into ORFeomes, or characterized
B using simple overlap of taxa? Using in Text S1.
directly at the read level. Characterization
Bray-Curtis dissimilarity?
typically consists of function assignment
Q3. You examine 1,000 more sequenc- Supporting Information
similar to that performed for genes during
es from samples A and B, detecting 10
annotation of a single organisms genome; Text S1 Answers to Exercises.
additional taxa in A and 25 in B. Which
once genes in the metagenome are de- (DOCX)
sample has higher alpha diversity now, as
fined, they can be mapped or BLASTed to
measured by taxonomic richness? Why is
reference sequence databases or analyzed
this different from your previous answer? Acknowledgments
intrinsically using e.g. codon frequencies
or HMM profiles. Finally, the frequencies What statement can you make about the We thank Nicola Segata for assistance with
of enzymes and other gene products so ecological evenness of communities A and figures.
determined can be assigned to pathways, B as a result?

Further Reading
It is difficult to recommend comprehensive literature in an area that is changing
so rapidly, but the bioinformatics of microbial community studies are currently
best covered by the reviews in [22,56,165]. Computational tools for metagenomic
analysis include [13,19,63,75,76,77,166]. An overview of microbial ecology from a
phylogenetic perspective is provided in [167,168], and the use of the 16S subunit
as a marker gene is reviewed in [12]. Likewise, experimental and computational
functional metagenomics are discussed in [6,25,169]. The clinical relevance of the
human microbiome is far-ranging and is comprehensively reviewed in [157].

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002808


Glossary
alpha diversity: within-sample taxonomic diversity

beta diversity: between-sample taxonomic diversity

binning: assignment of sequences to taxonomic units

biofilm: a physically (and often temporally) structured aggregate of microorganisms, often containing multiple taxa, and often
adhered to each other and/or to a defined substrate

chimera: an artificial DNA sequence generated during amplification, consisting of a combination of two (or more) true
underlying sequences

collectors curve: a plot in which the horizontal axis represents samples (often DNA sequences) and the vertical axis represents
diversity (e.g. number of distinct taxa)

community structure: used most commonly to refer to the taxonomic composition of a microbial community; can also refer to
the spatiotemporal distribution of taxa

diversity: a measure of the taxonomic distribution within a community, either in terms of distinct taxa or in terms of their
evolutionary/phylogenetic distance

FBA: Flux Balance Analysis, a computational method for inferring the metabolic behavior of a system given prior knowledge of
the enzymatic reactions of which it is capable

functional metagenomics: computational or experimental analysis of a microbial community with respect to the biochemical
and other biomolecular activities encoded by its composite genome

gap filling: the process of imputing missing or inaccurate gene abundances in a set of pathways

germ-free: a host animal containing no microorganisms

gnotobiotic: a host animal containing a defined set of microorganisms, either synthetically implanted or transferred from
another host; often used to refer to model organisms with humanized microbiota

holes: missing genes in a set of reference pathways; see gap filling

interpolation: see gap filling

marker: a gene or other DNA sequence that can be (ideally) unambiguously assigned to a particular taxon or function

metagenome: the total genomic DNA of all organisms within a community

metagenomics: the study of uncultured microbial communities, typically relying on high-throughput experimental data and
bioinformatic techniques

metametabolome: the total metabolite pool (and possibly fluxes) of a community

metaproteome: the total proteome of all organisms within a community

metatranscriptome: the total transcribed RNA pool of all organisms within a community

microbiome: the total microbial community and biomolecules within a defined environment

microbiota: the total collection of microbial organisms within a community, typically used in reference to an animal host

microflora: an older term used synonymously with microbiota

ORFeome: the total collection of open reading frames within a metagenome

ortholog: in strict usage, a homologous gene in two species distinguished only by a speciation event; in practice, used to
denote any gene sufficiently homologous as to represent strong evidence for conserved biological function

OTU: Operational Taxonomic Unit, a cluster of organisms similar at the sequence level beyond some threshhold (e.g. 95%) used
in place of species, genus, etc.

phylochip: a microarray containing taxonomic (and sometimes functional) marker sequences

PLOS Computational Biology | www.ploscompbiol.org 10 December 2012 | Volume 8 | Issue 12 | e1002808


phylotype: see OTU

prebiotic: a food substance metabolized by the microbiota so as to directly or indirectly benefit the host

probiotic: a live microorganism consumed by the host with direct or indirect health benefits

rarefaction curve: see collectors curve

richness: see diversity

16S rRNA: the transcribed form of the 16S ribosomal subunit gene, the smaller RNA component of the prokaryotic ribosome,
used as the most common taxonomic marker for microbial communities

WGS: Whole-Genome Shotgun, used to describe shotgun sequencing of individual organisms and, sometimes, microbial
communities, although this is not completely accurate as no whole-genome is typically involved

WMS: Whole-Metagenome Shotgun sequencing, used in reference to undirected metagenomic sequencing to distinguish it
from sequencing directed to specific taxonomic marker genes

References
1. Qin J, Li R, Raes J, Arumugam M, Burgdorf 14. DeSantis TZ, Hugenholtz P, Larsen N, Rojas 26. Sellner KG, Doucette GJ, Kirkpatrick GJ (2003)
KS, et al. (2010) A human gut microbial gene M, Brodie EL, et al. (2006) Greengenes, a Harmful algal blooms: causes, impacts and
catalogue established by metagenomic sequenc- chimera-checked 16S rRNA gene database and detection. J Ind Microbiol Biotechnol 30: 383
ing. Nature 464: 5965. workbench compatible with ARB. Appl Environ 406.
2. (2012) Structure, function and diversity of the Microbiol 72: 50695072. 27. Hildebrand MV (1993) The Birthday Problem.
healthy human microbiome. Nature 486: 207 15. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, American Mathematical Monthly 100: 643.
214. et al. (2009) The Ribosomal Database Project: 28. Chao A (1984) Nonparametric estimation of the
3. Gram HC (1884) Uber die isolierte Farbung der improved alignments and new tools for rRNA number of classes in a population. Scandinavian
Schizomyceten in Schnitt- und Trockenprapar- analysis. Nucleic Acids Res 37: D141145. Journal of Statistics 11: 265270.
aten. Fortschritte der Medizin 2: 185189. 16. Pruesse E, Quast C, Knittel K, Fuchs BM, 29. Chao A, Ma M-C, Yang MCK (1993) Stopping
4. Pace NR, Stahl DA, Lane DJ, Olsen GJ (1986) Ludwig W, et al. (2007) SILVA: a comprehen- rules and estimation for recapture debugging
The analysis of natural microbial populations by sive online resource for quality checked and with unequal failure rates. Biometrika 80: 193
ribosomal RNA sequences. Advances in Micro- aligned ribosomal RNA sequence data compat- 201.
bial Ecology 9: 155. ible with ARB. Nucleic Acids Res 35: 7188 30. Heltshe JF, Forrester NE (1983) Estimating
5. Amann RI, Ludwig W, Schleifer KH (1995) 7196. species richness using the jackknife procedure.
Phylogenetic identification and in situ detection 17. Achtman M, Wagner M (2008) Microbial Biometrics 39: 111.
of individual microbial cells without cultivation. diversity and the genetic nature of microbial 31. Colwell RK, Coddington JA (1994) Estimating
Microbiol Rev 59: 143169. species. Nat Rev Microbiol 6: 431440. terrestrial biodiversity through extrapolation.
6. Handelsman J (2004) Metagenomics: application 18. Schloss PD (2010) The effects of alignment Phil Trans R Soc London B 345: 101118.
of genomics to uncultured microorganisms. quality, distance calculation method, sequence 32. Shannon CE (1948) A mathematical theory of
Microbiol Mol Biol Rev 68: 669685. filtering, and region on the analysis of 16S rRNA communication. Bell System Technical Journal
gene-based studies. PLoS Comput Biol 6: 27: 379423, 623656.
7. Sanger F, Coulson AR (1975) A rapid method
for determining sequences in DNA by primed e1000844. 33. Simpson EH (1949) Measurement of diversity.
Nature 163: 688.
synthesis with DNA polymerase. Journal of 19. Schloss PD, Westcott SL, Ryabin T, Hall JR,
34. Bray JR, Curtis JT (1957) An ordination of
molecular biology 94: 441448. Hartmann M, et al. (2009) Introducing mothur:
upland forest communities of southern Wiscon-
8. Sanger F, Nicklen S, Coulson AR (1977) DNA open-source, platform-independent, communi-
sin. Ecological Monographs 27: 325349.
sequencing with chain-terminating inhibitors. ty-supported software for describing and com-
35. Huber T, Faulkner G, Hugenholtz P (2004)
Proceedings of the National Academy of Sci- paring microbial communities. Appl Environ
Bellerophon: a program to detect chimeric
ences of the United States of America 74: 5463 Microbiol 75: 75377541.
sequences in multiple sequence alignments.
5467. 20. Hamady M, Lozupone C, Knight R (2010) Fast
Bioinformatics 20: 23172319.
9. Birney E, Stamatoyannopoulos JA, Dutta A, UniFrac: facilitating high-throughput phyloge-
36. Brodie EL, Desantis TZ, Joyner DC, Baek SM,
Guigo R, Gingeras TR, et al. (2007) Identifica- netic analyses of microbial communities includ-
Larsen JT, et al. (2006) Application of a high-
tion and analysis of functional elements in 1% of ing analysis of pyrosequencing and PhyloChip density oligonucleotide microarray approach to
the human genome by the ENCODE pilot data. ISME J 4: 1727. study bacterial population dynamics during
project. Nature 447: 799816. 21. Wang Q, Garrity GM, Tiedje JM, Cole JR uranium reduction and reoxidation. Appl Envi-
10. Bocchetta M, Ceccarelli E, Creti R, Sanange- (2007) Naive Bayesian classifier for rapid ron Microbiol 72: 62886298.
lantoni AM, Tiboni O, et al. (1995) Arrange- assignment of rRNA sequences into the new 37. Schatz MC, Phillippy AM, Gajer P, DeSantis
ment and nucleotide sequence of the gene (fus) bacterial taxonomy. Appl Environ Microbiol 73: TZ, Andersen GL, et al. (2010) Integrated
encoding elongation factor G (EF-G) from the 52615267. microbial survey analysis of prokaryotic com-
hyperthermophilic bacterium Aquifex pyrophi- 22. Hamady M, Knight R (2009) Microbial com- munities for the PhyloChip microarray. Appl
lus: phylogenetic depth of hyperthermophilic munity profiling for human microbiome pro- Environ Microbiol 76: 56365638.
bacteria inferred from analysis of the EF-G/fus jects: Tools, techniques, and challenges. Ge- 38. Riesenfeld CS, Schloss PD, Handelsman J
sequences. J Mol Evol 41: 803812. nome Res 19: 11411152. (2004) Metagenomics: genomic analysis of
11. Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin 23. Johnson RA, Wichern DW (2007) Applied microbial communities. Annu Rev Genet 38:
ML, et al. (1985) Rapid determination of 16S Multivariate Statistical Analysis: Prentice 525552.
ribosomal RNA sequences for phylogenetic Hall. 39. Chen K, Pachter L (2005) Bioinformatics for
analyses. Proc Natl Acad Sci U S A 82: 6955 24. Lozupone C, Knight R (2005) UniFrac: a new whole-genome shotgun sequencing of microbial
6959. phylogenetic method for comparing microbial communities. PLoS Comput Biol 1: 106112.
12. Tringe SG, Hugenholtz P (2008) A renaissance communities. Appl Environ Microbiol 71: 40. Gilbert JA, Field D, Huang Y, Edwards R, Li
for the pioneering 16S rRNA gene. Curr Opin 82288235. W, et al. (2008) Detection of large numbers of
Microbiol 11: 442446. 25. Gianoulis TA, Raes J, Patel PV, Bjornson R, novel sequences in the metatranscriptomes of
13. Caporaso JG, Kuczynski J, Stombaugh J, Korbel JO, et al. (2009) Quantifying environ- complex marine microbial communities. PLoS
Bittinger K, Bushman FD, et al. (2010) QIIME mental adaptation of metabolic pathways in One 3: e3042.
allows analysis of high-throughput community metagenomics. Proc Natl Acad Sci U S A 106: 41. Booijink CC, Boekhorst J, Zoetendal EG, Smidt
sequencing data. Nat Methods 7: 335336. 13741379. H, Kleerebezem M, et al. (2010) Metatranscrip-

PLOS Computational Biology | www.ploscompbiol.org 11 December 2012 | Volume 8 | Issue 12 | e1002808


tome analysis of the human fecal microbiota 62. Rho M, Tang H, Ye Y (2010) FragGeneScan: 84. Kanehisa M, Goto S, Furumichi M, Tanabe M,
reveals subject-specific expression profiles, with predicting genes in short and error-prone reads. Hirakawa M (2010) KEGG for representation
genes encoding proteins involved in carbohy- Nucleic Acids Res. and analysis of molecular networks involving
drate metabolism being dominantly expressed. 63. Seshadri R, Kravitz SA, Smarr L, Gilna P, diseases and drugs. Nucleic Acids Res 38: D355
Appl Environ Microbiol 76: 55335540. Frazier M (2007) CAMERA: a community 360.
42. Verberkmoes NC, Russell AL, Shah M, Godzik resource for metagenomics. PLoS Biol 5: e75. 85. NC-IUBMB (1999) Nomenclature committee of
A, Rosenquist M, et al. (2009) Shotgun meta- 64. Nagarajan N, Cook C, Di Bonaventura M, Ge the international union of biochemistry and
proteomics of the human distal gut microbiota. H, Richards A, et al. (2010) Finishing genomes molecular biology (NC-IUBMB), Enzyme Sup-
ISME J 3: 179189. with limited resources: lessons from an ensemble plement 5 (1999). Eur J Biochem 264: 610650.
43. Li X, LeBlanc J, Truong A, Vuthoori R, Chen of microbial genomes. BMC Genomics 11: 242. 86. Ren Q, Chen K, Paulsen IT (2007) Trans-
SS, et al. (2011) A metaproteomic approach to 65. Pop M (2009) Genome assembly reborn: recent portDB: a comprehensive database resource for
study human-microbial ecosystems at the mu- computational challenges. Brief Bioinform 10: cytoplasmic membrane transport systems and
cosal luminal interface. PLoS One 6: e26542. 354366. outer membrane channels. Nucleic Acids Res
44. Turnbaugh PJ, Gordon JI (2008) An invitation 66. Camacho C, Coulouris G, Avagyan V, Ma N, 35: D274279.
to the marriage of metagenomics and metabo- Papadopoulos J, et al. (2009) BLAST+: archi- 87. Emanuelsson O, Brunak S, von Heijne G,
lomics. Cell 134: 708713. tecture and applications. BMC Bioinformatics Nielsen H (2007) Locating proteins in the cell
45. Wikoff WR, Anfora AT, Liu J, Schultz PG, 10: 421. using TargetP, SignalP and related tools. Nat
Lesley SA, et al. (2009) Metabolomics analysis 67. Teeling H, Meyerdierks A, Bauer M, Amann R, Protoc 2: 953971.
reveals large effects of gut microflora on Glockner FO (2004) Application of tetranucle- 88. Sayers EW, Barrett T, Benson DA, Bolton E,
mammalian blood metabolites. Proc Natl Acad otide frequencies for the assignment of genomic Bryant SH, et al. (2010) Database resources of
Sci U S A 106: 36983703. fragments. Environ Microbiol 6: 938947. the National Center for Biotechnology Informa-
46. Wilmes P, Bond PL (2006) Metaproteomics: 68. McHardy AC, Martin HG, Tsirigos A, Hugen- tion. Nucleic Acids Res 38: D516.
studying functional gene expression in microbial holtz P, Rigoutsos I (2007) Accurate phyloge- 89. Tatusov RL, Fedorova ND, Jackson JD, Jacobs
ecosystems. Trends Microbiol 14: 9297. netic classification of variable-length DNA AR, Kiryutin B, et al. (2003) The COG
47. Poretsky RS, Hewson I, Sun S, Allen AE, Zehr fragments. Nat Methods 4: 6372. database: an updated version includes eukary-
JP, et al. (2009) Comparative day/night meta- 69. Brady A, Salzberg SL (2009) Phymm and otes. BMC Bioinformatics 4: 41.
transcriptomic analysis of microbial communi- PhymmBL: metagenomic phylogenetic classifi- 90. Muller J, Szklarczyk D, Julien P, Letunic I, Roth
ties in the North Pacific subtropical gyre. cation with interpolated Markov models. Nat A, et al. (2010) eggNOG v2.0: extending the
Environ Microbiol 11: 13581375. Methods 6: 673676. evolutionary genealogy of genes with enhanced
48. Shi Y, Tyson GW, DeLong EF (2009) Meta- 70. Salzberg SL, Pertea M, Delcher AL, Gardner non-supervised orthologous groups, species and
transcriptomics reveals unique microbial small MJ, Tettelin H (1999) Interpolated Markov functional annotations. Nucleic Acids Res 38:
RNAs in the oceans water column. Nature 459: models for eukaryotic gene finding. Genomics D190195.
266269. 59: 2431. 91. Caspi R, Altman T, Dale JM, Dreher K,
49. Giannoukos G, Ciulla DM, Huang K, Haas BJ, 71. Eddy SR (1998) Profile hidden Markov models. Fulcher CA, et al. (2010) The MetaCyc database
Izard J, et al. (2012) Efficient and robust RNA- of metabolic pathways and enzymes and the
Bioinformatics 14: 755763.
seq process for cultured bacteria and complex BioCyc collection of pathway/genome databas-
72. Haft DH, Selengut JD, White O (2003) The
community transcriptomes. Genome biology 13: es. Nucleic Acids Res 38: D473479.
TIGRFAMs database of protein families. Nu-
R23.
cleic Acids Res 31: 371373. 92. Nelson KE, Weinstock GM, Highlander SK,
50. Lander ES, Linton LM, Birren B, Nusbaum C, Worley KC, Creasy HH, et al. (2010) A catalog
73. Finn RD, Tate J, Mistry J, Coggill PC, Sammut
Zody MC, et al. (2001) Initial sequencing and of reference genomes from the human micro-
SJ, et al. (2008) The Pfam protein families
analysis of the human genome. Nature 409:
database. Nucleic Acids Res 36: D281288. biome. Science 328: 994999.
860921.
74. Veiga P, Gallini CA, Beal C, Michaud M, 93. Izui K, Matsumura H, Furumoto T, Kai Y
51. Venter JC, Adams MD, Myers EW, Li PW,
Delaney ML, et al. (2010) Bifidobacterium (2004) Phosphoenolpyruvate carboxylase: a new
Mural RJ, et al. (2001) The sequence of the
animalis subsp. lactis fermented milk product era of structural biology. Annu Rev Plant Biol
human genome. Science 291: 13041351.
reduces inflammation by altering a niche for 55: 6984.
52. (2012) A framework for human microbiome
colitogenic microbes. Proc Natl Acad Sci U S A. 94. Moriya Y, Itoh M, Okuda S, Yoshizawa AC,
research. Nature 486: 215221.
75. Markowitz VM, Ivanova NN, Szeto E, Pala- Kanehisa M (2007) KAAS: an automatic
53. Tyson GW, Chapman J, Hugenholtz P, Allen
niappan K, Chu K, et al. (2008) IMG/M: a data genome annotation and pathway reconstruction
EE, Ram RJ, et al. (2004) Community structure
management and analysis system for metagen- server. Nucleic Acids Res 35: W182185.
and metabolism through reconstruction of
microbial genomes from the environment. omes. Nucleic Acids Res 36: D534538. 95. Ye Y, Doak TG (2009) A parsimony approach
Nature 428: 3743. 76. Meyer F, Paarmann D, DSouza M, Olson R, to biological pathway reconstruction/inference
Glass EM, et al. (2008) The metagenomics for genomes and metagenomes. PLoS Comput
54. Venter JC, Remington K, Heidelberg JF,
Halpern AL, Rusch D, et al. (2004) Environ- RAST server - a public resource for the Biol 5: e1000465.
mental genome shotgun sequencing of the automatic phylogenetic and functional analysis 96. Reed JL, Patel TR, Chen KH, Joyce AR,
Sargasso Sea. Science 304: 6674. of metagenomes. BMC Bioinformatics 9: 386. Applebee MK, et al. (2006) Systems approach to
55. Hugenholtz P, Tyson GW (2008) Microbiology: 77. Goll J, Rusch D, Tanenbaum DM, Thiagarajan refining genome annotation. Proc Natl Acad
metagenomics. Nature 455: 481483. M, Li K, et al. (2010) METAREP: JCVI Sci U S A 103: 1748017484.
56. Kunin V, Copeland A, Lapidus A, Mavromatis Metagenomics Reports - an open source tool 97. Satish Kumar V, Dasika MS, Maranas CD
K, Hugenholtz P (2008) A bioinformaticians for high-performance comparative metage- (2007) Optimization based automated curation
guide to metagenomics. Microbiol Mol Biol Rev nomics. Bioinformatics. of metabolic reconstructions. BMC Bioinfor-
72: 557578, Table of Contents. 78. Eisen JA (2007) Environmental shotgun se- matics 8: 212.
57. Sogin ML, Morrison HG, Huber JA, Mark quencing: its potential and challenges for 98. Green ML, Karp PD (2004) A Bayesian method
Welch D, Huse SM, et al. (2006) Microbial studying the hidden world of microbes. PLoS for identifying missing enzymes in predicted
diversity in the deep sea and the underexplored Biol 5: e82. metabolic pathway databases. BMC Bioinfor-
rare biosphere. Proc Natl Acad Sci U S A 79. Langmead B, Trapnell C, Pop M, Salzberg SL matics 5: 76.
103: 1211512120. (2009) Ultrafast and memory-efficient alignment 99. Durot M, Bourguignon PY, Schachter V (2009)
58. Mavromatis K, Ivanova N, Barry K, Shapiro H, of short DNA sequences to the human genome. Genome-scale models of bacterial metabolism:
Goltsman E, et al. (2007) Use of simulated data Genome Biol 10: R25. reconstruction and applications. FEMS Micro-
sets to evaluate the fidelity of metagenomic 80. Li H, Durbin R (2010) Fast and accurate long- biol Rev 33: 164190.
processing methods. Nat Methods 4: 495500. read alignment with Burrows-Wheeler trans- 100. Ghosh D, Poisson LM (2009) Omics data and
59. Turnbaugh PJ, Hamady M, Yatsunenko T, form. Bioinformatics 26: 589595. levels of evidence for biomarker discovery.
Cantarel BL, Duncan A, et al. (2009) A core gut 81. Rodriguez-Brito B, Rohwer F, Edwards RA Genomics 93: 1316.
microbiome in obese and lean twins. Nature (2006) An application of statistics to comparative 101. Hirschhorn JN, Daly MJ (2005) Genome-wide
457: 480484. metagenomics. BMC Bioinformatics 7: 162. association studies for common diseases and
60. Abubucker S, Segata N, Goll J, Schubert AM, 82. Rusch DB, Halpern AL, Sutton G, Heidelberg complex traits. Nat Rev Genet 6: 95108.
Izard J, et al. (2012) Metabolic reconstruction KB, Williamson S, et al. (2007) The Sorcerer II 102. Freilich S, Kreimer A, Borenstein E, Yosef N,
for metagenomic data and its application to the Global Ocean Sampling expedition: northwest Sharan R, et al. (2009) Metabolic-network-
human microbiome. PLoS computational biol- Atlantic through eastern tropical Pacific. PLoS driven analysis of bacterial ecological strategies.
ogy 8: e1002358. Biol 5: e77. Genome Biol 10: R61.
61. Hoff KJ, Lingner T, Meinicke P, Tech M (2009) 83. Ashburner M, Ball CA, Blake JA, Botstein D, 103. Tepper N, Shlomi T (2010) Predicting metabolic
Orphelia: predicting genes in metagenomic Butler H, et al. (2000) Gene ontology: tool for engineering knockout strategies for chemical
sequencing reads. Nucleic Acids Res 37: the unification of biology. The Gene Ontology production: accounting for competing pathways.
W101105. Consortium. Nat Genet 25: 2529. Bioinformatics 26: 536543.

PLOS Computational Biology | www.ploscompbiol.org 12 December 2012 | Volume 8 | Issue 12 | e1002808


104. Stolyar S, Van Dien S, Hillesland KL, Pinel N, 126. Dethlefsen L, Relman DA (2010) Microbes and human gut microbiome: a metagenomic analysis
Lie TJ, et al. (2007) Metabolic modeling of a Health Sackler Colloquium: Incomplete recov- in humanized gnotobiotic mice. Sci Transl Med
mutualistic microbial community. Mol Syst Biol ery and individualized responses of the human 1: 6ra14.
3: 92. distal gut microbiota to repeated antibiotic 146. Ley RE (2010) Obesity and the human micro-
105. Thiele I, Palsson BO (2010) A protocol for perturbation. Proc Natl Acad Sci U S A. biome. Curr Opin Gastroenterol 26: 511.
generating a high-quality genome-scale meta- 127. Dethlefsen L, Huse S, Sogin ML, Relman DA 147. Turnbaugh PJ, Ley RE, Mahowald MA,
bolic reconstruction. Nat Protoc 5: 93121. (2008) The pervasive effects of an antibiotic on Magrini V, Mardis ER, et al. (2006) An
106. Lorenz P, Eck J (2005) Metagenomics and the human gut microbiota, as revealed by deep obesity-associated gut microbiome with in-
industrial applications. Nat Rev Microbiol 3: 16S rRNA sequencing. PLoS Biol 6: e280. creased capacity for energy harvest. Nature
510516. 128. Yatsunenko T, Rey FE, Manary MJ, Trehan I, 444: 10271031.
107. Sommer MO, Church GM, Dantas G (2010) A Dominguez-Bello MG, et al. (2012) Human gut 148. Turnbaugh PJ, Backhed F, Fulton L, Gordon JI
functional metagenomic approach for expand- microbiome viewed across age and geography. (2008) Diet-induced obesity is linked to marked
ing the synthetic biology toolbox for biomass Nature 486: 222227. but reversible alterations in the mouse distal gut
conversion. Mol Syst Biol 6: 360. 129. Kurokawa K, Itoh T, Kuwahara T, Oshima K, microbiome. Cell Host Microbe 3: 213223.
108. Faust K, Sathirapongsasuti JF, Izard J, Segata Toh H, et al. (2007) Comparative metagenomics 149. Marsh PD (2006) Dental plaque as a biofilm and
N, Gevers D, et al. (2012) Microbial Co- revealed commonly enriched gene sets in human a microbial community - implications for health
occurrence Relationships in the Human Micro- gut microbiomes. DNA Res 14: 169181. and disease. BMC Oral Health 6 Suppl 1: S14.
biome. PLoS computational biology 8: 130. Koenig JE, Spor A, Scalfone N, Fricker AD, 150. Nasidze I, Li J, Quinque D, Tang K, Stoneking
e1002606. Stombaugh J, et al. (2010) Microbes and Health M (2009) Global diversity in the human salivary
109. Little AE, Robinson CJ, Peterson SB, Raffa KF, Sackler Colloquium: Succession of microbial microbiome. Genome Res 19: 636643.
Handelsman J (2008) Rules of engagement: consortia in the developing infant gut micro- 151. Zijnge V, van Leeuwen MB, Degener JE, Abbas
interspecies interactions that regulate microbial biome. Proc Natl Acad Sci U S A. F, Thurnheer T, et al. (2010) Oral biofilm
communities. Annu Rev Microbiol 62: 375401. 131. Claesson MJ, Cusack S, OSullivan O, Greene- architecture on natural teeth. PLoS One 5:
110. Vartoukian SR, Palmer RM, Wade WG (2010) Diniz R, de Weerd H, et al. (2010) Microbes and e9321.
Strategies for culture of unculturable bacteria. Health Sackler Colloquium: Composition, var- 152. Guggenheim M, Shapiro S, Gmur R, Guggen-
FEMS Microbiol Lett 309: 17. iability, and temporal stability of the intestinal heim B (2001) Spatial arrangements and asso-
111. Vaishampayan PA, Kuehl JV, Froula JL, microbiota of the elderly. Proc Natl Acad ciative behavior of species in an in vitro oral
Morgan JL, Ochman H, et al. (2010) Compar- Sci U S A. biofilm model. Appl Environ Microbiol 67:
ative metagenomics and population dynamics of 132. Claesson MJ, Jeffery IB, Conde S, Power SE, 13431350.
the gut microbiota in mother and infant. OConnor EM, et al. (2012) Gut microbiota 153. Yoshida Y, Palmer RJ, Yang J, Kolenbrander
Genome Biol Evol 2010: 5366. composition correlates with diet and health in PE, Cisar JO (2006) Streptococcal receptor
112. Trosvik P, Stenseth NC, Rudi K (2010) the elderly. Nature 488: 178184. polysaccharides: recognition molecules for oral
Convergent temporal dynamics of the human 133. Wu GD, Chen J, Hoffmann C, Bittinger K, biofilm formation. BMC Oral Health 6 Suppl 1:
infant gut microbiota. ISME J 4: 151158. Chen YY, et al. (2011) Linking long-term dietary S12.
patterns with gut microbial enterotypes. Science 154. Jenkinson HF, Lamont RJ (2005) Oral microbial
113. Jia W, Li H, Zhao L, Nicholson JK (2008) Gut
334: 105108. communities in sickness and in health. Trends
microbiota: a potential new territory for drug
134. Spencer MD, Hamp TJ, Reid RW, Fischer LM, Microbiol 13: 589595.
targeting. Nat Rev Drug Discov 7: 123129.
Zeisel SH, et al. (2011) Association between 155. Ley RE, Hamady M, Lozupone C, Turnbaugh
114. Round JL, Mazmanian SK (2009) The gut
composition of the human gastrointestinal mi- PJ, Ramey RR, et al. (2008) Evolution of
microbiota shapes intestinal immune responses
crobiome and development of fatty liver with mammals and their gut microbes. Science 320:
during health and disease. Nat Rev Immunol 9:
choline deficiency. Gastroenterology 140: 976 16471651.
313323.
986. 156. Hehemann JH, Correc G, Barbeyron T, Helbert
115. Grice EA, Kong HH, Conlan S, Deming CB,
135. Zhang C, Zhang M, Wang S, Han R, Cao Y, et W, Czjzek M, et al. (2010) Transfer of
Davis J, et al. (2009) Topographical and al. (2010) Interactions between gut microbiota, carbohydrate-active enzymes from marine bac-
temporal diversity of the human skin micro- host genetics and diet relevant to development of teria to Japanese gut microbiota. Nature 464:
biome. Science 324: 11901192. metabolic syndromes in mice. ISME J 4: 232 908912.
116. Frank DN, Feazel LM, Bessesen MT, Price CS, 241. 157. Sekirov I, Finlay BB (2009) The role of the
Janoff EN, et al. (2010) The human nasal 136. Dethlefsen L, McFall-Ngai M, Relman DA intestinal microbiota in enteric infection.
microbiota and Staphylococcus aureus carriage. (2007) An ecological and evolutionary perspec- J Physiol 587: 41594167.
PLoS One 5: e10598. tive on human-microbe mutualism and disease. 158. van de Guchte M, Penaud S, Grimaldi C, Barbe
117. Segata N, Haake SK, Mannon P, Lemon KP, Nature 449: 811818. V, Bryson K, et al. (2006) The complete genome
Waldron L, et al. (2012) Composition of the 137. Muegge BD, Kuczynski J, Knights D, Clemente sequence of Lactobacillus bulgaricus reveals
adult digestive tract bacterial microbiome based JC, Gonzalez A, et al. (2011) Diet drives extensive and ongoing reductive evolution. Proc
on seven mouth surfaces, tonsils, throat and stool convergence in gut microbiome functions across Natl Acad Sci U S A 103: 92749279.
samples. Genome biology 13: R42. mammalian phylogeny and within humans. 159. Martin FP, Wang Y, Sprenger N, Yap IK,
118. Dewhirst FE, Chen T, Izard J, Paster BJ, Science 332: 970974. Lundstedt T, et al. (2008) Probiotic modulation
Tanner AC, et al. (2010) The Human Oral 138. Ley RE, Turnbaugh PJ, Klein S, Gordon JI of symbiotic gut microbial-host metabolic inter-
Microbiome. J Bacteriol. (2006) Microbial ecology: human gut microbes actions in a humanized microbiome mouse
119. Guarner F, Malagelada JR (2003) Gut flora in associated with obesity. Nature 444: 10221023. model. Mol Syst Biol 4: 157.
health and disease. Lancet 361: 512519. 139. Ley RE, Backhed F, Turnbaugh P, Lozupone 160. OHara AM, Shanahan F (2006) The gut flora
120. Blaser MJ, Falkow S (2009) What are the CA, Knight RD, et al. (2005) Obesity alters gut as a forgotten organ. EMBO Rep 7: 688693.
consequences of the disappearing human micro- microbial ecology. Proc Natl Acad Sci U S A 161. Zimmer C (2010) How Microbes Defend and
biota? Nat Rev Microbiol 7: 887894. 102: 1107011075. Define Us. The New York Times. New York,
121. Dominguez-Bello MG, Costello EK, Contreras 140. Samuel BS, Gordon JI (2006) A humanized NY.
M, Magris M, Hidalgo G, et al. (2010) Delivery gnotobiotic mouse model of host-archaeal-bac- 162. Khoruts A, Dicksved J, Jansson JK, Sadowsky
mode shapes the acquisition and structure of the terial mutualism. Proc Natl Acad Sci U S A 103: MJ (2010) Changes in the composition of the
initial microbiota across multiple body habitats 1001110016. human fecal microbiome after bacteriotherapy
in newborns. Proc Natl Acad Sci U S A 107: 141. Rawls JF, Samuel BS, Gordon JI (2004) for recurrent Clostridium difficile-associated
1197111975. Gnotobiotic zebrafish reveal evolutionarily con- diarrhea. J Clin Gastroenterol 44: 354360.
122. Weinberger DM, Trzcinski K, Lu YJ, Bogaert served responses to the gut microbiota. Proc 163. Borody TJ (2000) Flora Power fecal bacteria
D, Brandes A, et al. (2009) Pneumococcal Natl Acad Sci U S A 101: 45964601. cure chronic C. difficile diarrhea.
capsular polysaccharide structure predicts sero- 142. Rawls JF, Mahowald MA, Ley RE, Gordon JI Am J Gastroenterol 95: 30283029.
type prevalence. PLoS Pathog 5: e1000476. (2006) Reciprocal gut microbiota transplants 164. Degnan PH, Ochman H (2011) Illumina-based
123. Cox MJ, Allgaier M, Taylor B, Baek MS, Huang from zebrafish and mice to germ-free recipients analysis of microbial community diversity. The
YJ, et al. (2010) Airway microbiota and reveal host habitat selection. Cell 127: 423433. ISME journal.
pathogen abundance in age-stratified cystic 143. Ivanov, II, Atarashi K, Manel N, Brodie EL, 165. Wooley JC, Godzik A, Friedberg I (2010) A
fibrosis patients. PLoS One 5: e11044. Shima T, et al. (2009) Induction of intestinal primer on metagenomics. PLoS Comput Biol 6:
124. Nicholson JK, Holmes E, Wilson ID (2005) Gut Th17 cells by segmented filamentous bacteria. e1000667.
microorganisms, mammalian metabolism and Cell 139: 485498. 166. Mitra S, Klar B, Huson DH (2009) Visual and
personalized health care. Nat Rev Microbiol 3: 144. Ivanov, II, Littman DR (2010) Segmented statistical comparison of metagenomes. Bioinfor-
431438. filamentous bacteria take the stage. Mucosal matics 25: 18491855.
125. Garrett WS, Gordon JI, Glimcher LH (2010) Immunol 3: 209212. 167. Atlas RM, Bartha R (1997) Microbial Ecology:
Homeostasis and inflammation in the intestine. 145. Turnbaugh PJ, Ridaura VK, Faith JJ, Rey FE, Fundamentals and Applications: Benjamin
Cell 140: 859870. Knight R, et al. (2009) The effect of diet on the Cummings.

PLOS Computational Biology | www.ploscompbiol.org 13 December 2012 | Volume 8 | Issue 12 | e1002808


168. Pace NR (1997) A molecular view of microbial 170. Kanehisa M, Araki M, Goto S, Hattori M, the project to annotate 1000 genomes. Nucleic
diversity and the biosphere. Science 276: 734 Hirakawa M, et al. (2008) KEGG for linking Acids Res 33: 56915702.
740. genomes to life and the environment. Nucleic 172. Morgan XC, Segata N, Huttenhower C (in
169. Raes J, Bork P (2008) Molecular eco-systems Acids Res 36: D480484. press) Biodiversity and functional genomics in
biology: towards an understanding of commu- 171. Overbeek R, Begley T, Butler RM, Choudhuri the human microbiome. Trends Genet.
nity function. Nat Rev Microbiol 6: 693699. JV, Chuang HY, et al. (2005) The subsystems doi:10.1016/j.tig.2012.09.005. Epub ahead of
approach to genome annotation and its use in print 7 November 2012.

PLOS Computational Biology | www.ploscompbiol.org 14 December 2012 | Volume 8 | Issue 12 | e1002808


Education

Chapter 13: Mining Electronic Health Records in the


Genomics Era
Joshua C. Denny*
Departments of Biomedical Informatics and Medicine, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America

rived solely from the electronic health


Abstract: The combination of im- enable multiple genomic investiga- record (EHR) [6]. In these models, a
proved genomic analysis methods, tions upon a single set of genotyped hospital collects DNA for research, and
decreasing genotyping costs, and individuals. This chapter reviews sev- maintains a linkage between the DNA
increasing computing resources has eral examples of phenotype extrac- sample and the EHR data for that patient.
led to an explosion of clinical geno- tion and their application to genetic The primary source of phenotypic infor-
mic knowledge in the last decade. research, demonstrating a viable fu- mation, therefore, is the EHR. Depending
Similarly, healthcare systems are in- ture for genomic discovery using
on the design of the biobank model, some
creasingly adopting robust electronic EHR-linked data.
EHR-linked biobanks have the ability to
health record (EHR) systems that not
only can improve health care, but also supplement EHR-accrued data with pur-
contain a vast repository of disease pose-collected research data.
This article is part of the Transla- The EHR model for genetic research
and treatment data that could be
tional Bioinformatics collection for offers several key advantages, but also faces
mined for genomic research. Indeed,
institutions are creating EHR-linked PLOS Computational Biology. prominent challenges to successful imple-
DNA biobanks to enable genomic mentation. A primary advantage is cost.
and pharmacogenomic research, us- EHRs contain a longitudinal record of robust
1. Introduction and Motivation
ing EHR data for phenotypic informa- clinical data that is produced as a byproduct
tion. However, EHRs are designed Typical genetic research studies have of routine clinical care. Thus, it is a rich, real-
primarily for clinical care, not research, used purpose-built cohorts or observation- world dataset that requires little additional
so reuse of clinical EHR data for al studies for genetic research. As of 2012, funding to obtain. Both study designs share
research purposes can be challeng- more than 1000 genome-wide association costs for obtaining and storing DNA.
ing. Difficulties in use of EHR data analyses have been performed, not to Another advantage of EHR-linked
include: data availability, missing data, mention a vast quantity of candidate gene DNA databanks is the potential to reuse
incorrect data, and vast quantities of studies [1]. Many of these studies have genetic information to investigate a broad
unstructured narrative text data. investigated multiple disease and pheno- range of additional phenotypes beyond the
Structured information includes bill- typic traits within a single patient cohort, original study. This is particularly true for
ing codes, most laboratory reports, such as the Wellcome Trust [2] and dense genetic data such as generated
and other variables such as physio-
Framingham research cohorts [35]. Typ- through genome-wide association studies
logic measurements and demograph-
ically, patient questionnaires and/or re- or large-scale sequencing data. For in-
ic information. Significant informa-
tion, however, remains locked within search staff are used to ascertain pheno- stance, a patient may be genotyped once
EHR narrative text documents, includ- typic traits for a patient. While these study as part of a study on diabetes, and then
ing clinical notes and certain catego- designs may offer high validity and later participate in another analysis for
ries of test results, such as pathology repeatability in their assessment of a given cardiovascular disease.
and radiology reports. For relatively trait, these models are typically very costly Major efforts in EHR DNA biobanking
rare observations, combinations of and often represent only a cross-section of are underway at a number of institutions.
simple free-text searches and billing time. In addition, rare diseases may take a One of the major driving forces has been
codes may prove adequate when significant time to accrue in these datasets. the National Human Genome Research
followed by manual chart review. Another model that is gaining accep- Institute (NHGRI)-sponsored Electronic
However, to extract the large cohorts tance is genetic discovery based solely or Medical Records and Genomics
necessary for genome-wide associa- partially from phenotype information de- (eMERGE) network [7], which began in
tion studies, natural language pro-
cessing methods to process narrative
Citation: Denny JC (2012) Chapter 13: Mining Electronic Health Records in the Genomics Era. PLoS Comput
text data may be needed. Combina- Biol 8(12): e1002823. doi:10.1371/journal.pcbi.1002823
tions of structured and unstructured
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
textual data can be mined to gener- Baltimore County, United States of America
ate high-validity collections of cases
Published December 27, 2012
and controls for a given condition.
Once high-quality cases and controls Copyright: 2012 Joshua C. Denny. This is an open-access article distributed under the terms of the
are identified, EHR-derived cases can Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
medium, provided the original author and source are credited.
be used for genomic discovery and
validation. Since EHR data includes a Funding: This article was supported in part by grants from the National Library of Medicine R01 LM 010685
and the National Human Genome Research Institute U01 HG004603. The funders had no role in the preparation
broad sampling of clinically-rele- of the manuscript.
vant phenotypic information, it may
Competing Interests: The author has declared that no competing interests exist.
* E-mail: josh.denny@vanderbilt.edu

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002823


What to Learn in This Chapter paratively lower specificity but higher
sensitivity. For instance, to establish the
diagnosis of coronary artery disease, one
N Describe the types of information available in Electronic Health Records (EHRs),
could look for a CPT code for coronary
and the relative sensitivity and positive predictive value of each
artery bypass surgery or percutaneous
N Describe the difference between unstructured and structured information in coronary angioplasty disease, or for one
the EHR
of several ICD9 codes. If the CPT code is
N Describe methods for developing accurate phenotype algorithms that integrate present, there is a high probability that the
structured and unstructured EHR information, and the roles played by billing patient has corresponding diagnosis of
codes, laboratory values, medication data, and natural language processing coronary disease. However, many patients
N Describe recent uses of EHR-derived phenotypes to study genome-phenome without these CPT codes also have
relationships coronary disease, but either have not
N Describe the cost advantages unique to EHR-linked biobanks, and the ability to received these interventions or received
reuse genetic data for many studies them at a different hospital. In contrast, a
N Understand the role of EHRs to enable phenome-wide association studies of clinician may bill an ICD9 code for
genetic variants coronary disease based on clinical suspi-
cion without a firm diagnosis. Figure 1
shows the results of a study that compared
2007 and, as of 2012, consists of nine sites 2.1 Billing Data the use of natural language processing
that are performing genome-wide association Billing data typically consists of codes (NLP) and CPT codes to detect patients
studies using phenotypic data derived from derived from the International Classifica- who have received colorectal cancer
EHR. The National Institutes of Health tion of Diseases (ICD) and Current screening, via a colonoscopy within the
(NIH)-sponsored Pharmacogenomics Re- Procedural Terminology (CPT). ICD is a last ten years, at one institution. In this
search Network (PGRN) also include sites hierarchical terminology of diseases, signs, study, only 61% (106 out of 174 total) of
performing genetic research using EHR data symptoms, and procedure codes main- the documented completed colonoscopies
as their source of phenotypic data. Another tained by the World Health Organization were found via CPT codes [20]. The most
example is the Kaiser Permanente Research (WHO). While the majority of the world common cause of false negatives was a
Program on Genes, Environment and uses ICD version 10, the United States (as colonoscopy completed at another hospi-
Health, which has genotyped 100,000 mem- of 2012) uses ICD version 9-CM; the tal. CPT codes, however, had a very high
bers with linked EHR data [8]. current Center for Medicare and Medi- precision (i.e., positive predictive value; see
caid Services guidelines mandate a transi- Box 1), with only one false positive.
tion to ICD-10-CM in the United States
2. Classes of Data Available in by October 1, 2014. Because of their 2.2 Laboratory and Vital Signs
EHRs widespread use as required components Laboratory data and vital signs form a
for billing, and due to their ubiquity within longitudinal record of mostly structured
EHRs are designed primarily to support EHR systems, billing codes are frequently data in the medical record. In addition to
clinical care, billing, and, increasingly, used for research purposes [914]. Prior being stored as name-value pair data,
other functions such as quality improve- research has demonstrated that such these fields and values can be encoded
ment initiatives aimed at improving the administrative data can have poor sensi- using standard terminologies. The most
health of a population. Thus, the types of tivity and specificity [15,16]. Despite this, common controlled vocabulary used to
data and their methods of storing this data they remain an important part of more represent laboratory tests and vital signs is
are optimized to support these missions. complex phenotype algorithms that the Logical Observation Identifiers Names
The primary types of information avail- achieve high performance [1719]. and Codes (LOINCH), which is a Consol-
able from EHRs are: billing data, labora- CPT codes are created and maintained idated Health Informatics standard for
tory results and vital signs, provider by the American Medical Association. representation of laboratory and test
documentation, documentation from re- They serve as the chief coding system names and is part of Health Language 7
ports and tests, and medication records. providers use to bill for clinical services. (HL7) [21,22]. Despite the growing use of
Billing data and many laboratory results Typically, CPTs are paired with ICD LOINC, many (perhaps most) hospital lab
are available in most systems as structured codes, the latter providing the reason systems still use local dictionaries to
name-value pair data. Clinical docu- (e.g., a disease or symptom) for a clinical encode laboratory results internally. Hos-
mentation, many test results (such as encounter or procedure. This satisfies the pital laboratory systems or testing compa-
echocardiograms and radiology testing), requirements of insurers, who require nies may change over time, resulting in
and medication records are often found in certain allowable diagnoses and symptoms different internal codes for the same test
narrative or semi-narrative text formats. to pay for a given procedure. For example, result. Thus, care is needed to implement
Researchers creating electronic pheno- insurance companies will pay for a brain selection logic based on laboratory results.
type algorithms (discussed in Section 6.2) magnetic resonance imaging (MRI) scan Indeed, a 20092010 data standardization
typically utilize multiple types of informat- that is ordered for a number of complaints effort at Vanderbilt University Medical
ics (e.g., billing codes, laboratory results, (such as known cancers or symptoms such Center found that the concept of weight
medication data, and/or NLP) to achieve as headache), but not for unrelated and height each had more than five
high accuracy when identifying cases and symptoms such as chest pain. internal representations. Weights and
controls from the EHR. Within the context of establishing a heights were also recorded by different
Table 1 summarizes the types of data particular diagnosis from EHR data, CPT systems using different field names and
available in the EHR and their strengths codes tend to have high specificity but low stored internally with different units (e.g.,
and weaknesses. sensitivity, while ICD9 codes have com- kilograms, grams, and pounds for weight;

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002823


Table 1. Strengths and weakness of data classes within EHRs.

ICD codes CPT codes Laboratory Data Medication records Clinical Documentation

Availability in EHR Near-universal Near-universal Near-universal Variable Variable


systems
Recall Medium Poor Medium Inpatient: High Medium
Outpatient: Variable
Precision Medium High High Inpatient: High Medium-High
Outpatient: Variable
Fragmentation effect Medium High Medium-High Medium Low-Medium
Query method Structured Structured Mostly structured Structured, text NLP, text queries, and
queries, and NLP rarely structured
Strengths -Easy to query -Easy to query -Value depends on Can have high validity Best record of what
-Serves as a good first -High precision test providers thought
pass of disease status -High data validity
Weaknesses -Disease codes often -Most susceptible to -May need to -Often need to interface -Difficult to process
used for screening missing data errors aggregate different inpatient and outpatient automatically
when disease not a (e.g., performed at variations of the records -Interpretation accuracy
ctually present another hospital) same data elements -Medication records from depends on assessment
-Accuracy hindered by -Procedure receipt -Normal ranges and outside providers not method
billing realities and clinic influenced by patient units may change present -May suffer from
workflow and payer factors over time -Medications prescribed significant
external to disease not necessary taken cut and paste
process -Not universally
available in EHRs
-May be
self-contradictory
Summary Essential first element for Helpful addition if Helpful addition if Useful for confirmation Useful for confirming
electronic phenotyping relevant relevant and a marker of severity common diagnoses
or for finding rare ones

doi:10.1371/journal.pcbi.1002823.t001

Figure 1. Comparison of natural language processing (NLP) and CPT codes to detect completed colonoscopies in 200 patients. In
this study, more completed colonoscopies were found via NLP than with billing codes alone, and only one colonoscopy was found with billing codes
that was not found with NLP. NLP examples were reviewed for accuracy.
doi:10.1371/journal.pcbi.1002823.g001

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002823


Box 1. Metrics Commonly Used to Evaluate Phenotype Selection 2.4 Documentation from Reports
Algorithms and Tests
Provider-generated reports and test
True Positives results include radiology and pathology
SensitivityRecall ~ reports and some procedure results such as
Gold standard positives
echocardiograms. They are often in the
form of narrative text results. Many of
these contain a mixture of structured and
True Negatives unstructured results. Examples include an
Specificity~ electrocardiogram report, which typically
Gold standard negatives
has structured interval durations and may
contain a structured field indicating
whether the test was abnormal or not.
True Positives However, most electrocardiogram (ECG)
Positive Predictive ValuePPV ,Precision~ reports also contain a narrative text
True PositiveszFalse Positives
impression representing the cardiolo-
gists interpretation of the result (e.g.,
consider anterolateral myocardial ische-
True Negatives mia or Since last ECG, patient has
Negative Predictive ValueNPV ~
True NegativeszFalse Negatives developed atrial fibrillation) [28]. For
ECGs, the structured content (e.g., the
intervals measured on the ECG) are
generated using automated algorithms
2|Recall|Precision and have varying accuracy [29].
F {measure~
RecallzPrecision
2.5 Medication Records
Medication records serve an important
role in accurate phenotype characteriza-
tion. They can be used to increase the
centimeters, meters, inches, and feet for of tests and clinical visits, and is frequently
precision of case identification, and to help
height). found in EHR systems. To be useful for
ensure that patients believed to be controls
Structured laboratory results are often phenotyping efforts, clinical documenta-
do not actually have the disease. Medica-
a very important component of pheno- tion must be in the form of electronically-
tions received by a patient serve as
type algorithms, and can represent tar- available text that can be used for
confirmation that the treating physician
gets for genomic investigation [3,4,23]. subsequent manual review, text searches, believed the disease was present to a
An algorithm to identify type 2 diabetes or NLP. They can be created via sufficient degree that they prescribed a
(T2D) cases and controls, for instance, computer-based documentation (CBD) treating medication. It is particularly
used laboratory values (e.g., hemoglobin systems or dictated and transcribed. The helpful to find presence or absence of
A1c and glucose values) combined with most common form of computable text is medications highly specific or sensitive for
billing codes and medication mentions in unstructured narrative text documents, the disease. For instance, a patient with
[17]. Similarly, an algorithm to deter- although a number of developers have diabetes will receive either oral or inject-
mine genomic determinants of normal also created structured documentation able hypoglycemic agents; these medica-
cardiac conduction required normal tools [24]. Narrative text documents can tions are both highly sensitive and specific
electrolyte (potassium, calcium, and be processed by text queries or by NLP for treating diabetes, and can also be used
magnesium) values [16]. In these set- systems, as discussed in the following to help differentiate type I diabetes
tings, investigation of the determinants of section. (treated almost exclusively with insulin)
the values requires careful selection of For some phenotypes, crucial docu- from T2D (which is typically a disease of
the value to be investigated. For instance, ments may only be available as hand- insulin resistance and thus can be treated
an analysis of determinants of uric acid written documents, and thus not amenable with a combination of oral and injectable
or red blood cell indices would exclude to text searching or NLP. Unavailability hypoglycemic agents).
patients treated with certain antineoplas- may result from clinics that are slow Medication records can be in varying
tic agents (which can increase uric acid adopters, have very high patient volumes, forms within an electronic record. With
or suppression of erythrocyte produc- or have specific workflows not well ac- the increased use of computerized provid-
tion), and, similarly, an analysis of white commodated by the EHR system [25]. er order entry (CPOE) systems to manage
blood cell indices also excludes patients However, these hand-written documents hospital stays, inpatient medication rec-
with active infections and certain medi- may be available electronically as scanned ords are often available in highly struc-
cations at the time of the laboratory copies. Recent efforts have shown that tured records that may be mapped to
measurement. intelligent character recognition (ICR) controlled vocabularies. In addition, many
software may be useful for processing hospital systems are installing automated
2.3 Provider Documentation scanned documents containing hand-writ- bar-code medication administration rec-
Clinical documentation represents per- ten fields (Figure 2) [26,27]. This task can ords by which hospital staff record each
haps the richest and most diverse source of be challenging, however, and works best individual drug administration for each
phenotype information. Provider docu- when the providers are completing pre- patient [30]. With this information, accu-
mentation is required for nearly all billing formatted forms. rate drug exposures and their times can be

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002823


Figure 2. Use of Intelligent Character Recognition to codify handwriting. Figure courtesy of Luke Rasmussen, Northwestern University.
doi:10.1371/journal.pcbi.1002823.g002

constructed for each inpatient. Even names, generics, combination medica- ing) or NLP systems. Keyword searching
without electronic medication administra- tions, and abbreviations that would be can effectively identify rare physical exam
tion records (such as bar-code systems), used, but has the advantage that it can be findings in text [32], and extension to use
research has shown that CPOE-ordered easily accomplished using relational data- of regular expression pattern matching has
medications are given with fairly high base queries. The downside is that this been used to extract blood pressure
reliability [31]. approach requires re-engineering for each readings [33]. NLP computer algorithms
Outpatient medication records are often medication or set of medications to be scan and parse unstructured free-text
recorded via narrative text entries within searched, and does not allow for the documents, applying syntactic and seman-
clinical documentation, patient problem retrieval of other medication data, such as tic rules to extract structured representa-
lists, or communications with patients dose, frequency, and duration. A more tions of the information content, such as
through telephone calls or patient portals. general-purpose approach can be concepts recognized from a controlled
Many EHR systems have incorporated achieved with NLP, which is discussed terminology [3437]. Early NLP efforts
outpatient prescribing systems, which cre- in greater detail in Section 3 below. to extract medical concepts from clinical
ate structured medical records during text documents focused on coding in the
generation of new prescriptions and 3. Natural Language Processing Systematic Nomenclature of Pathology or
refills. However, within many EHR to Support Clinical Knowledge the ICD for financial and billing purposes
systems, electronic prescribing tools are Extraction [38], while more recent efforts often use
optional, not yet widely adopted, or have complete versions of the Unified Medical
only been used within recent history. Although many documentation tools Language System (UMLS) [3941],
Thus, accurate construction of a patients include structured and semi-structured SNOMED-CT [16], and/or domain-spe-
medication exposure history often re- elements, the vast majority of computer cific vocabularies such as RxNorm for
quires NLP techniques. For specific algo- based documentation (CBD) remains in medication extraction [42]. NLP systems
rithms, focused free-text searching for a natural language narrative formats [24]. utilize varying approaches to understand-
set of medications can be efficient and Thus, to be useful for data mining, ing text, including rule-based and statis-
effective [17]. This approach requires the narrative data must be processed through tical approaches using syntactic and/or
researcher to generate the list of brand use of text-searching (e.g., keyword search- semantic information. Natural language

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002823


processors can achieve classification rates UMLS will likely recognize atenolol additional information is needed. The
similar to those of manual reviewers, and and Tenormin (a United States brand NUgene project, which has enrolled
can be superior to keyword searches. A name for atenolol) as two different nearly 10,000 people through 2012, uses
number of researchers have demonstrated concepts, since each is represented by a similar approach, obtaining patients
the effectiveness of NLP for large-scale separate concepts in the UMLS. Medi- consent during outpatient clinic visits
text-processing tasks. Melton and Hripc- cation-specific NLP systems focus on [56]. Another example of an EHR-
sak used MedLEE to recognize instances extracting such metadata for a medica- associated biobank is the Kaiser-Perma-
of adverse events in hospital discharge tion. Sirohl and Peissig applied a com- nente biobank, which has genotyped
summaries [43]. Friedman and colleagues mercial medication NLP system to de- 100,000 individuals [57].
evaluated NLP for pharmacovigilance to rived structured medication information The alternative opt-out approach is
discover adverse drug events from clinical [52], which was later linked to laboratory evidenced by Vanderbilt Universitys
records by using statistical methods that data and used to explore the pharmaco- BioVU, which associates DNA with de-
associate extracted UMLS disease con- dynamics of statin efficacy (a cholesterol- identified EHR data [58]. In this model,
cepts with extracted medication names lowering medication) [53]. Xu et al. patients have the opportunity to opt out
[40]. These studies show the potential developed a similar system at Vanderbilt of the DNA biobank by checking a box on
for NLP to aid in specific phenotype called MedEx, which had recall and the standard Consent to Treatment
recognition. precision $0.90 for discharge summaries form signed as part of routine clinical
Using either NLP systems or keyword and clinic notes on Vanderbilt clinical care. A majority of patients (.90%) do
searching, the primary task in identifying a documents [42]. Additionally, the 2009 not check this box, indicating assent to
particular phenotype is to filter out i2b2 NLP challenge focused on medica- the use of their DNA in the biobank [58].
concepts (or keywords) within a corpus of tion extraction using de-identified dis- If the patient does not opt-out, blood
documents that indicate statements other charge summaries from Partners Health- that is scheduled to be discarded after
than the patient having the disease. care, and 20 teams competed to identify routine laboratory testing is instead sent
Researchers may desire to specify partic- medications and their signatures. The for DNA extraction, which is stored for
ular document types (e.g., documents best systems achieved F-measures $0.80 potential future use. To ensure that no
within a given domain, problem lists, [54]. Much work remains to be done in one knows with certainty if a subjects
etc.) or particular types of visits or this area, as extraction of both medica- DNA is in BioVU, an additional small
specialists (e.g., requiring a visit with an tion names and associated signature percentage of patients are randomly
ophthalmologist). Some common NLP information can be challenging when excluded.
tasks needed in phenotype classification considering the full breadth of clinical The BioVU model requires that the
include identifying family medical history documentation formats available, includ- DNA and associated EHR data be de-
context and negated terms (e.g., no ing provider-staff and provider-patient identified in order to assure that the model
cardiac disease), and removing drug communications, which often contain less complies with the policies of non-human
allergies when searching for patients formal and misspelled representations of subjects research. The full-text of the EHR
taking a certain medication. Recognition prescribed medications. undergoes a process of de-identification
of sections within documents can be For more information on NLP methods with software programs that remove
handled using structured section labels, and applications, please see the article on Health Insurance Portability and Account-
specialized NLP systems such as SecTag text mining elsewhere in this collection ability Act (HIPAA) identifiers from all
[44], or more general-purpose NLP sys- (submitted). clinical documentation in the medical
tems such as MedLEE [45] or HITEX record. At the time of this writing, text
[46]. A number of solutions have been 4. EHR-Associated Biobanks: de-identification for BioVU is performed
proposed for negation detection; among Enabling EHR-Based Genomic using the commercial product DE-ID [59]
the more widespread are adaptations of with additional pre- and post-processing
Science
the NegEx algorithm developed by Chap- steps. However, a number of other clinical
man et al., which uses a series of negation DNA biobanks associated with EHR text de-identification software packages
phrases and boundary words to identify systems can be composed of either all have been studied, some of which are
negated text [47]. NegEx or similar comers or a focused collection, and open source [60,61]. Multiple reviews by
algorithms can be used as a standalone pursue either a conventional consented both the local institutional review board
system or be integrated within a number opt-in or an opt-out approach. Cur- and the federal Office for Human Re-
of general-purpose NLP systems including rently, the majority of DNA biobanks search Protections have affirmed this
MedLEE [48], the KnowledgeMap con- have an opt-in approach that selects status as nonhuman subjects research
cept identifier [49], cTAKES [50], and the patients for particular research studies. according to 45 CFR 46 [58]. Nonethe-
National Library of Medicines MetaMap Two population-based models in the less, all research conducted within BioVU
[51]. eMERGE network are the Personalized and the associated de-identified EHR
Medication information extraction is Medicine Research Population (PMRP) (called the Synthetic Derivative) is
an important area for clinical applica- project of the Marshfield Clinic (Marsh- overseen by the local Institutional Review
tions that benefits from specialized NLP field, WI) [55] and Northwestern Uni- Board. An opt-out model similar to
tools. Most general-purpose NLP systems versitys NUgene project (Chicago, IL). BioVU is used by Partners Healthcare
will recognize medications by the medi- The PMRP project selected 20,000 indi- for the Crimson biobank, which can
cation ingredient mentioned in the text viduals who receive care in the geograph- accrue patients who meet specific pheno-
but may not identify the relevant medi- ic region of the Marshfield Clinic. These type criteria as they have routine blood
cation metadata such as dose, frequency, patients have been consented, surveyed, draws.
and route. In addition, a general purpose and have given permission to the inves- An advantage of the opt-out approach is
NLP system using as its vocabulary the tigators for recontact in the future if rapid sample accrual. BioVU began col-

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002823


Figure 3. General figure for identifying cases and controls using EHR data. Application of electronic selection algorithms lead to division of
a population of patients into four groups, the largest of which comprises patients who were excluded because they lack sufficient evidence to be
either a case or control patient. Definite cases and controls cross some predefined threshold of positive predictive value (e.g., PPV$95%), and thus do
not require manual review. For very rare phenotypes or complicated case definitions, the category of possible cases may need to be reviewed
manually to increase the sample size.
doi:10.1371/journal.pcbi.1002823.g003

lecting DNA samples in 2007, adding be entered through a variety of sources. single measure of the overall performance
about 500 new samples weekly, and has Most commonly, administrative staff re- of an algorithm that can be used to
over 150,000 subjects as of September cord race/ethnicity via structured data compare two algorithms or selection
2012. Since it enrolls subjects prospective- collection tools in the EHR. Often, this logics. Since the scale of the graph is 0 to
ly, investigation of rare phenotypes may be field can be ignored (left as unknown), 1 on both axes, the performance of a
possible with such systems. The major especially in busy clinical environments, perfect algorithm is 1, and random chance
disadvantage of the opt-out approach is such as emergency departments. Un- is 0.5.
that it precludes recontact of the patients known percentages of patients can range
since their identity has been removed. between 9% and 23% of subjects [17,18]. 6.2 Creation of Phenotype Selection
However, the Synthetic Derivative is Among those patients for whom data is Logic
continually updated as new information entered, a study of genetic ancestry infor- Initial work in phenotype detection has
is added to the EHR, such that the mative markers correlated well with EHR- often focused on a single modality of EHR
amount of phenotypic information for reported race/ethnicities [64]. In addition, data. A number of studies have used
included patients grows over time. a study within the Veterans Administration
billing data, some comparing directly to
(VA) hospital system noted that over 95%
other genres of data, such as NLP. Li et al.
5. Race and Ethnicity in EHR- of all EHR-derived race/ethnicity agreed
compared the results of ICD-9 encoded
with self-reported race/ethnicity using
Derived Biobanks diagnoses and NLP-processed discharge
nearly one million records [65]. Thus,
summaries for clinical trial eligibility
Given that much genetic information despite concerns over EHR-derived ances-
queries, finding that use of NLP provided
varies greatly within ancestral populations, tral information, such information, when
more valuable data sources for clinical trial
accurate knowledge of genetic ancestry present, appears similar to self-report
pre-screening than ICD-9 codes [15].
information is essential to allow for proper ancestry information.
Savova et al. has used cTAKES to
genetic study design and control of discover peripheral arterial disease cases
population stratification. Without it, one 6. Phenotype-Driven Discovery by looking for particular key words in
can see numerous spurious genetic associ- in EHRs radiology reports, and then aggregating
ations due solely to race/ethnicity [62]. 6.1 Measure of Phenotype Selection the individual instances using AND-OR-
Single nucleotide polymorphisms (SNPs) Logic Performance NOT Boolean logic to classify cases into
common in one population may be rare in The evaluation of phenotype selection four categories: positive, negative, proba-
another. In large-scale GWA analyses, one logic can use metrics similar to informa- ble, and unknown [66].
can tolerate less accurate knowledge of tion retrieval tasks. Common metrics are Phenotype algorithms can be created
ancestry a priori, since the large amount of sensitivity (or recall), specificity, positive multiple ways, depending of the rarity of
genetic data allows one to calculate the predictive value (PPV, also known as the phenotype, the capabilities of the EHR
genetic ancestry of the subject using precision), and negative predictive value system, and the desired sample size of the
catalogs of SNPs known to vary between (see Box 1). If a population is assessed for study. Generally, phenotype selection logics
races. Alternatively, one can also adjust for case and control status, then another (algorithms) are composed of one or more
genetic ancestry using tools such as useful metric is comparing the receiver of four elements: billing code data, other
EIGENSTRAT [63]. However, in smaller operator characteristic (ROC) curves. structured (coded) data such as laboratory
candidate gene studies, it is important to ROC curves graph the sensitivity vs. false values and demographic data, medication
know the ancestry beforehand. positive rate (or, 1-specificity) given a information, and NLP-derived data. Struc-
Self-reported race/ethnicity data is often continuous measure of the outcome of tured data can be retrieved effectively from
used in genetic studies. In contrast race/ the algorithm. By calculating the area most EHR systems. These data can be
ethnicity as recorded within an EHR may under the ROC curve (AUC), one has a combined through simple Boolean logic

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002823


Table 2. Methods of finding cases and controls for genetic analysis of five common diseases.

Disease Methods Cases Controls Case PPV Control PPV

Atrial fibrillation NLP of ECG impressions 168 1695 98% 100%


ICD9 codes
CPT codes
Crohns Disease ICD9 codes 116 2643 100% 100%
Medications (text)
Type 2 Diabetes ICD9 codes 570 764 100% 100%
Medications (text)
Text searches (controls)
Multiple Sclerosis ICD9 codes or text diagnosis 66 1857 87%* 100%
Rheumatoid Arthritis ICD9 codes 170 701 97% 100%
Medications (text)
Text searches (exclusions)

*Given the small number of multiple sclerosis cases, all possible cases were manually validated to ensure high recall.
doi:10.1371/journal.pcbi.1002823.t002

[17] or through machine learning methods have at least one billing code for the 7. Examples of Genetic
such as logistic regression [18], to achieve a disease) are reasonable and likely do not Discovery Using EHRs
predefined specificity or positive predictive lead to significant bias.
value. A drawback to the use of machine For other algorithms, the temporal The growth of EHR-driven genomic
learning data (such as logistic regression relationships of certain elements are very research (EDGR) that is, genomic
models) is that it may not be as portable to important. Consider an algorithm to research proceeding primarily from
other EHR systems as more simple Boolean determine whether a certain combination EHR data linked to DNA samples is
logic, depending on how the models are of medication adversely impacted a given a recent phenomenon [6]. Preceding
constructed. The application of many lab, such as kidney function or glucose these most recent research initiatives,
phenotype selection logics can be thought [67]. Such an algorithm would need to other studies laid the groundwork for use
of partitioning individuals into four buckets take into account the temporal sequence of EHR data to study genetic phenom-
definite cases (with sufficiently high PPV), and time between the particular medica- ena. Rzhetsky et al. used billing codes
possible cases (which can be manually tions and laboratory tests. For example, from the EHRs of 1.5 million patients to
reviewed if needed), controls (which do glucose changes within minutes to hours analyze disease co-occurrence in 161
not have the disease with acceptable of a single administration of insulin, but conditions as a proxy for possible genetic
PPV), and individuals excluded from the the development of glaucoma from corti- overlap [68]. Chen et al. compared
analysis due to either potentially overlap- costeroids (a known side effect) would not laboratory measurements and age with
ping diagnoses or insufficient evidence be expected to happen acutely following a gene expression data to identify rates of
(Figure 3). single dose. change that correlated with genes known
For many algorithms, sensitivity (or For very rare diseases or findings, one to be involved in aging [69]. A study at
recall) is not necessarily evaluated, assum- may desire to find every case, and thus Geisinger Clinic evaluated SNPs in the
ing there are an adequate number of cases. the logic may simply be a union of 9p21 region that are known to be
A possible concern in not evaluating recall keyword text queries and billing codes associated to cardiovascular disease and
(sensitivity) of a phenotype algorithm is followed by manual review of all returned early myocardial infarction [70]. They
that there may be a systematic bias in how cases. Examples include the rare physical found these SNPs were associated with
patients were selected. For example, exam finding hippus (exaggerated pupil- heart disease and T2D using EHR-
consider a hypothetical algorithm to find lary oscillations occurring in the setting of derived data. Several specific examples
patients with T2D whose logic was to altered mental status) [32], or potential of EDGR are detailed below.
select all patients that had at least one drug adverse events (e.g., Stevens-Johnson
billing code for T2D and also required syndrome), which are often very rare but 7.1 Replicating Known Genetic
that cases receive an oral hypoglycemic severe. Associations for Five Diseases
medication. This algorithm may be highly Since EHRs represent longitudinal rec- An early replication study of known
specific for finding patients with T2D ords of patient care, they are biased to genetic associations with five diseases with
(instead of type 1 diabetes), but would recording those events that are recorded as known genetic associations was performed
miss those patients who had progressed in part of medical care. Thus, they are in BioVU. The study was designed to test
disease severity such that oral hypoglyce- particularly useful for investigating dis- the hypothesis that an EHR-linked DNA
mic agents no longer worked and who now ease-based phenotypes, but potentially less biobank could be used for genetic associ-
require insulin treatment. Thus, this efficacious for investigating non-disease ation analyses. The goal was to use only
phenotype algorithm could miss the more phenotypes such as hair or eye color, left EHR data for phenotype information.
severe cases of T2D. However, for a vs. right handedness, cognitive attributes, The first 10,000 samples accrued in
practical application, such assessments of biochemical measures (beyond routine BioVU were genotyped at 21 SNPs that
recall can be challenging given large labs), etc. On the other hand, they may are known to be associated with these five
samples sizes of rare diseases. Certain be particularly useful for analyzing disease diseases (atrial fibrillation, Crohns disease,
assumptions (e.g., that a patient should progression over time. multiple sclerosis, rheumatoid arthritis,

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002823


Table 3. eMERGE network participants.

Phenotyping
Institution Biorepository Overview Model Size EHR Summary Methods

Group Health1 GHC Biobank Disease specific 4000 Comprehensive Structured data
(Seattle, WA) Alzheimers Disease Cohort vendor-based EHR extraction, NLP
Patient Registry and Adult since 2004
Changes in Thought Study
Marshfield Clinic Research Personalized Medicine Population based 20,000 Comprehensive Structured data
Foundation1 Research Project internally developed extraction, NLP,
(Marshfield, WI) Marshfield Clinic, an integrated EHR since 1985 Intelligent Character
regional health system Recognition
Mayo Clinic1 Disease cohort Disease specific 16,500 Comprehensive Structured data
(Rochester, MN) Derived from vascular laboratory & Cohorts internally developed extraction, NLP
exercise stress testing labs EHR since 1995
Northwestern University1 NUgene Project Population based .10,000 Comprehensive Structured data
(Chicago, IL) Northwestern affiliated hospitals vendor based extraction, text
and outpatient clinics Inpatient and searches, NLP
Outpatient (different
systems) EHR
since 2000
Vanderbilt University1 BioVU Population based 150,000 Comprehensive Structured data
(Nashville, TN) Primarily drawn from outpatient routine internally developed extraction, NLP
laboratory samples EHR since 2000
Geisinger Health System2 MyCode Population based .30,000 Comprehensive Structured data
(Pennsylvania) Enrollment of health plan participants vendor-based EHR extraction, NLP
Mount Sinai Medical Institute for Personalized Population based .30,000 Comprehensive Structured data
Center2 (New York, NY) Medicine Biobank vendor-based EHR extraction, NLP
Outpatient enrollment since 2004
Cincinnati Childrens General and disease cohorts. Population based .3,000 Comprehensive Structured data
Hospital3 vendor-based EHR extraction, NLP
(Cincinnati, OH)
Childrens Hospital of General and disease cohorts. Population based .100,000 Comprehensive Structured data
Philadelphia3 vendor-based EHR extraction, NLP
(Philadelphia, PA)
Boston Childrens3 Crimson Disease based Virtual Comprehensive Structured data
(Boston MA) On-demand, de-identified internally developed extraction, NLP
phenotype-driven collection EHR

Sizes represent approximate sizes as of 2012; many sites are still actively recruiting. NLP = Natural Language Processing. Sites joined with 1eMERGE-I in 2007, 2eMERGE-II
in 2011, or as 3pediatric sites in 2012.
doi:10.1371/journal.pcbi.1002823.t003

and T2D). Reported odds ratios were randomly selected records to provide final 7.2 Demonstrating Multiethnic
1.142.36 in at least two previous studies PPVs. Associations with Rheumatoid
prior to the analysis. Automated pheno- Used alone, ICD9 codes had PPVs of Arthritis
type identification algorithms were devel- 5689% compared to a gold standard Using a logistic regression algorithm
oped using NLP techniques (to identify key represented by the final algorithm. Errors operating on billing data, NLP-derived
findings, medication names, and family were due to coding errors (e.g., typos), features, medication records, and labo-
history), billing code queries, and struc- misdiagnoses from non-specialists (e.g., a ratory data, Liao et al. developed an
tured data elements (such as laboratory non-specialist diagnosed a patient as algorithm to accurately identify rheuma-
results) to identify cases (n = 70698) and having rheumatoid arthritis followed by toid arthritis patients [18]. Kurreeman
controls (n = 8083818). Final algorithms a rheumatologist who revised the diag- et al. used this algorithm on EHR data
achieved PPV of $97% for cases and nosis to psoriatic arthritis), and indeter- to identify a population of 1,515 cases
100% for controls on randomly selected minate diagnoses that later evolved into and 1,480 matched controls [71]. These
cases and controls (Table 2) [17]. For each well-defined ones (e.g., a patient thought researchers genotyped 29 SNPs that had
of the target diseases, the phenotype to have Crohns disease was later deter- been associated with RA in at least one
algorithms were developed iteratively, mined to have ulcerative colitis, another prior study. Sixteen of these SNPs
with a proposed selection logic applied to type of inflammatory bowel disease). achieved statistical significance, and
a set of EHR subjects, and random cases Each of the 21 tests of association yielded 26/29 had odds ratios in the same
and controls evaluated for accuracy. The point estimates in the expected direction, direction and with similar effect sizes.
results of these reviews were used to refine and eight of the known associations The authors also demonstrated that
the algorithms, which were then rede- achieved statistical significance these portions of these risk alleles were
ployed and reevaluated on a unique set of [17]. associated with rheumatoid arthritis in

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002823


Figure 4. Use of NLP to identify patients without heart disease for a genome-wide analysis of normal cardiac conduction. Using
simple text searching, 1564 patients would have been eliminated unnecessarily due to negated terms, family medical history of heart disease, or low
dose medication use that would not affect measurements on the electrocardiogram. Use of NLP improves recall of these cases without sacrificing
positive predictive value. The final case cohort represented the patients used for GWAS in [71].
doi:10.1371/journal.pcbi.1002823.g004

East Asian, African, and Hispanic Amer- Initial plans were for each site to by the PR interval on the ECG), conduct-
ican populations. analyze their own phenotypes indepen- ed entirely within samples drawn from one
dently. However, the network has realized site, identified variants in SCN10A.
the benefits of synergy. Central efforts SCN10A is a sodium channel expressed in
7.3 eMERGE Network autonomic nervous system tissue and is
across the network were involved in
The eMERGE network is composed of now known to be involved in cardiac
harmonization of the collective genetic
nine institutions as of 2012 (http://gwas. regulation. The phenotype algorithm
data.
org; Table 3). Each site has a DNA identified patients with normal ECGs
biobank linked to robust, longitudinal who did not have evidence of prior heart
EHR data. The initial goal of the 7.4 Early Genome-Wide Association
disease, were not on medications that
eMERGE network was to investigate the Studies from the eMERGE Network would interfere with cardiac conduction,
feasibility of genome-wide association As of 2012, the eMERGE Network has and had normal electrolytes. The pheno-
studies using EHR data as the primary published GWAS on atrioventricular con- type algorithm used NLP and billing code
source for phenotypic information. Each duction [72], red blood cell [23] and white queries to search for the presence of prior
of these sites initially set out to investigate blood cell [73] traits, primary hypothyroid- heart disease and medication use [72]. Of
one or two primary phenotypes (Table 3). ism [74], and erythrocyte sedimentation note, the algorithm highlights the impor-
Network sites have currently created and rate [75], with others ongoing. The first two tance of using clinical note section tagging
evaluated electronic phenotype algorithms studies published by the network were using and negation to exclude only those
for 14 different primary and secondary single-site GWAS studies; latter studies patients with heart disease, as opposed to
phenotypes, with nearly 30 more planned. have realized the advantage of pooling patients whose records contained negated
After defining phenotype algorithms, data across multiple sites to increase the heart disease concepts (e.g., no myocar-
each site then performed genome-wide sample size available for a study. Impor- dial infarction) or heart disease concepts
genotyping at one of two NIH-supported tantly, several studies in eMERGE have in related individuals (e.g., mother died of
genotyping centers. explicitly evaluated the portability of the a heart attack). Use of NLP improved
The primary goals of an algorithm are electronic phenotype algorithms by review- recall of cases by 129% compared with
to perform with high precision ($95%) ing algorithms at multiple sites. Evaluation simple text searching, while maintaining a
and reasonable recall. Algorithms incor- of the hypothyroidism algorithm at the five positive predictive value of 97% (Figure 4)
porate billing codes, laboratory and vital eMERGE-I sites, for instance, noted an [78,72].
signs data, test and procedure results, and overall weighted PPV of 92.4% and 98.5% The study of RBC traits identified four
clinical documentation. NLP is used to for cases and controls, respectively [74]. variants associated with RBC traits. One
both increase recall (find additional cases) Similar results have been found with T2D of these, SLC17A1, had not been previ-
and achieve greater precision (via im- [76], cataracts [27], and rheumatoid ar- ously identified, and is involved in sodium-
proved specificity). These phenotype algo- thritis [77] algorithms. phosphate co-transport in the kidney. The
rithms are available for download from As a case study, the GWAS for latter study of RBC traits utilized patients
PheKB (http://phekb.org). atrioventricular conduction (as measured genotyped at one site as cases and controls

PLOS Computational Biology | www.ploscompbiol.org 10 December 2012 | Volume 8 | Issue 12 | e1002823


Figure 5. A PheWAS plot for rs3135388 in HLA-DRA. This region has known associations with multiple sclerosis. The red line indicates
statistical significance at Bonferroni correction. The blue line represents p,0.05. This plot is generated from updated data from [78] and the updated
PheWAS methods as described in [73].
doi:10.1371/journal.pcbi.1002823.g005

for their primary phenotype of peripheral A phenome-wide association study strong association between this SNP and
arterial disease (PAD). Thus, this repre- (PheWAS) is, in a sense, a reverse multiple sclerosis, but also highlights other
sents an in silico GWAS for a new finding GWAS. PheWAS investigations require possible associations, such as Type 1
that did not require new genotyping, but large representative patient populations diabetes and acquired hypothyroidism.
instead leveraged the available data within with definable phenotypic characteristics. Recent explorations into PheWAS meth-
the EHR. The eMERGE study of primary Such studies only recently became feasi- ods using NLP have shown greater
hypothyroidism, similarly, identified a ble, facilitated by linkage of DNA bior- efficacy for detecting associations: with
novel association with FOXE1, a thyroid epositories to EHR systems, which can the same patients, NLP-based PheWAS
transcription factor, without any new provide a comprehensive, longitudinal replicated six of the seven known associ-
genotyping by using samples derived from record of disease. ations, generally with more significant p-
five eMERGE sites. The first PheWAS studies were per- values [80].
formed on 6,005 patients genotyped for PheWAS methods may be particularly
five SNPs with seven previously known useful for highlighting pleiotropy and
7.5 Phenome-Wide Association disease associations [79]. This PheWAS clinically associated diseases. For exam-
Studies (PheWAS) used ICD9 codes linked to a code- ple, an early GWAS for T2D identified,
Typical genetic analyses investigate translation table that mapped ICD9 codes among others, FTO loci as an associated
many genetic loci against a single trait or to 776 disease phenotypes. In this study, variant [81]. A later GWAS demonstrat-
disease. Such analyses cannot identify PheWAS methods replicated four of seven ed this risk association was mediated
pleiotropic associations, and may miss previously known associations with through the effect of FTO on increasing
important confounders in an analysis. p,0.011. Figure 5 shows one illustrative body mass index, and thus increasing risk
Another approach, engender by the rich PheWAS plot of phenotype associations of T2D within those individuals. Such
phenotype record included in the EHR, is with an HLA-DRA SNP known to be effects may be identified through broad
to simultaneously investigate many pheno- associated with multiple sclerosis. Of note, phenome scans made possible through
types associated with a given genetic locus. this PheWAS not only demonstrates a PheWAS.

PLOS Computational Biology | www.ploscompbiol.org 11 December 2012 | Volume 8 | Issue 12 | e1002823


8. Conclusions and Future analysis. A major challenge is derivation are useful for mining genetic data.
Directions of accurate collections of cases and What are some of the strengths and
controls for a given disease of interest, drawbacks of each type of data?
EHRs have long been seen as a vehicle usually achieved through creation and 2) Explain what a phenotype algo-
to improve healthcare quality, cost, and validation of phenotype selection logics. rithm is and why it is necessary. For
safety. However, their growing adoption These algorithms take significant time and example, how can use of natural
in the United States and elsewhere is effort to develop and often require adjust- language processing improve upon
demonstrating their capability as a broad ment and a skilled team to deploy at a use of billing codes alone?
tool for research. Enabling tools include secondary site. Another challenge is the
enterprise data warehouses and software 3) Select a clinical disease and design
availability of phenotypic information. a phenotype algorithm for it.
to process unstructured information, such Many patients may be observed at a given
as de-identification and NLP. When 4) How might a phenotype algorithm
healthcare facility only for certain types of
linked to biological data such as DNA be different for a very rare disease
care (e.g., primary care or a certain
or tissue biorepositories, EHRs can be- (e.g., prion diseases) vs. a more
subspecialist), leading to fragmented
come a powerful tool for genomic anal- common one (e.g., Type 2 diabe-
knowledge of a patients medical history
ysis. One can imagine future repositories tes)? How would a phenotype
and medication exposures. Future growth
also storing intermittent plasma samples algorithm be different for a phys-
of Health Information Exchanges could ical exam finding (e.g., hippus or a
to allow for proteomic analyses.
substantially improve these information particular type of heart murmur) vs.
A key advantage of EHR-based genetic
gaps. Finally, DNA biobanks require a disease?
studies is that they allow for the collection
significant institutional investment and
of phenotype information as a byproduct 5) Describe the differences between a
ongoing financial, ethical, and logistical
of routine healthcare. Moreover, this DNA biobank linked to an EHR
support to run effectively. Thus, they are
information collection grows over time and one collected as part of a non-
and is continually refined as new informa- not ubiquitous.
EHR research cohort. What are the
tion may confirm or refute a diagnosis for As genomics move beyond discovery
advantages and disadvantages of a
a given individual. Through the course of into clinical practice, the future of person-
de-identified DNA biobank vs. an
ones life, a number of information points alized medicine is one in which our genetic
identified DNA biobank (either
concerning disease, response to treatment, information could be simply a click of the
linked to an EHR or not).
and laboratory and test data are collected. mouse away [82]. In this future, DNA-
enabled EHR systems will assist in more 6) It is often harder to create algo-
Aggregation of this information can allow rithms to find drug-response pheno-
for generation of large sample sizes of accurate prescribing, risk stratification,
and diagnosis. Genomic discovery in types (such as adverse drug events)
patients with certain diseases or medica- than for a chronic disease. Give
tion exposures. Moreover, once a subject EHR systems provides a real-world test
bed to validate and discover clinically several reasons why this might be.
receives dense genotyping for one EHR-
based study, their genetic data can be meaningful genetic effects. Answers to the Exercises can be found
reused for many other genotypic studies, in Text S1.
allowing for relatively low-cost reuse of the 9. Exercises
genetic material (once a given phenotype Supporting Information
can be found in the EHR). 1) Compare and contrast the basic
Three major rate-limiting steps impede types of data available in an Elec- Text S1 Answers to Exercises.
utilization of EHR data for genetic tronic Health Records (EHR) that (DOCX)

Further Reading

N Shortliffe EH, Cimino JJ, editors (2006) Biomedical informatics: computer applications in health care and biomedicine. 3rd
edition. Springer. 1064 p.
Chapters of particular relevance: Chapter 2 (Biomedical data: their acquisition, storage, and use), Chapter 8 (Natural language
and text processing in biomedicine), Chapter 12 (Electronic health record systems)
N Hristidis V, editor (2009) Information discovery on electronic health records. 1st edition. Chapman and Hall/CRC. 331 p.
Chapters of particular relevance: Chapter 2 (Electronic health records), Chapter 4 (Data quality and integration issues in
electronic health records), 7 (Data mining and knowledge discovery on EHRs).
N Wilke RA, Xu H, Denny JC, Roden DM, Krauss RM, et al. (2011) The emerging role of electronic medical records in
pharmacogenomics. Clin Pharmacol Ther 89: 379386. doi:10.1038/clpt.2010.260.
N Roden DM, Xu H, Denny JC, Wilke RA (2012) Electronic medical records as a tool in clinical pharmacology: opportunities and
challenges. Clin Pharmacol Ther. Available: http://www.ncbi.nlm.nih.gov/pubmed/22534870. Accessed 30 June 2012.
N Kohane IS (2011) Using electronic health records to drive discovery in disease genomics. Nat Rev Genet 12: 417428.
doi:10.1038/nrg2999.

PLOS Computational Biology | www.ploscompbiol.org 12 December 2012 | Volume 8 | Issue 12 | e1002823


Glossary

N Candidate gene study: A study of specific genetic loci in which a phenotype-genotype association may exist (e.g.,
hypothesis-led genotype experiment)
N Computer-based documentation (CBD): Any electronic note or report found within an EHR system. Typically, these can
be dictated or typed directly into a note writer system (which may leverage templates) available within the EHR. Notably,
CBD excludes scanned documents.
N Computerized Provider Order Entry (CPOE): A system for allowing a provider (typically a clinician or a nurse
practitioner) to enter, electronically, an order for a patient. Typical examples include medication prescribing or test ordering.
These systems allow for a precise electronic record of orders given and also can provide decision support to help improve
care.
N Electronic Health Record (EHR): Any comprehensive electronic medical record system storing all the data about a
patients encounters with a healthcare system, including medical diagnoses, physician notes, prescribing records. EHRs
include CPOE and CBD systems (among others), and allow for easy information retrieval of clinical notes and results.
N Genome-wide association study (GWAS): A broad scale study of a number of points selected along a genome without
using a prior hypothesis. Typically, these studies analyze more than .500,000 loci on the genome.
N Genotype: The specific DNA sequence at a given location.
N Natural language processing (NLP): Use of algorithms to created structured data from unstructured, narrative text
documents. Examples include use of comprehensive NLP software solutions to find biomedical concepts in documents, as
well as more focused applications of techniques to find extract features from notes, such as blood pressure readings.
N Phenome-wide association study (PheWAS): A broad scale study of a number phenotypes selected along the genome
without regard to a prior hypothesis as what phenotype(s) a given genetic locus may be associated.
N Phenotype selection logic (or algorithm): A series of Boolean rules or machine learning algorithms incorporating such
information as billing codes, laboratory values, medication records, and NLP designed to derive a case and control
population. from EHR data.
N Phenotype: Any observable attribute of an individual.
N Single nucleotide polymorphism (SNP): a single locus on the genome that shows variation in the human population.
N Structured data: Data that is already recorded in a system in a structured name-value pair format and can be easily queried
via a database.
N Unified Medical Language System (UMLS): A comprehensive metavocabulary maintained by the National Library of
Medicine which combines .100 individual standardized vocabularies. The UMLS is composed of the Metathesaurus, the
Specialist Lexicon, and the Semantic Network. The largest component of the UMLS is the Metathesaurus, which contains the
term strings, concept groupings of terms, and concept interrelationships.
N Unstructured data: Data contained in narrative text documents such as the clinical notes generated by physicians and
certain types of text reports, such as pathology results or procedures such as echocardiograms.

References
1. Hindorff LA, Sethupathy P, Junkins HA, Ramos 7. Manolio TA (2009) Collaborative genome-wide 12. Dean BB, Lam J, Natoli JL, Butler Q, Aguilar D,
EM, Mehta JP, et al. (2009) Potential etiologic association studies of diverse diseases: programs of et al. (2009) Use of Electronic Medical Records
and functional implications of genome-wide the NHGRIs office of population genomics. for Health Outcomes Research: A Literature
association loci for human diseases and traits. Pharmacogenomics 10: 235241. Review. Med Care Res Rev. Available: http://
Proc Natl Acad Sci USA 106: 93629367. 8. Kaiser Permanente, UCSF Scientists Complete www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd =
doi:10.1073/pnas.0903103106. NIH-Funded Genomics Project Involving 100,000 Retrieve&db = PubMed&dopt = Citationlistuids =
2. Wellcome Trust Case Control Consortium (2007) People (n.d.). Available: http://www.dor.kaiser. 19279318.
Genome-wide association study of 14,000 cases of org/external/news/press_releases/Kaiser_ 13. Elixhauser A, Steiner C, Harris DR, Coffey RM
seven common diseases and 3,000 shared con- Permanente,_UCSF_Scientists_Complete_NIH- (1998) Comorbidity measures for use with
trols. Nature 447: 661678. Funded_Genomics_Project_Involving_100,000_ administrative data. Medical care 36: 827.
3. Dehghan A, Kottgen A, Yang Q, Hwang S-J, Kao People/. Accessed 13 September 2011. 14. Charlson ME, Pompei P, Ales KL, MacKenzie
WL, et al. (2008) Association of three genetic loci 9. Herzig SJ, Howell MD, Ngo LH, Marcantonio CR (1987) A new method of classifying prognostic
with uric acid concentration and risk of gout: a ER (2009) Acid-suppressive medication use and comorbidity in longitudinal studies: development
genome-wide association study. Lancet 372: 1953 the risk for hospital-acquired pneumonia. Jama and validation. Journal of chronic diseases 40:
1961. doi:10.1016/S0140-6736(08)61343-4. 301: 21202128. 373383.
4. Benjamin EJ, Dupuis J, Larson MG, Lunetta KL, 10. Klompas M, Haney G, Church D, Lazarus R, 15. Li L, Chase HS, Patel CO, Friedman C, Weng C
Booth SL, et al. (2007) Genome-wide association Hou X, et al. (2008) Automated identification of (2008) Comparing ICD9-encoded diagnoses and
with select biomarker traits in the Framingham acute hepatitis B using electronic medical record NLP-processed discharge summaries for clinical
Heart Study. BMC Med Genet 8 Suppl 1: S11. data to facilitate public health surveillance. PLoS trials pre-screening: a case study. AMIA. Annual
doi:10.1186/1471-2350-8-S1-S11. ONE 3: e2626. doi:10.1371/journal.pone. Symposium proceedings/AMIA Symposium:
5. Kiel DP, Demissie S, Dupuis J, Lunetta KL, 0002626. 404408.
Murabito JM, et al. (2007) Genome-wide association 11. Kiyota Y, Schneeweiss S, Glynn RJ, Cannuscio 16. Elkin PL, Ruggieri AP, Brown SH, Buntrock J,
with bone mass and geometry in the Framingham CC, Avorn J, et al. (2004) Accuracy of Medicare Bauer BA, et al. (2001) A randomized controlled
Heart Study. BMC Med Genet 8 Suppl 1: S14. claims-based diagnosis of acute myocardial in- trial of the accuracy of clinical record retrieval using
6. Kohane IS (2011) Using electronic health records farction: estimating positive predictive value on SNOMED-RT as compared with ICD9-CM.
to drive discovery in disease genomics. Nat Rev the basis of review of hospital records. American Proceedings/AMIA. Annual Symposium: 159
Genet 12: 417428. doi:10.1038/nrg2999. heart journal 148: 99104. 163.

PLOS Computational Biology | www.ploscompbiol.org 13 December 2012 | Volume 8 | Issue 12 | e1002823


17. Ritchie MD, Denny JC, Crawford DC, Ramirez Expressions to Abstract Blood Pressure and tions. J Am Med Inform Assoc 17: 507513.
AH, Weiner JB, et al. (2010) Robust replication of Treatment Intensification Information from the doi:10.1136/jamia.2009.001560.
genotype-phenotype associations across multiple Text of Physician Notes. Journal of the American 51. Aronson AR, Lang F-M (2010) An overview of
diseases in an electronic medical record. Medical Informatics Association 13: 691695. MetaMap: historical perspective and recent
Am J Hum Genet 86: 560572. doi:10.1016/ doi:10.1197/jamia.M2078. advances. J Am Med Inform Assoc 17: 229
j.ajhg.2010.03.003. 34. Sager N, Lyman M, Bucknall C, Nhan N, Tick LJ 236. doi:10.1136/jamia.2009.002733.
18. Liao KP, Cai T, Gainer V, Goryachev S, Zeng- (1994) Natural language processing and the 52. Sirohi E, Peissig P (2005) Study of effect of drug
treitler Q, et al. (2010) Electronic medical records representation of clinical data. J Am Med Inform lexicons on medication extraction from electronic
for discovery research in rheumatoid arthritis. Assoc 1: 142160. medical records. Pac Symp Biocomput: 308318.
Arthritis Care Res (Hoboken) 62: 11201127. 35. Haug PJ, Ranum DL, Frederick PR (1990) 53. Wilke RA, Berg RL, Linneman JG, Zhao C,
doi:10.1002/acr.20184. Computerized extraction of coded findings from McCarty CA, et al. (2008) Characterization of low-
19. Conway M, Berg RL, Carrell D, Denny JC, Kho free-text radiologic reports. Work in progress. density lipoprotein cholesterol-lowering efficacy
AN, et al. (2011) Analyzing the heterogeneity and Radiology 174: 543548. for atorvastatin in a population-based DNA
complexity of electronic health record oriented 36. Friedman C, Hripcsak G, Shablinsky I (1998) An biorepository. Basic Clin Pharmacol Toxicol 103:
phenotyping algorithms. AMIA Annu Symp Proc evaluation of natural language processing meth- 354359. doi:10.1111/j.1742-7843.2008.00291.x.
2011: 274283. odologies. Proceedings/AMIA. Annual Sympo- 54. Uzuner O, Solti I, Cadag E (2010) Extracting
20. Denny JC, Peterson JF, Choma NN, Xu H, sium: 855859. medication information from clinical text. Journal
Miller RA, et al. (2010) Extracting timing and 37. Denny JC, Smithers JD, Miller RA, Spickard A of the American Medical Informatics Association
status descriptors for colonoscopy testing from (2003) Understanding medical school curricu- 17: 514518. doi:10.1136/jamia.2010.003947.
electronic medical records. J Am Med Inform lum content using KnowledgeMap. J Am Med 55. McCarty CA, Nair A, Austin DM, Giampietro
Assoc 17: 383388. doi:10.1136/jamia. Inform Assoc 10: 351362. PF (2007) Informed consent and subject motiva-
2010.004804. 38. Dunham GS, Pacak MG, Pratt AW (1978) tion to participate in a large, population-based
21. Huff SM, Rocha RA, McDonald CJ, De Moor GJ, Automatic indexing of pathology data. Journal genomics study: the Marshfield Clinic Personal-
Fiers T, et al. (1998) Development of the Logical of the American Society for Information Science ized Medicine Research Project. Community
Observation Identifier Names and Codes (LOINC) 29: 8190. Genet 10: 29. doi:10.1159/000096274.
vocabulary. J Am Med Inform Assoc 5: 276292. 39. Denny JC, Spickard A, Miller RA, Schildcrout J, 56. NUgene Project (n.d.). Available: https://www.
22. Logical Observation Identifiers Names and Codes Darbar D, et al. (2005) Identifying UMLS nugene.org/. Accessed 16 September 2012.
(2007). Available: http://www.regenstrief.org/ concepts from ECG Impressions using Knowl- 57. Kaiser Permanente, UCSF Scientists Complete
medinformatics/loinc/. edgeMap. AMIA. Annual Symposium proceed- NIH-Funded Genomics Project Involving 100,000
23. Kullo IJ, Ding K, Jouni H, Smith CY, Chute CG ings [electronic resource]/AMIA Symposium: People (n.d.). Available: http://www.dor.kaiser.
(2010) A genome-wide association study of red 196200. org/external/news/press_releases/Kaiser_
blood cell traits using the electronic medical record. 40. Wang X, Hripcsak G, Markatou M, Friedman C Permanente,_UCSF_Scientists_Complete_NIH-
PLoS ONE 5: e13011. doi:10.1371/journal. (2009) Active computerized pharmacovigilance Funded_Genomics_Project_Involving_100,000_
pone.0013011 using natural language processing, statistics, and People/. Accessed 13 September 2011.
24. Rosenbloom ST, Stead WW, Denny JC, Giuse D, electronic health records: a feasibility study. J Am 58. Roden DM, Pulley JM, Basford MA, Bernard
Lorenzi NM, et al. (2010) Generating Clinical Med Inform Assoc 16: 328337. GR, Clayton EW, et al. (2008) Development of a
Notes for Electronic Health Record Systems. 41. Meystre SM, Haug PJ (2008) Randomized large-scale de-identified DNA biobank to enable
Appl Clin Inform 1: 232243. doi:10.4338/ACI- controlled trial of an automated problem list with personalized medicine. Clinical pharmacology
2010-03-RA-0019. improved sensitivity. International journal of and therapeutics 84: 362369.
25. Rosenbloom ST, Denny JC, Xu H, Lorenzi N, medical informatics. Available: http://www. 59. Gupta D, Saul M, Gilbertson J (2004) Evaluation
Stead WW, et al. (2011) Data from clinical notes: a ncbi.nlm.nih.gov/entrez/query.fcgi?cmd = Retrieve of a deidentification (De-Id) software engine to
perspective on the tension between structure and &db = PubMed&dopt = Citation&list_uids = 18280787. share pathology reports and clinical documents
flexible documentation. J Am Med Inform Assoc 42. Xu H, Stenner SP, Doan S, Johnson KB, for research. American journal of clinical pathol-
18: 181186. doi:10.1136/jamia.2010.007237. Waitman LR, et al. (2010) MedEx: a medication ogy 121: 176186.
26. Rasmussen LV, Peissig PL, McCarty CA, Starren information extraction system for clinical narra- 60. Aberdeen J, Bayer S, Yeniterzi R, Wellner B,
J (2012) Development of an optical character tives. J Am Med Inform Assoc 17: 1924. Clark C, et al. (2010) The MITRE Identification
recognition pipeline for handwritten form fields doi:10.1197/jamia.M3378. Scrubber Toolkit: design, training, and assess-
from an electronic health record. Journal of the 43. Melton GB, Hripcsak G (2005) Automated ment. Int J Med Inform 79: 849859.
American Medical Informatics Association: JA- detection of adverse events using natural lan- doi:10.1016/j.ijmedinf.2010.09.007.
MIA 19: e90e95. doi:10.1136/amiajnl-2011- guage processing of discharge summaries. J Am 61. Uzuner O, Luo Y, Szolovits P (2007) Evaluating the
000182. Med Inform Assoc 12: 448457. state-of-the-art in automatic de-identification. J Am
27. Peissig PL, Rasmussen LV, Berg RL, Linneman 44. Denny JC, Spickard A, Johnson KB, Peterson Med Inform Assoc 14: 550563. doi:10.1197/jamia.
JG, McCarty CA, et al. (2012) Importance of multi- NB, Peterson JF, et al. (2009) Evaluation of a M2444.
modal approaches to effectively identify cataract method to identify and categorize section headers 62. Cardon LR, Palmer LJ (2003) Population stratifi-
cases from electronic health records. J Am Med in clinical documents. J Am Med Inform Assoc cation and spurious allelic association. Lancet 361:
Inform Assoc 19: 225234. doi:10.1136/amiajnl- 16: 806815. doi:10.1197/jamia.M3037. 598604. doi:10.1016/S0140-6736(03)12520-2.
2011-000456. 45. Friedman C, Shagina L, Lussier Y, Hripcsak G 63. Price AL, Patterson NJ, Plenge RM, Weinblatt
28. Denny JC, Spickard A, Miller RA, Schildcrout J, (2004) Automated encoding of clinical documents ME, Shadick NA, et al. (2006) Principal compo-
Darbar D, et al. (2005) Identifying UMLS based on natural language processing. J Am Med nents analysis corrects for stratification in ge-
concepts from ECG Impressions using Knowl- Inform Assoc 11: 392402. nome-wide association studies. Nat Genet 38:
edgeMap. AMIA. Annual Symposium proceed- 46. Zeng QT, Goryachev S, Weiss S, Sordo M, 904909. doi:10.1038/ng1847.
ings/AMIA Symposium: 196200. Murphy SN, et al. (2006) Extracting principal 64. Dumitrescu L, Ritchie MD, Brown-Gentry K,
29. Willems JL, Abreu-Lima C, Arnaud P, van diagnosis, co-morbidity and smoking status for Pulley JM, Basford M, et al. (2010) Assessing the
Bemmel JH, Brohet C, et al. (1991) The asthma research: evaluation of a natural language accuracy of observer-reported ancestry in a
diagnostic performance of computer programs processing system. BMC medical informatics and biorepository linked to electronic medical records.
for the interpretation of electrocardiograms. The decision making 6: 30. Genet Med 12: 648650. doi:10.1097/GIM.0-
New England journal of medicine 325: 1767 47. Chapman WW, Bridewell W, Hanbury P, b013e3181efe2df.
1773. Cooper GF, Buchanan BG (2001) A simple 65. Sohn M-W, Zhang H, Arnold N, Stroupe K,
30. Poon EG, Keohane CA, Yoon CS, Ditmore M, algorithm for identifying negated findings and Taylor BC, et al. (2006) Transition to the new
Bane A, et al. (2010) Effect of bar-code technol- diseases in discharge summaries. Journal of race/ethnicity data collection standards in the
ogy on the safety of medication administration. biomedical informatics 34: 301310. Department of Veterans Affairs. Popul Health
N Engl J Med 362: 16981707. doi:10.1056/ 48. Friedman C, Shagina L, Lussier Y, Hripcsak G Metr 4: 7. doi:10.1186/1478-7954-4-7.
NEJMsa0907115. (2004) Automated encoding of clinical documents 66. Savova GK, Fan J, Ye Z, Murphy SP, Zheng J,
31. FitzHenry F, Peterson JF, Arrieta M, Waitman based on natural language processing. J Am Med et al. (2010) Discovering peripheral arterial
LR, Schildcrout JS, et al. (2007) Medication Inform Assoc 11: 392402. disease cases from radiology notes using natural
administration discrepancies persist despite elec- 49. Denny JC, Miller RA, Waitman LR, Arrieta MA, language processing. AMIA Annu Symp Proc
tronic ordering. J Am Med Inform Assoc 14: 756 Peterson JF (2009) Identifying QT prolongation 2010: 722726.
764. doi:10.1197/jamia.M2359. from ECG impressions using a general-purpose 67. Tatonetti NP, Denny JC, Murphy SN, Fernald
32. Denny JC, Arndt FV, Dupont WD, Neilson EG Natural Language Processor. International jour- GH, Krishnan G, et al. (2011) Detecting Drug
(2008) Increased hospital mortality in patients nal of medical informatics 78 Suppl 1: S3442. Interactions From Adverse-Event Reports: Inter-
with bedside hippus. The American journal of 50. Savova GK, Masanz JJ, Ogren PV, Zheng J, action Between Paroxetine and Pravastatin In-
medicine 121: 239245. Sohn S, et al. (2010) Mayo clinical Text Analysis creases Blood Glucose Levels. Clin Pharmacol
33. Turchin A, Kolatkar NS, Grant RW, Makhni and Knowledge Extraction System (cTAKES): Ther. Available: http://www.ncbi.nlm.nih.gov/
EC, Pendergrass ML, et al. (2006) Using Regular architecture, component evaluation and applica- pubmed/21613990. Accessed 7 June 2011.

PLOS Computational Biology | www.ploscompbiol.org 14 December 2012 | Volume 8 | Issue 12 | e1002823


68. Rzhetsky A, Wajngurt D, Park N, Zheng T (2007) 73. Crosslin DR, McDavid A, Weston N, Nelson SC, Medical Informatics Association: JAMIA 19:
Probing genetic overlap among complex human Zheng X, et al. (2012) Genetic variants associated e162e169. doi:10.1136/amiajnl-2011-000583.
phenotypes. Proc Natl Acad Sci USA 104: with the white blood cell count in 13,923 subjects 78. Denny JC, Kho A, Chute CG, Carrell D,
1169411699. doi:10.1073/pnas.0704820104. in the eMERGE Network. Hum Genet 131: 639 Rasmussen L, et al. (2010) Use of Electronic
69. Chen DP, Weber SC, Constantinou PS, Ferris 652. doi:10.1007/s00439-011-1103-9. Medical Records for Genomic Research
TA, Lowe HJ, et al. (2008) Novel integration of 74. Denny JC, Crawford DC, Ritchie MD, Bielinski SJ, Preliminary Results and Lessons from the
hospital electronic medical records and gene Basford MA, et al. (2011) Variants Near FOXE1 Are eMERGE Network.
expression measurements to identify genetic Associated with Hypothyroidism and Other Thyroid 79. Denny JC, Ritchie MD, Basford MA, Pulley JM,
markers of maturation. Pac Symp Biocomput: Conditions: Using Electronic Medical Records for Bastarache L, et al. (2010) PheWAS: demonstrat-
243254. Genome- and Phenome-wide Studies. Am J Hum ing the feasibility of a phenome-wide scan to
70. Wood GC, Still CD, Chu X, Susek M, Erdman R, Genet 89: 529542. doi:10.1016/j.ajhg.2011.09.008. discover gene-disease associations. Bioinformatics
et al. (2008) Association of chromosome 9p21 SNPs 75. Kullo IJ, Ding K, Shameer K, McCarty CA, 26: 12051210. doi:10.1093/bioinformatics/
with cardiovascular phenotypes in morbid obesity Jarvik GP, et al. (2011) Complement receptor 1 btq126.
using electronic health record data. Genomic Med 2: gene variants are associated with erythrocyte 80. Denny JC, Bastarache L, Crawford DC, Ritchie
3343. doi:10.1007/s11568-008-9023-z. sedimentation rate. Am J Hum Genet 89: 131 MD, Basford MA, et al. (2010) Scanning the
71. Kurreeman F, Liao K, Chibnik L, Hickey B,
138. doi:10.1016/j.ajhg.2011.05.019. EMR Phenome for Gene-Disease Associations
Stahl E, et al. (2011) Genetic basis of autoanti-
76. Kho AN, Hayes MG, Rasmussen-Torvik L, Pacheco using Natural Language Processing. Proc AMIA
body positive and negative rheumatoid arthritis
JA, Thompson WK, et al. (2012) Use of diverse Annu Fall Symp.
risk in a multi-ethnic cohort derived from
electronic medical record systems to identify genetic 81. Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ,
electronic health records. Am J Hum Genet 88:
5769. doi:10.1016/j.ajhg.2010.12.007. risk for type 2 diabetes within a genome-wide Li Y, et al. (2007) A genome-wide association
72. Denny JC, Ritchie MD, Crawford DC, Schildcr- association study. J Am Med Inform Assoc 19: study of type 2 diabetes in Finns detects multiple
out JS, Ramirez AH, et al. (2010) Identification of 212218. doi:10.1136/amiajnl-2011-000439. susceptibility variants. Science 316: 13411345.
genomic predictors of atrioventricular conduc- 77. Carroll RJ, Thompson WK, Eyler AE, Mandelin 82. Collins F (2009) Opportunities and challenges for
tion: using electronic medical records as a tool for AM, Cai T, et al. (2012) Portability of an the NIHan interview with Francis Collins.
genome science. Circulation 122: 20162021. algorithm to identify rheumatoid arthritis in Interview by Robert Steinbrook. N Engl J Med
doi:10.1161/CIRCULATIONAHA.110.948828. electronic health records. Journal of the American 361: 13211323. doi:10.1056/NEJMp0905046.

PLOS Computational Biology | www.ploscompbiol.org 15 December 2012 | Volume 8 | Issue 12 | e1002823


Education

Chapter 14: Cancer Genome Analysis


Miguel Vazquez, Victor de la Torre, Alfonso Valencia*
Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain

Abstract: Although there is great specific tissues that accumulate over time. from a particular type of cancer, and is
promise in the benefits to be Genetic predisposition is represented by used to identify biomarkers, characterize
obtained by analyzing cancer ge- germline variants and indeed, many com- cancer subtypes with clinical or therapeu-
nomes, numerous challenges hin- mon germline variants have been associ- tic implications, or to simply advance our
der different stages of the process, ated with specific diseases, as well as with understanding of the tumorigenic process.
from the problem of sample prep- altered drug susceptibility and/or toxicity. The second approach involves examining
aration and the validation of the The association of germline variants with the genome of a particular cancer patient
experimental techniques, to the clinical features and disease is mainly in the search for specific alterations that
interpretation of the results. This achieved through Genome Wide Associa- may be susceptible to tailored therapy.
chapter specifically focuses on the tion Studies (GWAS). GWAS use large Although both approaches draw on
technical issues associated with the cohorts of cases to analyze the relationship common experimental and bioinformatics
bioinformatics analysis of cancer between the disease and thousands or techniques, they analyze different types of
genome data. The main issues millions of mutations across the entire information, have different goals and they
addressed are the use of database genome, and they are the subject of a require the presentation of the results in
and software resources, the use of separate chapter in this issue. distinct ways.
analysis workflows and the presen- The study of cancer genomes differs The development of Next Generation
tation of clinically relevant action significantly from GWAS, as during the Sequencing (NGS) has not only helped
items. We attempt to aid new
lifetime of the organism variants only identify genetic variants but also, it
developers in the field by describ-
accumulate in the tumor or the affected represents an important aid in the study
ing the different stages of analysis
and discussing current approaches, tissues, and they are not transmitted of epigenetics (DNAseq and ChipSeq of
as well as by providing practical from generation to generation. These are histone methylation marks), transcription-
advice on how to access and use known as somatic mutations. Mutations al regulation and splicing (RNAseq). The
resources, and how to implement accumulate as the tumors progress through combined power of such genomic data
recommendations. Real cases from processes that are not completely under- provides a more complete definition of
cancer genome projects are used stood and that depend on the evolution of cancer genomes.
as examples. the different cell types in the tumor, i.e., To aid developers new to the field of
clonal versus parallel evolution [5]. Re- cancer genomics, this chapter will discuss
gardless of which model is more relevant, the particularities of cancer genome
the tumor genome includes mutations that analysis, as well as the main scientific
This article is part of the Transla-
facilitate tumorigenesis or are that essential and technical challenges, and potential
tional Bioinformatics collection for
for the generation of the tumor (known solutions.
PLOS Computational Biology.
as tumor drivers), and others that have
accumulated during the growth of the 2. Overview of Cancer Genome
1. Introduction
tumor (known as passengers) [6]. Distin- Analysis
Cancer is commonly defined as a guishing driver from passenger muta-
disease of the genes, a definition that tions is crucial for the interpretation of The sequence of the steps in an
emphasizes the importance of cataloguing cancer genomes [5]. idealized cancer genome analysis pipeline
and analyzing tumor-associated muta- Depending on the type of data and the are presented in Figure 1. For each step
tions. The recent advances in sequencing aim of the analysis, cancer genome listed, the biological disciplines involved,
technology have underpinned the prog- analysis may focus on the cancer type or the bioinformatics techniques used and
ress in several large-scale projects to on the patient. The first approach consists some of the most salient challenges that
systematically compile genomic informa- of examining a cohort of patients suffering arise are listed.
tion related to cancer. For example,
the Cancer Genome Atlas (http:// Citation: Vazquez M, de la Torre V, Valencia A (2012) Chapter 14: Cancer Genome Analysis. PLoS Comput
cancergenome.nih.gov/) and the projects Biol 8(12): e1002824. doi:10.1371/journal.pcbi.1002824
overseen by the International Cancer Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
Genome Consortium [1] (http://icgc. Baltimore County, United States of America
org/) have focused on identifying links Published December 27, 2012
between cancer and genomic variation. Copyright: 2012 Vazquez et al. This is an open-access article distributed under the terms of the Creative
Unsurprisingly, the analysis of genomic Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
mutations associated with cancer is also provided the original author and source are credited.
making its way into clinical applications Funding: This article was supported in part by the grant from the Spanish Ministry of Science and Innovation
[24]. BIO2007-66855 and the EU FP7 project ASSET, grant agreement 259348. The funders had no role in the
preparation of the manuscript.
Cancer may be favored by genetic
predisposition, although it is thought to Competing Interests: The authors have declared that no competing interests exist.
be primarily caused by mutations in * E-mail: valencia@cnio.es

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002824


What to Learn in This Chapter detected at the genome level. This process
is relatively well established and is the
This chapter presents an overview of how cancer genomes can be analyzed, main subject of this chapter.
discussing some of the challenges involved and providing practical advice on
how to address them. As the primary analysis of experimental data is described 2.2. Consequence, Recurrence
elsewhere (sequencing, alignment and variant calling), we will focus on the Analysis and Candidate Drivers
secondary analysis of the data, i.e., the selection of candidate driver genes,
The list of somatic variants obtained
functional interpretation and the presentation of the results. Emphasis is placed
on how to build applications that meet the needs of researchers, academics and from the primary analysis of DNA se-
clinicians. The general features of such applications are laid out, along with advice quences is carefully examined to identify
on their design and implementation. This document should serve as a starter mutations that may alter the function of
guide for bioinformaticians interested in the analysis of cancer genomes, protein products. DNA mutations are
although we also hope that more experienced bioinformaticians will find translated into mutations in RNA tran-
interesting solutions to some key technical issues. scripts, and from RNA into proteins,
potentially altering their amino acid se-
quence. The impact of these amino acid
2.1. Sequencing, Alignment and be accompanied by scores measuring the alterations on protein function can range
Variant Calling sequencing quality over that region or the from largely irrelevant (if they do not affect
After samples are sequenced, sequencing prevalence of the variant allele in the any region of the protein involved in cata-
reads are aligned to a reference genome samples. The workflow employed for this lysis or binding, or if they do not signi-
and all differences are identified through a type of analysis is commonly known as a ficantly alter the structure and stability of
process known as variant calling. The output primary analysis (For more information on the protein) to highly deleterious (for
of the variant calling is a list of genomic sequencing, alignment and variant calling, example if the amino acid changes result
variations that is organized according to please refer to [7,8]). in the formation of a truncated protein
their genomic location (chromosome and This chapter describes the subsequent lacking important functional regions). The
position) and the variant allele. They may steps in the analysis of the variants severity of these alterations can be assessed

Figure 1. Idealized cancer analysis pipeline. The column on the left shows a list of sequential steps. The columns on the right show the
bioinformatics and molecular biology disciplines involved at each step, the types of techniques employed and some of the current challenges faced.
doi:10.1371/journal.pcbi.1002824.g001

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002824


using specialized software tools known as investigated in studies of DNA methyla- ped to different transcripts by finding the
protein mutation pathogenicity predictors. tion [10] and RNA sequencing in the same exon affected, the offset of the mutation
Mutations are also examined to identify patients (Ferreira et al. submitted). At the inside that exon and the position of the
recurrence, which may point to key genes technical level, the analysis of heteroge- exon inside the transcript. By removing
and mutational hotspots. The predicted neous genomic data adds further com- the 59 UTR region of the transcript
consequences of the mutations and their plications to analysis workflows, as the sequence and dividing the rest into triplets,
recurrence are used to select potential underlying biological bases are often not the affected codon can be identified, as
driver mutations that may be directly fully understood. Consequently, relatively well as the possible amino acid replace-
involved in the tumorigenic process. few published studies have effectively ment. Ensembl BioMart provides all the
Note that not all mutations that have combined more than a few combinations information necessary to perform this type
deleterious consequences for protein func- of such data [1113]. These studies are of mapping, while a number of other
tion are necessarily involved in cancer as usually supported by visualization tools to systems also provide this functionality (see
the proteins affected may not play any analyze the results within specialist appli- Table 1).
fundamental role in tumorigenesis. cations tailored to fit the specific set of data One important technical consideration
generated. when mapping genomic variants is the
2.3. Pathways and Functional Finally, in a personalized medicine version of the genome build. It is essential
application, the results must be related to to use the correct build and many map-
Analysis
information of clinical relevance, such as ping tools support different versions of the
Genes that are recurrently mutated in
potentially related drugs and therapies. genome build. Moreover, the data in
cancer tend to be easily identifiable, and
Ensembl is thoroughly versioned, so that
obvious examples include TP53 and
2.5. Current Challenges the BioMart interface can be used to
KRAS that are mutated in many cancer
In general terms, three key challenges gather all genomic information consistent-
types. More often mutations are more
exist when analyzing cancer genomes: ly for any particular build. Thus, entities
widely distributed and the probability of (mutations, genes, transcripts or proteins)
(1) the heterogeneity of the data to be
finding the same gene mutated in several can be linked back to the appropriate ver-
analyzed, which ranges from genomic
cases is low, making it more difficult to sion using the Ensembl web site archives.
mutations in coding regions to alterations
identify common functional features asso-
in gene expression or epigenetic marks; (2)
ciated with a given cancer.
the range of databases and software 3.2. Driver Mutations and
Pathway analysis offers a means to
resources required to analyse and interpret Pathogenicity Prediction
overcome this challenge by associating
the results; and (3) the comprehensive In addition to false variants introduced
mutated genes with known signaling path-
expertise required to understand the impli- by technical errors, some variants present
ways, regulatory networks, clusters in
cations of such varied experimental data. in the samples may not contribute to
protein interaction networks, protein com-
plexes or general functional classes, such cancer development. The terms driver
as those defined in the Gene Ontology 3. Critical Bioinformatics Tasks and passenger were first used in 1964 in
database. A number of statistical me- in Cancer Genome Analysis the context of viral infections that drive
thods have been developed to determine cancer [6]. However, they are now used to
An overview of the four main tasks that distinguish mutations that drive cancer
the significance of the associations between should be performed when analyzing the
mutated genes and these functional classes. onset and progression from those that play
cancer genome is shown in Figure 2, along little or no role in such processes but that
Pathways analysis has now become a with the associated requirements. In the are propagated by their co-existence with
fundamental component of cancer genome first instance, the mutations initially de- driver mutations. The problem of distin-
analysis and it is described in almost all tected at the DNA level must be trimmed guishing driver from passenger mutations
cancer genome publications. In this sense, to include only somatic variations, remov- remains unsolved as yet. Experimental
cancer is not only a disease of the genes ing the germline SNPs detected in healthy assays of activity are one means of testing
but also a disease of the pathways. tissue of the same individuals or in the the tumorigenic potential of mutations
general population. The description of the [14], although such assays are difficult to
2.4. Integration, Visualization and different stages of analysis that we present perform to scale. Consequently, a number
Interpretation begins with this list of somatic variants and of complementary in-silico methods have
Information on the mutational status of their associated genomic locations. been developed to identify driver muta-
genes can be better understood if it is tions. Statistical approaches seek to identify
integrated with information about gene 3.1. Mapping between Coordinate traces of mutation selection during tumor
expression and related to alterations in: Systems formation by looking at the prevalence of
the copy number of each gene (CNVs), a Translating mutational information de- mutations in particular genes in sample
very common phenomenon in cancer; rived from genomic coordinates to other cohorts, or the ratios of synonymous versus
mutations in promoters and enhancers; data types is an obvious first step. Al- non-synonymous mutations in particular
variations in the affinity of transcription though this may seem trivial, its impor- candidate genes. However, such statistical
factors and DNA binding proteins; or tance should not be underestimated given approaches require large sample cohorts to
dysregulation of epigenetic control. that alterations in single nucleotides can achieve sufficient power. Alternatively, in-
The importance of the relationships have significant consequences. silico predictions of pathogenicity can be
between different genome data sources is The position of DNA mutations in used to restrict the list of potential driver
illustrated by the case of chronic lympho- transcripts and protein products must be mutations to those that are likely to alter
cytic leukemia (CLL). The consequences obtained by translating their coordinates protein function [15].
of mutations in the SF3B1 splicing factor, across various systems. For example, point Several tools that implement different
detected by exon sequencing [9], were mutations in coding regions can be map- versions of these general concepts can be

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002824


PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002824
Figure 2. Main tasks in an analysis pipeline. Starting with the patient information derived from NGS experiments, the variants are mapped
between genes and proteins, evaluated for pathogenicity, considered systemically through functional analysis, and the resulting conclusions
translated into actionable results.
doi:10.1371/journal.pcbi.1002824.g002

used to perform pathogenicity predictions isoforms are specifically expressed in the type of functional annotation most com-
for point mutations in coding regions (see cell type of interest, in which case, monly considered and thus, functional
Table 1). Prediction is far more compli- additional software will be necessary to analysis is often termed pathway analysis.
cated for genomic aberrations and muta- analyze the data generated by the new However, functional annotations may also
tions that affect non-coding regions of experiments. include other types of biological associa-
DNA, an area of basic research that is still tions such as cellular location, protein
in its early stages. However, the large domain composition, and classes of cellular
collections of genomic information gath- 3.3 Functional Interpretation or biochemical terms, such as GO terms
ered by the ENCODE project [16] will Some genes harbor a large number of (Table 2 lists some useful databases along
doubtless play a key role in this research. mutations in cancer genomes, such as TP53 with the relevant functional annotations).
Despite their limited scope, mutations in and KRAS, whose importance and rele- Over the last decade, multiple statistical
coding regions are the most useful for vance as cancer drivers have been well approaches have been developed to iden-
cancer genome analysis. This is initially established. Frequently however, genomic tify functional annotations (also known as
because it is still cheaper to sequence data reveals the presence of mutated genes labels) that are significantly associated
exomes than full genomes and also, that are far less prevalent, and the signif- with lists of entities, collectively known as
because they are closer to actionable icance of these genes must be considered in enrichment analysis. Indeed, the current
medical items, given that most drugs target the context of the functional units they are systems for functional interpretation have
proteins. Indeed, most clinical success part of. For example, SF3B1 was mutated been derived from the systems previously
stories based on cancer genome analysis in only 10 out of 105 samples of chronic developed to analyze expression arrays,
have involved the analysis of point muta- lymphocytic leukemia (CLL) in the study and they have been adapted to analyze
tions in proteins [3]. conducted by the ICGC consortium [9], lists of cancer-related genes. As this step is
In particular, we have focused on the and in 14 out of 96 in the study performed critical to perform functional interpreta-
need to analyze the consequences of in the Broad Institute [17]. While these tions, special care must be taken when
mutations in alternative isoforms of each numbers are statistically significant, many selecting methods to be incorporated into
gene, in addition to those in the main other components of the RNA splicing and the analysis pipeline. Cases in which the
isoforms. Despite the potential implica- transport machinery are also mutated in characteristics of the data challenge the
tions of alternative splicing, this problem CLL. Even if these mutations occur at assumptions of the methods are parti-
remains largely overlooked by current lower frequencies they further emphasize cularly delicate. For instance, a hyper-
applications. A common solution is to the importance of this gene [18]. geometric test might be appropriate to
assign the genomic mutations to just one of Functional interpretation aims to iden- analyze gene lists that are differentially
the several potential isoforms, without tify large biological units that correlate expressed in gene expression arrays. How-
considering their possible incidence of better with the phenotype than individual ever, when dealing with lists of mutated
other splice isoforms, and in most cases mutated genes, and as such, it can produce genes this approach does not account for
without knowing which isoform is actually a more general interpretation of the factors such as the number of mutations
produced in that particular tissue. The acquired genomic information. The in- per gene, the size of the genes, or the presence
availability of RNAseq data should solve volvement of genes in specific biological, of genes in overlapping genomic clusters
this problem by demonstrating which metabolic and signaling pathways is the (where one mutation may simultaneously

Table 1. Selection of the software packages used in cancer genome analysis.

Software Functionality Availability

VEP Mutation mapping Local installation or web site


ANNOVAR Mutation mapping Local installation
VARIANT Mutation mapping Local installation, web site, and web service
Mutation Assessor, SIFT For protein variants Web site and web service
Condel Consensus prediction Web site and web service
wKinMut Kinase specific Web site and web service
Genecodis Annotation enrichment for gene lists Web site and web service
FatiGO, David Annotation enrichment for gene lists Web site
Cytoscape Network visualization and analysis Local installation. Can be embedded in browser
applications
R Statistics and plotting Local installation
Taverna Workflow enactment Local installation
Galaxy Workflow enactment Browser application

doi:10.1371/journal.pcbi.1002824.t001

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002824


Table 2. Selection of databases commonly used in our workflows.

Database Entities Properties

Ensembl Genes, proteins, transcripts, regulatory Genomic positions, relationships between them,
regions, variants identifiers in different formats, GO terms, PFAM
domains
Entrez Genes, articles Articles for genes, abstracts of articles, links to full text
UniProt Proteins PDBs, known variants
KEGG, Reactome, Biocarta, Gene Ontology Genes Pathways, processes, function, cell location
TFacts Genes Transcription regulation
Barcode Genes Expression by tissue
PINA, HPRD, STRING Proteins Interactions
PharmaGKB Drugs, proteins, variants Drug targets, pharmacogenetics
STITCH, Matador Drugs, proteins Drug targets
Drug clinical trials Investigational drugs Diseases or conditions in they are being tested
GEO, ArrayExpress Genes (microarray probes) Expression values
ICGC, TCGA Cancer Genomes Point mutations, methylation, CNV, structural variants
dbSNP, 1000 genomes Germline variations Association with diseases or conditions
COSMIC Somatic variations Association with cancer types

doi:10.1371/journal.pcbi.1002824.t002

affect several genes). As none of these issues A different type of analysis considers the Drug-related information and the tools
are accommodated by the standard ap- relationships between entities based on with which to analyze it is essential for the
proaches used for gene expression analysis, their connections in protein interaction analysis of personalized data (some of the
new developments are clearly required for networks. This approach has been used to key databases linking known gene variants
cancer genome analysis. measure the proximity of groups of cancer- to diseases and drugs are listed in Table 2).
To alleviate the rigidity introduced by related genes and other groups of genes or Accessing this information and integrating
the binary nature of set-based approaches, functions, by labeling nodes with specific chemical informatics methodologies into
whereby genes are either on the list or they characteristics (such as roles in biological bioinformatics systems presents new chal-
are not, some enrichment analysis ap- pathways or functional classes) [20]. lenges for bioinformaticians and system
proaches study the over-representation of Functional interpretation can therefore developers.
annotations/labels using rank-based statis- be facilitated by the use of a wide array of
tics. A common choice for rank-based alternative analyses. Different approaches 4. Resources for Genome
approaches is to use some variation of the can potentially uncover hidden functional Analysis in Cancer
Kolmogorov-Smirnov non-parametric sta- implications in genomic data, although the
4.1. Databases
tistic, as employed in gene set enrichment integration of these results remains a key
Although complex, the data required
analysis (GSEA) [19]. Another benefit of challenge.
for genome analysis can usually be repre-
rank approaches is that the scores used can sented in a tabular format. Tab separated
be designed to account for some of the 3.4. Applicable Results: Diagnosis, values (TSV) files are the de facto standard
features that are not well handled by set- Patient Stratification and Drug when sharing database resources. For a
based approaches. Accordingly, consider- Therapies developer, these files have several practical
ations of background mutation rates based For clinical applications, the results of advantages over other standard formats
on gene length, sequencing quality or cancer genome analysis need to be trans- popular in computer science (namely
heterogeneity in the initial tumor samples lated into practical advice for clinicians, XML): they are easier to read, write and
can be incorporated into the scoring providing potential drug therapies, better parse with scripts; they are relatively
scheme. However, rank statistics are still tumor classification or early diagnostic succinct; the format is straight-forward
unable to handle other issues, such as markers. While bioinformatics systems can and the contents can be inferred from the
mutations affecting clusters of genes that support these decisions, it will be up to first line of the file, which typically holds
are functionally related (e.g., proto-cadher- expert users to present these findings in the the names of the columns.
ins), which still challenge the assumption context of the relevant medical and clinical Some databases describe entities and
of independence made by most statistical information available at any given time. In their properties, such as: proteins and the
approaches. Note that from a bioinfor- the case of our institutions (CNIO) person- drugs that target them; germline variations
matics perspective, sets of entities are often alized cancer medicine approach, we use and the diseases with which they are
conceptually simpler to work with than mouse xenografts (also known as avatar associated; or genes along with the factors
ranked lists when crossing information models) to test the effects of drugs on that regulate their transcription. Other
derived from different sources. Moreover, tumors prior to considering their potential databases are repositories of experimental
from an application perspective, informa- to treat patients [4]. In turn, the results of data, such as the Gene Expression Omnibus
tion summarized in terms of sets of entities these xenograft studies are used as a and ArrayExpress, which contain data from
is often more actionable than ranks or scores. feedback into the system for future analyses. microarray experiments on a wide range of

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002824


samples and under a variety of experimental incorrect predictions is a valid strategy, in for resources to make their data and
conditions. For cancer genome studies, practice it substantially reduces the number functionalities available in several ways, a
cancer-specific repositories will soon be the of predictions that can be made. trend that is already evident in databases
main reference, such as those developed by Identifier translation is a very common like Ensembl, where the information can
the ICGC and TCGA projects. Indeed, task in Bioinformatics in general, and in be examined using the web interface,
these repositories contain complete geno- cancer genome analysis in particular. In downloaded via the BioMart web service,
types that offer a perfect opportunity to test practice, we use the Ensembl BioMart web batch downloaded from an FTP server, or
new approaches with real data. service to download identifier equivalence queried through the PERL API.
Bioinformaticians know that crossing tables (in TSV format), which map Bioinformaticians should strive to make
information from different sources is not different identifier formats between and their resources widely available to allow
a trivial task, as different resources use a across genes, proteins, array probes, etc. others to use them in the most convenient
variety of identifiers. Even very similar We build fast indexes over these equiva- manner. In function of the workflows
entities can have different identifiers in two lence tables and make them ubiquitously characteristics, some accessibility modes
different databases (e.g., genes in Entrez accessible to all our functionalities through (e.g., web service, local application, or API)
and Ensembl). Some resources borrow simple API calls, web services, or com- will be more convenient than others. For
identifiers for their own data, along with mand line statements. While potentially example, if a relatively systematic work-
HGNC gene symbols, while databases encumbered by semantic incompatibilities flow has to be applied to a batch of
such as KEGG have their own identifiers between entity definitions in multiple datasets, then command-line tools are very
for genes, and offer equivalence tables resources, a thoroughly versioned transla- convenient as they are easy to script.
that map them to gene symbols or other tion equivalence system is an invaluable Because a cancer genome analysis pipeline
common formats. asset for database integration. may require several connected analytical
In addition to entities being referenced steps, it is important to be able to script
them to avoid manual operations, thereby
by different identifier formats, in distinct 4.2. Software Resources
guaranteeing the sustainability and repro-
resources they may also adhere to slightly In cancer analysis pipelines, several
ducibility of the results. Conversely, if the
different definitions (e.g., regarding what tasks must be performed that require
user is concerned with the analysis of just
constitutes a gene). Furthermore, as men- supporting software. These range from
one dataset but interpretation of the results
tioned above the differences between simple database searches to cross-check requires more careful examination, visual
genome builds can substantially affect the lists of germline mutations with lists of interfaces such as browser-based applica-
mapping between coordinate systems, and known SNPs, to running complex compu- tions may be the most convenient end-user
they can also give rise to differences tational methods to identify protein-pro- interface, as these can link the results to
between entities. tein interaction sub-networks affected by knowledge databases to set the context.
In general, translating identifiers can be mutations. Some cancer analysis work-
cumbersome and incompatibilities may flows opt to develop these functionalities
5. Workflow Enactment Tools
exist between resources. For example, in-house, while others delegate them to
MutationAssessor, which predicts the path- third party software with the implicit and Visual Interfaces
ogenicity of protein mutations [15], uses burdens of installation and configuration. Given the complexity of cancer genome
UniProt identifiers. Analysis systems using Table 1 lists some software resources that analysis, it is worth discussing how to
Ensembl data for coordinate mappings, are useful when implementing analysis design and execute (enact) workflows,
such as our own, render mutations using workflows, and succinctly describes their which may become very elaborate. Work-
Ensembl Protein IDs, and in some cases functionality and availability. flows can be thought of as analysis recipes,
there are problems in translating identifiers, The functionalities required in a ge- whereby each analysis entails enacting
and even in assigning mutations to the nome analysis workflow can be divided that workflow using new data. Ideally a
wrong isoforms. To prevent these potential into four classes, depending on how they workflow should be comprehensive and
errors, MutationAssessor double checks are accessed (Table 3): via web services, cover the complete analysis process from
that the original amino acid matches the local or browser based applications, com- the raw data to the final results. These
sequence it is using and refuses to make a mand line tools, or application program- workflows may involve processing different
prediction otherwise. Although avoiding ming interfaces (APIs). It is not uncommon types of data and may require specific

Table 3. Types of third party software and their general characteristics.

Software type Installation User friendly Scriptable Reusable1

Browser app. NO YES NO2 NO


Web server NO NO YES NO
Local app YES YES NO3 NO
Command line YES NO YES YES4
API YES NO YES YES

1
Reusable means that the code, in whole or in part, can be reused for some other purpose.
2
May be scriptable using web scraping.
3
May support some macro definitions and batch processing.
4
If the source code is provided and is easy to pick apart.
doi:10.1371/journal.pcbi.1002824.t003

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002824


Further Reading standards. Cancer genome analysis systems
need to be capable of conveniently man-
aging this complexity and of adapting to the
N Weinberg RA (2006) The biology of cancer. 1st ed. Garland Science. 850 p.
specific characteristics of each analysis.
N Ng PC, Murray SS, Levy S, Venter JC (2009) An agenda for personalized
Finally, it is worth noting that bioinfor-
medicine. Nature 461: 724726. doi:10.1038/461724a.
matics systems will soon have to move
N Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB (2011) beyond the current research environ-
Bioinformatics challenges for personalized medicine. Bioinformatics 27: 1741 ments and into clinical settings, a chal-
1748. doi:10.1093/bioinformatics/btr295. lenge that will involve more industrial
N Valencia A, Hidalgo M (2012) Getting personalized cancer genome analysis into the development that can better cope with
clinic: the challenges in bioinformatics. Genome Med 4: 61. doi:10.1186/gm362. issues of sustainability, robustness and
N Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, et al. (2012) accreditation, while still incorporating
Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther the latest bioinformatics components that
92: 414417. doi:10.1038/clpt.2012.96. will continue to be generated in research
N Baudot A, de la Torre V, Valencia A (2010) Mutated genes, pathways and laboratories. This constitutes a new and
processes in tumours. EMBO Rep 11: 805810. doi:10.1038/embor.2010.133. exciting frontier for bioinformatics soft-
N Huang DW, Sherman BT, Lempicki RA (2009) Bioinformatics enrichment tools: ware developers.
paths toward the comprehensive functional analysis of large gene lists. Nucl
Acids Res 37: 113. doi:10.1093/nar/gkn923.
7. Exercise Questions
N Khatri P, Sirota M, Butte AJ (2012) Ten years of pathway analysis: current
approaches and outstanding challenges. PLoS Comput Biol 8: e1002375.
I. Name three general issues that
doi:10.1371/journal.pcbi.1002375.
bioinformaticians face when ana-
N Stein L (2002) Creating a bioinformatics nation. Nature 417: 119120. lyzing cancer genome data?
doi:10.1038/417119a.
II. What are the four main tasks in
cancer genome analysis in a
adaptations for the analysis of certain types best fit these aspects to the particularities clinical setting once the primary
of experiments. Often, parts of the analysis of each user. analysis has been performed?
will be repeated in a different context and In a more academic setting, close col-
thus, one of the objectives of workflow III. Why is it important to use the
laboration between the researcher and the
enactment tools is to reuse code efficiently. correct genome build?
bioinformatician facilitates the develop-
A number of systems have been designed ment of custom interfaces that can better IV. What do we mean by driver
to facilitate the construction of workflows adapt to given datasets, and answer the mutation?
(e.g., Taverna [21] and Galaxy [22], which very specific questions that may arise V. There are two key principles that
both offer visual interfaces to orchestrate during data exploration. In our institution, help determine driver mutations
workflows across a very wide range of we use a programmatic workflow enact- in-silico. What are they?
available functionalities). ment system that orchestrates a wide VI. Give several reasons why point
Although visual workflow enactment variety of tasks, ranging from coordinate mutations in coding regions are so
approaches have become reasonably pop- mapping to enrichment analysis. This important.
ular, they still have several important system is controlled via a browser appli-
limitations. Firstly, despite recent efforts, VII. Name three issues that challenge
cation designed to rapidly produce custom
these approaches remain overly complex the assumptions made by the
reports using a template-based HTML
for non-bioinformaticians. Secondly, they standard pathway enrichment
report generation system. It is a system
are quite inflexible in terms of the pre- analysis tools when applied to
that was developed entirely in-house but
sentation and exploration of the results, genomic mutations.
that makes use of third party software,
and thus, understanding the results re- allowing us to address the requirements of VIII. Discuss the problems that arise with
quires the user to do additional work our collaborators in a timely manner. identifiers when integrating infor-
outside of the system. Finally, the expres- mation across different databases.
siveness of these approaches is limited 6. Summary IX. Why are command line tools
when compared with general purpose generally more convenient than
programming languages. Experienced de- Cancer genome analysis involves the browser-based applications for
velopers will find them of limited utility, manipulation of large datasets and the processing a batch analyses?
and prefer to have their functionalities application of complex methods. The
X. How would an application aimed
accessible by APIs derived from general heterogeneity of the data and the disparity
at researchers differ from one
purpose programming languages. of the software implementations represent
aimed at clinicians in terms of
The information presented to the user an additional layer of complexity, which
the information presented?
needs to closely match his/her needs, requires the use of systems that can be
especially in more translational settings. easily adapted and reconfigured. Addition- Answers to the Exercises can be found
Too much information may mask impor- ally, interpretation of the results in terms in Text S1.
tant conclusions, while too little may leave of specific biological questions is more
the user unsure as to the validity of their effective if done in close collaboration Supporting Information
findings. This further emphasizes the need with experts in the field. This represents a
to customize workflows and the manner in specific challenge for software development Text S1 Answers to Exercises
which results are displayed, in order to in terms of interactivity and representation (DOCX)

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002824


References
1. Hudson TJ, Anderson W, Artez A, Barker AD, 9. Quesada V, Conde L, Villamor N, Ordonez GR, cancer genomics. Nucleic Acids Res 39: e118.
Bell C, et al. (2010) International network of Jares P, et al. (2011) Exome sequencing identifies doi:10.1093/nar/gkr407.
cancer genome projects. Nature 464: 993998. recurrent mutations of the splicing factor SF3B1 16. Bernstein BE, Birney E, Dunham I, Green ED,
doi:10.1038/nature08987. gene in chronic lymphocytic leukemia. Nat Genet Gunter C, et al. (2012) An integrated encyclope-
2. Roychowdhury S, Iyer MK, Robinson DR, 44: 4752. doi:10.1038/ng.1032. dia of DNA elements in the human genome.
Lonigro RJ, Wu Y-M, et al. (2011) Personalized 10. Kulis M, Heath S, Bibikova M, Queiros AC, Nature 489: 5774. doi:10.1038/nature11247.
oncology through integrative high-throughput Navarro A, et al. (2012) Epigenomic analysis 17. Wang L, Lawrence MS, Wan Y, Stojanov P,
sequencing: a pilot study. Sci Transl Med 3: detects widespread gene-body DNA hypomethy- Sougnez C, et al. (2011) SF3B1 and other novel
111ra121. doi:10.1126/scitranslmed.3003161. lation in chronic lymphocytic leukemia. Nat cancer genes in chronic lymphocytic leukemia.
3. Villarroel MC, Rajeshkumar NV, Garrido-La- Genet 44): 12361242. doi:10.1038/ng.2443. N Engl J Med 365: 24972506. doi:10.1056/
guna I, De Jesus-Acosta A, Jones S, et al. (2011) 11. Chuang H-Y, Rassenti L, Salcedo M, Licon K, NEJMoa1109016.
Personalizing cancer treatment in the age of Kohlmann A, et al. (2012) Subnetwork-based 18. Damm F, Nguyen-Khac F, Fontenay M, Bernard
global genomic analyses: PALB2 gene mutations analysis of chronic lymphocytic leukemia identi- OA (2012) Spliceosome and other novel muta-
and the response to DNA damaging agents in fies pathways that associate with disease progres- tions in chronic lymphocytic leukemia, and
pancreatic cancer. Mol Cancer Ther 10: 38. sion. Blood 120: 26392649. doi:10.1182/blood- myeloid malignancies. Leukemia 26: 20272031.
19. Subramanian A, Tamayo P, Mootha VK,
doi:10.1158/1535-7163.MCT-10-0893. 2012-03-416461.
Mukherjee S, Ebert BL, et al. (2005) Gene set
4. Valencia A, Hidalgo M (2012) Getting personal- 12. The Cancer Genome Atlas Network (2012)
enrichment analysis: a knowledge-based ap-
ized cancer genome analysis into the clinic: the Comprehensive molecular portraits of human
proach for interpreting genome-wide expression
challenges in bioinformatics. Genome Med 4: 61. breast tumours. Nature 490: 6170. doi:10.1038/
profiles. Proc Natl Acad Sci USA 102: 15545
doi:10.1186/gm362. nature11412. 15550. doi:10.1073/pnas.0506580102.
5. Baudot A, Real FX, Izarzugaza JMG, Valencia A 13. Chen R, Mias GI, Li-Pook-Than J, Jiang L, Lam 20. Glaab E, Baudot A, Krasnogor N, Schneider R,
(2009) From cancer genomes to cancer models: HYK, et al. (2012) Personal omics profiling Valencia A (2012) EnrichNet: network-based
bridging the gaps. EMBO Rep 10: 359366. reveals dynamic molecular and medical pheno- gene set enrichment analysis. Bioinformatics 28:
doi:10.1038/embor.2009.46. types. Cell 148: 12931307. doi:10.1016/ i451i457. doi:10.1093/bioinformatics/bts389.
6. Andrewes C (1964) Tumour-viruses and Virus- j.cell.2012.02.009. 21. Hull D, Wolstencroft K, Stevens R, Goble C,
tumours. Br Med J 1: 653658. 14. Frohling S, Scholl C, Levine RL, Loriaux M, Pocock MR, et al. (2006) Taverna: a tool for
7. Nielsen R, Paul JS, Albrechtsen A, Song YS Boggon TJ, et al. (2007) Identification of driver building and running workflows of services.
(2011) Genotype and SNP calling from next- and passenger mutations of FLT3 by high- Nucleic Acids Res 34: W729W732. doi:10.1093/
generation sequencing data. Nature Reviews throughput DNA sequence analysis and function- nar/gkl320.
Genetics 12: 443451. doi:10.1038/nrg2986. al assessment of candidate alleles. Cancer Cell 12: 22. Giardine B, Riemer C, Hardison RC, Burhans R,
8. Metzker ML (2010) Sequencing technologies - the 501513. doi:10.1016/j.ccr.2007.11.005. Elnitski L, et al. (2005) Galaxy: a platform for
next generation. Nat Rev Genet 11: 3146. doi: 15. Boris Reva YACS (2011) Predicting the function- interactive large-scale genome analysis. Genome
10.1038/nrg2626. al impact of protein mutations: application to Res 15: 14511455. doi:10.1101/gr.4086505.

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002824


Education

Chapter 15: Disease Gene Prioritization


Yana Bromberg*
Department of Biochemistry and Microbiology, School of Environmental and Biological Sciences, Rutgers University, New Brunswick, New Jersey, United States of America

Abstract: Disease-causing aberra- chromosome 4 called huntigtin (HTT) 2. Background


tions in the normal function of a [4]. Huntingtons became the first genetic
disease mapped using polymorphism The Merriam-Webster dictionary de-
gene define that gene as a disease
information (G8 DNA probe/genetic fines the word disease as a a condition
gene. Proving a causal link be-
marker), closely followed by the same of the living animal or plant body or of one
tween a gene and a disease exper-
imentally is expensive and time- year discovery of phenylketonuria associ- of its parts that impairs normal functioning
consuming. Comprehensive priori- ation with polymorphisms in a hepatic and is typically manifested by distinguishing
tization of candidate genes prior to enzyme phenylalanine hydroxylase [5]. signs and symptoms. Thus, disease is
experimental testing drastically re- These advances provided a route for defined with respect to normal function of said
duces the associated costs. Com- predicting the likelihood of disease devel- body or body part. Note, that this definition
putational gene prioritization is opment and even stirred some worries also describes the malfunction of individual
based on various pieces of correl- regarding the possibility of the rise of cells or cell groups. In fact, many diseases
ative evidence that associate each medical eugenics [6]. Interestingly, it can and should be defined on a cellular
gene with the given disease and took another ten years for HTTs se- level. Understanding a disease, and poten-
suggest possible causal links. A fair quence to be identified and for the tially finding curative or preventive mea-
amount of this evidence comes precise nature of the Huntigtons-associ- sures, requires answering three questions:
from high-throughput experimen- ated mutation to be determined [7]. (1) What is the affected function? (2) What
tation. Thus, well-developed meth- functional activity levels are considered
The recent explosion in high-through-
ods are necessary to reliably deal normal given the environmental contexts?
put experimental techniques has contrib-
with the quantity of information at (3) What is the direction and amount of
hand. Existing gene prioritization uted significantly to the identification of
disease-associated genes and mutations. change in this activity necessary to cause
techniques already significantly im-
For instance, the latest release of SwissVar the observed phenotype?
prove the outcomes of targeted
experimental studies. Faster and [8], a variation centered view of the Swiss- Contrary to the view that historically
more reliable techniques that ac- Prot database of genes and proteins [9,10], prevailed in classical genetics it is rarely
count for novel data types are reports nearly 20 thousand mutations in the case that one gene is responsible for
necessary for the development of 35 hundred genes associated with over one function. Rather, an assembly of genes
new diagnostics, treatments, and three thousand broad disease classes. constitutes a functional module or a
cure for many diseases. Unfortunately, the improved efficiency in molecular pathway. By definition, a mo-
production of association data (e.g. ge- lecular pathway leads to some specific end
nome-wide association studies, GWAS) point in cellular functionality via a series of
has not been matched by its similarly interactions between molecules in the cell.
This article is part of the Transla- improving accuracy. Thus, the sheer Alterations in any of the normally occur-
tional Bioinformatics collection for quantity of existing but yet unvalidated ring processes, molecular interactions, and
PLOS Computational Biology. data resulted in information overflow. pathways lead to disease. For example,
While association and linkage studies folate metabolism is an important molec-
provide a lot of information, incorporation ular pathway, the disruptions in which
of other sources of evidence is necessary to have been associated with many disorders
1. Introduction
narrow down the candidate search space. including colorectal cancer [12] and
In 1904 Dr. James Herrick reported [1] Computational methods - gene prioritiza- coronary heart disease [13]. Because this
the findings of peculiar elongated and tion techniques, are therefore necessary to pathway involves 19 proteins interacting
sickle shaped red blood cells discovered effectively translate the experimental data via numerous cycles and feedback loops
by Dr. Ernest Irons in a hospital patient into legible disease-gene associations [11]. [14], it is not surprising that there are a
afflicted with shortness of breath, heart
palpitations, and various other aches and Citation: Bromberg Y (2013) Chapter 15: Disease Gene Prioritization. PLoS Comput Biol 9(4): e1002902.
pains. This was the first documented case doi:10.1371/journal.pcbi.1002902
of sickle cell disease in the United States. Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
Forty years later, in 1949, sickle cell Baltimore County, United States of America
anemia became the first disease to be Published April 25, 2013
characterized on a molecular level [2,3]. Copyright: 2013 Yana Bromberg. This is an open-access article distributed under the terms of the Creative
Thus, implicitly, the first disease-associat- Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
ed gene, coding for beta-globin chain of provided the original author and source are credited.
hemoglobin A, was discovered. Funding: YB is funded by Rutgers, New Brunswick start-up funding, Gordon and Betty Moore Foundation
It took another thirty years before in grant, and USDA-NIFA and NJAES grants Project No. 10150-0228906. The funders had no role in the
preparation of the manuscript.
1983 a study of the DNA of families
afflicted with Huntingtons disease has Competing Interests: The author has declared that no competing interests exist.
revealed its association with a gene on * E-mail: YanaB@rci.rutgers.edu

PLOS Computational Biology | www.ploscompbiol.org 1 April 2013 | Volume 9 | Issue 4 | e1002902


What to Learn in This Chapter change, binding site obstruction, loss of
ligand affinity, etc.), (3) introduction
of new pathway members (e.g. activation
N Identification of specific disease genes is complicated by gene pleiotropy,
of previously silent genes), and (4) envi-
polygenic nature of many diseases, varied influence of environmental factors,
and overlying genome variation. ronmental disruptions (e.g. increased tem-
peratures due to inflammation or de-
N Gene prioritization is the process of assigning likelihood of gene involvement in
creased ligand concentrations due to
generating a disease phenotype. This approach narrows down, and arranges in
malnutrition). While all members of the
the order of likelihood in disease involvement, the set of genes to be tested
experimentally. affected pathways can be construed as
disease genes, the identification of a subset
N The gene priority in disease is assigned by considering a set of relevant of the true causative culprits is difficult.
features such as gene expression and function, pathway involvement, and
Obscuring such identification are individ-
mutation effects.
ual genome variation (i.e. the reference
N In general, disease genes tend to 1) interact with other disease genes, 2) harbor definition of normal is person-specific),
functionally deleterious mutations, 3) code for proteins localizing to the multigenic nature and complex pheno-
affected biological compartment (pathway, cellular space, or tissue), 4) have types of most diseases, varied influence of
distinct sequence properties such as longer length and a higher number of
environmental factors, as well as experi-
exons, 5) have more orthologues and fewer paralogues.
mental data heterogeneity and constraints.
N Data sources (directly experimental, extracted from knowledge-bases, or text- Disease genes are most often identified
mining based) and mathematical/computational models used for gene using: (1) genome wide association or
prioritization vary widely. linkage analysis studies, (2) similarity or
linkage to and co-regulation/co-expres-
sion/co-localization with known disease
genes, and (3) participation in known
number of different ways in which it can be 3. Interpreting What We Know disease-associated pathways or compart-
broken. The changes in concentrations ments. In bioinformatics, these are repre-
and/or activity levels of any of the pathway Identifying the genetic underpinnings of
sented by multiple sources of evidence,
members directly affect the pathway end- the observed disease is a major challenge
both direct, i.e. evidence coming from own
products (e.g. pyrimidine and/or methylat- in human genetics. Since disease results experimental work and from literature,
ed DNA). The specifics of a given change from the alteration of normal function, and indirect, i.e. guilt-by-association
define the severity and the type of the identifying disease genes requires defining data. The latter means that genes that
resulting disease; see Box 1 for discussion molecular pathways whose disrupted func- are in any way related to already estab-
on disease types. Moreover, since the view tionality is necessary and sufficient to lished disease-associated genes are pro-
of a single pathway as a discrete and cause the observed disease. The pathway moted in the suspect list. Additionally,
independent entity (with no overlap with function changes due to the (1) changes in implied gene-disease links, such as func-
other pathways) is an oversimplification, it gene expression (i.e. quantity and concen- tional deleteriousness of mutations affect-
is increasingly evident that different diseas- tration of product), (2) changes in structure ing candidate genes, contributes to estab-
es are also interdependent. of the gene-product (e.g. conformational lishing associations. The manner in which
each guilty association is derived varies
from tool to tool and all of them deserve
consideration. Very broadly, gene-disease
Box 1. Genetic similarities of different disease types. associations are inferred from (Figure 1):
Diseases can be very generally classified by their associated causes: pathogenic 1. Functional Evidence the suspect gene is
(caused by an infection), environmentally determined (caused by inanimate a member of the same molecular
environmental stressors and deficiencies, such as physical trauma, nutrient pathways as other disease-genes; in-
deficiency, radiation exposure and sleep deprivation), and genetically hereditary or ferred from: direct molecular interac-
spontaneous (defined by germline mutations and spontaneous errors in DNA tions, transcriptional co-(regulation/
transcription, respectively). Moreover, certain genotypes are more susceptible to
expression/localization), genetic link-
the effects of pathogens and environmental stress, contributing to a deadly
age, sequence/structure similarity,
interplay between disease causes. Regardless of the cause of disease, its
manifestations are defined by the changes in the affected function. For example, and paralogy (in-species homology
cancer is the result of DNA damage occurring in a normal cell and leading toward resulting from a gene duplication
a growth and survival advantage. The initial damage is generally limited to a fairly event)
small number of mutations in key genes, such as proto-oncogenes and tumor 2. Cross-species Evidence the suspect gene
suppressor genes [135]. The method of accumulation of these mutants is not very has homologues implicated in generat-
important. A viral infection may cause cancer by enhancing proto-oncogene ing similar phenotypes in other organ-
function [136] or by inserting viral oncogenes into host cell genome. An inherited isms
genetic variant may disrupt or silence a single allele of a mismatch-repair gene as 3. Same-compartment Evidence the suspect
in Lynch syndrome [137]. Spontaneous transcription errors and influence of gene is active in disease-associated
environmental factors, e.g. continued exposure to high levels of ionizing
pathways (e.g. ion channels), cellular
radiation, may result in oncogene and tumor suppressor-gene mutations leading
compartments (e.g. cell membrane),
to the development of cancer [138]. Thus, the same broad types of disease can be
caused by the disruption of the same mechanisms or pathways resulting from any and tissues (e.g. liver).
of the three types of causes. 4. Mutation Evidence suspect genes are
affected by functionally deleterious

PLOS Computational Biology | www.ploscompbiol.org 2 January 2013 | Volume 9 | Issue 4 | e1002902


Figure 1. Overview of gene prioritization data flow. In order to prioritize disease-gene candidates various pieces of information about the
disease and the candidate genetic interval are collected (green layer). These describe the biological relationships and concepts (blue layer) relating
the disease to the possible causal genes. Note, the blue layer (representing the biological meaning) should ideally be blind to the content green layer
(information collection); i.e. any resource that describes the needed concepts may be used by a gene prioritization method.
doi:10.1371/journal.pcbi.1002902.g001

mutations in genomes of diseased components can produce similar pheno- protein precursor (AgRP), pro-opiomela-
individuals types; i.e. genes responsible for similar nocortin (POMC) and/or their processed
5. Text Evidence there is ample co- diseases often participate in the same derivatives directly bind MC4R for varied
occurrence of gene and disease terms interaction networks [22,23]. To illustrate purposes of the MC4R signaling pathway.
in scientific texts. Note that textual co- this point, consider the interaction Finally, the reported interactions with
occurrence represents some form of partners of the melanocortin 4 receptor Neuropeptide Y-precursor (NPY) and the
biological evidence, which does not yet (MC4R) in STRING [24,25] server growth hormone releasing protein
lend itself to explicit documentation. generated Figure 2. Note, not all known (GHRL) are literature derived and may
interactions are shown the inclusion reflect indirect, but tight connectivity. By
parameter is STRING server likelihood the token of same pathway evidence,
3.1 Functional Evidence .0.9. MC4R interactors, whether agonists or
3.1.1 Molecular interactions. Gene MC4R is a hypothalamic receptor with antagonists, may be predicted to be linked
prioritization tools, from the earliest field a primary function of energy homeostasis to obesity. In fact, mutations that nega-
pioneers like G2D [15,16,17] to the more and food intake regulation. Functionally tively affect normal POMC production or
recent ENDEAVOUR [18,19] and deleterious polymorphisms in this receptor processing have been shown to be obesity-
GeneWanderer [20,21], among many are known to be associated with severe associated [29,30] and gene association
others, have used gene-gene (protein- obesity [26,27,28]. Here, MC1R, MC3R, studies have linked AgRP with anorexia
protein) interaction and/or pathway and MC5R are membrane bound mela- and bulimia nervosa behavioral traits [31],
information to prioritize candidate genes. nocortin (1,3,5) receptors that interact representative of food intake abnormali-
Biologically this makes sense, because if with MC4R via shared binding partners. ties. Other pathway participants have also
diseases result from pathway breakdown Syndecan-3 (SDC3), agouti signaling pro- been marked and extensively studied for
then disabling any of the pathway tein precursor (ASIP), agouti related obesity association.

PLOS Computational Biology | www.ploscompbiol.org 3 April 2013 | Volume 9 | Issue 4 | e1002902


homologs [44]. However, in the cases where
paralogous functional compensation is
insufficient to restore normal function,
inactivation of any of the paralogues leads
to same or similar disease. Prioritization
tools thus often use functional similarity as
an input feature. For example, one
GeneOntology (GO, [45]) defined MC4R
function, is melanocyte-stimulating horm-
one receptor activity (GO:0004980). There
are two other human gene products sharing
this function: MSHR (MC1R, 52%
sequence identity) and MC3R (61%).
Predictors relying on functional similarity
to annotate disease association would
inevitably link both of these with obesity.
These findings are confirmed by the recent
studies for MC3R [46], but the jury still
remains out for MC1R involvement.
Quantifying functional similarity is of
utmost importance for the above approach.
Figure 2. MC4R-centered protein-protein interaction network. The figure illustrates Using ontology-defined functions (e.g. Gen-
protein-protein interaction neighborhood of the human melanocortin 4 receptor (MC4R) as eOntology) this problem reduces to finding
illustrated by the confidence view of the STRING 8.3 server. The nodes of the graph represent a distance between two ontology nodes/
human proteins and the connections illustrate their known or predicted, direct and indirect subtrees (e.g. [47,48,49,50]). For un-anno-
interactions. The connection between any two protein-nodes is based on the available
information mined from relevant databases and literature. The network includes all protein
tated genes, however, sequence and struc-
interactions that have .0.9 estimated probability. ture homology is often used to transfer
doi:10.1371/journal.pcbi.1002902.g002 functional annotations from studied genes
and proteins [51,52]. Since functionally
similar genes are likely to produce similar
3.1.2 Regulatory and genetic in the clusters low. Thus, the fine-tuned disease phenotypes, sequence/structure
linkage. Co-regulation of genes has cooperation of alleles is not broken by similarities are good indicators of similar
traditionally been thought to point to recombination, but rather transmitted as disease involvement. Additionally, disease
their involvement in same molecular one entity to the next generation. De- genes are often associated with specific gene
pathways [32] and, by that token, to regulation of these clusters is therefore and protein features such as higher exon
similar disease phenotypes; e.g. [33,34]. likely to be deleterious to the organism and number and longer gene length, protein
For example, GPR30 a novel G-protein develop into disease. length, presence of signal peptides, higher
coupled estrogen receptor is co-expressed Genes co-expressed with or genetically distance to a neighboring gene and 39 UTR
with the classical estrogen receptor ERb linked to other disease genes are also likely length, and lower sequence divergence
[33]. The former (GPR30) has been linked to be disease-associated. However, while from their orthologues [53,54]. Moreover,
to endometrial carcinoma [35] so it is no genetic linkage and co-regulation are disordered proteins are often implicated in
surprise that the latter (ERb) is also valuable markers of disease association, cancer [55].
associated with this type of cancer [33]. they also pose a specificity problem; i.e. a
However, co-regulation doesnt always given disease-associated gene may be co- 3.2 Cross-species Evidence
have to mean the same pathway studies regulated with or linked to another Animal models exist for a broad range
have shown that consistently co-expressed disease-associated gene, where the two of human diseases in a number of well-
genes, while possibly genetically linked diseases are not identical. Genetic linkage studied laboratory organisms, i.e. mouse,
[36,37], may also reside in distinct path- similarly poses a problem for GWAS zebrafish, fruit fly, etc. However, straight-
ways [38]. Additionally, co-expressed non- where it is difficult to distinguish between forward cross-species comparisons of
paralogous genes, independent of com- driver mutations, the actual causes of orthologues and their associated pheno-
mon pathway involvement, often cluster disease, and passenger mutations, co- typic traits are also very useful. A high
together in different species and fall into occurring with the disease-mutations due number of orthologues (consistent pres-
chromosomal regions with low recombi- to genetic linkage. ence in multiple species) generally high-
nation rates [39,40], suggesting genetic 3.1.3 Similar sequence/structure/ lights essential genes that are prone to
linkage [39,40]. These finding suggests function. Reduced or absent phenotypic disease involvement. Orthologues general-
that clusters of co-expressed genes are effect in response to gene knockout/ ly participate in similar molecular path-
selectively advantageous [36]. Possibly, inactivation is a common occurrence ways although different levels of function
these clusters are groups of genes that [42,43], largely explained by functional are necessary for different organisms (e.g.
despite the apparent functional heteroge- compensation, i.e. partial interchangeability human MC4R is more functional then its
neity may be jointly involved in orches- of paralogous genes. In humans, genes with polar bear orthologue [56]). Thus, cross-
trating complicated cellular functionality at least one paralogue, approximated by species tissue-specific phenotypic differen-
[41]. Evolutionary pressure works on 90% sequence identity, are about three tiation due to slightly varied sequences
maintaining co-expression of these genes times less likely to be associated with disease may be useful for gene prioritization. For
and on keeping recombination rates with- as compared to genes with more remote example, the human MC4R and almost

PLOS Computational Biology | www.ploscompbiol.org 4 January 2013 | Volume 9 | Issue 4 | e1002902


all of its close orthologues (e.g. in mouse, possible disease-gene candidate. Further also important for choosing correct
rat, pig, and cow) contain a conserved investigation shows that mutations in protein-protein interaction networks, as
valine residue in the 95th position of the PAX6 may result in aniridia (absence of some proteins interact in some tissues,
amino acid sequence. In the polar bear iris), corneal opacity (aniridia-related ker- but rarely in others [66]. Disease-associat-
orthologue, however, this position is fre- atopathy), cataract (lens clouding), glauco- ed cellular pathways (e.g. ion channels or
quently occupied by an isoleucine residue ma, and long-term retinal degeneration endocytic membrane transport) and com-
[56]. When considering MC4R involve- (Figure 3) [59]. partments (e.g. membrane or nucleus)
ment in generating an obesity phenotype, A correlation of gene co-expression implicate pathway/compartment-specific
it is useful to note that polar bears have a across species is also useful for annotating gene-products in disease as well. For
need for increased body fat content for disease genes [60,61]. Genes that are part example, autosomal recessive generalized
thermal insulation, water buoyancy, and of the same functional module are gener- myotonia (Beckers disease) (GM) and
energy storage requirements [56] as com- ally co-expressed. Also, there is evidence autosomal dominant myotoniacongenita
pared to humans and to other organisms for co-expression of visibly functionally (Thomsens disease, MC) are character-
that share a conserved V95. Thus, one can unrelated genes [37,62,63]. The explana- ized by skeletal muscle stiffness [67]. This
imagine that the V95I mutation, while tion of these co-expression clusters having phenotype is the result of muscle mem-
deleterious to the function of the receptor, an evolutionary advantage only holds true brane hyperexcitability and, in conjunc-
is a polar bear specific adaptation to its for otherwise unjustified conservation of tion with observed alterations in muscle
environment, and may have a similar these clusters throughout different species; chloride and sodium currents, points to
(increased body fat) effect in humans. In i.e. cross-species comparison of protein co- possible involvement of deficiencies of the
fact, V95I does inactivate the human expression may be used for validation of muscle chloride channel. In fact, studies
receptor [57,58] and associates with obesity. disease-gene co-expression inference. Us- point to the mutations in the transmem-
Comparing human and animal pheno- ing this assumption, Ala et al [61] had brane region of CLC-1, the muscle
types is not always straightforward. Wash- narrowed down the initial list of 1,762 chloride channel coding gene, as the
ington et al [59] have shown that pheno- genes in the loci mapped via genetic culprit [67]. Another example is that of
type ontologies facilitate genotype- linkage to 850 OMIM (Online Mendelian the multiple storage diseases, such as Tay-
phenotype comparisons across species. Inheritance in Man) [64] phenotypes to Sachs, Gaucher, Niemann-Pick and
Disease phenotypes recorded in their twenty times fewer (81) possible disease- Pompe disease, which are caused by the
ontology (OBD, ontology based database) causing genes. For example, in their impairment of the degradation pathways
can be compared to the similarly built analysis a cluster of functionally unrelated of the intracellular vesicular transport. In
cross-species phenotype ontologies using a genes co-expressed in human and mouse fact, many of the genes implicated in these
set of proposed similarity metrics. Finding contained a bona fide disease-gene KCNIP4 diseases encode for proteins localized to
related phenotypes across species suggests (partial epilepsy with pericentral spikes). endosomes (e.g. NPC1 in Neimann-Pick
orthologous human candidate genes. For [68]) or lysosomes (e.g. GBA [69] in
instance, phenotypic similarities of eye 3.3 Compartment Evidence Gaucher, GAA in Pompe [70] and HEXA
abnormalities recorded in human and fly Changes in gene expression in disease- in Tay Sachs [71]).
suggest that PAX6, a human orthologue of affected tissues are associated with many
the phenotype-associated fly gene ey, is a complex diseases [65]. Tissue specificity is 3.4 Mutant Evidence
By definition, every genetic disease is
associated with some sort of mutation that
alters normal functionality. In fact, prima-
ry selection of candidates for further
analysis is often largely based on observa-
tions of polymorphisms in diseased indi-
viduals, which are absent in healthy
controls (e.g. GWAS). However, not all
observed polymorphisms are associated
with deleterious effects. Note, that on
average gain and loss of function muta-
tions are considered to alter normal
functionality equally deleteriously. Most
of the observed variation does not at all
manifest phenotypically, some is weakly
deleterious with respect to normal func-
tion, and less still is weakly beneficial. In
nature strongly beneficial mutations are
very rare; they spread rapidly in the
population and cannot be considered
disease-associated. On the other hand,
strongly deleterious or inactivating muta-
tions are often incompatible with life. A
Figure 3. Correlating cross-species phenotypes. Phenotypes of wild-type (top) and PAX6
ortholog mutations (bottom) in human, mouse, zebrafish, and fly can be described with the EQ
small percentage of mutations of this type,
method suggested by Washington et al [59]. Once phenotypic descriptions are standardized affecting genes whose function is not life-
across species, genotypic variations can be assessed as well. essential, are often associated with mono-
doi:10.1371/journal.pcbi.1002902.g003 genic Mendelian disorders. Strongly dele-

PLOS Computational Biology | www.ploscompbiol.org 5 April 2013 | Volume 9 | Issue 4 | e1002902


terious mutations in the genes whose CNTNAP2 genes. Both of these genes have coding-region SNPs. dbSNP data suggests
function may somehow be compensated been implicated in other neurological an even larger percentage of synonymity
(e.g. via paralogue activity) are associated disorders [75,76] reaffirming the possible ,188 thousand (44%), which is possibly
with complex disorders, where the level of disease link. Inversions, translocations and due to evolutionary pressure eliminating
compensation affects the observed pheno- large deletions and insertions have all been functionally deleterious non-synonymous
type. Complex disorders may also accu- implicated in different forms of disease. SNPs. MNPs are rare as compared to
mulate weakly deleterious mutations to Even very small indels, resulting in an open- SNPs, but are over-represented amongst
generate a strongly negative phenotype. reading frame shift (frameshift mutations), the protein altering variants, almost always
Intuitively it is clear that a selected are often sufficient to cause disease. For changing the affected amino acid, or two
candidate gene, carrying a deleterious instance, one of the causes of Tay-Sachs is a neighboring ones, or introducing a
mutation in an affected individual is more deletion of a single cytosine nucleotide in nonsense mutation (stop-codon) [83].
likely to be disease-associated than one the coding sequence of a lysosomal enzyme Identifying and annotating functional
which contains functionally neutral mu- beta-hexosaminidase [71]. effects of SNPs and MNPs is important in
tants or no variation at all. In most cases of diseases that are the context of gene prioritization because
3.4.1 Structural variation. Structural associated with SVs the prioritization of genes selected for further disease-associa-
variation (SV) is the least studied of all types disease-causing genes is reduced to finding tion studies are more likely to contain a
of mutations. It has long been assumed that those that are directly affected by the deleterious mutation or be under the
less than 10% of human genetic variation is mutation. Lots of work has been done in control of one (e.g. mutations affecting
in the form of genome structural variants this direction, including development of transcription factor or microRNA binding
(insertions and deletions, inversions, the CNVinetta package [77] for mining sites). In recent years a number of methods
translocations, aneuploidy, and copy and visualizing CNVs, GASV approach were created for identifying mutations as
number variations - CNVs). However, for identifying structural variation bound- functionally deleterious. PromoLign [84],
because each of the structural variants is aries more precisely [78], and software PupaSNP finder [85], and RAVEN [86]
large (kb-Mb scale), the total number of base created by Ritz et al for searching for look for SNPs affecting transcription,
pairs affected by SVs may actually be structural variants in strobe sequencing SNPper [87] finds and annotates SNP
comparable to the number of base pairs data [79]. SV identification is still a new locations, conservation, and possible func-
affected by the much more common SNPs field, but the advances in methodologies tionalities so that they can be visually
(single nucleotide polymorphisms). will have a great impact on our under- assessed, and SNPselector [88] and
Moreover, high throughput detection of standing and study of many of the known FASTSNP [89] assess various SNP fea-
structural variants is notoriously difficult diseases. tures such as whether it alters the binding
and is only now becoming possible with 3.4.2 Nucleotide polymorphisms. The site of a transcription factor, affects the
better sequencing techniques and CNV other ,90% of human variation exists in promoter/regulatory region, damages the
arrays. Thus, more SVs may be discovered the form of SNPs (single nucleotide 39 UTR sequence that may affect post-
in the near future. We do not currently know polymorphisms) and MNPs (multi- transcriptional regulation, or eliminates a
what proportion of genetic disease is caused nucleotide polymorphisms; consecutive necessary splice site. Coding synonymous
by SVs, but we suspect that it is high. nucleotide substitutions, usually of length SNPs have recently been shown to have
Due to the above mentioned constraints two or three). A single human genome is the same chance of being involved in a
on SV identification, there are only ,180 expected to contain roughly 1015 million disease mechanism as non-coding SNPs
thousand structural variants reported in SNPs per person [80]. As many as 93% of [90]. This effect may be due to codon
one of the most complete mutation all human genes contain at least one SNP usage bias or to changes in splicing or
collections the Database of Genomic and 98% of all genes are in the vicinity miRNA binding sites [91]. However, few
Variants, DGV [72]. Gross changes to (,5 kb) of a SNP [81]. The latest release of (if any) computational methods are able
genome sequence are very likely to be NCBI dbSNP database [82] (build 137) make predictions with regard to their
disease associated, but also frequently gene contains nearly 43 million validated human functional effects.
non-specific. For instance, Downs syn- SNPs, 17.5 million of which have been Non-synonymous SNPs are somewhat
drome, trisomy 21, is an example of a experimentally mapped to functionally more studied. Early termination of the
whole extra chromosome gain and cri du distinct regions of the genome (i.e. mRNA protein is very often associated with
chat syndrome results from the deletion of UTR, intron, or coding regions). Non- disease so genes with nonsense mutants
the short arm of chromosome 5 [73]. All coding region SNPs (,17.2 million) are are automatically moved up in the list of
of the genes found in these regions of the trivially more prevalent than coding SNPs possible suspects. Missense SNPs and
genome are, by default, associated with (,432 thousand) as non-coding DNA MNPs, which alter the protein sequence
the observed disease but neither can be makes up the vast majority of the without destroying it, may or may not be
considered primarily causal. When the genome. Coding SNPs, however, are disease associated. In fact, most methods
damage is less extensive the genes involved over-represented in disease associations; estimate that only 2530% of the nsSNPs
may be further evaluated for causation. e.g. OMIM contains 2430 non-coding negatively affect protein function [92].
For instance, several epilepsy-associated SNPs (0.0001% of all) and 5327 coding Databases like OMIM [93], and more
genes are known, but functionally-signifi- ones (0.01% of all 100-fold enrichment). explicitly, SNPdbe [94], SNPeffect [95],
cant mutations in these account for only a Due to the redundancy of the genetic code, PolyDoms [96], Mutation@A Glance [97]
small fraction of observed disease cases. coding SNPs can be further subdivided into and DMDM [98] map SNPs to known
One study [74] reports that CNV mutants synonymous (no effect on protein sequence) structural/functional effects and diseases.
found in epileptic individuals but not in the and non-synonymous (single amino acid Computational tools that make predictions
general population account for nearly nine substitution) SNPs. Simple statistics of the about functional and disease-associated
percent of all cases. Among these are CNVs genetic code suggest that synonymous effects of SNPs include SNAP [99,100],
resulting from deletions in AUTS2 and SNPs should account for 24% of all SIFT [101,102], PolyPhen [103,104],

PLOS Computational Biology | www.ploscompbiol.org 6 January 2013 | Volume 9 | Issue 4 | e1002902


For a significantly oversimplified exam-
ple of this type of processing consider
searching PubMed for the terms breast
cancer and BRCA1. The initial search
returns 50 articles, as compared to 21 for
breast cancer with BRCA2, 6 with PIK3CA, 1
with TOX3, and 0 for MC4R or CLC1
associations. While the number of publi-
cations reflects many extraneous factors
such as the popularity and research age
of the protein, it is also very much
reflective of the possibility of gene-disease
association. Thus, BRCA1 and BRCA2
would be the most likely candidates
for cancer association, followed by
PIK3CA and TOX3. MC4R and CLC1
would not make the cut. Note that
PubMed now defaults to a smart search
engine, which identifies all aliases of the
gene and the disease while cutting out
more promiscuous matches; i.e. turning off
the translation of terms would result in
significantly more less accurate matches.
Using specialized tools like PolySearch (or
IHOP) to perform the same queries
produces more refined and quantifiable
Figure 4. PolySearch gene-disease associations. PolySearch uses PubMed lookup results to results (Figure 4).
prioritize diseases associated with a given gene. Here, screen shots of the top two results (where
available; sorted by relevancy score metric) from PolySearch are shown. According to these, 4. The Inputs and Outputs
BRCA1 and PIK3CA are associated with breast cancer, while MC4R and CLC1 are not. These results
quantitatively confirm intuitive inferences made from simple PubMed searches. Existing disease-gene prioritization
doi:10.1371/journal.pcbi.1002902.g004 methods vary based on the types of inputs
that they use to produce their varied
PHD-SNP [105], SNPs3D [106], and plain site in natural language text of outputs. Functionality of prioritization
many others. Most of these methods are scientific publications. Consider, for exam- methods is defined by previously known
binary in essence that is they point to a ple, a scientist who is interested in priori- information about the disease and by
deficiency without suggesting specifics of tizing breast cancer genes. A casual search candidate search space [124], which may
the disease or molecular mechanisms of in PubMed for the term combination breast be either submitted by the user or
functional failure. Nevertheless, they are cancer generates over two hundred thousand automatically selected by the tool. Disease
very useful in conjunction with other data matches. Limiting the field to genetics of breast information is generally limited to lists of
described above. The recent trend in cancer reduces the count to slightly fewer known disease-associated genes, affected
mutation analysis has seen the develop- than fifty thousand. The past thirty days tissues and pathways and relevant key-
ment of tools, like SNPNexus [107] and have brought about 46 new papers. Thus, words. The candidate search space does
SNPEffectPredictor [108] that are no someone interested in getting all the genetic not have to be input at all (i.e. the entire
longer limited by DNA type and predict information out of the PubMed collection genome) or be defined by the suspect (for
effects for both non-coding and coding would need to dedicate his or her life to varied experimental reasons) genomic
region SNPs. reading. Fortunately, scientific text mining region. The prioritization accuracy, in
tools have recently come of age large part, depends on the accuracy and
3.5 Text Evidence [113,114,115]. The new tools will allow specificity of the inputs. Thus, providing a
The body of science that addresses gene- for intelligent identification of possible list of very broad keywords may reduce the
disease associations has been growing in gene-gene and disease-gene correlations performance specificity, while incorrect
leaps and bounds since the mapping of a [116,117,118]. For example, the Informa- candidate search space automatically de-
hemoglobin mutation to sickle cell anemia. tion Hyperlinked Over Proteins, IHOP creases sensitivity. Prioritization methods
Some researchers have been proactive in method [119] links gene/protein names in generally output ranked/ordered lists of
making their data computationally avail- scientific texts via associated phenotypes genes, oftentimes associated with p-values,
able from databases like dbSNP, GAD and interaction information. For automat- classifier scores, etc.
[109], COSMIC [110], etc. Others have ed link extraction, however, the existing Overall, input and output requirements
contributed by depositing knowledge ob- gene prioritization techniques rely mostly and formats are a very important part of
tained through reading and manual cura- on term co-occurrence statistics (e.g. establishing a tools relevance for its users.
tion into the likes of PMD [111], GeneRIF PosMed [120] and GeneDistiller [121]) As with other bioinformatics methods, the
[112] and UniProt [9]. However, huge and gene-function annotations (e.g. EN- ease use and the steepness of learning
amounts of data, which could potentially DEAVOR [122] and PolySearch [123]), curve for a given gene prioritization
improve the performance of any gene which can then be related to diseases as method often define the user base at least
prioritization method, remains hidden in described above. as strictly as does its performance.

PLOS Computational Biology | www.ploscompbiol.org 7 April 2013 | Volume 9 | Issue 4 | e1002902


Box 2. Illustrating basic functionality of a standard (on-line fully-interconnected feed-forward
sigmoid-function back-propagating) neural network.
In Figure 5A example network there are three fully interconnected layers of neurons (input, hidden, and output layers); i.e. each
neuron in one layer is connected to every neuron in the next layer. The three input neurons encode biologically relevant pieces
of data relating a given gene G to a given disease D. For each G and D, i_neuron1 is the fraction of articles (out of 1000)
containing in-text co-occurrences of G and D and i_neuron2 represents the presence/absence of a sequence-similar gene G
associated with D (i_neuron3 = G/G sequence identity). The hidden (inference) layer consists of two neurons h_neuron1 and
h_neuron2 with activation thresholds h1 and h2, respectively. The single output, o_neuron (threshold hO) represents the
involvement of G in causing D: 0 = no involvement, 1 = direct causation. The starting weights of the network (wi1-h1, wi1-h2wh2-o)
are arbitrarily assigned random values between 0 and 1. Intuitively, the function of the network is to convert input neuron values
into output neuron values via a network of weights and hidden neurons. Mathematically, the network is described as follows:

The value (dx) of neuron x is the sum of inputs into x from the previous layer of neurons (Yi = 1Rn in general; in our example: I1R3,
H1R2). Each of the n inputs is a product of value of neuron Yi and weight of connection between Yi and x (wYiRx).

X
n
dx ~ Yi wYi?x
i~1

The value of the output (zx) of a neuron x based on its dx and its threshold hx is:

zx ~f (dx zhx )

In our case, the function (f) is a sigmoid, where a is a real number constant (optimized for any given network, but generally
initially chosen to be between 0.5 and 2).

1
f (x)~
1ze{ax
Thus, to compute the output of every neuron in the network we need to use the formula:

1
zx ~
1ze{a(dx zhx )
Note, that to compute the output of the o_neuron (zO; the prediction made by the network) we first have to compute the
outputs of all h_neurons (zHi = 1Rn).

In a supervised learning paradigm, experimentally established pairs of inputs and outputs are given to the network during
training (Figure 5C). After each input, the network output (zO) is compared to the observed result (R). If the network makes a
classification error its weights are adjusted to reflect that error. Establishing the best way to update weights and thresholds in
response to error is of the major challenges of neural networks. Many techniques use some form of the delta rule a gradient
descent-based optimization algorithm that makes changes to function variables proportionate to the negative of the
approximate gradient of the function at the given point. [Its OK if you didnt understand that sentence the basic idea is to
change the weights and thresholds in the direction opposite of the direction of the error]. In our example, we use the delta rule
with back-propagation. This means that to compute the error of the hidden layer, the threshold of the output layer (hO) and the
weights connecting the hidden layer to the output layer (wh1RO, wh2RO) need to be changed first.

The steps are as follows:

1. Compute the error (eO) of zO as compared to result R. Note, that the difference between the expected and the observed values
defines the gradient (g) at the output neuron.

eO ~zO (1{zO )(R{zO )

2. Compute the change in the threshold of the output layer (DhO), using a variable l, the learning rate constant - a real number,
often initialized to 0.10.2 and optimized for each network)

DhO ~leO

PLOS Computational Biology | www.ploscompbiol.org 8 January 2013 | Volume 9 | Issue 4 | e1002902


3. Compute the change in the weights connecting the hidden layer to the output, wHiRO.

DWHi ?O ~DhO Hi

4. Compute the gradient (gi) at hidden neurons

gi ~eO wHi ?O

Note, from here all steps are the same as above


5. Compute the error at zHi

eHi ~zHi (1{zHi )gi

6. Compute the change in hHi

DhHi ~leHi

7. Compute the change in wIjRHi

DWIj ?Hi ~DhHi Ij

In on-line updating mode of our example, weights and thresholds are altered after each set of input transmissions. Once the
network has seen the full set of input/output pairs (one epoch/iteration), training continues re-using the same set until the
performance is satisfactory. Note that neural networks are sensitive to dataset imbalance. I.e. it is preferable to balance the
training data, such that the number of instances of each class is presented a roughly equal number of times.

In testing, updating of the weights no longer takes place; i.e. the zO for any given set of inputs is constant over time. See
Exercise 8 for an experience with testing. Note, there are many variations on the type and parameters of network learning
(propagation mode and direction, weight update rules, thresholds for stopping, etc.) Please consult the necessary literature for
more information, e.g. [134].

5. The Processing methods and their requirements differ, the thresholds (functions) of any one given
notion of identifying patterns in the data neuron. Training a network means
Gene prioritization methods use differ- that may be indicative disease-gene in- optimizing these parameters using an
ent algorithms to make sense of all the volvement remains the same throughout. existing set of inputs (and, possibly, out-
data they extract, including mathemati- In simplest terms, a neural network is puts). Ultimately, a trained network could
cal/statistical models/methods (e.g. Gene- essentially a mathematical model that then relatively accurately recognize learned
Prospector [125]), fuzzy logic (e.g. Topp- defines a function f: XRY, where a patterns in previously unseen data. For
Gene [126,127]), and artificial learning distribution over X (the inputs to the more details regarding the possible types
devices (e.g. PROSPECTR [54]), among network) is mapped to a distribution over and parameters of neural networks see
others. Some methods use combinations of Y (the outputs/classifications). The word [132,134]. For an illustration of network
the above. Objectively, there is no one network in the name artificial neural application see Box 2 and Figure 5.
methodology that is better than the others network refers to the set of connections
for all data inputs. For more details on between the neurons (Figure 5). The 6. Summary
computational methods used in the vari- functionality of the network is defined by
ous approaches please refer to relevant the transmission of signal from activated The development of high throughput
tool publications and method-specific neurons in one layer to the neurons in technologies has augmented our abilities to
computer science/mathematics literature, another layer via established (and weighed) identify genetic deficiencies and inconsis-
e.g. [128,129,130,131,132,133,134]. connections. Besides the choice and num- tencies that lead to the development of
To illustrate the general concepts of ber of inputs and outputs, the parameters diseases. However, a large portion of
relying on the various computational tech- defining a given ANN are (1) interconnec- information in the heaps of data that these
niques for gene prioritization we will tion patterns, (2) the process by which the methods produce is incomprehensible to
consider the use of an artificial neural weights of connections are selected/updat- the naked eye. Moreover, inferences that
network (ANN). Keep in mind that while ed (learning function), and (3) the activation could potentially be made from combining

PLOS Computational Biology | www.ploscompbiol.org 9 April 2013 | Volume 9 | Issue 4 | e1002902


Figure 5. Predicting gene-disease involvement using artificial neural networks (ANNs). In a supervised learning paradigm, the neural
networks are trained using experimental data correlating inputs (descriptive features relating genes to diseases) to outputs (likelihood of gene-
disease involvement). The training and testing procedures for the generalized network (Panel A) are described in text. In our example, the WEKA
[129,130,131,139] ANN (Panel B; a = 0.5, l = 0.2) is trained using the training set (Panel C) repeated 500 times (epochs). The network memorizes
(Predictions in Panel C) the patterns in the training set and is capable of making accurate predictions for four out of seven instances it has not seen
before (test set, Panel D). It is important to note here that the erroneously assigned instances (yellow highlight) in the test set are, for the most part,
unlike the training. The first one has very little literature correlation (0.01), while sequence similarity to another disease-involved gene is fairly high
0.55). The second maps an unlikely candidate gene (very low literature, no homology) to disease, and the third has barely enough literature mapping
and borderline homology. Representation of neither of these instances was consistently present in the training set. This example highlights the
importance of training using a representative training set, while testing on a set that is not equivalent to training.
doi:10.1371/journal.pcbi.1002902.g005

different studies and existing research the information they extract based on associated genes or explain why this is
results are beyond reach for anyone of perceived quality and importance of each not feasible. How many SNPs associate
human (not cyborg) descent. Gene priori- piece of data available in the context of the these genes with diabetes? Is it realisti-
tization methods (Table 1) have been entire set of descriptors a function cally possible to experimentally evaluate
developed to make sense of this data by unlikely to be reproduced in manual data individual effects of each SNP in this
extracting and combining the various interpretation. Thus, computational gene set?
pieces necessary to link genes to diseases. prioritization techniques serve as interpret- 2. Using STRING (http://string-db.org/),
These methods rely on experimental work ers of both of newly retrieved data and of find all genes (hint: use limit of 50)
such as disease gene linkage analysis and information contained in previous studies. interacting with insulin (confidence
genome wide studies to establish the search They also are the bridge that connects .0.99). Note, this confidence limit is extremely
space of candidate genes that may possibly seemingly unrelated inferences creating an high computational techniques would normally
be involved in generating the observed easily comprehensible outlook on an im- deal with lower limits and thus larger data sets.
phenotype. Further, they utilize mathemat- portant problem of disease gene annota- What is the insulin gene name used by
ical and computational models of disease to tion. STRING? How many interaction part-
filter the original set of genes based on gene ners does your query return? Switch to
and protein sequence, structure, function, 7. Exercises STRING evidence view. Pick three
interaction, expression, and tissue and genes connected to insulin via text
cellular localization information. Data re- 1. Search the GAD (http://geneticassociationdb. mining, but without insulin in their
positories that contain the necessary infor- nih.gov/) database for all genes report- full name, and find one reference for
mation are diverse in both content and ed to be associated with diabetes. Refine each in PubMed (http://www.ncbi.nlm.
format and require deep knowledge of the this set to find only the positively nih.gov/pubmed/) suggesting that these
stored information to be properly inter- associated genes. How many are there? genes are involved with diabetes. Report
preted. Moreover, the models utilizing the Why was the total data set reduced? Gene IDs (e.g. MC4R), PubMed IDs and
various sources assign different weights to Count the number of unique diabetes publication citations. Use PolySearch

PLOS Computational Biology | www.ploscompbiol.org 10 January 2013 | Volume 9 | Issue 4 | e1002902


Table 1. The available data sources and gene prioritization tools.

Data Type Data Content Possible Sources Tools

Experiment, observation Linkage, association, pedigree, relevant User provided CAESAR [140], CANDID [141],
texts and other data ENDEAVOR [122], G2D [15,16,17],
Gentrepid [142], GeneDistiller [121],
PGMapper [143], PRINCE [144],
Prioritizer [145], SUSPECTS [146],
ToppGene [126,127]
Sequence, structure, meta-data Sequence conservation, exon number, SCOP [147], PFam [148,149], CAESAR, CANDID, ENDEAVOR, G2D,
coding region length, known structural ProSite [150], UniProt, Gentrepid, GeneDistiller,
domains and sequence motifs, chromosomal Entrez Gene [151], ENSEMBL GeneProspector [125], MedSim [157],
location, protein localization, and other [152], InterPro [153], LocDB MimMiner [158], PGMapper,
gene-centered information and predictions [154], GeneCards [155], PhenoPred [159], Prioritizer,
PredictProtein [156] PROSPECTR [54], SNPs3D [106],
SUSPECTS, ToppGene
Pathway, protein-protein Disease-gene associations, pathways and KEGG [160,161], STRING, CAESAR, CANDID, DiseaseNet [170],
interaction, genetic linkage, gene-gene/protein-protein interactions/ Reactome [162,163], DIP [164], ENDEAVOR, G2D, Gentrepid,
expression interaction predictions, and gene expression BioGRID [165], GEO [166,167], GeneDistiller, GeneWanderer [20],
data ArrayExpress [168], ReLiance MaxLink [171], MedSim, PGMapper,
[169] PhenoPred, PRINCE, Prioritizer,
SNPs3D, SUSPECTS, ToppGene
Non-human data Information about related genes and OrthoDisease [172], OrthoMCL CAESAR, CANDID, ENDEAVOR,
phenotypes in other species [173], MGD [174], GeneDistiller, GeneProspector,
Pathbase [175] GeneWanderer, MedSim, Prioritizer,
PROSPECTR, SNPs3D, SUSPECTS,
ToppGene
Ontologies Gene, disease, phenotype, and anatomic GO, DO [176], MPO CAESAR, ENDEAVOR, G2D,
ontologies [177,178], HPO [179], GeneDistiller, MedSim, PhenoPred,
eVOC [180] Prioritizer, SNPs3D, ToppGene
Mutation associations and effects Information about existing mutations, their dbSNP, PMD [111], GAD, CAESAR, CANDID, GeneProspector,
functional and structural effects and their DMDM, SNAP, PolyDoms, GeneWanderer, PROSPECTR, SNPs3D,
association with diseases, predictions of SNPdbe, SNPselector, RAVEN, SUSPECTS
functional or structural effects for the SNPeffect, PHD-SNP,
mutations in the gene in question Mutation@A Glance,
PromoLign, SIFT, PolyPhen,
PupaSNP finder, FASTSNP
Literature Mixed information of all types extracted PubMed, PubMed Central, CAESAR, CANDID, DiseaseNet,
from literature references (e.g. disease-gene HGMD [181], GeneRIF, OMIM ENDEAVOR, G2D, Gentrepid,
correlation and non-ontology based GeneDistiller, GeneProspector,
gene-function assignment) GeneWanderer, MedSim, MimMiner,
PGMapper, PolySearch [123], PRINCE,
Prioritizer, PROSPECTR, SNPs3D,
SUSPECTS, ToppGene

There is a wide range of data sources that can be used to infer the above-described pieces of evidence. The existing tools try to take advantage of many (if not all) of
them. This table summarizes the collections and methodologies that make current state of the art in gene prioritization possible. Note, not all resources mentioned here
are utilized by all gene prioritization tools nor are all data sources available listed. Moreover, some resources may be classified as more than one data-type. Many of the
resources reported here are available electronically through the gene prioritization portal [124].
doi:10.1371/journal.pcbi.1002902.t001

(http://wishart.biology.ualberta.ca/ Which of the terms is the most exact select increased susceptibility (MPO,
polysearch) gene to disease mapping in defining the likely molecular function http://www.informatics.jax.org/
with your gene IDs to do the same. Does of insulin (lowest term in a tree searches/MP_form.shtml). How many
your experience confirm that the func- hierarchy)? Display gene products in genotypes are returned? Display
tional molecular interaction evidence GO:0005158: insulin receptor bind- the genotypes and click on the
works? Why? ing, reduce the set to human proteins, Airetm1Mand/Aire+ genotype for fur-
3. In AmiGO (GO term browser, http:// and look at the inferred tree. How many ther exploration. What is the affected
www.geneontology.org), find the hu- gene products are in this term? Pick a set gene? Click on gene title (Gene link in
man insulin record (hint: use the insulin of three gene products (report IDs) and Nomenclature section) to display fur-
ID obtained above). What is the Swiss- use them to search PolySearch for ther information. What is an ortholo-
Prot ID for insulin? Go to the term diabetes associations. In question 3 we gue? What is the human orthologue of
view. How many GO term associations used the common pathway evidence your mouse gene? Look up this gene in
does insulin have? Reduce the view to to show the relationship of genes to OMIM (http://www.ncbi.nlm.nih.
molecular function terms. How diabetes. What type of predictive evi- gov/omim) for association with diabe-
many terms are left? Create a tree dence is used here? tes. Copy/paste the citation from
view of these terms (hint: use the 4. Search the Mammalian Phenotype OMIM, describing the gene relation-
Perform an action dropdown). Ontology for keyword diabetes and ship to diabetes in humans. Do your

PLOS Computational Biology | www.ploscompbiol.org 11 April 2013 | Volume 9 | Issue 4 | e1002902


results confirm the cross-species to convince you of hemoglobin-diabe- false positives/negatives, positive/
evidence? tes association and why? From reading negative accuracy/coverage, over-
5. Search GeneCards (http://www. article titles/extracted sentences, can all accuracy)? Try using the Deci-
genecards.org, utilize advanced search) you identify a biological reason for sion Stump classifier with default
for genes expressing in the pancreas connecting hemoglobin to diabetes? If parameters (take screenshot of out-
(hint: pancreatic tissue is often affected one looks especially convincing, cite put). If everything below 0.5 is 0,
in diabetes). How many are there? that article (hint: its OK to not find and everything above is 1, what is
Explore the GeneCard for CCKBR for one). For the first three articles, can your performance? Is it better or
diabetes association. Do you find that you identify a biological reason for
worse than the neural net?
this gene confirms the disease com- connecting hemoglobin to diabetes?
partment evidence? What database, Go back to the list of diabetes related 8.2. Open ended: Experiment with dif-
referenced in GeneCards, contains the genes and look at TCF7L2 articles. Are ferent tools available from WEKAs
CCKBR-diabetes association? Now the biological reasons for matching Classify section setting the testing set
look at the GeneCard of PLEKHG4. TCF7L2 to diabetes clearly defined? to your test-files location. First, run
Is there evidence for this gene being Cite the most convincing article. Why the MultiLayer Perceptron with
associated with diabetes (whether in the do you think TCF7L2 is ranked lower parameters as described in
GeneCards record or otherwise)? Ex- in association than hemoglobin? Is Figure 5, then try to alter the
plain your ideas in detail, paying there significant evidence for calcium parameters (momentum term,
special attention to the disease com- channel (CACNA1E) involvement in learning rate, and number of ep-
partment line of evidence. diabetes? Consider the PubMed cita- ochs). Try using Linear Regression,
6. Search UniProt (http://www.uniprot. tions. Do you agree with PolySearch Decision Table, or Decision Stump
org) for all reviewed [reviewed:yes] classification of this gene-disease asso- classifiers with default parameters.
human [organism:Homo sapiens ciation? Does your experience with Is your performance on the test set
[9606]] protein entries that contain PolySearch confirm the text evi- better or worse? Close the WEKA
natural variants with reference to dence function of gene prioritization Explorer, reformat your train/test
diabetes [annotation:(type:natural_var- methods?
files in the text editor to replace
iations diabetes)]. Use advanced search 8. WEKA exercises (choose one). Disease column values by Booleans
with specific limits (i.e. sequence anno- (True/False) values, and re-open
tation, natural_variations, term diabe- 8. Download and install WEKA ( http://
the training file. Use BayesianNet
tes). How many proteins fit this de- www.cs.waikato.ac.nz/,ml/weka/).
and RandomForest classifiers to test
scription? Locate the entry for insulin Using a text-editor (or Microsoft Excel)
on the testing file. Does you perfor-
(identifier from question 3) and find the create comma delimited values (CSV)
mance improve? Note, that without
total number of known coding variants files identical to the ones described in
further understanding of each of the
of this sequence. How many are Figure 5CD (i.e. copy over the train-
ing and testing files and replace spaces tools, it is nearly impossible to
annotated as associated with any form determine which method is applica-
of diabetes? (hint: read the general with commas). Save the files and open
the training file in WEKAs Explorer ble to your data.
annotation section for correspondence
of abbreviations to types of diabetes). GUI. Open the training file in Answers to the Exercises can be found
Run SNAP (http://www.rostlab.org/ WEKAs Explorer GUI. You should
in Text S1.
services/snap/) to predict functional have four columns of data (Text,
effects of all variants. (hint: use comma Homology, ID, Disease) corresponding
to four attributes of each data instance. Supporting Information
separated batch submit). How many
are predicted to be functionally non- Text S1 Answers to Exercises.
neutral? Do SNAP predictions of 8.1. Defined Questions: Run the Mul- (DOCX)
functional effect correlate with anno- tiLayer Perceptron with parame-
tated disease associations? Does this ters (momentum = 0.5, learn- Acknowledgments
result confirm the mutant implica- ing = 0.2, trained using the
tion for nsSNPs? training set, Figure 5C, repeated The author would like to thank Chengsheng
Zhu (Rutgers University), Chani Weinreb
7. Search PolySearch for all genes associ- 500 times/epochs). Test with the
(Columbia University) and Nikolay Samusik
ated with diabetes. How many results test set (Figure 5D) and output (Max Planck Institute, Dresden) for critical
are returned? Look at the PubMed predictions for each test entry reading and comments to the manuscript. She
articles that associate hemoglobin (make a screenshot). Assuming that also acknowledges the help of Gregory Behrin-
with diabetes (follow the link from everything predicted below 0 is 0, ger (Rutgers University) and of all of the
PolySearch). How many are there? and everything above is 1. What is students of the Spring 2012 Bioinformatics
Do you find this number large enough your performance (number of true/ course at Rutgers in testing the exercises.

PLOS Computational Biology | www.ploscompbiol.org 12 January 2013 | Volume 9 | Issue 4 | e1002902


Glossary

N Annotation any additional information about a genetic sequence. Annotation types are extremely varied, including
functional, structural, regulatory, location-related, organism-specific, experimentally derived, predicted, etc.
N CNV, copy number variation an alteration of the genome, which results in an individual having a non-standard number of
copies of one or more DNA sections.
N Gene prioritization the process of arranging possible disease causing genes in order of their likelihood in disease
involvement.
N GWAS, genome wide association studies the examination of all genes in the genome to correlate their variation to
phenotypic trait variation across individuals in a given population.
N Genetic linkage tendency of certain genetic regions on the same chromosome to be inherited together more often than
expected due to limited recombination between them.
N Genetic marker a DNA sequence variant with a known location that can be used to identify specific subsets of individuals
(cells, species, individual organisms, etc.).
N Homologue a gene derived from a common ancestor with the reference gene. Generally, gene A is a homologue of gene B
if both are derived from a common ancestor.
N Linkage disequilibrium tendency of certain genetic regions (not necessarily on the same chromosome) to be inherited
together more often that expected from considering their population frequencies. In reference to gene prioritization, this
phenomenon may complicate establishment of causal genes due to their consistent inheritance in complex with non-causal
genetic regions.
N Orthologues homologous genes separated by a speciation event. Generally, gene A is an orthologue of gene B if A and B
are homologous, but reside in different species. Orthologues often perform the same general function in different organisms.
N Paralogues homologous genes separated by a duplication event (often followed by copy differentiation). Generally, gene
A is a paralogue of gene B if A and B are homologous and reside in the same species. A and B can be functionally identical or,
on contraire, very different, but are often only slightly dissimilar.
N Pleiotropy the influence of a single gene on a number of phenotypic traits.

Further Reading

N Alterovitz G, Ramoni M, eds. (2010) Knowledge-based bioinformatics: from analysis to interpretation. Padstow, Cornwall: John
Wiley and Sons Ltd.
N Bromberg Y, Capriotti E, eds. (2012) SNP-SIG 2011: identification and annotation of SNPs in the context of structure, function
and disease. Proceedings from SNP-SIG 2011 conference, Vienna, Austria. BMC Genomics 13 Supp 4.
N Chen JY, Youn E, Mooney SD (2009) Connecting protein interaction data, mutations, and disease using bioinformatics.
Methods Mol Biol 541: 449461.
N Dalkilic MM, Costello JC, Clark WT, Radivojac P (2008) From protein-disease associations to disease informatics. Front Biosci 13:
33913407.
N Evans JA, Rzhetsky A (2011) Advancing science through mining libraries, ontologies, and communities. J Biol Chem 286:
2365923666.
N Kann MG (2007) Protein interactions and disease: computational approaches to uncover the etiology of diseases. Brief
Bioinform 8: 333346.
N Krallinger M, Leitner F, Valencia A (2010) Analysis of biological processes and diseases using text mining approaches. Methods
Mol Biol 593: 341382.
N Liberles DA, Teichmann SA, Bahar I, Bastolla U, Bloom J, et al. (2012) The interface of protein structure, protein biophysics, and
molecular evolution. Protein Sci 21: 769785.
N Maulik U, Bandyopadhyay S, Wang JTL, eds. (2010) Computational intelligence and pattern analysis in biological informatics.
Hoboken, NJ: John Wiley and Sons, Inc.
N Mooney SD, Krishnan VG, Evani US (2010) Bioinformatic tools for identifying disease gene and SNP candidates. Methods Mol
Biol 628: 307319.
N Moreau Y, Tranchevent LC (2012) Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat
Rev Genet 13: 523536.
N Oti M, Brunner HG (2007) The modular nature of genetic diseases. Clin Genet 71: 111.
N Piro RM, Di Cunto F (2007) Computational approaches to disease-gene prediction: rationale, classification and successes. FEBS
J 279: 678696.

PLOS Computational Biology | www.ploscompbiol.org 13 April 2013 | Volume 9 | Issue 4 | e1002902


References
1. Herrick JB (2001) Peculiar elongated and sickle- 23. Gandhi TK, Zhong J, Mathivanan S, Karthick 42. Gu Z, Steinmetz LM, Gu X, Scharfe C, Davis
shaped red blood corpuscles in a case of severe L, Chandrika KN, et al. (2006) Analysis of the RW, et al. (2003) Role of duplicate genes in
anemia. 1910. Yale J Biol Med 74: 179184. human protein interactome and comparison genetic robustness against null mutations. Na-
2. Pauling L, Itano HA, Singer SJ, Wells IC (1949) with yeast, worm and fly interaction datasets. ture 421: 6366.
Sickle cell anemia, a molecular disease. Science Nat Genet 38: 285293. 43. Conant GC, Wagner A (2004) Duplicate genes
109: 443. 24. Jensen LJ, Kuhn M, Stark M, Chaffron S, and robustness to transient gene knock-downs in
3. Ingram VM (1956) A specific chemical differ- Creevey C, et al. (2009) STRING 8a global Caenorhabditis elegans. Proc Biol Sci 271: 89
ence between the globins of normal human and view on proteins and their functional interac- 96.
sickle-cell anaemia haemoglobin. Nature 178: tions in 630 organisms. Nucleic Acids Res 37: 44. Hsiao TL, Vitkup D (2008) Role of duplicate
792794. D412416. genes in robustness against deleterious human
4. Gusella JF, Wexler NS, Conneally PM, 25. Snel B, Lehmann G, Bork P, Huynen MA mutations. PLoS Genet 4: e1000014.
Naylor SL, Anderson MA, et al. (1983) A (2000) STRING: a web-server to retrieve and doi:10.1371/journal.pgen.1000014
polymorphic DNA marker genetically linked display the repeatedly occurring neighbourhood 45. Ashburner M, Ball CA, Blake JA, Botstein D,
to Huntingtons disease. Nature 306: 234 of a gene. Nucleic Acids Res 28: 34423444. Butler H, et al. (2000) Gene ontology: tool for
238. 26. Huszar D, Lynch CA, Fairchild-Huntress V, the unification of biology. The Gene Ontology
5. Woo SL, Lidsky AS, Guttler F, Chandra T, Dunmore JH, Fang Q, et al. (1997) Targeted Consortium. Nat Genet 25: 2529.
Robson KJ (1983) Cloned human phenylalanine disruption of the melanocortin-4 receptor results 46. Mencarelli M, Walker GE, Maestrini S, Alberti
hydroxylase gene allows prenatal diagnosis and in obesity in mice. Cell 88: 131141. L, Verti B, et al. (2008) Sporadic mutations in
carrier detection of classical phenylketonuria. 27. Lubrano-Berthelier C, Le Stunff C, Bougneres melanocortin receptor 3 in morbid obese
Nature 306: 151155. P, Vaisse C (2004) A homozygous null mutation individuals. Eur J Hum Genet 16: 581586.
6. Robertson M (1984) Towards a medical eugen- delineates the role of the melanocortin-4 recep- 47. Lord PW, Stevens RD, Brass A, Goble CA
ics? Br Med J (Clin Res Ed) 288: 429430. tor in humans. J Clin Endocrinol Metab 89: (2003) Investigating semantic similarity measures
7. (1993) A novel gene containing a trinucleotide 20282032. across the Gene Ontology: the relationship
repeat that is expanded and unstable on 28. Farooqi IS, Keogh JM, Yeo GS, Lank EJ, between sequence and annotation. Bioinfor-
Huntingtons disease chromosomes. The Hun- Cheetham T, et al. (2003) Clinical spectrum of matics 19: 12751283.
tingtons Disease Collaborative Research obesity and mutations in the melanocortin 4 48. Wang JZ, Du Z, Payattakool R, Yu PS, Chen
Group. Cell 72: 971983. receptor gene. N Engl J Med 348: 10851095. CF (2007) A new method to measure the
8. Yip YL, Famiglietti M, Gos A, Duek PD, David 29. Challis BG, Coll AP, Yeo GS, Pinnock SB, semantic similarity of GO terms. Bioinformatics
FP, et al. (2008) Annotating single amino acid Dickson SL, et al. (2004) Mice lacking pro- 23: 12741281.
polymorphisms in the UniProt/Swiss-Prot opiomelanocortin are sensitive to high-fat feed- 49. del Pozo A, Pazos F, Valencia A (2008) Defining
knowledgebase. Hum Mutat 29: 361366. ing but respond normally to the acute anorectic functional distances over gene ontology. BMC
9. UniProt Consortium (2010) The Universal effects of peptide-YY(3-36). Proc Natl Acad Bioinformatics 9: 50.
Protein Resource (UniProt) in 2010. Nucleic Sci U S A 101: 46954700. 50. Schlicker A, Albrecht M (2010) FunSimMat
Acids Res 38: D142148. 30. Yaswen L, Diehl N, Brennan MB, Hochgesch- update: new features for exploring functional
wender U (1999) Obesity in the mouse model of similarity. Nucleic Acids Res 38: D244D248.
10. Bairoch A, Apweiler R (2000) The SWISS-
PROT protein sequence database and its pro-opiomelanocortin deficiency responds to 51. Punta M, Ofran Y (2008) The rough guide to in
supplement TrEMBL in 2000. Nucleic Acids peripheral melanocortin. Nat Med 5: 1066 silico function prediction, or how to use
1070. sequence and structure information to predict
Res 28: 4548.
protein function. PLoS Comput Biol 4:
11. Moreau Y, Tranchevent LC (2012) Computa- 31. Helder SG, Collier DA (2011) The genetics of
e1000160. doi:10.1371/journal.pcbi.1000160
tional tools for prioritizing candidate genes: eating disorders. Curr Top Behav Neurosci 6:
52. Rentzsch R, Orengo CA (2009) Protein function
boosting disease gene discovery. Nature reviews 157175.
predictionthe power of multiplicity. Trends
Genetics 13: 523536. 32. van Noort V, Snel B, Huynen MA (2003)
Biotechnol 27: 210219.
12. Potter JD (1999) Colorectal cancer: molecules Predicting gene function by conserved co-
53. Lopez-Bigas N, Ouzounis CA (2004) Genome-
and populations. J Natl Cancer Inst 91: 916 expression. Trends Genet 19: 238242.
wide identification of genes likely to be involved
932. 33. Huang GS, Gunter MJ, Arend RC, Li M, Arias-
in human genetic disease. Nucleic Acids Res 32:
13. Frosst P, Blom HJ, Milos R, Goyette P, Pulido H, et al. (2010) Co-expression of GPR30 31083114.
Sheppard CA, et al. (1995) A candidate genetic and ERbeta and their association with disease 54. Adie EA, Adams RR, Evans KL, Porteous DJ,
risk factor for vascular disease: a common progression in uterine carcinosarcoma. Pickard BS (2005) Speeding disease gene
mutation in methylenetetrahydrofolate reduc- Am J Obstet Gynecol 203: 242 e241245. discovery by sequence based candidate prioriti-
tase. Nat Genet 10: 111113. 34. Jesmin J, Rashid MS, Jamil H, Hontecillas R, zation. BMC Bioinformatics 6: 55.
14. Thomas DC, Conti DV, Baurley J, Nijhout F, Bassaganya-Riera J (2010) Gene regulatory 55. Iakoucheva LM, Brown CJ, Lawson JD, Obra-
Reed M, et al. (2009) Use of pathway informa- network reveals oxidative stress as the underly- dovic Z, Dunker AK (2002) Intrinsic disorder in
tion in molecular epidemiology. Hum Genomics ing molecular mechanism of type 2 diabetes and cell-signaling and cancer-associated proteins.
4: 2142. hypertension. BMC Med Genomics 3: 45. J Mol Biol 323: 573584.
15. Perez-Iratxeta C, Bork P, Andrade MA (2002) 35. Smith HO, Leslie KK, Singh M, Qualls CR, 56. Staubert C, Tarnow P, Brumm H, Pitra C,
Association of genes to genetically inherited Revankar CM, et al. (2007) GPR30: a novel Gudermann T, et al. (2007) Evolutionary aspects
diseases using data mining. Nat Genet 31: 316 indicator of poor survival for endometrial in evaluating mutations in the melanocortin 4
319. carcinoma. Am J Obstet Gynecol 196: 386 receptor. Endocrinology 148: 46424648.
16. Perez-Iratxeta C, Bork P, Andrade-Navarro MA e381389; discussion 386 e389311. 57. Xiang Z, Litherland SA, Sorensen NB, Proneth
(2007) Update of the G2D tool for prioritization 36. Elizondo LI, Jafar-Nejad P, Clewing JM, B, Wood MS, et al. (2006) Pharmacological
of gene candidates to inherited diseases. Nucleic Boerkoel CF (2009) Gene clusters, molecular characterization of 40 human melanocortin-4
Acids Res 35: W212216. evolution and disease: a speculation. Curr receptor polymorphisms with the endogenous
17. Perez-Iratxeta C, Bork P, Andrade MA (2010) Genomics 10: 6475. proopiomelanocortin-derived agonists and the
G2D: Candidate Genes to Inherited Diseases. 37. Spellman PT, Rubin GM (2002) Evidence for agouti-related protein (AGRP) antagonist. Bio-
18. Tranchevent LC, Barriot R, Yu S, Van Vooren large domains of similarly expressed genes in the chemistry 45: 72777288.
S, Van Loo P, et al. (2008) ENDEAVOUR Drosophila genome. J Biol 1: 5. 58. Hinney A, Hohmann S, Geller F, Vogel C, Hess
update: a web resource for gene prioritization in 38. Yu CL, Louie TM, Summers R, Kale Y, C, et al. (2003) Melanocortin-4 receptor gene:
multiple species. Nucleic Acids Res 36: W377 Gopishetty S, et al. (2009) Two distinct path- case-control study and transmission disequilibri-
384. ways for metabolism of theophylline and caffeine um test confirm that functionally relevant
19. Tranchevent LC, Moreau Y (2009) ENDEAV- are coexpressed in Pseudomonas putida CBB5. mutations are compatible with a major gene
OUR. J Bacteriol 191: 46244632. effect for extreme obesity. J Clin Endocrinol
20. Kohler S, Bauer S, Horn D, Robinson PN 39. Singer GA, Lloyd AT, Huminiecki LB, Wolfe Metab 88: 42584267.
(2008) Walking the interactome for prioritization KH (2005) Clusters of co-expressed genes in 59. Washington NL, Haendel MA, Mungall CJ,
of candidate disease genes. Am J Hum Genet 82: mammalian genomes are conserved by natural Ashburner M, Westerfield M, et al. (2009)
949958. selection. Mol Biol Evol 22: 767775. Linking human diseases to animal models using
21. Kohler S (2008) GeneWanderer. 40. Hurst LD, Williams EJ, Pal C (2002) Natural ontology-based phenotype annotation. PLoS
22. Sun J, Zhao Z (2010) A comparative study of selection promotes the conservation of linkage of Biol 7: e1000247. doi:10.1371/journal.-
cancer proteins in the human protein-protein co-expressed genes. Trends Genet 18: 604606. pbio.1000247
interaction network. BMC Genomics 11 Suppl 41. Dawkins R (1976) The Selfish Gene. New York 60. Mootha VK, Lepage P, Miller K, Bunkenborg J,
3: S5. City: Oxford University Press. Reich M, et al. (2003) Identification of a gene

PLOS Computational Biology | www.ploscompbiol.org 14 January 2013 | Volume 9 | Issue 4 | e1002902


causing human cytochrome c oxidase deficiency 79. Ritz A, Bashir A, Raphael BJ (2010) Structural 99. Bromberg Y, Rost B (2007) SNAP: predict effect
by integrative genomics. Proc Natl Acad variation analysis with strobe reads. Bioinfor- of non-synonymous polymorphisms on function.
Sci U S A 100: 605610. matics 26: 12911298. Nucleic Acids Res 35: 38233835.
61. Ala U, Piro RM, Grassi E, Damasco C, Silengo 80. Botstein D, Risch N (2003) Discovering geno- 100. Bromberg Y, Yachdav G, Rost B (2008) SNAP
L, et al. (2008) Prediction of human disease types underlying human phenotypes: past suc- predicts effect of mutations on protein function.
genes by human-mouse conserved coexpression cesses for mendelian disease, future approaches Bioinformatics 24: 23972398.
analysis. PLoS Comput Biol 4: e1000043. for complex disease. Nat Genet 33 Suppl: 228 101. Kumar P, Henikoff S, Ng PC (2009) Predicting
doi:10.1371/journal.pcbi.1000043 237. the effects of coding non-synonymous variants
62. Michalak P (2008) Coexpression, coregulation, 81. Chakravarti A (2001) To a future of genetic on protein function using the SIFT algorithm.
and cofunctionality of neighboring genes in medicine. Nature 409: 822823. Nat Protoc 4: 10731081.
eukaryotic genomes. Genomics 91: 243248. 82. Sherry ST, Ward M, Sirotkin K (1999) dbSNP- 102. Ng PC, Henikoff S (2003) SIFT: Predicting
63. Fukuoka Y, Inaoka H, Kohane IS (2004) Inter- database for single nucleotide polymorphisms amino acid changes that affect protein function.
species differences of co-expression of neighbor- and other classes of minor genetic variation. Nucleic Acids Res 31: 38123814.
ing genes in eukaryotic genomes. BMC Geno- Genome Res 9: 677679. 103. Ramensky V, Bork P, Sunyaev S (2002) Human
mics 5: 4. 83. Rosenfeld JA, Malhotra AK, Lencz T (2010) non-synonymous SNPs: server and survey.
64. McKusick-Nathans Institute of Genetic Medi- Novel multi-nucleotide polymorphisms in the Nucleic Acids Res 30: 38943900.
cine (JHUB, MD) and National Center for human genome characterized by whole genome 104. Adzhubei IA, Schmidt S, Peshkin L, Ramensky
Biotechnology Information, National Library of and exome sequencing. Nucleic Acids Res 38: VE, Gerasimova A, et al. (2010) A method and
Medicine (Bethesda, MD) (2010) Online Men- 61026111. server for predicting damaging missense muta-
delian Inheritance in Man, OMIM (TM). 84. Zhao T, Chang LW, McLeod HL, Stormo GD tions. Nat Methods 7: 248249.
65. Stranger BE, Nica AC, Forrest MS, Dimas A, (2004) PromoLign: a database for upstream 105. Capriotti E, Calabrese R, Casadio R (2006)
Bird CP, et al. (2007) Population genomics of region analysis and SNPs. Hum Mutat 23: Predicting the insurgence of human genetic
human gene expression. Nat Genet 39: 1217 534539. diseases associated to single point protein
1224. 85. Conde L, Vaquerizas JM, Santoyo J, Al- mutations with support vector machines and
66. Jiang B-B, Wang J-G, Wang Y, Xiao J-F (2009) Shahrour F, Ruiz-Llorente S, et al. (2004) evolutionary information. Bioinformatics 22:
Gene Prioritization for Type 2 Diabetes in PupaSNP Finder: a web tool for finding SNPs 27292734.
Tissue-specific Protein Interaction Networks. with putative effect at transcriptional level. 106. Yue P, Melamud E, Moult J (2006) SNPs3D:
Systems Biology 10801131: 319328. Nucleic Acids Res 32: W242248. candidate gene and SNP selection for association
67. Koch MC, Steinmeyer K, Lorenz C, Ricker K, 86. Andersen MC, Engstrom PG, Lithwick S, studies. BMC Bioinformatics 7: 166.
Wolf F, et al. (1992) The skeletal muscle chloride Arenillas D, Eriksson P, et al. (2008) In silico 107. Chelala C, Khan A, Lemoine NR (2009)
channel in dominant and recessive human detection of sequence variations modifying SNPnexus: a web database for functional
myotonia. Science 257: 797800. transcriptional regulation. PLoS Comput Biol annotation of newly discovered and public
4: e5. doi:10.1371/journal.pcbi.0040005 domain single nucleotide polymorphisms. Bioin-
68. Greer WL, Riddell DC, Gillan TL, Girouard
87. Riva A, Kohane IS (2002) SNPper: retrieval and formatics 25: 655661.
GS, Sparrow SM, et al. (1998) The Nova Scotia
(type D) form of Niemann-Pick disease is caused analysis of human SNPs. Bioinformatics 18: 108. McLaren W, Pritchard B, Rios D, Chen Y,
16811685. Flicek P, et al. (2010) Deriving the consequences
by a G3097RT transversion in NPC1. Amer-
of genomic variants with the Ensembl API and
ican journal of human genetics 63: 5254. 88. Xu H, Gregory SG, Hauser ER, Stenger JE,
SNP Effect Predictor. Bioinformatics 26: 2069
69. Liou B, Kazimierczuk A, Zhang M, Scott CR, Pericak-Vance MA, et al. (2005) SNPselector: a
2070.
Hegde RS, et al. (2006) Analyses of variant acid web tool for selecting SNPs for genetic associ-
109. Becker KG, Barnes KC, Bright TJ, Wang SA
beta-glucosidases: effects of Gaucher disease ation studies. Bioinformatics 21: 41814186.
(2004) The genetic association database. Nat
mutations. The Journal of biological chemistry 89. Yuan HY, Chiou JJ, Tseng WH, Liu CH, Liu
Genet 36: 431432.
281: 42424253. CK, et al. (2006) FASTSNP: an always up-to-
110. Forbes SA, Bindal N, Bamford S, Cole C, Kok
70. Shieh JJ, Wang LY, Lin CY (1994) Point date and extendable service for SNP function
CY, et al. (2011) COSMIC: mining complete
mutation in Pompe disease in Chinese. Journal analysis and prioritization. Nucleic Acids Res
cancer genomes in the Catalogue of Somatic
of inherited metabolic disease 17: 145148. 34: W635641.
Mutations in Cancer. Nucleic Acids Res 39:
71. Lau MM, Neufeld EF (1989) A frameshift 90. Chen R, Davydov EV, Sirota M, Butte AJ D945950.
mutation in a patient with Tay-Sachs disease (2010) Non-synonymous and synonymous cod- 111. Kawabata T, Ota M, Nishikawa K (1999) The
causes premature termination and defective ing SNPs show similar likelihood and effect size Protein Mutant Database. Nucleic Acids Res 27:
intracellular transport of the alpha-subunit of of human disease association. PLoS ONE 5: 355357.
beta-hexosaminidase. J Biol Chem 264: 21376 e13574. doi:10.1371/journal.pone.0013574 112. Mitchell JA, Aronson AR, Mork JG, Folk LC,
21380. 91. Parmley JL, Hurst LD (2007) How do synony- Humphrey SM, et al. (2003) Gene indexing:
72. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, mous mutations affect fitness? Bioessays 29: characterization and analysis of NLMs GeneR-
Donahoe PK, et al. (2004) Detection of large- 515519. IFs. AMIA Annu Symp Proc: 460464.
scale variation in the human genome. Nat Genet 92. Ng PC, Henikoff S (2006) Predicting the effects 113. Hirschman L, Yeh A, Blaschke C, Valencia A
36: 949951. of amino acid substitutions on protein function. (2005) Overview of BioCreAtIvE: critical assess-
73. Chen H (2007) Cri du chat syndrome. Medscape Annu Rev Genomics Hum Genet 7: 6180. ment of information extraction for biology.
Reference. Available: http://emedicine. 93. Amberger J, Bocchini CA, Scott AF, Hamosh A BMC Bioinformatics 6 Suppl 1: S1.
medscape.com/article/942897-overview. Ac- (2009) McKusicks Online Mendelian Inheri- 114. Altman RB, Bergman CM, Blake J, Blaschke C,
cessed 16 January 2013. tance in Man (OMIM). Nucleic Acids Res 37: Cohen A, et al. (2008) Text mining for biology
74. Mefford HC, Muhle H, Ostertag P, von Spiczak D793796. the way forward: opinions from leading scien-
S, Buysse K, et al. (2010) Genome-wide copy 94. Schaefer C, Meier A, Rost B, Bromberg Y tists. Genome Biol 9 Suppl 2: S7.
number variation in epilepsy: novel susceptibility (2012) SNPdbe: constructing an nsSNP func- 115. Blaschke C, Andrade MA, Ouzounis C, Valen-
loci in idiopathic generalized and focal epilep- tional impacts database. Bioinformatics 28: 601 cia A (1999) Automatic extraction of biological
sies. PLoS Genet 6: e1000962. doi:10.1371/ 602. information from scientific text: protein-protein
journal.pgen.1000962 95. Reumers J, Schymkowitz J, Ferkinghoff-Borg J, interactions. Proc Int Conf Intell Syst Mol Biol:
75. Kalscheuer VM, FitzPatrick D, Tommerup N, Stricher F, Serrano L, et al. (2005) SNPeffect: a 6067.
Bugge M, Niebuhr E, et al. (2007) Mutations in database mapping molecular phenotypic effects 116. Laurila JB, Naderi N, Witte R, Riazanov A,
autism susceptibility candidate 2 (AUTS2) in of human non-synonymous coding SNPs. Nu- Kouznetsov A, et al. (2010) Algorithms and
patients with mental retardation. Hum Genet cleic Acids Res 33: D527532. semantic infrastructure for mutation impact
121: 501509. 96. Jegga AG, Gowrisankar S, Chen J, Aronow BJ extraction and grounding. BMC Genomics 11
76. Alarcon M, Abrahams BS, Stone JL, Duvall JA, (2007) PolyDoms: a whole genome database for Suppl 4: S24.
Perederiy JV, et al. (2008) Linkage, association, the identification of non-synonymous coding 117. Caporaso JG, Baumgartner WA, Jr., Randolph
and gene-expression analyses identify SNPs with the potential to impact disease. DA, Cohen KB, Hunter L (2007) MutationFin-
CNTNAP2 as an autism-susceptibility gene. Nucleic Acids Res 35: D700706. der: a high-performance system for extracting
Am J Hum Genet 82: 150159. 97. Hijikata A, Raju R, Keerthikumar S, Ramaba- point mutation mentions from text. Bioinfor-
77. Wittig M, Helbig I, Schreiber S, Franke A (2010) dran S, Balakrishnan L, et al. (2010) Muta- matics 23: 18621865.
CNVineta: a data mining tool for large case- tion@A Glance: an integrative web application 118. Mika S, Rost B (2004) NLProt: extracting
control copy number variation datasets. Bioin- for analysing mutations from human genetic protein names and sequences from papers.
formatics 26: 22082209. diseases. DNA Res 17: 197208. Nucleic Acids Res 32: W634637.
78. Sindi S, Helman E, Bashir A, Raphael BJ (2009) 98. Peterson TA, Adadey A, Santana-Cruz I, Sun Y, 119. Hoffmann R, Valencia A (2004) A gene network
A geometric approach for classification and Winder A, et al. (2010) DMDM: domain for navigating the literature. Nat Genet 36: 664.
comparison of structural variants. Bioinformatics mapping of disease mutations. Bioinformatics 120. Thornblad TA, Elliott KS, Jowett J, Visscher
25: i222230. 26: 24582459. PM (2007) Prioritization of positional candidate

PLOS Computational Biology | www.ploscompbiol.org 15 April 2013 | Volume 9 | Issue 4 | e1002902


genes using multiple web-based software tools. tizing candidate genes for complex human traits. 162. DEustachio P (2011) Reactome knowledgebase
Twin Res Hum Genet 10: 861870. Genet Epidemiol 32: 779790. of human biological pathways and processes.
121. Seelow D, Schwarz JM, Schuelke M (2008) 142. George RA, Liu JY, Feng LL, Bryson-Richard- Methods Mol Biol 694: 4961.
GeneDistillerdistilling candidate genes from son RJ, Fatkin D, et al. (2006) Analysis of protein 163. Matthews L, Gopinath G, Gillespie M, Caudy
linkage intervals. PLoS ONE 3: e3874. sequence and interaction data for candidate M, Croft D, et al. (2009) Reactome knowledge-
doi:10.1371/journal.pone.0003874 disease gene prediction. Nucleic Acids Res 34: base of human biological pathways and process-
122. Aerts S, Lambrechts D, Maity S, Van Loo P, e130. es. Nucleic Acids Res 37: D619622.
Coessens B, et al. (2006) Gene prioritization 143. Xiong Q, Qiu Y, Gu W (2008) PGMapper: a 164. Xenarios I, Salwinski L, Duan XJ, Higney P,
through genomic data fusion. Nat Biotechnol web-based tool linking phenotype to genes. Kim SM, et al. (2002) DIP, the Database of
24: 537544. Bioinformatics 24: 10111013. Interacting Proteins: a research tool for studying
123. Cheng D, Knox C, Young N, Stothard P, 144. Vanunu O, Magger O, Ruppin E, Shlomi T, cellular networks of protein interactions. Nucleic
Damaraju S, et al. (2008) PolySearch: a web- Sharan R (2010) Associating genes and protein Acids Res 30: 303305.
based text mining system for extracting relation- complexes with disease via network propagation. 165. Stark C, Breitkreutz BJ, Chatr-Aryamontri A,
ships between human diseases, genes, mutations, PLoS Comput Biol 6: e1000641. doi:10.1371/ Boucher L, Oughtred R, et al. (2011) The
drugs and metabolites. Nucleic Acids Res 36: journal.pcbi.1000641 BioGRID Interaction Database: 2011 update.
W399405. 145. Franke L, van Bakel H, Fokkens L, de Jong ED, Nucleic Acids Res 39: D698704.
124. Tranchevent LC, Capdevila FB, Nitsch D, De Egmont-Petersen M, et al. (2006) Reconstruc- 166. Barrett T, Troup DB, Wilhite SE, Ledoux P,
Moor B, De Causmaecker P, et al. (2011) A tion of a functional human gene network, with Evangelista C, et al. (2011) NCBI GEO: archive
guide to web tools to prioritize candidate genes. for functional genomics data sets10 years on.
an application for prioritizing positional candi-
Briefings in bioinformatics 12: 2232. Nucleic Acids Res 39: D10051010.
date genes. Am J Hum Genet 78: 10111025.
125. Yu W, Wulf A, Liu T, Khoury MJ, Gwinn M 167. Edgar R, Domrachev M, Lash AE (2002) Gene
146. Adie EA, Adams RR, Evans KL, Porteous DJ,
(2008) Gene Prospector: an evidence gateway Expression Omnibus: NCBI gene expression
Pickard BS (2006) SUSPECTS: enabling fast
for evaluating potential susceptibility genes and and hybridization array data repository. Nucleic
and effective prioritization of positional candi-
interacting risk factors for human diseases. BMC Acids Res 30: 207210.
Bioinformatics 9: 528. dates. Bioinformatics 22: 773774.
147. Murzin AG, Brenner SE, Hubbard T, Chothia 168. Parkinson H, Sarkans U, Kolesnikov N, Abey-
126. Chen J, Bardes EE, Aronow BJ, Jegga AG gunawardena N, Burdett T, et al. (2011)
(2009) ToppGene Suite for gene list enrichment C (1995) SCOP: a structural classification of
proteins database for the investigation of se- ArrayExpress updatean archive of microarray
analysis and candidate gene prioritization. and high-throughput sequencing-based func-
Nucleic Acids Res 37: W305311. quences and structures. J Mol Biol 247: 536
540. tional genomics experiments. Nucleic Acids
127. Chen J, Xu H, Aronow BJ, Jegga AG (2007) Res 39: D10021004.
Improved human disease candidate gene prior- 148. Finn RD, Mistry J, Tate J, Coggill P, Heger A,
et al. (2010) The Pfam protein families database. 169. Iacucci E, Tranchevent LC, Popovic D, Pavlo-
itization using mouse phenotype. BMC Bioin- poulos GA, De Moor B, et al. (2012) ReLiance:
formatics 8: 392. Nucleic Acids Res 38: D211222.
149. Bateman A, Birney E, Durbin R, Eddy SR, a machine learning and literature-based priori-
128. Nilsson N (1997) Artificial Intelligence: A New tization of receptorligand pairings. Bioinfor-
Synthesis. San Francisco: Morgan Kaufmann Howe KL, et al. (2000) The Pfam protein
matics 28: i569i574.
Publishers. 513 p. families database. Nucleic Acids Res 28: 263
266. 170. Navlakha S, Kingsford C (2010) The power of
129. Bouckaert R, Frank E, Hall M, Holmes G,
protein interaction networks for associating
Pfahringer B, et al. (2010) WEKA-experiences 150. Sigrist CJ, Cerutti L, de Castro E, Langendijk-
genes with diseases. Bioinformatics 26: 1057
with a java open-source project. . Journal of Genevaux PS, Bulliard V, et al. (2010) PRO-
1063.
Machine Learning Research 11: 25332541. SITE, a protein domain database for functional
171. Ostlund G, Lindskog M, Sonnhammer EL
130. Frank E, Hall M, Trigg L, Holmes G, Witten IH characterization and annotation. Nucleic Acids
(2010) Network-based Identification of novel
(2004) Data mining in bioinformatics using Res 38: D161166.
cancer genes. Mol Cell Proteomics 9: 648655.
Weka. Bioinformatics 20: 24792481. 151. Maglott D, Ostell J, Pruitt KD, Tatusova T
172. OBrien KP, Westerlund I, Sonnhammer EL
131. Gewehr JE, Szugat M, Zimmer R (2007) (2011) Entrez Gene: gene-centered information
(2004) OrthoDisease: a database of human
BioWekaextending the Weka framework for at NCBI. Nucleic Acids Res 39: D5257.
disease orthologs. Hum Mutat 24: 112119.
bioinformatics. Bioinformatics 23: 651653. 152. Flicek P, Amode MR, Barrell D, Beal K, Brent
132. Steeb W-H (2008) The nonlinear workbook: 173. Li L, Stoeckert CJ, Jr., Roos DS (2003)
S, et al. (2011) Ensembl 2011. Nucleic Acids Res
chaos, fractals, cellular automata, neural net- OrthoMCL: identification of ortholog groups
39: D800806.
works, genetic algorithms, gene expression for eukaryotic genomes. Genome Res 13: 2178
153. Hunter S, Apweiler R, Attwood TK, Bairoch A, 2189.
programming, support vector machine, wave- Bateman A, et al. (2009) InterPro: the integra-
lets, hidden Markov models, fuzzy logic with 174. Bult CJ, Eppig JT, Kadin JA, Richardson JE,
tive protein signature database. Nucleic Acids Blake JA (2008) The Mouse Genome Database
C++, Java and symbolic C++ programs. 4th Res 37: D211215.
edition. Singapore: World Scientific Publishing. (MGD): mouse biology and model systems.
154. Rastogi S, Rost B (2011) LocDB: experimental Nucleic Acids Res 36: D724728.
628 p. annotations of localization for Homo sapiens
133. Ben-Gal I (2007) Bayesian networks. In: Ruggeri 175. Schofield PN, Gruenberger M, Sundberg JP
and Arabidopsis thaliana. Nucleic Acids Res 39: (2010) Pathbase and the MPATH ontology:
F, Kennett R, Faltin F, editors. Encyclopedia of D230234.
statistics in quality and reliability. Chichester, community resources for mouse histopathology.
155. Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet Vet Pathol 47: 10161020.
England: John Wiley and Sons. D (1997) GeneCards: integrating information
134. Habra A (2005) neural networks - an introduc- 176. Osborne JD, Lin S, Zhu L, Kibbe WA (2007)
about genes, proteins and diseases. Trends Mining biomedical data using MetaMap Trans-
tion. Available: http://www.tek271.com/ Genet 13: 163.
documents/others/into-to-neural-networks. Ac- fer (MMtx) and the Unified Medical Language
156. Rost B, Yachdav G, Liu J (2004) The Pre- System (UMLS). Methods Mol Biol 408: 153
cessed 16 January 2013.
dictProtein server. Nucleic Acids Res 32: W321 169.
135. Sarasin A (2003) An overview of the mechanisms
326. 177. Smith CL, Eppig JT (2009) The mammalian
of mutagenesis and carcinogenesis. Mutat Res
157. Schlicker A, Lengauer T, Albrecht M (2010) phenotype ontology: enabling robust annotation
544: 99106.
136. Parsonnet J (1999) Microbes and malignancy : Improving disease gene prioritization using the and comparative analysis. Wiley Interdiscip Rev
infection as a cause of human cancers. New semantic similarity of Gene Ontology terms. Syst Biol Med 1: 390399.
York: Oxford University Press. xii, 465 p. Bioinformatics 26: i561567. 178. Smith CL, Goldsmith CA, Eppig JT (2005) The
137. Hitchins MP (2010) Inheritance of epigenetic 158. van Driel MA, Bruggeman J, Vriend G, Brunner Mammalian Phenotype Ontology as a tool for
aberrations (constitutional epimutations) in can- HG, Leunissen JA (2006) A text-mining analysis annotating, analyzing and comparing phenotyp-
cer susceptibility. Adv Genet 70: 201243. of the human phenome. Eur J Hum Genet 14: ic information. Genome Biol 6: R7.
138. Williams D (2008) Radiation carcinogenesis: 535542. 179. Robinson PN, Kohler S, Bauer S, Seelow D,
lessons from Chernobyl. Oncogene 27 Suppl 2: 159. Radivojac P, Peng K, Clark WT, Peters BJ, Horn D, et al. (2008) The Human Phenotype
S918. Mohan A, et al. (2008) An integrated approach Ontology: a tool for annotating and analyzing
139. Hall M, Frank E, Holmes G, Pfahringer B, to inferring gene-disease associations in humans. human hereditary disease. Am J Hum Genet 83:
Reutemann P, et al. (2009) The WEKA Data Proteins 72: 10301037. 610615.
Mining Software: an update. SIGKDD Explo- 160. Kanehisa M, Goto S (2000) KEGG: kyoto 180. Kelso J, Visagie J, Theiler G, Christoffels A,
rations 11: 1018. encyclopedia of genes and genomes. Nucleic Bardien S, et al. (2003) eVOC: a controlled
140. Gaulton KJ, Mohlke KL, Vision TJ (2007) A Acids Res 28: 2730. vocabulary for unifying gene expression data.
computational system to select candidate genes 161. Kanehisa M, Goto S, Furumichi M, Tanabe M, Genome Res 13: 12221230.
for complex human traits. Bioinformatics 23: Hirakawa M (2010) KEGG for representation 181. Stenson PD, Ball EV, Mort M, Phillips AD,
11321140. and analysis of molecular networks involving Shiel JA, et al. (2003) Human Gene Mutation
141. Hutz JE, Kraja AT, McLeod HL, Province MA diseases and drugs. Nucleic Acids Res 38: D355 Database (HGMD): 2003 update. Hum Mutat
(2008) CANDID: a flexible method for priori- 360. 21: 577581.

PLOS Computational Biology | www.ploscompbiol.org 16 January 2013 | Volume 9 | Issue 4 | e1002902


Education

Chapter 16: Text Mining for Translational Bioinformatics


K. Bretonnel Cohen*, Lawrence E. Hunter
Computational Bioscience Program, University of Colorado School of Medicine, Aurora, Colorado, United States of America

Abstract: Text mining for transla- research potential. It is a subfield of have to be defined on a bespoke basis for
tional bioinformatics is a new field biomedical natural language processing any given translational bioinformatics task.
with tremendous research poten- (BioNLP) that concerns itself directly with One potential application is better
tial. It is a subfield of biomedical the problem of relating basic biomedical phenotyping. Experimental experience in-
natural language processing that research to clinical practice, and vice dicates that strict phenotyping of patients
concerns itself directly with the versa. improves the ability to find disease genes.
problem of relating basic biomed- When phenotyping is too broad, the
ical research to clinical practice, genetic association may be obscured by
and vice versa. Applications of text 1.1 Use Cases variability in the patient population. An
mining fall both into the category The foundational question in text min- example of the advantage of strict pheno-
of T1 translational researchtrans- ing for translational bioinformatics is what typing comes from the work of [1,2]. They
lating basic science results into the use cases are. It is not immediately worked with patients with diagnoses of
new interventionsand T2 transla- obvious how the questions that text mining pulmonary fibrosis. However, having a
tional research, or translational for translational bioinformatics should try diagnosis of pulmonary fibrosis in the
research for public health. Potential to answer are different from the questions medical record was not, in itself, a strict
use cases include better phenotyp- that are approached in BioNLP in general. enough definition of the phenotype for
ing of research subjects, and phar- The answer lies at least in part in the their work [1]. They defined strict criteria
macogenomic research. A variety of nature of the specific kinds of information for study inclusion and ensured that
methods for evaluating text mining that text mining should try to gather, and patients met the criteria through a number
applications exist, including corpo- in the uses to which that information is
ra, structured test suites, and post of methods, including manual review of
intended to be put. However, these the medical record. With their sharpened
hoc judging. Two basic principles probably only scratch the surface of the
of linguistic structure are relevant definition of the phenotype, they were able
domain of text mining for translational to identify 102 genes that were up-
for building text mining applica-
tions. One is that linguistic struc- bioinformatics, and the latter has yet to be regulated and 89 genes that were down-
ture consists of multiple levels. The clearly defined. regulated in the study group. This includ-
other is that every level of linguistic One step in the direction of a definition ed Plunc (palate, lung and nasal epitheli-
structure is characterized by ambi- for use cases for text mining for transla- um associated), a gene not previously
guity. There are two basic ap- tional bioinformatics is to determine associated with pulmonary fibrosis. Auto-
proaches to text mining: rule- classes of information found in clinical mation of the step of manually reviewing
based, also known as knowledge- text that would be useful for basic medical records would potentially allow
based; and machine-learning- biological scientists, and classes of infor- for the inclusion or exclusion of much
based, also known as statistical. mation found in the basic science litera- larger populations of patients in similar
Many systems are hybrids of the ture that would be of use to clinicians. This studies.
two approaches. Shared tasks have in itself would be a step away from the Another use for text mining in transla-
had a strong effect on the direction usual task definitions of BioNLP, which tional bioinformatics is aiding in the
of the field. Like all translational tend to focus either on finding biological preparation of Cochrane reviews and
bioinformatics software, text min- information for biologists, or on finding other meta-analyses of experimental stud-
ing software for translational bioin-
clinical information for clinicians. Howev- ies. Again, text mining could be used to
formatics can be considered
er, it is likely that there is no single set of identify cohorts that should be included in
health-critical and should be sub-
ject to the strictest standards of data that would fit the needs of biological the meta-analysis, as well as to determine
quality assurance and software scientists on the one hand or clinicians on P-values and other indicators of signifi-
testing. the other, and that information needs will cance levels.

Citation: Cohen KB, Hunter LE (2013) Chapter 16: Text Mining for Translational Bioinformatics. PLoS Comput
Biol 9(4): e1003044. doi:10.1371/journal.pcbi.1003044
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
Baltimore County, United States of America
This article is part of the Transla-
This article
tional is part of the
Bioinformatics Transla-
collection for Published April 25, 2013
tional Bioinformatics collection
PLOS Computational Biology. for Copyright: 2013 Cohen, Hunter. This is an open-access article distributed under the terms of the Creative
PLOS Computational Biology. Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
Funding: This work was funded in part by grants NIH 5 R01 LM009254-06, NIH 5 R01 LM008111-07, NIH 5 R01
GM083649-04, and NIH 5 R01 LM009254-03 to Lawrence E. Hunter. The funders had no role in the preparation
1. Introduction of the manuscript.

Text mining for translational bioinfor- Competing Interests: The authors have declared that no competing interests exist.
matics is a new field with enormous * E-mail: kevin.cohen@gmail.com

PLOS Computational Biology | www.ploscompbiol.org 1 April 2013 | Volume 9 | Issue 4 | e1003044


What to Learn in This Chapter i2b2 has sponsored shared tasks on
deidentification of clinical documents,
Text mining is an established field, but its application to translational determining smoking status, detecting
bioinformatics is quite new and it presents myriad research opportunities. It is obesity and its comorbidities, medical
made difficult by the fact that natural (human) language, unlike computer problems, treatments, and tests. Note that
language, is characterized at all levels by rampant ambiguity and variability. there are no genomic components to this data.
Important sub-tasks include gene name recognition, or finding mentions of gene
names in text; gene normalization, or mapping mentions of genes in text to 1.2 Text Mining, Natural Language
standard database identifiers; phenotype recognition, or finding mentions of
phenotypes in text; and phenotype normalization, or mapping mentions of Processing, and Computational
phenotypes to concepts in ontologies. Text mining for translational bioinfor- Linguistics
matics can necessitate dealing with two widely varying genres of textpublished Text mining, natural language process-
journal articles, and prose fields in electronic medical records. Research into the ing, and computational linguistics are
latter has been impeded for years by lack of public availability of data sets, but often used more or less interchangeably,
this has very recently changed and the field is poised for rapid advances. Like all and indeed one can find papers on text
translational bioinformatics software, text mining software for translational mining and natural language processing at
bioinformatics can be considered health-critical and should be subject to the the annual meeting of the Association for
strictest standards of quality assurance and software testing. Computational Linguistics, and papers
from any of these categories at meetings
for any of the other categories. However,
Most of the applications discussed here latter is especially difficult, since so many technically speaking, some differences exist
fall into the category of T1 translational things can fit within the definition of between them. Computational linguistics
researchtranslating basic science results phenotype. A phenotype is the entirety strictly defined deals with building com-
into new interventions (http://grants.nih. of observable characteristics of an organism putationally testable models of human
gov/grants/guide/notice-files/NOT-AG- [7]. The wide range and rapidly changing linguistic behavior. Natural language pro-
08-003.html). There are also applications technologies for measuring observable cessing has to do with building a wide
in translational research for public health, features of patient phenotypes require the range of applications that take natural
also known as T2 translational research (op. text mining user to be very specific about language as their input. Text mining is
cit.). This is true both in the case of mining what observables they want to capture. For more narrow than natural language pro-
information for public health experts and example, phenotypes can include any cessing, and deals with the construction of
for the general public. For public health behavior, ranging from duration of mating applications that provide a solution to a
experts, there is a growing body of work dances in flies to alcohol-seeking in humans. specific information need. For example, a
on various factors affecting disease moni- They can also include any measurable syntactic analyzer would be an example of
toring in electronic medical records, such physical characteristic, ranging from very a natural language processing application;
as work by Chapman and colleagues on macro characteristics such as hair color to a text mining application might use that
biosurveillance and disease and syndrome very granular ones such as specific values for syntactic analyzer as part of the process for
outbreak detection (e.g., [3,4], among any of the myriad laboratory assays used in filling the very specific information need of
others). For the general public, simplifying modern clinical medicine.
finding information about protein-protein
technical texts can be helpful. [5] describes There is some evidence from the interactions. This chapter will include
work in this area. PharmGKB and the Comparative Tox- information about both natural language
1.1.1 The pharmacogenomics icogenomics Database experiences that
processing and text mining [1012].
perspective. One area of research that text mining can scale up processing in
has made some steps towards defining a use terms of the number of diseases studied
case for text mining is pharmacogenomics. and the number of gene-disease, drug- 1.3 Evaluation Techniques and
An example of this is the PharmGKB disease, and drug-gene associations dis- Evaluation Metrics in Text Mining
database. Essential elements of their covered [8]. Furthermore, experiments A variety of methods for evaluating text
definition of pharmacogenomics text with the PharmGKB database suggest mining applications exist. They typically
mining include finding relationships that pharmacogenomics is currently more apply the same small family of metrics as
between genotypes, phenotypes, and drugs. powerful than genomics for finding such figures of merit.
As in the case of other applications that we associations and has reached the point of 1.3.1 Corpora. One paradigm of
will examine in this chapter, mining this being ready for translation of research evaluation in text mining is based on the
information requires as a first step the ability results to clinical practice [9]. assumption that all evaluation should take
to find mentions of the semantic types of 1.1.2 The i2b2 perspective. Informatics place on naturally occurring texts. These
interest when they are mentioned in text. for Integrating Biology and the Bedside (i2b2) is a texts are annotated with data or metadata
These will be of increasing utility if they can National Center for Biomedical Computing about what constitutes the right answers
be mapped to concepts in a controlled devoted to translational bioinformatics. It has for some task. For example, if the intended
vocabulary. Each semantic type presents included text mining within its scope of application to be tested is designed to
unique challenges. For example, finding research areas. Towards this end, it has locate mentions of gene names in free text,
information about genotypes requires sponsored a number of shared tasks (see then the occurrence of every gene name in
finding mentions of genes (see Section 4.3 Section 5 below) on the subject of text the text would be marked. The mark-up is
below), finding mentions of mutations and mining. These give us some insight into known as annotation. (Note that this is a
alleles, and mapping these to each other; i2b2s definition of use cases for text mining very different use of the word
finding mentions of drugs, which is more for translational bioinformatics. i2b2s focus annotation from its use in the model
difficult than it is often assumed to be [6]; has been on extracting information from free organism database construction
and finding mentions of phenotypes. The text in clinical records. Towards this end, community.) The resulting set of

PLOS Computational Biology | www.ploscompbiol.org 2 April 2013 | Volume 9 | Issue 4 | e1003044


annotated documents is known as a corpus human judges whether or not they are by some understanding of the nature of
(plural corpora). Given a corpus, an correct. This is especially commonly used linguistic structure. Two basic principles
application is judged by its ability to when a large number of systems are being are relevant. One is that linguistic struc-
replicate the set of annotations in the evaluated. In this case, the outputs of the ture consists of multiple layers. The other
corpus. Some types of corpora are best systems can be pooled, and the most is that every layer of linguistic structure is
built by linguists, e.g., those involving common outputs (i.e., the ones produced characterized by ambiguity.
syntactic analysis, but there is abundant by the most systems) are selected for judging. All linguistic analyses in text mining are
evidence that biomedical scientists can 1.3.4 Metrics. A small family of descriptive in nature. That is, they seek only
build good corpora if they follow best related metrics is usually used to evaluate to describe the nature of human linguistic
practices in corpus design (see e.g., [13]). text mining systems. Accuracy, or the productions, much as one might attempt
1.3.2 Structured test suites. Stru- number of correct answers divided by the to describe the multi-dimensional structure
ctured test suites are built on the principles total number of answers, is rarely used. of a protein. Linguistic analyses are not
of software testing. They contain groups of Precision. Precision is defined as the prescriptivethat is, they do not attempt
inputs that are classified according to number of correct system outputs (true to prescribe or enforce standards for
aspects of the input. For example, a test positives, or TP) divided by the total language use.
suite for applications that recognize gene number of system outputs (the count of TP
names might contain sentences with gene plus the false positives (FP) erroneous 2.1 Layers of Linguistic Structure
names that end with numbers, that do not system outputs). It is often compared The layers of linguistic structure vary
end with numbers, that consist of common loosely to specificity, but is actually more somewhat between written and spoken
English words, or that are identical to the analogous to positive predictive value. language (although many are shared). We
names of diseases. Unlike a standard focus here on the layers that are relevant
corpus, test suites may contain data that is TP to written language, focusing particularly
manufactured for the purposes of the test Precision~
TPzFP on scientific journal articles and on clinical
suite. For example, a test suite for documents.
recognizing Gene Ontology terms [14] 2.1.1 Document structure. The first
contains the term cell migration, but also the Recall. Recall is defined as the number layer of the structure of written documents
manufactured variant migration of cells. (Note of true positives divided by the total that is relevant to text mining for
that being manufactured does not imply number of potential system outputs, i.e. translational bioinformatics is the
being unrealistic.) Structured test suites true positives plus false negatives (FN) structure of individual documents. In the
have the major advantage of making it things that should have been output by case of journal articles, this consists first of
much more straightforward to evaluate the system, but were not. This will differ all of the division of the document into
both the strengths and the weaknesses of from task type to task type. For example, discrete sections, typically in what is
an application. For example, application of in information retrieval (Section 4.1), it is known as the IMRD modelan
a structured test suite to an application for the number of documents judged relevant abstract, introduction, methods section,
recognizing Gene Ontology terms made it divided by the total number of actual results section, discussion, and
clear that the application was incapable of relevant documents. In named entity bibliography. Acknowledgments may be
recognizing terms that contain the word in. recognition of genes (Section 4.3), it is present, as well.
This was immediately obvious because the defined as the total number of correct gene The ability to segment a document into
test suite contained sets of terms that names output by the system divided by the these sections is important because differ-
contain function words, including a set of total number of gene names in the corpus. ent sections often require different pro-
terms that all contain the word in. To cessing techniques and because different
duplicate this insight with a corpus would TP sections should be focused on for different
require assembling all errors, then hoping Recall~ types of information. For example, meth-
TPzFN
that the fact that no terms containing the ods sections are frequent sources of false
word in were recognized jumped out at the positives for various semantic classes,
analyst. In general, structured test suites Balanced F-measure. The balanced F- which led researchers to ignore them in
should not be reflective of performance as measure attempts to reduce precision and much early research. However, they are
measured by the standard metrics using a recall to a single measure. It is calculated as also fruitful sections for finding informa-
corpus, since the distribution of types of the harmonic mean of precision and recall. tion about experimental methods, and as it
inputs in the test suite does not reflect the It includes a parameter b that is usually set has become clear that mining information
distribution of those types of inputs in to one, giving precision and recall equal about experimental methods is important
naturally occurring data. However, it has weight. Setting b greater than one weights to biologists [16], it has become clear that
been shown that structured test suites can be precision more heavily. Setting b less than methods must be developed for dealing
used to predict values of metrics for specific one weights recall more heavily. with methods sections. Abstracts have
equivalence classes of data (inputs that should been shown to have different structural
all be expected to test the same condition and (b2 z1)PR and content characteristics from article
produce the same result) [15]. We return to F~ bodies [17]; most research to date has
b2 PzR focused on abstracts, and it is clear that
the use of test suites in Section 6.
1.3.3 Post hoc judging. Sometimes new approaches will be required to fully
preparation of corpora is impractical. For exploit the information in article bodies.
example, there may be too many inputs that Segmenting and labeling document
2. Linguistic Fundamentals
need to be annotated. In these cases, post hoc sections can be simple when documents
judging is sometimes applied. That is, a Building applications for text mining for are provided in XML and a DTD is
program produces outputs, and then a translational bioinformatics is made easier available. However, this is often not the

PLOS Computational Biology | www.ploscompbiol.org 3 April 2013 | Volume 9 | Issue 4 | e1003044


case; for instance, many documents are marking the end of an abbreviation (Dr.), 2.1.4 Stems and lemmata. For
available for processing only in HTML marking the individual letters of an some applications, it is advantageous to
format. In this situation, two topics exist: abbreviation (p.r.n.), indicating the reduce words to stems or lemmata. Stems
finding the boundaries of the sections, rational parts of real numbers (3.14), and are normalized forms of words that reduce
and labelling the sections. The latter is so on. A period may even serve two all inflected forms to the same string. They
made more complicated by the fact that functions, as for example when etc. is at the are not necessarily actual words
a surprising range of phrases are used to end of a sentence, in which case the period themselvesfor example, the stem of city
label the different sections of a scientific marks both the end of the abbreviation and cities is citi, which is not a word in the
document. For example, the methods and the end of the sentence. Furthermore, English language. Their utility comes in
section may be called Methods, Methods some of the expected cues to sentence applications that benefit from this kind of
and Materials, Materials and Methods, Exper- boundaries are absent in biomedical text. normalization without needing to know
imental Procedures, Patients and Methods, For example, in texts about molecular exactly which words are the roots
Study Design, etc. Similar issues exist for biology, it is possible for a sentence to primarily machine-learning-based
structured abstracts; in the case of begin with a lower-case letter when a applications.
unstructured abstracts, it has been dem- mutant form of a gene is being mentioned. The term lemma (plural lemmata) is
onstrated that they can be segmented Various approaches have been taken to overloaded. It can mean the root word
into sections using a generative tech- the sentence segmentation task. The that represents a set of related words. For
nique [18]. KeX/PROPER system [20] uses a rule- example, the lemma of the set {phosophor-
Clinical documents present a far more based approach. The LingPipe system ylate, phosphorylates, phosphorylated, phosphory-
complex set of challenges than even provides a popular machine-learning- lating} is phosphorylate. Note that in this case,
scientific journal articles. For one thing, based approach through its LingPipe we have an actual word. Lemma can also
there is a much wider range of clinical API. Its model is built on PubMed/ mean the set of words that can instantiate
document typesadmission notes, dis- MEDLINE documents and works well a particular root word form; on this
charge summaries, radiology reports, pa- for journal articles, but it is not likely to meaning, the lemma of phosphorylate is
thology reports, office visit notes, etc. work well for clinical text (although this {phosphorylate, phosphorylates, phosphorylated,
Hospitals frequently differ from each other has not been evaluated). In clinical phosphorylating}. Lemmas have a clear
in the types of documents that they use, as documents, it is often difficult to define advantage of stems for some applications.
do individual physicians practices. Fur- any notion of sentence at all. However, while it is always possible to
thermore, even within a given hospital, 2.1.3 Tokens. Written sentences are determine the stem of a word (typically
different physicians may structure the built up of tokens. Tokens include words, using a rule-based approach, such as the
same document type differently. For but also punctuation marks, in cases where Porter stemmer [21], it is not always
example, just in the case of emergency those punctuation marks should be possible to determine the lemma of a
room visit reports, one of the authors built separated from words that they are word automatically. The BioLemmatizer
a classification system that determined, for attached to. The process of segmenting a [22] is a recently released tool that shows
a given document, what specialty it would sentence into tokens is known as high performance on the lemmatization
belong to (e.g., cardiology or pediatrics) if tokenization. For example, consider the task.
it had been generated by a specialist. He simple case of periods. When a period 2.1.5 Part of speech. It is often
found that not only did each hospital marks the end of a sentence, it should be useful to know the part of speech,
require a different classification system, separated from the word that it is attached technically known as lexical category, of the
but different doctors within the same to. regulation. will not be found in any tokens in a sentence. However, the notion
emergency room required different classi- biomedical dictionary, but regulation will. of part of speech is very different in
fiers. [19] describes an iterative procedure However, in many other instances, such as linguistic analysis than in the elementary
for building a segmenter for a range of when it is part of an abbreviation or a school conception, and text mining
clinical document types. number, it should not be separated. The systems typically make use of about
Once the document has been segment- case of hyphens is even more difficult. eighty parts of speech, rather than the
ed into sections, paragraphs must be Hyphens may have several functions in eight or so that are taught in school. We
identified. Here the segmentation task is biomedical text. If they indicate the go from eight to eighty primarily by
typically easy, but ordering may present a absence of a symptom (e.g., -fever), they subdividing parts of speech further than
problem. For example, it may not be clear should probably be separated, since they the traditional categories, but also by
where figure and table captions should be have their own meaning, indicating the adding new ones, such as parts of speech
placed. absence of the symptom. On the other of sentence-medial and sentence-final
2.1.2 Sentences. Once the document hand, they should remain in place when punctuation. Parts of speech are typically
has been segmented into paragraphs, the separating parts of a word, such as up- assigned to tokens by applications called
paragraphs must be further segmented regulate. part of speech taggers. Part of speech tagging
into sentences. Sentence segmentation is a The status of tokenization in building is made difficult by the fact that many
surprisingly difficult task. Even for pipelines of text mining applications is words are ambiguous as to their part of
newswire text, it is difficult enough to complicated. It may be the case that a speech. For example, in medical text, the
constitute a substantial homework component early in the pipeline requires word cold can be an adjective or it can be a
problem. For biomedical text, it is tokenized text, while a component later in reference to a medical condition. A word
considerably more difficult. Two main the pipeline requires untokenized text. can have several parts of speech, e.g., still.
difficulties arise. One is the fact that the Also, many applications have a built-in A variety of part of speech taggers that are
function of periods is ambiguousthat is, tokenizer, and conflicts between different specialized for biomedical text exist,
a period may serve more than one tokenization strategies may cause conflicts including MedPOST [23], LingPipe, and
function in a written text, such as in later analytical strategies. the GENIA tagger [24].

PLOS Computational Biology | www.ploscompbiol.org 4 April 2013 | Volume 9 | Issue 4 | e1003044


2.1.6 Syntactic structure. The end, or that when a passive form of a verb processing steps, making them hybrid
syntactic structure of a sentence is the way is used, the subject can be omitted. systems.
in which the phrases of the sentence relate
to each other. For example, in the article 3. The Two Families of 4. Text Mining Tasks
title Visualization of bacterial glycocalyx with a Approaches: Rule-Based and
scanning electron microscope (PMID 9520897), In Section 2.1, we discussed elements of
Learning-Based linguistic analysis. These analytical tasks
the phrase with a scanning electron microscope is
associated with visualization, not with There are two basic approaches to text are carried out in support of some higher-
bacterial glycocalyx. Automatic syntactic mining: rule-based, also known as knowl- level text mining tasks. Many types of text
analysis is made difficult by the existence edge-based, and machine-learning-based, mining tasks exist. We will discuss only the
of massive ambiguity. For example, while also known as statistical. most common ones here, but a partial list
one possible interpretation of that title is Rule-based approaches to text mining are includes:
that the visualization is done with a based on the application of rules, typically
scanning electron microscope, another manually constructed, to linguistic inputs. N Information retrieval
possible interpretation is that the For example, a rule-based approach to N Document classification
bacterial glycocalyx has a scanning
electron microscope. (Consider the
syntactic analysis might postulate that N Named entity recognition

analogous famous example I saw the man


given a string like phosphorylation of MAPK
by MAPKK, the phrase that follows the
N Named entity normalization
with the binoculars, where one possible word by is the doer of the phosphorylation, N Relation or information extraction
interpretation is that I used the and the phrase that follows the word of is N Question-answering
bionoculars to visualize the man, whereas the undergoer of the phosphorylation. Or, N Summarization
another possible interpretation is that I a rule-based approach might specify that
saw a man and that man had some in the pattern A X noun the X is an
binoculars.) It is very easy for humans to adjective, while in the pattern The adjective 4.1 Information Retrieval
determine which interpretation of the X verb the X is a noun, allowing us to Information retrieval is the task of, given an
article title is correct. However, it is very differentiate between the word cold as an information need and a set of documents,
difficult for computers to make this adjective in the former case and as a finding the documents that are relevant to
determination. There are many varieties medical condition in the latter case. Rule- filling that information need. PubMed/
of syntactic ambiguity, and it is likely that based solutions can be constructed for all MEDLINE is an example of a biomedical
any nontrivial sentence contains at least levels of linguistic analysis. information retrieval system for scientific
one. Machine-learning-based approaches to text journal articles; Google is an information
Syntactic analysis is known as parsing. mining are based on an initial step of retrieval system for web pages. Early
The traditional approach to automated feeding the system a set of data that is information retrieval assumed that all
syntactic analysis attempts to discover the labelled with the correct answers, be they documents were classified with some code
phrasal structure of a sentence, as de- parts of speech for tokens or the locations and typically required the assistance of a
scribed above. A new approach called of gene names in text. The job of the librarian to determine the appropriate
dependency parsing focuses instead on rela- system is then to figure out cues that code of interest. Keyword-based retrieval,
tionships between individual words. It is indicate which of the ambiguous analyses in which the user enters a set of words that
thought to better reflect the semantics of a should be applied. For instance, a system a relevant text would be expected to
sentence, and is currently popular in for document classification may learn that contain and the content of the texts in
BioNLP. if a document contains the word murine, the set of documents are searched for those
Along with determining the phrasal or then it is likely to be of interest to words, was a revolution made possible by
dependency structure of a sentence, some researchers who are interested in mice. the introduction of computers and elec-
parsers also make limited attempts to label Many different algorithms for machine tronic forms of documents in the hospital
the syntactic functions, such as subject and learning exist, but the key to a successful or research environment. The naive ap-
object, of parts of a sentence. system is the set of features that are used to proach to keyword-based retrieval simply
perform the classification. For example, a checks for the presence or absence of the
2.2 The Nature of Linguistic Rules part of speech tagger may use the words in the query, known as boolean
When we think of linguistic rules, we apparent parts of speech of the two search. Modern approaches use relatively
are most likely to think of the rules that we preceding words as a feature for deciding simple mathematical techniques to deter-
learn in school that impose arbitrary the part of speech of a third word. mine (a) the relative importance of words
norms on language usage, such as Say It is often claimed that machine learn- in the query in deciding whether or not a
you and I, not you and me, or a preposition ing systems can be built more quickly than document is relevantthe assumption
is a bad thing with which to end a sentence. rule-based systems due to the time that it here is that not all words are equally
These are known as prescriptive rules. Text takes to build rules manually. However, importantand (b) how well a given word
mining never deals with prescriptive rules. building feature extractors is time-con- reflects the actual relevance of a given
Rather, it always deals with descriptive rules. suming, and building the labelled train- document to the query. For example, we
Descriptive rules describe the parts of the ing data with the right answers is much can determine, given a count of how often
language and the ways in which they can more so. There is no empirical support for the words hypoperfusion and kidney occur in
combine, without any implied judgement the claim that learning-based systems can the set of documents as a whole, that if we
as to whether they are good or bad. be built more quickly than rule-based are looking for documents about kidney
For example, a linguistic rule might specify systems. Furthermore, it is frequently the hypoperfusion, we should give more
that certain classes of verbs can be case that putative learning-based systems weight to the rarer of the two words;
converted to nouns by adding -tion to their actually apply rules in pre- or post- given a count of how often the words kidney

PLOS Computational Biology | www.ploscompbiol.org 5 April 2013 | Volume 9 | Issue 4 | e1003044


and hypoperfusion occur in two documents, Early results in named entity recogni- suspect that we have an instance of the
we can determine which of the two tion were consistent with the hypothesis TRP1 with Entrez Gene ID 189930.
documents is most relevant to the query. that this task could not be achieved by Approaches might vary with respect to
simply starting with a dictionary of gene what they use as the knowledge source
4.2 Document Classification names and looking for those gene names (e.g., Entrez Gene SUMMARY fields,
Document classification is the task of in text. At least three problems were Entrez Gene PRODUCT fields, the
classifying a document as a member of immediately evident with this ap- contents of publications linked to the
one or more categories. In a typical proachthe fact that new gene names Entrez Gene entry), and what they
document classification workflow, one is are coined constantly, the fact that a consider the context of the gene mention,
supplied with a stream of documents, and number of gene names are homographs e.g., the sentence, the surrounding sen-
each one requires classification. This of common English words, and the fact tences, the entire abstract, etc.
differs from the information retrieval that many genes have names or synonyms
situation, in which information needs are that are unhelpful, such as putative oxidore- 4.5 Relation or Information
typically ad hoc. For example, curators of ductase (Entrez Gene ID 6393330). How- Extraction
a model organism database may require ever, recent evidence has suggested that Information extraction, or more recently
journal articles to be classified as to dictionary-based approaches can achieve relation extraction, is the process of mining
whether or not they are relevant for moderate success if the dictionary and the very specific types of facts from text.
further examination. Other classification data to be processed are subjected to Information extraction systems are by
tasks motivated by curation have been extensive preprocessing [29] or post-hoc definition restricted to a very specific
classifying journal articles as to whether or filtering, e.g., by the success or failure of a type of information. For example, a
not they are about embryogenesis. Docu- subsequent gene normalization step (see typical genomic information extraction
ment classification typically uses very Section 4.4 [30]). system might extract assertions about
simple feature sets, such as the presence protein-protein interactions, or a clinical
or absence of the words from the training 4.4 Named Entity Normalization information extraction system might
data. When this is the only feature, it is Named entity normalization is the process of mine assertions about relationships be-
known as a bag of words representation. taking a mention of a named entity in tween diseases and their treatments.
However, it has also been found useful to free text and returning a specific data- Most systems target binary relations,
use more abstract, conceptual features. base identifier that it refers to. In the such as the ones just described. However,
For example, [25] found the presence or biological domain, this has been studied more ambitious systems have extracted
absence of mentions of mouse strains to be most extensively in the case of genes and relationships with as many as four
a useful feature, regardless of the identity proteins, and the corresponding task is participants. One system [31] targeted
of the particular strain. known as gene normalization. In the clinical protein transport relations, with a four-
domain, it has been approached simul- way relationship that included the trans-
4.3 Named Entity Recognition taneously with named entity recognition, porting protein, the transported protein,
Named entity recognition is the task of again using the MetaMap application the beginning location of the transported
finding mentions of specific semantic (see Section 4.3). There are two major protein, and the destination.
classes in a text. In general language problems in gene normalization. The Rule-based approaches use typical sen-
processing, the most heavily studied se- first is that many species have genes with tence patterns. These may consist of text
mantic classes have been persons, places, the same name. For example, the literals or may involve syntactic analyses
and organizationsthus, the term BRCA1 gene is found in an enormous [32]. Learning-based approaches have
named entity. In genomic BioNLP, the number of animals. Thus, finding the classically used bag-of-words representa-
most heavily studied semantic class has appropriate gene identifier requires tions (see Section 4.2), but more recent
been gene and protein names. However, knowing the species under discussion, approaches have had success using fea-
other semantic classes have been studied which is a research problem in itself. The tures taken from syntactic analysis, partic-
as well, including cell lines and cell types. other problem is that a single species ularly dependency parsing [33].
In clinical NLP, the range of semantic may have multiple genes with the same
classes is wider, encompassing a large name. For example, humans have five 4.6 Question-Answering
number of types included in the Unified genes named TRP-1. Gene normaliza- Question-answering is the task of taking a
Medical Language System [26]. The tion is often approached as a problem in question and a source of information as
UMLS includes a Metathesaurus which word sense disambiguation, the task of input and returning an answer. Early
combines a large number of clinically and deciding which dictionary entry a given approaches to question-answering as-
biologically relevant controlled vocabular- text string refers to (e.g., the cold example sumed that the source of information was
ies, comprising many semantic classes. In referred to above). A popular approach a database, but modern approaches as-
the clinical domain, there is an industry to this utilizes knowledge about the gene sume that the answer exists in some
standard tool for named entity recogni- and the context in which the gene is PubMed/MEDLINE document or (for
tion, called MetaMap [27,28]. Biological mentioned. For example, the SUMMA- non-biomedical applications) in some
named entity recognition remains a sub- RY fields of the candidate genes might web page. Question-answering differs
ject of current research. Machine learning be used as a source of words that indicate from information retrieval in that the goal
methods predominate. Feature sets gener- what we know about the gene. Then, if is to return a specific answer, not a
ally include typographical features of a we see the words cation and channel in the document containing the answer. It differs
tokene.g., having mixed-case letters or text surrounding the gene name, we from information extraction in that it is
not, containing a hyphen or not, ending should expect that we have an instance meant to allow for ad hoc queries, while
with a numeral or not, etc. as well as of the TRP1 with Entrez Gene ID 7220, information extraction focuses on very
features of the surrounding tokens. while if we see the word proline, we should specific information needs. Question-an-

PLOS Computational Biology | www.ploscompbiol.org 6 April 2013 | Volume 9 | Issue 4 | e1003044


Table 1. Some knowledge sources for biomedical natural language processing.

Informatics for Integrating Biology and the Bedside National Center for Biomedical Computing with focus on translational research that
(i2b2 - https://www.i2b2.org/) facilitates and proves data sets for clinical natural language processing research
Gene Ontology (https://www.geneontology.org) Controlled vocabulary with relationships including partonymy and inheritance,
designed for describing gene functions, broadly construed
Entrez Gene (https://www.ncbi.nlm.nih.gov/gene) Source for gene names, symbols, and synonyms; also the source for GeneRIFs and
SUMMARY fields
PubMed/MEDLINE (https://www.ncbi.nlm.nih.gov/pubmed) The National Library of Medicines database of abstracts of biomedical publications
(MEDLINE) and search interface for accessing them (PubMed)
Unified Medical Language System (https://www.nlm.nih.gov/research/umls/) Large lexical and conceptual resource, including the UMLS Metathesaurus, which
aggregates a large number of biomedical and some genomic vocabularies
SWISSPROT (https://www.uniprot.org/) Database of information about proteins with literature references, useful as a gold
standard
PharmGKB (https://www.pharmgkb.org/) Database of relationships between a number of clinical, genomic, and other entities
with literature references, useful as a gold standard
Comparative Toxicogenomics Database (https://ctdbase.org/) Database of relationships between genes, diseases, and chemicals, with literature
references, useful as a gold standard

Various terminological resources, data sources, and gold-standard databases for biomedical natural language processing.
doi:10.1371/journal.pcbi.1003044.t001

swering typically involves determining the MiTAP, does multi-document summari- 5. Shared Tasks
type of answer that is expected (a time? a zation of epidemiological reports, news-
location? a person?), formulating a query wire feeds, email, online news, television The natural language processing com-
that will return documents containing the news, and radio news to detect disease munity has a long history of evaluating
answer, and then finding the answer outbreaks. applications through the shared task
within the documents that are returned. In the genomics domain, there have paradigm. Similar to CASP, a shared task
Various types of questions have varying been three major areas of summarization involves agreeing on a task definition, a
degrees of difficulty. The best results are data set, and a scoring mechanism. In
research. One has been the automatic
achieved for so-called factoid questions, biomedical text mining, shared tasks have
generation of GeneRIFs. GeneRIFs are
had a strong effect on the direction of the
such as where are lipid rafts located?, while short text snippets, less than 255 charac-
field. There have been both clinically
why questions are very difficult. In the ters in length, associated with specific
oriented and genomically oriented shared
biomedical domain, definition questions Entrez Gene entries. Typically they are
tasks.
have been extensively studied [3436]. manually cut-and-pasted from article ab-
In the clinical domain, the 2007 NLP
The medical domain presents some stracts. Lu et al. developed a method for
Challenge [42] involved assigning ICD9-
unique challenges. For example, questions finding them automatically using a variant
CM codes to radiology reports of chest x-
beginning with when might require times as of the Edmundsonian paradigm, a classic
rays and renal procedures. Also in the
their answer (e.g., when does blastocyst approach to single-document summariza- clinical domain, i2b2 has sponsored a
formation occur in humans?, but also may tion [38,39]. In the Edmundsonian para- number of shared tasks, described in
require very different sorts of answers, digm, sentences in a document are given Section 1.1.2. (At the time of writing, the
e.g.,when should antibiotics be given for a sore points according to a relatively simple set National Institute of Standards and
throat? [37]. A shared task in 2005 involved of features, including position in the Technology is preparing a shared task
a variety of types of genomic questions document, presence of cue words involving electronic medical records un-
adhering to specific templates (and thus (words that indicate that a document is a der the aegis of the annual Text Retriev-
overlapping with information extraction), good summary sentence), and absence of al Conference. The task definition is not
such as what is the biological impact of a stigma words (words that indicate that a yet defined.)
mutation in the gene X?. sentence is not likely to be a good In the genomics domain, the predomi-
summary sentence). nant shared tasks have been the BioCrea-
4.7 Summarization Another summarization problem is find- tive shared tasks and a five-year series of
Summarization is the task of taking a ing the best sentence for asserting a protein- tasks in a special genomics track of the
document or set of documents as input protein interaction. This task was made Text Retrieval Conference [43]. Some of
and returning a shorter text that conveys popular by the BioCreative shared task. the tasks were directly relevant to transla-
the information in the longer text(s). There The idea is to boil down a set of articles to tional bioinformatics. The tasks varied
is a great need for this capability in the the single sentence that best gives evidence from year to year and included informa-
biomedical domaina search in that the interaction occurs. Again, simple tion retrieval (Section 4.1), production of
PubMed/MEDLINE for the gene p53 features work well, such as looking for GeneRIFs (Section 4.7), document classi-
returns 56,464 publications as of the date references to figures or tables [40]. fication (Section 4.2), and question-an-
of writing. Finally, a small body of work on the swering (Section 4.6). A topic that was
In the medical domain, summarization generation of SUMMARY fields has been frequently investigated by participants was
has been applied to clinical notes, journal seen. More sophisticated measures have the contribution of controlled vocabularies
articles, and a variety of other input types. been applied here, such as the PageRank to performance on text mining tasks.
For example, one system, MITRES algorithm [41]. Results were equivocal; it was found that

PLOS Computational Biology | www.ploscompbiol.org 7 April 2013 | Volume 9 | Issue 4 | e1003044


they could occasionally increase perfor- is bad at. For this task, structured test (see e.g., [49]). Download the Meta-
mance, but only when used intelligently, suites and application of the general Map application or API and run it over
e.g., with appropriate preprocessing or principles of software testing are much a set of ten discharge summaries. Use
filtering of items in the terminologies more appropriate. Structured test suites Google to find the current links for the
blind use of vocabulary resources does not are discussed in Section 1.3.2. It is helpful i2b2 data sets and for downloading
improve performance. to consult with a descriptive linguist when MetaMap. Note that using the Meta-
The BioCreative series of shared tasks designing test suites for assessing an Map application will require writing
has been oriented more towards model applications ability to handle linguistic code to extract results from the Meta-
organism database curation than towards phenomena. [15] and [14] describe basic Map output file, while using the API
translational bioinformatics, but some of principles for constructing test suites for will require writing your own applica-
the subtasks that were involved are of linguistic phenomena by applying the tion. Which outputs might you consid-
utility in translational bioinformatics. Bio- techniques of software testing and of er to identify phenotypes that could be
Creative tasks have included gene name descriptive linguistics. The former includes relevant for your research interests?
recognition in text (Section 4.3), mining a methodology for the automatic genera- 2. Obtain a collection of 1,000 PubMed
information about relationships between tion of test suites of arbitrary size and abstracts by querying with the terms
genes and their functions (Section 4.5), complexity. [45] presents a quantitative gene and mutation and downloading the
mining information about protein-protein examination of the effectiveness of corpora 1,000 most recent. Run the EMU
interactions (Section 4.5), information versus structured test suites for software mutation extractor (http://bioinf.
retrieval (Section 4.1), and relating men- testing, and demonstrates that structured umbc.edu/EMU/ftp) or a similar tool
tions of genes in text to database entries in test suites achieve better code coverage on them. What genotypes can you
Entrez Gene and SWISSPROT (Section (percentage of code that is executed during identify in the output?
4.4). the test phasebugs cannot be discovered 3. A researcher has a collection of 10,000
in code that is not executed) than corpora, documents. She wants to retrieve all
6. Software Engineering for and also offer a significant advantage in documents relevant to pulmonary hy-
Text Mining terms of time and efficiency. They found pertension. The collection contains 250
that a structured test suite that achieved documents that are relevant to pulmo-
Like all translational bioinformatics higher code coverage than a 3.9 million nary hypertension. An information
software, text mining software for transla- word corpus could be run in about retrieval program written by a col-
tional bioinformatics can be considered 11 seconds, while it took about four and league returns 100 documents. 80 of
health-critical and should be subject to the a half hours to process the corpus. [46] these are actually relevant to pulmo-
strictest standards of quality assurance and discusses the application of the software nary hypertension. What is the preci-
software testing. General software testing engineering concept of the fault model, sion, recall, and F-measure for this
is covered in such standard books as [44]. informed by insights from linguistics, to system?
The special requirements of software discovering a serious error in their ontol-
4. Explain the difference between descrip-
testing for natural language processing ogy linking tool.
tive linguistic rules and prescriptive linguistic
applications are not covered in the stan- User interface assessment requires spe-
rules. Be sure to say which type text
dard books on software testing, but a small cial techniques not found in other areas of
mining is concerned with.
but growing body of literature discusses software testing for natural language
the special issues that arise here. There are processing. User interface testing has been Answers to the Exercises can be found
two basic paradigms for evaluating text most heavily studied in the case of in Text S1.
mining applications. The standard para- literature search interfaces. Here the work
digm involves running large corpora of [47,48] is most useful, and can serve as
through the application and determining a tutorial on interface evaluation. Supporting Information
the F-measure achieved. However, this Text S1 Answers to Exercises.
approach is not satisfactory for quality 7. Exercises (DOCX)
assurance and software testing. It is good
for achieving overall estimates of perfor- 1. Obtain a copy of a patient record Acknowledgments
mance, but does a poor job of indicating collection from the i2b2 National
what the application is good at and what it Center for Biomedical Computing Anna Divoli provided helpful comments on the
manuscript.

Further Reading References


[50] is a book-length treatment of biomedical natural language processing, 1. Steele MP, Speer MC, Loyd JE, Brown KK,
oriented towards readers with some background in NLP. [51] takes the Herron A, et al. (2005) Clinical and pathologic
features of familial interstitial pneumonia.
perspective of model organism database curation as the primary motivating Am J Respir Crit Care Med 172: 11461152.
task for text mining. It includes a review of the ten most important papers and 2. Boon K, Bailey N, Yang J, Steel M, Groshong S,
resources for BioNLP. [52] is an excellent introduction to text mining in general. et al. (2009) Molecular phenotypes distinguish
[10] is the standard textbook on natural language processing. patients with relatively stable from progressive
idiopathic pulmonary fibrosis (ipf). PLoS ONE 4:
e5134. doi:10.1371/journal.pone.0005134.
There are a number of seminal papers on biomedical text mining besides those
3. Chapman W, Dowling J, Wagner M (2004) Fever
already cited in the text. These include [20,5357]. detection from free-text clinical records for
biosurveillance. J Biomed Inform 37: 120127.
Table 1 lists a variety of terminological resources, data sources, and gold-standard 4. Chapman W, Dowling J (2007) Can chief
databases for biomedical natural language processing. complaints detect febrile syndromic patients?
Journal of Advances in Disease Surveillance 3.

PLOS Computational Biology | www.ploscompbiol.org 8 April 2013 | Volume 9 | Issue 4 | e1003044


5. Elhadad N (2006) User-sensitive text summariza- 24. Tsuruoka Y, Tateishi Y, Kim JD, Ohta T, relations from biomedical text. Genome Biol 9
tion: application to the medical domain [Ph.D. Mcnaught J, et al. (2005) Developing a robust Suppl 2: S9.
thesis]. New York: Columbia University. part-of-speech tagger for biomedical text. In: 41. Jin F, Huang M, Lu Z, Zhu X (2009) Towards
6. Uzuner O, Solti I, Cadag E (2010) Extracting Proceedings of the 10th Panhellenic Conference automatic generation of gene summary. In:
medication information from clinical text. J Am on Informatics. pp. 382392. Proceedings of the BioNLP 2009 Workshop.
Med Inform Assoc 17: 514518. 25. Caporaso JG, Baumgartner WA Jr, Cohen KB, Boulder, Colorado: Association for Computation-
7. Hunter LE (2009) The processes of life: an Johnson HL, Paquette J, et al. (2005) Concept al Linguistics. pp. 97105.
introduction to molecular biology. Cambridge recognition and the TREC Genomics tasks. In: 42. Pestian JP, Brew C, Matykiewicz P, Hovermale
(MA): MIT Press. The Fourteenth Text REtrieval Conference D, Johnson N, et al. (2007) A shared task
8. Wiegers TC, Davis AP, Cohen KB, Hirschman (TREC 2005) Proceedings. involving multi-label classification of clinical free
L, Mattingly CJ (2009) Text mining and manual 26. Lindberg D, Humphreys B, Mccray A (1993) The text. In: Proceedings of BioNLP 2007. Association
curation of chemical-gene-disease networks for Unified Medical Language System. Methods Inf for Computational Linguistics.
the Comparative Toxicogenomics Database Med 32: 281291. 43. Hersh W, Voorhees E (2008) TREC genomics
(CTD). BMC Bioinformatics 10: 326. 27. Aronson A (2001) Effective mapping of biomed- special issue overview. Information Retrieval.
9. Altman RB (2011) Pharmacogenomics: nonin- ical text to the UMLS Metathesaurus: The 44. Kaner C, Nguyen HQ, Falk J (1999) Testing
feriority is sufficient for initial implementation. MetaMap program. In: Proc AMIA 2001. pp. computer software. 2nd edition. John Wiley and
Clin Pharmacol Ther 89: 348350. 1721. Sons.
10. Jurafsky D, Martin JH (2008) Speech and 28. Aronson AR, Lang FM (2010) An overview of 45. Cohen KB, Baumgartner Jr WA, Hunter L (2008)
language processing: an introduction to natural MetaMap: historical perspective and recent Software testing and the naturally occurring data
language processing, computational linguistics, advances. J Am Med Inform Assoc 17: 229236. assumption in natural language processing. In:
and speech recognition. Pearson Prentice Hall. 29. Hanisch D, Fundel K, Mevissen HT, Zimmer R, Software Engineering, Testing, and Qual-ity
11. Manning C, Schuetze H (1999) Foundations of Fluck J (2005) ProMiner: rule-based protein and Assurance for Natural Language Processing.
statistical natural language processing. Cam- gene entity recognition. BMC Bioinformatics 6 Columbus, Ohio: Association for Computational
bridge (MA): MIT Press. Suppl 1: S14. Linguistics. pp. 2330.
12. Jackson P, Moulinier I (2002) Natural language 30. Verspoor K, Roeder C, Johnson HL, Cohen KB, 46. Johnson HL, Cohen KB, Hunter L (2007) A fault
processing for online applications: text retrieval, Baumgartner WA Jr, et al. (2010) Exploring model for ontology mapping, alignment, and
extraction, and categorization. 2nd edition. John species-based strategies for gene normalization. linking systems. In: Pacific Symposium on
Benjamins Publishing Company. IEEE/ACM Trans Comput Biol Bioinform 7: Biocomputing. World Scientific Publishing Com-
13. Cohen KB, Fox L, Ogren PV, Hunter L (2005) 462471. pany. pp. 233244.
Empirical data on corpus design and usage in 47. Hearst M, Divoli A, Jerry Y, Wooldridge M
31. Hunter L, Lu Z, Firby J, Baumgartner WA Jr,
biomedical natural language processing. In: AMIA (2007) Exploring the effcacy of caption search
Johnson HL, et al. (2008) OpenDMAP: An open-
for bioscience journal search interfaces. In:
2005 symposium proceedings. pp. 156160. source, ontology-driven concept analysis engine,
Biological, translational, and clinical language
14. Cohen KB, Roeder C, Jr WAB, Hunter L, with applications to capturing knowledge regard-
processing. Prague, Czech Republic: Associa-
Verspoor K (2010) Test suite design for biomed- ing pro-tein transport, protein interactions and
tion for Computational Linguistics. pp.
ical ontology concept recognition systems. In: cell-specific gene expression. BMC Bioinformatics
7380.
Proceedings of the Language Resources and 9: 78.
48. Divoli A, Hearst MA, Wooldridge MA (2008)
Evaluation Conference. 32. Kilicoglu H, Bergler S (2009) Syntactic depen-
Evidence for showing gene/protein name sugges-
15. Cohen KB, Tanabe L, Kinoshita S, Hunter L dency based heuristics for biological event
tions in bioscience literature search interfaces. Pac
(2004) A resource for constructing customized test extraction. In: Proceedings of the BioNLP 2009
Symp Biocomput 2008: 568579.
suites for molecular biology entity identification Workshop Companion Volume for Shared Task.
49. Uzuner O, South BR, Shen S, Duvall SL (2011)
systems. In: HLT-NAACL 2004 Workshop: Boulder, Colorado: Association for Computation-
2010 i2b2/VA challenge on concepts, assertions,
BioLINK 2004, Linking Biological Literature, al Linguistics. pp. 119127.
and relations in clinical text. J Am Med Inform
Ontologies and Databases. Association for Com- 33. Kim JD, Ohta T, Pyysalo S, Kano Y, Tsujii J (2009) Assoc 18: 552556.
putational Linguistics, pp. 18. Overview of BioNLP09 shared task on event 50. Cohen KB, Demner-Fushman D (forthcoming)
16. Krallinger M, Morgan A, Smith L, Leitner F, extraction. In: BioNLP 2009 Companion Volume: Biomedical natural language processing. John
Tanabe L, et al. (2008) The BioCreative II Shared Task on Entity Extraction. pp. 19. Benjamins Publishing Company.
critical assessment for information extraction in 34. Lin J, Demner-Fushman D (2005) Automatically 51. Cohen KB (2010) Biomedical text mining. In:
biology challenge. Genome Biol 9. evaluating answers to definition questions. In: Indurkhya N, Damerau FJ, editors. Handbook of
17. Cohen KB, Johnson HL, Verspoor K, Roeder C, Proceedings of the 2005 Human Language natural language processing. 2nd edition.
Hunter LE (2010) The structural and content Technology Conference and Conference on 52. Jackson P, Moulinier I (2002) Natural language
aspects of abstracts versus bodies of full text Empirical Methods in Natural Language Process- processing for online applications: text retrieval,
journal articles are different. BMC Bioinformatics ing (HLT/EMNLP 2005). pp. 931938. extraction, and categorization. John Benjamins
11: 492. 35. Yu H, Wei Y (2006) The semantics of a Publishing Company.
18. Lin J, Karakos D, Demner-Fushman D, Khu- definiendum constrains both the lexical semantics 53. Nobata C, Collier N, Tsujii J (1999) Automatic
danpur S (2006) Generative content models for and the lexicosyntactic patterns in the definiens. term identification and classification in biology
struc-tural analysis of medical abstracts. In: In: HTL-NAACL BioNLP Workshop: Linking texts. In: Proceedings of the fifth Natural
Proceedings of the HLT-NAACL BioNLP Work- Natural Language Processing and Biology: To- Language Processing Pacific Rim Symposium
shop on Linking Natural Language and Biology. wards Deeper Biological Literature Analysis. (NLPRS). pp. 369374.
New York, New York: Association for Computa- ACL, pp. 18. 54. Blaschke C, Andrade MA, Ouzounis C, Valencia
tional Linguistics. pp. 6572. 36. Yu H, Lee M, Kaufman D, Ely J, Osheroff J, et A (1999) Automatic extraction of biological
19. Demner-Fushman D, Abhyankar S, Jimeno- al. (2007) Development, implementation, and a information from scientific text: protein-protein
Yepes A, Loane R, Rance B, et al. (2011) A cognitive evaluation of a definitional question interactions. Proc Int Conf Intell Syst Mol Biol
knowledge-based approach to medical records answering system for physicians. J Biomed Inform 1999: 6067.
retrieval. In: Proceedings of TREC 2011. 40: 236251. 55. Craven M, Kumlien J (1999) Constructing
20. Fukuda K, Tamura A, Tsunoda T, Takagi T 37. Zweigenbaum P (2003) Question answering in biological knowledge bases by extracting infor-
(1998) Toward information extraction: identifying biomedicine. In: Proceedings of the workshop on mation from text sources. Proc Int Conf Intell
protein names from biological papers. In: Pac natural language processing for question answer- Syst Mol Biol 1999: 7786.
Symp Biocomput. pp. 707718. ing. pp. 14. 56. Friedman C, Kra P, Yu H, Krauthammer M,
21. Porter MF (1980) An algorithm for suffix 38. Lu Z, Cohen BK, Hunter L (2006) Finding Rzhetsky A (2001) GENIES: a natural-language
stripping. Program 14: 130137. GeneRIFs via Gene Ontology annotations. In: processing system for the extraction of molecular
22. Liu H, Christiansen T, Baumgartner WA Jr, PSB 2006. pp. 5263. pathways from journal articles. Bioinformatics 17:
Verspoor K (2012) BioLemmatizer: a lemmatiza- 39. Lu Z, Cohen KB, Hunter L (2007) GeneRIF S74S82.
tion tool for morphological processing of biomed- quality assurance as summary revision. In: Pacific 57. Rzhetsky A, Iossifov I, Koike T, Krauthammer
ical text. J Biomed Semantics 3: 3. Symposium on Biocomputing. M, Kra P, et al. (2004) Geneways: a system for
23. Smith L, Rindesch T, Wilbur WJ (2004) Med- 40. Baumgartner WA Jr, Lu Z, Johnson HL, extracting, analyzing, visualizing, and integrating
post: A part-of-speech tagger for biomedical text. Caporaso JG, Paquette J, et al. (2008) Concept molecular pathway data. J Biomed Inform 37:
Bioinformatics 20: 23202321. recognition for extracting protein interaction 4353.

PLOS Computational Biology | www.ploscompbiol.org 9 April 2013 | Volume 9 | Issue 4 | e1003044


Education

Chapter 17: Bioimage Informatics for Systems


Pharmacology
Fuhai Li, Zheng Yin, Guangxu Jin, Hong Zhao, Stephen T. C. Wong*
NCI Center for Modeling Cancer Development, Department of Systems Medicine and Bioengineering, The Methodist Hospital Research Institute, Weil Medical College of
Cornell University, Houston, Texas, United States of America

Abstract: Recent advances in auto- 1. Introduction es on extracting and analyzing quantita-


mated high-resolution fluorescence tive phenotypic data automatically from
The old adage that a picture is worth a large amounts of cell images with ap-
microscopy and robotic handling
thousand words certainly applies to the proaches in image analysis, computation
have made the systematic and cost
identification of phenotypic variations in vision and machine learning [3,4]. Appli-
effective study of diverse morpho-
logical changes within a large pop- biomedical studies. Bright field microsco- cations of HCA for screening drugs and
ulation of cells possible under a py, by detecting light transmitted through targets are referred to as High Content
variety of perturbations, e.g., drugs, thin and transparent specimens, has been Screening (HCS), which focuses on iden-
compounds, metal catalysts, RNA widely used to investigate cell size, shape, tifying compounds or genes that cause
interference (RNAi). Cell population- and movement. The recent development desired phenotypic changes [57]. The
based studies deviate from conven- of fluorescent proteins, e.g., green fluores-
image data contain rich information
tional microscopy studies on a few cent protein and its derivatives [1],
content for understanding biological pro-
cells, and could provide stronger enabled the investigation of the phenotyp-
cesses and drug effects, indicate diverse
statistical power for drawing exper- ic changes of subcellular protein struc-
and heterogeneous behaviors of individual
imental observations and conclu- tures, e.g., chromosomes and microtu-
sions. However, it is challenging to cells, and provide stronger statistical power
bules, revolutionizing optical imaging in
manually extract and quantify phe- in drawing experimental observations and
biomedical studies. Fluorescent proteins
notypic changes from the large conclusions, compared to conventional
are bound to specific proteins that are
amounts of complex image data microscopy studies on a few cells. Howev-
uniformly located in relevant cellular
generated. Thus, bioimage informat- structures, e.g., chromosomes, and emit er, extracting and mining the phenotypic
ics approaches are needed to rapidly longer wavelength light, e.g., green light, changes from the large scale, complex
and objectively quantify and analyze after exposure to shorter wavelength light, image data is daunting. It is not feasible to
the image data. This paper provides e.g., blue light. Thus, the spatial morphol- manually analyze these data. Hence, bio-
an overview of the bioimage infor- image informatics approaches were need-
ogy and temporal dynamic activities of
matics challenges and approaches in ed to automatically and objectively ana-
subcellular protein structures can be
image-based studies for drug and lyze large scale image data, extract and
target discovery. The concepts and imaged with a fluorescence microscope -
an optical microscope that can specifically quantify the phenotypic changes to profile
capabilities of image-based screen- the effects of drugs and targets.
ing are first illustrated by a few detect emitted fluorescence of a specific
wavelength [2]. In current image-based Bioimage informatics in image-based
practical examples investigating dif-
ferent kinds of phenotypic changes studies, five-dimensional (5D) image data studies usually consists of multiple analysis
caused by drugs, compounds, or of thousands of cells (cell populations) can modules [3,8,9], as shown in Figure 1.
RNAi. The bioimage analysis ap- be acquired: spatial (3D), time lapse (1D), Each of the analysis tasks is challenging,
proaches, including object detection, and multiple fluorescent probes (1D). and different approaches are often re-
segmentation, and tracking, are then With advances to automated high- quired for the analysis of different types of
described. Subsequently, the quanti- resolution microscopy, fluorescent label- images. To facilitate image-based screen-
tative features, phenotype identifica- ing, and robotic handling, image-based ing studies, a number of bioimage infor-
tion, and multidimensional profile studies have become popular in drug and matics software packages have been de-
analysis for profiling the effects of target discovery. These image-based stud- veloped and are publicly available [9].
drugs and targets are summarized. ies are often referred to as the High This chapter provides an overview of the
Moreover, a number of publicly Content Analysis (HCA) [3], which focus- bioimage informatics approaches in im-
available software packages for bio-
image informatics are listed for
further reference. It is expected that Citation: Li F, Yin Z, Jin G, Zhao H, Wong STC (2013) Chapter 17: Bioimage Informatics for Systems
Pharmacology. PLoS Comput Biol 9(4): e1003043. doi:10.1371/journal.pcbi.1003043
this review will help readers, includ-
ing those without bioimage infor- Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of
Maryland, Baltimore County , United States of America
matics expertise, understand the
capabilities, approaches, and tools Published April 25, 2013
of bioimage informatics and apply Copyright: 2013 Li et al. This is an open-access article distributed under the terms of the Creative
them to advance their own studies. Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
Funding: The research is supported by NIH R01 LM008696, NIH R01 CA121225, NIH R01 LM009161, NIH R01
This article is part of the Transla- AG028928, NIH U54CA149169 and CPRIT RP110532. The funders had no role in the preparation of the manuscript.
tional Bioinformatics collection for Competing Interests: The authors have declared that no competing interests exist.
PLOS Computational Biology. * E-mail: stwong@tmhs.org

PLOS Computational Biology | www.ploscompbiol.org 1 April 2013 | Volume 9 | Issue 4 | e1003043


What to Learn in This Chapter research. For example, the effects of
hundreds of compounds were profiled for
phenotypic changes using multicolor cell
N What automated approaches are necessary for analysis of phenotypic changes,
images in [1012]. Hundreds of quantita-
especially for drug and target discovery?
tive features were extracted to indicate the
N What quantitative features and machine learning approaches are commonly
phenotypic changes caused by these com-
used for quantifying phenotypic changes?
pounds, and then computational ap-
N What resources are available for bioimage informatics studies? proaches were proposed to identify the
effective compounds, categorize them,
age-based studies for drug and target analysis, and finally, a brief summary is characterize their dose-dependent re-
discovery to help readers, including those provided in Section 7. sponse, and suggest novel targets and
without bioimage informatics expertise, mechanisms for these compounds [10
understand the capabilities, approaches, 2. Example Image-based 12]. Moreover, phenotypic heterogeneity
and tools of bioimage informatics and was investigated by using a subpopulation
Studies for Drug and Target
apply them to advance their own studies. based approach to characterize drug
Discovery effects in [13], and distinguish cell popu-
The remainder of this chapter is organized
as follows. Section 2 introduces a number There are a variety of image-based lations with distinct drug sensitivities in
of practical screening applications for studies for discovery of drugs, targets, and [14]. Also in [15,16], the phenotypic
discovery of potential drugs and targets. mechanisms of biological processes. A good changes of proteins inside individual
Section 3 describes the challenges and starting point for learning about bioimage Drosophila Kc167 cells treated with RNAi
approaches for quantitative image analy- informatics approaches is to study practical libraries were investigated by using high
sis, e.g., object detection, segmentation, image-based studies, and a number of resolution fluorescent microscopy, and
and tracking. Section 4 introduces tech- examples are summarized below. bioimage informatics analysis was applied
niques for quantification of segmented to quantify these images to identify genes
objectives, including feature extraction, regulating the phenotypic changes of
phenotype classification, and clustering. 2.1 Multicolor Cell Imaging-based interest. Figure 2 shows an image of
Section 5 reviews a number of prevalent Studies for Drug and Target Drosophila Kc167 cells, which were treated
approaches for profiling drug effects based Discovery with RNAi and stained to visualize the
on the quantitative phenotypic data. Fixed cell images with multiple fluores- nuclear DNA (red), F-actin (green), and a-
Section 6 lists major, publicly available cent markers have been widely used for tubulin (blue). Freely available software
software packages of bioimage informatics drug and target screening in scientific packages, such as CellProfiler [17], Fiji

Figure 1. The flowchart of bioimage informatics for drug and target discovery.
doi:10.1371/journal.pcbi.1003043.g001

PLOS Computational Biology | www.ploscompbiol.org 2 April 2013 | Volume 9 | Issue 4 | e1003043


Figure 2. A representative image of Drosophila Kc167 cells treated with RNAi. The red, green, and blue colors are the DNA, F-actin, and a-
tubulin channels.
doi:10.1371/journal.pcbi.1003043.g002

[18], Icy [19], GCELLIQ [20], and Phe- compounds that regulate cell mitosis in example, the 3D neuron synaptic mor-
noRipper [21] can be used for the [24,25]. Moreover, the time-lapse images phological and structural changes were
multicolor cell image analysis. of live cells were used to study the dynamic investigated by using super-resolution mi-
behaviors of stem cells in [26,27] and predict croscopy, e.g., STED microscopy, to study
2.2 Live-cell Imaging-based Studies cell fates of neural progenitor cells using brain functions and disorders under dif-
for Cell Cycle and Migration their dynamic behaviors in [28]. Figure 3 ferent stimulations [3436]. Also other
shows a single frame of live HeLa cell images advanced optical techniques were pro-
Regulator Discovery
and the images of four cell cycle phases: posed in [37,38] to image and reconstruct
Two hallmarks of cancer cells are
interphase, prophase, metaphase, and ana- the 3D structure of live neurons. Figure 4
uncontrolled cell proliferation and migra-
phase [25]. The publicly available software shows an example of 2D neuron image
tion. These are also good phenotypes for
packages for time-lapse image analysis used in [39]. In [40], neuronal degenera-
screening drugs and targets that regulate cell
include, for example, the plugins of Cell- tion was mimicked by treating mice with
cycle progression and cell migration in time-
Profiler [17], Fiji [18], BioimageXD [29], different dosages of Ab peptide, which
lapse images. For example, out of 22,000
Icy [19], CellCognition [23], DCELLIQ may cause the loss of neuritis, and drugs
human genes, about 600 were identified as
[30], and TLM-Tracker [31]. that rescue the loss of neurites were
related to mitosis by using live cell (time-
lapse) imaging and RNAi treatment in the identified as candidates for AD therapy.
MitoCheck project (www.mitocheck.org) 2.3 Neuron Imaging-based Studies Figure 5 shows an example of neurites and
[22,23]. The project is now being expanded for Neurodegenerative Disease Drug nuclei images acquired in [40]. To quan-
to study how these identified genes work and Target Discovery titatively analyze neuron images, a num-
together to regulate cell mitosis, in which Neuronal morphology is illustrative of ber of publicly available software packages
mistakes can lead to cancer, in the MitoSys neuronal function and can be instructive have been developed, for example, Neur-
(systems biology of mitosis) project (http:// toward the dysfunctions seen in neurode- phologyJ [41], NeuronJ [42], NeuriteTra-
www.mitosys.org/). Also, live cell imaging of generative diseases, such as Alzheimers cer (Fiji plugin) [43], NeuriteIQ [44],
Hela cells was used to discover drugs and and Parkinsons disease [32,33]. For NeuronMetrics [45], NeuronStudio

PLOS Computational Biology | www.ploscompbiol.org 3 April 2013 | Volume 9 | Issue 4 | e1003043


Figure 3. Examples of HeLa cell nuclei and cell cycle phase images. (A) A frame of HeLa cell nuclei time-lapse image sequence; (B) Example
images of four cell cycle phases.
doi:10.1371/journal.pcbi.1003043.g003

[46,47], NeuronJ [42], NeuronIQ [39,48], opment, was built based on the confocal Due to the large amounts of images
and Vaa3D [49,50]. A review of software image stacks via the software, CellExplorer generated, it is not feasible to quantify
packages for neuron image analysis was [55,56]. In addition, CellProfiler provides the images manually. Therefore, automat-
also reported in [51]. an image analysis pipeline for delineating ed image analysis is essential for the
bodies, and quantifying the expression quantification of phenotypic changes. In
2.4 Caenorhabditis elegans Imaging- changes of specific proteins, e.g., clec-60 general, the challenges of quantitative
based Studies for Drug and Target and pharynx, of individual C. elegans under image analysis include object detection,
Discovery different treatments [57]. segmentation, tracking, and visualization.
Caenorhabditis elegans (C. elegans) is a These examples have demonstrated The word object in this context means
common animal model for drug and target diverse cellular phenotypes in different the object captured in the bioimages, e.g.,
discovery. Consisting of only hundreds of image-based studies. To quantify and the nucleus and cell. The following
cells, it is an excellent model to study analyze the complex phenotypic changes sections will introduce techniques used to
cellular development and organization. For of cells and sub-cellular components from address these challenges.
example, the invariant embryonic develop- large scale image data, bioimage infor-
ment of C. elegans was recorded by time- matics approaches are needed. 3.1 Object Detection
lapse imaging, and the embryonic lineages Object detection is to detect the loca-
of each cell were then reconstructed by cell 3. Quantitative Bioimage tions of individual objects. It is important,
tracking to study the functions of genes Analysis especially when the objects cluster togeth-
underpinning the development process er, to facilitate the segmentation task by
[5254]. Moreover, an atlas of C. elegans, After image acquisition, phenotypic providing the position and initial bound-
which quantified the nuclear locations and changes need to be quantified for charac- ary information of individual objects.
statistics on their spatial patterns in devel- terizing functions of drugs and targets. Based on the shape of objects, two

PLOS Computational Biology | www.ploscompbiol.org 4 April 2013 | Volume 9 | Issue 4 | e1003043


image, such that thresholding or seeded
watershed can be employed to the distance
image to detect and separate the touching
blob objects [59]. The intensity informa-
tion is also often used for blob detection.
Blob objects usually have relatively high
intensity in the center, and relatively low
intensity in the peripheral regions. For
example, the Laplacian-of-Gaussian
(LOG) filter is effective [6063] to detect
blob objects based on the intensity infor-
mation. After LOG filtering, local maxi-
mum response points often correspond to
centers of blob objects, as shown in
Figure 6. Moreover, the intensity gradient
information is also used for blob detection.
For example, in [64] the intensity gradient
vectors were smoothed by using the
gradient vector flow approach [65] so that
the smoothed gradient vectors continuous-
ly point to the object centers. Consequent-
ly, the blob object centers can be detected
by following the gradient vectors [64]. In
addition, the boundary points of blob
objects with high gradient amplitude can
be used to detect their centers, based on
Figure 4. A representative 2D neuron images. The bright spots near the backbones of the idea of Hough Transform [66]. For
neurons are the dendritic spines. example, in [67] an iterative radial voting
doi:10.1371/journal.pcbi.1003043.g004 method was developed to detect such
object centers based on the boundary
categories of object detection techniques The shape information of blob objects points. In brief, the detected boundary
are developed: blob structure detection, can be used to detect the centers of objects points vote the blob center with oriented
e.g., particles and cell nuclei, and tube using distance transformation [58]. The kernels iteratively, and the orientation and
structure detection, e.g., neurons, blood concavity of two touching objects would size of the kernels are updated based on
vessels. cause two local maxima in the distance the voting results. Finally, the maximum
response points in the voting image are
selected as the centers of objects. The
advantage of this method is that it can
detect the centers of objects with noise
appearance [67]. The distance transform
and the intensity gradient information also
can be combined for the object detection
[68]. For other blob objects with complex
appearances, the machine learning ap-
proaches based on local features [69,70]
can also be used for object detection
[71,72], as in the Fiji (trainable segmenta-
tion plugin) [18] and Ilastik [73].
Tubular structure detection is based
on the premise that the intensity re-
mains constant in the direction along
the tube, and varies dramatically in the
direction perpendicular to the tube. To
find the local direction of tube center
lines, the eigenvector corresponding to
the minimum and negative eigenvalue
of Hessian matrix was proposed in
[44,74]. Center line points can be
characterized by their local geometric
attributes, i.e., the first derivative is
close to zero and the magnitude of
Figure 5. A representative image of neurites. Red indicates nuclei and green represents second derivatives is large in a direction
neurites. perpendicular to tube center line
doi:10.1371/journal.pcbi.1003043.g005 [42,44,74]. After the center line point

PLOS Computational Biology | www.ploscompbiol.org 5 April 2013 | Volume 9 | Issue 4 | e1003043


Figure 6. An example of blob-structure (HeLa cell nuclei) detection. The red spots indicate the detected centers of objects.
doi:10.1371/journal.pcbi.1003043.g006

detection, a linking process is needed to used method. For example, blood vessel starting from the detected centers or
connect these center line points into detection in retinal images is a repre- centerlines of objects. Without the guid-
continuous center lines based on their sentative tubular structure detection ance of detection results, object segmen-
direction and distance. For example, in task with the supervised learning ap- tation would be more challenging.
NeuronJ, Dijkstras shortest-path was proaches [77,78]. In these methods, the
used based on the Gaussian derivative local features, e.g., intensity and wavelet 3.2 Object Segmentation
features to detect the neurons center- features, of an image patch containing a
The goal of object segmentation is to
line between two given points on the given pixel are calculated, and then a
delineate boundaries of individual objects
neuron [42]. Figure 7 provides an classifier is trained using these local
of interest in images. Segmentation is the
example of neurite images, and features based on a set of training points
basis for quantifying phenotypic changes.
Figure 8 shows the corresponding cen- [77,78]. A good survey of blood vessel
Although a number of image segmenta-
terline detection results [44] based on (tube structure) detection approaches in
tion methods have been reported, this
the local Gaussian derivative features. retinal images was reported in [79]. For
remains an open challenge due to the
In addition to the approaches based on more approaches and details of tubular
complexity of morphological appearances
Gaussian derivatives, there are other structure detection, readers should refer
of objects. This section introduces a
tubular structure detection approaches. to the aforementioned neuron image
number of widely used segmentation
For example, four sets of kernels (edge analysis software packages.
methods.
detectors) were designed to detect the In summary, blobs and tubes are the
neuron edges and centerlines [75], and dominating structures in bioimages. The Threshold segmentation
 [80] is the sim-
super-ellipsoid modeling was designed detection results provide the position and 1; t2 wI(x,y)wt1
plest method: T(I)~ ,
to fit the local geometry of blood vessels initial boundary information for the quan- 0; otherwise
where I(x,y) is the image, and t1 and t2 are
[76]. tification and segmentation processes. In
Moreover, machine learning-based other words, the segmentation process the intensity thresholds. As an extension
tubular structure detection is a widely tries to delineate boundaries of objects of the thresholding method, Fuzzy-C-

PLOS Computational Biology | www.ploscompbiol.org 6 April 2013 | Volume 9 | Issue 4 | e1003043


pixels with local maximum intensity,
which act like dams to avoid flooding
from different basins (object regions)
[82]. To avoid the over-segmentation
problem of the watershed approach, the
marker-controlled watershed (or seeded
watershed) approach, in which the
floods are from the marker or seed
points (the object detection results), was
proposed [68,8385]. Figure 9 illus-
trates the segmentation result of HeLa
cell nuclei using the seeded watershed
method based on the cell detection
results.
Active contour models are another set
of widely used segmentation methods [86
90]. Generally, there are two kinds of
active contour models: boundary-driven
and region-competition models. In the
boundary-driven model, the contours
(boundaries of objects) evolution is deter-
mined by the local gradient. In other
words, the boundary fronts move toward
the outside (or inside) quickly in the
Figure 7. A representative neurite image for centerline detection. regions with low intensity variation (gra-
doi:10.1371/journal.pcbi.1003043.g007
dient), and slowly in the regions with high
gradient (where the boundaries are). When
Means [81] can be used to separate images clumps (i.e., multiple objects touching great intensity variation appears inside
into more regions based on intensity together). Watershed segmentation and cells, or the boundary is weak, this method
information. These methods could di- its derivatives are widely used segmen- often fails [91]. Instead of using gradient
vide the image into objects and back- tation methods. They build object information, the region-competition mod-
ground, but fail to separate the object boundaries between objects on the el makes use of the intensity similarity

Figure 8. An example of neurite centerline detection. (A) The centerline confidence image obtained by using the local Gaussian derivative
features. Higher intensity indicates higher confidence of pixels on the centerlines. (B) The neurite centerline detection result image. Different colors
indicate the disconnected branches.
doi:10.1371/journal.pcbi.1003043.g008

PLOS Computational Biology | www.ploscompbiol.org 7 April 2013 | Volume 9 | Issue 4 | e1003043


Figure 9. An example of HeLa nuclei segmentation using the seeded watershed algorithm. The green contours are the boundaries of
nuclei.
doi:10.1371/journal.pcbi.1003043.g009

 
information to separate the image into and Vese active contour (CV) [87]. 1 2 x
regions with similar intensity. Region H(x)~ 1z arg tan ( ) , and the
2 p e
competition-based active contour models  
d +y
could solve the weak boundary problem; y~a+g:+yzgkzcj+yj curvature term, k~div ~
dt j+yj
however, they require that the intensity of
touching objects is separable [87]. To GAC level set evolution equation, yxx y2y {2yx yxy yy zy2x yyy
 3=2 indicates the
implement these active contour models,
level set representation is widely used [92]. y2x zy2y
d
Level set is an n+1 dimensional function y~ local smoothness of boundaries, and div is
that can easily represent any n dimensional dt the divergence operation. Figure 10 dem-
h i
shape without parameters. The inside de y m:k{n{l1 I{c1 2 zl2 I{c2 2 onstrates the segmentation result using
regions of objects are indicated by using GAC level set approach. An additional
positive levels, and outside regions are CV level set evolution equation, segmentation method, Voronoi segmenta-
represented using negative levels. For this tion [94], first defines the centers of objects
implementation, the initial boundary (zero where y denotes the level-set function, and and then constructs the boundaries be-
level) is required, and the signed distance g indicates the gradient function, + is the tween two objects on the pixels, from
function is often used to initialize the level gradient operator, c, c1, and c2 are constant which the distances are the same to the
set function [92,93]. To evolve the level set 1 e two centers. In CellProfiler, the Voronoi
variables. de x~ 2 is an approxi-
functions (grow the boundaries of objects), p e zx2 segmentation method was extended by
the following two equations are classical mation of the Dirac function to indicate considering the local intensity variations in
models. The first equation is often called the boundary bands), which is the deriv- the distance metric to achieve better
geodesic active contour (GAC) [86], and ative function of Heaviside function de- segmentation results [95]. This method is
the second one is often named the Chan noting inside/outside regions of objects: fast and generates level set comparable

PLOS Computational Biology | www.ploscompbiol.org 8 April 2013 | Volume 9 | Issue 4 | e1003043


Figure 10. An example of segmentation of Drosophila cell images using the level set approach.
doi:10.1371/journal.pcbi.1003043.g010

results. Graph cut segmentation method could be regions (superpixel) obtained by Hela cells [30]. Object tracking is a
views the image as a graph, in which each the clustering analysis. For example, Simple challenging task due to the complex
pixel is a vertex and adjacent pixels are Linear Iterative Clustering (SLIC) made dynamic behaviors of objects over time.
connected [63,96,97]. It cuts the graph use of the intensity and coordinate infor- In general, cell tracking approaches can be
into several small graphs from the regions mation of pixels to separate the image into classified into three categories: model
where adjacent pixels have the most uniformly sized and biologically meaning- evolution-based tracking, spatial-temporal
different properties, e.g., intensity. ful regions [98,99], and then the machine volume segmentation-based tracking, and
Different from the aforementioned seg- learning approaches were used to identify segmentation-based tracking.
mentation approaches, local feature and the regions of interest, e.g., boundary In the model evolution based tracking
machine learning-based segmentation ap- superpixels, for object segmentation [99]. approaches, cells or nuclei are initially
proaches are implemented, for example, in detected and segmented in the first frame,
Fiji (trainable segmentation plugin) [18] 3.3 Object Tracking and then their boundaries and positions
and Ilastik [73]. Users can interactively To study the dynamic behaviors and evolve frame by frame. Some tracking
select the training sample pixels/voxels or phenotypic changes of objects over time techniques in this category are mean-shift
small image patches conveniently, and then (e.g., cell cycle progression and migration), [100] and parametric active contours
classifiers are automatically trained based object tracking using time lapse image [88,101]. However, neither mean-shift
on the features of the training pixels or sequences is necessary. Figure 11 shows a nor parametric active contours can cope
voxels (or patches) to predict the classes, Hela cells division process in four frames well with cell division and nuclei clusters.
e.g., cells or background, of the pixels or at different time points, and Figures 12 Though the level set method enables
voxels (or patches) in a new image. The and 13 show the examples of cell migra- topological change, e.g., cell division, it
image patches could be a circle or square tion trajectories and cell lineages recon- also allows the fusion of overlapping cells.
neighbor regions of a given point, and also structed from the time-lapse images of Extending these methods to cope with

PLOS Computational Biology | www.ploscompbiol.org 9 April 2013 | Volume 9 | Issue 4 | e1003043


Figure 11. Time-lapse images indicating cell cycle progression. The cell in the red square in the first frame (A) divided into two cells in frame
60 (B). The descendent cells divided again in frame 152 and 156 respectively as shown in the red squares in (C) and (D).
doi:10.1371/journal.pcbi.1003043.g011

these tracking challenges is nontrivial and ral volume segmentation based tracking, ity measurements in [113,115]. For the
increases computation time [90,102104]. 2D image sequences were viewed as 3D association approaches, the overlap region
For example, the coupled geometric active volume data (2D spatial+temporal), and and distance based method was employed
contours model was proposed to prevent the shape and size constrained level set in [114], in which objects in the current
object fusion by representing each object segmentation approaches were applied to frame were associated with the nearest
with an independent level set in [105], and segment the traces of objects, and recon- objects in the next frame. Then the false
this was further extended to the 3D cell struct the cell lineage in [110112]. matches, e.g., many-to-one or one-to-
tracking in [90]. The other approach For detection and segmentation-based many, were further corrected through
explicitly blocking the cell merging is to tracking, objects are first detected and the post processing. Different from the
introduce the topology constraints, i.e., segmented, and then these objects are individual object association above, all
labeling objects regions with different associated between two consecutive segmented objects were simultaneously
numbers or colors. For example, the frames, based on their morphology, posi- associated by using the integer program-
region labeling map was employed in tion, and motion [30,113115]. The ming optimization in [113,116]:
[27,106] to deal with the cell merging, tracking approaches are usually done fast, x ~ max Sx, s.t. Ax1, where Ax1
and planar graphvertex coloring was but their accuracy is closely related to x[f0,1gN
employed to separate the neighboring detection and segmentation results, simi- restricts that one object can be associated
contours. From that four separate level larity measurements, and association strat- to one object at most, A is an (m+n)6N
set functions could easily deal with cell egies. The cell center position, shape, matrix, and the first m rows correspond to
merging [107] based on the four-color intensity, migration distance, and spatial m objects in frame t, and the last n rows
theorem [108,109]. For the spatial-tempo- context information were used as similar- denote objects in frame t +1. N is the

PLOS Computational Biology | www.ploscompbiol.org 10 April 2013 | Volume 9 | Issue 4 | e1003043


Figure 12. Examples of cell migration trajectories. Different colors represent different trajectories.
doi:10.1371/journal.pcbi.1003043.g012

number of all possible associations among velocity, and intensity. Then, two models (particles) being stochastically drawn, and it
objects in frame t and frame t+1. S is a are defined based on the state vector. The had been employed for object tracking in
16N similarity matrix, and S j first is the state evolution model, xt = fluorescent images in [119121]. In some
~S cktz1 jcit . For the unmatched cells, ft (xt21)+et, where ft is the state evolution biological studies, the motion dynamics of
e.g., the new born or new entered cells, a function at time point, t, and et is a noise, objects are complex. Therefore, one motion
linking process is usually needed to link e.g., Gaussian noise, which describes the model might not be able to describe object
them to the parent cells or as a new evolution of the state. The other is the motion dynamics well. The IMM filter is
trajectory. This optimal matching strategy observation model, zt = ht (xt21)+gt, where employed to incorporate multiple motion
was also used to link the object trajectory ht is the map function, and gt is the noise, models, and the motion model of objects can
segments in [27] to link the broken or which maps the state vector into observa- be transitioned from one to another in the
newly appearing trajectories. tions that are measurable in the image. next frame with certain probabilities. For
As an alternative to frame-by-frame Based on the two models and Bayes rule, example, the IMM filter with three motion
association strategies, Bayesian filters, the posterior density of the object state is models, i.e., random walk, first-order, and
e.g., Particle filter and Interacting Multiple estimated as follows: pxt Dz1:t !pzt Dx t second-order linear extrapolation, was used
Model (IMM) filters [117,118], are also pxt Dz1:t{1 , and pxt jz1:t{1 ~ p for 3D object tracking in [118], and for 2D
used for object tracking. The goal of these xt jxt{1 pxt{1 jzt{1 dxt{1 where the cell tracking in [27].
filters is to recursively estimate a model of p(zt |xt) is defined based on the observation
object migration in an image sequence. model, and the pxt Dxt{1 is defined based 3.4 Image Visualization
Generally, in the Bayesian methods, a on the state evolution model. The basic Most of the aforementioned software
state vector, xt, is defined to indicate principle of particle filter is to approximate packages provide functions to visualize 2D
the characters of objects, e.g., position, the posterior density by a set of samples images and the analysis results. However,

PLOS Computational Biology | www.ploscompbiol.org 11 April 2013 | Volume 9 | Issue 4 | e1003043


dard deviation of the intensity, the lengths
of the longest axis, the shortest axis, and
their ratio, the area of the cell, the
perimeter, the compactness of the cell
(compactness = perimeter2/4p*area), the area
of the minimum convex image, and the
roughness (area of cell/area of convex shape).
The calculation of Zernike moments
features was introduced in [131]. First,
the center of mass of the cell image was
calculated, then the average radius for
each cell was computed, and the pixel p(x,
y) of the cell image was mapped to a unit
circle to obtain the projected pixel as p(x9,
y9). Then Zernike moment features were
calculated based on the projected image
I(x9, y9). The Haralick texture features are
extracted from the gray-level spatial-de-
pendence matrices, including the angular
second moment, contrast, correlation, sum
of the squares, inverse difference moment,
sum of the average, sum of the variance,
sum of entropy, entropy, difference of the
variance, difference of entropy, informa-
tion measures of correlation, and maximal
correlation coefficient [132]. More de-
scriptions and calculation programs about
these Subcellular Location Features (SLF)
and SLF-based machine learning ap-
proaches for image classification can be
found at: http://murphylab.web.cmu.
Figure 13. Examples of cell lineages constructed by the tracking algorithm. The black edu/services/SLF/features.html.
numbers are the time of cell division (hours). The bottom red numbers indicate the number of
traces, and the numbers inside circles are the labels of cells in that frame.
doi:10.1371/journal.pcbi.1003043.g013 4.2 Phenotype Identification
Although these numerical features are
informative to describe the phenotypic
for higher dimensional images, e.g., 3D, 4D tial and provide a basis for the following changes, it can be difficult to understand
(including time), and 5D (including multi- quantification of morphological changes. these changes in terms of visual and
ple color channels), visualization is chal- understandable phenotypic changes. For
lenging. Fiji [18], Icy [19], and Bioima- 4. Numerical Features and example, the increase or decrease of cell
geXD [29], for example, are the widely Morphological Phenotypes size can be understood; however, it is not
used bioimage analysis and visualization 4.1 Numerical Features clear what the physical meaning of the
software packages for higher dimensional increase or decrease is for certain wavelet
To quantitatively measure the pheno-
images. In addition, NeuronStudio [46,47] typic changes of segmented objects, a set features. Therefore, transforming the nu-
is a software package tailored for neuron of descriptive numerical features are merical features into biologically meaning-
image analysis and visualization. Farsight needed. For example, four categories of ful features (phenotypes) is important. This
[122] and vaa3D [123] are also developed quantitative features, measuring morpho- section introduces a number of widely
for analysis and visualization of 3D, 4D, logical appearances of segmented objects, used phenotype identification approaches.
and 5D microscopy images. For developing are widely used in imaging informatics 4.2.1. Cell cycle phase
customized visualization tools, the Visual- studies for object classification and identi- identification. In cell cycle studies,
ization Toolkit (VTK) is a favorite choice fication, i.e., wavelets features [124,125], drug and target effects are indicated by
(http://www.vtk.org/) as it is open source geometry features [126], Zernike moment the dwelling time of cell cycle phases, e.g.,
and developed specifically for 3D visuali- features [127], and Haralick texture fea- interphase, prophase, metaphase and
zation. ParaView (http://www.paraview. tures [128]. In brief, Discrete Wavelet anaphase. Additional cell cycle phases,
org/) and ITK-SNAP (http://www. Transformation (DWT) features charac- e.g., Prometa-, Ana 1-, Ana 2-, and Telo-
itksnap.org/) are the popular Insight terize images in both scale and frequency phases, were also investigated in [133] and
Toolkit (ITK) (http://www.itk.org/) and domains. Two important DWT feature [23,134]. After object segmentation and
VTK based 3D image analysis and visual- sets are the Gabor wavelet [129] and the tracking, cell motion traces can be
ization software packages. CohenDaubechiesFeauveau wavelet extracted, as shown in Figure 14, and
This section has introduced a number of (CDF9/7) [130] features. Geometry fea- then the automated cell cycle phase
major methods for object detection, seg- tures describe the shape and texture identification is needed to calculate the
mentation, tracking, and visualization in features of the individual cells, e.g., the dwelling time of individual cells on
bioimage analysis. These analyses are essen- maximum value, mean value, and stan- different phases.

PLOS Computational Biology | www.ploscompbiol.org 12 April 2013 | Volume 9 | Issue 4 | e1003043


Cell cycle phase identification can be have an intrinsic phenotype, e.g., cell 5. Multidimensional Profiling
viewed as a pattern classification prob- cycle phases, but may exhibit unpredicted Analysis
lem. The aforementioned numerical fea- and novel phenotypes caused by
tures, and a number of classifiers can be experimental perturbations, e.g., drugs The aim of profiling analysis is to
used to identify the corresponding phases or RNAi treatments. These phenotypes characterize the functions of drugs and
of individual segmented cells, e.g., sup- are often defined by well-trained targets, divide them into groups with
port vector machine (SVM) biologists to characterize drug and target similar phenotypic changes, and identify
[115,133,135], K-nearest neighbors effects [16]. Figure 16 shows images of the candidates causing desired phenotypic
(KNN), and nave Bayesian classifiers Drosophila cells with three defined changes. To help analyze and organize
[114]. However, the classification accu- phenotypes: Normal, Ruffling and Spiky these multidimensional phenotypic profile
racy is often poor for cell cycle phases [136]. data, some publicly available software
appearing for a short time, e.g., prophase In large scale screening studies, how- packages have been designed, for example,
and metaphase, due to the unbalance of ever, it is subjective and time-consuming CellProfiler Analyst (http://www.
sample size compared to interphase, and for biologists to uncover novel pheno- cellprofiler.org/) and PhenoRipper
the segmentation bias. Fortunately, the types from millions of cells. Thus, auto- (http://www.phenoripper.org). In addi-
cell cycle phase transition rules, e.g., from mated discovery of novel phenotypes is tion, KNIME (http://www.knime.org/)
interphase to prophase, and from pro- important. For example, an automated is a publicly available pipeline and work-
phase to metaphase, can be used to phenotype discovery method was pro- flow system to help organize different data
reduce identification errors. Thus, a set posed in [20]. In brief, a GMM was flows. It also provides connections to
of cell cycle phase identification ap- constructed first for the existing pheno- bioimage analysis software packages, e.g.,
proaches based on the cell tracking types. Then the quantitative cellular data Fiji [18] and CellProfiler [9], and enables
results were proposed to achieve high from new cellular images were combined users to conveniently build specific data
identification accuracy. This problem is with samples generated from the GMM, analysis pipelines in KNIME. This section
often formulized as follows, and as shown and the cluster number of the combined describes some prevalent approaches in
in Figure 15. Let x = (x1, x2, , xT) data was estimated using gap statistics analyzing quantitative phenotypic profile
denote a cell image sequence of length T. [137]. Then, clustering analysis was data.
Each cell image is represented by a performed on the combined data set, in
numerical feature vector Q(xi )[Rd (using which some of the cells from the new 5.1 Clustering Analysis
the aforementioned numerical features). cellular images were merged into the Clustering analysis is to divide experi-
Let y = (y1, y2, , yT) represent the existing phenotypes, and the clusters that mental perturbations, e.g., drugs, RNAis,
corresponding cell cycle phase sequence could not be merged by any existing into groups that have similar phenotypic
that needs to be predicted. Based on the phenotype classes were considered as new changes. As clustering analysis approach-
cell cycle progression rules, for example, phenotype candidates. After the pheno- es, e.g., Hierarchical Clustering [139] and
the variation of nuclei size and intensity types are defined, classifiers can be built Consensus Clustering [140], are well
were used as an index to identify the conveniently based on the training data established, their technical details will not
mitosis phases of cells in [25], and and the numerical features for classifying be discussed here. In addition to the
Hidden Markov Modeling (HMM) was cells into one of the predefined pheno- aforementioned software, Cluster 3.0
used to identify the cell cycle phases in types. However, it is tedious to manually (http://www.falw.vu/,huik/cluster.htm)
CellCognition [23]. In brief, the transi- collect enough training samples of the and Java TreeView (http://jtreeview.
tion possibility from one phase to the rare and unusual phenotypes. To solve sourceforge.net/) are two additional easy-
other was learned from the training data this challenge, an iterative machine to-use clustering analysis software packag-
of cell cycle progressions, which could learning based approach was proposed es available in public domain.
improve the accuracy of cell cycle phase in [138]. First, a tentative rule (classifier)
identification. As an extension of HMM, was determined based on a few samples 5.2 SVM-based Multivariate Profiling
Temporally Constrained Combinatorial of a given phenotype, and then the Analysis
Clustering (TC3), which is an unsuper- classifier presented users a set of cells SVM classifier was employed for ana-
vised learning approach for cell cycle that were classified into the phenotype lyzing the multivariate drug profiles in
phase identification, was designed and based on the tentative rule. Users would [141]. To measure the phenotypic change
combined with Gaussian Mixture Model then manually correct the classification caused by drug treatments, the cell
(GMM) and HMM to achieve robust and errors, and the corrections are used to populations harvested from the drug-
accurate cell cycle identification results in refine the rule. This method could collect treated wells were compared with cells
[134]. Also, in [133] Finite State Ma- plenty of training samples after several collected from the control wells (no drug
chine (FSM) was employed to check the rounds of error correction and rule treatment). The difference between the
phase transition consistency and make refinement [138]. control and drug treatment was indicated
corrections to the error cell cycle phases This section introduced numerical fea- by two factors that are the outputs of the
predicted by using SVM classifier [115]. ture extraction, phenotype identification, SVM classifier. One is the accuracy of
Moreover, the cell cycle phases could be and classification. These analyses provide classification, which indicates the magni-
identified during the segmentation and quantitative phenotypic change data for tude of the drug effect. The other is the
linking process in the spatiotemporal identifying candidate targets and drug hits normal vector (d-profile) of the hyperplane
volumetric segmentation-based tracking that cause desirable phenotypic changes. separating the two cell populations, which
methods [110112]. The following section will describe ap- indicates the phenotypic changes caused
4.2.2 User defined phenotype, proaches to analyze the quantitative phe- by the drug. Figure 17 illustrates the idea;
identification, and classification. In notypic profile data for drug and target the yellow arrow is the d-profile indicating
certain image-based studies, cells may not identification. the direction of drug effects in the

PLOS Computational Biology | www.ploscompbiol.org 13 April 2013 | Volume 9 | Issue 4 | e1003043


Figure 14. A segment of cell cycle procession sequence. Four cell cycle phases, interphase, prophase, metaphase, and anaphase, appear in order.
doi:10.1371/journal.pcbi.1003043.g014

phenotypic feature space. Drugs with features explicitly clear, factor analysis representing nuclei size, DNA replication,
similar d-profiles were found to have the was employed in [12]. The basic principle chromosome condensation, nuclei mor-
same functional targets, and thus it could of factor analysis is to determine the phology, Edu texture, and nuclei elliptic-
be used to predict functions of new drugs independent common traits (factors). ity, were obtained through factor analysis
or compounds. Mathematically it is formulated by the in [12].
following equation.
5.3 Factor-based Multidimensional 5.4 Subpopulation-based
2 3
Profiling Analysis x11 ,x12 ,:::,x1n Heterogeneity Profiling Analysis
In the set of numerical features, some 6 7 In image-based screening studies, het-
6 x21 ,x22 ,:::,x2n 7
are highly correlated within groups but 6 7 erogeneous phenotypes often appeared
6 7~Xmn
poorly correlated with features in other 6 ::: 7 within a cell population, as shown in
4 5
groups. One possible explanation is that Figures 2 and 16, which indicated that
xm1 ,xm2 ,:::,xmn
the features in one group measure a individual cells responded to perturbations
common biological process, such as in- ~mmn zLmk Fkn zemn differently [142]. However, the heteroge-
crease or decrease of nuclei size. The neity information was ignored in most
challenge using these numerical features where mmn is the mean value of each row, screening studies. To better make use of
directly is that biological meanings of Fkn denotes the k factor, and the Lmk is the the heterogeneous phenotypic responses, a
certain phenotypic features are often loading matrix, which is the coordinates of subpopulation based approach was pro-
vague. It is thus difficult to explain the the n samples in the new k-dimensional posed to study the phenotypic heteroge-
phenotypic changes represented by these space. In other words, k factors are neity for characterizing drug effects in
numerical features as aforementioned. To independent and are the underlying bio- [13], and distinguishing cell populations
remove the redundant features and make logical processes that regulate the pheno- with distinct drug sensitivities in [14]. The
the biological meanings of numerical typic changes. For example, six factors basic principle of the subpopulation based

Figure 15. The graphical representation of cell cycle phase identification.


doi:10.1371/journal.pcbi.1003043.g015

PLOS Computational Biology | www.ploscompbiol.org 14 April 2013 | Volume 9 | Issue 4 | e1003043


Figure 16. A representative image of Drosophila cells with three phenotypes: (A) Normal, (B) Ruffling and (C) Spiky phenotypes.
doi:10.1371/journal.pcbi.1003043.g016

method is to characterize the phenotypic was implemented by fitting a GMM in the subpopulation. To profile the effects of
heterogeneity with a mixture of phenotyp- numerical space, and each model compo- perturbations, cells collected from pertur-
ically distinct subpopulations. This idea nent of the GMM represents a distinct bation conditions were first classified into

Figure 17. An illustration of drug profiling using the normal vector of hyperplane of SVM. The red and blue spots indicate the spatial
distribution of cells in the numeric feature space. The yellow arrow represents the normal vector of the hyperplane (the blue plane). The top left and
bottom right (MB231 cell) images are from drug treated and control conditions respectively.
doi:10.1371/journal.pcbi.1003043.g017

PLOS Computational Biology | www.ploscompbiol.org 15 April 2013 | Volume 9 | Issue 4 | e1003043


Table 1. List of publicly available bioimage informatics software packages.

Name Link Basic Functions

ImageJ http://rsb.info.nih.gov/ij/ General image analysis with rich plugins


Fiji (A distribution of ImageJ) http://fiji.sc/ Bioimage analysis with rich plugins
CellProfiler http://www.cellprofiler.org/ Bioimage analysis with rich analysis pipelines
CellProfiler Analyst http://www.cellprofiler.org/ Screening data analysis with machine learning approaches
Icy http://icy.bioimageanalysis.org/index.php Bioimage analysis
BioimageXD http://www.bioimagexd.net/ 3D Bioimage analysis and Visualization
PhenoRipper http://www.phenoripper.org Bioimage analysis for rapid exploration and interpretation of
bioimage data in drug screening
FarSight http://www.farsight-toolkit.org/wiki/Main_Page Dynamic Biological Microenvironments from 4D/5D Microscopy
Data
Vaa3D http://penglab.janelia.org/proj/v3d/V3D/About_V3D.html Bioimage visualization and analysis
Cell Analyzer http://penglab.janelia.org/proj/cellexplorer/cellexplorer/ C. elegans image analysis
What_is_Cell_Explorer.html
AceTree and StarryNite http://starrynite.sourceforge.net/ C. elegans embryo cell tracking and lineage reconstruction
Ilastik http://www.ilastik.org/ Image classification and segmentation
Image Quantitators (ZFIQ, http://www.methodisthealth.com/bbpsoftware A set of image analysis software packages for cell tracking in time-
DCELLIQ, GCELLIQ, NeuriteIQ, lapse images, and RNAi cell, neuron, neurite and Zebrafish image
NeuronIQ) analysis
CellCognition http://cellcognition.org/software/cecoganalyzer Cell tracking in time-lapse image analysis
TLMTracker http://www.tlmtracker.tu-bs.de/index.php/Main_Page Cell tracking in time-lapse image analysis
NeuronJ http://www.imagescience.org/meijering/software/neuronj/ Neurite Tracing and Quantification
NeurphologyJ http://life.nctu.edu.tw/,microtubule/neurphologyJ.html Neuron image analysis
NeuronStudio http://research.mssm.edu/cnic/tools-ns.html Neuron image analysis
CellOrganizer http://cellorganizer.org/ Synthetically model and simulate fluorescent microscopic cell
images
SimuCell http://www.simucell.org Synthetically model and simulate fluorescent microscopic cell
images
PatternUnmixer http://murphylab.web.cmu.edu/software/PatternUnmixer2.0/ Model fundamental sub-cellular patterns
mManager http://valelab.ucsf.edu/,MM/MMwiki/ Control of automated microscopes
ScanImage http://openwiki.janelia.org/wiki/display/ephus/ Control of automated microscopes
ScanImage%2C+Ephus%2C+and+other+DAQ+software
OME http://www.openmicroscopy.org/site Image Database Software
Bisque http://www.bioimage.ucsb.edu/bisque Image Database Software
OMERO.searcher http://murphylab.web.cmu.edu/software/searcher/ Content-based bioimage search
KNIME http://www.knime.org/example-workflows Workflow system for data analytics, reporting and integration

doi:10.1371/journal.pcbi.1003043.t001

one of the subpopulations, and then the there are a number of publicly available packages, including the microscope control
portions of cells belonging to each sub- bioimage informatics software packages [9], software for image acquisition (mManager
population were calculated as features to which provide even more powerful func- and ScanImage) and image database soft-
further characterize the effects of pertur- tions with cutting-edge algorithms and ware (OME, Bisque and OMERO.-
bations. For more details, please refer to screening-specific analysis pipelines. For searcher). Also, certain cellular image
[13,14]. the convenience of finding these popular simulation software packages, e.g., CellOr-
software packages, they are listed in Table 1. ganizer and SimuCell, provide useful in-
6. Publicly Available Bioimage It is difficult to summarize all of their sights into the organizations of proteins of
capabilities and functions because many of interest within individual cells. These soft-
Informatics Software Packages
them are designed for flexible bioimage ware packages represent the prevalent
A number of commercial bioimage analysis with a set of diverse plugins and directions of bioimage informatics research,
informatics software tools e.g., GE-InCel- function modules, e.g., Fiji, CellProfiler, Icy, thus their websites and features are worth
lAnalyzer [143], Cellomics [144], Cellumen and BioimageXD. The software selection checking.
[145], MetaXpress [146], BD Pathway for specific applications is also non-trivial,
[147] have been developed and are widely and the best way might be to check their 7. Summary
used in pharmaceutical companies, and websites and online documents. In addition
academic institutions. In addition to the to the bioimage informatics software pack- With the advances of fluorescent mi-
commercially available software packages, ages, there are other software croscopy and robotic handling, image-

PLOS Computational Biology | www.ploscompbiol.org 16 April 2013 | Volume 9 | Issue 4 | e1003043


based screening has been widely used for 8. Exercises Acknowledgments
drug and target discovery by systematical-
Q1. Understand the principle of using This paper summarizes over a decade of highly
ly investigating morphological changes
green fluorescent protein (GFP) to label productive collaborations with many colleagues
within cell populations. The bioimage worldwide. The authors would like to acknowl-
informatics approaches to automatically the chromosome of HeLa cells. edge their collaborators, in particular, Norbert
detect, quantify, and profile the phenotyp- Q2. Download a cellular image pro- Perrimon, Jeff Lichtman, Bernando Sabatini,
ic changes caused by various perturba- cessing software package, then download Randy King, Junying Yuan, and Tim Mitchison
tions, e.g., drug compounds and RNAi, some cell images, and use them as from Harvard Medical School; Alexei Degterev
examples to perform the cell detection, and Eric Miller from Tufts University; Weiming
are essential to the success of these image-
segmentation, and feature extraction, and Xia from Boston VA Medical Center and
based screening studies. In this chapter, an Boston University, Jun Lu from Stanford
overview of the current bioimage infor- provide the analysis results.
University; Chris Bakal from Institute of Cancer
matics approaches for systematic drug Q3. Download a time-lapse image Research, Royal Cancer Hospital, U.K.; Yan
discovery was provided. A number of analysis software package, then download Feng of Novartis Institutes of Biomedical
practical examples were first described to some time-lapse images, and use them as Research; Shih Fu Chang of Columbia Univer-
illustrate the concepts and capabilities of examples to perform cell tracking, and cell sity; Marta Lipinski from the University of
cycle phase classification, and provide the Maryland at Baltimore; Jinwen Ma from Peking
image-based screening for drug and target University of China; Liang Ji from Tsinghua
discovery. Then, the prevalent bioimage analysis results.
University of China; Myong Hee Kim of
informatics techniques, e.g., object detec- Q4. Download a neuron image analysis EWHA Womans University, Korea; Yong
tion, segmentation, tracking and visualiza- software package, then download some Zhang from IBM Research; and Guanglei
tion, were discussed. Subsequently, the neuron images, and use them as examples Xiong from Siemens Corporate Research. The
widely used numerical features, pheno- to perform dendrite and spine detection, raw image data presented in this paper were
and provide the analysis results. mostly generated from the labs of our biological
types identification, classification, and
Q5. Implement the watershed and level collaborators. We would also like to thank our
profiling analysis were introduced to colleagues at the Department of Systems
characterize the effects of drugs and set segmentation methods by using ITK Medicine and Bioengineering, The Methodist
targets. Finally, the major publicly avail- functions (http://www.itk.org/) and test Hospital Research Institute for their discussions,
able bioimage informatics software pack- them on some cell images. notably Xiaofeng Xia, Kemi Cui, Zhong Xue,
ages were listed for future reference. We Answers to the Exercises can be found and Jie Cheng, as well as former members
hope that this review provided sufficient in Text S1. including Xiaowei Chen, Ranga Srinivasan,
Peng Shi, Yue Huang, Gang Li, Xiaobo Zhou,
information and insights for readers to Jingxin Nie, Jun Wang, Tianming Liu, Huiming
apply the approaches and techniques of Supporting Information Peng, Yong Zhang, and Qing Li. We would also
bioimage informatics to advance their like to thank James Mancuso, Derek Cridebr-
Text S1 Answers to Exercises.
research projects. ing, Luanne Novak, and Rebecca Danforth for
(DOCX) proofreading and discussion.

Further Reading

N Taylor DL (2010) A personal perspective on high-content screening (HCS): from the beginning. J Biomol Screen 15(7): 720
255.
N Shariff A, Kangas J, Coelho LP, Quinn S, Murphy RF (2010) Automated image analysis for high-content screening and analysis.
J Biomol Screen 15(7): 726734.
N Dufour A, Shinin V, Tajbakhsh S, Guillen-Aghion N, Olivo-Marin JC, et al. (2005) Segmenting and tracking fluorescent cells in
dynamic 3-D microscopy with coupled active surfaces. IEEE Trans Image Process 14: 13961410.
N Danuser G (2011) Computer vision in cell biology. Cell 147(5): 973978.
N Murray JI, Bao Z, Boyle TJ, Boeck ME, Mericle BL, et al. (2008) Automated analysis of embryonic gene expression with cellular
resolution in C. elegans. Nat Methods 5: 703709.
N Rodriguez A, Ehlenberger DB, Dickstein DL, Hof PR, Wearne SL (2008) Automated three-dimensional detection and shape
classification of dendritic spines from fluorescence microscopy images. PLoS ONE 3: e1997. doi:10.1371/journal.pone.0001997.
N Bakal C, Aach J, Church G, Perrimon N (2007) Quantitative morphological signatures define local signaling networks
regulating cell morphology. Science 316(5832): 17531756.
N Neumann B, Walter T, Heriche JK, Bulkescher J, Erfle H, et al. (2010) Phenotypic profiling of the human genome by time-lapse
microscopy reveals cell division genes. Nature 464: 721727.
N Allan C, Burel JM, Moore J, Blackburn C, Linkert M, et al. (2012) OMERO: flexible, model-driven data management for
experimental biology. Nat Methods 9: 245253.
N Eliceiri KW, Berthold MR, Goldberg IG, Ibanez L, Manjunath BS, et al. (2012) Biological imaging software tools. Nat Methods 9:
697710.

PLOS Computational Biology | www.ploscompbiol.org 17 April 2013 | Volume 9 | Issue 4 | e1003043


Glossary

N Cellular phenotype: A cellular phenotype refers to a distinct morphological appearance or behavior of cells as observed
under fluorescent, phase contrast, or bright field microscopy.
N Green fluorescent protein (GFP): GFP is used as a protein reporter by attaching to specific proteins, and exhibiting bright
green fluorescence when exposed to light in the blue to ultraviolet range.
N Fluorescence microscope: A fluorescence microscope is an optical microscope that uses higher intensity light source to
excite a fluorescent species in a sample of interest.
N High content analysis (HCA): HCA focuses on extracting and analyzing quantitative phenotypic data automatically from
large amounts of cell images with automated image analysis, computer vision and machine learning approaches.
N High content screening (HCS): Applications of HCA for screening drugs and targets are referred to as HCS that aims to
identify compounds or genes that cause desired phenotypic changes.
N RNA interference (RNAi): RNAi is a biological process, in which RNA molecules inhibit gene expression, typically by causing
the destruction of specific mRNA molecules.
N Automated image analysis: Automated image analysis aims to quantitatively analyze images automatically by computer
programs with minimal human interventions.
N Object detection: Object detection is to automatically detect locations of objects of interest in images.
N Blob structure detection: Blob structure detection is to detect positions of objects of interest that have circle, sphere like
structures, e.g., nuclei and particles.
N Tube structure detection: Tube structure is to detect centerlines of objects that have long tube like structures, e.g., neuron
dendrite and blood vessel.
N Object segmentation: Object segmentation is to automatically delineate boundaries of objects of interest in images.
N Object tracking: Object tracking is to identify the motion traces of objects of interest in time-lapse images.
N Feature extraction: Feature extraction is to quantify the morphological appearances of segmented objects by calculating a
set of numerical features.
N Phenotype classification: Phenotype classification is to assign each segmented object into a sub-group that has distinct
phenotypes from other sub-groups.
N Cell cycle phase identification: Cell cycle phase identification is to automatically identify the corresponding cell cycle
phase that a given cell is in according to its morphological appearances.

References
1. Tsien RY (1998) The green fluorescent protein. content screening and ligand-target prediction 21. Rajaram S, Pavie B, Wu LF, Altschuler SJ
Annu Rev Biochem 67: 509544. to identify mechanism of action. Nat Chem Biol (2012) PhenoRipper: software for rapidly profil-
2. Lichtman JW, Conchello JA (2005) Fluorescence 4: 5968. ing microscopy images. Nat Methods 9: 635
microscopy. Nat Methods 2: 910919. 13. Slack MD, Martinez ED, Wu LF, Altschuler SJ 637.
3. Shariff A, Kangas J, Coelho LP, Quinn S, (2008) Characterizing heterogeneous cellular 22. Neumann B, Walter T, Heriche JK, Bulkescher
Murphy RF (2010) Automated image analysis responses to perturbations. Proc Natl Acad J, Erfle H, et al. (2010) Phenotypic profiling of
for high-content screening and analysis. J Biomol Sci U S A 105: 1930619311. the human genome by time-lapse microscopy
Screen 15: 726734. 14. Singh DK, Ku C-J, Wichaidit C, Steininger RJ, reveals cell division genes. Nature 464: 721727.
4. Danuser G (2011) Computer vision in cell Wu LF, et al. (2010) Patterns of basal signaling 23. Held M, Schmitz MH, Fischer B, Walter T,
biology. Cell 147: 973978. heterogeneity can distinguish cellular popula- Neumann B, et al. (2010) CellCognition: time-
5. Taylor DL (2010) A personal perspective on tions with different drug sensitivities. Mol Syst resolved phenotype annotation in high-through-
high-content screening (HCS): from the begin- Biol 6: 369. put live cell imaging. Nat Methods 7: 747
ning. J Biomol Screen 15(7): 720255. 15. Bakal C, Linding R, Llense F, Heffern E, 754.
6. Abraham VC, Taylor DL, Haskins JR (2004) Martin-Blanco E, et al. (2008) Phosphorylation 24. Shi Q, King RW (2005) Chromosome nondis-
High content screening applied to large-scale networks regulating JNK activity in diverse junction yields tetraploid rather than aneuploid
cell biology. Trends Biotechnol 22: 1522. genetic backgrounds. Science 322: 453456. cells in human cell lines. Nature 437: 1038
7. Giuliano KA, DeBiasio RL, Dunlay RT, Gough 16. Bakal C, Aach J, Church G, Perrimon N (2007) 1042.
A, Volosky JM, et al. (1997) High-content Quantitative morphological signatures define 25. Sigoillot FD, Huckins JF, Li F, Zhou X, Wong
screening: a new approach to easing key local signaling networks regulating cell morphol- ST, et al. (2011) A time-series method for
bottlenecks in the drug discovery process. ogy. Science 316: 17531756. automated measurement of changes in mitotic
J Biomol Screen 2: 249259. 17. Carpenter AE, Jones TR, Lamprecht MR, Clarke and interphase duration from time-lapse movies.
8. Peng H (2008) Bioimage informatics: a new area C, Kang IH, et al. (2006) CellProfiler: image PLoS ONE 6: e25511. doi:10.1371/journal.
of engineering biology. Bioinformatics 24: 1827 analysis software for identifying and quantifying pone.0025511
1836. cell phenotypes. Genome Biol 7: R100. 26. Miki T, Lehmann T, Cai H, Stolz DB, Strom
9. Eliceiri KW, Berthold MR, Goldberg IG, Ibanez 18. Schindelin J, Arganda-Carreras I, Frise E, SC (2005) Stem cell characteristics of amniotic
L, Manjunath BS, et al. (2012) Biological imaging Kaynig V, Longair M, et al. (2012) Fiji: an epithelial cells. Stem Cells 23: 15491559.
software tools. Nat Methods 9: 697710. open-source platform for biological-image anal- 27. Li K, Miller ED, Chen M, Kanade T, Weiss LE,
10. Yarrow JC, Feng Y, Perlman ZE, Kirchhausen ysis. Nat Methods 9: 676682. et al. (2008) Cell population tracking and lineage
T, Mitchison TJ (2003) Phenotypic screening of 19. de Chaumont F, Dallongeville S, Chenouard N, construction with spatiotemporal context. Med
small molecule libraries by high throughput cell Herve N, Pop S, et al. (2012) Icy: an open Image Anal 12: 546566.
imaging. Comb Chem High Throughput Screen bioimage informatics platform for extended 28. Cohen AR, Gomes FL, Roysam B, Cayouette M
6: 279286. reproducible research. Nat Methods 9: 690696. (2011) Computational prediction of neural
11. Perlman Z, Slack M, Feng Y, Mitchison T, Wu 20. Yin Z, Zhou X, Bakal C, Li F, Sun Y, et al. progenitor cell fates. Nat Methods 7: 213218.
L, et al. (2004) Multidimensional drug profiling (2008) Using iterative cluster merging with 29. Kankaanpaa P, Paavolainen L, Tiitta S, Karja-
by automated microscopy. Science 306: 1194 improved gap statistics to perform online lainen M, Paivarinne J, et al. (2012) BioIma-
1198. phenotype discovery in the context of high- geXD: an open, general-purpose and high-
12. Young DW, Bender A, Hoyt J, McWhinnie E, throughput RNAi screens. BMC Bioinformatics throughput image-processing platform. Nat
Chirn G-W, et al. (2008) Integrating high- 9: 264. Methods 9: 683689.

PLOS Computational Biology | www.ploscompbiol.org 18 April 2013 | Volume 9 | Issue 4 | e1003043


30. Li F, Zhou X, Ma J, Wong TCS (2010) Multiple 50. Peng H, Long F, Myers G (2011) Automatic 3D estimation in computerized video time-lapse
Nuclei Tracking Using Integer Programming for neuron tracing using all-path pruning. Bioinfor- microscopy. 643109643109.
Quantitative Cancer Cell Cycle Analysis. IEEE matics 27: i239i247. 72. Jiang S, Zhou X, Kirchhausen T, Wong ST
Trans Med Imaging 29: 96105. 51. Meijering E (2010) Neuron tracing in perspec- (2007) Detection of molecular particles in live
31. Klein J, Leupold S, Biegler I, Biedendieck R, tive. Cytometry A 77: 693704. cells via machine learning. Cytometry A 71:
Munch R, et al. (2012) TLM-Tracker: software 52. Boyle TJ, Bao Z, Murray JI, Araya CL, 563575.
for cell segmentation, tracking and lineage Waterston RH (2006) AceTree: a tool for visual 73. Sommer C, Straehle C, Kothe U, Hamprecht
analysis in time-lapse microscopy movies. Bioin- analysis of Caenorhabditis elegans embryogen- FA (2011) Ilastik: interactive learning and
formatics 28: 22762277. esis. BMC Bioinformatics 7: 275. segmentation toolkit. pp. 230233. 2011 IEEE
32. Segal M (2005) Dendritic spines and long-term 53. Bao Z, Murray JI, Boyle T, Ooi SL, Sandel MJ, International Symposium on Biomedical Imag-
plasticity. Nat Rev Neurosci 6: 277284. et al. (2006) Automated cell lineage tracing in ing; 30 March2 April 2011.
33. Hyman BT (2001) Molecular and anatomical Caenorhabditis elegans. Proc Natl Acad Sci U S A 74. Steger C (1998) An unbiased detector of
studies in Alzheimers disease. Neurologia 16: 103: 27072712. curvilinear structures. IEEE Transactions on
100104. 54. Sarov M, Murray JI, Schanze K, Pozniakovski Pattern Analysis and Machine Intellegence 20:
34. Ding JB, Takasaki KT, Sabatini BL (2009) A, Niu W, et al. (2012) A genome-scale resource 113125.
Supraresolution imaging in brain slices using for in vivo tag-based protein function explora- 75. Al-Kofahi KA, Lasek S, Szarowski DH, Pace
stimulated-emission depletion two-photon laser tion in C. elegans. Cell 150: 855866. CJ, Nagy G, et al. (2002) Rapid automated
scanning microscopy. Neuron 63: 429437. 55. Liu X, Long F, Peng H, Aerni SJ, Jiang M, et al. three-dimensional tracing of neurons from con-
35. Nagerl UV, Willig KI, Hein B, Hell SW, (2009) Analysis of cell fate from single-cell gene focalimage stacks. IEEE Transactions on Infor-
Bonhoeffer T (2008) Live-cell imaging of expression profiles in C. elegans. Cell 139: 623 mation Technology in Biomedicine 6: 171187.
dendritic spines by STED microscopy. Proc 633. 76. Tyrrell JA, di Tomaso E, Fuja D, Ricky T,
Natl Acad Sci U S A 105: 1898218987. 56. Long F, Peng H, Liu X, Kim SK, Myers E Kozak K, et al. (2007) Robust 3-D modeling of
36. Carter AG, Sabatini BL (2004) State-dependent (2009) A 3D digital atlas of C. elegans and its vasculature imagery using superellipsoids. IEEE
calcium signaling in dendritic spines of striatal application to single-cell analyses. Nat Methods Trans Med Imaging 26: 223237.
medium spiny neurons. Neuron 44: 483493. 6: 667672. 77. Soares JVB, Leandro JJG, Cesar RM, Jelinek
37. Duemani Reddy G, Kelleher K, Fink R, Saggau 57. Wahlby C, Kamentsky L, Liu ZH, Riklin-Raviv HF, Cree MJ (2006) Retinal vessel segmentation
P (2008) Three-dimensional random access T, Conery AL, et al. (2012) An image analysis using the 2-D Gabor wavelet and supervised
multiphoton microscopy for functional imaging toolbox for high-throughput C. elegans assays. classification. IEEE Trans Med Imaging 25:
of neuronal activity. Nat Neurosci 11: 713720. Nat Methods 9: 714716. 12141222.
38. Iyer V, Hoogland TM, Saggau P (2006) Fast 58. Borgefors G (1986) Distance transformations in 78. Staal J, Abramoff MD, Niemeijer M, Viergever
functional imaging of single neurons using digital images. Computer Vision, Graphics, and MA, van Ginneken B (2004) Ridge-based vessel
random-access multiphoton (RAMP) microsco- Image Processing 34: 344371. segmentation in color images of the retina. IEEE
py. J Neurophysiol 95: 535545. 59. Wahlby C, Sintorn I, Erlandsson F, Borgefors Trans Med Imaging 23: 501509.
39. Cheng J, Zhou X, Miller E, Witt RM, Zhu J, et G, Bengtsson E (2004) Combining intensity, 79. Fraz M, Remagnino P, Hoppe A, Uyyanonvara
al. (2007) A novel computational approach for edge and shape information for 2D and 3D B, Rudnicka A, et al. (2012) Blood vessel
automatic dendrite spines detection in two- segmentation of cell nuclei in tissue sections. segmentation methodologies in retinal images -
photon laser scan microscopy. J Neurosci Meth- J Microsc 215: 6776. a survey. Comput Methods Programs Biomed
ods 165: 122134. 60. Lindeberg T (1993) Detecting salient blob-like 108(1): 407433.
40. Ofengeim D, Shi P, Miao B, Fan J, Xia X, et al. image structures and their scales with a scale- 80. Otsu N (1978) A threshold selection method
(2012) Identification of small molecule inhibitors space primal sketch: a method for focus-of- from gray level histgram. IEEE Transactions on
of neurite loss induced by Abeta peptide using attention. Int J Comput Vision 11: 283318. System, Man, and Cybernetics 8: 6266.
high content screening. J Biol Chem 287: 8714 61. Lindeberg T (1998) Feature detection with 81. Dunn JC (1973) A fuzzy relative of the
8723. automatic scale selection. Int J Comput Vision ISODATA process and its use in detecting
41. Ho SY, Chao CY, Huang HL, Chiu TW, 30: 79116. compact well-separated clusters. Journal of
Charoenkwan P, et al. (2011) NeurphologyJ: an 62. Byun J, Verardo MR, Sumengen B, Lewis GP, Cybernetics 3: 3257.
automatic neuronal morphology quantification Manjunath BS, et al. (2006) Automated tool for 82. Vincent L, Soille P (1991) Watersheds in digital
method and its application in pharmacological the detection of cell nuclei in digital microscopic spaces: an efficient algorithm based on immer-
discovery. BMC Bioinformatics 12: 230. images: application to retinal images. Mol Vis sion simulations. IEEE Transactions on Pattern
42. Meijering E, Jacob M, Sarria JC, Steiner P, 12: 949960. Analysis and Machine Intelligence 13: 583598.
Hirling H, et al. (2004) Design and validation of 63. Al-Kofahi Y, Lassoued W, Lee W, Roysam B 83. Beucher S (1992) The watershed transformation
a tool for neurite tracing and analysis in (2010) Improved automatic detection and seg- applied to image segmentation. Scanning Mi-
fluorescence microscopy images. Cytometry A mentation of cell nuclei in histopathology croscopy International 6: 299314.
58: 167176. images. IEEE Trans Biomed Eng 57: 841852. 84. Meyer F, Beucher S (1990) Morphological
43. Pool M, Thiemann J, Bar-Or A, Fournier AE 64. Li G, Liu T, Tarokh A, Nie J, Guo L, et al. segmentation. Journal of Visual Communication
(2008) NeuriteTracer: a novel ImageJ plugin for (2007) 3D cell nuclei segmentation based on and Image Representation 1: 2146.
automated quantification of neurite outgrowth. gradient flow tracking. BMC Cell Biology 8: 40. 85. Wahlby C, Lindblad J, Vondrus M, Bengtsson
J Neurosci Methods 168: 134139. 65. Xu C, Prince JL (1998) Snakes, shapes, and E, Bjorkesten L (2002) Algorithms for cytoplasm
44. Xiong G, Zhou X, Degterev A, Ji L, Wong STC gradient vector flow. IEEE Transactions on segmentation of fluorescence labelled cells.
(2006) Automated neurite labeling and analysis Image Processing 7: 359369. Analytical Cellular Pathology 24: 101111.
in fluorescence microscopy images. Cytometry 66. Duda RO, Hart PE (1972) Use of the Hough 86. Casselles V, Kimmel R, Sapiro G (1997)
Part A 69A: 494505. transformation to detect lines and curves in Geodesic active contours. International Journal
45. Narro ML, Yang F, Kraft R, Wenk C, Efrat A, pictures. Commun ACM 15: 1115. of Computer Vision 22: 6179.
et al. (2007) NeuronMetrics: software for semi- 67. Parvin B, Yang Q, Han J, Chang H, Rydberg B, 87. Chan T, Vese L (2001) Active contours without
automated processing of cultured neuron imag- et al. (2007) Iterative voting for inference of edges. IEEE Transactions on Image Processing
es. Brain Res 1138: 5775. structural saliency and characterization of sub- 10: 266277.
46. Rodriguez A, Ehlenberger DB, Dickstein DL, cellular events. IEEE Trans Image Process 16: 88. Zimmer C, Labruyere E, Meas-Yedid V,
Hof PR, Wearne SL (2008) Automated three- 615623. Guillen N, Olivo-Marin J (2002) Segmentation
dimensional detection and shape classification of 68. Lin G, Adiga U, Olson K, Guzowski JF, Barnes and tracking of migrating cells in videomicro-
dendritic spines from fluorescence microscopy CA, et al. (2003) A hybrid 3D watershed scopy with parametric active contours: a tool for
images. PLoS ONE 3: e1997. doi:10.1371/ algorithm incorporating gradient cues and cell-based drug testing. IEEE Trans Med
journal.pone.0001997 object models for automatic segmentation of Imaging 21: 12121221.
47. Wearne SL, Rodriguez A, Ehlenberger DB, nuclei in confocal image stacks. Cytometry A 56: 89. Yan P, Zhou X, Shah M, Wong ST (2008)
Rocher AB, Henderson SC, et al. (2005) New 2336. Automatic segmentation of high-throughput
techniques for imaging, digitization and analysis 69. Lienhart R, Maydt J (2002) An extended set of RNAi fluorescent cellular images. IEEE Trans
of three-dimensional neural morphology on Haar-like features for rapid object detection. pp. Inf Technol Biomed 12: 109117.
multiple scales. Neuroscience 136: 661680. I-900I-903. Vol. 1. Proceedings of the 2002 90. Dufour A, Shinin V, Tajbakhsh S, Guillen-
48. Zhang Y, Zhou X, Witt RM, Sabatini BL, International Conference on Image Processing. Aghion N, Olivo-Marin JC, et al. (2005)
Adjeroh D, et al. (2007) Dendritic spine 70. Viola P, Jones M (2001) Rapid object detection Segmenting and tracking fluorescent cells in
detection using curvilinear structure detector using a boosted cascade of simple features. dynamic 3-D microscopy with coupled active
and LDA classifier. Neuroimage 36: 346360. Proceedings of the 2001 IEEE Computer surfaces. IEEE Transactions on Image Process-
49. Peng H, Ruan Z, Atasoy D, Sternson S (2010) Society Conference on Computer Vision and ing 14: 13961410.
Automatic reconstruction of 3D neuron struc- Pattern Recognition. pp. I-511I-518. Vol. 1. 91. Caselles V, Kimmel R, Sapiro G (1997)
tures using a graph-augmented deformable 71. He W, Wang X, Metaxas D, Mathew R, White Geodesic active contours. International Journal
model. Bioinformatics 26: i38i46. E (2007) Cell segmentation for division rate of Computer Vision 22: 6179.

PLOS Computational Biology | www.ploscompbiol.org 19 April 2013 | Volume 9 | Issue 4 | e1003043


92. Osher S, Sethian JA (1988) Fronts propagating 109. Appel K, Haken W, Koch J (1977) Every planar 126. Chen X, Zhou X, Wong ST (2006) Automated
with curvature-dependent speed: Algorithms map is four colorable part II. Reducibility. segmentation, classification, and tracking of
based on Hamilton-Jacobi formulations. Journal Illinois Journal of Mathematics: 491567. cancer cell nuclei in time-lapse microscopy.
of Computation Physics 79: 1249. 110. Padfield DR, Rittscher J, Sebastian T, Thomas IEEE Trans Biomed Eng 53: 762766.
93. Chunming L, Chenyang X, Changfeng G, Fox N, Roysam B (2006) Spatio-temporal cell cycle 127. Boland M, Murphy R (2001) A neural network
MD (2005) Level set evolution without re- analysis using 3D level set segmentation of classifier capable of recognizing the patterns of
initialization: a new variational formulation. unstained nuclei in line scan confocal fluores- all major subcellular structures in fluorescence
IEEE Computer Society Conference on Com- cence images. 3rd IEEE International Sympo- microscope images of HeLa cells. Bioinformatics
puter Vision and Pattern Recognition; 2025 sium on Biomedical Imaging; 69 April 2006. 17: 12131223.
June 2005. pp. 430436. Vol. 1. pp. 10361039. 128. Haralick R (1979) Statistical and structural
94. Aurenhammer F (1991) Voronoi diagrams - a 111. Padfield DR, Rittscher J, Roysam B (2008) approaches to texture. Proceedings of IEEE
survey of a fundamental geometric data struc- Spatio-temporal cell segmentation and tracking 67: 786804.
ture. ACM Comput Surv 23: 345405. for automated screening. 5th IEEE International 129. Manjunatha BS, Ma WY (1996) Texture
95. Jones T, Carpener A, Golland P (2005) Voronoi- Symposium on Biomedical Imaging; 1417 May features for browsing and retrieval of image
based segmentation of cells on image manifolds. 2008. pp. 376379. data. IEEE Trans Pattern Anal Mach Intell 18:
Lecture Notes in Computer Science: 535543. 112. Padfield D, Rittscher J, Thomas N, Roysam B 837842.
96. Shi J, Malik J (2000) Normalized cuts and image (2009) Spatio-temporal cell cycle phase analysis 130. Cohen A, Daubechies I, Feauveau JC (1992) Bi-
using level sets and fast marching methods.
segmentation. IEEE Trans Pattern Anal Mach orthogonal bases of compactly supported wave-
Medical Image Analysis 13: 143155.
Intell 22: 888905. lets. Communications on Pure and Applied
113. Al-Kofahi O, Radke RJ, Goderie SK, Shen Q,
97. Felzenszwalb PF, Huttenlocher DP (2004) Effi- Mathematics 45: 485560.
Temple S, et al. (2006) Automated cell lineage
cient graph-based image segmentation. 131. Zernike F (1934) Beugungstheorie des schnei-
construction: a rapid method to analyze clonal
Int J Comput Vision 59: 167181. dencerfarhens undseiner verbesserten form, der
development established with murine neural
98. Radhakrishna A, Shaji A, Smith K, Lucchi A, phasenkontrastmethode. Physica 1: 689704.
progenitor cells. Cell Cycle 5: 327335.
Fua P, et al. (June, 2010) SLIC superpixels. 114. Chen X, Zhou X, Wong STC (2006) Automated 132. Haralick RM, Shanmugam K, Dinstein I (1973)
Technical report 149300, EPFL. segmentation, classification, and tracking of Textural features for image classification. IEEE
99. Lucchi A, Smith K, Achanta R, Lepetit V, Fua cancer cell nuclei in time-lapse microscopy. Transactions on Systems, Man and Cybernetics
P (2010) A fully automated approach to IEEE Transactions on Biomedical Engineering 6: 610620.
segmentation of irregularly shaped cellular 53: 762766. 133. Harder N, Mora-Bermudez F, Godinez WJ,
structures in EM images. Proceedings of the 115. Harder N, Mora-Bermudez F, Godinez WJ, Wunsche A, Eils R, et al. (2009) Automatic
13th International Conference on Medical Ellenberg J, Eils R, et al. (2006) Automated analysis of dividing cells in live cell movies to
Image Computing and Computer-Assisted In- analysis of the mitotic phases of human cells in detect mitotic delays and correlate phenotypes in
tervention. Med Image Comput Comput Assist 3D fluorescence microscopy image sequences. time. Genome Res 19: 21132124.
Interv 13(Pt 2): 463471. Med Image Comput Comput Assist Interv 9: 134. Zhong Q, Busetto AG, Fededa JP, Buhmann
100. Debeir O, Ham PV, Kiss R, Decaestecker C 840848. JM, Gerlich DW (2012) Unsupervised modeling
(2005) Tracking of migrating cells under phase- 116. Li K, Chen M, Kanade T (2007) Cell popula- of cell morphology dynamics for time-lapse
contrast video microscopy with combined mean- tion tracking and lineage construction with microscopy. Nat Methods 9: 711713.
shift processes. IEEE Trans Med Imaging 24: spatiotemporal context. Med Image Comput 135. Wang M, Zhou X, Li F, Huckins J, King WR, et
697711. Comput Assist Interv 10: 295302. al. (2008) Novel cell segmentation and online
101. Zimmer C, Olivo-Marin JC (2005) Coupled 117. Blom HAP (1984) An efficient filter for abruptly SVM for cell cycle phase identification in
parametric active contours. IEEE Trans Pattern changing systems. Proceedings of 23rd IEEE automated microscopy. Bioinformatics 24: 94
Anal Mach Intell 27: 18381842. Conference on Decision and Control 23: 656658. 101.
102. Yang F, Mackey MA, Ianzini F, Gallardo G, 118. Genovesio A, Liedl T, Emiliani V, Parak WJ, 136. Wang J, Zhou X, Bradley PL, Chang SF,
Sonka M (2005) Cell segmentation, tracking, Coppey-Moisan M, et al. (2006) Multiple Perrimon N, et al. (2008) Cellular phenotype
and mitosis detection using temporal context. particle tracking in 3-D+t microscopy: method recognition for high-content RNA interference
Med Image Comput Comput Assist Interv 8(Pt and application to the tracking of endocytosed genome-wide screening. J Biomol Screen 13:
1): 302309. quantum dots. IEEE Trans Image Process 15: 2939.
103. Bunyak F, Palaniappan K, Nath SK, Baskin TL, 10621070. 137. Yan M, Ye K (2007) Determining the number of
Gang D (2006) Quantitative cell motility for in 119. Smal I, Draegestein K, Galjart N, Niessen W, clusters using the weighted gap statistic. Biomet-
vitro wound healing using level set-based active Meijering E (2007) Rao-blackwellized marginal rics 63: 10311037.
contour tracking. Proc IEEE Int Symp Biomed particle filtering for multiple object tracking in 138. Jones TR, Carpenter AE, Lamprecht MR,
Imaging 2006 April 6: 10401043. molecular bioimaging. Proceedings of the 20th Moffat J, Silver SJ, et al. (2009) Scoring diverse
104. Dzyubachyk O, van Cappellen WA, Essers J, International Conference on Information Pro- cellular morphologies in image-based screens
Niessen WJ, Meijering E (2010) Advanced level- cessing in Medical Imaging. Kerkrade, The with iterative feedback and machine learning.
set-based cell tracking in time-lapse fluorescence Netherlands: Springer-Verlag. Proc Natl Acad Sci U S A 106: 18261831.
120. Smal I, Niessen W, Meijering E (2006) Bayesian 139. Young DW, Bender A, Hoyt J, McWhinnie E,
microscopy. IEEE Trans Med Imaging 29: 852
tracking for fluorescence microscopic imaging; Chirn GW, et al. (2008) Integrating high-content
867.
69 April 2006. pp. 550553. screening and ligand-target prediction to identify
105. Bo Z, Zimmer C, Olivo-Marin JC (2004)
121. Godinez WJ, Lampe M, Worz S, Muller B, Eils mechanism of action. Nat Chem Biol 4: 5968.
Tracking fluorescent cells with coupled geomet-
R, et al. (2007) Tracking of virus particles in
ric active contours. IEEE International Sympo- 140. Frise E, Hammonds AS, Celniker SE Systematic
time-lapse fluorescence microscopy image se-
sium on Biomedical Imaging; 1518 April 2004. image-driven analysis of the spatial Drosophila
quences. 1215 April 2007. pp. 256259.
pp. 476479. Vol. 1. embryonic expression landscape. Mol Syst Biol
122. Luisi J, Narayanaswamy A, Galbreath Z,
106. Li K, Miller ED, Weiss LE, Campbell PG, 6: 345.
Roysam B (2011) The FARSIGHT trace editor:
Kanade T (2006) Online tracking of migrating 141. Loo LH, Wu LF, Altschuler SJ (2007) Image-
an open source tool for 3-D inspection and
and proliferating cells imaged with phase- efficient pattern analysis aided editing of auto- based multivariate profiling of drug responses
contrast microscopy. Conference on Computer mated neuronal reconstructions. Neuroinfor- from single cells. Nat Methods 4: 445453.
Vision and Pattern Recognition Workshop. New matics 9: 305315. 142. Altschuler SJ, Wu LF (2010) Cellular heteroge-
York City, New York. pp. 65. 123. Peng H, Ruan Z, Long F, Simpson JH, Myers neity: do differences make a difference? Cell
107. Nath SK, Palaniappan K, Bunyak F (2006) Cell EW (2010) V3D enables real-time 3D visualiza- 141: 559563.
segmentation using coupled level sets and graph- tion and quantitative analysis of large-scale 143. GE-InCellAnalyzer. http://www.biacore.com/
vertex coloring. Proceedings of the 9th Interna- biological image data sets. Nat Biotechnol 28: high-content-analysis/index.html.
tional Conference on Medical Image Comput- 348353. 144. Cellomics. http://www.cellomics.com/content/
ing and Computer-Assisted Intervention. Med 124. Manjunath B, Ma W (1996) Texture features for menu/About_Us/.
Image Comput Comput Assist Interv 9 (Pt 1): browsing and retrieval of image data. IEEE 145. Cellumen. http://www.cellumen.com/.
101108. Trans Pattern Anal Mach Intell 18: 837842. 146. MetaXpress. http://www.moleculardevices.
108. Appel K, Haken W (1977) Every planar map is 125. Zhou X, Wong STC (2006) Informatics chal- com/pages/software/metaxpress.html.
four colorable part I. Discharging. Illinois lenges of high-throughput microscopy. IEEE 147. BD-Pathway. http://www.bdbiosciences.ca/
Journal of Mathematics: 429490. Signal Processing Magazine 23: 6372. bioimaging/cell_biology/pathway/software/.

PLOS Computational Biology | www.ploscompbiol.org 20 April 2013 | Volume 9 | Issue 4 | e1003043


ploscompbiol.org
ploscompbiol@plos.org

PLOS is a nonprofit organization founded to accelerate progress in science and medicine by leading a transformation in
research communication.

You might also like