You are on page 1of 4

Research Proposal

Bioinformatics approach to evaluation of Transcription factor genes and


diseases (Cancer)
Brijesh Singh Yadav (Senior Research Associates, URC, Allahabad)
E-Mail: brijeshbioinfo@gmail.com

Problem Statement:
The purpose of the proposed research is the development of a computational approach to
quantitatively evaluate associations between transcription factor encoding genes and
human diseases, based on available literature evidence. The approach will analyze a set
of candidate genes and determine which genes are linked to human diseases, which
properties are involved in these gene-disease linkages, and which clusters of similar
genes are involved in particular diseases.
During the course of the research, I shall explore methods for recapitulating existing
associations and predicting novel associations based on diverse forms of data pertaining
to genes and diseases. These methods will evaluate the resulting associations in a
quantitative manner, and the resulting analyses will be validated to determine the efficacy
of the methods.
Background:
Identification of functional causes and contributing mechanisms of disease is a principal
aim of biomedical research. In many cases, the term “disease” broadly applies to a
heterogeneous set of observable properties, which may arise from multiple molecular
processes. Disease is often characterized by symptoms and a pattern of progression over
time. The area of Cancer diseases is particularly broad, encompassing a wide range of
complex, abnormal phenotypes. Compared to diseases associated with other organs,
many types of cancer like brain cancer tend to be poorly understood: many are difficult to
characterize and have complex genetic components involving multiple genes.
Transcription factors are key regulators of gene expression, involved via processes such
as the recruitment of transcription initiation factors and conformational change of DNA,
working alone or as part of protein complexes.
GeneSeeker can find genes within a chromosomal location that are localized in
particular tissues, by looking at human and mouse expression data. Another method of
associating disease genes to anatomical locations performed text mining of PubMed
abstracts to associate eVOC anatomical ontology terms to gene names.
Machine learning approaches can be used when a representative set of disease genes are
available to use as training data. In DGP, a decision tree classification approach is used to
find features common to disease genes based on a training set composed of sample
disease and control proteins. Features were protein length, BLASTP ratios (conservation
score) between a protein and its highest scoring homologue within taxonomic groups
(representing phylogenetic conservation and extent) and the conservation score with the
closest paralogue. The study indicates that, on average, hereditary disease genes (genes
taken from OMIM) in comparison to randomly selected genes are longer, more
conserved, phylogenetically extended and without close paralogues.
PROSPECTR uses a wider variety of features, including the length of the gene, the
length of its coding sequence, the length of its cDNA, length of the protein, GC content
and percentage protein identity with its nearest homologue in various species (mouse,
worm, fly). The investigators used an alternating decision tree, taking genes from OMIM
and comparing against genes not found in OMIM. They also generated two independent
test sets – one using genes from the Human Gene Mutation Database with randomly
selected control genes, and another set of 54 genes not in OMIM, again with a set of
randomly selected control genes.
POCUS takes another machine learning approach, using a selected training set of genes
linked to the target disease. POCUS identifies common features between all the training
genes – InterPro domains, GO annotations, similar expression profile – and assesses the
chance that such common features would be shared by chance. This method depends on
a carefully selected training set of genes, and focuses the likelihood of these genes all
sharing common, disease-related properties, in contrast to methods that focus on
overrepresentation of properties among the training genes.
Proposed Method:
Most of the existing methods for the computational prediction of linkages between genes
and disease take as input a preliminary list of candidate genes (e.g. genes in a genomic
region linked in a genetic study to a disease), and return as output either a reduced or a
ranked list. The underlying approaches differ substantively between methods. Examples
of characteristics used in the methods include numerical features derived from the raw
sequence of genes and/or encoded proteins, existing annotations of proteins and genes,
and abstracts or articles directly referring to the gene. The current methods focus on using
properties from a representative set of genes to identify similar genes from the candidate
set.
We propose a method of extracting gene-disease associations that will emphasise
verifiable supporting evidence for the predicted associations, and a quantitative
evaluation of the strength of the association. We shall investigate both associations
between genes and disease, as well as properties of the gene-disease association.
We shall consider three base entities – Genes, Diseases, Evidence – and the relationships
between these entities.
Goal of Research:
Our goal will be to predict Gene-Disease relationships based on the existence of
relationships between other entity pairings. After initial study of mammalian gene-disease
relationships, we will broaden the approach to incorporate entity relationships involving
orthologous genes in model organisms or related diseases. These paths of supporting
evidence will be quantitatively evaluated, making it possible to both extract strongly
supported gene-disease linkages and to rank these linkages.
Although the thesis itself will investigate properties of transcription factor genes in
Cancer diseases, the methods and analysis will be designed for general application. For
the initial analysis of the main gene-disease associations.
Reference:
1. Van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG, et al. (2005)
GeneSeeker: extraction and integration of human disease-related information
from web-based genetic databases. Nucleic Acids Research 33: 758.
2. Tiffin N, Kelso J, Powell A, Pan H, Bajic V, et al. (2005) Integration of text- and data-
mining using ontologies successfully selects disease gene candidates. Nucleic
Acids Research 33: 1544-1552.
3. López-Bigas N, Ouzounis C (2004) Genome-wide identification of genes likely to be
involved in human genetic disease. Nucleic Acids Research 32: 3108.
4. Adie E, Adams R, Evans K, Porteous D, Pickard B (2005) Speeding disease gene
discovery by sequence based candidate prioritization. BMC Bioinformatics 6: 55.
5. Turner F, Clutterbuck D, Semple C (2003) POCUS: mining genomic sequence
annotation to predict disease genes. Genome Biology 4: 75.

You might also like