You are on page 1of 56

Introduction to Bioinformatics

Bioinformatics is a modern discipline integrating different branches of science i.e. Biology, Chemistry & Information technology.

Informatics related to Biological and Medical sciences: Bioinformatics Structural Bioinformatics Medical Informatics

Chemoinformatics
Pharmacy Informatics Clinical Informatics

Bioinformatics has a strong interdisciplinary character. It can be considered to be a confluence of Biology,

Computer

Science,

Information

Technology,

Mathematics, Chemistry, Physics, and Medicine with the objectives of developing tools to analyze biological, biochemical, biophysical data and to generate new knowledge in these areas. It is a fact that persons trained and skilled in these multifarious ways do not exist, and if this area is to develop in our country these persons will have to be trained and produced.

In other wordsBioinformatics is
The combination of biology and information technology. It is a branch of science that deals with the computer based analysis of large biological data sets.

It incorporates the development of databases to store and search data, and of statistical tools and algorithms to analyze and determine relationships between biological sets, such as macromolecular sequences, structures, expression profiles and biochemical pathways.

DNA

RNA

Protein synthesis

COMPUTERS IN BIOLOGY
Development of
New scientific methods, Algorithms for managing large amounts of sequence and structural data As the full genome sequences of many species, data from structural genomics, micro-arrays, and proteomics became available, integration of these data to a common platform require sophisticated bioinformatics tools. { Sequence-Structure-Function }. Organizing these data into knowledgeable databases and developing appropriate software tools for analyzing the same are going to be major challenges. India as a major player in IT industry, has the potential to develop such resources at an affordable cost.

Structural Bioinformatics in Drug Discovery

Homology modeling of target protein Target protein sequence Virtual library of compounds or QSAR analysis Large scale Docking

Confirm using Crystallography, Kinetic analysis

Crystal structure of target protein

Lead identification & Lead optimization

Compound development (Drug)

Fig: Schematic outline of the application of SB (homology modeling) and crystallography (structural molecular biology) in drug discovery process.

X-ray

COMPUTERS IN BIOLOGY

Table : Some important structural bioinformatics databases/ resources/ tools: Database and its importance S.No. 1. National Center for Biotechnology http://www.ncbi.nlm.nih.go Information (NCBI): Provides a v/Entrez/ general search for nucleotide sequences, protein sequences, biomolecule 3D structures, genomes, taxonomy or literature. Structural Genomics Target http://spam.sdsc.edu/ Database (sgtdb): 3-D models of all sequences under investigation by structural genomics centers. Structure Comparison Database http://cl.sdsc.edu/ce.html (CE): Pair-wise structure comparisons based on the Combinatorial Extension (CE) Algorithm for both a representative set and complete set of protein structures; includes alignments. URL

2.

3.

COMPUTERS IN BIOLOGY

4.

CKAAP DB:Database of http://ckaaps.sdsc.edu/perl/ structures with Conserved Key browser.pl Amino Acid Positions. Protein Data Bank (PDB): The http://www.rcsb.org/pdb single worldwide source of primary structural data on biological macromolecules determined experimentally. Extended GO Annotation of PDB http://spdc.sdsc.edu/ Chains: Use of structure comparison to extend the coverage of GO terms in the PDB. The PDBbind database is http://www.pdbbind.org/ designed to provide a collection of experimentally measured binding affinity data (Kd, Ki, and IC50) exclusively for the proteinligand complexes available in PDB.

5.

6.

7.

Bioinformatics
Information Resources And Networks

Outline
Bioinformatics Information Resources And Networks
EMBnet European Molecular Biology Network
DBs and Tools

NCBI National Center For Biotechnology Information


DBs and Tools

Nucleic Acid Sequence Databases Protein Information Resources Metabolic Databases Mapping Databases Databases concerning Mutations Literature Databases

EMBnet European Molecular Biology Network


Founded in 1988 Network that links European laboratories that use biocomputing and bioinformatics in molecular biology research is a science-based group of collaborating nodes throughout Europe and nodes outside Europe provides information, services and training to the users efforts to increase the availability and

accessibility of data resources and computing tools


increase knowledge and proficiency in bioinformatics through education and training

EMBnet - Nodes
National Nodes
(18)

governmental

academic, industrial research centers


Specialist Nodes
(9)

EMBnet
(41 nodes)

Biocomputing centers from non European countries Associate Nodes


(11)

EMBnet - Nodes
National Nodes
Vienna Biocenter - Austria CSC - Finland DKFZ - Germany INCBI - Ireland IEN-AdR - Italy Bio - Norway PEN - Portugal CNB-CSIC - Spain SIB - Switzerland BEN - Belgium INFOBIOGEN - France HEN - Hungary INN - Israel CMBI - Netherlands IBB - Poland GeneBee - Russia BMC - Sweden SEQNET - UK

Appointed by the governments Provide on-line services, user support and training

EMBnet - Nodes
Munich Information Center for protein sequences

Specialist Nodes
MIPS ICGEB Pharmarcia F.Hoffmann La Roche EBI HGMP - RC Sanger UCL

Academic, industrial or research centers in specific areas of bioinformatics Largely responsible for maintenance of biological databases and software
Important key specialist node and home of: EMBL, SWISS-PROT and TrEMBL databases

Hinxton Hall
(Cambridge UK)

EMBnet - Nodes
Associate Nodes
IBBM - Argentina ANGIS - Australia

Centers from non European countries

CBI - China

CIGB - Cuba

CDFD - India

SANBI South Africa

EMBnet - Brazil

CBR - Canada

EMBnet - Chile

EBMnet - Colombia

CIFN - MEXICO

EMBnets Mission
Assist in biotechnological and bioinformatics related research Provide training and education Exploit network infrastructures Investigate and develop new technologies Bridge between commercial and academic sectors

Who are EMBnets Users?


> 40,000 registered users from all over the world as well as a larger number of Internet users All scientists working in Life Sciences, from undergraduate students to top level scientists, in academia as well as industry, can get support from EMBnet

EMBnets SRS
National Nodes

Sequence Retrieval System - SRS


result of a research project with the EMBnet to interrogating all resources gathered together

EMBnet
Specialist Nodes Associate Nodes

SRS is a network browser for DBs in molecular Biology

SRS allows any flat-file DB to be indexed to any other


queries across a range of different DB types via a single interface independent of underlying data structures or query languages

http://srs.embl-heidelberg.de:8000/srs5/
Sequence Retrieval System Network Browser for Databanks in Molecular Biology

Data Bank

Rele ase

No Entries

Indexing Date

Group

Availa bility

SWISSPROT SWISSNEW NRDB SWALL UNIPROT_SPROT UNIPROT_TREMBL TREMBLNEW TREMBL

163235 81134 2269647 3022528 212425 2666963 624819 2576118

10-Jun-2005 22-Mar-2006 29-Mar-2006 22-Mar-2006 22-Mar-2006 23-Mar-2006 12-Dec-2005 04-Oct-2005

Sequence Sequence Sequence Sequence Sequence Sequence Sequence Sequence

ok ok ok ok ok ok ok ok

Data Bank

No Entries

Indexing Date

Group

Availa bility

SPTREMBL
SPTREMBLNEW REMTREMBL PIR WORMPEP

1449374
143140 92182 283416 19538

16-Jun-2005
17-Jun-2005 20-Jun-2005 16-Jun-2005 16-Jun-2005

Sequence
Sequence Sequence Sequence Sequence

ok
ok ok ok ok

DROSOPHILA
EMBLNEW EMBL EMBLEST EMBLWGS GENBANK GENBANKEST REFSEQP

14100
4035816 20343598 31990232 11106060 19233264 31008556 8006

16-Jun-2005
21-Nov-2005 30-Dec-2005 06-Jan-2006 24-Sep-2005 18-Nov-2005 23-Feb-2006 16-Jun-2005

Sequence
Sequence Sequence Sequence Sequence Sequence Sequence Sequence

ok
ok ok ok ok ok ok ok

SUBTILIST

16-Jun-2005

Sequence

ok

Data Bank

No Entries

Indexing Date

Group

Availa bility

PROSITE PROSITEDOC BLOCKS EPD ENZYME PRINTS TFSITE TFFACTOR

1935 1407 4034 1375 4173 865 4342 1799

22-Mar-2006 22-Mar-2006 16-Jun-2005 16-Jun-2005 16-Jun-2005 16-Jun-2005 07-Apr-2003 07-Apr-2003

SeqRelated SeqRelated SeqRelated SeqRelated SeqRelated SeqRelated TransFac TransFac

ok ok ok ok ok ok ok ok

TFCELL
TFCLASS TFMATRIX TFGENE PDB DSSP HSSP PDBFINDER NRL3D FLYGENES FLYREFS OMIM REPTILIA

816
27 246 1035 34927 30832 30369 35701 6063 7556 0 17004 8364

07-Apr-2003
07-Apr-2003 07-Apr-2003 07-Apr-2003 08-Feb-2006 22-Nov-2005 08-Feb-2006 28-Mar-2006 16-Jun-2005 16-Jun-2005 07-Apr-2003 18-Oct-2005 18-Jan-2006

TransFac
TransFac TransFac TransFac Protein3DStruct Protein3DStruct Protein3DStruct Protein3DStruct Protein3DStruct Genome Genome Mutations Others

ok
ok ok ok ok ok ok ok ok ok ok ok ok

NCBI National Center For Biotechnology Information


Leading American information provider Established in 1988 as a division of the National Library of Medicine (NLM)
Located on the campus of the National Institute of Health (NIH Rockville/Maryland)

Mission: Development of new information technologies to aid our understanding of the molecular and genetic processes that underlie health and disease Creation of systems for storing and analysing biological information Development of advanced methods of computer-based information processing Facilitation of user access to DBs and software Co-ordination of efforts to gather biotechnology information worldwide

NCBI
Since 1992 maintenance of GenBank and collaboration with international nucleotide DBs: EMBL and DDBJ (Japan) Providing the Entrez that facilitates to access biological DBs (similar to SRS that is provided by the EMBnet)

NCBI - Responsibilities
administers research on biomedical problems at the molecular level using mathematical and computational methods maintains collaborations with several NIH (National Institutes of Health) institutes, academia, industry, and other governmental agencies promotes scientific communication by sponsoring meetings, workshops, and lecture series supports training on basic and applied research in computational biology for postdoctoral fellows through the NIH Intramural Research Program engages members of the international scientific community in informatics research and training through the Scientific Visitors Program develops, distributes, supports, and coordinates access to a variety of databases and software for the scientific and medical communities develops and promotes standards for databases, data deposition and exchange, and biological nomenclature

Nucleic Acid Sequence Databasesare GeneBank, the principal nucleic acid sequence databases
EMBL and DDBJ, which each collect a portion of the total sequence data reported world-wide, and exchange new and updated entries on a daily basis Nucleic acid sequence Databases
EMBL (Europe) GenBank (USA) DDBJ (Japan) ENSEMBL (project between EMBL - EBI and the Sanger Institute) dbEST (division of GenBank) GSDB (division of GenBank)

source: http://www3.ebi.ac.uk/Services/DBStats/

Nucleic Acid Sequence Databases - EMBL


This morning the EMBL Database contained 127,450,085,130 nucleotides in 69,666,551 entries. Breakdown by entry type: Entry Standard Constructed (CON) Third Party Annotation (TPA) Whole Genome Shotgun (WGS) TypeEntries 56,843,150 497,187 4,884 12,318,618 Nucleotides 61,498,109,356 n/a 334,827,880 64,837,183,592

The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications. The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis.

Nucleic Acid Sequence Databases - EMBL


Number of entries (current 69,666,551) Total nucleotides (current 127,450,085,130 )

Ref: EMBL Nucleotide Sequence Database:developments in 2005, Nucleic Acids Research, 2006, Vol. 34, D10D15

Nucleic Acid Sequence Databases - EMBL


By nucleotide count

Homo sapiens Bos taurus Macaca mulatta

Mus musculus Canis familiaris Loxodonta africana

Rattus norvegicus Monodelphis domestica Other

Pan troglodyt es Danio rerio

Nucleic Acid Sequence Databases GenBank


GenBank which is produced at NCBI, is split into smaller, discrete divisions.
This facilitates fast, specific searches by restricting queries to particular database subsets During 1992-1997, the level of EST and STS data within GenBank grew 10-fold. the overall sequence information contributed by such partial data was still less than that of higher quality sequences in the other major divisions

Specialised Genomic Resources


In addition to the comprehensive DNA sequence DBs, there is a variety of more specialised genomic resources. These so called boutique DBs bring focus to speciesspecific genomics and to particular sequencing techniques. Specialised Genomic Resources SGD Saccharomyces Genome Database UniGene - gene-oriented clusters from GenBank TIGR - Databases of The Institute for Genomic Research ACeDB A C.elegans DataBase

Specialised Genomic Databases


SGD (Saccharomyces Genome Database) SGDTM is a scientific database
of the molecular biology and genetics of the yeast Saccharomyces cerevisiae.

http://genome-www.stanford.edu/Saccharomyces

AceDB (A C. elegans DataBase)


http://www.acedb.org (c.elegans)

FlyBase (A Database of Drosophila Genes & Genomes) (http://flybase.bio.indiana.edu (fruit fly) MGD (Mouse Genome Database)
http://www.informatics.jax.org (Mouse)

Protein Information Resources


Levels of protein sequence and structural organisation:
primary The primary structure of a protein is its amino acid sequence

secondary

The second structure of a protein corresponds to regions of local regularity (e.g., -helices and -strands). The tertiary structure of a protein arises from the packing of its secondary structure elements, which may form discrete domains within a fold.

tertiary

Principles of Protein Structure

primary structure

ACDEFGHIKLMNPQRSTVWY

Protein Information Resources


Levels of protein sequence and structural organisation:
primary database secondary database

primary

sequence

AVILDRYFH

secondary

motif

[AS]-[IL]2-X[DE]-R-[FYW]2-H

tertiary

domain

module

a,b,c

@.*,#

structure database

Primary Protein Databases


The primary structure of a protein is its amino acid sequence

these are stored in primary databases as linear alphabets that denote the constituent residues

Protein sequence Databases SWISS-PROT - Protein knowledgebase TrEMBL - Computer-annotated supplement to Swiss-Prot PIR Protein Information Resource MIPS Munich Information Centre for Protein Sequences NRL-3D - produced by PIR

Protein Sequence Databases


Table of the most represented species Swiss-Prot contains 197,228 sequence entries, comprising 71,501,181 amino acids abstracted from 135,257 references Total number of species represented in Swiss-Prot: 9,520 The average sequence length in Swiss-Prot is 362 amino acids. Swiss-Prot is the most highly annotated protein sequence DB
No. 1 2 3 4 5 6 8 7 9 10

Frequ.
13049 10132 5189 4847 4669 3665 2863 2814 2750 2286

Species
Homo sapiens (Human) Mus musculus (Mouse) Saccharomyces cerevisiae (Baker's yeast) Escherichia coli Rattus norvegicus (Rat)

Arabidopsis thaliana (Mouseear cress)


Schizosaccharomyces pombe (Fission yeast) Bacillus subtilis Caenorhabditis elegans Drosophila melanogaster (Fruit fly)

Composite Protein Sequence Databases


Composite databases amalgamate a variety of different primary databases They render sequence searching much more efficient, because they obviate the need to interrogate multiple resources Different composite databases use different primary sources and different redundancy criteria in their amalgamation procedures

Composite Protein Sequence Databases


NRDB
Natural Resource DB

OWL
SWISS-PROT PIR

MIPSX
PIR1-4 MIPSOwn

SP+TrEMBL
SwissProt TrEMBL

PDB SWISS-PROT

SWISS-PROT TrEMBL

PIR
GenPept SWISS-PROTupdate GenPeptupdate

GenBank
NRL-3D

MIPSTrn
MIPSH PIRMOD NRL-3D SWISS-PROT EMTrans GBTrans Kabat PseqIP

Secondary databases
Secondary databases contain pattern data, i.e., diagnostic signatures for protein families. These signatures encode the most highly conserved features of multiply aligned sequences, which are often crucial to the structure or function of the protein The second structure of a protein corresponds to regions of local regularity (e.g., -helices and -strands). Which, in sequence alignments, are often apparent as wellconserved motifs patterns are regular expressions, fingerprints, blocks, profiles, etc.

Secondary databases
Secondary DB
PROSITE

Primary source
SWISS-PROT

Stored information
Regular expressions (patterns)

Profiles
PRINTS BLOCKS IDENTIFY

SWISS-PROT
OWL PROSITE/PRINTS BLOCKS/PRINTS

Weighted matrices (profiles)


Aligned motifs (fingerprints) Aligned motifs (blocks) Fuzzy regular expressions (patterns)

Secondary databases
TRANSFAC http://transfac.gbf.de EPD http://www.epd.isb-sib.ch InterPro http://www.ebi.ac.uk/interpro/ PROSITE http://www.expasy.ch/prosite BLOCKS http://blocks.fhcrc.org PRINTS ftp://ftp.seqnet.dl.ac.uk/pub/database/prints PFAM http://www.sanger.ac.uk/Software/Pfam/index.shtml ProDom http://www.toulouse.inra.fr/prodom.html InterPro http://www.ebi.ac.uk/interpro GeneCards http://bioinformatics.weizmann.ac.il/cards ENSEMBL http://www.ensembl.org EcoCyc http://ecocyc.panbio.com/ecocyc/ecocyc.html

Secondary databases
There is some overlap in content between the secondary databases PDBsum alone has 35,291 entries Pattern DB growth is slow because the addition of detailed family annotation is very time consuming. PROSITE and PRINTS are the only comprehensively, manually annotated secondary DBs

To address the annotation bottleneck, the secondary database curators are together created a unified database of protein families known as InterPro

Structure Classification DBs


Contain 3D structures available from crystallographic and spectroscopic studies

Structure Classification Databases PDBsum Protein Data Bank CATH Class, Architecture, Topology, Homology SCOP Structural Classification of Proteins

Structure Classification DBs


PDB
http://www.rcsb.org

SCOP
http://scop.mrc-lmb.cam.ac.uk/scop

CATH
http://www.biochem.ucl.ac.uk/bsm/cath

DSSP
http://www.sander.ebi.ac.uk/dssp

FSSP
http://www.ebi.ac.uk/dali/fssp

HSSP
http://www.sander.ebi.ac.uk/hssp

Metabolic Databases
A number of metabolic databases are available electronically some with features for querying and visualizing metabolic pathways and regulatory networks.

KEGG (Kyoto Encyclopedia of Genes and Genomes)


http://www.genome.ad.jp/kegg

ENZYME (Enzyme nomenclature database)


http://www.expasy.ch/enzyme

BRENDA (Enzyme Information System)


http://www.brenda.uni-koeln.de

EMP (Enzymes and Metabolic Pathways database)


http://www.empproject.com

Mapping Databases
OMIM
(Online Mendelian Inheritance in Man)

http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim

GDB (The GDB Human Genome Database)


http://www.gdb.org

RHDB
http://corba.ebi.ac.uk/RHdb

Databases concerning Mutations


dbSNP
http://www.ncbi.nlm.nih.gov/SNP

HGBASE
http://hgbase.cgr.ki.se

The SNP Consortium (TSC)


http://snp.cshl.org

HAEMA
http://europium.csc.mrc.ac.uk/usr/WWW/WebPages/datab ase.dir/quiz.dir/intrquiz.htm

Literature Databases
PubMed
http://www.ncbi.nlm.nih.gov/entrez/query

Bioinformatics Online
http://www.bioinformatics.oupjournals.org

Nature
http://www.nature.com

Science
http://www.sciencemag.org

Database tools for displaying and annotating genomic sequence data


Viewer format
Artemis ACeDB Apollo EnsEMBL NCBI map viewer GoldenPath

URL
www.sanger.ac.uk/Software/Artemis www.acedb.org/Tutorial/brieftutorial/shtml www.ensembl.org/apollo www.ensembl.org www.ncbi.nlm.nih.gov genome.ucsc.edu

Database formats
There is no universally agreed format for genome databases and several viewers and browsers have been developed with graphical displays for genomic sequence analysis and annotation.

Common formats
There are several conventions for representing nucleic acid and protein sequences, of which the following are widely used
NBRF/PIR FASTA GDE

These formats have limited facilities for comments, which must include a unique identifier code and sequence accession number

Formats for multiple sequence alignment


There are separate formats for multiple sequence alignment representation, of which the following are popular
MSF PHYLIP ALN

Files of structural data


Structural data are maintained as flat files using the PDB format Such files contain orthogonal atomic coordinates together with annotations, comments and experimental details

http://www.pdb.org

Submission of sequences
Sequences may be submitted to any of the three primary databases using the tools provided by the database curators Such tools include WebIn and BankIt, which can be used over the Internet, and Sequin, a stand-alone application
http://www.ebi.ac.uk/embl/Submission/webin.html

http://www.ncbi.nlm.nih.gov/BankIt/

Database interrogation
All the databases discussed above can be searched by sequence similarity However, detailed text-based searches of the annotations are also possible using tools such as Entrez The simplest way to cross-reference between the primary nucleotide sequence databases and SWISS-PROT is to search by accession number, as this provides an unambiguous identifier of genes and their products

You might also like