Professional Documents
Culture Documents
Subject : Bioinformatic
Databases at EBI
Nucleotide databases
Functional Genomics Databases
Protein databases
Structure databases
Sequence Analysis
Pairwise analysis
Multiple sequence alignment
Homology Searching
Summary
Exercises
Glossary
References
Introduction
Keeping with the tremendous growth in field of computational biology, a need was felt to
establish an independent and parallel research institute that would act not just as a mirror
housing the Genbank nucleotide resources of NCBI, but would also develop matching
databases and analysis tools. The European Molecular Biology Laboratory (EMBL) was
thus established in 1974 and is now supported with funding from 20 members states of the
European Union (EU), Israel and Australia. EMBL currently operates five research institutes
in different countries with main institute at Heidelberg, Germany.
The five institutes of EMBL with their core research activities are (http://www.embl.org/):
We would limit ourselves to the bioinformatics research, database and facilities of EMBL
that is located at EMBL-EBI in the following sections.
The European Bioinformatics Institute (EBI-EMBL) was established in 1980 as the EMBL
Nucleotide Sequence Data Library at Heidelberg and in-fact was the worlds first public
nucleotide database preceding NCBI by eight years (NCBI was established in 1988) with an
objective to create database of published nucleotide sequences. Subsequently in 1992, the
EMBL decided to establish EBI as a dedicated research cum analysis facility at the Wellcome
Trust Genome Campus, Hinxton (UK) in close proximity to Sanger Sequencing Center.
At present, EBI-EMBL houses databases and provides service and analysis tools for all major
research disciplines requiring computational support. In addition, EBI is also a partner and
coordinator for International Nucleotide Sequence Database Collaboration (INSDC;
http://www.insdc.org) for public domain nucleotide sequence information together with
Genbank at NCBI (www.ncbi.nim.nih.gov) and DNA databank of Japan (DDBJ;
www.ddbj.nig.ac.jp).
a. Biological Ontologies
b. Literature
d. Nucleotides
f. Protein
g. Proteomics
h. Small Molecules
i. Structure
Databases at EBI
The following section will deal with selected databases of EBI-EMBL:
Nucleotide databases
a. European Nucleotide Archive (ENA): ENA receives nucleotide data from a variety of
sources, including small scale sequencing studies, sequencing centers and the INSDC (i.e.
Genbank and DDBJ). In order to better manage the sequencing resources, ENA has been
divided in several sub-databases such as
Sequence Read Archive (SRA) for Next Generation Sequencing (NGS) data
EMBL-Bank- for assembled and annotated sequence data (note that submission
of nucleotide data should be done at either Genbank, EBI or DDBJ and not to all of
these, as the data submitted in one of the database is automatically replicated or
sent to the other two)
Source: http://www.ebi.ac.uk/ena/
The DGVa is analogous to the dbVar database of NCBI. The data at DGVa can be accessed
via the ensemble (www.ensembl.org) portal
Source: www.ensembl.org
c. EGA: The European Genome Phenome Archive (EGA) stores data from studies that
are carried out with an objective to understand the linkages between genotype and
phenotype, especially from biomedical research. This database is analogous to the dbGaP
database at NCBI. Such data may have been generated from Genome wide association
studies (GWAS). As the studies and datasets generally deals with disorders such as
cancer, coronary artery defects, hypertension, Rheumatoid arthritis and diabetes, strict
control during submission and public access is implemented on ethical grounds (as it
contains information about patients and subjects taking part in the study) to prevent misuse
or data.
Source: https://www.ebi.ac.uk/ega/
A Data Access Committee (DAC) determines on a case-to-case basis the access to EGA
dataset by users, and requires signing of an agreement for downloading and subsequent
use of the dataset.
d. ENA- Genome: This database contains the completed genome sequence data from a
variety of organisms such as:
Bacteria
Eukaryotes
Organelles
Phages
Plasmid
Viroids
EMBL-EBI developed the ENSEMBL genomes tool to browse, analyse and visualize the
genome sequencing data. Currently, there are close to 350 completed genome
sequences available for browsing, analysis and downloading. The sequence analysis
tools at ENSEMBL genome server provides tools for analysis at all levels of genome
organization, such as whole genome, chromosome, genome segment, gene and
transcript level. The genome visualization and analysis tool at ENSEMBL genome also
provides links to molecular function, gene ontology, protein summary and structure
tables.
Figure : Genome browser at ENSEMBL provides genome visualization and analysis tool
at various levels of genome organization
Source: http://plants.ensembl.org/index.html
Alignment tools built into the database allows users to perform analysis and detect
polymorphism at HLA loci.
IPD- MHC contains sequences for Major histocompatibility factors for a large
number of species
IPD-KIR is the database for human Killer cell Immunoglobulin like Receptors and
contains information about 614 alleles ( http://www.ebi.ac.uk/ipd/kir/stats.html)
Source: https://www.ebi.ac.uk/metagenomics/
Figure : Analysis of sequence data at Metagenome to reveal the genetic makeup of the
Sample
Source: https://www.ebi.ac.uk/metagenomics/
ArrayExpress: This database contains functional genomics data including microarray data
that are MIAME compliant (see chapter of NCBI for MIAME information). As on December
2012, the database contained dataset from 34145 experiments. ArrayExpress also
contains a Gene Expression Atlas containing datasets from 3558 experiments from 99484
Source: https://www.ebi.ac.uk/arrayexpress/
ArrayExpress not only accepts microarray data, but also is a central repository for sequence
data that have been generated as part of functional genomics studies. For example,
sequence data from non-human and human non-identifiable experiments such as RNA-
sequencing (RNA-seq) and ChIP-Chip generated via high-throughput technologies are also
deposited at Arrayexpress. In case the data is from human-potentially identifiable then
it should be submitted to EGA (European Genome Phenome Archive) database
(http://www.ebi.ac.uk/fg/doc/help/UHTS_submissions.html). The data submitted to
ArrayExpress is eventually transferred to the ENA (European Nucleotides Archive), and the
Metadata from EGA can also be accessed via ArrayExpress.
Protein Database:
There are several databases catering to the various dimensions of bioinformatics needs for
protein analysis at EBI.
The InterPro is the primary, integrated database for protein families, domains and
functional sites. The InterPro collates information from various other resources to prepare
and catalogue information about protein structure, domain, function, signature and other
allows users to search and perform analysis. InterProScan is the software that allows
users to analyse their protein sequences againsts InterPro database.
UniProt is the centralized universal database for all protein sequences. Each of the protein
sequence is annotated with functional details, amino acid sequence, taxonomic description
and other information from related disciplines. UniProt has several sub-databases such as:
The database termed PANDIT: Protein and Associated Nucleotide Domains with
Inferred Trees is no longer supported by for updates but is available for browsing. It
contains a collection of proteins domains covering several phylogenetic trees in multiple
sequence alignment format
The CSA (Catalytic Site Atlas) contains a collection of catalytic sites for proteins and
enzymes from experimental and structural data. CSA contains information for nearly 29,000
entries.
Figure : Catalytic site atlas compiles information about catalytic sites of various enzymes
based on experimental and structural data
Source: http://www.ebi.ac.uk/thornton-srv/databases/CSA/
Structure databases:
a. DSSP database stores nearly 84000 secondary structure assignments for Protein
Data Bank (PDB) entries (http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-
page+LibInfo+-lib+DSSP)
b. FSSP (Family of structurally similar proteins) stores alignments based on
structure for PDB entries
c. HSSP (Homology derived structures of proteins) is a database that merges 2-D
and 3-D information (structural) with 1-D (sequence) information
d. PDBe (Protein Data bank in Europe) is a database for experimentally derived
structures of protein and other biomolecules. The structure files can be viewed using
free viewers such as Rasmol (rasmol.org/)
e. EMDB (Electron Microscopy Data Bank) database stores electron microscopy
structural information for macromolecular complexes and subcellular structures. It
currently stores nearly 1570 structures (http://www.ebi.ac.uk/pdbe/emdb/)
f. PDBeNMR stores NMR derived structures in PDBe
g. PDBeChem is the central repository for chemical and ligand molecules found in
PDBe and currently has over 15000 ligand structures (http://www.ebi.ac.uk/pdbe-
srv/pdbechem/)
Figure: Few examples of structures from EMDB and PDB database at EBI-EMBL
Sequence Analysis
Analysis of sequence can be performed at EBI for several objectives including pairwise
alignment and multiple alignment.
Pairwise alignment:
Local alignment of two sequences can be submitted via EMBOSS water that
implement Smith-Waterman algorithm
(http://www.ebi.ac.uk/Tools/psa/emboss_water/)
Promoterwise tool compares two DNA sequences which may have undergone
inversions and translocations and thus is considered useful for detecting cis elements
such as promoters (http://www.ebi.ac.uk/Tools/psa/promoterwise/)
Figure: Pairwise alignment of using global alignment using EMBOSS-needle, and DNA-
Protein using GeneWise
Source: http://www.ebi.ac.uk/Tools/psa/
Figure : A multiple sequence alignment output obtained with progressive alignment tool
Clustal
Source: http://www.ebi.ac.uk/Tools/msa/clustalw2/
MAFFT ((Multiple Alignment using Fast Fourier Transform) is deemed to be a fast Multiple
Sequence Alignment program available at http://www.ebi.ac.uk/Tools/msa/mafft/).
The third multiple sequence alignment tool at EBI is MUSCLE (Multiple Sequence
Comparison by Log-Expectation) which is not only faster but also accurate as compared
to other MSA tools
Source: http://www.ebi.ac.uk/Tools/msa/mafft/
Homology Searches: Analysis of sequences using local alignment BLAST tool can be used
to search for homology or similarity.
c. After selecting the tool, an appropriate query is to be entered into the box (nucleotide for
BLASTN and protein for BLASTP),
i. a summary overview of all the subjects that have matches to the query with
ranking based on the percent similarity, query coverage and e-value,
ii. a pairwise alignment between the query and the subject, and
i. a summary overview of all the subjects that have matches to the query with
ranking based on the percent similarity, query coverage and e-value,
ii. a pairwise alignment between the query and the subject, and
iv. Link to Gene Expression database (ArrayExpress) for expression profile of the
transcripts (subject)
Source: http://www.ebi.ac.uk/Tools/sss/ncbiblast/nucleotide.html
Figure : Snapshots of BLASTP result output windows-2 showing links to genome view,
protein families and classification at UniProt and Gene Ontology (functional classification)
Source: http://www.ebi.ac.uk/Tools/sss/ncbiblast/
Exercises
1. What does EMBL and EBI stand for?
3. EMBL and EBI were was established in the years ------ and ------- respectively.
4. What are the various institutes and core research areas of EMBL?
5. NCBI, EMBL and DDBJ are coordinated by ------------------ for public domain nucleotide
sequence information.
8. Which database stores genomic structural variation information? What is the comparable
database at NCBI?
10. Which tool developed by EBI can be used to browse and visualize genome sequencing
data? List the attributes of the tool.
11. Where would you find datasets related to Immunologically relevant biomolecules?
12. What is metagenomics? Which database contains the metagenome data? With the help
of flowchart list the steps taken to retrieve and browse metagenome data.
13. What are the various databases for protein and their associated analysis tools at EBI?
14. Which database would you explore to retrieve and analyse gene expression datasets?
15. How does the two databases ArrayExpress and EGA differ from each other?
18. List the features associated with PANDIT and CSA databases.
19. Determination and prediction of protein structure is an important feature for proteomic
research. Which databases in EBI cater to such information?
20. Comment on some of the module and features associated with EMBOSS.
21. How would you perform Pairwise comparison of DNA using BLASTN?
22. How would you perform BLASTP at EBI? Comment on the features of the results that are
obtained upon BLASTP
Glossary
Summary
The European Molecular Biology Laboratory (EMBL) was established in 1974 as a need was
felt to serve as a parallel institute for hosting and developing computational data and tools.
There are five institutes that meet the overall goals and objectives of EMBL with the
European Bioinformatics Institutes (EBI) dedicated towards bioinformatics research. A
collaborative effort under the International Nucleotide Sequence and Database Collaboration
(INSDC) allows EMBL-EBI to share data and resources with NCBI (U.S.A) and DDBJ (Japan).
At the EMBL-EBI, the databases are divided into Literature, Ontology, Functional Genomics,
Nucleotides, Pathways, Proteins, Proteomics, small molecules and Structures. The
nucleotide database is further sub-divided into ENA-genome (for genomic sequence), SRA
(for NGS data), DGVa (for genomic variant data), EGA (for genome-phenome data). The
functional genomics databases are exemplified by the ArrayExpress(microarray data);
Protein database include InterPro, UniProt, PANDIT, CSA; and Structure databases include
DSSP, FSSP, PDBe, EMDB etc. All these databases are provided with links for analysis tools.
For example EMBOSS Needle, EMBOSS water and Promoterwise are tools for pairwise
sequence analysis; for multiple sequence alignment several tools such as ClustalW,
MUSCLE, MAAFT are available. EMBL thus provides a complete suite of databases and tools
for a wide range of sequence, structure and functional analysis.
References
Works Cited
1. Altschul S.F, Gish W, Miller W, Myers E W and Lipman D J. Basic Local Alignment
Search Tool. J. Mol. Biol. 215, 403-410 (1990)
2. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. A model of evolutionary change in
proteins. In "Atlas of Protein Sequence and Structure" 5(3) M.O. Dayhoff (ed.), 345 -
352 (1978)
3. Henikoff, S. and Henikoff, J. Amino acid substitution matrices from protein blocks
Proc. Natl. Acad. Sci. USA. 89(biochemistry): 10915 - 10919 (1992).
4. Robert C Edgar (2004) MUSCLE: a multiple sequence alignment method with reduced
time and space complexity. BMC Bioinformatics: 5:113 doi:10.1186/1471-2105-5-
113
5. Katoh, Misawa, Kuma, Miyata (2002) MAFFT: a novel method for rapid multiple
sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30:3059-
3066)
6. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H,
Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG.
(2007). Clustal W and Clustal X version 2.0. Bioinformatics, 23, 2947-2948.
7. Magrane, M. and UniProt Consortium (2011) UniProt Knowledgebase: a hub
1. Suggested Readings
Bioinformatics and Functional Genomics: 2nd Edition, Jonathon Pevsner (2009), Wiley
Blackwell
2. Wang et al. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Review
Genetics 10(1): 57-63
Web Links
1. http://www.embl.org/
2. www.ebi.ac.uk
3. http://www.insdc.org
4. www.ensembl.org
5. http://www.ebi.ac.uk/fg/doc/help/UHTS_submissions.html
6. http://www.ebi.ac.uk/pdbe/emdb/
7. http://www.ebi.ac.uk/Tools/psa/emboss_needle/http://www.ebi.ac.uk/Tools/psa/em
boss_water/)
8. http://www.ebi.ac.uk/Tools/psa/genewise
9. http://www.ebi.ac.uk/Tools/psa/promoterwise
10. http://www.ebi.ac.uk/Tools/msa/clustalw2/
11. http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+LibInfo+-lib+DSSP
12. http://www.ebi.ac.uk/Tools/sss/ncbiblast/