EMBL

EMBL
Subject : Bioinformatic
Lesson : European Molecular Biology Laboratory (EMBL)
Lesson Developer: Sandip Das
Department/ College: Department of Botany, University of Delhi
Institute of Life Long Learning 0

EMBL
Table of Contents Contents
Chapter : European Molecular Biology Laboratory

Introduction
European Bioinformatics Institute
Databases at EBI
Nucleotide databases
Functional Genomics Databases
Protein databases
Structure databases
Sequence Analysis
Pairwise analysis
Multiple sequence alignment
Homology Searching
Summary
Exercises
Glossary
References

EMBL
Introduction
Keeping with the tremendous growth in field of computational biology, a need was felt to
establish an independent and parallel research institute that would act not just as a mirror
housing the Genbank nucleotide resources of NCBI, but would also develop matching
databases and analysis tools. The European Molecular Biology Laboratory (EMBL) was
thus established in 1974 and is now supported with funding from 20 members states of the
European Union (EU), Israel and Australia. EMBL currently operates five research institutes
in different countries with main institute at Heidelberg, Germany.
The five institutes of EMBL with their core research activities are (http://www.embl.org/):
a. EMBL Heidelberg (Germany; http://www.embl.de/)
b. EMBL Grenoble (France; http://www.embl.fr/index.php)- Structural Biology
c. EMBL-European Bioinformatics Institute (EBI) Hinxton (UK; http://www.ebi.ac.uk/)-

Bioinformatics
d. EMBL Hamburg (Germany; http://www.embl-hamburg.de/index.php) Structural

Biology
e. EMBL Monterotondo (Italy; http://www.embl.it/index.php)- Mouse Biology
The broad goals of EMBL are:
a. Basic research in Molecular biology
b. Training manpower i.e. students, scientist and visitors
c. Develop new tools, technologies and methods
d. Offer service to the research community
e. Transfer technology to industry for commercialization

EMBL
We would limit ourselves to the bioinformatics research, database and facilities of EMBL
that is located at EMBL-EBI in the following sections.
European Bioinformatics Institute (EBI)
The European Bioinformatics Institute (EBI-EMBL) was established in 1980 as the EMBL
Nucleotide Sequence Data Library at Heidelberg and in-fact was the worlds first public
nucleotide database preceding NCBI by eight years (NCBI was established in 1988) with an
objective to create database of published nucleotide sequences. Subsequently in 1992, the
EMBL decided to establish EBI as a dedicated research cum analysis facility at the Wellcome
Trust Genome Campus, Hinxton (UK) in close proximity to Sanger Sequencing Center.
At present, EBI-EMBL houses databases and provides service and analysis tools for all major
research disciplines requiring computational support. In addition, EBI is also a partner and
coordinator for International Nucleotide Sequence Database Collaboration (INSDC;
http://www.insdc.org) for public domain nucleotide sequence information together with
Genbank at NCBI (www.ncbi.nim.nih.gov) and DNA databank of Japan (DDBJ;
www.ddbj.nig.ac.jp).
The following are the broad categories of databases at EBI-EMBL:
a. Biological Ontologies
b. Literature
c. Functional Genomics or microarray
d. Nucleotides
e. Pathways and Networks
f. Protein
g. Proteomics
h. Small Molecules
i. Structure

EMBL
Figure: Webportal of EMBL and EBI
Source: http://www.embl.org/ , http://www.ebi.ac.uk/
Databases at EBI
The following section will deal with selected databases of EBI-EMBL:
Nucleotide databases
a. European Nucleotide Archive (ENA): ENA receives nucleotide data from a variety of
sources, including small scale sequencing studies, sequencing centers and the INSDC (i.e.
Genbank and DDBJ). In order to better manage the sequencing resources, ENA has been
divided in several sub-databases such as
ENA-Genome - for genome sequencing data

EMBL
Sequence Read Archive (SRA) for Next Generation Sequencing (NGS) data
EMBL-Bank- for assembled and annotated sequence data (note that submission
of nucleotide data should be done at either Genbank, EBI or DDBJ and not to all of
these, as the data submitted in one of the database is automatically replicated or
sent to the other two)

EMBL
Figure: ENA database with information about sequences
Source: http://www.ebi.ac.uk/ena/

EMBL
b. DGva: Database of Genomic Variant Archive (DGVa) is a publicly accessible

database that stores information about genomic structural variants having role in
causing diseases. Such variant may be in the form of
size ranging from few nucleotides to several Kilobase or even Megabases,
structural, i.e. insertions, deletions, translocations, and
copy number variants (CNV)
The DGVa is analogous to the dbVar database of NCBI. The data at DGVa can be accessed
via the ensemble (www.ensembl.org) portal

EMBL
Figure : DGVa database accessed through ENSEMBL portal
Source: www.ensembl.org

EMBL
c. EGA: The European Genome Phenome Archive (EGA) stores data from studies that
are carried out with an objective to understand the linkages between genotype and
phenotype, especially from biomedical research. This database is analogous to the dbGaP
database at NCBI. Such data may have been generated from Genome wide association
studies (GWAS). As the studies and datasets generally deals with disorders such as
cancer, coronary artery defects, hypertension, Rheumatoid arthritis and diabetes, strict
control during submission and public access is implemented on ethical grounds (as it
contains information about patients and subjects taking part in the study) to prevent misuse
or data.
Figure : The European Genome Pheome database
Source: https://www.ebi.ac.uk/ega/

EMBL
A Data Access Committee (DAC) determines on a case-to-case basis the access to EGA
dataset by users, and requires signing of an agreement for downloading and subsequent
use of the dataset.
d. ENA- Genome: This database contains the completed genome sequence data from a
variety of organisms such as:
Archaea and archeal virus
Bacteria
Eukaryotes
Organelles
Phages
Plasmid
Viroids
EMBL-EBI developed the ENSEMBL genomes tool to browse, analyse and visualize the
genome sequencing data. Currently, there are close to 350 completed genome
sequences available for browsing, analysis and downloading. The sequence analysis
tools at ENSEMBL genome server provides tools for analysis at all levels of genome
organization, such as whole genome, chromosome, genome segment, gene and
transcript level. The genome visualization and analysis tool at ENSEMBL genome also
provides links to molecular function, gene ontology, protein summary and structure
tables.

EMBL
Figure : Genome browser at ENSEMBL provides genome visualization and analysis tool
at various levels of genome organization
Source: http://plants.ensembl.org/index.html

EMBL
e. Several other databases such as Immuno Polymorphism database (IPD) (such

as IMGT/HLA, IMGT/LIGM, IPD-MHC, IPD-KIR e etc), Metagenomics and
Patentdata resources are also part of the nucleotide resources at EBI-EMBL.
IMGT/HLA database is the nucleotide sequence database for human major

histocompatibility complex HLA. This database is a part of the International
ImmunoGenetics Project (IMGT) and the data has been subdivided into the
following five classes of alleles of HLA
(http://www.ebi.ac.uk/ipd/imgt/hla/stats.html):
HLA Class I alleles (6725)
HLA Class II alleles (1771)
HLA alleles (8496)
Other non-HLA alleles (148)
Confidential alleles (8)
Alignment tools built into the database allows users to perform analysis and detect
polymorphism at HLA loci.
IMGT/LIGM similarly is a database for Immuglobulins and T-Cell receptors
IPD- MHC contains sequences for Major histocompatibility factors for a large
number of species
IPD-HPA is the database for human platelet antigens
IPD-KIR is the database for human Killer cell Immunoglobulin like Receptors and
contains information about 614 alleles ( http://www.ebi.ac.uk/ipd/kir/stats.html)
EBI-Metagenomic contains sequence information from microflora samples that

have been collected from various environments. Some such examples include core
gut microflora, aquatic microflora from Antarctica, glaciers, ocean samples, meat
samples and so on. The metagenome sequences are analyzed to reveal the

EMBL
frequency of predicted CDS (coding DNA sequence), their GO (genome Ontology)

annotation, putative proteins with biochemical, cellular and molecular functions.
Figure : The IPD and Metagenome web portal at EBI-EMBL
Source: https://www.ebi.ac.uk/metagenomics/

EMBL
Figure : Analysis of sequence data at Metagenome to reveal the genetic makeup of the
Sample
Source: https://www.ebi.ac.uk/metagenomics/
Functional Genomics database:
ArrayExpress: This database contains functional genomics data including microarray data
that are MIAME compliant (see chapter of NCBI for MIAME information). As on December
2012, the database contained dataset from 34145 experiments. ArrayExpress also
contains a Gene Expression Atlas containing datasets from 3558 experiments from 99484

EMBL
assays performed under 20806 conditions. Both ArrayExpress-Experiments archives and

Gene Expression Atlas provide a searchable browser for users to access
data.

EMBL
Figure : ArrayExpress portal for functional genomics data
Source: https://www.ebi.ac.uk/arrayexpress/
ArrayExpress not only accepts microarray data, but also is a central repository for sequence
data that have been generated as part of functional genomics studies. For example,
sequence data from non-human and human non-identifiable experiments such as RNA-
sequencing (RNA-seq) and ChIP-Chip generated via high-throughput technologies are also
deposited at Arrayexpress. In case the data is from human-potentially identifiable then
it should be submitted to EGA (European Genome Phenome Archive) database
(http://www.ebi.ac.uk/fg/doc/help/UHTS_submissions.html). The data submitted to
ArrayExpress is eventually transferred to the ENA (European Nucleotides Archive), and the
Metadata from EGA can also be accessed via ArrayExpress.
Protein Database:
There are several databases catering to the various dimensions of bioinformatics needs for
protein analysis at EBI.
The InterPro is the primary, integrated database for protein families, domains and
functional sites. The InterPro collates information from various other resources to prepare
and catalogue information about protein structure, domain, function, signature and other
allows users to search and perform analysis. InterProScan is the software that allows
users to analyse their protein sequences againsts InterPro database.
UniProt is the centralized universal database for all protein sequences. Each of the protein
sequence is annotated with functional details, amino acid sequence, taxonomic description
and other information from related disciplines. UniProt has several sub-databases such as:
UniProt Archive (UniParc) containing non-redundant protein sequences from all

public databases such as EMBL-WGS, WormBase, USPTO, EPTO, FlyBase, PIR etc
UniProtUniRef: Cluster of similar protein sequences and is a representative

containing subset of UniProtArchive
UniProtKB/SwissProt: Annotated protein resources from UniProtKB for the

purpose of Gene Ontology
UniProt/UniMES: Protein database for environmental and metagenomic samples

EMBL
The database termed PANDIT: Protein and Associated Nucleotide Domains with
Inferred Trees is no longer supported by for updates but is available for browsing. It
contains a collection of proteins domains covering several phylogenetic trees in multiple
sequence alignment format
Figure: UniProt database provides a collection of non-redundant protein sequences;

whereas PANDIT is a pre-aligned database of proteins based on their domains and covers a
wide phylogenetic range.

EMBL
Source: http://www.uniprot.org/ , http://www.ebi.ac.uk/goldman-srv/pandit/
The CSA (Catalytic Site Atlas) contains a collection of catalytic sites for proteins and
enzymes from experimental and structural data. CSA contains information for nearly 29,000
entries.
Figure : Catalytic site atlas compiles information about catalytic sites of various enzymes
based on experimental and structural data
Source: http://www.ebi.ac.uk/thornton-srv/databases/CSA/

EMBL
Structure databases:
Structural information for a variety of molecules including proteins and other

macromolecular structure are grouped under the Structure database.
a. DSSP database stores nearly 84000 secondary structure assignments for Protein
Data Bank (PDB) entries (http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-
page+LibInfo+-lib+DSSP)
b. FSSP (Family of structurally similar proteins) stores alignments based on
structure for PDB entries
c. HSSP (Homology derived structures of proteins) is a database that merges 2-D
and 3-D information (structural) with 1-D (sequence) information
d. PDBe (Protein Data bank in Europe) is a database for experimentally derived
structures of protein and other biomolecules. The structure files can be viewed using
free viewers such as Rasmol (rasmol.org/)
e. EMDB (Electron Microscopy Data Bank) database stores electron microscopy
structural information for macromolecular complexes and subcellular structures. It
currently stores nearly 1570 structures (http://www.ebi.ac.uk/pdbe/emdb/)
f. PDBeNMR stores NMR derived structures in PDBe
g. PDBeChem is the central repository for chemical and ligand molecules found in
PDBe and currently has over 15000 ligand structures (http://www.ebi.ac.uk/pdbe-

EMBL
srv/pdbechem/)
Figure: Few examples of structures from EMDB and PDB database at EBI-EMBL
Source: http://www.ebi.ac.uk/pdbe/emdb/ , http://www.ebi.ac.uk/pdbe/

EMBL
Sequence Analysis
Analysis of sequence can be performed at EBI for several objectives including pairwise
alignment and multiple alignment.
Pairwise alignment:
Global alignment of nucleotide and protein sequences can be performed by

implementing Needleman and Wunsch Algorithm via EMBOSS needle webserver
(http://www.ebi.ac.uk/Tools/psa/emboss_needle/). Two sequences that need to be
aligned are to be submitted either by copy and paste in the window or the FASTA
files need to be uploaded followed by submission of the jobs.
Local alignment of two sequences can be submitted via EMBOSS water that
implement Smith-Waterman algorithm
(http://www.ebi.ac.uk/Tools/psa/emboss_water/)
Genewise compares a protein sequence to DNA sequences

(http://www.ebi.ac.uk/Tools/psa/genewise/)
Promoterwise tool compares two DNA sequences which may have undergone
inversions and translocations and thus is considered useful for detecting cis elements
such as promoters (http://www.ebi.ac.uk/Tools/psa/promoterwise/)

EMBL
Figure: Pairwise alignment of using global alignment using EMBOSS-needle, and DNA-
Protein using GeneWise
Source: http://www.ebi.ac.uk/Tools/psa/
Multiple sequence alignment:

Multiple Sequence alignment attempts to achieve the following objectives:
a. Identity conserved regions and thus are slow evolving
b. Identify diverged region and thus are rapidly evolving
c. Identity functional regions as domains, motifs and invariant residues
d. Perform molecular phylogeny
ClustalW: ClustalW is a preogressive method of sequence alignment and is available at

http://www.ebi.ac.uk/Tools/msa/clustalwo/ for web based service

EMBL
Figure : A multiple sequence alignment output obtained with progressive alignment tool
Clustal
Source: http://www.ebi.ac.uk/Tools/msa/clustalw2/

EMBL
MAFFT ((Multiple Alignment using Fast Fourier Transform) is deemed to be a fast Multiple
Sequence Alignment program available at http://www.ebi.ac.uk/Tools/msa/mafft/).
The third multiple sequence alignment tool at EBI is MUSCLE (Multiple Sequence
Comparison by Log-Expectation) which is not only faster but also accurate as compared
to other MSA tools

EMBL
Figure : A MAFFT output for multiple sequence alignment
Source: http://www.ebi.ac.uk/Tools/msa/mafft/
Homology Searches: Analysis of sequences using local alignment BLAST tool can be used
to search for homology or similarity.
a. BLASTN and BLASTP can be launched by clicking on

http://www.ebi.ac.uk/Tools/sss/ncbiblast/
b. Subsequently choose the program i.e. BLASTN or BLASTP.
c. After selecting the tool, an appropriate query is to be entered into the box (nucleotide for
BLASTN and protein for BLASTP),
d. select the appropriate database to be searched, and then
e. initiate the homology search by pressing the submit button
The results of BLASTN include
i. a summary overview of all the subjects that have matches to the query with
ranking based on the percent similarity, query coverage and e-value,
ii. a pairwise alignment between the query and the subject, and
iii. a graphical overview
The results of BLASTP is much richer and includes
i. a summary overview of all the subjects that have matches to the query with
ranking based on the percent similarity, query coverage and e-value,
ii. a pairwise alignment between the query and the subject, and
iii. a graphical overview
iv. Link to Gene Expression database (ArrayExpress) for expression profile of the
transcripts (subject)
v. Genome view with location of the subject
vi. Protein family information
vii. Gene Ontology Information i.e. functional classification

EMBL
Figure: Snapshots of the various output windows of BLASTN at EBI-EMBL

EMBL
Figure: Snapshots of BLASTP result output window-1
Source: http://www.ebi.ac.uk/Tools/sss/ncbiblast/nucleotide.html

EMBL
Figure : Snapshots of BLASTP result output windows-2 showing links to genome view,
protein families and classification at UniProt and Gene Ontology (functional classification)
Source: http://www.ebi.ac.uk/Tools/sss/ncbiblast/

EMBL
Exercises
1. What does EMBL and EBI stand for?
2. What are the objectives of EBI/EMBL?
3. EMBL and EBI were was established in the years ------ and ------- respectively.
4. What are the various institutes and core research areas of EMBL?
5. NCBI, EMBL and DDBJ are coordinated by ------------------ for public domain nucleotide
sequence information.
6. What are the categories of databases at EBI?
7. What is ENA? What are the various sub-databases of ENA?
8. Which database stores genomic structural variation information? What is the comparable
database at NCBI?
9. Genotype-phenotype relationship data can be found in ------------- database.
10. Which tool developed by EBI can be used to browse and visualize genome sequencing
data? List the attributes of the tool.
11. Where would you find datasets related to Immunologically relevant biomolecules?
12. What is metagenomics? Which database contains the metagenome data? With the help
of flowchart list the steps taken to retrieve and browse metagenome data.
13. What are the various databases for protein and their associated analysis tools at EBI?
14. Which database would you explore to retrieve and analyse gene expression datasets?
15. How does the two databases ArrayExpress and EGA differ from each other?
16. What are the various databases for protein at EBI?
17. How would you retrieve a protein sequence from EBI?
18. List the features associated with PANDIT and CSA databases.
19. Determination and prediction of protein structure is an important feature for proteomic
research. Which databases in EBI cater to such information?

EMBL
20. Comment on some of the module and features associated with EMBOSS.
21. How would you perform Pairwise comparison of DNA using BLASTN?
22. How would you perform BLASTP at EBI? Comment on the features of the results that are
obtained upon BLASTP
23.What are the various multiple sequence alignment tools at EBI?

24. Expand the following:
a. BLASTN
b. DGVa
c. ENA
d. NGS
e. SRA
f. GWAS
g. GEO
h. CSA
i. HSSP
j. FSSP
25.Prepare a list of comparable databases between NCBI and EBI.
Glossary
a. EMBOSS: European Molecular Biology Open Software Suite, a collection of free

open source software developed by EMBNet for bioinformatics analysis
b. EST: Expressed Sequence Tags are generated through single-pass sequencing of 5
and 3 ends of cDNA clones from a cDNA libraries and are a rapid and inexpensive
method to get a snapshot into the transcript profile and generate sequence data
c. Metagenomics: The study of genetic material directly recovered from
environmental samples without first isolating or purifying the source organism from
the milieu or mixture or consortia or organisms. Environmental samples may
include those from soil (soil microflora), ocean, water bodies, gut, skin etc
d. MHC: Cell surface immunogenic antigen found on surface of WBC, with other cells of
the body

EMBL
e. MIAME: An internationally accepted norm for performing microarray experiments.

The MIAME guideline requires the researchers to record and submit experimental
design, sample annotation, raw data and processed data. Further details can be
retrieved at http://www.mged.org/Workgroups/MIAME/miame_2.0.html or at
http://www.ncbi.nlm.nih.gov/geo/info/MIAME.html
f. Next Generation Sequencing (NGS): All new sequencing technologies that are not
dependent of Sangers chain termination methods are broadly clubbed under NGS.
Some of these include reversible chain termination reactions, or single molecule
sequencing or Ligation based Sequencing.
g. RNA-Seq: It is a new technology that allows researchers to sequence RNA or
transcriptome using NGS based deep-sequencing (Wang et al. 2009).
Summary
The European Molecular Biology Laboratory (EMBL) was established in 1974 as a need was
felt to serve as a parallel institute for hosting and developing computational data and tools.
There are five institutes that meet the overall goals and objectives of EMBL with the
European Bioinformatics Institutes (EBI) dedicated towards bioinformatics research. A
collaborative effort under the International Nucleotide Sequence and Database Collaboration
(INSDC) allows EMBL-EBI to share data and resources with NCBI (U.S.A) and DDBJ (Japan).
At the EMBL-EBI, the databases are divided into Literature, Ontology, Functional Genomics,
Nucleotides, Pathways, Proteins, Proteomics, small molecules and Structures. The
nucleotide database is further sub-divided into ENA-genome (for genomic sequence), SRA
(for NGS data), DGVa (for genomic variant data), EGA (for genome-phenome data). The
functional genomics databases are exemplified by the ArrayExpress(microarray data);
Protein database include InterPro, UniProt, PANDIT, CSA; and Structure databases include
DSSP, FSSP, PDBe, EMDB etc. All these databases are provided with links for analysis tools.
For example EMBOSS Needle, EMBOSS water and Promoterwise are tools for pairwise
sequence analysis; for multiple sequence alignment several tools such as ClustalW,
MUSCLE, MAAFT are available. EMBL thus provides a complete suite of databases and tools
for a wide range of sequence, structure and functional analysis.
References
Works Cited

EMBL
1. Altschul S.F, Gish W, Miller W, Myers E W and Lipman D J. Basic Local Alignment
Search Tool. J. Mol. Biol. 215, 403-410 (1990)
2. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. A model of evolutionary change in
proteins. In "Atlas of Protein Sequence and Structure" 5(3) M.O. Dayhoff (ed.), 345 -
352 (1978)
3. Henikoff, S. and Henikoff, J. Amino acid substitution matrices from protein blocks
Proc. Natl. Acad. Sci. USA. 89(biochemistry): 10915 - 10919 (1992).
4. Robert C Edgar (2004) MUSCLE: a multiple sequence alignment method with reduced
time and space complexity. BMC Bioinformatics: 5:113 doi:10.1186/1471-2105-5-
113
5. Katoh, Misawa, Kuma, Miyata (2002) MAFFT: a novel method for rapid multiple
sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30:3059-
3066)
6. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H,
Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG.
(2007). Clustal W and Clustal X version 2.0. Bioinformatics, 23, 2947-2948.
7. Magrane, M. and UniProt Consortium (2011) UniProt Knowledgebase: a hub
of integrated protein data. Database (Oxford) 2011, bar009.

8. Hunter et al. (2012) InterPro in 2011: new developments in the family and
domain prediction database. NAR 40(D1):D306-D312
1. Suggested Readings
Bioinformatics and Functional Genomics: 2nd Edition, Jonathon Pevsner (2009), Wiley
Blackwell
2. Wang et al. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Review
Genetics 10(1): 57-63
Web Links
1. http://www.embl.org/
2. www.ebi.ac.uk
3. http://www.insdc.org
4. www.ensembl.org
5. http://www.ebi.ac.uk/fg/doc/help/UHTS_submissions.html

EMBL
6. http://www.ebi.ac.uk/pdbe/emdb/
7. http://www.ebi.ac.uk/Tools/psa/emboss_needle/http://www.ebi.ac.uk/Tools/psa/em
boss_water/)
8. http://www.ebi.ac.uk/Tools/psa/genewise
9. http://www.ebi.ac.uk/Tools/psa/promoterwise
10. http://www.ebi.ac.uk/Tools/msa/clustalw2/
11. http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+LibInfo+-lib+DSSP
12. http://www.ebi.ac.uk/Tools/sss/ncbiblast/

EMBL

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EMBL

Uploaded by

Copyright:

Available Formats

EMBL

Lesson : European Molecular Biology Laboratory (EMBL)

Lesson Developer: Sandip Das

Department/ College: Department of Botany, University of Delhi

Institute of Life Long Learning 0

Table of Contents Contents

Chapter : European Molecular Biology Laboratory

Institute of Life Long Learning 1

a. EMBL Heidelberg (Germany; http://www.embl.de/)

b. EMBL Grenoble (France; http://www.embl.fr/index.php)- Structural Biology

c. EMBL-European Bioinformatics Institute (EBI) Hinxton (UK; http://www.ebi.ac.uk/)-

d. EMBL Hamburg (Germany; http://www.embl-hamburg.de/index.php) Structural

e. EMBL Monterotondo (Italy; http://www.embl.it/index.php)- Mouse Biology

The broad goals of EMBL are:

a. Basic research in Molecular biology

b. Training manpower i.e. students, scientist and visitors

c. Develop new tools, technologies and methods

d. Offer service to the research community

e. Transfer technology to industry for commercialization

Institute of Life Long Learning 2

European Bioinformatics Institute (EBI)

The following are the broad categories of databases at EBI-EMBL:

c. Functional Genomics or microarray

e. Pathways and Networks

Institute of Life Long Learning 3

Figure: Webportal of EMBL and EBI

Source: http://www.embl.org/ , http://www.ebi.ac.uk/

ENA-Genome - for genome sequencing data

Institute of Life Long Learning 4

Institute of Life Long Learning 5

Figure: ENA database with information about sequences

Institute of Life Long Learning 6

b. DGva: Database of Genomic Variant Archive (DGVa) is a publicly accessible

size ranging from few nucleotides to several Kilobase or even Megabases,

structural, i.e. insertions, deletions, translocations, and

copy number variants (CNV)

Institute of Life Long Learning 7

Figure : DGVa database accessed through ENSEMBL portal

Institute of Life Long Learning 8

Figure : The European Genome Pheome database

Institute of Life Long Learning 9

Archaea and archeal virus

Institute of Life Long Learning 10

Institute of Life Long Learning 11

e. Several other databases such as Immuno Polymorphism database (IPD) (such

IMGT/HLA database is the nucleotide sequence database for human major

HLA Class I alleles (6725)

HLA Class II alleles (1771)

HLA alleles (8496)

Other non-HLA alleles (148)

Confidential alleles (8)

IMGT/LIGM similarly is a database for Immuglobulins and T-Cell receptors

IPD-HPA is the database for human platelet antigens

EBI-Metagenomic contains sequence information from microflora samples that

Institute of Life Long Learning 12

frequency of predicted CDS (coding DNA sequence), their GO (genome Ontology)

Figure : The IPD and Metagenome web portal at EBI-EMBL

Institute of Life Long Learning 13

Functional Genomics database:

Institute of Life Long Learning 14

assays performed under 20806 conditions. Both ArrayExpress-Experiments archives and

Institute of Life Long Learning 15

Figure : ArrayExpress portal for functional genomics data

UniProt Archive (UniParc) containing non-redundant protein sequences from all