BI Biological+Databases

PUBLICLY AVAILABLE BIOLOGICAL DATABASES
PDB
The PDB archive contains information about experimentally-determined structures of proteins, nucleic acids, and
complex assemblies. As a member of the wwPDB, the RCSB PDB curates and annotates PDB data according to
agreed upon standards.
The RCSB PDB also provides a variety of tools and resources. Users can perform simple and advanced searches
based on annotations relating to sequence, structure and function. These molecules are visualized, downloaded,
and analyzed by users who range from students to specialized scientists.
The PDB is the Protein Data Bank, a single worlwide repository for 3D structural data of biological molecules. A PDB
is a file, typically with a "pdb" file extension, contains 3D structural data of a particular biological molecule. In short,
a PDB file is broken into two sections: (i) a header that contains much background information on the molecule in
question such as authors and experimental conditions, (ii) 3D coordinate data that contain the vital experimental
data in the form of 3D cartesian coordinates, B-factors, atom information, and more.
Protein Data Bank files, containing some form of macromolecular coordinate set, are visualized via graphic
computing. A myriad of advanced molecular visualization programs are used in academic and industrial setting, and
some of these are specific to the field of research or the techniques used to collect the coordinate data. For
example, X-ray crystallographers use O, XtalView, MAIN, or other programs for crystallographic modeling. These
allow the researcher to model the coordinate set into the electron density from the collected X-ray data.
However, PDB files are "final" deposited coordinates. This means that the depositor has made their best effort to
provide a final model that is as accurate as they could make it. Programs like Molscript, Pymol,
SwissPDBview/DeepView, Molmol, SETOR, DINO and others are routinely used to visualize deposited
coordinates.
In order to produce molecular graphics, one requires the following: graphics workstation computer, some form of
molecular graphics software, and most importantly, training.
Symmation has been formally trained in X-ray crystallography and bioinformatics. We have the software and
computer equipment, aside from our 8 years of experience. We are able to provide a complete solution to
molecular graphics visualization that includes raw data interpretation, remodeling, refinement, structure analysis,
and of course publication-quality graphics.
GenBank
GenBank (1) is a comprehensive public database of nucleotide sequences and
supporting bibliographic and biological annotation, built and distributed by the
National Center for Biotechnology Information (NCBI), a division of the National
Library of Medicine (NLM), located on the campus of the US National Institutes of
Health (NIH) in Bethesda, MD. NCBI builds GenBank primarily from the submission
of sequence data from authors and from the bulk submission of expressed
sequence tag (EST), genome survey sequence (GSS) and other high-throughput
data from sequencing centers. The US Office of Patents and Trademarks (USPTO)
also contributes sequences from issued patents. GenBank incorporates sequences
submitted to the EMBL Data Library (2) in the United Kingdom and the DNA
Databank of Japan (DDBJ) (3) as part of a long-standing international collaboration
between the three databases in which data are exchanged daily to ensure a uniform
and comprehensive collection of sequence information. NCBI makes the GenBank
data available at no cost over the Internet, via FTP and a wide range of web-based
retrieval and analysis services, which operate on the GenBank data (4).
ORGANIZATION OF THE DATABASE

GenBank continues to grow at an exponential rate with 7.9 million new sequences
added over the past 12 months. As of Release 143 in August 2004, GenBank
contained over 41.8 billion nucleotide bases from 37.3 million individual sequences.
Complete genomes (http://www.ncbi.nlm.nih.gov/Genomes/index.html) represent a growing
portion of the database, with over 50 of more than 180 complete microbial
genomes in GenBank deposited over the past year. The number of eukaryote
genomes for which coverage and assembly are good continues to increase as well,
with over 20 such assemblies now available, including that of the reference human
genome.
EQUENCE-BASED TAXONOMY
Database sequences are classified and can be queried using a comprehensive
sequence-based taxonomy (http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html)
developed by NCBI in collaboration with EMBL and DDBJ and with the valuable
assistance of external advisers and curators. Over 165 000 named species are
represented in GenBank and new species are being added at the rate of over 2000
per month. About 19% of the sequences in GenBank are of human origin and 13%
of all sequences are human ESTs. After Homo sapiens, the top species in GenBank
in terms of number of bases are Mus musculus, Rattus norvegicus, Danio rerio, Zea
mays, Oryza sativa, Drosophila melanogaster, Gallus gallus and Canis familiaris.
Each GenBank entry includes a concise description of the sequence, the scientific
name and taxonomy of the source organism, bibliographic references and a table of
features (http://www.ncbi.nlm.nih.gov/collab/FT/index.html) listing areas of biological
significance, such as coding regions and their protein translations, transcription
units, repeat regions, and sites of mutations or modifications.
The files in the GenBank distribution have traditionally been partitioned into
‘divisions’ that roughly correspond to taxonomic groups such as bacteria (BCT),
viruses (VRL), primates (PRI) and rodents (ROD). In recent years, divisions have
been added to support specific sequencing strategies. These include divisions for
EST, GSS, high-throughput genomic (HTG) and high-throughput cDNA (HTC)
sequences, making a total of 17 divisions. For convenience in file transfer, the
larger divisions, such as the EST and PRI, are partitioned into multiple files for the
bimonthly GenBank releases on NCBI's FTP site.
DNA Data Bank of Japan (DDBJ)
DNA Data Bank of Japan (DDBJ) is the sole nucleotide sequence data bank in Asia, which is officially certified to
collect nucleotide sequences from researchers and to issue the internationally recognized accession number
to data submitters. Since we exchange the collected data with EMBL-Bank/EBI (European Bioinformatics
Institute) and GenBank/NCBI (National Center for Biotechnology Information) on a daily basis, the three
data banks share virtually the same data at any given time. The virtually unified database is called "the
International Nucleotide Sequence Database (INSD)". DDBJ collects sequence data mainly from Japanese
researchers, but of course accepts data and issue the accession number to researchers in any other countries.
DNA Data Bank of Japan (DDBJ) is the sole nucleotide sequence data bank in Asia, which is officially certified to
collect nucleotide sequences from researchers and to issue the internationally recognized accession number
to data submitters. Since we exchange the collected data with EMBL-Bank/EBI (European Bioinformatics
Institute) and GenBank/NCBI (National Center for Biotechnology Information) on a daily basis, the three
data banks share virtually the same data at any given time. The virtually unified database is called "the
International Nucleotide Sequence Database (INSD)". DDBJ collects sequence data mainly from Japanese
researchers, but of course accepts data and issue the accession number to researchers in any other countries.
SWISS-PROT
SWISS-PROT ( 1 ) is an annotated protein sequence database established in 1986 and maintained collaboratively,
since 1987, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library (now
the EMBL Outstation-The European Bioinformatics Institute; 2 ). The SWISS-PROT protein sequence data bank
consists of sequence entries. Sequence entries are composed of different line types, each with their own format.
For standardization purposes the format of SWISS-PROT ( 3 ) follows as closely as possible that of the EMBL
nucleotide sequence database.
Annotation
In SWISS-PROT, as in most other sequence databases, two classes of data can be distinguished, the core data and
the annotation. For each sequence entry the core data consists of the sequence data, the citation information
(bibliographical references) and the taxonomic data (description of the biological source of the protein), while the
annotation consists of a description of the following items: (i) function(s) of the protein; (ii) post-translational
modification(s), for example carbohydrates, phosphorylation, acetylation, GPI-anchor, etc.; (iii) domains and sites,
for example calcium binding regions, ATP binding sites, zinc fingers, homeobox, kringle, etc.; (iv) secondary
structure; (v) quaternary structure; (vi) similarities to other proteins; (vii) disease(s) associated with deficiency of
the protein; (viii) sequence conflicts, variants, etc.
We try to include as much annotation information as possible in SWISS-PROT. To obtain this information we use, in
addition to the publications that report new sequence data, review articles to periodically update the annotations
of families or groups of proteins. We also make use of external experts, who have been recruited to send us their
comments and updates concerning specific groups of proteins.
We believe that our having systematic recourse both to publications other than those reporting the core data and
to subject referees represents a unique and beneficial feature of SWISS-PROT.
In SWISS-PROT annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword
lines (KW). Most comments are classified by `topics', an approach which permits easy retrieval of specific
categories of data from the database.
Minimal redundancy
Many sequence databases contain, for a given protein sequence, separate entries which correspond to different
literature reports. In SWISS-PROT we try as much as possible to merge all these data, so as to minimize the
redundancy of the database. If conflicts exist between various sequencing reports they are indicated in the feature
table of the corresponding entry.
Integration with other databases
It is important to provide the users of biomolecular databases with a degree of integration between the three types
of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiary structures), as well
as with specialized data collections. SWISS-PROT is currently cross-referenced with 24 different databases. Cross-
references are provided in the form of pointers to information related to SWISS-PROT entries and found in data
collections other than SWISS-PROT. For example, the sample sequence shown in Figure 1 contains data bank
reference (DR) lines that point to EMBL, PIR, OMIM and PROSITE. In this particular example it is therefore possible
to retrieve the nucleic acid sequence(s) that encodes that protein (EMBL), the description of genetic disease(s)
associated with that protein (OMIM) or the pattern specific for that family of proteins (PROSITE).
CATH
Details
The CATH Protein Structure Classification database is a hierarchical classification of protein
structural domains based on four levels:
• Class(C) - Describes secondary structure composition of the protein. Proteins can be

either Alpha Beta, Mainly Beta, Mainly Alpha, or have Few Secondary Structures
• Architechture(A) - Describes the gross overall shape of the protein. Assigned based on
general orientation of secondary structural elements. Examples of categories are barrel,
3-layer sandwich, etc. Assignments on this level are made by hand.
• Topology(T) Structures are group into categories based on fold shape and similar
orientation of secondary structural elements based on SSAP structure comparison
algorithm.
• Homology(H) - These groups of proteins are thought to share a common ancestor based
and sequence and structural similarity. Structures are clustered into same homologous
superfamily if above a certain sequence identity and SSAP score
PubMed
PubMed, available via the NCBI Entrez retrieval system, was developed by the National Center for
Biotechnology Information (NCBI) at the National Library of Medicine (NLM), located at the U.S. National
Institutes of Health (NIH). Entrez is the text-based search and retrieval system used at NCBI for services
including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, OMIM,
and many others. PubMed provides access to citations from biomedical literature. LinkOut provides access to full-
text articles at journal Web sites and other related Web resources. PubMed also provides access and links to the
other Entrez molecular biology resources.
Publishers participating in PubMed electronically submit their citations to NCBI prior to or at the time of
publication. If the publisher has a web site that offers full-text of its journals, PubMed provides links to that site as
well as biological resources, consumer health information, research tools, and more. There may be a charge to
access the text or information.
Use the Batch Citation Matcher to match citations to PubMed using bibliographic information such as journal,
volume, issue, page number, and year, or the Entrez Programming Utilities that provide access to Entrez data
outside of the regular Web query interface.
PubMed Coverage
PubMed provides access to bibliographic information that includes MEDLINE, as well as:
• The out-of-scope citations (e.g., articles on plate tectonics or astrophysics)

from certain MEDLINE journals, primarily general science and chemistry
journals, for which the life sciences articles are indexed for MEDLINE.
• Citations that precede the date that a journal was selected for MEDLINE
indexing.
• Some additional life science journals that submit full text to PubMedCentral
and receive a qualitative review by NLM.
For additional information, please see the NLM Fact Sheet: What's the Difference Between MEDLINE and
PubMed?
KEGG
KEGG is an integrated database resource consisting of 16 main databases, broadly categorized into systems
information, genomic information, and chemical information as shown below. Genomic and chemical information
represents the molecular building blocks of life in the genomic and chemical spaces, respectively, and systems
information represents functional aspects of the biological systems, such as the cell and the organism, that are
built from the building blocks. KEGG has been widely used as a reference knowledge base for biological
interpretation of large-scale datasets generated by sequencing and other high-throughput experimental
technologies.
Category Database Content

Pathway maps for metabolism and other cellular processes, as well as human
KEGG PATHWAY
diseases; manually created from published materials
Functional hierarchies (ontologies) representing our knowledge on various aspects
Systems KEGG BRITE
of biological systems; manually created from published materials
informatio
KEGG MODULE Tighter functional units for pathways and complexes; manually defined
n
KEGG DISEASE List of disease genes and molecules; manually entered from published materials
Chemical structures and associated information of approved drugs in Japan, USA,
KEGG DRUG
and Europe; manually entered from published materials
KEGG ORTHOLOGY KEGG Orthology (KO) groups based on PATHWAY and BRITE; manually defined
Genome maps and organism information; generated from RefSeq and other public
KEGG GENOME
resources
Gene catalogs of complete genomes with manual annotation; generated from
KEGG GENES
Genomic RefSeq and other public resources
informatio Sequence similarity scores and best-hit relations; computationally derived from
KEGG SSDB
n GENES by pairwise genome comparisons of all protein-coding genes
Gene catalogs of draft genomes with automatic annotation; generated from web
KEGG DGENES
resources
Gene catalogs (consensus contigs) of EST data with automatic annotation;
KEGG EGENES
generated from dbEST
KEGG COMPOUND Chemical compounds; manually entered from published materials
Chemical KEGG GLYCAN Glycans; manually entered from published materials

informatio KEGG REACTION Chemical reactions; manually defined from ENZYME and PATHWAY
n KEGG RPAIR Chemical structure transformation patterns; manually defined from REACTION
KEGG ENZYME Enzyme nomenclature; generated from ExplorEnz with annotation by KEGG

BI Biological+Databases

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BI Biological+Databases

Uploaded by

Copyright:

Available Formats

PUBLICLY AVAILABLE BIOLOGICAL DATABASES

ORGANIZATION OF THE DATABASE

DNA Data Bank of Japan (DDBJ)

Integration with other databases

• Class(C) - Describes secondary structure composition of the protein. Proteins can be

• The out-of-scope citations (e.g., articles on plate tectonics or astrophysics)

Category Database Content

Chemical KEGG GLYCAN Glycans; manually entered from published materials

You might also like