BiCG 2013 Module2

13-05-23
Canadian Bioinformatics Workshops

bioinformatics.ca
Module #: Title of Module
13-05-23
You are free to:

Copy, share, adapt, or re-mix; Photograph, lm, or broadcast; Blog, live-blog, or post video of;
This presentation. Provided that:

You attribute the work to its author and respect the rights and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at; http:/ /www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
Module 2a
bioinformatics.ca
Module 2a Cloud Compu2ng with AWS

BF Francis OuelleBe
http://durtridingurl.blogspot.ca/2011/04/cloud-kingdom.html
13-05-23
Disclaimer
I do not (and will not) prot in any way, shape or form, from any of the brands, products or companies I may mention.
Module 2a
bioinformatics.ca
E-mail
francis@oicr.on.ca @bffo #CBWBiCG #CBW2013
Module 2a
bioinformatics.ca
13-05-23
Learning Objectives of Module

Introduction of cloud computing Use of the wiki in this workshop How to log into the cloud Review of databases used in bioinformatics and some used in Cancer Genomics UCSC Genome Browser IGV and other visualization tools
Module 2a
bioinformatics.ca
Disk Capacity vs Sequencing Capacity, 1990-2009

Disk Storage (Mbytes/$) 1,000,000 DNA Sequencing (bp/$) 1,000,000,000
100,000,000 100,000 10,000,000 10,000
Hard disk storage (MB/$)

1,000
1,000,000
Doubling time=14 mo
100,000
100
10,000
10
Nextgen sequencing (bp/$)

Doubling time=4 mo0
1,000
100 1
Pre-nextgen sequencing (bp/$)

Doubling time=19 mo
10
0 1990
1992
1994
1996
1998
2000
2003
2004
2006
2008
2010
1 2012
13-05-23
About DNA and computers

We'll hit the $1000 genome during 2013 or 2014, then need to think about the $100 genome. The doubling time of sequencing is 5-6 months. The doubling time of storage and network bandwidth is 12 months. The doubling time of CPU speed is 18 months. The cost of sequencing a base pair will equal the cost of storing a base pair by 2018
Module 2a
bioinformatics.ca
What is the general biomedical scientists to do?

Lots of data Poor IT infrastructure in many labs Where do they go? Write more grants? Get bigger hardware? Look to the sky?
bioinformatics.ca
Module 2a
13-05-23
Genomic companies already there!

Complete Genomics pipeline:
ACGTACGT AAGTTCGG ATGGCGTA GTCCCTTT TTGGGGTG TAGTGAGG CGCTGATT CGGAGAG
All of the hard work done here!
Module 2a
bioinformatics.ca
Most people already there!

Google docs Dropbox Netix Twitter
Module 2a
bioinformatics.ca
13-05-23
Amazon Web Services (AWS)

Innite storage (scalable): S3 (simple storage service) Compute per hour: EC2 (elastic cloud computing) Ready when you are High Performance Computing Multiple football elds of HPC throughout the world HPC are expanded at one contained at a time:
http:/ /goo.gl/7PVAl
Module 2a
bioinformatics.ca
Some of the challenges with cloud computing: Not cheap! Getting les to and from there Not the best solution for everybody Standardization PHI: personal health information & security concerns In the USA: Patriot act
Module 2a
bioinformatics.ca
13-05-23
Some of the advantages with cloud computing: At the CBW: we received a grant from Amazon, so supported by AWS in Education grant award. There are better ways of transferring large les, and now AWS makes it free to upload les. A number of datasets exist on AWS (e.g. 1000 genome data). Many useful bioinformatics AMIs (Amazon Machine Images) exist on AWS: e.g. cloudbiolinux & CloudMan (Galaxy) Many avors of cloud available, not just AWS
Module 2a
bioinformatics.ca
In this workshop:
Some tools (data) are
on your computer on the web on the cloud.
You will become efcient at traversing these various spaces, and nding resources you need, and using what is best for you. There are different ways of using the cloud:
1. Command line (like your own very powerful Unix box) 2. With a web-browser (e.g. Galaxy): not in this workshop
Module 2a
bioinformatics.ca
13-05-23
Things we have set up:

Loaded data les to an S3 bucket We brought up an Ubuntu (Linux) instance, and loaded a whole bunch of software for NGS analysis. We then cloned this, and made separate instances for everybody in the class. Weve simplied the security: you basically all have the same login and and le access, and opened ports. In your own world you would be more secure.
Module 2a
bioinformatics.ca
ICGC Controlled Access Datasets

Detailed Phenotype and Outcome data Region of residence Risk factors Examination Surgery Drugs Radiation Sample Slide Specic histological features Analyte Aliquot Donor notes Gene Expression (probe-level data) Raw genotype calls Gene-sample identier links Genome sequence les
ICGC OA Datasets
Cancer Pathology Histologic type or subtype Histologic nuclear grade Patient/Person Gender Age range Gene Expression (normalized) DNA methylation Genotype frequencies Computed Copy Number and Loss of Heterozygosity Newly discovered somatic variants
http:/ /goo.gl/w4mrV
Module 2a
bioinformatics.ca
13-05-23
For this workshop: all on Wiki!

http:/ /bioinformatics.ca/workshop_wiki/ "Login: FirstnameLastname "Password: guest
Module 2a
bioinformatics.ca
Module 2a
bioinformatics.ca
10
13-05-23
Module 2a
bioinformatics.ca
Module 2a
bioinformatics.ca
11
13-05-23
On Mac: Control+
Module 2a
bioinformatics.ca
Module 2a
bioinformatics.ca
12
13-05-23
File permissions:
ls -l (long listing)
drwx------+ 67 francis -rw-r--r--@ 1 francis rwx : owner rwx : group rwx: world r read (4) w write (2) x execute (1) staff staff 2278 22 May 21:25 ../ 1696 22 May 21:31 CBWkey.pem
Which ever way you add these 3 numbers, you know which integers were used (6 is always 4+2, 5 is 4+1, 4 is by itself, 0 is none of them etc ) So, when you have: chmod 600 <file name> It is rw for the the le owner only
Module 2a
bioinformatics.ca
Logging in to AWS
Module 2a
bioinformatics.ca
13
13-05-23
Windows
Module 2a
bioinformatics.ca
Use your assigned student #
Module 2a
bioinformatics.ca
14
13-05-23
3 2
Module 2a
bioinformatics.ca
4
Module 2a
5
bioinformatics.ca
15
13-05-23
6 7
From now on, just double-click CBW to login.
Module 2a
bioinformatics.ca
So, at this point:

Your laptop is ready for the workshop If it is not, you know where to get the information you need You know how to use the wiki for this workshop You know where all of the lectures are You have read all of the pre-lecture material If not, you know where the papers are, and you are a speed reader You know how to login to AWS
Module 2a
bioinformatics.ca
16
13-05-23
We are now on Lunch Break and we are networking, and having a great time with our CBW colleagues. Wish you were here? Register for Canadian Bioinformatics Workshops here: bioinformatics.ca
Module 2a
bioinformatics.ca
17
13-05-24
Canadian Bioinformatics Workshops

bioinformatics.ca
Module #: Title of Module
13-05-24
You are free to:

Copy, share, adapt, or re-mix; Photograph, lm, or broadcast; Blog, live-blog, or post video of;
This presentation. Provided that:

You attribute the work to its author and respect the rights and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at; http:/ /www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
Module 2b
bioinformatics.ca
Module 2b Databases and Visualiza8on Tools

BF Francis Ouelle?e
13-05-24
Disclaimer
I do not (and will not) prot in any way, shape or form, from any of the brands, products or companies I may mention.
Module 2b
bioinformatics.ca
Learning Objectives of Module

Introduction of cloud computing Use of the wiki in this workshop How to log into the cloud Review of databases used in bioinformatics and some used in Cancer Genomics UCSC Genome Browser IGV and other visualization tools
Module 2b
bioinformatics.ca
13-05-24
"Nothing in Biology Makes Sense Except in the Light of Evolution" Theodosius Dobzhansky, 1973
http:/ /goo.gl/doHu7
Module 2b
bioinformatics.ca
With apologies to Theodosius: "Nothing in Bioinformatics Makes Sense Except in the Light of Evolution"
Francis Ouellette, 2012

http:/ /goo.gl/doHu7
Module 2b
bioinformatics.ca
13-05-24
Q: Why do we have Bioinformatics? A: Open Data from Genomic and Proteomics Technologies
Module 2b
bioinformatics.ca
BLAST would not have been invented if GenBank or the Atlas of protein Sequences and Structure did not exist beforehand.
DDBJ/EMBL/GenBank is the open access source of all publically available DNA sequences Atlas of Protein Sequence and Structure, published from 1965-1978 under the editorship of Margaret Dayho, later became the PIR, and then the UniProtKB database
Module 2b
bioinformatics.ca
13-05-24
Bioinformatics is about integrating biological themes how do you dene together with the help of bioinformatics? computer tools and biological databases, and gaining new knowledge system think about pair the share in study.
Open Source Open Access
Open Data
Module 2b
bioinformatics.ca
13-05-24
Bioinformatics reagent: Databases

Organized array of information Place where you put things in, and (if all is well) you should be able to get them out again. Resource for other databases and tools. Simplify the information space by specialization. Bonus: Allows you to make discoveries. Important question to ask: " " " "what is the data model?
Module 2b
bioinformatics.ca
Bioinformatics experiments:
Sequence Reagents:
Sequence Databases
BLAST search Method:

P-P N-P P-N N-N N (P) " " " " N (P) "BLASTP "BLASTX "TBLASTN "BLASTN "TBLASTX
Alignment Interpretation:
Similarity Hypothesis testing
Know your reagents

14
Know your methods
Do your controls
13-05-24
Bioinformatics Citizenship: What it means, and what does it cost?
Nature 409:452
15
Databases
Information system Query system Storage System Data
13-05-24
Databases
GenBank flat file COSMIC record Interaction Record Title of a book Book
Databases
Boxes Oracle MySQL PC binary files Unix text files Bookshelves
13-05-24
Databases
A List you look at A catalogue indexed files SQL grep
Databases
Information system Query system Storage System Data The library of Congress Google Entrez EnsEMBL UCSC gemome browser
10
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
11
13-05-24
Module 2b
bioinformatics.ca
http://www.ncbi.nlm.nih.gov/gquery/
All [filter]
Oct 30th, 2012
12
13-05-24
Formats
DNA sequence (GenBank Flat Files) Protein Sequences Other formats to know about
FASTA GFF3 XML
Module 2b
bioinformatics.ca
GenBank Flat File (GBFF)
Header
Title Taxonomy Citation
Features (AA seq)
DNA Sequence
Lecture 1.3 26
13
13-05-24
FASTA
> >gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4!! MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI! IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN! LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL! EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES! SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE! R!
Databases
Primary (archival)
GenBank/EMBL/DDBJ UniProt PDB Medline (PubMed) Intact
Secondary (curated)
RefSeq Taxon UniProt OMIM SGD Biosamples/Bioprojects
14
13-05-24
http:/ /nar.oxfordjournals.org/content/41/D1.toc January 2013
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
15
13-05-24
http:/ /nar.oxfordjournals.org/content/40/D1.toc
January 2012
Module 2b
31
bioinformatics.ca
Sequence Databases
Primary DNA (archive)
DDBJ/ENA/GenBank
Primary protein (curated/automation)

UniProtKB
Curated Databases (lots of human labour)

RefSeq (Genomic, mRNA and protein) UniProtKB/SwissProt and neXtprot
Module 2b
bioinformatics.ca
16
13-05-24
What is GenBank?
GenBank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain.
http:/ /www.ncbi.nlm.nih.gov/genbank/ Benson et al., Nucleic Acids Res. 2013 http:/ /www.ncbi.nlm.nih.gov/pubmed/23193287
Module 2b
bioinformatics.ca
ENA
EB-eye
Adapted from Rolf Apweiler
17
13-05-24
Types of les in GenBank

From one-gene investigators
Often a very well annotated cDNA A genomic segment from an new invertebrate A mitochondria or virus
From population/phylogenetic analysis

rRNA amplicon from environmental sampling
From Genome Centers:

Gene expression:
Expressed Sequence Tags Full Length Insert cDNA TSA
Genome sequencing projects

HTG CON
Module 2b
bioinformatics.ca
Functional Divisions
PAT EST TSA STS GSS HTG HTC CON ENV
BCT PRI
Patent Expressed Sequence Tags "Transcriptome Shotgun Assembly Sequence Tagged Site Genome Survey Sequence High Throughput Genome (unnished) High throughput cDNA (unnished) Contig assembly instructions Environmental sampling methods
FUN ROD INV SYN MAM VRL PHG VRT PLN
Organismal divisions:
18
13-05-24
Guiding Principals In GenBank, records are grouped for various reasons: understand this is key to using and fully taking advantage of this database.
Module 2b
bioinformatics.ca
Identiers
You need identiers which are stable through time Need identiers which will always refer to specic sequences Need these identiers to track history of sequence updates Also need feature and annotation identiers (need to tract important things)
Genes Transcripts Proteins ((( Phenotype )))
Module 2b
bioinformatics.ca
19
13-05-24
LOCUS, Accession, NID and protein_id

LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identier. ACCESSION: A unique identier to that record, citable entity; does not change when record is updated. A good record identier, ideal for citation in publication. VERSION: ID system where the accession and version play the same function as the accession and gi number. Nucleotide gi: Geninfo identier (gi), a unique integer which will change every time the sequence changes. Protein gi: Geninfo identier (gi), a unique integer which will change every time the sequence changes. protein_id: Identier which has the same structure and function as the nucleotide Accession.version numbers, but slightly different format.
LOCUS, Accession, gi and PID

LOCUS DEFINITION ACCESSION VERSION HSU40282 1789 bp mRNA PRI 21-MAY-1998 Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds. U40282 U40282.1 GI:3150001
LOCUS: ACCESSION: VERSION: GI: Protein gi: protein_id:

CDS
HSU40282 U40282 U40282.1

3150001
3150002
AAC16892.1
LOCUS ACCESSION Accession.version gi protein gi Protein_id
157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1 /db_xref="GI:3150002"
Lecture 1.3
20
13-05-24
21
13-05-24
Accession number space

GenBank:
1+5 (L12345, U00001) 2+6 (AF000001, AC000003)
WGS (Not distributed with GenBank)

4+2+6 (AAAA01000001, AAAD01000001)
Protein:
1+5 or 3+5
All have accession.version
Module 2b
bioinformatics.ca
22
13-05-24
Secondary Accession Numbers

When you retire accession numbers, these often are put in the secondary accession number space. (e.g GenBank Accession number L05146) With the removal of sequence length limits, GenBank will now allow continuous ranges of secondary accessions. As of GenBank Release 146.0 (February 2005), it is legal to represent continuous ranges of secondary accessions by a start accession, a dash character, and an end accession. (e.g. for the E. coli genome) ACCESSION U00096 AE000111-AE000510
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
23
13-05-24
EST: Expressed Sequence Tag Expressed Sequence Tags are shorter (300-1000 bp) single reads from mRNA (cDNA) which are produced in large numbers. They represent a snapshot of what is expressed in a given tissue, and developmental stage.
Also see:
http:/ /www.ncbi.nlm.nih.gov/dbEST/ http:/ /www.ncbi.nlm.nih.gov/UniGene/
Lecture 1.3
LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM
REFERENCE AUTHORS TITLE JOURNAL COMMENT
CX016035 296 bp mRNA linear EST 06-DEC-2004 qt06h09.g1 Whole Heart Library (DOGEST5) Canis familiaris cDNA, mRNA sequence. CX016035 CX016035.1 GI:56398446 EST. Canis familiaris (dog) Canis familiaris Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Carnivora; Fissipedia; Canidae; Canis. 1 (bases 1 to 296) Balija,V.S., Nascimento,L.U. and McCombie,W.R. ESTs from Canis familiaris whole heart (dog) Unpublished (2004) Contact: W. Richard McCombie Lita Annenberg Hazen Genome Sequencing Center Cold Spring Harbor Laboratory PO Box 100, Cold Spring Harbor, NY 11724, USA Tel: 516 367 8884 Fax: 516 367 8874 Email: mccombie@cshl.org.
24
13-05-24
FEATURES source
Location/Qualifiers 1..296 /organism="Canis familiaris" /mol_type="mRNA" /db_xref="taxon:9615" /sex="Unknown" /dev_stage="3 month old normal canine" /lab_host="XL10 Gold" /clone_lib="Whole Heart Library (DOGEST5)" /note="Organ: Heart; Vector: pBluescript II SK; Site_1: EcoRI; Site_2: XhoI; Library constructed using pBluescript XR kit from Stratagene. Cloned cDNA was size selected between 1-3 kb. Mark Haskins VMD, PhD, Pathology and Medical Genetics, School of Veterinary Medicine, University of Pennsylvania, 3800 Spruce Street, Philadelphia, PA 19104-6051" gtggcggccg tcttttatta gatccacgtc tcattacaga ggtttatggc ctctagaact aaaccaggtg gcctccctcg tggacactgg ttggatttgg agtggatccc agtcactcca ggctgggggg ggggcagtga gatcagaggg ccgggctgca ttcgctgaga tggctggccc tggatcagag gagggtgaag ggaattcggc aaaggcacac actctgtcca cgttcttatg gtgtgg
ORIGIN 1 61 121 181 241 // ctccaccgcg acgaggaggg ttatgttcca gacctctttt gccgggcctt
Transcriptome Shotgun Assembly

TSA is an archive of computationally assembled sequences from primary data such as ESTs, traces and Next Generation Sequencing Technologies. The overlapping sequence reads from a complete transcriptome are assembled into transcripts by computational methods instead of by traditional cloning and sequencing of cloned cDNAs. The primary sequence data used in the assemblies must have been experimentally determined by the same submitter. TSA sequence records differ from EST and GenBank records because there are no physical counterparts to the assemblies.
Module 2b
bioinformatics.ca
25
13-05-24
LOCUS DEFINITION ACCESSION VERSION DBLINK KEYWORDS SOURCE ORGANISM
REFERENCE AUTHORS TITLE JOURNAL PUBMED REMARK REFERENCE AUTHORS TITLE JOURNAL PRIMARY
EU284816 447 bp mRNA linear TSA 30-JUN-2008 TSA: Elaeis guineensis EgEfMPOB00001 ribosomal L28-like protein mRNA, complete cds, mRNA sequence. EU284816 EU284816.1 GI:192908657 BioProject: PRJNA30465 TSA; Transcriptome Shotgun Assembly. Elaeis guineensis (African oil palm) Elaeis guineensis Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; Liliopsida; Arecaceae; Arecoideae; Cocoseae; Elaeidinae; Elaeis. 1 (bases 1 to 447) Low,E.T., Alias,H., Boon,S.H., Shariff,E.M., Tan,C.Y., Ooi,L.C., Cheah,S.C., Raha,A.R., Wan,K.L. and Singh,R. Oil palm (Elaeis guineensis Jacq.) tissue culture ESTs: identifying genes associated with callogenesis and embryogenesis BMC Plant Biol. 8, 62 (2008) 18507865 Publication Status: Online-Only 2 (bases 1 to 447) Low,E.T.L., Alias,H., Boon,S.H., Shariff,E.M., Tan,C.Y.A., Ooi,L.C.L., Cheah,S.C., Wan,K.L., Rahim,R.A. and Singh,R. Direct Submission Submitted (15-NOV-2007) Advanced Biotechnology and Breeding Centre, Biology Division, Malaysian Palm Oil Board, 6, Persiaran Institusi, Bandar Baru Bangi, Kajang, Selangor 43000, Malaysia TSA_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP 1-447 EY404781.1 14-460 1-436 EY409465.1 72-507
Module 2b
bioinformatics.ca
FEATURES source
CDS
Location/Qualifiers 1..447 /organism="Elaeis guineensis" /mol_type="mRNA" /db_xref="taxon:51953" 1..447 /codon_start=1 /product="ribosomal L28-like protein" /protein_id="ACF06437.1" /db_xref="GI:192908658" /translation="MATVPGPLIWEIVKRNNAFLVKQFGNGNAMVQFSKEPNNLYNVN SYKHSGLANKKTVAIQPGGKDLSVVLATSKTKKQNKPGNLYNRSVMKKEFRKMAKAVK NQVTDNYYRPDLTKAALARLSAVHRSLKVAKSGVKKRNRQAVQVKY" ttcccggacc ttggtaatgg attcctacaa aggacctctc tgtacaacag aggtgaccga ctgtacatcg ttcaagtaaa acttatctgg caatgccatg gcactccggg tgtggttctt gtcggttatg caactactac cagcctcaag atactag gagattgtga gtgcagttta ttggcgaaca gcaaccagta aagaaagagt aggcctgatt gttgccaagt agagaaacaa gcaaagagcc agaagactgt agacaaagaa tccggaagat tgaccaaagc ctggagtgaa tgcttttctt gaacaatctc ggccattcag gcagaacaaa ggcaaaggca agctcttgca gaagaggaac
ORIGIN 1 61 121 181 241 301 361 421 atggctactg gttaagcagt tacaatgtga ccaggaggga cctggtaatc gtcaagaacc aggctcagtg cggcaggctg
Module 2b
bioinformatics.ca
26
13-05-24
WGS: Whole Genome Shotgun (Not in GenBank release, not shared with ENA/DDBJ)
Contigs from ongoing Whole Genome Shotgun sequencing projects The nucleotides from WGS projects go into the BLAST wgs database, whereas the proteins go into the BLAST nr database. More info, and how to submit to this division:
http:/ /www.ncbi.nlm.nih.gov/Genbank/wgs.html Accession format is 4+2+6
Module 2b
bioinformatics.ca
http://www.ncbi.nlm.nih.gov/projects/WGS/WGSprojectlist.cgi
Lecture 1.3
54
27
13-05-24
WGS record (not in GenBank)
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
28
13-05-24
Sequences NOT in GenBank

WGS: whole genome shotgun TPA: third party annotations SNPs SAGE tags (serial analysis of gene expression) RefSeq (Genomic, mRNA, or protein) Consensus sequences
Module 2b
bioinformatics.ca
What is UniProtKB?
UniProt is a protein sequence database that is the result of a merge from SWISS-PROT and PIR and is funded by the NIH, EMBL and the Swiss Govt. It is the main distributed, annotated, and curated protein sequence database. Data in UniProt is derived from coding sequence annotations in ENA (GenBank/ENA/DDBJ) nucleic acid sequence data, and from sequences in PIR and SP. UniProt is a Flat-File database just like GenBank or ENA/EMBL
http:/ /www.uniprot.org/
http:/ /database.oxfordjournals.org/content/2011/bar009.long
Module 2b
bioinformatics.ca
29
13-05-24
30
13-05-24
UniprotKB/SwissProt
ID AC DT DE GN OS OC RX CC CC CC CC CC CC CC CC CC CC CC DR KW FT FT SQ CYS3_YEAST STANDARD; PRT; 393 AA. P31373; 01-JUL-1993 (REL. 26, CREATED) CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. TAXONOMY SACCHAROMYCETACEAE; SACCHAROMYCES.
CITATION -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + NH(3) + 2-OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING L-CYSTEINE FROM L-METHIONINE. -!- SUBUNIT: HOMOTETRAMER. -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. -------------------------------------------------------------------------DISCLAMOR -------------------------------------------------------------------------DATABASE cross-reference CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. INIT_MET 0 0 BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN
ID AC DT DT DT DE GN OS OC OC RN RP RX RA RA RT RT RL RN RP RC RX RA RT RT RT RL RN RP RC RX RA RA RT RT RT RL RN RP RC RX RA RA RT RT RL RN RP RX RA RA RT RT RL CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC DR DR DR DR DR DR DR DR DR DR DR DR DR KW FT FT SQ
//
CYS3_YEAST STANDARD; PRT; 393 AA. P31373; 01-JUL-1993 (REL. 26, CREATED) 01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE) 01-NOV-1995 (REL. 32, LAST ANNOTATION UPDATE) CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. SACCHAROMYCES CEREVISIAE (BAKER'S YEAST). EUKARYOTA; FUNGI; ASCOMYCOTA; HEMIASCOMYCETES; SACCHAROMYCETALES; SACCHAROMYCETACEAE; SACCHAROMYCES. [1] SEQUENCE FROM N.A., AND PARTIAL SEQUENCE. MEDLINE; 92250430. [NCBI, ExPASy, Israel, Japan] ONO B.-I., TANAKA K., NAITO K., HEIKE C., SHINODA S., YAMAMOTO S., OHMORI S., OSHIMA T., TOH-E A.; "Cloning and characterization of the CYS3 (CYI1) gene of Saccharomyces cerevisiae."; J. BACTERIOL. 174:3339-3347(1992). [2] SEQUENCE FROM N.A., AND CHARACTERIZATION. STRAIN=DBY939; MEDLINE; 93328685. [NCBI, ExPASy, Israel, Japan] YAMAGATA S., D'ANDREA R.J., FUJISAKI S., ISAJI M., NAKAMURA K.; "Cloning and bacterial expression of the CYS3 gene encoding cystathionine gamma-lyase of Saccharomyces cerevisiae and the physicochemical and enzymatic properties of the protein."; J. BACTERIOL. 175:4800-4808(1993). [3] SEQUENCE FROM N.A. STRAIN=S288C / AB972; MEDLINE; 93289814. [NCBI, ExPASy, Israel, Japan] BARTON A.B., KABACK D.B., CLARK M.W., KENG T., OUELLETTE B.F.F., STORMS R.K., ZENG B., ZHONG W.W., FORTIN N., DELANEY S., BUSSEY H.; "Physical localization of yeast CYS3, a gene whose product resembles the rat gamma-cystathionase and Escherichia coli cystathionine gammasynthase enzymes."; YEAST 9:363-369(1993). [4] SEQUENCE FROM N.A. STRAIN=S288C / AB972; MEDLINE; 93209532. [NCBI, ExPASy, Israel, Japan] OUELLETTE B.F.F., CLARK M.W., KENG T., STORMS R.K., ZHONG W.W., ZENG B., FORTIN N., DELANEY S., BARTON A.B., KABACK D.B., BUSSEY H.; "Sequencing of chromosome I from Saccharomyces cerevisiae: analysis of a 32 kb region between the LTE1 and SPO7 genes."; GENOME 36:32-42(1993). [5] SEQUENCE OF 1-18, AND CHARACTERIZATION. MEDLINE; 93289817. [NCBI, ExPASy, Israel, Japan] ONO B.-I., ISHII N., NAITO K., MIYOSHI S.-I., SHINODA S., YAMAMOTO S., OHMORI S.; "Cystathionine gamma-lyase of Saccharomyces cerevisiae: structural gene and cystathionine gamma-synthase activity."; YEAST 9:389-397(1993). -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + NH(3) + 2-OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING L-CYSTEINE FROM L-METHIONINE. -!- SUBUNIT: HOMOTETRAMER. -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. -------------------------------------------------------------------------This SWISS-PROT entry is copyright. It is produced through a collaboration between the Swiss Institute of Bioinformatics and the EMBL outstation the European Bioinformatics Institute. There are no restrictions on its use by non-profit institutions as long as its content is in no way modified and this statement is not removed. Usage by and for commercial entities requires a license agreement (See http://www.isb-sib.ch/announce/ or send an email to license@isb-sib.ch). -------------------------------------------------------------------------EMBL; L05146; AAC04945.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] EMBL; L04459; AAA85217.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] EMBL; D14135; BAA03190.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] PIR; S31228; S31228. YEPD; 5280; -. SGD; L0000470; CYS3. [SGD / YPD] PFAM; PF01053; Cys_Met_Meta_PP; 1. PROSITE; PS00868; CYS_MET_METAB_PP; 1. DOMO; P31373. PRODOM [Domain structure / List of seq. sharing at least 1 domain] PROTOMAP; P31373. PRESAGE; P31373. SWISS-2DPAGE; GET REGION ON 2D PAGE. CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. INIT_MET 0 0 BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN
//
Where do I plant my (human) flag?

Across a large human genome of 2 X 3,000,000,000 bp Changing coordinate system?
EnsEMBL UCSC Genome Browser NCBI Genome Browser
Module 2b
bioinformatics.ca
31
13-05-24
http:/ /www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/
Module 2b
bioinformatics.ca
http:/ /www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/
Module 2b
bioinformatics.ca
32
13-05-24
http:/ /genome.ucsc.edu/FAQ/FAQreleases.html
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
33
13-05-24
BioProjects & BioSamples http:/ /www.ncbi.nlm.nih.gov/pubmed/22139929
Module 2b
bioinformatics.ca
Schematic depicting how BioProject, BioSample and data objects can be organized and linked.
Barrett T et al. Nucl. Acids Res. 2012;40:D57-D63

Published by Oxford University Press 2011.
34
13-05-24
OK, what do you want to do with this?

Make discoveries about insights in cancer biology? Identify diagnostic or prognostic markers for a particular tumor sub-type? You are actually a closet geneticist, and you just like to look at variation in your favorite gene family in what ever sample you can get your hands on? You want to integrate methylation marks with gene expression data to better understand cell biology? You want to develop a new de novo whole genome assembly algorithm that works on cancer genomes? Want to identify shared signal pathways in all head & neck tumor types?
Module 2b
bioinformatics.ca
Historical perspective on the Human Genome Data

Human Expressed Seq Tags (mRNA) sequencing Human genome mapping and sequencing Population analysis and polymorphism measurements Genome Wide Association Studies <the Homer paper> The Cancer Genome Atlas pilot The 1000 genome project The Cancer Genome Atlas The International Cancer Genome Consortium
Module 2b
bioinformatics.ca
35
13-05-24
ICGC Controlled Access Datasets

Detailed Phenotype and Outcome data Region of residence Risk factors Examination Surgery Drugs Radiation Sample Slide Specic histological features Analyte Aliquot Donor notes Gene Expression (probe-level data) Raw genotype calls Gene-sample identier links Genome sequence les
ICGC OA Datasets
Cancer Pathology Histologic type or subtype Histologic nuclear grade Patient/Person Gender Age range Gene Expression (normalized) DNA methylation Genotype frequencies Computed Copy Number and Loss of Heterozygosity Newly discovered somatic variants
http:/ /goo.gl/w4mrV
Module 2b
bioinformatics.ca
http:/ /dcc.icgc.org/
Module 2b
bioinformatics.ca
36
13-05-24
ftp:/ /data.dcc.icgc.org/
Module 2b
bioinformatics.ca
ftp://data.dcc.icgc.org/current/Pancreatic_Cancer-OICR-CA/ssm.oicrPanc.txt.gz
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. Cancer Type Pancrea8c Cancer (OICR, CA) Assembly Version GRCh37 Chromosome X Chromosome start 96212915 Chromosome end 96212915 Chromosome strand 1 Reference genome allele C RefSNP allele RefSNP strand 1 External Varia8on ID ce3d9ca5-3634-44d8-a8 Muta8on ID Muta8on type single base subs8tu8on Muta8on C>T Tumour genotype C/T Control genotype C/C Consequence type upstream_gene_variant CDS muta8on AA muta8on Ensembl Gene ID ENSG00000214628 Transcript aected ENST00000398690 Gene name RP11-392M9.2.1 Is annotated not annotated Valida8on status validated Analysis ID exomeSSMAgil_201202 Plaborm Illumina HiSeq 26. Valida8on plaborm Ion Torrent PGM 27. Raw data repository dbSNP 28. Raw data accession h?p://www.ncbi.nlm.nih.gov/snp 29. Base calling algorithm genome_analyzer_sofware.ilmn 30. Alignment algorithm Novoalign 31. Varia8on calling algorithm GATK 32. Donor ID PCSI_0072 33. Sex male 34. Age at diagnosis 83 35. Age at enrollment 82 36. ICD-10 C25 37. Donor tumour stage at diagnosis 0 38. Specimen ID PCSI_0072_Pa_P 39. Specimen type primary tumour 40. Tumour conrmed primary tumour 41. Tumour histological type G2 42. Tumour grade 43. Tumour stage 44. Specimen donor treatment type no treatment 45. Specimen donor treatment type other 46. Analyzed sample ID PCSI_0072_Pa_P 47. Sample type Primary tumour 48. Matched sample ID PCSI_0072_Ly_R 49. Ini8al Release Date April 15, 2010
Module 2b
bioinformatics.ca
37
13-05-24
Data & Nomenclature & the Metadata is cri6cal for all of our work!
If databases get it wrong, the onus is on on the user to let the databases know that it is wrong!
http://goo.gl/bGjMH
Module 2b
bioinformatics.ca
Data & Nomenclature & the Metadata is cri6cal for all of our work!
If databases get it wrong, the onus is on on the user to let the databases know that it is wrong!
any db
....
http://goo.gl/bGjMH
Module 2b
bioinformatics.ca
38
13-05-24
Another source of important Cancer Data:

h?p://www.sanger.ac.uk/gene8cs/CGP/cosmic/

Module 2b
bioinformatics.ca
ftp:/ /ftp.sanger.ac.uk/pub/CGP/cosmic/data_export/
Module 2b
bioinformatics.ca
39
13-05-24
Module 2b
bioinformatics.ca
Identify yourself
Fill out detail form which includes: Contact and Project Information Information Technology details and procedures for keeping data secure Data Access Agreement
All of these documents are put into a PDF file that you print and get your institution to sign off on your behalf
Module 2b
bioinformatics.ca
40
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
41
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
42
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
43
13-05-24
Module 2b
bioinformatics.ca
ERA Open TCGA dbGaP

BAM
DACO Open ICGC
EGA BA BAM M + EGA id bioinformatics.ca
VCF
Module 2b
44
13-05-24
What is Cancer Data?

Structured Clinical Data about the pa8ent Structured Clinical Data about the treatment Structured Clinical Data about the tumor Associated with a number of posi8ons (hundreds, if not thousands) of nucleo8de coordinate system on one reference genome.
Module 2b
bioinformatics.ca
ICGC is implementing NCBIs bioprojects http:/ /www.ncbi.nlm.nih.gov/bioproject
Module 2b
bioinformatics.ca
45
13-05-24
About UCSC Genome Browser

Browse many Eukaryotic genomes (yeast to human) Most annotations are there Important evolutionary and variation data representation. Very exible and congurable views Graphical and table views (Galaxy uses this) Upload your data into custom tracks and share with colleagues Client/server application with its issues, but a great app!
Module 2b
bioinformatics.ca
Best Way to represent data?

We have been thinking mostly in a linear scale, and decorating the string of DNA with annotations.
What are annotations?

bioinformatics.ca
Module 2b
46
13-05-24
Data being Collected

Phenotype: Clinical data tumor pathology, age, gender, treatment, survival (controlled access) Germ line data SNPs (controlled access) Somatic mutations in tumour (open) Copy number variations (open) RNA abundance & splicing (open) RNA sequences (controlled access) DNA methylation (open)
Module 2b
bioinformatics.ca
Getting data
Cosmic: http://www.sanger.ac.uk/genetics/CGP/cosmic
ICGC Home: http:/ /icgc.org ICGC DCC: http:/ /dcc.icgc.org TCGA Home: http:/ /cancergenome.nih.gov TCGA: http:/ /tcga-data.nci.nih.gov/tcga UCSC Home: http:/ /genome.ucsc.edu UCSC Human: http:/ /genome.ucsc.edu/cgi-bin/hgGateway UCSC Human Cancer: https:/ /genome-cancer.ucsc.edu
Module 2b
bioinformatics.ca
47
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
48
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
49
13-05-24
Integrated Genome Viewer
Module 2b
bioinformatics.ca
Integrated Genome Viewer
Module 2b
bioinformatics.ca
50
13-05-24
IGV can use and display many le formats

http:/ /www.broadinstitute.org/software/igv/FileFormats
Module 2b
bioinformatics.ca
IGV: le formats, e.g. BAM (binary version of SAM, or Sequence Alignment Formatted les)
Module 2b
bioinformatics.ca
51
13-05-24
Module 2b
bioinformatics.ca
Ask your question, and then gather the data, the tools and hardware you need
Data and Databases: you will take workshops, you will read papers, and you will go on-line: SeqAnswers & maybe the bioinformatics.ca Links Directory Tools: you will take workshops, you will read papers, and you will go on-line: SeqAnswers & maybe the bioinformatics.ca Links Directory Hardware: you need to decide?
Module 2b
bioinformatics.ca
52
13-05-24
What can you do with IGV?

Visualization of different genomic data types: l aligned sequence reads l mutations l copy number l RNA interference screens l gene expression l methylation and genomic annotations List of supported data formats:
http:/ /www.broadinstitute.org/software/igv/FileFormats
For this example: l *.bam for the alignment le l *.gtf for the genome annotation data
Using IGV to visualize sequence alignment and genomic annotations
Step1: Choose the genome in the list (or import your own genome le)
Here we have selected hg18 because it was used for the alignment
53
13-05-24

Step 2: Import your alignment file File->Load from File
You can also download file from a URL, a DAS or a server

Sample files source: http://manuals.bioinformatics.ucr.edu/home/gui-ngs-analysis and ftp://ftp.broad.mit.edu/pub/igv/INMEGEN2010/
Step 2: Import your sequence alignment file
If you download a *.bam file, it must be sorted and indexed, and the index *.bai file must be in the same directory You can visualize several alignment files at the same time for the same species
54
13-05-24

You can either: l select a chromosome
Step 3: select the data to display
select the coordinates
search for a gene

Step 4: visualize the read alignments on the sequence
You will not see the alignment if the region your are looking at in too large for IGV: Zoom in using the + sign (in red) or by double-clicking on the display area
double-click here to zoom in and see the alignment
55
13-05-24
Cytoband
Genomic coordinates
Track names
Data panel
Genomic annotations (default: RefSeq)
Coverage of reads on the sequence

White reads: low alignment score Other colors: depend on the color alignment code selected (ex: insert size, pair orientation, read strand)
Annotated introns
Annotated exons
56
13-05-24
2 examples of variation compared to the reference sequence Lighter color bases: low quality bases
Reference sequence (here hg18)

Step 5.1: download genomic annotations le from UCSC table browser
1) Go on http:/ /genome.ucsc.edu and click on Tables
Several ways of downloading gene annotation les can be used, for example directly from the source sequence databases
57
13-05-24
Select the genome (here hg18) Select the gene annotations (here Ensembl) Select the file format (here GTF)
Choose your file name and click on the get output button

Step 5.2: load the genomic annotation file in IGV
Select File->Load from file and choose the GTF file you have downloaded You have know access to RefSeq and Ensembl gene annotations:
The more data and annotations you load, the more memory you need You can select a higher memory threshold if you need it when you launch IGV
58
13-05-24

On this example you can visualize deletion (10kb, from IGV publication*)
Robinson et al., (2011) Nature Biotechnology 29: 2426

You can also visualize copy number variation data (from IGV publication*)
Robinson et al., (2011) Nature Biotechnology 29: 2426
59
13-05-24
Following OpenHelix, UCSC, & SeqAnswers

OpenHelix http:/ /www.openhelix.com/ Twitter: @openhelix Blog: http:/ /blog.openhelix.com/
UCSC
http:/ /genome.ucsc.edu/ Twitter: @GenomeBrowser More tutorials: http:/ /genome.ucsc.edu/training.html SEQanswers
Forum for NGS technologies
h?p://seqanswers.com/
Module 2b
bioinformatics.ca
Lab exercise:
Update of the stats for 2013 http://goo.gl/tkMXa
Module 2b
bioinformatics.ca
60
13-05-24
CosmicCompleteExport_v64_260313.tsv How many records in this le? How many genes? Which are most mutated? How many publications? Which is the most mutated gene in TCGA set? What concerns about these? Some hints: 1> cat le.tsv | cut f 1 | sort | uniq -c | sort -nr > foo.out & 2> grep TCGA le.tsv | wc l 3> man cut
Module 2b
bioinformatics.ca
Get simple somatic mutation from OICRs pancreatic cancer ICGC project
cat ssm.oicrPanc.txt | wc l What is this? > cat ssm.oicrPanc.txt | awk -F\t '{print $3}' | uniq c What is this? > cat ssm.oicrPanc.txt | awk -F\t '{print $23}' | uniq c What is this?
Etc ... See wiki
Module 2b
bioinformatics.ca
61

BiCG 2013 Module2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BiCG 2013 Module2

Uploaded by

Copyright:

Available Formats

13-05-23

Canadian Bioinformatics Workshops

Module #: Title of Module

You are free to:

This presentation. Provided that:

Module 2a Cloud Compu2ng with AWS

francis@oicr.on.ca @bffo #CBWBiCG #CBW2013

Learning Objectives of Module

Disk Capacity vs Sequencing Capacity, 1990-2009

100,000,000 100,000 10,000,000 10,000

Hard disk storage (MB/$)

Nextgen sequencing (bp/$)

Pre-nextgen sequencing (bp/$)

About DNA and computers

What is the general biomedical scientists to do?

Genomic companies already there!

All of the hard work done here!

Most people already there!

Amazon Web Services (AWS)

Things we have set up:

ICGC Controlled Access Datasets

For this workshop: all on Wiki!

Use your assigned student #

So, at this point:

Canadian Bioinformatics Workshops

Module #: Title of Module

You are free to:

This presentation. Provided that:

Module 2b Databases and Visualiza8on Tools

Learning Objectives of Module

Francis Ouellette, 2012

Open Source Open Access

Bioinformatics reagent: Databases

BLAST search Method:

Know your reagents

Know your methods

Bioinformatics Citizenship: What it means, and what does it cost?

Information system Query system Storage System Data

Information system Query system Storage System Data

Information system Query system Storage System Data

Boxes Oracle MySQL PC binary files Unix text files Bookshelves

Information system Query system Storage System Data

A List you look at A catalogue indexed files SQL grep

Oct 30th, 2012

GenBank Flat File (GBFF)

Title Taxonomy Citation

Features (AA seq)

http:/ /nar.oxfordjournals.org/content/41/D1.toc January 2013

Primary protein (curated/automation)

Curated Databases (lots of human labour)

Adapted from Rolf Apweiler

Types of les in GenBank

From population/phylogenetic analysis

From Genome Centers:

Genome sequencing projects

LOCUS, Accession, NID and protein_id

LOCUS, Accession, gi and PID

LOCUS: ACCESSION: VERSION: GI: Protein gi: protein_id:

HSU40282 U40282 U40282.1

LOCUS ACCESSION Accession.version gi protein gi Protein_id

Accession number space

WGS (Not distributed with GenBank)