You are on page 1of 78

13-05-23

Canadian Bioinformatics Workshops


bioinformatics.ca

Module #: Title of Module

13-05-23

You are free to:


Copy, share, adapt, or re-mix; Photograph, lm, or broadcast; Blog, live-blog, or post video of;

This presentation. Provided that:


You attribute the work to its author and respect the rights and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at; http:/ /www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites

Module 2a

bioinformatics.ca

Module 2a Cloud Compu2ng with AWS


BF Francis OuelleBe

http://durtridingurl.blogspot.ca/2011/04/cloud-kingdom.html

13-05-23

Disclaimer
I do not (and will not) prot in any way, shape or form, from any of the brands, products or companies I may mention.

Module 2a

bioinformatics.ca

E-mail

francis@oicr.on.ca @bffo #CBWBiCG #CBW2013

Module 2a

bioinformatics.ca

13-05-23

Learning Objectives of Module


Introduction of cloud computing Use of the wiki in this workshop How to log into the cloud Review of databases used in bioinformatics and some used in Cancer Genomics UCSC Genome Browser IGV and other visualization tools

Module 2a

bioinformatics.ca

Disk Capacity vs Sequencing Capacity, 1990-2009


Disk Storage (Mbytes/$) 1,000,000 DNA Sequencing (bp/$) 1,000,000,000

100,000,000 100,000 10,000,000 10,000

Hard disk storage (MB/$)


1,000

1,000,000

Doubling time=14 mo

100,000

100

10,000

10

Nextgen sequencing (bp/$)


Doubling time=4 mo0

1,000

100 1

Pre-nextgen sequencing (bp/$)


Doubling time=19 mo

10

0 1990

1992

1994

1996

1998

2000

2003

2004

2006

2008

2010

1 2012

13-05-23

About DNA and computers


We'll hit the $1000 genome during 2013 or 2014, then need to think about the $100 genome. The doubling time of sequencing is 5-6 months. The doubling time of storage and network bandwidth is 12 months. The doubling time of CPU speed is 18 months. The cost of sequencing a base pair will equal the cost of storing a base pair by 2018

Module 2a

bioinformatics.ca

What is the general biomedical scientists to do?


Lots of data Poor IT infrastructure in many labs Where do they go? Write more grants? Get bigger hardware? Look to the sky?
bioinformatics.ca

Module 2a

13-05-23

Genomic companies already there!


Complete Genomics pipeline:
ACGTACGT AAGTTCGG ATGGCGTA GTCCCTTT TTGGGGTG TAGTGAGG CGCTGATT CGGAGAG

All of the hard work done here!

Module 2a

bioinformatics.ca

Most people already there!


Google docs Dropbox Netix Twitter

Module 2a

bioinformatics.ca

13-05-23

Amazon Web Services (AWS)


Innite storage (scalable): S3 (simple storage service) Compute per hour: EC2 (elastic cloud computing) Ready when you are High Performance Computing Multiple football elds of HPC throughout the world HPC are expanded at one contained at a time:

http:/ /goo.gl/7PVAl

Module 2a

bioinformatics.ca

Some of the challenges with cloud computing: Not cheap! Getting les to and from there Not the best solution for everybody Standardization PHI: personal health information & security concerns In the USA: Patriot act

Module 2a

bioinformatics.ca

13-05-23

Some of the advantages with cloud computing: At the CBW: we received a grant from Amazon, so supported by AWS in Education grant award. There are better ways of transferring large les, and now AWS makes it free to upload les. A number of datasets exist on AWS (e.g. 1000 genome data). Many useful bioinformatics AMIs (Amazon Machine Images) exist on AWS: e.g. cloudbiolinux & CloudMan (Galaxy) Many avors of cloud available, not just AWS
Module 2a

bioinformatics.ca

In this workshop:
Some tools (data) are
on your computer on the web on the cloud.

You will become efcient at traversing these various spaces, and nding resources you need, and using what is best for you. There are different ways of using the cloud:
1. Command line (like your own very powerful Unix box) 2. With a web-browser (e.g. Galaxy): not in this workshop

Module 2a

bioinformatics.ca

13-05-23

Things we have set up:


Loaded data les to an S3 bucket We brought up an Ubuntu (Linux) instance, and loaded a whole bunch of software for NGS analysis. We then cloned this, and made separate instances for everybody in the class. Weve simplied the security: you basically all have the same login and and le access, and opened ports. In your own world you would be more secure.
Module 2a

bioinformatics.ca

ICGC Controlled Access Datasets


Detailed Phenotype and Outcome data Region of residence Risk factors Examination Surgery Drugs Radiation Sample Slide Specic histological features Analyte Aliquot Donor notes Gene Expression (probe-level data) Raw genotype calls Gene-sample identier links Genome sequence les

ICGC OA Datasets
Cancer Pathology Histologic type or subtype Histologic nuclear grade Patient/Person Gender Age range Gene Expression (normalized) DNA methylation Genotype frequencies Computed Copy Number and Loss of Heterozygosity Newly discovered somatic variants
http:/ /goo.gl/w4mrV

Module 2a

bioinformatics.ca

13-05-23

For this workshop: all on Wiki!


http:/ /bioinformatics.ca/workshop_wiki/ "Login: FirstnameLastname "Password: guest

Module 2a

bioinformatics.ca

Module 2a

bioinformatics.ca

10

13-05-23

Module 2a

bioinformatics.ca

Module 2a

bioinformatics.ca

11

13-05-23

On Mac: Control+

Module 2a

bioinformatics.ca

Module 2a

bioinformatics.ca

12

13-05-23

File permissions:
ls -l (long listing)
drwx------+ 67 francis -rw-r--r--@ 1 francis rwx : owner rwx : group rwx: world r read (4) w write (2) x execute (1) staff staff 2278 22 May 21:25 ../ 1696 22 May 21:31 CBWkey.pem

Which ever way you add these 3 numbers, you know which integers were used (6 is always 4+2, 5 is 4+1, 4 is by itself, 0 is none of them etc ) So, when you have: chmod 600 <file name> It is rw for the the le owner only

Module 2a

bioinformatics.ca

Logging in to AWS

Module 2a

bioinformatics.ca

13

13-05-23

Windows

Module 2a

bioinformatics.ca

Use your assigned student #

Module 2a

bioinformatics.ca

14

13-05-23

3 2

Module 2a

bioinformatics.ca

4
Module 2a

5
bioinformatics.ca

15

13-05-23

6 7
From now on, just double-click CBW to login.
Module 2a

bioinformatics.ca

So, at this point:


Your laptop is ready for the workshop If it is not, you know where to get the information you need You know how to use the wiki for this workshop You know where all of the lectures are You have read all of the pre-lecture material If not, you know where the papers are, and you are a speed reader You know how to login to AWS
Module 2a

bioinformatics.ca

16

13-05-23

We are now on Lunch Break and we are networking, and having a great time with our CBW colleagues. Wish you were here? Register for Canadian Bioinformatics Workshops here: bioinformatics.ca
Module 2a

bioinformatics.ca

17

13-05-24

Canadian Bioinformatics Workshops


bioinformatics.ca

Module #: Title of Module

13-05-24

You are free to:


Copy, share, adapt, or re-mix; Photograph, lm, or broadcast; Blog, live-blog, or post video of;

This presentation. Provided that:


You attribute the work to its author and respect the rights and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at; http:/ /www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites

Module 2b

bioinformatics.ca

Module 2b Databases and Visualiza8on Tools


BF Francis Ouelle?e

13-05-24

Disclaimer
I do not (and will not) prot in any way, shape or form, from any of the brands, products or companies I may mention.

Module 2b

bioinformatics.ca

Learning Objectives of Module


Introduction of cloud computing Use of the wiki in this workshop How to log into the cloud Review of databases used in bioinformatics and some used in Cancer Genomics UCSC Genome Browser IGV and other visualization tools

Module 2b

bioinformatics.ca

13-05-24

"Nothing in Biology Makes Sense Except in the Light of Evolution" Theodosius Dobzhansky, 1973
http:/ /goo.gl/doHu7

Module 2b

bioinformatics.ca

With apologies to Theodosius: "Nothing in Bioinformatics Makes Sense Except in the Light of Evolution"

Francis Ouellette, 2012


http:/ /goo.gl/doHu7

Module 2b

bioinformatics.ca

13-05-24

Q: Why do we have Bioinformatics? A: Open Data from Genomic and Proteomics Technologies
Module 2b

bioinformatics.ca

BLAST would not have been invented if GenBank or the Atlas of protein Sequences and Structure did not exist beforehand.
DDBJ/EMBL/GenBank is the open access source of all publically available DNA sequences Atlas of Protein Sequence and Structure, published from 1965-1978 under the editorship of Margaret Dayho, later became the PIR, and then the UniProtKB database

Module 2b

bioinformatics.ca

13-05-24

Bioinformatics is about integrating biological themes how do you dene together with the help of bioinformatics? computer tools and biological databases, and gaining new knowledge system think about pair the share in study.

Open Source Open Access

Open Data
Module 2b

bioinformatics.ca

13-05-24

Bioinformatics reagent: Databases


Organized array of information Place where you put things in, and (if all is well) you should be able to get them out again. Resource for other databases and tools. Simplify the information space by specialization. Bonus: Allows you to make discoveries. Important question to ask: " " " "what is the data model?
Module 2b

bioinformatics.ca

Bioinformatics experiments:
Sequence Reagents:
Sequence Databases

BLAST search Method:


P-P N-P P-N N-N N (P) " " " " N (P) "BLASTP "BLASTX "TBLASTN "BLASTN "TBLASTX

Alignment Interpretation:
Similarity Hypothesis testing

Know your reagents


14

Know your methods

Do your controls

13-05-24

Bioinformatics Citizenship: What it means, and what does it cost?

Nature 409:452
15

Databases

Information system Query system Storage System Data

13-05-24

Databases

Information system Query system Storage System Data

GenBank flat file COSMIC record Interaction Record Title of a book Book

Databases

Information system Query system Storage System Data

Boxes Oracle MySQL PC binary files Unix text files Bookshelves

13-05-24

Databases

Information system Query system Storage System Data

A List you look at A catalogue indexed files SQL grep

Databases

Information system Query system Storage System Data The library of Congress Google Entrez EnsEMBL UCSC gemome browser

10

13-05-24

Module 2b

bioinformatics.ca

Module 2b

bioinformatics.ca

11

13-05-24

Module 2b

bioinformatics.ca

http://www.ncbi.nlm.nih.gov/gquery/

All [filter]

Oct 30th, 2012

12

13-05-24

Formats
DNA sequence (GenBank Flat Files) Protein Sequences Other formats to know about
FASTA GFF3 XML

Module 2b

bioinformatics.ca

GenBank Flat File (GBFF)

Header

Title Taxonomy Citation

Features (AA seq)

DNA Sequence
Lecture 1.3 26

13

13-05-24

FASTA

> >gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4!! MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI! IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN! LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL! EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES! SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE! R!

Databases
Primary (archival)
GenBank/EMBL/DDBJ UniProt PDB Medline (PubMed) Intact

Secondary (curated)
RefSeq Taxon UniProt OMIM SGD Biosamples/Bioprojects

14

13-05-24

http:/ /nar.oxfordjournals.org/content/41/D1.toc January 2013

Module 2b

bioinformatics.ca

Module 2b

bioinformatics.ca

15

13-05-24

http:/ /nar.oxfordjournals.org/content/40/D1.toc

January 2012

Module 2b

31

bioinformatics.ca

Sequence Databases
Primary DNA (archive)
DDBJ/ENA/GenBank

Primary protein (curated/automation)


UniProtKB

Curated Databases (lots of human labour)


RefSeq (Genomic, mRNA and protein) UniProtKB/SwissProt and neXtprot

Module 2b

bioinformatics.ca

16

13-05-24

What is GenBank?
GenBank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain.
http:/ /www.ncbi.nlm.nih.gov/genbank/ Benson et al., Nucleic Acids Res. 2013 http:/ /www.ncbi.nlm.nih.gov/pubmed/23193287
Module 2b

bioinformatics.ca

ENA

EB-eye

Adapted from Rolf Apweiler

17

13-05-24

Types of les in GenBank


From one-gene investigators
Often a very well annotated cDNA A genomic segment from an new invertebrate A mitochondria or virus

From population/phylogenetic analysis


rRNA amplicon from environmental sampling

From Genome Centers:


Gene expression:
Expressed Sequence Tags Full Length Insert cDNA TSA

Genome sequencing projects


HTG CON

Module 2b

bioinformatics.ca

Functional Divisions
PAT EST TSA STS GSS HTG HTC CON ENV
BCT PRI

Patent Expressed Sequence Tags "Transcriptome Shotgun Assembly Sequence Tagged Site Genome Survey Sequence High Throughput Genome (unnished) High throughput cDNA (unnished) Contig assembly instructions Environmental sampling methods
FUN ROD INV SYN MAM VRL PHG VRT PLN

Organismal divisions:

18

13-05-24

Guiding Principals In GenBank, records are grouped for various reasons: understand this is key to using and fully taking advantage of this database.

Module 2b

bioinformatics.ca

Identiers
You need identiers which are stable through time Need identiers which will always refer to specic sequences Need these identiers to track history of sequence updates Also need feature and annotation identiers (need to tract important things)
Genes Transcripts Proteins ((( Phenotype )))

Module 2b

bioinformatics.ca

19

13-05-24

LOCUS, Accession, NID and protein_id


LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identier. ACCESSION: A unique identier to that record, citable entity; does not change when record is updated. A good record identier, ideal for citation in publication. VERSION: ID system where the accession and version play the same function as the accession and gi number. Nucleotide gi: Geninfo identier (gi), a unique integer which will change every time the sequence changes. Protein gi: Geninfo identier (gi), a unique integer which will change every time the sequence changes. protein_id: Identier which has the same structure and function as the nucleotide Accession.version numbers, but slightly different format.

LOCUS, Accession, gi and PID


LOCUS DEFINITION ACCESSION VERSION HSU40282 1789 bp mRNA PRI 21-MAY-1998 Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds. U40282 U40282.1 GI:3150001

LOCUS: ACCESSION: VERSION: GI: Protein gi: protein_id:


CDS

HSU40282 U40282 U40282.1


3150001

3150002
AAC16892.1

LOCUS ACCESSION Accession.version gi protein gi Protein_id

157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1 /db_xref="GI:3150002"

Lecture 1.3

20

13-05-24

21

13-05-24

Accession number space


GenBank:
1+5 (L12345, U00001) 2+6 (AF000001, AC000003)

WGS (Not distributed with GenBank)


4+2+6 (AAAA01000001, AAAD01000001)

Protein:
1+5 or 3+5

All have accession.version

Module 2b

bioinformatics.ca

22

13-05-24

Secondary Accession Numbers


When you retire accession numbers, these often are put in the secondary accession number space. (e.g GenBank Accession number L05146) With the removal of sequence length limits, GenBank will now allow continuous ranges of secondary accessions. As of GenBank Release 146.0 (February 2005), it is legal to represent continuous ranges of secondary accessions by a start accession, a dash character, and an end accession. (e.g. for the E. coli genome) ACCESSION U00096 AE000111-AE000510

Module 2b

bioinformatics.ca

Module 2b

bioinformatics.ca

23

13-05-24

EST: Expressed Sequence Tag Expressed Sequence Tags are shorter (300-1000 bp) single reads from mRNA (cDNA) which are produced in large numbers. They represent a snapshot of what is expressed in a given tissue, and developmental stage.
Also see:

http:/ /www.ncbi.nlm.nih.gov/dbEST/ http:/ /www.ncbi.nlm.nih.gov/UniGene/

Lecture 1.3

LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM

REFERENCE AUTHORS TITLE JOURNAL COMMENT

CX016035 296 bp mRNA linear EST 06-DEC-2004 qt06h09.g1 Whole Heart Library (DOGEST5) Canis familiaris cDNA, mRNA sequence. CX016035 CX016035.1 GI:56398446 EST. Canis familiaris (dog) Canis familiaris Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Carnivora; Fissipedia; Canidae; Canis. 1 (bases 1 to 296) Balija,V.S., Nascimento,L.U. and McCombie,W.R. ESTs from Canis familiaris whole heart (dog) Unpublished (2004) Contact: W. Richard McCombie Lita Annenberg Hazen Genome Sequencing Center Cold Spring Harbor Laboratory PO Box 100, Cold Spring Harbor, NY 11724, USA Tel: 516 367 8884 Fax: 516 367 8874 Email: mccombie@cshl.org.

24

13-05-24

FEATURES source

Location/Qualifiers 1..296 /organism="Canis familiaris" /mol_type="mRNA" /db_xref="taxon:9615" /sex="Unknown" /dev_stage="3 month old normal canine" /lab_host="XL10 Gold" /clone_lib="Whole Heart Library (DOGEST5)" /note="Organ: Heart; Vector: pBluescript II SK; Site_1: EcoRI; Site_2: XhoI; Library constructed using pBluescript XR kit from Stratagene. Cloned cDNA was size selected between 1-3 kb. Mark Haskins VMD, PhD, Pathology and Medical Genetics, School of Veterinary Medicine, University of Pennsylvania, 3800 Spruce Street, Philadelphia, PA 19104-6051" gtggcggccg tcttttatta gatccacgtc tcattacaga ggtttatggc ctctagaact aaaccaggtg gcctccctcg tggacactgg ttggatttgg agtggatccc agtcactcca ggctgggggg ggggcagtga gatcagaggg ccgggctgca ttcgctgaga tggctggccc tggatcagag gagggtgaag ggaattcggc aaaggcacac actctgtcca cgttcttatg gtgtgg

ORIGIN 1 61 121 181 241 // ctccaccgcg acgaggaggg ttatgttcca gacctctttt gccgggcctt

Transcriptome Shotgun Assembly


TSA is an archive of computationally assembled sequences from primary data such as ESTs, traces and Next Generation Sequencing Technologies. The overlapping sequence reads from a complete transcriptome are assembled into transcripts by computational methods instead of by traditional cloning and sequencing of cloned cDNAs. The primary sequence data used in the assemblies must have been experimentally determined by the same submitter. TSA sequence records differ from EST and GenBank records because there are no physical counterparts to the assemblies.

Module 2b

bioinformatics.ca

25

13-05-24

LOCUS DEFINITION ACCESSION VERSION DBLINK KEYWORDS SOURCE ORGANISM

REFERENCE AUTHORS TITLE JOURNAL PUBMED REMARK REFERENCE AUTHORS TITLE JOURNAL PRIMARY

EU284816 447 bp mRNA linear TSA 30-JUN-2008 TSA: Elaeis guineensis EgEfMPOB00001 ribosomal L28-like protein mRNA, complete cds, mRNA sequence. EU284816 EU284816.1 GI:192908657 BioProject: PRJNA30465 TSA; Transcriptome Shotgun Assembly. Elaeis guineensis (African oil palm) Elaeis guineensis Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; Liliopsida; Arecaceae; Arecoideae; Cocoseae; Elaeidinae; Elaeis. 1 (bases 1 to 447) Low,E.T., Alias,H., Boon,S.H., Shariff,E.M., Tan,C.Y., Ooi,L.C., Cheah,S.C., Raha,A.R., Wan,K.L. and Singh,R. Oil palm (Elaeis guineensis Jacq.) tissue culture ESTs: identifying genes associated with callogenesis and embryogenesis BMC Plant Biol. 8, 62 (2008) 18507865 Publication Status: Online-Only 2 (bases 1 to 447) Low,E.T.L., Alias,H., Boon,S.H., Shariff,E.M., Tan,C.Y.A., Ooi,L.C.L., Cheah,S.C., Wan,K.L., Rahim,R.A. and Singh,R. Direct Submission Submitted (15-NOV-2007) Advanced Biotechnology and Breeding Centre, Biology Division, Malaysian Palm Oil Board, 6, Persiaran Institusi, Bandar Baru Bangi, Kajang, Selangor 43000, Malaysia TSA_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP 1-447 EY404781.1 14-460 1-436 EY409465.1 72-507

Module 2b

bioinformatics.ca

FEATURES source

CDS

Location/Qualifiers 1..447 /organism="Elaeis guineensis" /mol_type="mRNA" /db_xref="taxon:51953" 1..447 /codon_start=1 /product="ribosomal L28-like protein" /protein_id="ACF06437.1" /db_xref="GI:192908658" /translation="MATVPGPLIWEIVKRNNAFLVKQFGNGNAMVQFSKEPNNLYNVN SYKHSGLANKKTVAIQPGGKDLSVVLATSKTKKQNKPGNLYNRSVMKKEFRKMAKAVK NQVTDNYYRPDLTKAALARLSAVHRSLKVAKSGVKKRNRQAVQVKY" ttcccggacc ttggtaatgg attcctacaa aggacctctc tgtacaacag aggtgaccga ctgtacatcg ttcaagtaaa acttatctgg caatgccatg gcactccggg tgtggttctt gtcggttatg caactactac cagcctcaag atactag gagattgtga gtgcagttta ttggcgaaca gcaaccagta aagaaagagt aggcctgatt gttgccaagt agagaaacaa gcaaagagcc agaagactgt agacaaagaa tccggaagat tgaccaaagc ctggagtgaa tgcttttctt gaacaatctc ggccattcag gcagaacaaa ggcaaaggca agctcttgca gaagaggaac

ORIGIN 1 61 121 181 241 301 361 421 atggctactg gttaagcagt tacaatgtga ccaggaggga cctggtaatc gtcaagaacc aggctcagtg cggcaggctg

Module 2b

bioinformatics.ca

26

13-05-24

WGS: Whole Genome Shotgun (Not in GenBank release, not shared with ENA/DDBJ)
Contigs from ongoing Whole Genome Shotgun sequencing projects The nucleotides from WGS projects go into the BLAST wgs database, whereas the proteins go into the BLAST nr database. More info, and how to submit to this division:
http:/ /www.ncbi.nlm.nih.gov/Genbank/wgs.html Accession format is 4+2+6

Module 2b

bioinformatics.ca

http://www.ncbi.nlm.nih.gov/projects/WGS/WGSprojectlist.cgi

Lecture 1.3

54

27

13-05-24

WGS record (not in GenBank)

Module 2b

bioinformatics.ca

Module 2b

bioinformatics.ca

28

13-05-24

Sequences NOT in GenBank


WGS: whole genome shotgun TPA: third party annotations SNPs SAGE tags (serial analysis of gene expression) RefSeq (Genomic, mRNA, or protein) Consensus sequences

Module 2b

bioinformatics.ca

What is UniProtKB?
UniProt is a protein sequence database that is the result of a merge from SWISS-PROT and PIR and is funded by the NIH, EMBL and the Swiss Govt. It is the main distributed, annotated, and curated protein sequence database. Data in UniProt is derived from coding sequence annotations in ENA (GenBank/ENA/DDBJ) nucleic acid sequence data, and from sequences in PIR and SP. UniProt is a Flat-File database just like GenBank or ENA/EMBL

http:/ /www.uniprot.org/
http:/ /database.oxfordjournals.org/content/2011/bar009.long

Module 2b

bioinformatics.ca

29

13-05-24

30

13-05-24

UniprotKB/SwissProt
ID AC DT DE GN OS OC RX CC CC CC CC CC CC CC CC CC CC CC DR KW FT FT SQ CYS3_YEAST STANDARD; PRT; 393 AA. P31373; 01-JUL-1993 (REL. 26, CREATED) CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. TAXONOMY SACCHAROMYCETACEAE; SACCHAROMYCES.

CITATION -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + NH(3) + 2-OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING L-CYSTEINE FROM L-METHIONINE. -!- SUBUNIT: HOMOTETRAMER. -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. -------------------------------------------------------------------------DISCLAMOR -------------------------------------------------------------------------DATABASE cross-reference CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. INIT_MET 0 0 BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN

ID AC DT DT DT DE GN OS OC OC RN RP RX RA RA RT RT RL RN RP RC RX RA RT RT RT RL RN RP RC RX RA RA RT RT RT RL RN RP RC RX RA RA RT RT RL RN RP RX RA RA RT RT RL CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC DR DR DR DR DR DR DR DR DR DR DR DR DR KW FT FT SQ

//

CYS3_YEAST STANDARD; PRT; 393 AA. P31373; 01-JUL-1993 (REL. 26, CREATED) 01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE) 01-NOV-1995 (REL. 32, LAST ANNOTATION UPDATE) CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. SACCHAROMYCES CEREVISIAE (BAKER'S YEAST). EUKARYOTA; FUNGI; ASCOMYCOTA; HEMIASCOMYCETES; SACCHAROMYCETALES; SACCHAROMYCETACEAE; SACCHAROMYCES. [1] SEQUENCE FROM N.A., AND PARTIAL SEQUENCE. MEDLINE; 92250430. [NCBI, ExPASy, Israel, Japan] ONO B.-I., TANAKA K., NAITO K., HEIKE C., SHINODA S., YAMAMOTO S., OHMORI S., OSHIMA T., TOH-E A.; "Cloning and characterization of the CYS3 (CYI1) gene of Saccharomyces cerevisiae."; J. BACTERIOL. 174:3339-3347(1992). [2] SEQUENCE FROM N.A., AND CHARACTERIZATION. STRAIN=DBY939; MEDLINE; 93328685. [NCBI, ExPASy, Israel, Japan] YAMAGATA S., D'ANDREA R.J., FUJISAKI S., ISAJI M., NAKAMURA K.; "Cloning and bacterial expression of the CYS3 gene encoding cystathionine gamma-lyase of Saccharomyces cerevisiae and the physicochemical and enzymatic properties of the protein."; J. BACTERIOL. 175:4800-4808(1993). [3] SEQUENCE FROM N.A. STRAIN=S288C / AB972; MEDLINE; 93289814. [NCBI, ExPASy, Israel, Japan] BARTON A.B., KABACK D.B., CLARK M.W., KENG T., OUELLETTE B.F.F., STORMS R.K., ZENG B., ZHONG W.W., FORTIN N., DELANEY S., BUSSEY H.; "Physical localization of yeast CYS3, a gene whose product resembles the rat gamma-cystathionase and Escherichia coli cystathionine gammasynthase enzymes."; YEAST 9:363-369(1993). [4] SEQUENCE FROM N.A. STRAIN=S288C / AB972; MEDLINE; 93209532. [NCBI, ExPASy, Israel, Japan] OUELLETTE B.F.F., CLARK M.W., KENG T., STORMS R.K., ZHONG W.W., ZENG B., FORTIN N., DELANEY S., BARTON A.B., KABACK D.B., BUSSEY H.; "Sequencing of chromosome I from Saccharomyces cerevisiae: analysis of a 32 kb region between the LTE1 and SPO7 genes."; GENOME 36:32-42(1993). [5] SEQUENCE OF 1-18, AND CHARACTERIZATION. MEDLINE; 93289817. [NCBI, ExPASy, Israel, Japan] ONO B.-I., ISHII N., NAITO K., MIYOSHI S.-I., SHINODA S., YAMAMOTO S., OHMORI S.; "Cystathionine gamma-lyase of Saccharomyces cerevisiae: structural gene and cystathionine gamma-synthase activity."; YEAST 9:389-397(1993). -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + NH(3) + 2-OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING L-CYSTEINE FROM L-METHIONINE. -!- SUBUNIT: HOMOTETRAMER. -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. -------------------------------------------------------------------------This SWISS-PROT entry is copyright. It is produced through a collaboration between the Swiss Institute of Bioinformatics and the EMBL outstation the European Bioinformatics Institute. There are no restrictions on its use by non-profit institutions as long as its content is in no way modified and this statement is not removed. Usage by and for commercial entities requires a license agreement (See http://www.isb-sib.ch/announce/ or send an email to license@isb-sib.ch). -------------------------------------------------------------------------EMBL; L05146; AAC04945.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] EMBL; L04459; AAA85217.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] EMBL; D14135; BAA03190.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] PIR; S31228; S31228. YEPD; 5280; -. SGD; L0000470; CYS3. [SGD / YPD] PFAM; PF01053; Cys_Met_Meta_PP; 1. PROSITE; PS00868; CYS_MET_METAB_PP; 1. DOMO; P31373. PRODOM [Domain structure / List of seq. sharing at least 1 domain] PROTOMAP; P31373. PRESAGE; P31373. SWISS-2DPAGE; GET REGION ON 2D PAGE. CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. INIT_MET 0 0 BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN

//

Where do I plant my (human) flag?


Across a large human genome of 2 X 3,000,000,000 bp Changing coordinate system?
EnsEMBL UCSC Genome Browser NCBI Genome Browser

Module 2b

bioinformatics.ca

31

13-05-24

http:/ /www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/

Module 2b

bioinformatics.ca

http:/ /www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/

Module 2b

bioinformatics.ca

32

13-05-24

http:/ /genome.ucsc.edu/FAQ/FAQreleases.html

Module 2b

bioinformatics.ca

Module 2b

bioinformatics.ca

33

13-05-24

BioProjects & BioSamples http:/ /www.ncbi.nlm.nih.gov/pubmed/22139929

Module 2b

bioinformatics.ca

Schematic depicting how BioProject, BioSample and data objects can be organized and linked.

Barrett T et al. Nucl. Acids Res. 2012;40:D57-D63


Published by Oxford University Press 2011.

34

13-05-24

OK, what do you want to do with this?


Make discoveries about insights in cancer biology? Identify diagnostic or prognostic markers for a particular tumor sub-type? You are actually a closet geneticist, and you just like to look at variation in your favorite gene family in what ever sample you can get your hands on? You want to integrate methylation marks with gene expression data to better understand cell biology? You want to develop a new de novo whole genome assembly algorithm that works on cancer genomes? Want to identify shared signal pathways in all head & neck tumor types?
Module 2b

bioinformatics.ca

Historical perspective on the Human Genome Data


Human Expressed Seq Tags (mRNA) sequencing Human genome mapping and sequencing Population analysis and polymorphism measurements Genome Wide Association Studies <the Homer paper> The Cancer Genome Atlas pilot The 1000 genome project The Cancer Genome Atlas The International Cancer Genome Consortium

Module 2b

bioinformatics.ca

35

13-05-24

ICGC Controlled Access Datasets


Detailed Phenotype and Outcome data Region of residence Risk factors Examination Surgery Drugs Radiation Sample Slide Specic histological features Analyte Aliquot Donor notes Gene Expression (probe-level data) Raw genotype calls Gene-sample identier links Genome sequence les

ICGC OA Datasets
Cancer Pathology Histologic type or subtype Histologic nuclear grade Patient/Person Gender Age range Gene Expression (normalized) DNA methylation Genotype frequencies Computed Copy Number and Loss of Heterozygosity Newly discovered somatic variants
http:/ /goo.gl/w4mrV

Module 2b

bioinformatics.ca

http:/ /dcc.icgc.org/

Module 2b

bioinformatics.ca

36

13-05-24

ftp:/ /data.dcc.icgc.org/

Module 2b

bioinformatics.ca

ftp://data.dcc.icgc.org/current/Pancreatic_Cancer-OICR-CA/ssm.oicrPanc.txt.gz
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. Cancer Type Pancrea8c Cancer (OICR, CA) Assembly Version GRCh37 Chromosome X Chromosome start 96212915 Chromosome end 96212915 Chromosome strand 1 Reference genome allele C RefSNP allele RefSNP strand 1 External Varia8on ID ce3d9ca5-3634-44d8-a8 Muta8on ID Muta8on type single base subs8tu8on Muta8on C>T Tumour genotype C/T Control genotype C/C Consequence type upstream_gene_variant CDS muta8on AA muta8on Ensembl Gene ID ENSG00000214628 Transcript aected ENST00000398690 Gene name RP11-392M9.2.1 Is annotated not annotated Valida8on status validated Analysis ID exomeSSMAgil_201202 Plaborm Illumina HiSeq 26. Valida8on plaborm Ion Torrent PGM 27. Raw data repository dbSNP 28. Raw data accession h?p://www.ncbi.nlm.nih.gov/snp 29. Base calling algorithm genome_analyzer_sofware.ilmn 30. Alignment algorithm Novoalign 31. Varia8on calling algorithm GATK 32. Donor ID PCSI_0072 33. Sex male 34. Age at diagnosis 83 35. Age at enrollment 82 36. ICD-10 C25 37. Donor tumour stage at diagnosis 0 38. Specimen ID PCSI_0072_Pa_P 39. Specimen type primary tumour 40. Tumour conrmed primary tumour 41. Tumour histological type G2 42. Tumour grade 43. Tumour stage 44. Specimen donor treatment type no treatment 45. Specimen donor treatment type other 46. Analyzed sample ID PCSI_0072_Pa_P 47. Sample type Primary tumour 48. Matched sample ID PCSI_0072_Ly_R 49. Ini8al Release Date April 15, 2010

Module 2b

bioinformatics.ca

37

13-05-24

Data & Nomenclature & the Metadata is cri6cal for all of our work!
If databases get it wrong, the onus is on on the user to let the databases know that it is wrong!

http://goo.gl/bGjMH

Module 2b

bioinformatics.ca

Data & Nomenclature & the Metadata is cri6cal for all of our work!
If databases get it wrong, the onus is on on the user to let the databases know that it is wrong!

any db
....

http://goo.gl/bGjMH

Module 2b

bioinformatics.ca

38

13-05-24

Another source of important Cancer Data:



h?p://www.sanger.ac.uk/gene8cs/CGP/cosmic/

Module 2b

bioinformatics.ca

ftp:/ /ftp.sanger.ac.uk/pub/CGP/cosmic/data_export/

Module 2b

bioinformatics.ca

39

13-05-24

Module 2b

bioinformatics.ca

Identify yourself

Fill out detail form which includes: Contact and Project Information Information Technology details and procedures for keeping data secure Data Access Agreement

All of these documents are put into a PDF file that you print and get your institution to sign off on your behalf

Module 2b

bioinformatics.ca

40

13-05-24

Module 2b

bioinformatics.ca

Module 2b

bioinformatics.ca

41

13-05-24

Module 2b

bioinformatics.ca

Module 2b

bioinformatics.ca

42

13-05-24

Module 2b

bioinformatics.ca

Module 2b

bioinformatics.ca

43

13-05-24

Module 2b

bioinformatics.ca

ERA Open TCGA dbGaP


BAM

DACO Open ICGC

EGA BA BAM M + EGA id bioinformatics.ca

VCF
Module 2b

44

13-05-24

What is Cancer Data?


Structured Clinical Data about the pa8ent Structured Clinical Data about the treatment Structured Clinical Data about the tumor Associated with a number of posi8ons (hundreds, if not thousands) of nucleo8de coordinate system on one reference genome.
Module 2b

bioinformatics.ca

ICGC is implementing NCBIs bioprojects http:/ /www.ncbi.nlm.nih.gov/bioproject

Module 2b

bioinformatics.ca

45

13-05-24

About UCSC Genome Browser


Browse many Eukaryotic genomes (yeast to human) Most annotations are there Important evolutionary and variation data representation. Very exible and congurable views Graphical and table views (Galaxy uses this) Upload your data into custom tracks and share with colleagues Client/server application with its issues, but a great app!
Module 2b

bioinformatics.ca

Best Way to represent data?


We have been thinking mostly in a linear scale, and decorating the string of DNA with annotations.

What are annotations?


bioinformatics.ca

Module 2b

46

13-05-24

Data being Collected


Phenotype: Clinical data tumor pathology, age, gender, treatment, survival (controlled access) Germ line data SNPs (controlled access) Somatic mutations in tumour (open) Copy number variations (open) RNA abundance & splicing (open) RNA sequences (controlled access) DNA methylation (open)

Module 2b

bioinformatics.ca

Getting data
Cosmic: http://www.sanger.ac.uk/genetics/CGP/cosmic
ICGC Home: http:/ /icgc.org ICGC DCC: http:/ /dcc.icgc.org TCGA Home: http:/ /cancergenome.nih.gov TCGA: http:/ /tcga-data.nci.nih.gov/tcga UCSC Home: http:/ /genome.ucsc.edu UCSC Human: http:/ /genome.ucsc.edu/cgi-bin/hgGateway UCSC Human Cancer: https:/ /genome-cancer.ucsc.edu

Module 2b

bioinformatics.ca

47

13-05-24

Module 2b

bioinformatics.ca

Module 2b

bioinformatics.ca

48

13-05-24

Module 2b

bioinformatics.ca

Module 2b

bioinformatics.ca

49

13-05-24

Integrated Genome Viewer

Module 2b

bioinformatics.ca

Integrated Genome Viewer

Module 2b

bioinformatics.ca

50

13-05-24

IGV can use and display many le formats


http:/ /www.broadinstitute.org/software/igv/FileFormats

Module 2b

bioinformatics.ca

IGV: le formats, e.g. BAM (binary version of SAM, or Sequence Alignment Formatted les)

Module 2b

bioinformatics.ca

51

13-05-24

Module 2b

bioinformatics.ca

Ask your question, and then gather the data, the tools and hardware you need
Data and Databases: you will take workshops, you will read papers, and you will go on-line: SeqAnswers & maybe the bioinformatics.ca Links Directory Tools: you will take workshops, you will read papers, and you will go on-line: SeqAnswers & maybe the bioinformatics.ca Links Directory Hardware: you need to decide?

Module 2b

bioinformatics.ca

52

13-05-24

What can you do with IGV?


Visualization of different genomic data types: l aligned sequence reads l mutations l copy number l RNA interference screens l gene expression l methylation and genomic annotations List of supported data formats:
http:/ /www.broadinstitute.org/software/igv/FileFormats

For this example: l *.bam for the alignment le l *.gtf for the genome annotation data

Using IGV to visualize sequence alignment and genomic annotations

Step1: Choose the genome in the list (or import your own genome le)

Here we have selected hg18 because it was used for the alignment

53

13-05-24

Using IGV to visualize sequence alignment and genomic annotations


Step 2: Import your alignment file File->Load from File

You can also download file from a URL, a DAS or a server


Sample files source: http://manuals.bioinformatics.ucr.edu/home/gui-ngs-analysis and ftp://ftp.broad.mit.edu/pub/igv/INMEGEN2010/

Using IGV to visualize sequence alignment and genomic annotations

Step 2: Import your sequence alignment file

If you download a *.bam file, it must be sorted and indexed, and the index *.bai file must be in the same directory You can visualize several alignment files at the same time for the same species

54

13-05-24

Using IGV to visualize sequence alignment and genomic annotations


You can either: l select a chromosome
Step 3: select the data to display

select the coordinates

search for a gene

Using IGV to visualize sequence alignment and genomic annotations


Step 4: visualize the read alignments on the sequence
You will not see the alignment if the region your are looking at in too large for IGV: Zoom in using the + sign (in red) or by double-clicking on the display area

double-click here to zoom in and see the alignment

55

13-05-24

Using IGV to visualize sequence alignment and genomic annotations

Cytoband

Genomic coordinates

Track names

Data panel

Genomic annotations (default: RefSeq)

Using IGV to visualize sequence alignment and genomic annotations

Coverage of reads on the sequence


White reads: low alignment score Other colors: depend on the color alignment code selected (ex: insert size, pair orientation, read strand)

Annotated introns

Annotated exons

56

13-05-24

Using IGV to visualize sequence alignment and genomic annotations

2 examples of variation compared to the reference sequence Lighter color bases: low quality bases

Reference sequence (here hg18)

Using IGV to visualize sequence alignment and genomic annotations


Step 5.1: download genomic annotations le from UCSC table browser

1) Go on http:/ /genome.ucsc.edu and click on Tables

Several ways of downloading gene annotation les can be used, for example directly from the source sequence databases

57

13-05-24

Using IGV to visualize sequence alignment and genomic annotations

Select the genome (here hg18) Select the gene annotations (here Ensembl) Select the file format (here GTF)

Choose your file name and click on the get output button

Using IGV to visualize sequence alignment and genomic annotations


Step 5.2: load the genomic annotation file in IGV

Select File->Load from file and choose the GTF file you have downloaded You have know access to RefSeq and Ensembl gene annotations:

The more data and annotations you load, the more memory you need You can select a higher memory threshold if you need it when you launch IGV

58

13-05-24

Using IGV to visualize sequence alignment and genomic annotations


On this example you can visualize deletion (10kb, from IGV publication*)

Robinson et al., (2011) Nature Biotechnology 29: 2426

Using IGV to visualize sequence alignment and genomic annotations


You can also visualize copy number variation data (from IGV publication*)

Robinson et al., (2011) Nature Biotechnology 29: 2426

59

13-05-24

Following OpenHelix, UCSC, & SeqAnswers


OpenHelix http:/ /www.openhelix.com/ Twitter: @openhelix Blog: http:/ /blog.openhelix.com/

UCSC
http:/ /genome.ucsc.edu/ Twitter: @GenomeBrowser More tutorials: http:/ /genome.ucsc.edu/training.html SEQanswers
Forum for NGS technologies

h?p://seqanswers.com/

Module 2b

bioinformatics.ca

Lab exercise:
Update of the stats for 2013 http://goo.gl/tkMXa

Module 2b

bioinformatics.ca

60

13-05-24

CosmicCompleteExport_v64_260313.tsv How many records in this le? How many genes? Which are most mutated? How many publications? Which is the most mutated gene in TCGA set? What concerns about these? Some hints: 1> cat le.tsv | cut f 1 | sort | uniq -c | sort -nr > foo.out & 2> grep TCGA le.tsv | wc l 3> man cut
Module 2b

bioinformatics.ca

Get simple somatic mutation from OICRs pancreatic cancer ICGC project
cat ssm.oicrPanc.txt | wc l What is this? > cat ssm.oicrPanc.txt | awk -F\t '{print $3}' | uniq c What is this? > cat ssm.oicrPanc.txt | awk -F\t '{print $23}' | uniq c What is this?

Etc ... See wiki

Module 2b

bioinformatics.ca

61

You might also like