Professional Documents
Culture Documents
13-05-23
Module 2a
bioinformatics.ca
http://durtridingurl.blogspot.ca/2011/04/cloud-kingdom.html
13-05-23
Disclaimer
I do not (and will not) prot in any way, shape or form, from any of the brands, products or companies I may mention.
Module 2a
bioinformatics.ca
Module 2a
bioinformatics.ca
13-05-23
Module 2a
bioinformatics.ca
1,000,000
Doubling time=14 mo
100,000
100
10,000
10
1,000
100 1
10
0 1990
1992
1994
1996
1998
2000
2003
2004
2006
2008
2010
1 2012
13-05-23
Module 2a
bioinformatics.ca
Module 2a
13-05-23
Module 2a
bioinformatics.ca
Module 2a
bioinformatics.ca
13-05-23
http:/ /goo.gl/7PVAl
Module 2a
bioinformatics.ca
Some of the challenges with cloud computing: Not cheap! Getting les to and from there Not the best solution for everybody Standardization PHI: personal health information & security concerns In the USA: Patriot act
Module 2a
bioinformatics.ca
13-05-23
Some of the advantages with cloud computing: At the CBW: we received a grant from Amazon, so supported by AWS in Education grant award. There are better ways of transferring large les, and now AWS makes it free to upload les. A number of datasets exist on AWS (e.g. 1000 genome data). Many useful bioinformatics AMIs (Amazon Machine Images) exist on AWS: e.g. cloudbiolinux & CloudMan (Galaxy) Many avors of cloud available, not just AWS
Module
2a
bioinformatics.ca
In this workshop:
Some tools (data) are
on your computer on the web on the cloud.
You will become efcient at traversing these various spaces, and nding resources you need, and using what is best for you. There are different ways of using the cloud:
1. Command line (like your own very powerful Unix box) 2. With a web-browser (e.g. Galaxy): not in this workshop
Module 2a
bioinformatics.ca
13-05-23
bioinformatics.ca
ICGC OA Datasets
Cancer Pathology Histologic type or subtype Histologic nuclear grade Patient/Person Gender Age range Gene Expression (normalized) DNA methylation Genotype frequencies Computed Copy Number and Loss of Heterozygosity Newly discovered somatic variants
http:/ /goo.gl/w4mrV
Module 2a
bioinformatics.ca
13-05-23
Module 2a
bioinformatics.ca
Module 2a
bioinformatics.ca
10
13-05-23
Module 2a
bioinformatics.ca
Module 2a
bioinformatics.ca
11
13-05-23
On Mac: Control+
Module 2a
bioinformatics.ca
Module 2a
bioinformatics.ca
12
13-05-23
File permissions:
ls -l (long listing)
drwx------+ 67 francis -rw-r--r--@ 1 francis rwx : owner rwx : group rwx: world r read (4) w write (2) x execute (1) staff staff 2278 22 May 21:25 ../ 1696 22 May 21:31 CBWkey.pem
Which ever way you add these 3 numbers, you know which integers were used (6 is always 4+2, 5 is 4+1, 4 is by itself, 0 is none of them etc ) So, when you have: chmod 600 <file name> It is rw for the the le owner only
Module 2a
bioinformatics.ca
Logging in to AWS
Module 2a
bioinformatics.ca
13
13-05-23
Windows
Module 2a
bioinformatics.ca
Module 2a
bioinformatics.ca
14
13-05-23
3 2
Module 2a
bioinformatics.ca
4
Module
2a
5
bioinformatics.ca
15
13-05-23
6 7
From now on, just double-click CBW to login.
Module
2a
bioinformatics.ca
bioinformatics.ca
16
13-05-23
We are now on Lunch Break and we are networking, and having a great time with our CBW colleagues. Wish you were here? Register for Canadian Bioinformatics Workshops here: bioinformatics.ca
Module
2a
bioinformatics.ca
17
13-05-24
13-05-24
Module 2b
bioinformatics.ca
13-05-24
Disclaimer
I do not (and will not) prot in any way, shape or form, from any of the brands, products or companies I may mention.
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
13-05-24
"Nothing in Biology Makes Sense Except in the Light of Evolution" Theodosius Dobzhansky, 1973
http:/ /goo.gl/doHu7
Module 2b
bioinformatics.ca
With apologies to Theodosius: "Nothing in Bioinformatics Makes Sense Except in the Light of Evolution"
Module 2b
bioinformatics.ca
13-05-24
Q: Why do we have Bioinformatics? A: Open Data from Genomic and Proteomics Technologies
Module
2b
bioinformatics.ca
BLAST would not have been invented if GenBank or the Atlas of protein Sequences and Structure did not exist beforehand.
DDBJ/EMBL/GenBank is the open access source of all publically available DNA sequences Atlas
of
Protein
Sequence
and
Structure,
published
from
1965-1978
under
the
editorship
of
Margaret
Dayho,
later
became
the
PIR,
and
then
the
UniProtKB
database
Module 2b
bioinformatics.ca
13-05-24
Bioinformatics is about integrating biological themes how do you dene together with the help of bioinformatics? computer tools and biological databases, and gaining new knowledge system think about pair the share in study.
Open Data
Module
2b
bioinformatics.ca
13-05-24
bioinformatics.ca
Bioinformatics experiments:
Sequence Reagents:
Sequence Databases
Alignment Interpretation:
Similarity Hypothesis testing
Do your controls
13-05-24
Nature 409:452
15
Databases
13-05-24
Databases
GenBank flat file COSMIC record Interaction Record Title of a book Book
Databases
13-05-24
Databases
Databases
Information system Query system Storage System Data The library of Congress Google Entrez EnsEMBL UCSC gemome browser
10
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
11
13-05-24
Module 2b
bioinformatics.ca
http://www.ncbi.nlm.nih.gov/gquery/
All [filter]
12
13-05-24
Formats
DNA sequence (GenBank Flat Files) Protein Sequences Other formats to know about
FASTA GFF3 XML
Module 2b
bioinformatics.ca
Header
DNA Sequence
Lecture 1.3 26
13
13-05-24
FASTA
> >gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4!! MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI! IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN! LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL! EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES! SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE! R!
Databases
Primary (archival)
GenBank/EMBL/DDBJ UniProt PDB Medline (PubMed) Intact
Secondary (curated)
RefSeq Taxon UniProt OMIM SGD Biosamples/Bioprojects
14
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
15
13-05-24
http:/ /nar.oxfordjournals.org/content/40/D1.toc
January 2012
Module 2b
31
bioinformatics.ca
Sequence Databases
Primary DNA (archive)
DDBJ/ENA/GenBank
Module 2b
bioinformatics.ca
16
13-05-24
What is GenBank?
GenBank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain.
http:/ /www.ncbi.nlm.nih.gov/genbank/ Benson et al., Nucleic Acids Res. 2013 http:/ /www.ncbi.nlm.nih.gov/pubmed/23193287
Module
2b
bioinformatics.ca
ENA
EB-eye
17
13-05-24
Module 2b
bioinformatics.ca
Functional Divisions
PAT EST TSA STS GSS HTG HTC CON ENV
BCT PRI
Patent Expressed Sequence Tags "Transcriptome Shotgun Assembly Sequence Tagged Site Genome Survey Sequence High Throughput Genome (unnished) High throughput cDNA (unnished) Contig assembly instructions Environmental sampling methods
FUN ROD INV SYN MAM VRL PHG VRT PLN
Organismal divisions:
18
13-05-24
Guiding Principals In GenBank, records are grouped for various reasons: understand this is key to using and fully taking advantage of this database.
Module 2b
bioinformatics.ca
Identiers
You need identiers which are stable through time Need identiers which will always refer to specic sequences Need these identiers to track history of sequence updates Also need feature and annotation identiers (need to tract important things)
Genes Transcripts Proteins ((( Phenotype )))
Module 2b
bioinformatics.ca
19
13-05-24
3150002
AAC16892.1
157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1 /db_xref="GI:3150002"
Lecture 1.3
20
13-05-24
21
13-05-24
Protein:
1+5 or 3+5
Module 2b
bioinformatics.ca
22
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
23
13-05-24
EST: Expressed Sequence Tag Expressed Sequence Tags are shorter (300-1000 bp) single reads from mRNA (cDNA) which are produced in large numbers. They represent a snapshot of what is expressed in a given tissue, and developmental stage.
Also see:
Lecture 1.3
CX016035 296 bp mRNA linear EST 06-DEC-2004 qt06h09.g1 Whole Heart Library (DOGEST5) Canis familiaris cDNA, mRNA sequence. CX016035 CX016035.1 GI:56398446 EST. Canis familiaris (dog) Canis familiaris Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Carnivora; Fissipedia; Canidae; Canis. 1 (bases 1 to 296) Balija,V.S., Nascimento,L.U. and McCombie,W.R. ESTs from Canis familiaris whole heart (dog) Unpublished (2004) Contact: W. Richard McCombie Lita Annenberg Hazen Genome Sequencing Center Cold Spring Harbor Laboratory PO Box 100, Cold Spring Harbor, NY 11724, USA Tel: 516 367 8884 Fax: 516 367 8874 Email: mccombie@cshl.org.
24
13-05-24
FEATURES source
Location/Qualifiers 1..296 /organism="Canis familiaris" /mol_type="mRNA" /db_xref="taxon:9615" /sex="Unknown" /dev_stage="3 month old normal canine" /lab_host="XL10 Gold" /clone_lib="Whole Heart Library (DOGEST5)" /note="Organ: Heart; Vector: pBluescript II SK; Site_1: EcoRI; Site_2: XhoI; Library constructed using pBluescript XR kit from Stratagene. Cloned cDNA was size selected between 1-3 kb. Mark Haskins VMD, PhD, Pathology and Medical Genetics, School of Veterinary Medicine, University of Pennsylvania, 3800 Spruce Street, Philadelphia, PA 19104-6051" gtggcggccg tcttttatta gatccacgtc tcattacaga ggtttatggc ctctagaact aaaccaggtg gcctccctcg tggacactgg ttggatttgg agtggatccc agtcactcca ggctgggggg ggggcagtga gatcagaggg ccgggctgca ttcgctgaga tggctggccc tggatcagag gagggtgaag ggaattcggc aaaggcacac actctgtcca cgttcttatg gtgtgg
Module 2b
bioinformatics.ca
25
13-05-24
REFERENCE AUTHORS TITLE JOURNAL PUBMED REMARK REFERENCE AUTHORS TITLE JOURNAL PRIMARY
EU284816 447 bp mRNA linear TSA 30-JUN-2008 TSA: Elaeis guineensis EgEfMPOB00001 ribosomal L28-like protein mRNA, complete cds, mRNA sequence. EU284816 EU284816.1 GI:192908657 BioProject: PRJNA30465 TSA; Transcriptome Shotgun Assembly. Elaeis guineensis (African oil palm) Elaeis guineensis Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; Liliopsida; Arecaceae; Arecoideae; Cocoseae; Elaeidinae; Elaeis. 1 (bases 1 to 447) Low,E.T., Alias,H., Boon,S.H., Shariff,E.M., Tan,C.Y., Ooi,L.C., Cheah,S.C., Raha,A.R., Wan,K.L. and Singh,R. Oil palm (Elaeis guineensis Jacq.) tissue culture ESTs: identifying genes associated with callogenesis and embryogenesis BMC Plant Biol. 8, 62 (2008) 18507865 Publication Status: Online-Only 2 (bases 1 to 447) Low,E.T.L., Alias,H., Boon,S.H., Shariff,E.M., Tan,C.Y.A., Ooi,L.C.L., Cheah,S.C., Wan,K.L., Rahim,R.A. and Singh,R. Direct Submission Submitted (15-NOV-2007) Advanced Biotechnology and Breeding Centre, Biology Division, Malaysian Palm Oil Board, 6, Persiaran Institusi, Bandar Baru Bangi, Kajang, Selangor 43000, Malaysia TSA_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP 1-447 EY404781.1 14-460 1-436 EY409465.1 72-507
Module 2b
bioinformatics.ca
FEATURES source
CDS
Location/Qualifiers 1..447 /organism="Elaeis guineensis" /mol_type="mRNA" /db_xref="taxon:51953" 1..447 /codon_start=1 /product="ribosomal L28-like protein" /protein_id="ACF06437.1" /db_xref="GI:192908658" /translation="MATVPGPLIWEIVKRNNAFLVKQFGNGNAMVQFSKEPNNLYNVN SYKHSGLANKKTVAIQPGGKDLSVVLATSKTKKQNKPGNLYNRSVMKKEFRKMAKAVK NQVTDNYYRPDLTKAALARLSAVHRSLKVAKSGVKKRNRQAVQVKY" ttcccggacc ttggtaatgg attcctacaa aggacctctc tgtacaacag aggtgaccga ctgtacatcg ttcaagtaaa acttatctgg caatgccatg gcactccggg tgtggttctt gtcggttatg caactactac cagcctcaag atactag gagattgtga gtgcagttta ttggcgaaca gcaaccagta aagaaagagt aggcctgatt gttgccaagt agagaaacaa gcaaagagcc agaagactgt agacaaagaa tccggaagat tgaccaaagc ctggagtgaa tgcttttctt gaacaatctc ggccattcag gcagaacaaa ggcaaaggca agctcttgca gaagaggaac
ORIGIN 1 61 121 181 241 301 361 421 atggctactg gttaagcagt tacaatgtga ccaggaggga cctggtaatc gtcaagaacc aggctcagtg cggcaggctg
Module 2b
bioinformatics.ca
26
13-05-24
WGS: Whole Genome Shotgun (Not in GenBank release, not shared with ENA/DDBJ)
Contigs from ongoing Whole Genome Shotgun sequencing projects The nucleotides from WGS projects go into the BLAST wgs database, whereas the proteins go into the BLAST nr database. More info, and how to submit to this division:
http:/ /www.ncbi.nlm.nih.gov/Genbank/wgs.html Accession format is 4+2+6
Module 2b
bioinformatics.ca
http://www.ncbi.nlm.nih.gov/projects/WGS/WGSprojectlist.cgi
Lecture 1.3
54
27
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
28
13-05-24
Module 2b
bioinformatics.ca
What is UniProtKB?
UniProt is a protein sequence database that is the result of a merge from SWISS-PROT and PIR and is funded by the NIH, EMBL and the Swiss Govt. It is the main distributed, annotated, and curated protein sequence database. Data in UniProt is derived from coding sequence annotations in ENA (GenBank/ENA/DDBJ) nucleic acid sequence data, and from sequences in PIR and SP. UniProt is a Flat-File database just like GenBank or ENA/EMBL
http:/ /www.uniprot.org/
http:/ /database.oxfordjournals.org/content/2011/bar009.long
Module 2b
bioinformatics.ca
29
13-05-24
30
13-05-24
UniprotKB/SwissProt
ID AC DT DE GN OS OC RX CC CC CC CC CC CC CC CC CC CC CC DR KW FT FT SQ CYS3_YEAST STANDARD; PRT; 393 AA. P31373; 01-JUL-1993 (REL. 26, CREATED) CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. TAXONOMY SACCHAROMYCETACEAE; SACCHAROMYCES.
CITATION -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + NH(3) + 2-OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING L-CYSTEINE FROM L-METHIONINE. -!- SUBUNIT: HOMOTETRAMER. -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. -------------------------------------------------------------------------DISCLAMOR -------------------------------------------------------------------------DATABASE cross-reference CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. INIT_MET 0 0 BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN
ID AC DT DT DT DE GN OS OC OC RN RP RX RA RA RT RT RL RN RP RC RX RA RT RT RT RL RN RP RC RX RA RA RT RT RT RL RN RP RC RX RA RA RT RT RL RN RP RX RA RA RT RT RL CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC DR DR DR DR DR DR DR DR DR DR DR DR DR KW FT FT SQ
//
CYS3_YEAST STANDARD; PRT; 393 AA. P31373; 01-JUL-1993 (REL. 26, CREATED) 01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE) 01-NOV-1995 (REL. 32, LAST ANNOTATION UPDATE) CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. SACCHAROMYCES CEREVISIAE (BAKER'S YEAST). EUKARYOTA; FUNGI; ASCOMYCOTA; HEMIASCOMYCETES; SACCHAROMYCETALES; SACCHAROMYCETACEAE; SACCHAROMYCES. [1] SEQUENCE FROM N.A., AND PARTIAL SEQUENCE. MEDLINE; 92250430. [NCBI, ExPASy, Israel, Japan] ONO B.-I., TANAKA K., NAITO K., HEIKE C., SHINODA S., YAMAMOTO S., OHMORI S., OSHIMA T., TOH-E A.; "Cloning and characterization of the CYS3 (CYI1) gene of Saccharomyces cerevisiae."; J. BACTERIOL. 174:3339-3347(1992). [2] SEQUENCE FROM N.A., AND CHARACTERIZATION. STRAIN=DBY939; MEDLINE; 93328685. [NCBI, ExPASy, Israel, Japan] YAMAGATA S., D'ANDREA R.J., FUJISAKI S., ISAJI M., NAKAMURA K.; "Cloning and bacterial expression of the CYS3 gene encoding cystathionine gamma-lyase of Saccharomyces cerevisiae and the physicochemical and enzymatic properties of the protein."; J. BACTERIOL. 175:4800-4808(1993). [3] SEQUENCE FROM N.A. STRAIN=S288C / AB972; MEDLINE; 93289814. [NCBI, ExPASy, Israel, Japan] BARTON A.B., KABACK D.B., CLARK M.W., KENG T., OUELLETTE B.F.F., STORMS R.K., ZENG B., ZHONG W.W., FORTIN N., DELANEY S., BUSSEY H.; "Physical localization of yeast CYS3, a gene whose product resembles the rat gamma-cystathionase and Escherichia coli cystathionine gammasynthase enzymes."; YEAST 9:363-369(1993). [4] SEQUENCE FROM N.A. STRAIN=S288C / AB972; MEDLINE; 93209532. [NCBI, ExPASy, Israel, Japan] OUELLETTE B.F.F., CLARK M.W., KENG T., STORMS R.K., ZHONG W.W., ZENG B., FORTIN N., DELANEY S., BARTON A.B., KABACK D.B., BUSSEY H.; "Sequencing of chromosome I from Saccharomyces cerevisiae: analysis of a 32 kb region between the LTE1 and SPO7 genes."; GENOME 36:32-42(1993). [5] SEQUENCE OF 1-18, AND CHARACTERIZATION. MEDLINE; 93289817. [NCBI, ExPASy, Israel, Japan] ONO B.-I., ISHII N., NAITO K., MIYOSHI S.-I., SHINODA S., YAMAMOTO S., OHMORI S.; "Cystathionine gamma-lyase of Saccharomyces cerevisiae: structural gene and cystathionine gamma-synthase activity."; YEAST 9:389-397(1993). -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + NH(3) + 2-OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING L-CYSTEINE FROM L-METHIONINE. -!- SUBUNIT: HOMOTETRAMER. -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. -------------------------------------------------------------------------This SWISS-PROT entry is copyright. It is produced through a collaboration between the Swiss Institute of Bioinformatics and the EMBL outstation the European Bioinformatics Institute. There are no restrictions on its use by non-profit institutions as long as its content is in no way modified and this statement is not removed. Usage by and for commercial entities requires a license agreement (See http://www.isb-sib.ch/announce/ or send an email to license@isb-sib.ch). -------------------------------------------------------------------------EMBL; L05146; AAC04945.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] EMBL; L04459; AAA85217.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] EMBL; D14135; BAA03190.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] PIR; S31228; S31228. YEPD; 5280; -. SGD; L0000470; CYS3. [SGD / YPD] PFAM; PF01053; Cys_Met_Meta_PP; 1. PROSITE; PS00868; CYS_MET_METAB_PP; 1. DOMO; P31373. PRODOM [Domain structure / List of seq. sharing at least 1 domain] PROTOMAP; P31373. PRESAGE; P31373. SWISS-2DPAGE; GET REGION ON 2D PAGE. CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. INIT_MET 0 0 BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN
//
Module 2b
bioinformatics.ca
31
13-05-24
http:/ /www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/
Module 2b
bioinformatics.ca
http:/ /www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/
Module 2b
bioinformatics.ca
32
13-05-24
http:/ /genome.ucsc.edu/FAQ/FAQreleases.html
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
33
13-05-24
Module 2b
bioinformatics.ca
Schematic depicting how BioProject, BioSample and data objects can be organized and linked.
34
13-05-24
bioinformatics.ca
Module 2b
bioinformatics.ca
35
13-05-24
ICGC OA Datasets
Cancer Pathology Histologic type or subtype Histologic nuclear grade Patient/Person Gender Age range Gene Expression (normalized) DNA methylation Genotype frequencies Computed Copy Number and Loss of Heterozygosity Newly discovered somatic variants
http:/ /goo.gl/w4mrV
Module 2b
bioinformatics.ca
http:/ /dcc.icgc.org/
Module 2b
bioinformatics.ca
36
13-05-24
ftp:/ /data.dcc.icgc.org/
Module 2b
bioinformatics.ca
ftp://data.dcc.icgc.org/current/Pancreatic_Cancer-OICR-CA/ssm.oicrPanc.txt.gz
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. Cancer
Type
Pancrea8c
Cancer
(OICR,
CA)
Assembly
Version
GRCh37
Chromosome
X
Chromosome
start
96212915
Chromosome
end
96212915
Chromosome
strand
1
Reference
genome
allele
C
RefSNP
allele
RefSNP
strand
1
External
Varia8on
ID
ce3d9ca5-3634-44d8-a8
Muta8on
ID
Muta8on
type
single
base
subs8tu8on
Muta8on
C>T
Tumour
genotype
C/T
Control
genotype
C/C
Consequence
type
upstream_gene_variant
CDS
muta8on
AA
muta8on
Ensembl
Gene
ID
ENSG00000214628
Transcript
aected
ENST00000398690
Gene
name
RP11-392M9.2.1
Is
annotated
not
annotated
Valida8on
status
validated
Analysis
ID
exomeSSMAgil_201202
Plaborm
Illumina
HiSeq
26.
Valida8on
plaborm
Ion
Torrent
PGM
27.
Raw
data
repository
dbSNP
28.
Raw
data
accession
h?p://www.ncbi.nlm.nih.gov/snp
29.
Base
calling
algorithm
genome_analyzer_sofware.ilmn
30.
Alignment
algorithm
Novoalign
31.
Varia8on
calling
algorithm
GATK
32.
Donor
ID
PCSI_0072
33.
Sex
male
34.
Age
at
diagnosis
83
35.
Age
at
enrollment
82
36.
ICD-10
C25
37.
Donor
tumour
stage
at
diagnosis
0
38.
Specimen
ID
PCSI_0072_Pa_P
39.
Specimen
type
primary
tumour
40.
Tumour
conrmed
primary
tumour
41.
Tumour
histological
type
G2
42.
Tumour
grade
43.
Tumour
stage
44.
Specimen
donor
treatment
type
no
treatment
45.
Specimen
donor
treatment
type
other
46.
Analyzed
sample
ID
PCSI_0072_Pa_P
47.
Sample
type
Primary
tumour
48.
Matched
sample
ID
PCSI_0072_Ly_R
49.
Ini8al
Release
Date
April
15,
2010
Module 2b
bioinformatics.ca
37
13-05-24
Data
&
Nomenclature
&
the
Metadata
is
cri6cal
for
all
of
our
work!
If
databases
get
it
wrong,
the
onus
is
on
on
the
user
to
let
the
databases
know
that
it
is
wrong!
http://goo.gl/bGjMH
Module 2b
bioinformatics.ca
Data
&
Nomenclature
&
the
Metadata
is
cri6cal
for
all
of
our
work!
If
databases
get
it
wrong,
the
onus
is
on
on
the
user
to
let
the
databases
know
that
it
is
wrong!
any db
....
http://goo.gl/bGjMH
Module 2b
bioinformatics.ca
38
13-05-24
h?p://www.sanger.ac.uk/gene8cs/CGP/cosmic/
Module 2b
bioinformatics.ca
ftp:/ /ftp.sanger.ac.uk/pub/CGP/cosmic/data_export/
Module 2b
bioinformatics.ca
39
13-05-24
Module 2b
bioinformatics.ca
Identify yourself
Fill out detail form which includes: Contact and Project Information Information Technology details and procedures for keeping data secure Data Access Agreement
All of these documents are put into a PDF file that you print and get your institution to sign off on your behalf
Module 2b
bioinformatics.ca
40
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
41
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
42
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
43
13-05-24
Module 2b
bioinformatics.ca
VCF
Module
2b
44
13-05-24
bioinformatics.ca
Module 2b
bioinformatics.ca
45
13-05-24
bioinformatics.ca
Module 2b
46
13-05-24
Module 2b
bioinformatics.ca
Getting data
Cosmic: http://www.sanger.ac.uk/genetics/CGP/cosmic
ICGC Home: http:/ /icgc.org ICGC DCC: http:/ /dcc.icgc.org TCGA Home: http:/ /cancergenome.nih.gov TCGA: http:/ /tcga-data.nci.nih.gov/tcga UCSC Home: http:/ /genome.ucsc.edu UCSC Human: http:/ /genome.ucsc.edu/cgi-bin/hgGateway UCSC Human Cancer: https:/ /genome-cancer.ucsc.edu
Module 2b
bioinformatics.ca
47
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
48
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
49
13-05-24
Module 2b
bioinformatics.ca
Module 2b
bioinformatics.ca
50
13-05-24
Module 2b
bioinformatics.ca
IGV: le formats, e.g. BAM (binary version of SAM, or Sequence Alignment Formatted les)
Module 2b
bioinformatics.ca
51
13-05-24
Module 2b
bioinformatics.ca
Ask your question, and then gather the data, the tools and hardware you need
Data and Databases: you will take workshops, you will read papers, and you will go on-line: SeqAnswers & maybe the bioinformatics.ca Links Directory Tools: you will take workshops, you will read papers, and you will go on-line: SeqAnswers & maybe the bioinformatics.ca Links Directory Hardware: you need to decide?
Module 2b
bioinformatics.ca
52
13-05-24
For this example: l *.bam for the alignment le l *.gtf for the genome annotation data
Step1: Choose the genome in the list (or import your own genome le)
Here we have selected hg18 because it was used for the alignment
53
13-05-24
If you download a *.bam file, it must be sorted and indexed, and the index *.bai file must be in the same directory You can visualize several alignment files at the same time for the same species
54
13-05-24
55
13-05-24
Cytoband
Genomic coordinates
Track names
Data panel
Annotated introns
Annotated exons
56
13-05-24
2 examples of variation compared to the reference sequence Lighter color bases: low quality bases
Several ways of downloading gene annotation les can be used, for example directly from the source sequence databases
57
13-05-24
Select the genome (here hg18) Select the gene annotations (here Ensembl) Select the file format (here GTF)
Choose your file name and click on the get output button
Select File->Load from file and choose the GTF file you have downloaded You have know access to RefSeq and Ensembl gene annotations:
The more data and annotations you load, the more memory you need You can select a higher memory threshold if you need it when you launch IGV
58
13-05-24
59
13-05-24
UCSC
http:/ /genome.ucsc.edu/ Twitter: @GenomeBrowser More tutorials: http:/ /genome.ucsc.edu/training.html SEQanswers
Forum
for
NGS
technologies
h?p://seqanswers.com/
Module 2b
bioinformatics.ca
Lab
exercise:
Update
of
the
stats
for
2013
http://goo.gl/tkMXa
Module 2b
bioinformatics.ca
60
13-05-24
CosmicCompleteExport_v64_260313.tsv How many records in this le? How many genes? Which are most mutated? How many publications? Which is the most mutated gene in TCGA set? What concerns about these? Some hints: 1> cat le.tsv | cut f 1 | sort | uniq -c | sort -nr > foo.out & 2> grep TCGA le.tsv | wc l 3> man cut
Module
2b
bioinformatics.ca
Get simple somatic mutation from OICRs pancreatic cancer ICGC project
cat ssm.oicrPanc.txt | wc l What is this? > cat ssm.oicrPanc.txt | awk -F\t '{print $3}' | uniq c What is this? > cat ssm.oicrPanc.txt | awk -F\t '{print $23}' | uniq c What is this?
Module 2b
bioinformatics.ca
61