Bio Tics

Bioinformatics
ABE 2007 Kent Koster Group 3
Why bioinformatics?

Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.
Outline
Bioinformatics Defined Evolution of Bioinformatics Bioinformatics History Common Uses of Bioinformatics Procedures and Tools of Bioinformatics Our Procedure Our Results Resources

Bioinformatics Defined

Bioinformatics is broad term covering the use of computer algorithms to analyze biological data. Differs from computational biology in that while computational biology is the use of computer technology to solve a single, hypothesis-based hypothesisquestion, bioinformatics is the omnibus use of computerized statistical analysis to make statistical or comparative inferences. i.e. converting data to information.
The nebulous genesis of bioinformatics

1977 -X174 Phage Genome sequenced 1990 Paper published in the Journal of Molecular Biology describes sequence alignment search algorithm 1990s Software used to find fragment overlap for the Human Genome Project 1992 NCBI takes over GenBank DNA sequence database in response to the growing number of gene patents
The nebulous genesis of bioinformatics

1994 Entrez Global Query Cross-Database CrossSearch System allows users to search GenBank database 1995 Dr. Owen White writes software to help find gene elements (promoters, start and stop codons, etc.) in the sequenced Haemophilus influenzae genome 1996 NCBI-BLAST created to provide powerful NCBIheuristic searches against the GenBank database
Genomics to Proteomics through Bioinformatics

Because proteins are ultimately the tool of all* gene expression, proteomics is, in effect, the product science made possible by bioinformatics A proteome is the collection of all proteins expressed in a cell at a given time Every organism has 1 genome, but many proteomes In addition to high throughput protein analysis, proteomics is researched through cDNA analysis (RT(RTPCR) Proteomics represents a methodical addition of large scale biology to traditional molecular biology, made possible by bioinformatics
Common Uses of Bioinformatics

Homology and Comparative Modeling

Protein or gene homology is shared nucleotide or amino acid sequences or domains shared between different proteins regardless of whether from same or different organism Searching databases for nucleotide or amino acid sequences that match sequences in unknown samples
Gene or Protein Identification

So, how do ya do it?

DNA Sequencing Sequence Formats Sequence Homology Software Tools Aligning Tools Annotated Information Protein Folding

DNA Sequencing

Sanger Method

New nucleotide chains of DNA being replicated by DNA Polymerase are stopped when di-deoxy nucleotides (added in the direaction mixture in ~1/100 ratio) are incorperated into the chain
DNA Sequencing
Fluorescent dyes are bound to the ddNTPs, allowing the molecule to detected when it is excited by a laser Terminated DNA chains are run on a gel, and fragments are resolved by size By combining the fluorescence readings from each size nucleotide chain, the DNA sequence is computed

Example Sequence Chromatograph
Sequence Analysis

First Things First Sequence File Formats: Most common for nucleotides: FASTA / Multi-FASTA Multi> followed by any unicode text, entire line read as sequence title Carriage return followed by continuous 5- 3 nucleotide sequence or 5protein sequence using 1-letter codes 1Example: >E. coli Globin-coupled chemotaxis sensory transducer (TM Globindomain) ATGGACCTGATCACAAATGCGATTTAGAGACCTGATCACAAATG CGATGACCTGATCACAAATGCGATGACCTGATCACAAATGCGA TGTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGAT CTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATT AA
Sequence Homology Software

NCBI-BLAST NCBI

Run by the National Center for Biotechnology Information BLAST uses a heuristic algorithm based on the SmithSmith-Waterman algorithm Algorithm searches database for a small string within the query (default 11 for nucleotide searches), then when it detects a match, searches for shared nucleotides at each end of the seed to extend the match Gaps are taken into account, then the matches are presented in order of statistical significance http://www.ncbi.nlm.nih.gov/BLAST/
Different Types of BLAST

NucleotideNucleotide-nucleotide BLAST (BLASTN):

Basic nucleutide sequence searches The BLAST that you used for your sequences Similar technology used to search amino acid sequences A more advance protein BLAST useful for analyzing relationships between divergently evolved proteins.
ProteinProtein-protein BLAST (BLASTP):

PositionPosition-Specific Iterative BLAST (PSI-BLAST): (PSI
Different Types of BLAST

BLASTX and BLASTN variants:

Use six-frame translation for proteins and sixnucleotides, respectively, in the search Used for BLASTing several sequences at once to cut down on processing load and server reporting-time reporting-
MegaBLAST:

Interpreting BLAST Results

Max/Total Score

Calculated from the number of matches and gaps. Higher relative to your query length is better
S)
(eE Value: E=Kmn(e

Translation: E Value gives you the number of entries required in the database for a match to happen by random chance. e.g. E=e-6 means that one match would be expected for every 1,000,000 entries in the database Smaller E Values are better Values larger than E=e-5 too likely to be due to chance
Interpreting BLAST Results

Query Coverage

The percent of the query sequence matched by the database entry The percent identity, i.e. the percent that the genes match up within the limits of the full match (e.g. deletions or additions reduce this value)
Max Ident

Sequence Aligning Software

Clustal (free)
ClustalX Software ClustalW Web

DNAStar ($$$) Functionality is similar, but difference is in interface, tools, and speed of algorithms http://www.ebi.ac.uk/clustalw/

SMART
Simple Modular Architecture Research Tool Run by EMBL (European Molecular Biology Laboratory) While BLAST compares nucleotide sequences and then informs you of any domains that may have been annotated to them, SMART compares by domains

PFAM

Protein domain database Manually curated, trading volume for quality Uses hidden Markov models for domain pattern recognition Run by Sanger Institute in the UK Heuristic server-load analysis predicts when key serverprotein analysis report is due and crashes server http://www.sanger.ac.uk/Software/Pfam/
Interpro
Database of protein domains and functional sites Best source of annotation Other tools sometimes draw annotation from Interpro Run by the European Bioinformatics Institute http://www.ebi.ac.uk/interpro/

Protein Folding

Lowest energy state folding

Ab initio: tremendously resource heavy, can only be done for tiny proteins Distributed computing is used for mid-sized midproteins

Folding@Home Human
Proteome Folding Project Rosetta@Home Predictor@Home
Protein Folding

SoftwareSoftware-assisted manual folding

Use knowledge of biochemistry to fold protein into predicted structure, then software to find lowest energy state
Commercial Programs:
Protein Shop Profold

Manual Motif Verification

Ramachandran Plot ratio of to angles on N and C terminals of subunit
Our Procedure

Colonies were selected from nutrient plates

Each group selected two colonies to sequence Colonies which survived ampicillin treatment were possibly transformed by the vector, which contained an ampicillin resistance gene Presence of PDI insert was expected to disrupt ccdB (lethal protein) and LacZ gene expression in vector plasmid LacZ expression resulted in some blue colonies, as the colonies were able to cleave X-Gal substrate into Xblue product
Initial Questions Guiding Colony Selection

How did some blue colonies survive? Did all blue colonies come from the PCR product? Did the white colonies contain the PDI inserts? Were some colonies able to survive without the ampicillin resistance plasmid? What was the actual sequence of the commercial positive control insert? Some samples were transformed with inserts collected from PCR instead of gel electrophoresis. Could have nonnon-PDI sequences have ligated to the vector and been inserted into bacteria?
Procedure
Samples were prepared with T3 and T7 (forward and backward) primers in solution for sequencing Samples were sent to UH Manoa lab for sequencing Chromatogram results were viewed with Finch TV to determine quality

Procedure

Sequences were trimmed at 5 and 3 ends, then restriction enzyme sites on the vector were attempted to be located with Finch TV
Procedure

Sequences were exported in FASTA format Procedure was repeated for the other strands PairPair-wise alignment was performed for both strands of each sample with EBIs tools Consensus sequence from pair-wise alignment pairwas searched for in BLAST Gene information was located from BLAST annotation and TAIR website
Results

General Remarks

Because colonies were selected prior to the identity of the positive control insert being questioned, no control colonies were sequenced All sequenced white colonies definitively had PDI gene insert, save for one interesting exception Some blue colonies showed multiple nucleotide chromatogram readings, suggesting either sample contamination or separately transformed E. coli growing as one colony
Group 3 Results
Sequenced 1 blue and 1 white colony from same plate Colonies were transformed with PCR product, not gel-recovered DNA gel White colonies had PDI insert Blue colonies had 154Bp partial insert, disrupting ccdB gene, but remaining ininframe and allowing for a partially function LacZ alpha gene to be expressed

Group 3 White Colony

T7 strand definitively showed the presence of a PDI insert
Group 3 White Colony

T3 and T7 strand consensus sequence also showed PDI gene presense
Group 3 Blue Colony

Blue colony T3 showed multiple signals
Group 3 Blue Colony

However, T7 strand was salvageable A 154 nucleotide sequence was found between the restriction sites

Group 1 Results
White Colony from PCR product showed PDI gene in both T3 and T7 strands White colony from gel purification:

T7 strand sequenced as multiple signals T3 strand sequenced excellently

Group 1 Gel White Colony

T3 sequence showed only nucleotides 15401540-2320 of the vector
Group 2 Results

White Colony from gel purification

White colonies sequenced with PDI gene Both T3 and T7 strand sequencing showed consistent multiple signals
Blue w/ White Ring Colony from PCR

Group 4 Results
1 white colony from PCR and 1 white colony from gel purification were sequenced Both showed PDI gene

Final Remarks

All white colonies had the PDI gene, except one with a modified vector All blue colonies were transformed with the direct PCR product (not gel purified) Group 3 showed that a small (154Bp) insert that stays in-frame with the inLacZ gene can knock-out the ccdB, while still allowing the expression of an knockat least partially functioning LacZ gene Some blue colonies with white rings could be 2 separate lines living together

Bacteria transformed with ampicillin resistance gene could deplete area of ampicillin, allowing bacteria without the gene to crowd the white bacteria out of the area of depleted ampicillin How could bacteria without the insert survive both ccdB expression and ampicillin selection in broth?

ccdB gene could be lost due to mutation Bactaria could have cut plasmid, deleting the ccdB, but retaining LacZ possibly and ampicillin resistance genes
No group sequenced the positive control insert sequence still a mystery!
Resources

http://www.bioinformatics.org http://http://syntheticbiology.org/Tools.html NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/ SMART: http://smart.embl-heidelberg.de/ http://smart.emblPFAM: http://www.sanger.ac.uk/Software/Pfam/ Interpro: http://www.ebi.ac.uk/interpro/ Canadian Bioinformatics Helpdesk Newsletter (Ramachandran Plot): http://gchelpdesk.ualberta.ca/news/22sep05/cbhd_news_22sep05.p hp Finch TV: http://www.geospiza.com/finchtv/ EBI Pair-wise alignment: Pairhttp://www.ebi.ac.uk/emboss/align/index.html TAIR: http://www.arabidopsis.org

Bio Tics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bio Tics

Uploaded by

Copyright:

Available Formats

Bioinformatics

ABE 2007 Kent Koster Group 3

The nebulous genesis of bioinformatics

The nebulous genesis of bioinformatics

Genomics to Proteomics through Bioinformatics

Common Uses of Bioinformatics

Homology and Comparative Modeling

Gene or Protein Identification

So, how do ya do it?

Example Sequence Chromatograph

Sequence Homology Software

Different Types of BLAST

NucleotideNucleotide-nucleotide BLAST (BLASTN):

ProteinProtein-protein BLAST (BLASTP):

PositionPosition-Specific Iterative BLAST (PSI-BLAST): (PSI

Different Types of BLAST

BLASTX and BLASTN variants:

Interpreting BLAST Results

(eE Value: E=Kmn(e

Interpreting BLAST Results

Sequence Aligning Software

Lowest energy state folding

Proteome Folding Project  Rosetta@Home  Predictor@Home

SoftwareSoftware-assisted manual folding

Manual Motif Verification

Ramachandran Plot ratio of to angles on N and C terminals of subunit

Colonies were selected from nutrient plates

Initial Questions Guiding Colony Selection

Group 3 White Colony

T7 strand definitively showed the presence of a PDI insert

Group 3 White Colony

T3 and T7 strand consensus sequence also showed PDI gene presense

Group 3 Blue Colony

Blue colony T3 showed multiple signals

Group 3 Blue Colony

T7 strand sequenced as multiple signals  T3 strand sequenced excellently

Group 1 Gel White Colony

T3 sequence showed only nucleotides 15401540-2320 of the vector

White Colony from gel purification

Blue w/ White Ring Colony from PCR

No group sequenced the positive control insert sequence still a mystery!

You might also like

PositionPosition-Specific Iterative BLAST (PSI-BLAST): (PSI

(eE Value: E=Kmn(e

Proteome Folding Project Rosetta@Home Predictor@Home

T7 strand sequenced as multiple signals T3 strand sequenced excellently