You are on page 1of 43

Bioinformatics

ABE 2007 Kent Koster Group 3

Why bioinformatics?


Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.

Outline
Bioinformatics Defined  Evolution of Bioinformatics  Bioinformatics History  Common Uses of Bioinformatics  Procedures and Tools of Bioinformatics  Our Procedure  Our Results  Resources


Bioinformatics Defined


Bioinformatics is broad term covering the use of computer algorithms to analyze biological data. Differs from computational biology in that while computational biology is the use of computer technology to solve a single, hypothesis-based hypothesisquestion, bioinformatics is the omnibus use of computerized statistical analysis to make statistical or comparative inferences. i.e. converting data to information.

The nebulous genesis of bioinformatics


 

1977 -X174 Phage Genome sequenced 1990 Paper published in the Journal of Molecular Biology describes sequence alignment search algorithm 1990s Software used to find fragment overlap for the Human Genome Project 1992 NCBI takes over GenBank DNA sequence database in response to the growing number of gene patents

The nebulous genesis of bioinformatics




1994 Entrez Global Query Cross-Database CrossSearch System allows users to search GenBank database 1995 Dr. Owen White writes software to help find gene elements (promoters, start and stop codons, etc.) in the sequenced Haemophilus influenzae genome 1996 NCBI-BLAST created to provide powerful NCBIheuristic searches against the GenBank database

Genomics to Proteomics through Bioinformatics




  

Because proteins are ultimately the tool of all* gene expression, proteomics is, in effect, the product science made possible by bioinformatics A proteome is the collection of all proteins expressed in a cell at a given time Every organism has 1 genome, but many proteomes In addition to high throughput protein analysis, proteomics is researched through cDNA analysis (RT(RTPCR) Proteomics represents a methodical addition of large scale biology to traditional molecular biology, made possible by bioinformatics

Common Uses of Bioinformatics




Homology and Comparative Modeling




Protein or gene homology is shared nucleotide or amino acid sequences or domains shared between different proteins regardless of whether from same or different organism Searching databases for nucleotide or amino acid sequences that match sequences in unknown samples

Gene or Protein Identification




So, how do ya do it?


DNA Sequencing  Sequence Formats  Sequence Homology Software Tools  Aligning Tools  Annotated Information  Protein Folding


DNA Sequencing


Sanger Method


New nucleotide chains of DNA being replicated by DNA Polymerase are stopped when di-deoxy nucleotides (added in the direaction mixture in ~1/100 ratio) are incorperated into the chain

DNA Sequencing
Fluorescent dyes are bound to the ddNTPs, allowing the molecule to detected when it is excited by a laser  Terminated DNA chains are run on a gel, and fragments are resolved by size  By combining the fluorescence readings from each size nucleotide chain, the DNA sequence is computed


Example Sequence Chromatograph

Sequence Analysis
     

First Things First Sequence File Formats: Most common for nucleotides: FASTA / Multi-FASTA Multi> followed by any unicode text, entire line read as sequence title Carriage return followed by continuous 5- 3 nucleotide sequence or 5protein sequence using 1-letter codes 1Example: >E. coli Globin-coupled chemotaxis sensory transducer (TM Globindomain) ATGGACCTGATCACAAATGCGATTTAGAGACCTGATCACAAATG CGATGACCTGATCACAAATGCGATGACCTGATCACAAATGCGA TGTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGAT CTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATT AA

Sequence Homology Software




NCBI-BLAST NCBI  

 

Run by the National Center for Biotechnology Information BLAST uses a heuristic algorithm based on the SmithSmith-Waterman algorithm Algorithm searches database for a small string within the query (default 11 for nucleotide searches), then when it detects a match, searches for shared nucleotides at each end of the seed to extend the match Gaps are taken into account, then the matches are presented in order of statistical significance http://www.ncbi.nlm.nih.gov/BLAST/

Different Types of BLAST




NucleotideNucleotide-nucleotide BLAST (BLASTN):


 

Basic nucleutide sequence searches The BLAST that you used for your sequences Similar technology used to search amino acid sequences A more advance protein BLAST useful for analyzing relationships between divergently evolved proteins.

ProteinProtein-protein BLAST (BLASTP):




PositionPosition-Specific Iterative BLAST (PSI-BLAST): (PSI

Different Types of BLAST




BLASTX and BLASTN variants:




Use six-frame translation for proteins and sixnucleotides, respectively, in the search Used for BLASTing several sequences at once to cut down on processing load and server reporting-time reporting-

MegaBLAST:


Interpreting BLAST Results




Max/Total Score


Calculated from the number of matches and gaps. Higher relative to your query length is better
S)

(eE Value: E=Kmn(e

 

Translation: E Value gives you the number of entries required in the database for a match to happen by random chance. e.g. E=e-6 means that one match would be expected for every 1,000,000 entries in the database Smaller E Values are better Values larger than E=e-5 too likely to be due to chance

Interpreting BLAST Results




Query Coverage


The percent of the query sequence matched by the database entry The percent identity, i.e. the percent that the genes match up within the limits of the full match (e.g. deletions or additions reduce this value)

Max Ident


Sequence Aligning Software




Clustal (free)
ClustalX Software  ClustalW Web


DNAStar ($$$)  Functionality is similar, but difference is in interface, tools, and speed of algorithms  http://www.ebi.ac.uk/clustalw/


SMART
Simple Modular Architecture Research Tool  Run by EMBL (European Molecular Biology Laboratory)  While BLAST compares nucleotide sequences and then informs you of any domains that may have been annotated to them, SMART compares by domains


PFAM
  

 

Protein domain database Manually curated, trading volume for quality Uses hidden Markov models for domain pattern recognition Run by Sanger Institute in the UK Heuristic server-load analysis predicts when key serverprotein analysis report is due and crashes server http://www.sanger.ac.uk/Software/Pfam/

Interpro
Database of protein domains and functional sites  Best source of annotation  Other tools sometimes draw annotation from Interpro  Run by the European Bioinformatics Institute  http://www.ebi.ac.uk/interpro/


Protein Folding


Lowest energy state folding


Ab initio: tremendously resource heavy, can only be done for tiny proteins  Distributed computing is used for mid-sized midproteins

 Folding@Home  Human

Proteome Folding Project  Rosetta@Home  Predictor@Home

Protein Folding


SoftwareSoftware-assisted manual folding




Use knowledge of biochemistry to fold protein into predicted structure, then software to find lowest energy state

Commercial Programs:
Protein Shop  Profold


Manual Motif Verification




Ramachandran Plot ratio of to angles on N and C terminals of subunit

Our Procedure


Colonies were selected from nutrient plates


 

Each group selected two colonies to sequence Colonies which survived ampicillin treatment were possibly transformed by the vector, which contained an ampicillin resistance gene Presence of PDI insert was expected to disrupt ccdB (lethal protein) and LacZ gene expression in vector plasmid LacZ expression resulted in some blue colonies, as the colonies were able to cleave X-Gal substrate into Xblue product

Initial Questions Guiding Colony Selection


     

How did some blue colonies survive? Did all blue colonies come from the PCR product? Did the white colonies contain the PDI inserts? Were some colonies able to survive without the ampicillin resistance plasmid? What was the actual sequence of the commercial positive control insert? Some samples were transformed with inserts collected from PCR instead of gel electrophoresis. Could have nonnon-PDI sequences have ligated to the vector and been inserted into bacteria?

Procedure
Samples were prepared with T3 and T7 (forward and backward) primers in solution for sequencing  Samples were sent to UH Manoa lab for sequencing  Chromatogram results were viewed with Finch TV to determine quality


Procedure


Sequences were trimmed at 5 and 3 ends, then restriction enzyme sites on the vector were attempted to be located with Finch TV

Procedure
  

Sequences were exported in FASTA format Procedure was repeated for the other strands PairPair-wise alignment was performed for both strands of each sample with EBIs tools Consensus sequence from pair-wise alignment pairwas searched for in BLAST Gene information was located from BLAST annotation and TAIR website

Results


General Remarks


Because colonies were selected prior to the identity of the positive control insert being questioned, no control colonies were sequenced All sequenced white colonies definitively had PDI gene insert, save for one interesting exception Some blue colonies showed multiple nucleotide chromatogram readings, suggesting either sample contamination or separately transformed E. coli growing as one colony

Group 3 Results
Sequenced 1 blue and 1 white colony from same plate  Colonies were transformed with PCR product, not gel-recovered DNA gel White colonies had PDI insert  Blue colonies had 154Bp partial insert, disrupting ccdB gene, but remaining ininframe and allowing for a partially function LacZ alpha gene to be expressed


Group 3 White Colony




T7 strand definitively showed the presence of a PDI insert

Group 3 White Colony




T3 and T7 strand consensus sequence also showed PDI gene presense

Group 3 Blue Colony




Blue colony T3 showed multiple signals

Group 3 Blue Colony


However, T7 strand was salvageable  A 154 nucleotide sequence was found between the restriction sites


Group 1 Results
White Colony from PCR product showed PDI gene in both T3 and T7 strands  White colony from gel purification:


T7 strand sequenced as multiple signals  T3 strand sequenced excellently




Group 1 Gel White Colony




T3 sequence showed only nucleotides 15401540-2320 of the vector

Group 2 Results


White Colony from gel purification




White colonies sequenced with PDI gene Both T3 and T7 strand sequencing showed consistent multiple signals

Blue w/ White Ring Colony from PCR




Group 4 Results
1 white colony from PCR and 1 white colony from gel purification were sequenced  Both showed PDI gene


Final Remarks
  

All white colonies had the PDI gene, except one with a modified vector All blue colonies were transformed with the direct PCR product (not gel purified) Group 3 showed that a small (154Bp) insert that stays in-frame with the inLacZ gene can knock-out the ccdB, while still allowing the expression of an knockat least partially functioning LacZ gene Some blue colonies with white rings could be 2 separate lines living together


Bacteria transformed with ampicillin resistance gene could deplete area of ampicillin, allowing bacteria without the gene to crowd the white bacteria out of the area of depleted ampicillin How could bacteria without the insert survive both ccdB expression and ampicillin selection in broth?
 

ccdB gene could be lost due to mutation Bactaria could have cut plasmid, deleting the ccdB, but retaining LacZ possibly and ampicillin resistance genes

No group sequenced the positive control insert sequence still a mystery!

Resources
      

  

http://www.bioinformatics.org http://http://syntheticbiology.org/Tools.html NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/ SMART: http://smart.embl-heidelberg.de/ http://smart.emblPFAM: http://www.sanger.ac.uk/Software/Pfam/ Interpro: http://www.ebi.ac.uk/interpro/ Canadian Bioinformatics Helpdesk Newsletter (Ramachandran Plot): http://gchelpdesk.ualberta.ca/news/22sep05/cbhd_news_22sep05.p hp Finch TV: http://www.geospiza.com/finchtv/ EBI Pair-wise alignment: Pairhttp://www.ebi.ac.uk/emboss/align/index.html TAIR: http://www.arabidopsis.org

You might also like