Professional Documents
Culture Documents
Jenny Wu
UCI Genomics High
Throughput Facility
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data file formats, general
workflow
Data Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
Example: RNA-Seq analysis with Tuxedo
protocol
Summary and future plan
Why Next Generation Sequencing
One can sequence hundreds of
millions of short sequences (35bp-
120bp) in a single run in a short
period of time with low per base cost.
Informat
ics
(wall.hms.harvard.edu)
Bioinformatics Challenges
in NGS Data Analysis
VERY large text files (tens of millions of lines
long)
Cant do business as usual with familiar tools
Impossible memory usage and execution time
Manage, analyze, store, transfer and archive huge
files
Need for powerful computers and expertise
Informatics groups must manage compute
clusters
New algorithms and software are required and
often time they are open source Unix/Linux based.
Collaboration of IT, bioinformaticians and
Basic NGS Workflow
NGS Data Analysis Overview
Olson et al.
Outline
Goals
Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general
workflow
Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
RNA-Seq analysis with Tuxedo protocol
Summary and future plan
Terminology
Coverage (depth): The number of nucleotides from
reads that are mapped to a given position .
Quality Score: Each called base comes with a quality
score which measures the probability of base call error.
Mapping: Align reads to reference to identify its origin.
Assembly: Merging of fragments of DNA in order to
reconstruct the original sequence.
Duplicate reads: Reads that are identical.
Multi-reads: Reads that can be mapped to multiple
locations equally well.
What does the data look like?
Common NGS Data Formats
FASTA Format (Reference
Seq)
FASTQ Format (reads)
FASTQ Format (Illumina
Example) Lane Tile
Read Barcode
Flow Cell Tile
Record ID Coordinate
Header s
@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA
CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT
+
Read
BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ
Separator Bases
@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG
(with
AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG
Read
optional
+ Quality
repeated
@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2
Scores
header)
@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG
CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC
+
CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ
@DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG
GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG
+
CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ
NOTE: for paired-end runs, there is a second file
with one-to-one corresponding headers and reads.
(Passarelli, 2012)
Outline
Goals
Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general
workflow,
Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
RNA-Seq analysis with Tuxedo protocol
Summary and future plan
Data Analysis Pipeline
Raw reads FASTQ
Local
Local Read
realignment, Read QC
QC and
and FASTQC,
FASTQC, FASTX-
FASTX-
realignment, preprocessi
preprocessi toolkit,
base
base quality
quality ng toolkit, PRINSEQ
PRINSEQ
recalibration ng
recalibration
Analysis-
ready FASTQ
reads
Collecting
Collecting
FASTA reference
reference Read
Read Bowtie,
Bowtie, BWA,
BWA,
sequences
sequences Mapping MAQ
GTF/GFF and
Mapping MAQ
and
annotation
annotation
Data
Task
Visualization
Visualization Mapped SAM/BA File
(IGV,
(IGV, USCS
USCS GB)
GB) reads M Format
Software
Whole
Whole
Genome RNA-Seq:
RNA-Seq:
Genome
Sequencing: Transcript
Transcript Methyl-Seq:
Methyl-Seq:
Sequencing: ChIP-Seq
ChIP-Seq ::
Variant assembly,
assembly, Methylation
Methylation
Variant Peak
Peak Calling
Calling
calling, quantificatio
quantificatio calling
calling
calling,
annotation n
n
annotation
Why QC?
Sequencing runs cost money
Consequences of not assessing the Data
Sequencing a poor library on multiple
runs throwing money away!
General information
Olson et al.
Short Read Alignment
Software
Short Reads Mapping
Software
How to choose an aligner?
There are many aligners and they
vary a lot in performance (accuracy,
memory usage, speed, etc).
Factors to consider : application,
platform, read length, downstream
analysis, etc.
Constant trade off between speed and
sensitivity (e.g. MAQ vs. Bowtie)
Guaranteed high accuracy will take
longer.
Outline
Goals
Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general
workflow,
Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
RNA-Seq analysis with Tuxedo protocol
Summary and future plan
NGS Applications and Analysis
Strategy
Name Nucleic acid population Brief analysis strategy
RNA-Seq RNA (may be poly-A mRNA or total Alignment of reads to genes; variations
RNA) for detecting splice junctions and
quantifying abundance
Small RNA Small RNA (often miRNA) Alignment of reads to small RNA
sequencing references (e.g. miRbase), then to the
genome; quantify abundance
ChIP-Seq DNA bound to protein, captured via Align reads to reference genome, identify
antibody (ChIP = Chromatin peaks & motifs
ImmunoPrecipitation)
RIP-Seq RNA bound to protein, captured via Align reads to reference genome and/or
antibody (RIP = RNA genes, identify peaks and motifs
ImmunoPrecipitation)
Methylation Select methylated genomic DNA Align reads to reference and either identify
Analysis regions, or convert methylated peaks or regions of methylation
nucleotides to alternate forms
SNP calling/ All or some genomic DNA or RNA Either align reads to reference and identify
discovery statistically significant SNPs, or compare
multiple samples to each other to identify
SNPs
Structural Genomic DNA, with two reads (mate- Align mate-pairs to reference sequence
Variation pair reads) per DNA template and interpret structural variants
Analysis
de novo Genomic DNA (possibly with external Piece-together reads to assemble contigs,
Sequencing data e.g. cDNA, genomes of closely scaffolds, and (ideally) whole-genome
related species, etc.) sequence
Metagenom Entire RNA or DNA from a (usually Phylogenetic analysis of sequences
ics microbial) community
Whole
Whole ChIP-Seq
ChIP-Seq :: Methyl-
Methyl-
Genome
Genome RNA-Seq:
RNA-Seq: Protein
Protein Seq:
Seq:
Sequencing
Sequencing
,, Exome
Exome
Transcripto
Transcripto
me
me analysis
analysis
DNA
DNA
binding
binding
Methylation
Methylation
pattern
pattern
Sequencing
Sequencing site,
site, analysis
analysis
1:
1: Transcriptome
Transcriptome
assembly
assembly
Variant
Variant 2. Peak Methylation
2. Abundance
Abundance Peak Methylation
Calling:
Calling: SNPs,
SNPs, quantification Identification
Identification calling
quantification calling
InDels
InDels 3.
3. Differential
Differential
expression
expression and
and
regulation
regulation
ssahaSNP,
ssahaSNP, Tophat,
Tophat, STAR,
STAR, MACS,
MACS, AREM,
AREM, Bismark,
Bismark, BS
BS
Samtools,
Samtools, Cufflinks,
Cufflinks, PeakSeq
PeakSeq Seeker
Seeker
PyroBayes
PyroBayes edgeR,
edgeR,
Outline
Goals
Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general
workflow,
Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
RNA-Seq analysis with Tuxedo protocol
Summary and future plan
RNA-seq (Tuxedo Protocol)
1. Read mapping
SAM/BA
M
2. Transcript
assembly and
GTF/GFF
quantification
3. Merge assembled
transcripts from
multiple samples
4. Differential
Expression analysis
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.
016.html
1. Spliced Alignment: Tophat
Tophat : a spliced short read aligner for
RNA-seq.
$ tophat -p 8 -G genes.gtf -o
C1_R1_thout genome C1_R1_1.fq
C1_R1_2.fq
$ tophat -p 8 -G genes.gtf -o
C1_R2_thout genome C1_R2_1.fq
C1_R2_2.fq
$ tophat -p 8 -G genes.gtf -o
C2_R1_thout genome C2_R1_1.fq
C2_R1_2.fq
$ tophat -p 8 -G genes.gtf -o
C2_R2_thout genome C2_R2_1.fq
C2_R2_2.fq
2.Transcript assembly and
abundance quantification:
Cufflinks
CuffLinks: a program that assembles aligned
RNA-Seq reads into transcripts, estimates their
abundances, and tests for differential
expression and regulation transcriptome-wide.
$ more assembies.txt
./C1_R1_clout/transcripts.gtf
./C1_R2_clout/transcripts.gtf
./C2_R1_clout/transcripts.gtf
./C2_R2_clout/transcripts.gtf
4.Differential Expression:
Cuffdiff
CuffDiff: a program that compares
transcript abundance between
samples.
Click on
ruler
Click and
drag
Use scroll
bar
Use keyboard:
Arrow keys, Page
up Page down,
Home, End
http://www.broadinstitute.org/igv/UserGuide
Neilsen, C.B., et al. Visualizing Genomes: techniques and
challenges Nature Methods 7:S5S15 (2010)
Summary
NGS technologies are transforming
molecular biology.
Bioinformatics analysis is a crucial
part in NGS applications
Data formats, terminology, general
workflow
Analysis pipeline
Software for various NGS applications
RNA-seq with Tuxedo suite
Thank you!