You are on page 1of 39

Introduction To Next Generation

Sequencing (NGS) Data Analysis

Jenny Wu
UCI Genomics High
Throughput Facility
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data file formats, general
workflow
Data Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
Example: RNA-Seq analysis with Tuxedo
protocol
Summary and future plan
Why Next Generation Sequencing
One can sequence hundreds of
millions of short sequences (35bp-
120bp) in a single run in a short
period of time with low per base cost.

Illumina/Solexa GA II / HiSeq 2000, 2500


Life Technologies/Applied Biosystems SOLiD
Roche/454 FLX, Titanium

Reviews: Michael Metzker (2010) Nature Reviews Genetics


Quail et al (2012) BMC Genomics Jul 24;13:341.
Why Bioinformatics

Informat
ics

(wall.hms.harvard.edu)
Bioinformatics Challenges
in NGS Data Analysis
VERY large text files (tens of millions of lines
long)
Cant do business as usual with familiar tools
Impossible memory usage and execution time
Manage, analyze, store, transfer and archive huge
files
Need for powerful computers and expertise
Informatics groups must manage compute
clusters
New algorithms and software are required and
often time they are open source Unix/Linux based.
Collaboration of IT, bioinformaticians and
Basic NGS Workflow
NGS Data Analysis Overview

Olson et al.
Outline
Goals
Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general
workflow
Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
RNA-Seq analysis with Tuxedo protocol
Summary and future plan
Terminology
Coverage (depth): The number of nucleotides from
reads that are mapped to a given position .
Quality Score: Each called base comes with a quality
score which measures the probability of base call error.
Mapping: Align reads to reference to identify its origin.
Assembly: Merging of fragments of DNA in order to
reconstruct the original sequence.
Duplicate reads: Reads that are identical.
Multi-reads: Reads that can be mapped to multiple
locations equally well.
What does the data look like?
Common NGS Data Formats
FASTA Format (Reference
Seq)
FASTQ Format (reads)
FASTQ Format (Illumina
Example) Lane Tile
Read Barcode
Flow Cell Tile
Record ID Coordinate
Header s

@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA
CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT
+
Read
BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ
Separator Bases
@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG
(with
AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG
Read
optional
+ Quality
repeated
@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2
Scores
header)
@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG
CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC
+
CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ
@DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG
GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG
+
CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ
NOTE: for paired-end runs, there is a second file
with one-to-one corresponding headers and reads.
(Passarelli, 2012)
Outline
Goals
Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general
workflow,
Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
RNA-Seq analysis with Tuxedo protocol
Summary and future plan
Data Analysis Pipeline
Raw reads FASTQ

Local
Local Read
realignment, Read QC
QC and
and FASTQC,
FASTQC, FASTX-
FASTX-
realignment, preprocessi
preprocessi toolkit,
base
base quality
quality ng toolkit, PRINSEQ
PRINSEQ
recalibration ng
recalibration

Analysis-
ready FASTQ
reads

Collecting
Collecting
FASTA reference
reference Read
Read Bowtie,
Bowtie, BWA,
BWA,
sequences
sequences Mapping MAQ
GTF/GFF and
Mapping MAQ
and
annotation
annotation
Data
Task
Visualization
Visualization Mapped SAM/BA File
(IGV,
(IGV, USCS
USCS GB)
GB) reads M Format
Software

Whole
Whole
Genome RNA-Seq:
RNA-Seq:
Genome
Sequencing: Transcript
Transcript Methyl-Seq:
Methyl-Seq:

Sequencing: ChIP-Seq
ChIP-Seq ::
Variant assembly,
assembly, Methylation
Methylation
Variant Peak
Peak Calling
Calling
calling, quantificatio
quantificatio calling
calling
calling,
annotation n
n
annotation
Why QC?
Sequencing runs cost money
Consequences of not assessing the Data
Sequencing a poor library on multiple
runs throwing money away!

Data analysis costs money and time


Cost of analyzing data, CPU time $$
Cost of storing raw sequence data $$$
Hours of analysis could be wasted $$$$
Downstream analysis can be incorrect.
How to QC?
$: fastqc s_1_1.fastq;

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, available on HPC


Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y
Outline
Goals
Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general
workflow,
Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
RNA-Seq analysis with Tuxedo protocol
Summary and future plan
The UCSC Genome Browser Homepage

General information

Get genome annotation here!

Get reference sequences here!


Specific information
new features, current status, etc.
Getting reference
sequences
Getting Reference
Annotation
Outline
Goals
Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general
workflow,
Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
RNA-Seq analysis with Tuxedo protocol
Summary and future plan
Sequence Mapping
Challenges
Alignment (Mapping) is the first steps
once read sequences are obtained.
The task: to align sequencing reads
against a known reference
Difficulties: high volume of data, size
of reference genome, computation
time, read length constraints,
ambiguity caused by repeats and
sequencing errors.
Short Read Alignment

Olson et al.
Short Read Alignment
Software
Short Reads Mapping
Software
How to choose an aligner?
There are many aligners and they
vary a lot in performance (accuracy,
memory usage, speed, etc).
Factors to consider : application,
platform, read length, downstream
analysis, etc.
Constant trade off between speed and
sensitivity (e.g. MAQ vs. Bowtie)
Guaranteed high accuracy will take
longer.
Outline
Goals
Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general
workflow,
Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
RNA-Seq analysis with Tuxedo protocol
Summary and future plan
NGS Applications and Analysis
Strategy
Name Nucleic acid population Brief analysis strategy
RNA-Seq RNA (may be poly-A mRNA or total Alignment of reads to genes; variations
RNA) for detecting splice junctions and
quantifying abundance
Small RNA Small RNA (often miRNA) Alignment of reads to small RNA
sequencing references (e.g. miRbase), then to the
genome; quantify abundance
ChIP-Seq DNA bound to protein, captured via Align reads to reference genome, identify
antibody (ChIP = Chromatin peaks & motifs
ImmunoPrecipitation)
RIP-Seq RNA bound to protein, captured via Align reads to reference genome and/or
antibody (RIP = RNA genes, identify peaks and motifs
ImmunoPrecipitation)
Methylation Select methylated genomic DNA Align reads to reference and either identify
Analysis regions, or convert methylated peaks or regions of methylation
nucleotides to alternate forms
SNP calling/ All or some genomic DNA or RNA Either align reads to reference and identify
discovery statistically significant SNPs, or compare
multiple samples to each other to identify
SNPs
Structural Genomic DNA, with two reads (mate- Align mate-pairs to reference sequence
Variation pair reads) per DNA template and interpret structural variants
Analysis
de novo Genomic DNA (possibly with external Piece-together reads to assemble contigs,
Sequencing data e.g. cDNA, genomes of closely scaffolds, and (ideally) whole-genome
related species, etc.) sequence
Metagenom Entire RNA or DNA from a (usually Phylogenetic analysis of sequences
ics microbial) community

(Hunicke-Smith et al, 2010)


Application Specific
Software
Mapped
reads

Whole
Whole ChIP-Seq
ChIP-Seq :: Methyl-
Methyl-
Genome
Genome RNA-Seq:
RNA-Seq: Protein
Protein Seq:
Seq:
Sequencing
Sequencing
,, Exome
Exome
Transcripto
Transcripto
me
me analysis
analysis
DNA
DNA
binding
binding
Methylation
Methylation
pattern
pattern

Sequencing
Sequencing site,
site, analysis
analysis
1:
1: Transcriptome
Transcriptome
assembly
assembly
Variant
Variant 2. Peak Methylation
2. Abundance
Abundance Peak Methylation
Calling:
Calling: SNPs,
SNPs, quantification Identification
Identification calling
quantification calling
InDels
InDels 3.
3. Differential
Differential
expression
expression and
and
regulation
regulation

ssahaSNP,
ssahaSNP, Tophat,
Tophat, STAR,
STAR, MACS,
MACS, AREM,
AREM, Bismark,
Bismark, BS
BS
Samtools,
Samtools, Cufflinks,
Cufflinks, PeakSeq
PeakSeq Seeker
Seeker
PyroBayes
PyroBayes edgeR,
edgeR,
Outline
Goals
Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general
workflow,
Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
RNA-Seq analysis with Tuxedo protocol
Summary and future plan
RNA-seq (Tuxedo Protocol)
1. Read mapping
SAM/BA
M
2. Transcript
assembly and
GTF/GFF
quantification

3. Merge assembled
transcripts from
multiple samples

4. Differential
Expression analysis

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.
016.html
1. Spliced Alignment: Tophat
Tophat : a spliced short read aligner for
RNA-seq.

$ tophat -p 8 -G genes.gtf -o
C1_R1_thout genome C1_R1_1.fq
C1_R1_2.fq

$ tophat -p 8 -G genes.gtf -o
C1_R2_thout genome C1_R2_1.fq
C1_R2_2.fq

$ tophat -p 8 -G genes.gtf -o
C2_R1_thout genome C2_R1_1.fq
C2_R1_2.fq

$ tophat -p 8 -G genes.gtf -o
C2_R2_thout genome C2_R2_1.fq
C2_R2_2.fq
2.Transcript assembly and
abundance quantification:
Cufflinks
CuffLinks: a program that assembles aligned
RNA-Seq reads into transcripts, estimates their
abundances, and tests for differential
expression and regulation transcriptome-wide.

$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/


accepted_hits.bam

$ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/


accepted_hits.bam

$ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/


accepted_hits.bam

$ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/


accepted_hits.bam
3. Final Transcriptome
assembly: Cuffmerge
$ cuffmerge -g genes.gtf -s genome.fa
-p 8 assemblies.txt

$ more assembies.txt

./C1_R1_clout/transcripts.gtf
./C1_R2_clout/transcripts.gtf
./C2_R1_clout/transcripts.gtf
./C2_R2_clout/transcripts.gtf
4.Differential Expression:
Cuffdiff
CuffDiff: a program that compares
transcript abundance between
samples.

$ cuffdiff -o diff_out -b genome.fa


-p 8 L C1,C2 -u
merged_asm/merged.gtf
./C1_R1_thout/accepted_hits.bam,./
C1_R2_thout/accepted_hits.bam,
./C2_R1_thout/accepted_hits.bam,./
C2_R2_thout/accepted_hits.bam
Integrative Genomics Viewer
(IGV)
http://www.broadinstitute.org/igv
Visualizing RNA-seq mapping
with IGV
Specify range or tem in
search box

Click on
ruler

Click and
drag
Use scroll
bar

Use keyboard:
Arrow keys, Page
up Page down,
Home, End
http://www.broadinstitute.org/igv/UserGuide
Neilsen, C.B., et al. Visualizing Genomes: techniques and
challenges Nature Methods 7:S5S15 (2010)
Summary
NGS technologies are transforming
molecular biology.
Bioinformatics analysis is a crucial
part in NGS applications
Data formats, terminology, general
workflow
Analysis pipeline
Software for various NGS applications
RNA-seq with Tuxedo suite
Thank you!

You might also like