Workshop13 Day1

Introduction To Next Generation
Sequencing (NGS) Data Analysis
Jenny Wu
UCI Genomics High
Throughput Facility
Outline
Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis
Basics: terminology, data file formats, general
workflow
Data Analysis Pipeline
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
Example: RNA-Seq analysis with Tuxedo
protocol
Summary and future plan
Why Next Generation Sequencing
One can sequence hundreds of
millions of short sequences (35bp-
120bp) in a single run in a short
period of time with low per base cost.
Illumina/Solexa GA II / HiSeq 2000, 2500

Life Technologies/Applied Biosystems SOLiD
Roche/454 FLX, Titanium
Reviews: Michael Metzker (2010) Nature Reviews Genetics

Quail et al (2012) BMC Genomics Jul 24;13:341.
Why Bioinformatics
Informat
ics
(wall.hms.harvard.edu)
Bioinformatics Challenges
in NGS Data Analysis
VERY large text files (tens of millions of lines
long)
Cant do business as usual with familiar tools
Impossible memory usage and execution time
Manage, analyze, store, transfer and archive huge
files
Need for powerful computers and expertise
Informatics groups must manage compute
clusters
New algorithms and software are required and
often time they are open source Unix/Linux based.
Collaboration of IT, bioinformaticians and
Basic NGS Workflow
NGS Data Analysis Overview
Olson et al.
Outline
Goals
Bioinformatics Challenges in NGS data analysis
workflow
Analysis Pipeline
Sequence mapping
RNA-Seq analysis with Tuxedo protocol
Terminology
Coverage (depth): The number of nucleotides from
reads that are mapped to a given position .
Quality Score: Each called base comes with a quality
score which measures the probability of base call error.
Mapping: Align reads to reference to identify its origin.
Assembly: Merging of fragments of DNA in order to
reconstruct the original sequence.
Duplicate reads: Reads that are identical.
Multi-reads: Reads that can be mapped to multiple
locations equally well.
What does the data look like?
Common NGS Data Formats
FASTA Format (Reference
Seq)
FASTQ Format (reads)
FASTQ Format (Illumina
Example) Lane Tile
Read Barcode
Flow Cell Tile
Record ID Coordinate
Header s
@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA
CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT
+
Read
BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ
Separator Bases
@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG
(with
AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG
Read
optional
+ Quality
repeated
@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2
Scores
header)
CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC
+
CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ
GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG
+
CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ
NOTE: for paired-end runs, there is a second file
with one-to-one corresponding headers and reads.
(Passarelli, 2012)
Outline
Goals
workflow,
Analysis Pipeline
Sequence mapping
Data Analysis Pipeline
Raw reads FASTQ
Local
Local Read
realignment, Read QC
QC and
and FASTQC,
FASTQC, FASTX-
FASTX-
realignment, preprocessi
preprocessi toolkit,
base
base quality
quality ng toolkit, PRINSEQ
PRINSEQ
recalibration ng
recalibration
Analysis-
ready FASTQ
reads
Collecting
Collecting
FASTA reference
reference Read
Read Bowtie,
Bowtie, BWA,
BWA,
sequences
sequences Mapping MAQ
GTF/GFF and
Mapping MAQ
and
annotation
annotation
Data
Task
Visualization
Visualization Mapped SAM/BA File
(IGV,
(IGV, USCS
USCS GB)
GB) reads M Format
Software
Whole
Whole
Genome RNA-Seq:
RNA-Seq:
Genome
Sequencing: Transcript
Transcript Methyl-Seq:
Methyl-Seq:

Sequencing: ChIP-Seq
ChIP-Seq ::
Variant assembly,
assembly, Methylation
Methylation
Variant Peak
Peak Calling
Calling
calling, quantificatio
quantificatio calling
calling
calling,
annotation n
n
annotation
Why QC?
Sequencing runs cost money
Consequences of not assessing the Data
Sequencing a poor library on multiple
runs throwing money away!
Data analysis costs money and time

Cost of analyzing data, CPU time $$
Cost of storing raw sequence data $$$
Hours of analysis could be wasted $$$$
Downstream analysis can be incorrect.
How to QC?
$: fastqc s_1_1.fastq;
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, available on HPC

Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y
Outline
Goals
workflow,
Analysis Pipeline
Sequence mapping
The UCSC Genome Browser Homepage
General information
Get genome annotation here!
Get reference sequences here!

Specific information
new features, current status, etc.
Getting reference
sequences
Getting Reference
Annotation
Outline
Goals
workflow,
Analysis Pipeline
Sequence mapping
Sequence Mapping
Challenges
Alignment (Mapping) is the first steps
once read sequences are obtained.
The task: to align sequencing reads
against a known reference
Difficulties: high volume of data, size
of reference genome, computation
time, read length constraints,
ambiguity caused by repeats and
sequencing errors.
Short Read Alignment
Olson et al.
Short Read Alignment
Software
Short Reads Mapping
Software
How to choose an aligner?
There are many aligners and they
vary a lot in performance (accuracy,
memory usage, speed, etc).
Factors to consider : application,
platform, read length, downstream
analysis, etc.
Constant trade off between speed and
sensitivity (e.g. MAQ vs. Bowtie)
Guaranteed high accuracy will take
longer.
Outline
Goals
workflow,
Analysis Pipeline
Sequence mapping
NGS Applications and Analysis
Strategy
Name Nucleic acid population Brief analysis strategy
RNA-Seq RNA (may be poly-A mRNA or total Alignment of reads to genes; variations
RNA) for detecting splice junctions and
quantifying abundance
Small RNA Small RNA (often miRNA) Alignment of reads to small RNA
sequencing references (e.g. miRbase), then to the
genome; quantify abundance
ChIP-Seq DNA bound to protein, captured via Align reads to reference genome, identify
antibody (ChIP = Chromatin peaks & motifs
ImmunoPrecipitation)
RIP-Seq RNA bound to protein, captured via Align reads to reference genome and/or
antibody (RIP = RNA genes, identify peaks and motifs
ImmunoPrecipitation)
Methylation Select methylated genomic DNA Align reads to reference and either identify
Analysis regions, or convert methylated peaks or regions of methylation
nucleotides to alternate forms
SNP calling/ All or some genomic DNA or RNA Either align reads to reference and identify
discovery statistically significant SNPs, or compare
multiple samples to each other to identify
SNPs
Structural Genomic DNA, with two reads (mate- Align mate-pairs to reference sequence
Variation pair reads) per DNA template and interpret structural variants
Analysis
de novo Genomic DNA (possibly with external Piece-together reads to assemble contigs,
Sequencing data e.g. cDNA, genomes of closely scaffolds, and (ideally) whole-genome
related species, etc.) sequence
Metagenom Entire RNA or DNA from a (usually Phylogenetic analysis of sequences
ics microbial) community
(Hunicke-Smith et al, 2010)

Application Specific
Software
Mapped
reads
Whole
Whole ChIP-Seq
ChIP-Seq :: Methyl-
Methyl-
Genome
Genome RNA-Seq:
RNA-Seq: Protein
Protein Seq:
Seq:
Sequencing
Sequencing
,, Exome
Exome
Transcripto
Transcripto
me
me analysis
analysis
DNA
DNA
binding
binding
Methylation
Methylation
pattern
pattern

Sequencing
Sequencing site,
site, analysis
analysis
1:
1: Transcriptome
Transcriptome
assembly
assembly
Variant
Variant 2. Peak Methylation
2. Abundance
Abundance Peak Methylation
Calling:
Calling: SNPs,
SNPs, quantification Identification
Identification calling
quantification calling
InDels
InDels 3.
3. Differential
Differential
expression
expression and
and
regulation
regulation
ssahaSNP,
ssahaSNP, Tophat,
Tophat, STAR,
STAR, MACS,
MACS, AREM,
AREM, Bismark,
Bismark, BS
BS
Samtools,
Samtools, Cufflinks,
Cufflinks, PeakSeq
PeakSeq Seeker
Seeker
PyroBayes
PyroBayes edgeR,
edgeR,
Outline
Goals
workflow,
Analysis Pipeline
Sequence mapping
RNA-seq (Tuxedo Protocol)
1. Read mapping
SAM/BA
M
2. Transcript
assembly and
GTF/GFF
quantification
3. Merge assembled
transcripts from
multiple samples
4. Differential
Expression analysis
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.
016.html
1. Spliced Alignment: Tophat
Tophat : a spliced short read aligner for
RNA-seq.
$ tophat -p 8 -G genes.gtf -o
C1_R1_thout genome C1_R1_1.fq
C1_R1_2.fq
C1_R2_2.fq
C2_R1_2.fq
C2_R2_2.fq
2.Transcript assembly and
abundance quantification:
Cufflinks
CuffLinks: a program that assembles aligned
RNA-Seq reads into transcripts, estimates their
abundances, and tests for differential
expression and regulation transcriptome-wide.
$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/

accepted_hits.bam

accepted_hits.bam

accepted_hits.bam

accepted_hits.bam
3. Final Transcriptome
assembly: Cuffmerge
$ cuffmerge -g genes.gtf -s genome.fa
-p 8 assemblies.txt
$ more assembies.txt
./C1_R1_clout/transcripts.gtf
4.Differential Expression:
Cuffdiff
CuffDiff: a program that compares
transcript abundance between
samples.
$ cuffdiff -o diff_out -b genome.fa

-p 8 L C1,C2 -u
merged_asm/merged.gtf
./C1_R1_thout/accepted_hits.bam,./
C1_R2_thout/accepted_hits.bam,
./C2_R1_thout/accepted_hits.bam,./
C2_R2_thout/accepted_hits.bam
Integrative Genomics Viewer
(IGV)
http://www.broadinstitute.org/igv
Visualizing RNA-seq mapping
with IGV
Specify range or tem in
search box
Click on
ruler
Click and
drag
Use scroll
bar
Use keyboard:
Arrow keys, Page
up Page down,
Home, End
http://www.broadinstitute.org/igv/UserGuide
Neilsen, C.B., et al. Visualizing Genomes: techniques and
challenges Nature Methods 7:S5S15 (2010)
Summary
NGS technologies are transforming
molecular biology.
Bioinformatics analysis is a crucial
part in NGS applications
Data formats, terminology, general
workflow
Analysis pipeline
Software for various NGS applications
RNA-seq with Tuxedo suite
Thank you!

Workshop13 Day1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Workshop13 Day1

Uploaded by

Copyright:

Available Formats

Introduction To Next Generation

Sequencing (NGS) Data Analysis

Illumina/Solexa GA II / HiSeq 2000, 2500

Reviews: Michael Metzker (2010) Nature Reviews Genetics

Data analysis costs money and time

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, available on HPC

Get genome annotation here!

Get reference sequences here!

(Hunicke-Smith et al, 2010)

$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/

$ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/

$ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/

$ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/

$ cuffdiff -o diff_out -b genome.fa

You might also like