You are on page 1of 30

The Genome Access Course

Next Generation DNA Sequencing

Illumina HiSeq X
1.8 Tbp
(3 billion reads) in ~3 days

(as of 11/6/2014)
November 2014
The Genome Access Course
Whole Genome Shotgun Sequencing
Randomly
Fragment
Genomic DNA

Sequence
Fragments

Genome Assembly ...ATCCGTAAATGGGCTGATACTACTAATGC


TGGGCTGATACTACTAATGCCAAACTGTACTAGTCCTG...

...ATCCGTAAATGGGCTGATACTACTAATGCCAAACTGTACTAGTCCTG...

Contiguous
Sequence
(Contig)
November 2014
The Genome Access Course

RNA Sequencing (RNA-Seq)


cDNA made
from RNA

Sequence
cDNA
Fragments

Garber et al, Nat Methods (2011)

1. Characterize all RNA in sample

2. Gene expression level proportional to number of reads

3. Detect alternatively spliced transcripts


November 2014
The Genome Access Course

Typical Next Gen Experiments

• Genome sequencing
– Novel genomes
– Resequencing
• Transcriptome sequencing (RNA-seq)
– Characterize transcripts with or without reference genome
• Typical length
• Short (microRNAs, …)
– Find differentially expressed transcripts
• Other
– Methyl-seq
– ChIP-seq

November 2014
The Genome Access Course

November 2014
The Genome Access Course

Illumina Sequencing
Sequencing by
DNA Sample Synthesis

Construct
Library

Cluster Generation
in Flow Cell

200+ million reads per lane


(>100 bp reads)

November 2014
The Genome Access Course

Types of Sequencing Libraries


Single-End Reads - 5’ or 3’ (random)

Paired-End Reads - 5’ and 3’

200-500 bp
Mate-Pair Reads - 5’ and 3’

2-5 kbp

November 2014
The Genome Access Course

Taken from GIGA Newsletter 13 – Universite de Liège

November 2014
The Genome Access Course

What Does the Data Look Like?


FASTQ File Format
Sequence

Quality
(ASCII character for
each base)

> 200 million


reads in one
lane

Files so big
that they
break them
up in 40
million reads
per file

November 2014
The Genome Access Course

Example Analysis Workflow


Paired-End FASTQ Files

FASTQ FASTQ
(_R1.txt) (_R2.txt)
Align Reads to Genome

FastQC SAM File


(Diagnostics)

Trim Reads BAM File


(if needed)

November 2014
The Genome Access Course

Sequence Composition Diagnostics

Unbiased Reads

Biased Reads

First Position Nearly


Always “T”

November 2014
The Genome Access Course

GC Bias in First ~15 bp Due to Random Hexamer Priming

November 2014
The Genome Access Course

Trim Sequences Prior To Analysis


• Make sure sequencing adapters are removed
• Trim ends of sequence based on quality scores

November 2014
The Genome Access Course

FastX Toolkit – Hannon Lab at CSHL

Trimmomatic

November 2014
The Genome Access Course

Example Analysis Workflow


Paired-End FASTQ Files

FASTQ FASTQ
(_R1.txt) (_R2.txt)
Align Reads to Genome

FastQC SAM File


(Diagnostics)

Trim Reads BAM File


(if needed)

November 2014
The Genome Access Course

Sequence Alignment/Map (SAM) Format

Common file format to store:


- Reads
- Quality of each base
- How reads align to a reference sequence
Generated by most next gen analysis software
samtools software package
November 2014
The Genome Access Course

samtools Used to Manipulate SAM Files

SAM File

samtools

PileUp
File BAM File
Call
Variants

Pileup output file
chr1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
chr1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
chr1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
chr1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
chr1 276 G 22 TTTTTTTTTTTTTTTTTTTTTTT 33;+<<7=7<<7<&<<1;<<6<
chr1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
chr1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
chr1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

November 2014
The Genome Access Course

Binary Alignment (BAM) Files

• Common file format to store reads and their alignment to a


reference sequence
– Generated by most next gen analysis software
• samtools software package

• UCSC Genome Browser and Ensembl can display them as a


custom track
– IGV from Broad very useful

November 2014
The Genome Access Course

UCSC Genome Browser with


1,000 Genomes Project Data

November 2014
The Genome Access Course

Integrated Genomics Viewer (IGV)

November 2014
The Genome Access Course

LookSeq at Sanger Mouse Genomes Project

November 2014
The Genome Access Course

Glo1 CNV Present in Mouse Genomes Data for A/J


Proximal Flank Glo1 Locus Distal Flank
Chr17: 30.5Mb Chr17: 30.7Mb Chr17: 31.2Mb
Max ~50x coverage Max >100x coverage Max ~50x coverage

50kb 50kb 50kb

November 2014
The Genome Access Course

Glo1 CNV Not Present in Mouse Genomes Data for NZO


Proximal Flank Glo1 Locus Distal Flank
Chr17: 30.5Mb Chr17: 30.7Mb Chr17: 31.2Mb
Max ~25x coverage Max ~25x coverage Max ~25x coverage

50kb 50kb 50kb

November 2014
The Genome Access Course

Galaxy (http://main.g2.bx.psu.edu)

November 2014
The Genome Access Course

Public Data Repositories


NCBI EBI

SRA Formatted
Files FASTQ Files

Automatically
Forward FASTQ
SRA ToolKit Files to Galaxy

FASTQ Files

November 2014
The Genome Access Course

NCBI BioProject

November 2014
The Genome Access Course

NCBI Gene Expression Omnibus

November 2014
The Genome Access Course

Overall Analysis Workflow

FASTQ Files

Secondary Analysis Tertiary Analysis


1.Read Preprocessing & Diagnostics 1.Analysis of Read Counts
2.Align Reads to Reference e.g., Differentially expressed genes
2.Analysis of Gene Lists
1. Enrichment
2. Pathway and networks
3.Analysis of Aligned Reads 3.Analysis of Expression Patterns
e.g., Read counts per gene from RNA-Seq

November 2014
The Genome Access Course

Push-Button Bioinformatics … Be Careful

November 2014
The Genome Access Course

Third Generation Sequencing

November 2014

You might also like