Next Generation Sequencing (NGS)/Introduction

ABOUT THIS BOOK

The first four chapters are general introductions to broad concepts of bioinformatics and NGS in particular. They are 'required pre-requisites', and will be referred to in the rest of the book:
- In the Introduction, we give a nearly complete overview of the field, starting with sequencing technologies, their properties, strengths and weaknesses, covering the various biological processes they can assay, and finishing with a section on common sequencing terminology. Finally we finish with an overview of a typical sequencing workflow.
- In Big Data we deal with some of the (perhaps unexpected) difficulties that arise when dealing with typical volumes of NGS data. From shipping hard drives around the world, to the amount of memory you'll need in your computer to assemble the data when they arrive, these issues often take novices by surprise. We'll get into the file formats, archives, and algorithms that have been developed to deal with these problems.
- In Bioinformatics from the outside we will discuss the interfaces used by bioinformaticians. We will present the command line with its text interface and blinking cursor, but also more user friendly graphical user interfaces (GUIs) which were developed specially for bioinformatics pipelines.
- In Pre-processing we will discuss the best practices of controlling the quality of a NGS dataset, and cleaning out low quality data.

The next five chapters describe the analyses which can be done using a reference genome sequence, assuming one is available:
- In Alignment we will discuss how to map a set of reads to a reference dataset.
- In DNA Variation we will describe how to call variants (either SNVs, CNVs or breakends) using mapped reads.
- In RNA we will explain how to determine exons, isoforms and gene expression levels from mapped RNA-seq reads.
- In Epigenetics we will describe pull down assays which are used to determine epigenetic traits such as histone or CpG methylation.
- In Chromatin structure we will discuss technologies used to determine the structure of the chromatin, e.g. the placement of the histones or the physical proximity of different chromosomal regions when the DNA lies in the nucleus.

Finally the last two chapters will describe analyses in the absence of a reference genome:
- De novo assembly will describe how to assemble a genome from NGS reads.
- De novo RNA assembly will explain how to assemble a transcriptome from NGS reads only.

Introduction

Platforms and Technologies

NGS platforms employ different technologies to decode the identity of nucleotides in DNA, or detect covalent modifications such as methylation on the nucleotides.

NGS platforms evolve quickly. Usually, new technologies & platforms are announced at the Advances in Genome Biology & Technology (AGBT) conference ^[1]

For educational purposes, some reviews of NGS platforms published in 2011 ^[2]. Read more about the sequencing technologies here

File format and terminology

FASTA

The FASTA format, generally indicated with the suffix .fa or .fasta, is a straightforward, human readable format. Normally, each file consists of a set of sequences, where each sequence is represented by a one line header, starting with the '>' character, followed by the corresponding nucleotide sequence, in multiple lines of regular width (generally 60 or 80 characters wide). In practice, some tools may produce a sequence with a header and a single long line of sequence. For more detailed information see the FASTA Wikipedia page.

FASTQ

FASTQ is a text file format (human readable) that provides 4 lines of data per sequence.

Sequence identifier
The sequence
Comments
Quality scores

FASTQ format is commonly used to store sequencing reads, in particular from Illumina and Ion Torrent platforms.

Paired-end reads may be stored either in one FASTQ file (alternating) or in two different FASTQ files. Paired-end reads may have sequence identifiers ended by "/1" and "/2" respectively.

Example FASTQ entry for one Illumina read:

@EAS20_8_6_1_3_1914/1
CGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAGATT
+
HHHHHHHHHFHGGHHHHHHHHHHHHHHHHHHHHEHHHHHHHHHHHHHHGHHHGHHHGHIHHHHHHHHHHHHHHHGCHHHHFHHHHHHHGGGCFHBFBCCF

Generally a FASTQ file is stored in files with the suffix .fq or .fastq using Gzip file compression indicated by the suffix .gz or .gzip.

For more detailed information see the FASTQ Wikipedia page.

SFF

SFF is a binary file format used to encode sequencing reads from the 454 platform.

http://en.wikipedia.org/wiki/Standard_Flowgram_Format

SAM/BAM

File formats used to encode short reads alignment. See Next_Generation_Sequencing_(NGS)/Alignment for more information.

FASTG

FASTG is an emerging file format for genome assemblies that take ambiguities into account. FASTG is like FASTA, but the G stands for ‘graph’.

VCF

The Variant Call Format (VCF) is a specification used in bioinformatics for storing gene sequence variations. See [1] for more information.

Read lengths

As of Feb 2013, the read-length of second generation sequencing platforms are shorter than conventional Sanger sequencing, creating challenges in reads mapping and assembly.

The most well used Illumina platforms can produce read-length up to 250bp. In practice, ~100bp is mostly accessible to researchers worldwide.
Ion Torrent: Varies, typically peak at 400bp
SOLiD: 50-75bp

Paired-/Single-ends

Single-end reads means the sequence fragment are sequenced from 1 direction only.
In paired-end sequencing, a single fragment are sequenced from both 5' and 3' end, giving rise to forward and reverse read. The sequenced fragments could be separated by a certain bases (inner insert size) or can be overlapping, giving rise to a contiguous longer single-end fragment after merging. The uses of paired-end reads can improve the accuracy of reads mapping onto a reference genome. The typical fragment size (external inserts size) is 200bp to 500bp

Mate-pairs

Mate-pair is different from paired-end in the sense of how the sequence library is made. In mate-pair sequencing, 2-5kb fragments are selected and sequenced from both ends, thus giving information how nucleotides far apart are linked together. Mate-pairs are more ideal for studying genomic structural rearrangement and help de novo genome assembly. They also facilitate sensitive structural variant (SV) detection across a widened SV size-spectrum and in repetitive areas of the genome.

Colorspace

Colorspace is a 2-base encoding system commercialized by Life Tech and used in SOLiD platforms. Technology overview is described here.

Quality scores

Quality score is an indication of probability of the base call being incorrect. Quality score is used in the FASTQ format.

Various encoding schemes are available, including, most commonly, [Phred quality scores].

Error profiles & Sequencing biases

Uses of NGS

DNA

To find mutations from tumor cells .

RNA

To reconstruct transcriptome (genome-based or de novo) using reverse transcription so that researchers can count how many reads align onto annotated parts of the transcriptome. This is used to compare gene expression in samples that are dramatically different from each other and to build biochemical pathways of an organism.

ChIP

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

Chromatin structure

General NGS Workflow Overview

References

[1] ttp://agbt.org/

[2] ttp://www.ncbi.nlm.nih.gov/pubmed/21612267

[1]

[2]

Next Generation Sequencing (NGS)
	Introduction	Big Data