Next Generation Sequencing (NGS)

The need for an up-to-date synthesis of next generation sequencing know-how

The high demand for low-cost sequencing has driven the development of high-throughput sequencing, which also goes by the term next generation sequencing (NGS). Thousands or millions of sequences are concurrently produced in a single next-generation sequencing process. Next generation sequencing has become a commodity. With the commercialization of various affordable desktop sequencers, NGS has become within the reach of traditional wet-lab biologists. As seen in recent years, genome-wide scale computational analysis is increasingly being used as a backbone to foster novel discovery in biomedical research. However, as the quantities of sequence data increase exponentially, the analysis bottle-neck is yet to be solved.

The current sources for NGS informatics are extremely fragmented. A novice could read review articles in various journals, follow discussion threads on forums such as Biostar^[1] or SEQanswers ^[2], or sign up for courses organized by various institutes. Finding a centralized synthesis is much more difficult. Books are available, but the development of the field is so fast that book chapters risk being obsoleted by the time they are even printed. Moreover, cost for a handful of authors to continually update their text would presumably take up a lot of their schedule.

Drawing from the obvious goodwill and community spirit displayed on discussion forums, and exploiting the collaborative tools made available by the Wikimedia foundation, we propose to initiate the editing of a collaborative WikiBook on NGS. Our plan is to collect a sufficient amount of text that people will be incentivized to contribute to it, essentially providing the same information as a forum but in a tidier form. Ultimately, our goal is to create a collective lab book that explains the key concepts and describes best practices in NGS.

Target audience

This set of dynamic materials are designed for the bench biologists (advanced PhD students and early career postdoctoral researchers with no or basic bioinformatics experience and demonstrate interest in NGS data analysis). Advanced materials might be added as the community contributes and the needs and trends in the field develop. The flexibility of online material should allow the reader to ignore details in a first read, yet have immediate access to the details they need. However, the overall structure and style should be in priority designed for the non-bioinformatician reader.

Some chapters come with practical exercises so readers may get themselves familiar with the steps.

Get stuck at data analysis?

Go find help from online communities, including Biostar and SEQanswers, please make sure you follow the guidelines framed by Dall’Olio et al.^[3]

About this book

The first four chapters are general introductions to broad concepts of bioinformatics and NGS in particular. They are 'required pre-requisites', and will be referred to in the rest of the book:
- In the Introduction, we give a nearly complete overview of the field, starting with sequencing technologies, their properties, strengths and weaknesses, covering the various biological processes they can assay, and finishing with a section on common sequencing terminology. Finally we finish with an overview of a typical sequencing workflow.
- In Big Data we deal with some of the (perhaps unexpected) difficulties that arise when dealing with typical volumes of NGS data. From shipping hard drives around the world, to the amount of memory you'll need in your computer to assemble the data when they arrive, these issues often take novices by surprise. We'll get into the file formats, archives, and algorithms that have been developed to deal with these problems.
- In Bioinformatics from the outside we will discuss the interfaces used by bioinformaticians. We will present the command line with its text interface and blinking cursor, but also more user friendly graphical user interfaces (GUIs) which were developed specially for bioinformatics pipelines.
- In Pre-processing we will discuss the best practices of controlling the quality of a NGS dataset, and cleaning out low quality data.

The next five chapters describe the analyses which can be done using a reference genome sequence, assuming one is available:
- In Alignment we will discuss how to map a set of reads to a reference dataset.
- In DNA Variation we will describe how to call variants (either SNVs, CNVs or breakends) using mapped reads.
- In RNA we will explain how to determine exons, isoforms and gene expression levels from mapped RNA-seq reads.
- In Epigenetics we will describe pull down assays which are used to determine epigenetic traits such as histone or CpG methylation.
- In Chromatin structure we will discuss technologies used to determine the structure of the chromatin, e.g. the placement of the histones or the physical proximity of different chromosomal regions when the DNA lies in the nucleus.

Finally the last two chapters will describe analyses in the absence of a reference genome:
- De novo assembly will describe how to assemble a genome from NGS reads.
- De novo RNA assembly will explain how to assemble a transcriptome from NGS reads only.

Details

In Pre-processing, fastq, QC, trimming, error correction, etc.
In Alignment, formats, algos, assessment.
In DNA Variation, protocols, formats, databases, visualization.
In RNA, transcriptomics workflow, tools, gene prediction, formats, databases.
In Epigenetics... bisulphite sequencing,
In Chromatin structure ... chipseq eh?
In De novo assembly algos, workflows, tools, databases.
In RNA assembly, similarities differences and challenges relative to DNA assembly.

References

↑ Parnell, Laurence D. (27 October 2011). "BioStar: An Online Question & Answer Resource for the Bioinformatics Community". PLoS Computational Biology. 7 (10): e1002216. doi:10.1371/journal.pcbi.1002216. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
↑ Li, J.-W. (13 March 2012). "SEQanswers: an open access community for collaboratively decoding genomes". Bioinformatics. 28 (9): 1272–1273. doi:10.1093/bioinformatics/bts128. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
↑ Dall'Olio, Giovanni M. (28 September 2011). "Ten Simple Rules for Getting Help from Online Scientific Communities". PLoS Computational Biology. 7 (9): e1002202. doi:10.1371/journal.pcbi.1002202. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[1] Parnell, Laurence D. (27 October 2011). "BioStar: An Online Question & Answer Resource for the Bioinformatics Community". PLoS Computational Biology. 7 (10): e1002216. doi:10.1371/journal.pcbi.1002216. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[2] Li, J.-W. (13 March 2012). "SEQanswers: an open access community for collaboratively decoding genomes". Bioinformatics. 28 (9): 1272–1273. doi:10.1093/bioinformatics/bts128. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[3] Dall'Olio, Giovanni M. (28 September 2011). "Ten Simple Rules for Getting Help from Online Scientific Communities". PLoS Computational Biology. 7 (9): e1002202. doi:10.1371/journal.pcbi.1002202. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[1]

[2]

[3]