Next Generation Sequencing (NGS)
The Need for an Up-To-Date Synthesis of Next Generation Sequencing Know-How
Next generation sequencing (NGS) has become a commodity. With the commercialization of various affordable desktop sequencers, NGS will be of reach by more traditional wet-lab biologists . As seen in recent years, genome-wide scale computational analysis is increasingly being used as a backbone to foster novel discovery in biomedical research. However, as the quantities of sequence data increase exponentially, the analysis bottle-neck is yet to be solved.
The current sources for NGS informatics are extremely fragmented. A novice could read review articles in various journals, follow discussion threads on forums such as Biostar or SEQanswers , or sign up for courses organized by various institutes. Finding a centralized synthesis is much more difficult. Books are available, but the development of the field is so fast that book chapters risk being obsoleted by the time they are even printed. Moreover, cost for a handful of authors to continually update their text would presumably take up a lot of their schedule.
Drawing from the obvious goodwill and community spirit displayed on discussion forums, and exploiting the collaborative tools made available by the Wikipedia foundation, we propose to initiate the editing of a collaborative WikiBook on NGS. Our plan is to collect a sufficient amount of text that people will be incentivized to contribute to it, essentially providing the same information as a forum but in a tidier form. Ultimately, our goal is to create a collective lab book that explains the key concepts and describes best practices in NGS.
This set of dynamic materials are designed for the bench biologists (advanced PhD students and early career postdoctoral researchers with no or basic bioinformatics experience and demonstrate interest in NGS data analysis). Advanced materials might be added as the community contributes and the needs and trends in the field develop. The flexibility of online material should allow the reader to ignore details in a first read, yet have immediate access to the details they need. However, the overall structure and style should be in priority designed for the non-bioinformatician reader.
Some chapter comes with practical exercise so readers may get themselves familiar with the steps.
Get stuck at data analysis?
TABLE OF CONTENTS
ABOUT THIS BOOK
- The first four chapters are general introductions to broad concepts of bioinformatics and NGS in particular. They are 'required pre-requisites', and will be referred to in the rest of the book:
- In the Introduction, we give a near complete overview of the field. Starting with sequencing technologies, their properties, strengths and weaknesses, covering the various biologies that they assay, and finishing with a section on common sequencing terminology. Finally we finish with an overview of a typical sequencing workflow.
- In Big Data we deal with some of the (perhaps unexpected) difficulties that arise when dealing with typcal volumes of NGS data. From shipping hard drives around the world to the amount of memory you'll need in your computer to assemble the data when they arrive. We'll get into the file formats, archives, and algorithms that have been developed to deal with these problems.
- In Bioinformatics from the outside we will discuss the interfaces used by bioinformaticians. We will present the command line with its text interface and blinking cursor, but also more user friendly graphical user interfaces (GUIs) which were developed specially for bioinformatics pipelines.
- In Pre-processing we will discuss the best practices of controlling the quality of a NGS dataset, and cleaning out low quality data.
- The next five chapters describe the analyses which can be done using a reference genome sequence, assuming one is available:
- In Alignment we will discuss how to map a set of reads to a reference dataset.
- In DNA Variation we will describe how to call variants (either SNVs, CNVs or breakends) using mapped reads.
- In RNA we will explain how to determine exons, isoforms and gene expression levels from mapped RNA-seq reads.
- In Epigenetics we will describe pull down assays which are used to determine epigenetic traits such as histone or CpG methylation.
- In Chromatin structure we will discuss technologies used to determine the structure of the chromatin, e.g. the placement of the histones or the physical proximity of different chromosomal regions when the DNA lies in the nucleus.
- Finally the last two chapters will describe analyses in the absence of a reference genome:
- De novo assembly will describe how to assemble a genome from NGS reads.
- De novo RNA assembly will explain how to assemble a transcriptome from NGS reads only.
- In Pre-processing, fastq, QC, trimming, error correction, etc.
- In Alignment, formats, algos, assessment.
- In DNA Variation, protocols, formats, databases, visualization.
- In RNA, transcriptomics workflow, tools, gene prediction, formats, databases.
- In Epigenetics... bisulphite sequencing,
- In Chromatin structure ... chipseq eh?
- In De novo assembly algos, workflows, tools, databases.
- In RNA assembly, similarities differences and challenges relative to DNA assembly.
If you contribute a substantial amount of work to this WikiBook, please add yourself to the /Authors page! Don't worry if people move, edit, or delete parts of your contributions - it's normal here. Also, don't worry about formatting - put the text in place and someone will come and format it for you. Don't be shy to ask others for help!
- Parnell LD, Lindenbaum P, Shameer K et al. BioStar: an online question & answer resource for the bioinformatics community, PLoS Comput Biol 2011;7:e1002216.
- Li JW, Schmieder R, Ward RM et al. SEQanswers: an open access community for collaboratively decoding genomes, Bioinformatics 2012;28:1272-1273
- Dall'Olio GM, Marino J, Schubert M et al. Ten simple rules for getting help from online scientific communities, PLoS Comput Biol 2011;7:e1002202.