Next Generation Sequencing (NGS)/Big Data

Next Generation Sequencing (NGS)
Introduction Big Data Bioinformatics from the outside

Big Data

edit

Data Deluge

edit

The first problem you face is probably the large size of the NGS FASTQ files - the "data deluge" problem. You no longer only have to deal with microplate readings, or digitalized gel photos; the size of NGS data can be huge. For example, compressed FASTQ files from a 60x human whole genome sequencing can still require 200Gb. A small project with 10–20 whole genome sequencing (WGS) samples can generate ~4TB of raw data. Even these estimates do not include the disk space required for downstream analysis.

Storing data

edit

Referenced from a post from BioStars[1]:

  • Very high end: enterprise cluster and SAN.
  • High end: Two mirrored servers in separate buildings or Cloud.
  • Typical: External hard drives and/or NAS with raid-5/6

Moving data

edit

Moving data between collaborators is also non-trivial. For RNA-Seq samples, FTP may suffice, but for WGS data, shipping hard drives may be the only solution.

Externalizing compute requirements from the research group

edit

It is difficult for a single lab to maintain sufficient computing facilities. A single lab will probably own some basic computing hardware; however, many tasks will have huge computational demands (e.g. memory for de novo genome assembly) that require them to be performed elsewhere. An institution / core facility may host a centralized cluster. Alternatively, one might consider doing the task on the cloud.

  • NIH maintains a centralized computing cluster called Biowulf.
  • Bioinformatics cloud computing is suggested.[2][3] EBI has adopted a cloud-based platform called Helix Nebula.[4]

References

edit
  1. Wo, H. (24 March 2011). "Question: Huge Ngs Data Storage And Transferring". Biostars. Biostar Genomics, LLC. Retrieved 28 April 2016.
  2. Akhlaghpour, H. (3 July 2012). "Genomic Analysis in the Cloud". YouTube. Google. Retrieved 28 April 2016.
  3. Schadt, E.E.; Linderman, M.D.; Sorenson, J.; Lee, L.; Nolan, G.P. (2010). "Computational solutions to large-scale data management and analysis". Nature Reviews Genetics. 11 (9): 647–57. doi:10.1038/nrg2857. PMC 3124937. PMID 20717155.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link)
  4. Lueck, R. (16 January 2013). "Big data and HPC on-demand: Large-scale genome analysis on Helix Nebula – the Science Cloud" (PDF). Trust-IT Services. Retrieved 28 April 2016.