Next Generation Sequencing (NGS)/DNA Variants

DNA Variants

Linking to Wikipedia

What is SNPs? What is Mutation? What is Genetic Variation?

Protocols

Whole genome, exome, etc. Consequences for downstream analysis

Typical workflow

File formats

VCF

VCF stands for Variant Call Format. It was created by the 1000 Genomes Project as a way to store small-scale variation data (SNPs, InDels, short structural rearrangements), and has since become the de facto standard format for storing such data. The official, detailed description can be found here (VCF version 4.1, as of writing).

VCF can store information about a variant, such as its position on a reference sequence, the reference and alternate alleles, stable variant identifier (e.g. rs number), as well as the observed allele(s) in multiple samples. VCF can also hold aggregate information about the variant across all samples (e.g. total coverage depth, allele frequencies etc.), as well as a list of filters that the variant failed during the current analysis.

The basic VCF file format is ASCII text. A header section identifies the VCF format version, defines FILTER and INFO fields, and other meta-data. This is followed by the actual data table, consisting of a single row containing the standard headers and the sample names, and one row per variant. All columns in the table header and the data rows are separated by tab (\t) characters:

#CHROM POS    ID     REF    ALT     QUAL FILTER INFO                    FORMAT      Sample1        Sample2        Sample3
2      4370   rs6057 G      A       29   .      NS=2;DP=13;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:52,51 1|0:48:8:51,51 1/1:43:5:.,.

(for more exhaustive examples, see the official description)

Creating a dataset

SAMTOOLS

others...

Reference datasets

Human=> Variants=>1000 genomes, HapMap,etc

Other species

Viewing datasets

Ensembl

UCSC

IGV

Tablet?

Comparing datasets

VCF tools

SEQwiki content dump

SNP detection

SNPs, or single nucleotide polymorphisms, are heritable single base changes in a genome versus a reference sequence. They are part of the more generic set of Single Nucleotide Variations (SNVs), which also encompasses somatic single base changes which are not passed to offspring and are due to environmental damage. Tools for SNP identification can also be used for SNV identification, though tools specific for SNV identification exist as well. In some contexts, such as cancer genomes, SNV identification is complicated by heterogeneous DNA samples.

SNP identification programs must distinguish system noise (instrument errors, PCR errors, etc) from actual variation. They generally do so by modeling various error types and the expected distribution of calls under homozygous reference (AA), homozygous variant (BB) and heterozygous variant (AB) states. Confidence in calls is generally affected by the reported sequence quality values and read depth. Some SNP/SNV callers work by comparing individual samples to a reference, whereas others can simultaneously call in multiple samples using information from each sample to assist calling in the other samples. SNP callers for mixed population samples also exist.

A common source of error in SNP/SNV calling is misalignment due to pseudogenes, repeated genomic segments or close orthologs; in these cases the co-alignment of reads arising from different genomic regions can result in a false positive call. Another source of error can be local misalignment (or ambiguous alignment) due to indels in reads (either true indel variations or sequencing errors); realignment tools such as Dindel and those found in GATK can generate more consistent treatment of indels to reduce this source of err.r Many SNP/SNV callers are designed for diploid DNA, and may not work well in samples with higher ploidy. As noted above, heterogeneity in samples such as tumor samples can frustrate SNV calling, and some callers are specifically designed to cope with this. Tumor samples may also have altered copy number due to gene or chromosomal amplification, meaning they are effectively of triploid or higher ploidy in some regions.

SNP/SNV callers often call only these polymorphisms, and not (for example) small indels. Users of these tools should also take care when calling adjacent pairs of SNPs/SNVs, as the phasing of these (or more distant SNPs) is not reported in many callers' reports.

Decision Helper

I want to quickly call SNP versus a reference =>Freebayes, samtools


Software Packages

↑Jump back a section

Free Software

Freebayes

Freebayes is the successor of Poly- Giga- and BAMBayes and should be much faster than these. Like these it relies on BAM files. It has also been described in some more detail by its developer on Biostar

  • Pros
    • very easy to run for simple SNP calling
    • Does not assume any ploidy
    • can read BAM files via STDIN

GATK

The Genome Analysis toolkit GATK allows multiple steps. The authors used their pipeline for variant calling using the NA12878 exome data set and compared their results to those of Crossbow (which uses SOAPsnp). Based on these results they concluded that crossbow had a lower spcecificity.

One easy way to to run GATK and other tools might be to use this variant pipeline mentioned on Biostar

  • Important reminder
    • If you run GATK framework in your own pipeline, you have to bear in mind GATK has Stringent file formatting requirement.
    • e.g. chromosomes ordering in genome reference file has to be in canonical order.
    • BAM header has to be present in every BAM file.
    • The BAM file has to be sorted, preferably by Picards because it write the proper header after sorting
    • Read-group tag has to be present in each BAM. Either input the correct tag during mapping or you may waste your time in fixing the BAM file afterwards
  • Pro
    • Likely relatively specific (The authors show higher specificity than crossbow)
  • Con
    • relatively complex pipelines

MAQ

MAQ

  • Pros
    • performed slightly better than sopasnp and beter than snvnmix according to an independent comparison

samtools

samtools using the mpileup command http://samtools.sourceforge.net/mpileup.shtml

samtools pileup (without the m) is deprecated and has been removed in recent SAMtools versions.

Sibelia

Sibelia is a comparative genomic tool toassists biologists in analysing the genomic variations that correlate with pathogens, or the genomic changes that help microorganisms adapt in different environments. Sibelia will also be helpful for the evolutionary and genome rearrangement studies for multiple strains of microorganisms.

  • Pros
    • Works well for multiple bacterial genomes.
    • Easy to run and cross-platform, licensed under GPL.
  • Cons
    • Works slow for large genomes.

http://bioinf.spbau.ru/sibelia

SOAPsnp

SOAPsnp is e.g. used in the Crossbow pipeline.

SNVMix

SNVMix The authors of SNVMix compared their tool to MAQ v0.6.8 and found better performance as judged by area under the curve when using Affymetrix SNP 6.0 data. However in an independent comparison using MAQ 0.71 MAQ performed better.

  • Cons
    • Might be unstable in high coverage region according to an independent comparison.
    • Might be less precise than MAQ and SOAPsnp
↑Jump back a section

Commercial Software

CLCBio

Further Reading Material and References

  • Further Reading
  • Original Publications
  • Comparisons
    • Nielsen R, Paul JS, Albrechtsen A, Song YS Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. (2011) 12:443-51. The article gives general reccommendations for a workflow and suggests to use a calibration step as implemented by GATK or SOAPsnp
    • Wang et al., 2011 A comparison of short read aligners and performance assesment of MAQ (0.71), SOAPsnp (1.03) and SNVmix(2-0.11.8-r4) where MAQ performed best
↑Jump back a section
Last modified on 18 May 2013, at 08:28