Next Generation Sequencing (NGS)/DNA Variants
Linking to Wikipedia
Whole genome, exome, etc. Consequences for downstream analysis
VCF stands for Variant Call Format. It was created by the 1000 Genomes Project as a way to store small-scale variation data (SNPs, InDels, short structural rearrangements), and has since become the de facto standard format for storing such data. The official, detailed description can be found here (VCF version 4.1, as of writing).
VCF can store information about a variant, such as its position on a reference sequence, the reference and alternate alleles, stable variant identifier (e.g. rs number), as well as the observed allele(s) in multiple samples. VCF can also hold aggregate information about the variant across all samples (e.g. total coverage depth, allele frequencies etc.), as well as a list of filters that the variant failed during the current analysis.
The basic VCF file format is ASCII text. A header section identifies the VCF format version, defines FILTER and INFO fields, and other meta-data. This is followed by the actual data table, consisting of a single row containing the standard headers and the sample names, and one row per variant. All columns in the table header and the data rows are separated by tab (\t) characters:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 2 4370 rs6057 G A 29 . NS=2;DP=13;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:52,51 1|0:48:8:51,51 1/1:43:5:.,.
(for more exhaustive examples, see the official description)
Creating a dataset
Human=> Variants=>1000 genomes, HapMap,etc
SEQwiki content dump
SNPs, or single nucleotide polymorphisms, are heritable single base changes in a genome versus a reference sequence. They are part of the more generic set of Single Nucleotide Variations (SNVs), which also encompasses somatic single base changes which are not passed to offspring and are due to environmental damage. Tools for SNP identification can also be used for SNV identification, though tools specific for SNV identification exist as well. In some contexts, such as cancer genomes, SNV identification is complicated by heterogeneous DNA samples.
SNP identification programs must distinguish system noise (instrument errors, PCR errors, etc) from actual variation. They generally do so by modeling various error types and the expected distribution of calls under homozygous reference (AA), homozygous variant (BB) and heterozygous variant (AB) states. Confidence in calls is generally affected by the reported sequence quality values and read depth. Some SNP/SNV callers work by comparing individual samples to a reference, whereas others can simultaneously call in multiple samples using information from each sample to assist calling in the other samples. SNP callers for mixed population samples also exist.
A common source of error in SNP/SNV calling is misalignment due to pseudogenes, repeated genomic segments or close orthologs; in these cases the co-alignment of reads arising from different genomic regions can result in a false positive call. Another source of error can be local misalignment (or ambiguous alignment) due to indels in reads (either true indel variations or sequencing errors); realignment tools such as Dindel and those found in GATK can generate more consistent treatment of indels to reduce this source of err.r Many SNP/SNV callers are designed for diploid DNA, and may not work well in samples with higher ploidy. As noted above, heterogeneity in samples such as tumor samples can frustrate SNV calling, and some callers are specifically designed to cope with this. Tumor samples may also have altered copy number due to gene or chromosomal amplification, meaning they are effectively of triploid or higher ploidy in some regions.
SNP/SNV callers often call only these polymorphisms, and not (for example) small indels. Users of these tools should also take care when calling adjacent pairs of SNPs/SNVs, as the phasing of these (or more distant SNPs) is not reported in many callers' reports.
I want to quickly call SNP versus a reference =>Freebayes, samtools
Freebayes is the successor of Poly- Giga- and BAMBayes and should be much faster than these. Like these it relies on BAM files. It has also been described in some more detail by its developer on Biostar
- very easy to run for simple SNP calling
- Does not assume any ploidy
- can read BAM files via STDIN
The Genome Analysis toolkit GATK allows multiple steps. The authors used their pipeline for variant calling using the NA12878 exome data set and compared their results to those of Crossbow (which uses SOAPsnp). Based on these results they concluded that crossbow had a lower spcecificity.
- Important reminder
- If you run GATK framework in your own pipeline, you have to bear in mind GATK has Stringent file formatting requirement.
- e.g. chromosomes ordering in genome reference file has to be in canonical order.
- BAM header has to be present in every BAM file.
- The BAM file has to be sorted, preferably by Picards because it write the proper header after sorting
- Read-group tag has to be present in each BAM. Either input the correct tag during mapping or you may waste your time in fixing the BAM file afterwards
- Likely relatively specific (The authors show higher specificity than crossbow)
- relatively complex pipelines
- performed slightly better than sopasnp and beter than snvnmix according to an independent comparison
samtools using the mpileup command http://samtools.sourceforge.net/mpileup.shtml
samtools pileup (without the m) is deprecated and has been removed in recent SAMtools versions.
Sibelia is a comparative genomic tool toassists biologists in analysing the genomic variations that correlate with pathogens, or the genomic changes that help microorganisms adapt in different environments. Sibelia will also be helpful for the evolutionary and genome rearrangement studies for multiple strains of microorganisms.
- Works well for multiple bacterial genomes.
- Easy to run and cross-platform, licensed under GPL.
- Works slow for large genomes.
SOAPsnp is e.g. used in the Crossbow pipeline.
SNVMix The authors of SNVMix compared their tool to MAQ v0.6.8 and found better performance as judged by area under the curve when using Affymetrix SNP 6.0 data. However in an independent comparison using MAQ 0.71 MAQ performed better.
- Might be unstable in high coverage region according to an independent comparison.
- Might be less precise than MAQ and SOAPsnp
Further Reading Material and References
- Further Reading
- Alkan et al., 2011 Finding variation using different approaches
- Original Publications
- Nielsen R, Paul JS, Albrechtsen A, Song YS Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. (2011) 12:443-51. The article gives general reccommendations for a workflow and suggests to use a calibration step as implemented by GATK or SOAPsnp
- Wang et al., 2011 A comparison of short read aligners and performance assesment of MAQ (0.71), SOAPsnp (1.03) and SNVmix(2-0.11.8-r4) where MAQ performed best