Information & Computational Sciences

False Positive SNPs

False positive SNPs are a major issue in NGS-based variant analysis, and – if present in large numbers – can have potentially disastrous consequences for downstream analysis and application, e.g. the generation of genotyping platforms. We have been investigating the root causes of false positive SNPs in a 3-year studentship jointly funded by the James Hutton Institute and the University of Dundee.

The major output from this is a large multifactorial study that examines how a number of variables involved in mapping and variant calling affect the rate of false positives generated (quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency, and filtering of SNPs by read mapping quality and read depth). We simulated reads of different lengths from the A. thaliana genome sequence, assembled these and mapped the reads back onto the de novo assembly and the original genome, as a control. We then ran a variant caller over these mappings and compared the number of variant calls generated. Because the simulated reads were sampled in error- and variant-free haploid mode, one can assume that every variant encountered is by definition a false positive. We also investigated which of the reads mapped were mapped to the correct location and which were not.

The results were complex and multi-facetted, with a lot of significant interaction terms between factors. We were able to show that, as a result of different factor level combinations alone, the number of false positive SNP varied by five orders of magnitude. Using a poorly assembled, fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The effect of read length was more complex — longer reads did not always result in fewer false positives, as one would perhaps expect due to presumed greater mapping specificity. Instead, we observed numerous instances where longer reads did more damage when mismapped, by contributing greater numbers of mismatches, which in turn led to increased numbers of false positives.

The study is a stark warning for bioinformaticians who work with fragmented assemblies (which is often the case with non-model organisms) to not blindly adopt practices that are considered acceptable with the human genome or anything of similar completeness, such as running mappers on their default mismatch stringency. Our results show that when mapping targets for reads are unavailable due to misassembly or non-assembly, reads are readily mapped elsewhere if mappings aren’t done on the most stringent of settings. These mismappings lead to false positives, and depending on factor combinations this can happen on an epidemic scale.

This work has now been published in the journal BMC Bioinformatics. The source code used for the analyses in this paper is available here.