A probabilistic approach for SNP discovery in high-throughput human resequencing data

  1. Rose Hoberman1,2,
  2. Joana Dias3,
  3. Bing Ge3,
  4. Eef Harmsen3,
  5. Michael Mayhew1,2,
  6. Dominique J. Verlaan3,4,5,
  7. Tony Kwan3,4,5,
  8. Ken Dewar3,4,5,
  9. Mathieu Blanchette1,2,6 and
  10. Tomi Pastinen3,4,5,6
  1. 1 McGill Centre for Bioinformatics, McGill University, Montréal H36 0B1, Canada;
  2. 2 School of Computer Sciences, McGill University, Montréal H3A 2T5, Canada;
  3. 3 McGill University and Genome Québec Innovation Centre, Montréal H36 1A4, Canada;
  4. 4 Department of Human Genetics, McGill University Health Centre (MUHC), McGill University, Montréal H36 1A4, Canada;
  5. 5 Department of Medical Genetics, McGill University Health Centre (MUHC), McGill University, Montréal H36 1A4, Canada

    Abstract

    New high-throughput sequencing technologies are generating large amounts of sequence data, allowing the development of targeted large-scale resequencing studies. For these studies, accurate identification of polymorphic sites is crucial. Heterozygous sites are particularly difficult to identify, especially in regions of low coverage. We present a new strategy for identifying heterozygous sites in a single individual by using a machine learning approach that generates a heterozygosity score for each chromosomal position. Our approach also facilitates the identification of regions with unequal representation of two alleles and other poorly sequenced regions. The availability of confidence scores allows for a principled combination of sequencing results from multiple samples. We evaluate our method on a gold standard data genotype set from HapMap. We are able to classify sites in this data set as heterozygous or homozygous with 98.5% accuracy. In de novo data our probabilistic heterozygote detection (“ProbHD”) is able to identify 93% of heterozygous sites at a <5% false call rate (FCR) as estimated based on independent genotyping results. In direct comparison of ProbHD with high-coverage 1000 Genomes sequencing available for a subset of our data, we observe >99.9% overall agreement for genotype calls and close to 90% agreement for heterozygote calls. Overall, our data indicate that high-throughput resequencing of human genomic regions requires careful attention to systematic biases in sample preparation as well as sequence contexts, and that their impact can be alleviated by machine learning-based sequence analyses allowing more accurate extraction of true DNA variants.

    Footnotes

    • 6 Corresponding authors.

      E-mail blanchem{at}mcb.mcgill.ca; fax (514) 398-3387.

      E-mail tomi.pastinen{at}mcgill.ca; fax (514) 398-1738.

    • 7 Notice that the numbers of homozygous and heterozygous sites are much more balanced than in actual genomic data—this is addressed in the section on de novo het-calling.

    • 8 At low coverage (<15×), the distribution of GigaBayes scores does not allow much of a trade-off between sensitivity and FCR, as the majority of hets are assigned score 0.

    • 9 If f is the FCR on a balanced test set, and s is the sensitivity obtained on that set, then the expected FCR on a data set with a polymorphism rate of p is (1 − p) • f / ((1 − p) • f + ps).

    • 10 The upper bound of the 90% Bayesian credible interval on the fraction of reads derived from the underrepresented allele is <0.45.

    • 11 Adding to the confusion, the term false-positive rate (FP/(FP+TN)) is sometimes used to denote what we call false call rate (FP/(FP+TP)), and specificity is sometimes used to refer to 1 − false call rate. Most often, terms are used without a clear explanation of the mathematical quantity represented.

    • 12 For training the classifier, only sites classified as polymorphic by Sequenom were included.

    • [Supplemental material is available online at http://www.genome.org. Alignment and SNP-calling software is available at http://www.mcb.mcgill.ca/∼blanchem/reseq.]

    • Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.092072.109.

      • Received February 1, 2009.
      • Accepted July 13, 2009.
    | Table of Contents

    Preprint Server