There's an existing (short) forum thread on this, but no answer. It seems like no matter what dataset I'm using, a hugely disproportionate number of the SNPs called by samtools pileup -cv have SNP quality of exactly 228. Often that's the highest quality represented in the list, but not always.

The docs say the SNP quality column contains the:

Phred-scaled likelihood that the genotype is identical to the reference, which is also called `SNP quality'. Suppose the reference base is A and in alignment we see 17 G and 3 A. We will get a low consensus quality because it is difficult to distinguish an A/G heterozygote from a G/G homozygote. We will get a high SNP quality, though, because the evidence of a SNP is very strong.

A range of 0-255 doesn't actually sound like a Phred-scaled number, though. Could some odd logarithmic conversion be responsible for throwing a bunch of stuff into quality 228?

I have a couple of pileups I'm looking at right now, each of which contains about 50K variants (this is human data from SOLiD, if it matters). Really, I see three odd things about this data:

  • both sets have spikes at 228, whether small-but-noticeable or totally-dominant
  • there's another trend-defying spike at SNP quality 30
  • the whole distribution is oddly spiky, with scores divisible by three being much more common

The two datasets are just two examples out of many; one normal-ish, and one with an unusually large spike. (Clicking the thumbnail below will take you straight to the image, not a page with ads or anything.)

Can anyone shed any light on these SNP quality mysteries? Thanks!

asked 10 Aug '10, 12:30

Jenn's gravatar image

Jenn
3629
accept rate: 0%

Interestingly, it looks like the read quality column has exactly the same characteristics (the three bullet points from above), though of course individual read qualities and SNP qualities don't necessarily correspond.

(10 Aug '10, 15:57) Jenn

I'll take a stab at this, but I've only briefly looked at the supplemental data in the original MAQ paper (this is the same method for SNP calling used by samtools pileup).

The SNP quality is determined using an estimate of the probabilities. Their estimation only has a certain amount of resolution (because nucleotide counts are always integers), so this is not a continuous function. The phred-like score is likewise not continuous (it gets rounded to an integer), but the number of nucleotides and the SNP quality will be on different scales, resulting in the "divisible by 3" feature you noticed. If the SNP quality and nucleotide counts could be fractional that would likely be pretty smooth. As far as the spikes at 228 and 30, I'm not entirely sure where those come from. With the kind of probability estimation they're using there are often edge cases (when both have equal numbers of reads, when one has exactly one read, when there's only two reads total, etc.) and it's possible that those edge cases get rounded to those particular scores, resulting in their over-representation. I would look to see if the SNPs generating those scores have anything in common.

link

answered 10 Aug '10, 20:05

mrawlins's gravatar image

mrawlins
431119
accept rate: 16%

Good point about diving into the MAQ literature. I'll see if I can find some time to do that. I haven't yet found anything in common for the spike SNPs, though I'll keep poking at it. Thanks for the new directions of inquiry! (I won't call this the "answer" yet, lest it discourage others from contributing.)

(11 Aug '10, 08:51) Jenn
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×13
×2
×2

Asked: 10 Aug '10, 12:30

Seen: 1,822 times

Last updated: 11 Aug '10, 08:51

powered by OSQA