|
There's an existing (short) forum thread on this, but no answer. It seems like no matter what dataset I'm using, a hugely disproportionate number of the SNPs called by samtools pileup -cv have SNP quality of exactly 228. Often that's the highest quality represented in the list, but not always. The docs say the SNP quality column contains the:
A range of 0-255 doesn't actually sound like a Phred-scaled number, though. Could some odd logarithmic conversion be responsible for throwing a bunch of stuff into quality 228? I have a couple of pileups I'm looking at right now, each of which contains about 50K variants (this is human data from SOLiD, if it matters). Really, I see three odd things about this data:
The two datasets are just two examples out of many; one normal-ish, and one with an unusually large spike. (Clicking the thumbnail below will take you straight to the image, not a page with ads or anything.) Can anyone shed any light on these SNP quality mysteries? Thanks! |
|
I'll take a stab at this, but I've only briefly looked at the supplemental data in the original MAQ paper (this is the same method for SNP calling used by samtools pileup). The SNP quality is determined using an estimate of the probabilities. Their estimation only has a certain amount of resolution (because nucleotide counts are always integers), so this is not a continuous function. The phred-like score is likewise not continuous (it gets rounded to an integer), but the number of nucleotides and the SNP quality will be on different scales, resulting in the "divisible by 3" feature you noticed. If the SNP quality and nucleotide counts could be fractional that would likely be pretty smooth. As far as the spikes at 228 and 30, I'm not entirely sure where those come from. With the kind of probability estimation they're using there are often edge cases (when both have equal numbers of reads, when one has exactly one read, when there's only two reads total, etc.) and it's possible that those edge cases get rounded to those particular scores, resulting in their over-representation. I would look to see if the SNPs generating those scores have anything in common. Good point about diving into the MAQ literature. I'll see if I can find some time to do that. I haven't yet found anything in common for the spike SNPs, though I'll keep poking at it. Thanks for the new directions of inquiry! (I won't call this the "answer" yet, lest it discourage others from contributing.)
(11 Aug '10, 08:51)
Jenn
|

Interestingly, it looks like the read quality column has exactly the same characteristics (the three bullet points from above), though of course individual read qualities and SNP qualities don't necessarily correspond.