I have fastq and bam files. How can I get basic read statistics.

Number of raw reads

Raw data reads(Mb)

Number of mapped reads

Mapped bases(Mb)

Mean depth(coverage)

asked 20 Jul '10, 07:50

hongiiv's gravatar image

hongiiv
1261
accept rate: 0%

edited 20 Jul '10, 10:21


Also check samtools flagstat.

link

answered 21 Jul '10, 08:55

Bruins's gravatar image

Bruins
161
accept rate: 0%

1

Number of raw reads ---> grep -c "^>" fastq.filename

Raw data read(MB) ---> Not yet...

Number of mapped reads ---> samtools flagstat

Mapped bases(Mb) ---> Not yet...

Mean depth(coverage) ---> using GATK DepthOfCoverage

Thank you.

(21 Jul '10, 09:24) hongiiv

For the number of raw reads, using the following command

grep -c "^>" fastq.filename

should do it.

Getting the Mb count of your reads involves getting the length of each read and adding them together. This can be done in Perl fairly easily. It can also be included in Java as described below, by adding all the unmapped read lengths.

The rest of the statistics are available using the samtools libraries. The samtools utilities probably have some of these functions, but I haven't spent much time with them. The Java implementation of samtools (Picard) is one of the nicest libraries I've ever used.

For counting the number of mapped reads you want to loop over all reads in the bam file and count if getReadUnmappedFlag is false (i.e. the read mapped). To get a mapped bases count, add getReadLength for each read that mapped. For the mean depth you can take the mapped bases count and divide by the total number of nucleotide positions. All that should take a few hours for an experienced Java programmer, and a few days for a novice. It should run in less than an hour.

link

answered 20 Jul '10, 12:42

mrawlins's gravatar image

mrawlins
431119
accept rate: 16%

I found GATK for Mean depth(coverage).

http://www.broadinstitute.org/gsa/wiki/index.php/Depth_of_Coverage_v3.0

java -Xmx4g -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R ref.fa -I KOBB0909_103.final.bam -I KOBB0909_111.final.bam -L target_region.list -o test.out

============target_region.list============

chr2:27677011-27677236

chr2:27677669-27677870

chr2:27678783-27678982

============test.out============

Locus Total_Depth Average_Depth_sample Depth_for_KOBB0909_103.sort.bam Depth_for_KOBB0909_111.sort.bam

chr2:27677011 44 22.00 36 8

chr2:27677012 45 22.50 37 8

chr2:27677013 46 23.00 37 9

chr2:27677014 46 23.00 37 9

chr2:27677015 46 23.00 37 9

chr2:27677016 46 23.00 37 9

chr2:27677017 46 23.00 37 9

chr2:27677018 46 23.00 37 9

link

answered 21 Jul '10, 08:48

hongiiv's gravatar image

hongiiv
1261
accept rate: 0%

hi @hongiv, Also have a look at HTSeq python library, it does coverage and quality from the command-line.

So you could get a reads.sam.pdf by doing:

$ htseq-qa reads.sam

You can also calculate coverage via some programming.

link

answered 27 Jul '10, 16:23

brentp's gravatar image

brentp
865
accept rate: 33%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×2
×2

Asked: 20 Jul '10, 07:50

Seen: 1,928 times

Last updated: 28 Dec '10, 22:58

powered by OSQA