0 Comments

用bedtools计算序列覆盖度coverage

发布于:2015-10-13  |   作者:admin  |   已聚集:人围观
  为了来根据我的contig coverage来聚类,我通过bowtie比对reads和contig,获得了比对的sam文件,然后通过samtools进行了处理,但是得到的结果只有多少条reads匹配到我的contig上,不是coverage.(具体怎么操作,之前博文有交代)。
   接着我采用bedtools来接着对samtools处理的比对数据进行分析。BEDTools是可用于genomic features的比较,相关操作及进行注释的工具。而genomic features通常使用Browser Extensible Data (BED) 或者 General Feature Format (GFF)文件表示,用UCSC Genome Browser进行可视化比较。
bedtools
安装
下载地址:https://code.google.com/p/bedtools/downloads/list
 git://github.com/arq5x/bedtools.git
1. 解压缩
2.Cd到解压缩文件中
3.Make
4.如果遇到问题,联系aaronquinlan at gmail dot com
5.运行 make test
6. Copy the files in bin/ to ~/bin or if you have the privileges, to /usr/local/bin.  Make sure that the directory to which you copy the tools is in your $PATH  (我在其他的地方运行了一下,居然可以用,不知道是不是之前就已经安装好了,这一步就没有做)
7. Use the tools.
 
在终端输入 bedtools genomecov 你可以看到具体的说明,具体的我就不说了
Tool:    bedtools genomecov (aka genomeCoverageBed)
Version: v2.17.0
Summary: Compute the coverage of a feature file among a genome.
 
Usage: bedtools genomecov [OPTIONS] -i -g
 
Options: 
-ibam 输入文档必须是BAM格式,同时也要是排序后的BAM文件
 
-d Report the depth at each genome position (with one-based coordinates).
Default behavior is to report a histogram.
 
-dz Report the depth at each genome position (with zero-based coordinates).
Reports only non-zero positions.
Default behavior is to report a histogram.
 
-bg Report depth in BedGraph format. For details, see:
genome.ucsc.edu/goldenPath/help/bedgraph.html
 
-bga Report depth in BedGraph format, as above (-bg).
However with this option, regions with zero 
coverage are also reported. This allows one to
quickly extract all regions of a genome with 0 
coverage by applying: "grep -w 0$" to the output.
 
-split Treat "split" BAM or BED12 entries as distinct BED intervals.
when computing coverage.
For BAM files, this uses the CIGAR "N" and "D" operations 
to infer the blocks for computing coverage.
For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds
fields (i.e., columns 10,11,12).
 
-strand Calculate coverage of intervals from a specific strand.
With BED files, requires at least 6 columns (strand is column 6). 
- (STRING): can be + or -
 
-5 Calculate coverage of 5" positions (instead of entire interval).
 
-3 Calculate coverage of 3" positions (instead of entire interval).
 
-max Combine all positions with a depth >= max into
a single bin in the histogram. Irrelevant
for -d and -bedGraph
- (INTEGER)
 
-scale Scale the coverage by a constant factor.
Each coverage value is multiplied by this factor before being reported.
Useful for normalizing coverage by, e.g., reads per million (RPM).
- Default is 1.0; i.e., unscaled.
- (FLOAT)
 
-trackline Adds a UCSC/Genome-Browser track line definition in the first line of the output.
- See here for more details about track line definition:
     http://genome.ucsc.edu/goldenPath/help/bedgraph.html
- NOTE: When adding a trackline definition, the output BedGraph can be easily
     uploaded to the Genome Browser as a custom track,
     BUT CAN NOT be converted into a BigWig file (w/o removing the first line).
 
-trackopts Writes additional track line definition parameters in the first line.
- Example:
  -trackopts 'name="My Track" visibility=2 color=255,30,30'
  Note the use of single-quotes if you have spaces in your parameters.
- (TEXT)
 
Notes: 
(1) The genome file should tab delimited and structured as follows:
 
For example, Human (hg19):
chr1 249250621
chr2 243199373
...
chr18_gl000207_random 4262  #关键是这一步,我需要提取我的scaffold文件名和长度,在这一步理解的不够好,花了点时间啊
 
(2) The input BED (-i) file must be grouped by chromosome.
A simple "sort -k 1,1 > .sorted" will suffice.
 
(3) The input BAM (-ibam) file must be sorted by position.
A "samtools sort " should suffice.
 
Tips: 
One can use the UCSC Genome Browser's MySQL database to extract
chromosome sizes. For example, H. sapiens:
 
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
"select chrom, size from hg19.chromInfo" > hg19.genome
 
 
 
 
达到目的的最后命令:
对scaffold.fa前处理
cd /sam/idba_ud/myproject/clean_reads/index
grep 'scaffold' scaffold.fa>scaffold.txt
cut -f1,2 -d " " scaffold.txt>scaffold.txt2
cat scaffold.txt2 |sed 's/>scaffold/scaffold/g'|sed 's/length_//g' >scaffold.txt3
获得coverage
bedtools genomecov -ibam eg2.sorted.bam -g scaffold.txt3 > coverage.txt
 
出来的结果:第一列是scaffold,第二列是起始位置,第三列是终止位置,第四列是总大小,第五列是覆盖度
在genomecov里面,第三列是总共覆盖的碱基数目
 
 awk '{$2==0}' coverage |less
正常是会出来结果,但是我这个什么都么有??
后来也就先放下了,等待高手来解答。
 
 
参考资料:
Biostar  http://www.biostars.org/p/16121/
http://www.plob.org/2012/09/26/3748.html
官网 http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html
Biostar  http://www.biostars.org/p/5165/ (里面有很多中方法,非常不错的一个帖子)

转自:http://blog.sina.com.cn/s/blog_670445240101l7li.html
标签:bedtools(2)
    输入验证码:
点击我更换验证码