cufflinks的使用

panhoy 2015-01-25

展開全文

一. 簡(jiǎn)介

Cufflinks下主要包含cufflinks,cuffmerge,cuffcompare和cuffdiff等幾支主要的程序。主要用于基因表達(dá)量的計(jì)算和差異表達(dá)基因的尋找。

二. 安裝

Cufflinks下載網(wǎng)頁。
1. 為了安裝Cufflinks，必須有Boost C++ libraries。下載Boost并安裝。默認(rèn)安裝在/usr/local。

$ tar jxvf boost_1_53_0.tar.bz2
$ cd boost_1_53_0
$ ./bootstrap.sh
$ sudo ./b2 install

2.安裝SAM tools。

下載SAM tools。
$ tar jxvf samtools-0.1.18.tar.bz2
$ cd samtools-0.1.18
$ make
$ sudo su 
# mkdir /usr/local/include/bam
# cp libbam.a /usr/local/lib
# cp *.h /usr/local/include/bam/
# cp samtools /usr/bin/

3. 安裝 Eigen libraries。

下載Eigen
$ tar jxvf 3.1.2.tar.bz2
$ cd eigen-eigen-5097c01bcdc4
$ sudo cp -r Eigen/ /usr/local/include/

4. 安裝Cufflinks。

$ tar zxvf cufflinks-2.0.2.tar.gz
$ cd cufflinks-2.0.2
$ ./configure --prefix=/path/to/cufflinks/install --with-boost=/usr/local/ --with-eigen=/usr/local/include//Eigen/
$ make
$ make install

5. 可以直接下載Linux x86_64 binary。不需要上述繁瑣步驟，解壓后的程序直接可用。(推薦)

三. Cufflinks的使用

1. Cufflinks簡(jiǎn)介

Cufflinks程序主要根據(jù)Tophat的比對(duì)結(jié)果，依托或不依托于參考基因組的GTF注釋文件，計(jì)算出(各個(gè)gene的)isoform的FPKM值，并給出trascripts.gtf注釋結(jié)果(組裝出轉(zhuǎn)錄組)。

注意：

1. fragment的長(zhǎng)度的估測(cè)，若為pair-end測(cè)序，則cufflinks自己會(huì)有一套算法，算出結(jié)果。若為single-end測(cè)序，則cufflinks默認(rèn)的是高斯分布，或者你自己提供相關(guān)的參數(shù)設(shè)置。

2. cufflinks計(jì)算multi-mapped reads，一般a read map到10個(gè)位置，則每個(gè)位置記為10%。a read mapping to 10 positions will count as 10% of a read at each position.

3. 一般不推薦用cufflinks拼接細(xì)菌的轉(zhuǎn)錄組，推薦 Glimmer。但是，若有注釋文件，可以用cufflinks和cuffdiff來檢測(cè)基因的表達(dá)和差異性。

4. cufflinks/cuffdiff不能計(jì)算出exon或splicing event的FPKM

5.cuffdiff處理時(shí)間序列data：采用參數(shù)-t

6.當(dāng)你使用cufflinks時(shí)，在最后出現(xiàn)了99%，然后一直不動(dòng)。因?yàn)閏uffdiff需要更多的CPU來處理一些匹配很多reads的loci。而這些位點(diǎn)一般要等其他位點(diǎn)全部解決了后，才由cuffdiff來處理?？梢杂脜?shù)-M來提供相關(guān)的文件，過濾掉rRNA或者線粒體RNA。

7. 當(dāng)使用cufflinks或cuffdiff出現(xiàn)了“crash with a ‘bad_alloc' error”，cuffdiff和cufflinks運(yùn)行了很長(zhǎng)時(shí)間才結(jié)束————這表明計(jì)算機(jī)拼接一個(gè)高表達(dá)的基因或定量分析一個(gè)高表達(dá)的基因，運(yùn)行的內(nèi)存使用玩盡了！解決方法：修改選項(xiàng)“-max-bundle-frags”，可以先嘗試500000，若錯(cuò)誤依舊在，可以繼續(xù)下調(diào)！

8. cuffdiff報(bào)道的結(jié)果里面所有的基因和轉(zhuǎn)錄本的FPKM=0，這表明GTF中的染色體名字和BAM里的名字不匹配。

9. cuffdiff和cufflinks的缺點(diǎn)：存在一定的假基因和轉(zhuǎn)錄本（原因：測(cè)序深度，測(cè)序質(zhì)量，測(cè)序樣本的測(cè)序次數(shù)，以及注釋的錯(cuò)誤）

10. large fold change表達(dá)量不代表數(shù)據(jù)的明顯性（這些基因的isform多或這些基因測(cè)序測(cè)到的少，整體較低的表達(dá)）。cuffdiff中明顯表達(dá)倍數(shù)改變的基因，存在不確定性。

11. 通過cufflinks產(chǎn)生的結(jié)果中transcript.gtf文件中cuff標(biāo)識(shí)的轉(zhuǎn)錄本就是新的轉(zhuǎn)錄本。相應(yīng)的，其他模塊輸出中CUFF標(biāo)識(shí)代表著新的轉(zhuǎn)錄本。

12. 若出現(xiàn)了如下錯(cuò)誤：

You are using Cufflinks v2.2.1, which is the most recent release.
open: No such file or directory
File 30 doesn't appear to be a valid BAM file, trying SAM...
Error: cannot open alignment file 30 for reading
這表明，你的參數(shù)有問題。例如“--min-intron-length”,你設(shè)置為了：“-min-intron-length”

2. 使用方法

$ cufflinks [options]* 

一個(gè)常用的例子：
$ cufflinks -p 8 -G transcript.gtf --library-type fr-unstranded -o cufflinks_output tophat_out/accepted_hits.bam

3. 普通參數(shù)

  -h | --help

   -o | --output-dir   default: ./
    設(shè)置輸出的文件夾名稱

 
-p | --num-threads  default: 1
    用于比對(duì)reads的CPU線程數(shù)

 
-G | --GTF 
    提供一個(gè)GFF文件，以此來計(jì)算isoform的表達(dá)。此時(shí)，將不會(huì)組裝新的transcripts，
程序會(huì)忽略和reference transcript不兼容的比對(duì)結(jié)果

 
-g | --GTF-guide 
    提供GFF文件，以此來指導(dǎo)轉(zhuǎn)錄子組裝(RABT assembly)。此時(shí)，輸出結(jié)果會(huì)包含reference transcripts和novel genes and isforms。

 
-M | --mask-file 
    提供GFF文件。Cufflinks將忽略比對(duì)到該GTF文件的transcripts中的reads。該
文件中常常是rRNA的注釋，也可以包含線立體和其它希望忽略的transcripts的注釋。將這些不需要的RNA去除后，對(duì)計(jì)算mRNA的表達(dá)量是有利的。

 
-b | --frag-bias-correct 
    提供一個(gè)fasta文件來指導(dǎo)Cufflinks運(yùn)行新的bias detection and correction algorithm。這樣能明顯提高轉(zhuǎn)錄子豐度計(jì)算的精確性。

 
-u | --multi-read-correct
    讓Cufflinks來做initial estimation步驟，從而更精確衡量比對(duì)到genome多個(gè)位點(diǎn)的reads。

 
--library-type  default:fr-unstranded
    處理的reads具有鏈特異性。比對(duì)結(jié)果中將會(huì)有個(gè)XS標(biāo)簽。一般Illumina數(shù)據(jù)的lib
rary-type為 fr-unstranded。

--library-norm-method    具體參考官網(wǎng),三種方式：classic-fpkm  默認(rèn)的方式。geometric  針對(duì)DESeq。quartile  計(jì)算時(shí)，fragments和總的map的count取75%

4. 豐度評(píng)估參數(shù)

-m | --frag-len-mean default: 200
插入片段的平均長(zhǎng)度。不過現(xiàn)在Cufflinks能learns插入片段的平均長(zhǎng)度，因此不推薦自主
設(shè)置此值。

 
-s | --frag-len-std-dev default: 80
插入片段長(zhǎng)度的標(biāo)準(zhǔn)差。不過現(xiàn)在Cufflinks能learns插入片段的平均長(zhǎng)度，因此不推薦自
主設(shè)置此值。

 
-N | --upper-quartile-form
使用75%分為數(shù)的值來代替總的值(比對(duì)到單一位點(diǎn)的fragments的數(shù)值)，作normalize。這樣有利于在低豐度基因和轉(zhuǎn)錄子中尋找差異基因。

 
--total-hits-norm default: TRUE
Cufflinks在計(jì)算FPKM時(shí),算入所有的fragments和比對(duì)上的reads。和下一個(gè)參數(shù)
對(duì)立。默認(rèn)激活該參數(shù)。

 
--compatible-hits-norm 
Cufflinks在計(jì)算FPKM時(shí)，只針對(duì)和reference transcripts兼容的fragments以及比對(duì)上的reads。該參數(shù)默認(rèn)不激活，只能在有 --GTF 參數(shù)下有效，并且作 RABT
或 ab initio 的時(shí)候無效。

--max-mle-iterations   進(jìn)行極大似然法時(shí)選擇的迭代次數(shù)，默認(rèn)為：5000

--max-bundle-frags   一個(gè)skipped locus/loci在別skipped前可以擁有的最大的fragment片段。默認(rèn)為1000000

--no-effective-length-correction   Cufflinks will not employ its "effective" length normalization to transcript FPKM.Cufflinks將不會(huì)使用它的“effective” 長(zhǎng)度標(biāo)準(zhǔn)化去計(jì)算轉(zhuǎn)錄的FPKM

--no-length-correction   Cufflinks將根本不會(huì)使用轉(zhuǎn)錄本的長(zhǎng)度去標(biāo)準(zhǔn)化fragment的數(shù)目。當(dāng)fragment的數(shù)目和the features being quantified的size是獨(dú)立的，可以使用（例如for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length).小心使用

5. 組裝常用參數(shù)

-L | --label  default: CUFF
    Cufflink以GTF格式來報(bào)告轉(zhuǎn)錄子片段(transfrags),該參數(shù)是GTF文件的前綴

-F/--min-isoform-fraction <0.0-1.0>  在計(jì)算一個(gè)基因的isoform 豐度后，過濾了豐度極低的轉(zhuǎn)錄本，因?yàn)檫@些轉(zhuǎn)錄本不可以信任。也可以過濾一些read匹配極低的外顯子。默認(rèn)為0.1或者10% of the most abundant isoform (the major isoform) of the gene.（一個(gè)基因的主要isoform的豐度的10%）

-j/--pre-mrna-fraction <0.0-1.0>   內(nèi)含子被aligment覆蓋的最低深度。若小于這個(gè)值則那些內(nèi)含子的alignments被忽略掉。默認(rèn)為15%。 The minimum depth of coverage in the intronic region covered      by the alignment is divided by the number of spliced reads, and if the          result is lower than this parameter value, the intronic alignments are          ignored. The default is 15%.

-I/--max-intron-length   內(nèi)含子的最大長(zhǎng)度。若大于該值的內(nèi)含子，cufflinks不會(huì)報(bào)告。默認(rèn)為300000.Cufflinks will not report transcripts with    introns longer than this, and will ignore SAM alignments with REF_SKIP          CIGAR operations longer than this.  The default is 300,000.

-a/--junc-alpha <0.0-1.0>    剪接比對(duì)過濾中假陽性的二項(xiàng)檢驗(yàn)中的 alpha value。默認(rèn)為 0.001

-A/--small-anchor-fraction <0.0-1.0>  在junction中一個(gè)reads小于自身長(zhǎng)度的這個(gè)百分比，會(huì)被懷疑，可能會(huì)在拼接前被過濾掉。默認(rèn)為0.09

--min-frags-per-transfrag   default: 10
    組裝出的transfrags被支持的RNA-seq的fragments數(shù)少于該值則不被報(bào)道。

--overhang-tolerance   當(dāng)決定一個(gè)reads或轉(zhuǎn)錄本與某個(gè)轉(zhuǎn)錄本兼容或匹配的時(shí)候，允許的能加入該轉(zhuǎn)錄本的外顯子的延伸長(zhǎng)度。默認(rèn)是8bp和bowtie/tophat默認(rèn)的一致。

--max-bundle-length   Maximum genomic length allowed for a given bundle.  The default is 3,500,000bp.

--min-intron-length   default: 50
    最小的intron大小。

--trim-3-avgcov-thresh   最小的3‘端的平均覆蓋程度。小于該值，則刪除其3’端序列。默認(rèn)10  Minimum average coverage required to attempt 3' trimming.  The default is 10.

--trim-3-dropoff-frac    最低百分比的拼接的轉(zhuǎn)錄本的3‘端的平均覆蓋程度。默認(rèn)0.1  The fraction of average coverage below which to trim the 3' end of an assembled          transcript.  The default is 0.1.

--max-multiread-fraction <0.0-1.0>   若一個(gè)轉(zhuǎn)錄本Transfrags的reads能匹配到基因組的多個(gè)位置，其中該轉(zhuǎn)錄本的reads有超過該百分比是multireads，則不會(huì)報(bào)告這個(gè)轉(zhuǎn)錄本。默認(rèn)為75%   The fraction a transfrag's supporting reads that may be multiply mapped to the genome. A transcript composed of more than this fraction will not be reported by the assembler.  Default: 0.75 (75% multireads or more is suppressed).

--overlap-radius   default: 50
    Transfrags之間的距離少于該值，則將其連到一起。

Advanced Reference Annotation Based Transcript (RABT) Assembly Options:當(dāng)你使用-g/--GTF-guide這個(gè)參數(shù)時(shí)，需要考慮的選項(xiàng)。

--3-overhang-tolerance     當(dāng)決定一個(gè)拼接的轉(zhuǎn)錄本（這個(gè)轉(zhuǎn)錄本可能不是新的轉(zhuǎn)錄本）和一個(gè)參考轉(zhuǎn)錄本是否合并時(shí)，參考轉(zhuǎn)錄本的3‘端允許延伸的長(zhǎng)度。默認(rèn)600bp   The number of bp allowed to overhang the 3' end of a reference transcript when determining      if an assembled transcript should be merged with it (ie, the assembled transcript is not novel).        The default is 600 bp.

--intron-overhang-tolerance    當(dāng)決定一個(gè)拼接的轉(zhuǎn)錄本（這個(gè)轉(zhuǎn)錄本可能不是新的轉(zhuǎn)錄本）和一個(gè)參考轉(zhuǎn)錄本是否合并時(shí)，參考轉(zhuǎn)錄本的外顯子允許延伸的長(zhǎng)度。默認(rèn)50bp   The number of bp allowed to enter the intron of a reference transcript when determining if an     assembled transcript should be merged with it (ie, the assembled transcript is not novel).      The default is 50 bp.

--no-faux-reads   This option disables tiling of the reference transcripts with faux reads.  Use this if you only         want to use sequencing reads in assembly but do not want to output assembled transcripts that lay       within reference transcripts.  All reference transcripts in the input annotation will also      be included in the output.這一項(xiàng)將不能掩蓋參考轉(zhuǎn)錄組中的假reads。當(dāng)你只想在拼接中使用測(cè)序的reads而不想輸出lay within reference transcripts的拼接的轉(zhuǎn)錄組。輸入時(shí)注釋的所有的參考轉(zhuǎn)錄組也將會(huì)輸入到輸出中。

其他參數(shù)（無關(guān)緊要）

-v/--verbose   顯示版本信息等等

-q/--quiet     除了警告和錯(cuò)誤外，其他信息將不會(huì)print

--no-update-check   關(guān)系cufflinks自動(dòng)更新的能力

6. Cufflinks輸出結(jié)果

cufflinks的輸入文件是sam或bam格式。并且sam或bam格式的文件必須排好序。（The SAM file supplied to Cufflinks must be sorted by 
          reference position.）Tophat的輸出結(jié)果sam或bam已經(jīng)排好了序。針對(duì)其他的未排序的sam或bam文件采用如下排序方式：

sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted



1. 
transcripts.gtf

該文件包含Cufflinks的組裝結(jié)果isoforms。前7列為標(biāo)準(zhǔn)的GTF格式，最后一列為attributes。其每一列的意義：

列數(shù)   列的名稱  例子         描述


1     序列名    chrX        染色體或contig名
; 2     來源      Cufflinks   產(chǎn)生該文件的程序名
; 3     類型      exon        記錄的類型，一般是transcript或exon
; 4     起始      1           1-base的值
; 5     結(jié)束      1000        結(jié)束位置
; 6     得分      1000        
; 7     鏈        +          Cufflinks猜測(cè)isoform來自參考序列的那一條鏈，
一般是'+','-'或'.';  
8     frame    .           Cufflinks不去預(yù)測(cè)起始或終止密碼子框的位置
; 9     attributes  ...      詳見下



每一個(gè)GTF記錄包含如下attributes：

Attribute      例子       描述


gene_id          CUFF.1      Cufflinks的gene id
;  transcript_id    CUFF.1.1  Cufflinks的轉(zhuǎn)錄子 id  
; FPKM           101.267   isoform水平上的豐度, Fragments Per Kilobase
 of exon model per Million mapped fragments
; frac           0.7647    保留著的一項(xiàng)，忽略即可，以后可能會(huì)取消這個(gè); 
conf_lo        0.07      isoform豐度的95%置信區(qū)間的下邊界，即 下邊界值 =
 FPKM * ( 1.0 - conf_lo )
;  conf_hi        0.1102    isoform豐度的95%置信區(qū)間的上邊界，即 上邊界值 =
 FPKM * ( 1.0 + conf_hi )
; cov            100.765   計(jì)算整個(gè)transcript上read的覆蓋度; 
full_read_support   yes  當(dāng)使用 RABT assembly 時(shí)，該選項(xiàng)報(bào)告所有的intr
ons和exons是否完全被reads所覆蓋



2. ispforms.fpkm_tracking

isoforms(可以理解為gene的各個(gè)外顯子)的fpkm計(jì)算結(jié)果

3. 
genes.fpkm_tracking

gene的fpkm計(jì)算結(jié)果

四. Cuffmerge的使用

1. Cuffmerge簡(jiǎn)介



Cuffmerge將各個(gè)Cufflinks生成的transcripts.gtf文件融合稱為一個(gè)更加全面的transcripts注釋結(jié)果文件merged.gtf。以利于用Cuffdiff來分析基因差異表達(dá)。

2. 使用方法

$ cuffmerge [options]* 
輸入文件為一個(gè)文本文件，是包含著GTF文件路徑的list。常用例子：
$ cuffmerge -o ./merged_asm -p 8 assembly_list.txt

3. 使用參數(shù)

-h | --help


-o  default: ./merged_asm
將結(jié)果輸出至該文件夾。

 -g | --ref-gtf
將該reference GTF一起融合到最終結(jié)果中。



-p | --num-threads  defautl: 1
使用的CPU線程數(shù)


-s | --ref-sequence /
該參數(shù)指向基因組DNA序列。如果是一個(gè)文件夾，則每個(gè)contig則是一個(gè)fasta文件；如果是
一個(gè)fasta文件，則所有的contigs都需要在里面。Cuffmerge將使用該ref-sequence來
幫助對(duì)transfrags分類，并排除repeats。比如transcripts包含一些小寫堿基的將歸類
到repeats.

4. Cuffmerge輸出結(jié)果



輸出的結(jié)果文件默認(rèn)為 /merged.gtf

五. Cuffcompare的使用

1. Cuffcompare簡(jiǎn)介



Cuffcompare使用Cufflinks的GTF結(jié)果，對(duì)GTF結(jié)果進(jìn)行比較。和reference gtf比較尋找novel轉(zhuǎn)錄本等。

2. Cuffcompare的使用方法

$ cuffcompare [options]*  [cuff2.gtf] ... [cuffN.gtf]

使用例子：
$ cuffcompare -o cuffcmp cuff1.gtf cuff2.gtf

3. 使用參數(shù)

-h                -V  顯示進(jìn)程    

-C  
默認(rèn)，表示"contained" transcripts 也會(huì)寫入.combined.gtf中。
-o  default: cuffcmp
輸出文件的前綴


-r 
參考的GFF文件。用來評(píng)估輸入的gtf文件中g(shù)ene models的精確性。每一個(gè)輸入的gtf的isoforms將和該參考文件進(jìn)行比較，并被標(biāo)注為 overlapping, matching 或 novel。

 -R
當(dāng)有了 -r 參數(shù)時(shí)，指定該參數(shù)時(shí)，將忽略參考GFF文件中的一些transcripts。這些transcripts不和任何輸入的GTF文件overlapped。



-s   該參數(shù)指向基因組DNA序列。如果是一個(gè)文件夾，則每個(gè)contig則是一個(gè)fasta文件；如果是
一個(gè)fasta文件，則所有的contigs都需要在里面。小寫字母的堿基用來將相應(yīng)的transcripts作為repeats處理。

4. 輸出結(jié)果



在當(dāng)前目錄下輸出3個(gè)文件：

.stats， 報(bào)告與參考注釋比較時(shí)，各種與準(zhǔn)確性相關(guān)的數(shù)據(jù)。其中，Sn和Sp展示的是specificity and sensitivity values。 fSn and fSp 列展示的 "fuzzy" variants of these same accuracy calculations。允許存在變動(dòng)。（-o 沒有設(shè)置，默認(rèn)為cuffcmp為文件前綴）

.combined.gtf    報(bào)告每個(gè)樣本的所有的 transfrags 的信息。若一個(gè)transfrag在多個(gè)樣本中，它只報(bào)道一次。

 .tracking      匹配到樣本間的轉(zhuǎn)錄本。this file matches transcripts up between samples.  Each row contains 
                a transcript structure that is present in one or more input GTF files. 
                Because the transcripts will generally have different IDs (unless you 
                assembled your RNA-Seq reads against a reference transcriptome), 
                cuffcompare examines the structure of each the transcripts, 
                matching transcripts that agree on the coordinates and order of all of
                their introns, as well as strand.  Matching transcripts are allowed to 
                differ on the length of the first and last exons, since these lengths
                will naturally vary from sample to sample due to the random nature of 
                sequencing.

例子；



TCONS_00000045 XLOC_000023 Tcea|uc007afj.1    j            q1:exp.115|exp.115.0|100|3.061355|0.350242|0.350207      q2:60hr.292|60hr.292.0|100|4.094084|0.000000|0.000000

In this example, a transcript present in the two input files, called exp.115.0 in the first and 60hr.292.0 in the second, doesn't match any reference transcript exactly, but shares exons with uc007afj.1, an isoform of the gene Tcea, as indicated by the class codej. The first three columns are as follows:

其中，1 Cufflinks transfrag id  TCONS_00000045  內(nèi)部的transfrag id；2  Cufflinks locus id  XLOC_000023  內(nèi)部的locus id； 3  Reference gene id   Tcea   參考的注釋的gene的id或者“-”表示沒有匹配到參考的轉(zhuǎn)錄本； 4  Reference transcript id  uc007afj.1  參考的注釋的轉(zhuǎn)錄本的id或者“-”表示沒有匹配到參考的轉(zhuǎn)錄本 ； 5 Class code  c  轉(zhuǎn)錄本和參考轉(zhuǎn)錄本之間的匹配類型。第五列之后如下：

qJ: | | | | | | |

在輸入的GTF的同目錄下輸出.refmap 和 
.tmap 文件。

.refmap  具體內(nèi)容如下：

1  Reference gene name   參考注釋的gtf中的基因名字 2 Reference transcript id 參考的轉(zhuǎn)錄本id  3  Class code 表示cufflinks拼接的轉(zhuǎn)錄本和參考轉(zhuǎn)錄本間的匹配情況：c 表示部分匹配；= 表示全部匹配

4  Cufflinks matches  匹配到參考轉(zhuǎn)錄本的cufflinks拼接的轉(zhuǎn)錄本的id



.tmap  具體內(nèi)容如下：

1  Reference gene name   參考注釋的gtf中的基因名字 2 Reference transcript id 參考的轉(zhuǎn)錄本id  3  Class code 表示cufflinks拼接的轉(zhuǎn)錄本和參考轉(zhuǎn)錄本間的匹配情況：c 表示部分匹配；= 表示全部匹配

4 Cufflinks gene id  ; 5 Cufflinks transcript id;  6 Fraction of major isofor m (FMI) ; 7  FPKM ; 8 FPKM_conf_lo; 9  FPKM_conf_hi  ; 10 Coverage ; 11 Length; 12  Major isoform ID



class cord :

Priority	Code	Description
1	`=`	Complete match of intron chain
2	`c`	Contained
3	`j`	Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript
4	`e`	Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment.
5	`i`	A transfrag falling entirely within a reference intron
6	`o`	Generic exonic overlap with a reference transcript
7	`p`	Possible polymerase run-on fragment (within 2Kbases of a reference transcript)
8	`r`	Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case
9	`u`	Unknown, intergenic transcript
10	`x`	Exonic overlap with reference on the opposite strand
11	`s`	An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors)
12	`.`	(.tracking file only, indicates multiple classifications)

六. Cuffdiff的使用

1. Cuffdiff簡(jiǎn)介



用于尋找轉(zhuǎn)錄子表達(dá)的顯著性差異。

2. Cuffdiff使用方法

cuffdiff主要是發(fā)現(xiàn)轉(zhuǎn)錄本表達(dá)，剪接，啟動(dòng)子使用的明顯變化。

cuffdiff [options]* ... [sampleN.sam_replicate1.sam[,...,sample2_replicateM.sam]]

$ cuffdiff [options]*   ...[sampleN_1.sam[,...,sampleN_M.sam]]
其中transcripts.gtf是由cufflinks，cuffcompare，cuffmerge所生成的文件，或是由其它程序生成的。一個(gè)樣本有多個(gè)replicate，用逗號(hào)隔開。sample多于一個(gè)時(shí)，cuffdiff將比較samples間的基因表達(dá)的差異性。

一個(gè)常用例子：
$ cuffdiff --lables lable1,lable2 -p 8 --time-series --multi-read-correct --library-type fr-unstranded --poisson-dispersion transcripts.gtf sample1.sam sample2.sam

cuffdiff接受bam/sam或cuffquant的CXB文件，同時(shí)也可以接受bam與sam的混合文件，不能接受bam/sam和CXB的混合文件。

3. 使用參數(shù)

-h | --help


-o | --output-dir  default: ./
輸出的文件夾目錄。


-L | --lables   default: q1,q2,...qN
給每個(gè)sample一個(gè)樣品名或者一個(gè)環(huán)境條件一個(gè)lable



-p | --num-threads  default: 1
使用的CPU線程數(shù)



-T | --time-series
讓Cuffdiff來按樣品順序來比對(duì)樣品，而不是對(duì)所有的samples都進(jìn)行兩兩比對(duì)。即第二個(gè)
SAM和第一個(gè)SAM比；第三個(gè)SAM和第二個(gè)SAM比；第四個(gè)SAM和第三個(gè)SAM比...



-N | --upper-quartile-form
使用75%分為數(shù)的值來代替總的值(比對(duì)到單一位點(diǎn)的fragments的數(shù)值)，作normalize。
這樣有利于在低豐度基因和轉(zhuǎn)錄子中尋找差異基因。



--total-hits-norm 
Cufflinks在計(jì)算FPKM時(shí),算入所有的fragments和比對(duì)上的reads。和下一個(gè)參數(shù)對(duì)立。
默認(rèn)不激活該參數(shù)。

 --compatible-hits-norm
Cufflinks在計(jì)算FPKM時(shí)，只針對(duì)和reference transcripts兼容的fragments以及
比對(duì)上的reads。該參數(shù)默認(rèn)激活，使用該參數(shù)可以降低核糖體rna的reads對(duì)基因表達(dá)的干擾。

 -b | --frag-bias-correct（一般是genome.fa）
提供一個(gè)fasta文件來指導(dǎo)Cufflinks運(yùn)行新的bias detection and correction 
algorithm。這樣能明顯提高轉(zhuǎn)錄子豐度計(jì)算的精確性。

 -u | --multi-read-correct
讓Cufflinks來做initial estimation步驟，從而更精確衡量比對(duì)到genome多個(gè)位點(diǎn)
的reads。



-c | --min-alignment-count   default: 10
如果比對(duì)到某一個(gè)位點(diǎn)的fragments數(shù)目少于該值，則不做該位點(diǎn)的顯著性分析。認(rèn)為該位點(diǎn)的表達(dá)量沒有顯著性差異。



-M | --mask-file 
提供GFF文件。Cufflinks將忽略比對(duì)到該GTF文件的transcripts中的reads。該文件中常常是rRNA的注釋，也可以包含線立體和其它希望忽略的transcripts的注釋。將這些不需要的RNA去除后，對(duì)計(jì)算mRNA的表達(dá)量是有利的。


-FDR  default: 0.05
允許的false discovery rate.


--library-type default:fr-unstranded
處理的reads具有鏈特異性。比對(duì)結(jié)果中將會(huì)有個(gè)XS標(biāo)簽。一般Illumina數(shù)據(jù)的library-
type為 fr-unstranded。


--dispersion-method   


其他高級(jí)參數(shù)：

-m | --frag-len-mean default: 200
插入片段的平均長(zhǎng)度。不過現(xiàn)在Cufflinks能learns插入片段的平均長(zhǎng)度，因此不推薦自主
設(shè)置此值。



-s | --frag-len-std-dev default: 80
插入片段長(zhǎng)度的標(biāo)準(zhǔn)差。不過現(xiàn)在Cufflinks能learns插入片段的平均長(zhǎng)度，因此不推薦自
主設(shè)置此值。


-v/--verbose   顯示版本信息等等

 -q/--quiet     除了警告和錯(cuò)誤外，其他信息將不會(huì)print


--no-update-check   關(guān)系cufflinks自動(dòng)更新的能力


-F/--min-isoform-fraction <0.0-1.0>   建議不要更改，主要的isorform豐度若低于這個(gè)分?jǐn)?shù)，可變的isoform將四舍五入為0.默認(rèn)為1e-5

--max-bundle-frags   一個(gè)skipped locus/loci在skipped前可以擁有的最大的fragment片段。默認(rèn)為1000000  

--max-frag-count-draws （默認(rèn)為100）和--max-frag-assign-draws （默認(rèn)為50）
--min-reps-for-js-test      一個(gè)針對(duì)不同調(diào)控的基因做test的最小的復(fù)制次數(shù)。Cuffdiff won't test genes for differential regulation unless the 
conditions in question have at least this many replicates.  Default: 3. 

--no-effective-length-correction   Cuffdiff will not employ its "effective" length normalization to transcript FPKM. Cufflinks將不會(huì)使用它的“effective” 長(zhǎng)度標(biāo)準(zhǔn)化去計(jì)算轉(zhuǎn)錄的FPKM

--no-length-correction    cufflinks將根本不會(huì)使用轉(zhuǎn)錄本的長(zhǎng)度去標(biāo)準(zhǔn)化fragment的數(shù)目。當(dāng)fragment的數(shù)目和the features being quantified的size是獨(dú)立的，可以使用（例如for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length).小心使用

--max-mle-iterations       極大似然法的迭代次數(shù)，默認(rèn)5000
--poisson-dispersion
Use the Poisson fragment dispersion model instead of learning one 
in each condition.

4. Cuffdiff輸出



1. FPKM tracking files   cuffdiff計(jì)算每個(gè)樣本中的轉(zhuǎn)錄本，初始轉(zhuǎn)錄本和基因的FPKM。其中，基因和初始轉(zhuǎn)錄本的FPKM的計(jì)算是在每個(gè)轉(zhuǎn)錄本group和基因group中的轉(zhuǎn)錄本的FPKM的求和。

`isoforms.fpkm_tracking`	Transcript FPKMs
`genes.fpkm_tracking`	Gene FPKMs. Tracks the summed FPKM of transcripts sharing each `gene_id`
`cds.fpkm_tracking`	Coding sequence FPKMs. Tracks the summed FPKM of transcripts sharing each `p_id`, independent of `tss_id`
`tss_groups.fpkm_tracking`	Primary transcript FPKMs. Tracks the summed FPKM of transcripts sharing each `tss_id`



2. Count tracking files    評(píng)估每個(gè)樣本中來自每個(gè) transcript, primary transcript, 
                and gene的fragment數(shù)目。其中primary transcript, 
                and gene的fragment數(shù)目是每個(gè)primary transcript group或gene group中trancript的數(shù)目之和。

`isoforms.count_tracking`	Transcript counts
`genes.count_tracking`	Gene counts. Tracks the summed counts of transcripts sharing each `gene_id`
`cds.count_tracking`	Coding sequence counts. Tracks the summed counts of transcripts sharing each `p_id`, independent of `tss_id`
`tss_groups.count_tracking`	Primary transcript counts. Tracks the summed counts of transcripts sharing each `tss_id`

 3. Read group tracking 
files   計(jì)算在每個(gè)repulate中每個(gè)transcript， primary transcript和gene的表達(dá)量和frage數(shù)目

`isoforms.read_group_tracking`	Transcript read group tracking
`genes.read_group_tracking`	Gene read group tracking. Tracks the summed expression and counts of transcripts sharing each `gene_id` in each replicate
`cds.read_group_tracking`	Coding sequence FPKMs. Tracks the summed expression and counts of transcripts sharing each `p_id`, independent of `tss_id` in each replicate
`tss_groups.read_group_tracking`	Primary transcript FPKMs. Tracks the summed expression and counts of transcripts sharing each `tss_id` in each replicate

4. Differential expression test    對(duì)于splicing transcript，
                primary transcripts, genes, and coding sequences.樣本之間的表達(dá)差異檢驗(yàn)。對(duì)于每一對(duì)樣本x和y，都會(huì)有以下四個(gè)文件：

`isoform_exp.diff`	Transcript differential FPKM.
`gene_exp.diff`	Gene differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each `gene_id`
`tss_group_exp.diff`	Primary transcript differential FPKM. Tests differences in the summed FPKM of transcripts sharing each `tss_id`
`cds_exp.diff`	Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each `p_id` independent of `tss_id`

每個(gè)文件的樣式如下：

Column number	Column name	Example	Description
1	Tested id	`XLOC_000001`	A unique identifier describing the transcipt, gene, primary transcript, or CDS being tested
2	gene	`Lypla1`	The `gene_name`(s) or `gene_id`(s) being tested
3	locus	`chr1:4797771-4835363`	Genomic coordinates for easy browsing to the genes or transcripts being tested.
4	sample 1	`Liver`	Label (or number if no labels provided) of the first sample being tested
5	sample 2	`Brain`	Label (or number if no labels provided) of the second sample being tested
6	Test status	`NOTEST`	Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
7	FPKM_x	`8.01089`	FPKM of the gene in sample x
8	FPKM_y	`8.551545`	FPKM of the gene in sample y
9	log2(FPKM_y/FPKM_x)	`0.06531`	The (base 2) log of the fold change y/x
10	test stat	`0.860902`	The value of the test statistic used to compute significance of the observed change in FPKM
11	p value	`0.389292`	The uncorrected p-value of the test statistic
12	q value	`0.985216`	The FDR-adjusted p-value of the test statistic
13	significant	`no`	Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

5. Differential splicing tests – 
splicing.diff     對(duì)于每個(gè)primary transcript，鑒定的不同的isoform的差異性。只有2個(gè)或2個(gè)以上的isoforms的primary transcript存在

Column number	Column name	Example	Description
1	Tested id	`TSS10015`	A unique identifier describing the primary transcript being tested.
2	gene name	`Rtkn`	The `gene_name` or `gene_id` that the primary transcript being tested belongs to
3	locus	`chr6:83087311-83102572`	Genomic coordinates for easy browsing to the genes or transcripts being tested.
4	sample 1	`Liver`	Label (or number if no labels provided) of the first sample being tested
5	sample 2	`Brain`	Label (or number if no labels provided) of the second sample being tested
6	Test status	`OK`	Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
7	Reserved	`0`
8	Reserved	`0`
9	√JS(x,y)	`0.22115`	The splice overloading of the primary transcript, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the splice variants
10	test stat	`0.22115`	The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y)
11	p value	`0.000174982`	The uncorrected p-value of the test statistic.
12	q value	`0.985216`	The FDR-adjusted p-value of the test statistic
13	significant	`yes`	Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing



6. Differential coding output – cds.diff    對(duì)于每個(gè)基因，它的cds的鑒定。樣本間的輸出cds的差異性。只有2個(gè)或2個(gè)以上的cds（multi-protein genes）列舉在文件中。

Column number	Column name	Example	Description
1	Tested id	`XLOC_000002-[chr1:5073200-5152501]`	A unique identifier describing the gene being tested.
2	gene name	`Atp6v1h`	The `gene_name` or `gene_id`
3	locus	`chr1:5073200-5152501`	Genomic coordinates for easy browsing to the genes or transcripts being tested.
4	sample 1	`Liver`	Label (or number if no labels provided) of the first sample being tested
5	sample 2	`Brain`	Label (or number if no labels provided) of the second sample being tested
6	Test status	`OK`	Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
7	Reserved	`0`
8	Reserved	`0`
9	√JS(x,y)	`0.0686517`	The CDS overloading of the gene, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the coding sequences
10	test stat	`0.0686517`	The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y)
11	p value	`0.00546783`	The uncorrected p-value of the test statistic
12	q value	`0.985216`	The FDR-adjusted p-value of the test statistic
13	significant	`yes`	Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing



7. Differential 
promoter use – promoters.diff  樣本間啟動(dòng)子使用的差異性。只有表達(dá)2個(gè)或2個(gè)以上isoform的基因列舉在這里。

8. Read group info – read_groups.info   每個(gè)repulate，在進(jìn)行定量分析時(shí)，cuffdiff的關(guān)鍵屬性會(huì)列出。

Column number	Column name	Example	Description
1	file	`mCherry_rep_A/accepted_hits.bam`	BAM or SAM file containing the data for the read group
2	condition	`mCherry`	Condition to which the read group belongs
3	replicate_num	`0`	Replicate number of the read group
4	total_mass	`4.72517e+06`	Total number of fragments for the read group
5	norm_mass	`4.72517e+06`	Fragment normalization constant used during calculation of FPKMs.
6	internal_scale	`1.23916`	Internal scaling factor, used to transform replicates of a single condition onto the "internal" common count scale.
7	external_scale	`0.96`	External scaling factor, used to transform counts from different conditions onto an internal common count scale.



9. Run 
info – run.info   運(yùn)行的信息。



其中：輸出文件FPKM Tracking file的格式如下：

1 tracking_id TCONS_00000001 內(nèi)部唯一object的id（識(shí)別基因，轉(zhuǎn)錄本，CDS，初始轉(zhuǎn)錄本）A unique identifier describing the object (gene, transcript, CDS, primary transcript)

2 class_code = 內(nèi)部定義的類別的id，“-”表明不是轉(zhuǎn)錄本。The class_code attribute for the object, or "-" if not a transcript, or if class_code isn't present

3 nearest_ref_id NM_008866.1 最接近的參考轉(zhuǎn)錄本The reference transcript to which the class code refers, if any

4 gene_id NM_008866 基因id The gene_id(s) associated with the object

5 gene_short_name Lypla1 基因名字 The gene_short_name(s) associated with the object

6 tss_id TSS1 初始轉(zhuǎn)錄本id，或者“-”表示沒有初始轉(zhuǎn)錄本。The tss_id associated with the object, or "-" if not a transcript/primary transcript, or if tss_idisn't present

7 locus chr1:4797771-4835363 基因組上的位置Genomic coordinates for easy browsing to the object

8 length 2447 轉(zhuǎn)錄本的長(zhǎng)度The number of base pairs in the transcript, or '-' if not a transcript/primary transcript

9 coverage 43.4279 read覆蓋深度的估測(cè)值 Estimate for the absolute depth of read coverage across the object

10 q0_FPKM 8.01089 樣本0中object的FPKM FPKMof the object in sample 0

11 q0_FPKM_lo 7.03583 object在樣本0中FPKM的95%置信區(qū)間的下界the lower bound of the 95% confidence interval on the FPKM of the object in sample 0

12 q0_FPKM_hi 8.98595 object在樣本0中FPKM的95%置信區(qū)間的上界the upper bound of the 95% confidence interval on the FPKM of the object in sample 0

13 q0_status OK object在樣本0中的量化狀態(tài)，0K表示成功，LOWDATA:太復(fù)雜或測(cè)序深度不夠；HIDATA：在一個(gè)基因座上太多fragments，FAIL：失敗的協(xié)方差矩陣或其他數(shù)值阻止了去卷積Quantification status for the object in sample 0. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.

Count tracking files 格式如下:

1 tracking_id TCONS_00000001 A unique identifier describing the object (gene, transcript, CDS, primary transcript)

2 q0_count 201.334 Estimated (externally scaled) number of fragments generated by the object in sample 0

3 q0_count_variance 5988.24 Estimated variance in the number of fragments generated by the object in sample 0

4 q0_count_uncertainty_var 170.21 Estimated variance in the number of fragments generated by the object in sample 0 due to fragment assignment uncertainty.

5 q0_count_dispersion_var 4905.63 Estimated variance in the number of fragments generated by the object in sample 0 due to cross-replicate variability.

6 q0_status OK Quantification status for the object in sample 0. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.





七. cufflinks使用中遇到的問題



1. 
使用cuffdiff時(shí)候，在最新版本下，無重復(fù)的RNA-seq樣作比較，結(jié)果中沒有差異表達(dá)基因？

在v2.0.1及之后的版本中cuffdiff貌似不支持無重復(fù)的RNA-seq數(shù)據(jù)了。使用之前的版本即可。

八 Cuffquant

cuffquant是cuffquant能夠?qū)蝹€(gè) BAM 文件的基因轉(zhuǎn)錄本表達(dá)水平進(jìn)行定量分析。生成的是CXB文件abundances.cxb,，可以作為cuffdiff的輸入，這會(huì)加快cuffdiff的運(yùn)行速度。也可以作為Cuffnorm的輸入。

具體使用：Usage: cuffquant [options]*

它的參數(shù)：(和前面參數(shù)的含義是一樣的)

-h/--help；-o/--output-dir ；-p/--num-threads ；-M/--mask-file ；-b/--frag-bias-correct ；-u/--multi-read-correct；--library-type；-m/--frag-len-mean ；-s/--frag-len-std-dev ；--max-mle-iterations ；--max-bundle-frags ；--no-effective-length-correction；--no-length-correction；-v/--verbose；-q/--quiet；--no-update-check；

九 Cuffnorm

cuffnorm能夠用 cuffquant 的輸出文件作為輸入文件，對(duì)基因和轉(zhuǎn)錄組，簡(jiǎn)單計(jì)算標(biāo)準(zhǔn)化過的表達(dá)水平。當(dāng)你想要的是一系列可比較的基因、轉(zhuǎn)錄組、CDS 組和 TSS 組的表達(dá)值時(shí)，可是使用 cuffnorm。例如，當(dāng)你僅僅想對(duì)單個(gè)基因的表達(dá)值做個(gè)熱圖或者點(diǎn)圖時(shí)。

cuffnorm [options]* ... [sampleN.sam_replicate1.sam[,...,sample2_replicateM.sam]]

具體參數(shù)：它的參數(shù)和前面的類似，可以看前面的相關(guān)參數(shù)。

-h/--help ；-o/--output-dir ；-L/--labels ；-p/--num-threads ；
--total-hits-norm（默認(rèn)不激活）；--compatible-hits-norm（默認(rèn)激活）； --library-type； --library-norm-method；--output-format；-v/--verbose； -q/--quiet； --no-update-check；

cuffnorm的輸出文件是實(shí)驗(yàn)中的each gene, transcript, TSS group, and CDS group的標(biāo)準(zhǔn)化的表達(dá)水平。不做表達(dá)差異的分析。cuffnorm的輸出文件默認(rèn)是“simple-table”的文件。這些文件和cuffdiff輸出的文件格式不同。若你想要cuffdiff格式的文件，你需要輸入命令： --output-format cuffdiff

cuffnorm 報(bào)道FPKM values and normalized, estimates for the number of fragments that originate from each gene, transcript, TSS group, and CDS group.這些結(jié)果已經(jīng)做了標(biāo)準(zhǔn)化處理。對(duì)于某些下游軟件需要原始文件，是不作為其輸入的。

可以創(chuàng)建一個(gè)文件，例如sample_sheet.txt作為cuffdiff或cuffnorm的輸入（存入sam文件的path）。文件格式如下：

sample_id      group_label


C1_R1.sam       C1


C1_R2.sam       C1


C2_R1.sam       C2


C2_R2.sam       C2

輸出結(jié)果文件如下：

FPKM tracking files：估測(cè)的基因的表達(dá)水平

Count tracking files：估測(cè)的基因的fragment count values

Read group tracking files：報(bào)道per-replicate expression and count data.

對(duì)于每個(gè)genes, transcripts, TSS groups, and CDS groups，cuffnorm會(huì)報(bào)道兩種文件形式： *.fpkm_table files and *.count_table files。

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： panhoy > 《RNseq》

舉報(bào)/認(rèn)領(lǐng)