談一談中國春基因轉(zhuǎn)錄水平上的證據(jù)

洋溢九洲 2021-01-04

展開全文

從組裝好的基因組序列到基因注釋這一步，說簡單也簡單，說難也難。這里的難是指，在轉(zhuǎn)錄水平上做到95%以上的準(zhǔn)確率，還是比較困難的。我們前面曾經(jīng)介紹過基因注釋的一些內(nèi)容。

基因注釋一般是指采用生物信息學(xué)的方法獲得已組裝好的基因組中基因的位置、結(jié)構(gòu)和基因功能等信息，一般包括從頭注釋、同源注釋和基于轉(zhuǎn)錄組和蛋白質(zhì)組的注釋。基于轉(zhuǎn)錄組和蛋白質(zhì)組的注釋是目前最準(zhǔn)確的方法，但受限于不可能獲得所有時(shí)空下的轉(zhuǎn)錄組，所以有必要用同源注釋和從頭注釋的結(jié)果作為補(bǔ)充。基因注釋是分子生物學(xué)研究的基礎(chǔ)，若基因注釋結(jié)果不正確或不完整，則以此為基礎(chǔ)的后續(xù)研究也會(huì)受到影響。

目前已經(jīng)有眾多的算法或軟件被開發(fā)出來用于基因注釋領(lǐng)域。從頭注釋的軟件包括SNAP (Korf 2004)，TwinScan (Korf et al. 2001)，F(xiàn)GENESH (Salamov and Solovyev 2000)，Augustus (Stanke et al. 2006)，Genscan (Burge and Karlin 1997)， GAZE (Howe et al. 2002) 等。這些軟件往往需要將一些已知基因作為訓(xùn)練集，然后根據(jù)訓(xùn)練好的模型去預(yù)測基因。理論上，只要訓(xùn)練集包括足夠的基因，該方法是可以預(yù)測出所有基因位點(diǎn)的，但卻不能準(zhǔn)確的界定基因的外顯子-內(nèi)含子結(jié)構(gòu)。

同源注釋則是將近源物種基因的轉(zhuǎn)錄本序列或蛋白序列映射至需要注釋的基因組上，常用的工具有BLAST、BLAT (Kent 2002)、Splign (Kapustin et al. 2008)、Spidey (Wheelan et al. 2001)、sim4 (Florea et al. 1998)、Exonerate (Slater and Birney 2005)、gmap (Wu and Watanabe 2005)、Magic-BLAST(Boratyn et al. 2019)和minimap2 (Li 2018) 等軟件, 其中g(shù)map、Magic-BLAST和minimap2為新一代比對(duì)工具，可將大量的轉(zhuǎn)錄本序列快速比對(duì)至基因組上。同源注釋有助于基因位點(diǎn)的發(fā)現(xiàn)，但由于不同物種之間基因組上存在差異，在基因結(jié)構(gòu)以及是否表達(dá)上還需要本物種轉(zhuǎn)錄水平的證據(jù)支持。

基于轉(zhuǎn)錄組的注釋，是指將不同來源的轉(zhuǎn)錄本序列比對(duì)至基因組上，然后根據(jù)轉(zhuǎn)錄本的位置進(jìn)行注釋。比對(duì)常用的軟件與上述同源轉(zhuǎn)錄本的比對(duì)所用的軟件一致。轉(zhuǎn)錄本序列一般來自EST序列、全長cDNA序列、二代測序獲得的轉(zhuǎn)錄本序列以及三代測序獲得的轉(zhuǎn)錄本序列。由于二代測序打斷測序的缺點(diǎn)，在拼接成全長轉(zhuǎn)錄本時(shí)會(huì)有假陽性的轉(zhuǎn)錄本產(chǎn)生。而三代測序獲得的轉(zhuǎn)錄本則可以避免這種情況的發(fā)生，但由于錯(cuò)誤率高而且價(jià)格也比較高，只被用在少數(shù)研究中。另外，為了獲得基因的方向、更精確的轉(zhuǎn)錄起始位點(diǎn)和結(jié)束位點(diǎn)等信息，諸如鏈特異性RNA-seq、Cap Analysis Gene Expression (CAGE-seq) 和PolyA-seq等基于二代測序平臺(tái)獲得的數(shù)據(jù)也被加入到基因組注釋流程中 (Wang et al. 2019)。相比轉(zhuǎn)錄組來說，目前高通量蛋白質(zhì)組技術(shù)還未獲得關(guān)鍵性突破。核糖體印記測序 (Ribo-seq) 可在一定程度上代替高通量蛋白質(zhì)組技術(shù)。該技術(shù)能夠獲得正在翻譯過程中的mRNA片段, 但目前還未見將該數(shù)據(jù)應(yīng)用到注釋流程當(dāng)中的報(bào)道。基因組正常轉(zhuǎn)錄時(shí)可能會(huì)出現(xiàn)一些轉(zhuǎn)錄噪音，并不是真正的基因，因此注釋基因時(shí)也應(yīng)當(dāng)考慮基因的表達(dá)量，排除可能的轉(zhuǎn)錄噪音。

為提高基因注釋的準(zhǔn)確性和完整性，可以將上述三種基因注釋方法綜合起來使用。目前有一些軟件將這三個(gè)方面的注釋方法整合到一個(gè)流程當(dāng)中，如MAKER (Cantarel et al. 2008)、MAKER-P (Campbell et al. 2014)、PASA (Haas et al. 2003)、Funannotate[1] 以及一些綜合性的生物數(shù)據(jù)庫網(wǎng)站也會(huì)開發(fā)一套自己的注釋流程，如Gramene pipeline (Liang et al. 2009)、Ensembl gene annotation system (Aken et al. 2016)、NCBI Eukaryotic Genome Annotation Pipeline[2]和PGSB[3]等。隨著使用三代測序獲得的轉(zhuǎn)錄本日益增多，一些基于三代轉(zhuǎn)錄組數(shù)據(jù)的基因注釋軟件也被開發(fā)出來，如LoReAn (Cook et al. 2019)、mikado (Venturini et al. 2018) 等。另外，隨著測序價(jià)格的降低以及基因組組裝技術(shù)的進(jìn)步，從頭組裝一個(gè)新基因組也變得容易起來，對(duì)于那些已有基因注釋的物種來說，可將已有的基因注釋轉(zhuǎn)移至新基因組上，目前已經(jīng)有一些生物信息學(xué)工具可以方便的完成這一過程 (Konig et al. 2016; Song et al. 2019)。總結(jié)來說，這種將不同注釋方法整合起來的生物信息軟件極大簡化了基因注釋的過程，在此基礎(chǔ)之上可輔以人工校正來糾正仍然可能錯(cuò)誤的基因。其中Dunn et al. (2019) 開發(fā)的工具Apollo讓研究者進(jìn)行人工校正變得更加便捷。

上述基因注釋方法同樣可以應(yīng)用到小麥基因組的注釋上，無論是烏爾拉圖小麥、節(jié)節(jié)麥還是野生二粒小麥和栽培二粒小麥以及中國春的基因注釋工作，都使用了上述三種方法和相關(guān)的軟件。其中，國際小麥測序聯(lián)盟在注釋中國春基因組的過程中采用了多種方法，除PGSB、PASA流程之外，還使用了專門為注釋小麥基因組所開發(fā)的TriAnnot流程 (Leroy et al. 2012)。該流程包括了轉(zhuǎn)座子注釋、基因注釋以及后續(xù)的基因功能注釋。盡管如此，在實(shí)際使用過程中發(fā)現(xiàn)，目前中國春小麥的基因注釋中仍然存在錯(cuò)誤，如小麥雄性不育基因Ms2就不在當(dāng)前注釋版本中 (Ni et al. 2017)。

以上是介紹著重方法和工具的介紹。實(shí)際上，我也有動(dòng)手去完善中國春或者大麥的基因注釋，有些結(jié)果已經(jīng)放到小麥多組學(xué)網(wǎng)站上。主要是利用轉(zhuǎn)錄水平的證據(jù)去完善，如小麥ESTs序列，RNA-seq數(shù)據(jù)，PacBio數(shù)據(jù)等。其中，僅用到的RNA-seq數(shù)據(jù)就達(dá)2000多份。

但折騰了各種工具和方法之后發(fā)現(xiàn)，想要在轉(zhuǎn)錄水平上達(dá)到較高的準(zhǔn)確率，必須利用大量高深度的全長轉(zhuǎn)錄本(PacBio數(shù)據(jù))去完善，僅僅靠二代RNA-seq數(shù)據(jù)是不現(xiàn)實(shí)的，另外，還需要借助Apollo進(jìn)行人工校正，這才能達(dá)到一個(gè)較高的準(zhǔn)確率。我曾折騰過大麥的Apollo，大概估算了下時(shí)間，全心全意搞這個(gè)，挨個(gè)基因過一遍大概需要半年時(shí)間。重要的是，這玩意時(shí)間一長，超級(jí)無聊。有段時(shí)間，投入了我的休息時(shí)間，用了幾天時(shí)間大概檢查了110Mb，但沒堅(jiān)持下去。我們常常講時(shí)間管理，我認(rèn)為這是不對(duì)的，其實(shí)應(yīng)該是人的精力/注意力管理。很多時(shí)候，時(shí)間是有，但精力跟不上，尤其是我們這種不經(jīng)常鍛煉身體的。

去年聽說，IWGSC在整2.0版本的注釋，但到目前為止還沒有釋放。我倒是希望好好整一下，哪怕晚點(diǎn)出來。不管有沒有出來，大家關(guān)注某一區(qū)間的基因時(shí)，不妨參考下這些轉(zhuǎn)錄水平的證據(jù)。

Aken BL, Ayling S, Barrell D, Clarke L, Curwen V, Fairley S, Fernandez Banet J, Billis K, Garcia Giron C, Hourlier T, Howe K, Kahari A, Kokocinski F, Martin FJ, Murphy DN, Nag R, Ruffier M, Schuster M, Tang YA, Vogel JH, White S, Zadissa A, Flicek P, Searle SM (2016) The Ensembl gene annotation system. Database (Oxford) 2016

Boratyn GM, Thierry-Mieg J, Thierry-Mieg D, Busby B, Madden TL (2019) Magic-BLAST, an accurate RNA-seq aligner for long and short reads. BMC Bioinformatics 20:405

Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78-94

Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun R, Jiao D, Lawrence CJ, Ware D, Shiu SH, Childs KL, Sun Y, Jiang N, Yandell M (2014) MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol 164:513-524

Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sanchez Alvarado A, Yandell M (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18:188-196

Cook DE, Valle-Inclan JE, Pajoro A, Rovenich H, Thomma B, Faino L (2019) Long-Read Annotation: automated eukaryotic genome annotation based on long-read cDNA sequencing. Plant Physiol 179:38-54

Dunn NA, Unni DR, Diesh C, Munoz-Torres M, Harris NL, Yao E, Rasche H, Holmes IH, Elsik CG, Lewis SE (2019) Apollo: Democratizing genome annotation. PLoS Comput Biol 15:e1006790

Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8:967-974

Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Jr., Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O (2003) Improving the *Arabidopsis* genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31:5654-5666

Howe KL, Chothia T, Durbin R (2002) GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res 12:1418-1427

Kapustin Y, Souvorov A, Tatusova T, Lipman D (2008) Splign: algorithms for computing spliced alignments with identification of paralogs. Biol Direct 3:20

Kent WJ (2002) BLAT--the BLAST-like alignment tool. Genome Res 12:656-664

Konig S, Romoth LW, Gerischer L, Stanke M (2016) Simultaneous gene finding in multiple genomes. Bioinformatics 32:3388-3395

Korf I (2004) Gene finding in novel genomes. BMC Bioinformatics 5:59

Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17 Suppl 1:S140-148

Leroy P, Guilhot N, Sakai H, Bernard A, Choulet F, Theil S, Reboux S, Amano N, Flutre T, Pelegrin C, Ohyanagi H, Seidel M, Giacomoni F, Reichstadt M, Alaux M, Gicquello E, Legeai F, Cerutti L, Numa H, Tanaka T, Mayer K, Itoh T, Quesneville H, Feuillet C (2012) TriAnnot: a versatile and high performance pipeline for the automated annotation of plant genomes. Front Plant Sci 3:5

Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34:3094-3100

Liang C, Mao L, Ware D, Stein L (2009) Evidence-based gene predictions in plant genomes. Genome Res 19:1912-1923

Ni F, Qi J, Hao Q, Lyu B, Luo MC, Wang Y, Chen F, Wang S, Zhang C, Epstein L, Zhao X, Wang H, Zhang X, Chen C, Sun L, Fu D (2017) Wheat *Ms2* encodes for an orphan protein that confers male sterility in grass species. Nat Commun 8:15121

Salamov AA, Solovyev VV (2000) Ab initio gene finding in *Drosophila* genomic DNA. Genome Res 10:516-522

Slater GS, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:31

Song B, Sang Q, Wang H, Pei H, Wang F, Gan X (2019) A weighted sequence alignment strategy for gene structure annotation lift over from reference genome to a newly sequenced individual. bioRxiv

Stanke M, Schoffmann O, Morgenstern B, Waack S (2006) Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7:62

Venturini L, Caim S, Kaithakottil GG, Mapleson DL, Swarbreck D (2018) Leveraging multiple transcriptome assembly methods for improved gene structure annotation. Gigascience 7

Wang K, Wang D, Zheng X, Qin A, Zhou J, Guo B, Chen Y, Wen X, Ye W, Zhou Y, Zhu Y (2019) Multi-strategic RNA-seq analysis reveals a high-resolution transcriptional landscape in cotton. Nat Commun 10:4714

Wheelan SJ, Church DM, Ostell JM (2001) Spidey: a tool for mRNA-to-genomic alignments. Genome Res 11:1952-1957

Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859-1875

------

[1] https://funannotate.

[2] https://www.ncbi.nlm./books/NBK169439

[3] http://pgsb./plant

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：洋溢九洲 > 《小麥（專輯）》

舉報(bào)/認(rèn)領(lǐng)