基因組測序的原理與方法_第1頁
基因組測序的原理與方法_第2頁
基因組測序的原理與方法_第3頁
基因組測序的原理與方法_第4頁
基因組測序的原理與方法_第5頁
已閱讀5頁,還剩120頁未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、編輯ppt1大規(guī)?;蚪M測序的大規(guī)模基因組測序的原理與方法原理與方法胡松年 編輯ppt2 元素周期表的發(fā)現(xiàn)奠定了二十世紀(jì)物理、化學(xué)研究和發(fā)展的基礎(chǔ)元素周期表“基因組序列圖”將奠定二十一世紀(jì)生命科學(xué)研究和生物產(chǎn)業(yè)發(fā)展的基礎(chǔ)! “基因組”-生命科學(xué)的“元素周期表”人體解剖圖奠定了現(xiàn)代醫(yī)學(xué)發(fā)展的基礎(chǔ)編輯ppt3生命的奧秘蘊(yùn)藏于 “四字天書”之中GCTTCTTCCTCATTTTCTCTTGCCGCCACCATGCCGCCACCA TCATTTTCTCTTGCCGCCACCATGCTTCTTCCTCATTTTCTCT CCACCATGCCGCCACCACGCCACCATGCTTCTTCCTCATCTC

2、GCTTTCTTGCCGCCACCATGCCGCCACCGCTTCTTCCtTCTCT編輯ppt4基因組學(xué)的基礎(chǔ)理論研究基因組學(xué)的基礎(chǔ)理論研究基因組學(xué)是要揭示下述四種整合體系的相互關(guān)系基因組學(xué)是要揭示下述四種整合體系的相互關(guān)系: 基因組作為信息載體基因組作為信息載體 (堿基對(duì)、重復(fù)序列的整(堿基對(duì)、重復(fù)序列的整體守恒與局部不平衡的關(guān)系)體守恒與局部不平衡的關(guān)系) 基因組作為遺傳物質(zhì)的整合體基因組作為遺傳物質(zhì)的整合體 (基因作為功能和基因作為功能和結(jié)構(gòu)單位與遺傳學(xué)機(jī)制的關(guān)系結(jié)構(gòu)單位與遺傳學(xué)機(jī)制的關(guān)系) 基因組作為生物化學(xué)分子的整合體基因組作為生物化學(xué)分子的整合體 (基因產(chǎn)物作基因產(chǎn)物作為功能分子

3、與分子、細(xì)胞機(jī)制的關(guān)系)為功能分子與分子、細(xì)胞機(jī)制的關(guān)系) 物種進(jìn)化的整合體物種進(jìn)化的整合體 (物種在地理與大氣環(huán)境中的物種在地理與大氣環(huán)境中的自然選擇)自然選擇)編輯ppt5編輯ppt6編輯ppt7編輯ppt8測序設(shè)備的壟斷和高速度換代199020052020Year2015201020001995Mb1000Mb4000ABI373ABI377ABI3130ABI3730ABI3730 xlGA-I GA-IILess Than 5 yrsHiSeq1000/2000Mb4500ABI3700ABI3700 xlSOLiDSOLiD2SOLiD35500 xl SOLiDABI3130 x

4、lGA-IIx5500 SOLiD編輯ppt9測序設(shè)備發(fā)展現(xiàn)狀第一代(穩(wěn)定需求)第一代(穩(wěn)定需求)ABi3130 xL3730 xL3500 xL第三代(即將面市)第三代(即將面市)Helicos BiosciencesHelicos Genetic Analysis System Pacific BiosciencesRSSystem 第二代(高速發(fā)展)第二代(高速發(fā)展)RocheGenome Sequencer FLX System GS Junior System IlluminaGenome Analyzer IIxMiSeqHiSeq 1000HiSeq 2000Life Techn

5、ologies (ABi)5500 SOLiD System5500 xL SOLiD SystemIon Torrent PGMDanaherMotionPolonator G.007Complete Genomics無錫艾吉因生物信息技術(shù)有限公司無錫艾吉因生物信息技術(shù)有限公司AG-100深圳華因康基因科技有限公司深圳華因康基因科技有限公司Pstar-1中科院北京基因組所中科院北京基因組所/ /半導(dǎo)體所半導(dǎo)體所BIGIS-1BIGIS-4編輯ppt10編輯ppt11編輯ppt12反應(yīng)所需物質(zhì):反應(yīng)所需物質(zhì):DNA模板、引物、模板、引物、DNA聚合聚合 酶、酶、dNTP、緩沖液、緩沖液每個(gè)循環(huán)

6、包括:每個(gè)循環(huán)包括:變性(變性(90)、退火()、退火(54 )、延伸()、延伸(72 )編輯ppt13編輯ppt14編輯ppt15ATGCCGTAGGCCTAGC TAGGCCTAGCTCGGA ATGCCGTAGGCCTAGCTCGGA基因組基因組DNABAC文庫文庫根據(jù)物理圖譜根據(jù)物理圖譜正確定位的正確定位的BAC 或或contig用于霰彈法測用于霰彈法測序的候選克隆序的候選克隆用于霰彈法測序用于霰彈法測序的亞克隆的亞克隆測序并組裝測序并組裝完整的基因完整的基因組序列組序列逐步克隆法(逐步克隆法(Clone by Clone) 全基因組霰彈法全基因組霰彈法 (Whole Genome S

7、hot-gun)基因組基因組DNA 霰彈法克隆霰彈法克隆測序并進(jìn)行測序并進(jìn)行全基因組序全基因組序列組裝列組裝完整的基因完整的基因組序列組序列編輯ppt16 編輯ppt17BAC by BACWhole Genome Shotgun the sequencing of the human genome is likely to be the only large sequencing project carried to completion by the methods described in this issue. Maynard V. Olson , The maps: Clone by

8、 clone by clone , Nature 409, 816 - 818 (2001) 編輯ppt18“WorkingDraft”(90%; 4X)FinishedGenome(99.99%; 8X)Gap1Gap2Chromosome工作草稿(框架圖)與完成圖編輯ppt19BAC by BAC 編輯ppt20The sequence of the human genomeC. Venter et al.Science 16 Feb. 291: 1304 1351, 2001編輯ppt21人類基因組計(jì)劃研究的主要成果和進(jìn)展表現(xiàn)在這人類基因組計(jì)劃研究的主要成果和進(jìn)展表現(xiàn)在這“四張圖四張圖”

9、上上 遺傳圖譜遺傳圖譜 又稱為連鎖圖譜(又稱為連鎖圖譜(linkage maplinkage map),指),指基因或基因或DNADNA標(biāo)志在染色體上的相對(duì)位置標(biāo)志在染色體上的相對(duì)位置與遺傳距離與遺傳距離物理圖譜物理圖譜 以定位的以定位的DNADNA標(biāo)記序列如標(biāo)記序列如STSSTS作為路標(biāo),作為路標(biāo),以以DNADNA實(shí)際長度即實(shí)際長度即bp、kb、Mb為圖距的為圖距的基因組圖譜?;蚪M圖譜。轉(zhuǎn)錄圖譜轉(zhuǎn)錄圖譜 利用利用EST(expressed sequence tags 表達(dá)表達(dá)序列標(biāo)簽)作為標(biāo)記所構(gòu)建的分子遺傳序列標(biāo)簽)作為標(biāo)記所構(gòu)建的分子遺傳圖譜圖譜序列圖譜序列圖譜 通過基因組測序得到的,

10、以通過基因組測序得到的,以A A、T T、G G、C C為標(biāo)記單位的基因組為標(biāo)記單位的基因組DNADNA序列序列 編輯ppt22物理圖譜的構(gòu)建物理圖譜的構(gòu)建大片段克隆的篩選大片段克隆的篩選霰彈法測序與霰彈法測序與“工作框架圖工作框架圖”的構(gòu)建的構(gòu)建序列的全組裝與序列的全組裝與“完成圖完成圖”構(gòu)建構(gòu)建編輯ppt23物理圖譜的制作物理圖譜的制作 編輯ppt24 物理圖譜物理圖譜是以特異的是以特異的DNADNA序列為標(biāo)志所展示的染色體圖。序列為標(biāo)志所展示的染色體圖。標(biāo)志之間的距離或圖距以物理距離如堿基對(duì)(標(biāo)志之間的距離或圖距以物理距離如堿基對(duì)(base pairbase pair;bpbp,Kb ,

11、 Mb)Kb , Mb)表示。最精細(xì)的物理圖是核苷酸順序圖,最粗略的物表示。最精細(xì)的物理圖是核苷酸順序圖,最粗略的物理圖是染色體組型圖。理圖是染色體組型圖。 STSSTS圖譜圖譜是最基本和最為有用的染色體物理圖譜之一,是最基本和最為有用的染色體物理圖譜之一,STSSTS(Sequence Tagged Site)Sequence Tagged Site)本身是隨機(jī)地從人類基因組上選擇本身是隨機(jī)地從人類基因組上選擇出來的長度在出來的長度在200200300bp300bp左右的特異性短序列(每個(gè)左右的特異性短序列(每個(gè)STSSTS在基在基因組中是唯一的,因組中是唯一的,STSSTS圖譜就是以圖譜就

12、是以STSSTS為路標(biāo)(平均每為路標(biāo)(平均每100Kb100Kb一一個(gè)),將個(gè)),將DNADNA克隆片段有序地定位到基因組上??寺∑斡行虻囟ㄎ坏交蚪M上。 STS的來源的來源隨機(jī)基因組序列隨機(jī)基因組序列表達(dá)基因序列,如表達(dá)基因序列,如EST遺傳標(biāo)記序列,如微衛(wèi)星標(biāo)記遺傳標(biāo)記序列,如微衛(wèi)星標(biāo)記有關(guān)有關(guān)STSSTS的信息可在基因組數(shù)據(jù)庫的信息可在基因組數(shù)據(jù)庫GDBGDB中找到中找到 http:/gdbwww. gdb. org編輯ppt25q確定各確定各STS序列及其在序列及其在基因組中的位置基因組中的位置q大插入片段基因組文大插入片段基因組文庫的構(gòu)建(庫的構(gòu)建(BAC文庫)文庫)q 以特定以特

13、定STS為標(biāo)記篩為標(biāo)記篩 選并定位克隆選并定位克隆q含有含有STS的克隆在基因的克隆在基因組中排序組中排序基因組數(shù)據(jù)庫(GDB)中至少含有24568 個(gè)STS路標(biāo)信息 編輯ppt26作為載體的基本要求 能在宿主細(xì)胞中進(jìn)行獨(dú)立的復(fù)制能在宿主細(xì)胞中進(jìn)行獨(dú)立的復(fù)制 具有多克隆位點(diǎn),可插入外源具有多克隆位點(diǎn),可插入外源 DNADNA片段片段 有合適的篩選標(biāo)記,如抗藥性有合適的篩選標(biāo)記,如抗藥性 大小合適,易于分離純化大小合適,易于分離純化 拷貝數(shù)多拷貝數(shù)多 文庫的概念文庫的概念 含有某種生物體全部基因的隨機(jī)片段的重組含有某種生物體全部基因的隨機(jī)片段的重組DNADNA克隆群體克隆群體 載體:載體:能攜帶

14、外源能攜帶外源DNADNA進(jìn)入宿主細(xì)胞進(jìn)入宿主細(xì)胞的工具,常用的載體有質(zhì)粒載體、噬的工具,常用的載體有質(zhì)粒載體、噬菌體載體、細(xì)菌人工染色體等菌體載體、細(xì)菌人工染色體等宿主:宿主:能容納外源能容納外源DNADNA片段的生物體,片段的生物體,常用的有大腸桿菌、酵母等常用的有大腸桿菌、酵母等編輯ppt27NotI、SacI脈沖場凝膠電泳得200Kb左右的大片段DNA 純化后與載體連接 電轉(zhuǎn)化,將連接產(chǎn)物導(dǎo)入大腸桿菌感受態(tài)細(xì)胞插有外源DNA片段的BAC載體在含有氯霉素的固體培養(yǎng)基中培養(yǎng)每一個(gè)菌落為帶有相同外源DNA片段的單克隆編輯ppt28BAC克隆的篩選克隆的篩選“STS-PCR反反應(yīng)池應(yīng)池”方案篩

15、方案篩選種子克隆選種子克隆特定的特定的STS標(biāo)標(biāo)記記 相互間具有重疊片段的BAC克隆根據(jù)STS信息組裝成contig,并定位于基因組上Contig每一個(gè)菌落為帶有相同外源DNA片段的單克隆編輯ppt29編輯ppt30Regional mapping編輯ppt31Regional mappingMinimal tiling path selected for sequencing.Regional mapping編輯ppt33stSG50796stSG50796WI-21858WI-21858WI-20982WI-20982SGC-34652SGC-34652EST325005EST325005

16、Bda37h09Bda37h09sts-N34454sts-N34454stSG-22642stSG-22642stSG22463stSG22463IB262IB262SGC-100057SGC-100057SGC-11218SGC-11218SGC-77734SGC-77734 SGC-12613SGC-12613SGC-79997SGC-79997D3S4170D3S4170WI-13469WI-13469SGC-104744SGC-104744WI-7400WI-7400SGC-82788SGC-82788sts-N30615sts-N30615SGC-106678SGC-106678W

17、I-3006WI-3006D3S4125D3S4125 stSG31571stSG31571SGC-86097SGC-86097SGC-104738SGC-104738 sts-T03421sts-T03421 stSG81116stSG81116DM1-2b11sDM1-2b11sA004Q43A004Q43WI-10858WI-10858SGC-15279SGC-15279stSG3143stSG3143WI-8499WI-8499 D3S3525D3S3525D3S3630D3S3630 SGC-11976 SGC-11976 WI-6116WI-6116WI-2053WI-2053SG

18、C-84074SGC-84074SGC-77858SGC-77858D3S3706D3S3706SGC-102094SGC-102094 WI-13611WI-13611NRU18-13sNRU18-13sWI-21921WI-21921CHLC.GATA44a05CHLC.GATA44a05D3S1304D3S1304sts-T58150sts-T58150SGC-82964SGC-82964 WI-1341WI-1341D3S3591D3S3591605m01229 e21279b12299n03198p1741l18233p0137i04324k11163m22Beijing Cente

19、rMapped on 3p by sequence from other center114k09204c23728k15429p24499n06399k19106b10129j10113l1013f06600o17322f0976o22263j0830m15320c08250a15294h24140b10137g22South centerMapped on 3p by fingerprint from other center265o10717m12762o12156h01324k15283k15572b0261i09534j21166f03497i24497i24121d03121d03

20、211k13161d20274o146i21116k05255k15812i02North centerMapped not on 3p by fish1120h22566o1463o01757o1626f1026f10 453a03586c02483g20507d0625c11344o05Mapped not on 3p by fish260k16263p03341o12560g03772p01344l093d22489o22794g03Beijing and South 306h05621c18438g1582o03181f22622p03320k0124b1657d0657d06470

21、e10STS markers 385a18416n08785a0797c1625f0125f01167p17167p17277d17669 e03194c09Beijing and North210b1795 e11101a04101a0499d1099d10487j12590a20156b21End certified 710 e0410h06508a20508a20173f11173f117m247m24211b19291p2144l1444l14481o07Phase 3Phase 3731 e12731 e12811m11811m11372k09194d21245a0616k1516k

22、15318i14318i14529b1753 e12542k24Mapped not on 3p by sequence from NCBI392m07319i18 454f24238a09238a09264h03157 e16350a17Mapped on 3p by fish673f20453f03489d19194i05? ?Sequenced BACs without mapping information93a0193a01360 e14244g03329a02611h22611h2270b0570b05135 e1674 e04124l0821j2321j23IB1403IB140

23、3SGC-12699SGC-12699sts-F21241sts-F21241WI- 6061WI- 6061stSG16459stSG16459WI-6949WI-6949 stSG15038stSG15038sts-M91858sts-M91858WI-17502WI-17502 WI-7625WI-7625WI-7071WI-7071AB000410AB000410sts-F21841sts-F21841sts-L15409sts-L15409A004Z22A004Z22stSG31652stSG31652WI-16427WI-16427stSG43815stSG43815A007593

24、A007593WI-11598WI-11598A008O42A008O42D3S4194D3S4194stSG4279stSG4279WI-14394WI-14394sts-N95054sts-N95054stSG32055stSG32055stSG15465stSG15465WI-11041WI-11041stSG47554stSG47554stSG3350stSG3350D3S3589D3S3589SGC-12045SGC-12045D3S1263D3S1263stSG47397stSG47397 SGC-84455SGC-84455 D3S3610D3S3610SGC-10790SGC-

25、10790D3S3691D3S3691A002R42A002R42stSG50845stSG50845stSG2582stSG2582WI-31307WI-31307A004X28A004X28D3S3601D3S3601A001T39A001T39stSG62586stSG62586WI-15608WI-15608sts-H83694sts-H83694stSG47347stSG47347WI-5650WI-5650WI-20823WI-20823202a21 105k13334l221087o20593j10169k17309m10813n2383m12 19 e08 203c04481h

26、17356a0713b04449 e2125o17715i04642 e22298m15224p21267l16407i02488o087f24481b18128a05380o24474f16327h1716m03470i10 398j1558i13424h06325l061016h17134k10299h13220d10220d10126l04900o2218f0358b17 1022p15193k15586c12588p09173m24572m141082a181082a18266 e23275j11270i10270i10333a0234l0634l06ctb-159n23ctb-159

27、n23168l03ctc-237n12ctc-237n12382a21ctc-371o18ctc-371o18126l09163d23AC055767AC055767767c01502k05502k05326o24ctb-140o19ctb-140o19415k13224m20167k17167k17219m19219m19266j06438j01627c01659g04659g04AC007791AC007791263i01263i01596j09996c06338p06338p06606c06606c06ctc-243a06ctc-243a06ctc-371o18ctc-371o18357

28、l2494a1494a14380a2270i1170i11citb-243a06citb-243a06af176815ctb-177n07ctb-177n07115g03115g03109j15781a02412a07412a07429f161020a11ctb-187p01ctb-187p01622i12402p1145b16439f04105h193pterBeijing Map編輯ppt34BAC Pooling Protocol 1,152 (plates) X 384 (wells/plate) X 1 (BAC/well) = 442,368 BAC 48X8 (板) X 384

29、( 孔/板 ) X 1 ( BAC/孔 ) = 147,456 BAC Each BAC clone contain 150 Kbp human insert 147,456 BAC clones 對(duì)全基因組的覆蓋率: 147,456 BAC clones X 150 Kbp = 7.3728 The genome DNA 3,000,000 Kbp 編輯ppt35共共48個(gè)個(gè)每組每組 8 個(gè)個(gè)每每8個(gè)個(gè)96孔板組成孔板組成1個(gè)個(gè)superpool,384個(gè)個(gè)96孔板組成孔板組成48個(gè)個(gè)superpools 48 superpools編輯ppt36 Column poolsColumn poo

30、ls Row poolsRow pools 1 2 3 4 5 6 7 8 9 10 11 12第八板第八板第二板第二板Plate poolsPlate pools第一板第一板 plate pools,row pools,column pools的構(gòu)成的構(gòu)成 編輯ppt37 1 2 3 4 5 6 7 8 9 10 11 12超級(jí)池(超級(jí)池(8個(gè)個(gè)96孔板,孔板,共共768個(gè)克?。﹤€(gè)克?。┌宄兀ò宄兀?6個(gè)克?。﹤€(gè)克隆)行池(12個(gè)克隆)列池(列池(8個(gè)克?。﹤€(gè)克隆)大大減少篩選的工作量,降低成本,所得篩選結(jié)果準(zhǔn)確可靠大大減少篩選的工作量,降低成本,所得篩選結(jié)果準(zhǔn)確可靠 28 VS 768編輯

31、ppt38sheet of superpools, plate pools, row pools, column pools 編輯ppt39 一一 BAC Screening前前48個(gè)樣品為引物個(gè)樣品為引物OGG1.51對(duì)對(duì)superpool(sp)的篩選結(jié)果的篩選結(jié)果后后48個(gè)樣品為引物個(gè)樣品為引物OGG1.52對(duì)對(duì)superpool(sp)的篩選結(jié)果的篩選結(jié)果 編輯ppt40引物引物OGG1.52對(duì)應(yīng)對(duì)應(yīng)sp#27,34,45的的plate,row,column pools的篩選結(jié)果的篩選結(jié)果編輯ppt41BAC clone 確定確定 (+為陽性克隆為陽性克隆) 編輯ppt42引物引物OG

32、G1.52的的Colony-PCR 編輯ppt43 STSSTS的密度尚未達(dá)到繪制高精度物理圖譜的要求,且在基因組中的分的密度尚未達(dá)到繪制高精度物理圖譜的要求,且在基因組中的分布不均勻,造成很多區(qū)域沒有陽性克隆覆蓋布不均勻,造成很多區(qū)域沒有陽性克隆覆蓋, ,形成空洞。因此需用指紋圖形成空洞。因此需用指紋圖譜(譜(FPCFPC法)或末端序列(法)或末端序列(Walking by End Sequence)Walking by End Sequence)步移等手段對(duì)種子步移等手段對(duì)種子克隆進(jìn)行延伸,形成連續(xù)克隆群。利用延伸方法篩選得到的克隆稱為延克隆進(jìn)行延伸,形成連續(xù)克隆群。利用延伸方法篩選得到的

33、克隆稱為延伸克隆。伸克隆。 Contig 1Contig 2重疊序列重疊序列重疊序列重疊序列延伸引物延伸引物篩選到的延伸克隆篩選到的延伸克隆編輯ppt4420 kb300 bpMolecular weightmarker every 5th lane- BAC clones 在96深孔 板中培養(yǎng)- Hind III 完全酶切- 1% 瓊脂糖凝膠電泳 指指 紋紋 圖圖 譜譜 法法 (Walking by Fingerprinting database) 挑取靠近空洞的種子克隆,酶切構(gòu)建其指紋圖譜,在FPC數(shù)據(jù)庫中進(jìn)行比對(duì),搜索含有此克隆的重疊克隆群信息,從中確定覆蓋空洞區(qū)域的克隆,達(dá)到延伸目的。

34、編輯ppt45Hind III 完全酶切Hind III 完全酶切FPC數(shù)據(jù)庫數(shù)據(jù)庫中比對(duì)中比對(duì)Clone AClone BClone CCAB編輯ppt46contig搭建中克隆的錯(cuò)位搭建中克隆的錯(cuò)位 編輯ppt47末端序列步行法末端序列步行法 (Walking by End Sequence) 挑取靠近空洞的種子克隆進(jìn)行末端測序,然后在基因組數(shù)據(jù)庫中進(jìn)行比對(duì),確定專一性的序列片段作為新的STS路標(biāo)。最后設(shè)計(jì)新路標(biāo)的PCR引物,按照STSPCR“反應(yīng)池”方案篩選新的克隆,達(dá)到延伸的目的 ??寺】寺?50A18350A18序列輸入序列輸入 end sequence databaseend se

35、quence database的查詢結(jié)果的查詢結(jié)果編輯ppt48四、四、Clone Identification 1、STS-PCR 2、BAC end sequencing 3、Fingerprinting 4、FISH 編輯ppt49CK2CK1CK2CK113f06267l16481o07250a15204c23340j13對(duì)對(duì)1515個(gè)克隆進(jìn)行個(gè)克隆進(jìn)行HindIIIHindIII酶切后電泳結(jié)果酶切后電泳結(jié)果 編輯ppt50編輯ppt51“工作框架圖工作框架圖”繪制繪制根據(jù)序列與STS database進(jìn)行blastn比較結(jié)果,將克隆定位末端序的比較,判定延伸在contig外的一端序列

36、。并可及時(shí)進(jìn)行walking,篩選新的克隆 編輯ppt52霰彈法測序組裝與Finishing編輯ppt53工作流程圖工作流程圖 編輯ppt54Shotgun Sequencing I :RANDOM PHASE編輯ppt55Shotgun Sequencing II:ASSEMBLY編輯ppt56Shotgun Sequencing III: FINISHING編輯ppt57Shotgun Sequencing III: FINISHING編輯ppt58Shotgun Sequencing III: FINISHING編輯ppt59Shotgun Sequencing III: FINISHI

37、NG編輯ppt60Shotgun Sequencing III: FINISHING編輯ppt61Consed軟件顯示序列組裝結(jié)果界面軟件顯示序列組裝結(jié)果界面 1、Filling “intraclone gaps”編輯ppt62BAC-453F3s finishingSp6Sp6Sp61kb.Insert size. The size of the clone-insert from which a clone-end pair is taken.Contig. The result of joining an overlapping collection of sequence reads.

38、Scaffold. The result of connecting non-overlapping contigs by using pair-end reads.N50 size. As applied to contigs or scaffolds, that size above which 50% of the assembled sequence can be found.編輯ppt77Genome assembly strategyContig assemblyScafffoldingInternal gap closinghttp:/ whole genome sequenci

39、ng projectsTable. Basic information of Rrecently sequenced genomes.OrganismGenome sizestrategyCoverageContigScafffolds#N50MaxTotal#N50MaxTotalHuman3.0GbSolexa45x2.76M1.5Kb18.8Kb2.18GbNRNRNRNRApple742.3 MbSangr+4544.4x+12.5x122,14616,171NR603.9Mb1,629102KbNR598.3Castor320MbSanger4.59x54,00021.1kb190k

40、b324Mb25,828496.5kb4.7Mb350.6MbGrapevine500MbSangr+4547x+4.2x58,61118.2Kb238kb531Mb2,0931.33Mb7.8Mb421MbPanda2.4GbSolexa74x200,60436,728434,6352.25Gb81,4961.22Mb6.05Mb2.30GbStraberry220Mb454+solexa+solid24.5x+6.4x+6.4x16,48728,072215,349202Mb3,2631.44Mb4.1Mb214MbCacoo430Mb454+sanger+solexa16.7x+44x2

41、5,91219.8kb190Kb291.44,792473.8Kb3415Kb326.9MbTomato900Mb454+sanger+solexa+solid31x+3.6x+82x+140 x110,87255.7kbNR763Mb3,7614.45MbNR782MbPotato840Mb454+solexa+solid11x+106x+0.2x111,18731KbNR683Mb66,301387KbNR727Mb編輯ppt79編輯ppt80 Flowchart of the WGS de novo assemblyGenomic DNADNA fragmentation, constr

42、uct fragmented librariesGenerate sequencing reads using 454 technologySequencing error correctionOutput contigsFill in intra-scaffold gaps and get the final scaffoldsGenomic DNADNA fragmentation, construct paired-end libraries with variant insert sizesGenerate sequencing reads using Illumina GA tech

43、nologySequencing pre-processOutput contigs and mini scaffoldsSolexa part454 partHybrid assembly and scffolding編輯ppt81 454 reads processRaw readsKmer evaluationQ20, remove adaptor,trim Sequencing pre-processNewbler assemblyAssembled readsUnassembled readsUnigene coverageKmer evaluationSolexa mappingN

44、r/Nt blastContig statusAssemblyHybrid scaffolding編輯ppt82 Solexa reads processRaw readsKmer evaluationSequencing pre-processSoap assemblyAssembled readsUnassembled readsUnigene coverageKmer evaluationSolexa mappingNr/Nt blastContig statusAssemblyMapping to 454 contigHybrid scaffoldingCov /Comp編輯ppt83

45、long readsassemblycontigsshort readsA +C B scaffoldingA +B C scaffoldsFix gapHybrid assembly編輯ppt84ESTUnigeneScaf AScaf CScaf BScaf DNew ScafABCDEST based Assembly in short reads of NGS: Constructe BIGer Scaffording 編輯ppt85Raw sequencing reads pre-processing I Significance and purposeuSequencing lib

46、rary quality controluSequencing bias analysisInherited prosperities on certain second generation sequencerGenome sequencing black hole effectTranscriptome sampling and quantification biasuReady for mapping uReady for de novo assembly 編輯ppt86Raw sequencing reads pre-processing IISequencing reads numb

47、ersDuplicates detection, regional distribution analysis and trimmingAdapter detection and trimmingReads quality analysis and low quality reads filter Average quality density distribution Average quality positional distribution regional distribution F-R correlation GC content-quality correlationInser

48、t length distribution Pipeline編輯ppt87raw data pre-process編輯ppt88Image analysis and basecallingGOAT pipeline (OLB1.6), CASAVA編輯ppt89Quality Control GERALD Summary.htmLaneLane Yield (kbases)Clusters (raw)Clusters(PF)1st Cycle Int(PF)% intensity after 20 cycles (PF)%PF Clusters% Align (PF)Alignment Sco

49、re (PF)%Error Rate (PF)152630597464 +/- 487887676 +/- 921975 +/- 2186.17 +/- 5.2589.76 +/-5.9599.06 +/-0.25102.41+/-1.621.30+/-0.22編輯ppt90Fastq and QualitySolexa reads of the Fastq formats_1_1_sequence.txtHWI-EAS724_0001:8:32:374:374#0/1GAGCTGTATATGAATAATAGTTCGTTTTTCATTATCCAAGATGGATCGGTATAAAGTCTGCTA

50、AAATAAAGGTACAACG+HWI-EAS724_0001:8:32:374:374#0/1fcfcfggdfggggfggggcggggggggfgggggcgggfWgggggggggfgcggdgcgcggggfacbbbbgcgggggds_1_2_sequence.txt HWI-EAS724_0001:8:32:374:374#0/2TACCGTTAATAGCAGTAATATCATAATAGTAATAGCATCATAACGGTAGTCCCATAAAAGTGTGTCAGTAGTAGTAGTA+HWI-EAS724_0001:8:32:374:374#0/2ggggfgggggd

51、_adcggggeggfggeggegfgeececdegggggfegcfegggegggfgacacedbd_cYbIllumina 1.3 format encodes a Phred quality score from 0 to 40 using ASCII 64 to 104error probability (p):# for solexa: p = 0.01, Q = 19; p = 0,05, Q = 12.8, p = 0.10, Q = 9.5;# for phred: p = 0.01, Q = 20; p = 0,05, Q = 13, p = 0.10, Q = 1

52、0;編輯ppt91Data assessment I Read quality distribution編輯ppt92Low Quality High Quality Trim: 3 end trim if QN 30) 60 Assessment: Distance Distrubition between two Low quality (Q20 ?編輯ppt99 Lane data usage in different solexa library- Fiter duplication reads編輯ppt100Average Reads per StartPoint編輯ppt101Re

53、ad CorrectionCorrect Illumina GA short reads Kmer = 17Genome Size Prediction: M = N * ( L-K+1)/L N = Total Length (bp) /Genome sizeL= Average Rads Length (bp)M編輯ppt102Genome size estimation using KmerBefore estimating the genome size, we set a hypothesis: the k-mer we picked out from the genome can

54、ergodic the whole genome sequence.According to the Lander waterman algorithm, the algorithm should be represented as: G= Knum / KdepthHere, G is the genome size, Knum is the total number of k-mer and Kdepth is the expected depth of the k-mer.If we obtain the expected depth of k-mer, we can calculate

55、 the genome size. Because the distribution of k-mer frequency yields to Poisson distribution, we can consider the peak of the k-mer distribution curve as the expected depth of k-mer and calculate the genome size.Note:A total of 15,437,084,746 Kmers, the peak value on the right figure is 8, so the genome size is estimated as:15,437,084,746/8=1.93G編輯ppt103High Quality Read Rate after preprocessAssembly: Raw data VS preprocessed Data ?編輯ppt104Questions Genome size estimati

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論