




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
DNA序列分類
摘要本問題是一個“有人管理分類問題”.首先分別列舉出20個學(xué)習(xí)樣本序列中1字符串、2
字符串、3字符串出現(xiàn)的頻率,構(gòu)成含41個變量的基本特征集,接著用主成分分析法從中提
取出4個特征.然后用Fisher線性判別法進(jìn)行分類,得出了所求20個人工制造序列及182個
自然序列的分類結(jié)果如下:
1)20個人工序列:22,23,25,27,29,34,35,36,37為A類,其余為B類.
2)182個自然序列:1,4,8,10,27,29,32,41,43,48,54,63,70,72,75,76,81,
86,90,92,102,110,116,119,126,131,144,150,157,159,160,161,162,163,
164,165,166,169,170,182為B類,其余為A類.
最后通過檢驗(yàn)證明所用的分類數(shù)學(xué)模型效率較高.
一、問題重述
人類基因組計(jì)劃中DNA全序列草圖是由4個字符A,T,C,G按一定順序排成的長約
30億的字符序列,其中沒有“斷句”也沒有標(biāo)點(diǎn)符號.雖然人類對它知之甚少,但也發(fā)現(xiàn)了
其中的一些規(guī)律性和結(jié)構(gòu).例如,在全序列中有一些是用于編碼蛋白質(zhì)的序列片段,即由這4
個字符組成的64種不同的3字符串,其中大多數(shù)用于編碼構(gòu)成蛋白質(zhì)的20種氨基酸.又例如,
在不用于編碼蛋白質(zhì)的序列片段中,A和T的含量特別多些,于是以某些堿基特別豐富作為
特征去研究DNA序列的結(jié)構(gòu)也取得了一些結(jié)果.此外,利用統(tǒng)計(jì)的方法還發(fā)現(xiàn)序列的某些片
段之間具有相關(guān)性,等等.這些發(fā)現(xiàn)讓人們相信,DNA序列中存在著局部的和全局性的結(jié)構(gòu),
充分發(fā)掘序列的結(jié)構(gòu)對理解DNA全序列是十分有意義的.目前在這項(xiàng)研究中最普通的思想是
省略序列的某些細(xì)節(jié),突出特征,然后將其表示成適當(dāng)?shù)臄?shù)學(xué)對象.
作為研究DNA序列的結(jié)構(gòu)的嘗試,提出以下對序列集合進(jìn)行分類的問題:
1)請從20個已知類別的人工制造的序列(其中序列標(biāo)號1?10為A類,11?20為B類)
中提取特征,構(gòu)造分類方法,并用這些已知類別的序列,衡量你的方法是否足夠好.然后用
你認(rèn)為滿意的方法,對另外20個未標(biāo)明類別的人工序列(標(biāo)號21?40)進(jìn)行分類,把結(jié)果用
序號(按從小到大的順序)標(biāo)明他們的類別(無法分類的不寫入)
2)同樣方法對182個自然DNA序列(他們都較長)進(jìn)行分類,像1)一樣地給出分類結(jié)果.
二、模型的合理假設(shè)
1.各序列中DNA堿基三聯(lián)組(即3字符串)的起始位置和基因表達(dá)不影響分類的結(jié)果.
2.64種3字符串壓縮為20組后不影響分類的結(jié)果.
3.較長的182個自然序列與已知類別的20個樣本序列具有共同的特征.
三、模型建立與求解
研究DNA序列具有什么結(jié)構(gòu),其A,T,C,G4個堿基排成的看似隨機(jī)的序列中隱藏著
什么規(guī)律,是解讀人類基因組計(jì)劃中DNA全序列草圖的基礎(chǔ),也是生物信息學(xué)(Bioinformates)
最重要的課題之一.
題目給出了20個已知為兩個類別的人工制造的DNA序列,要求我們從中提取特征,構(gòu)
造分類方法,從而對20個未標(biāo)明類別的人工DNA序列和182個自然DNA序列進(jìn)行分類.這
是模式識別中的“有人管理分類”問題,即事先規(guī)定了分類的標(biāo)準(zhǔn)和種類的數(shù)目,通過大批
已知樣本的信息處理找出規(guī)律,再用計(jì)算機(jī)預(yù)報(bào)未知.給出的已知類別的樣本稱為學(xué)習(xí)樣
本.對于此類問題,我們通過建立分類數(shù)學(xué)模型(這包括形成和提取特征以及制定分類決策)、
考查分類模型的效率、預(yù)報(bào)未知這幾個步驟來進(jìn)行.
(一)特征的形成和提取
為了有效地實(shí)現(xiàn)分類識別,首先要根據(jù)被識別的對象產(chǎn)生一組基本特征,并對基本特征
進(jìn)行變換,得到最能反映分類本質(zhì)的特征.這就是特征形成和提取的過程.在列舉了盡可能
完備的特征參數(shù)集之后,就要借助于數(shù)學(xué)的方法,使特征參數(shù)的數(shù)目(在保證分類良好的前
提下)減到最小.這是因?yàn)椋?.多余的特征參數(shù)不但沒有多少好處,而且會帶來噪音,干擾分
類和數(shù)學(xué)模型的建立.2.為了保證樣本數(shù)和特征參數(shù)個數(shù)的比值足夠大,而又不必要用太多的
樣本,最好使特征參數(shù)的個數(shù)降至最少.模式識別計(jì)算一般要求樣本數(shù)至少為變量數(shù)的3倍,
否則結(jié)果不夠可靠.本問題的學(xué)習(xí)樣本數(shù)為20個,故特征參數(shù)的個數(shù)以6?8個為宜.
我們通過研究4個字符AIGG在DNA序列中的排列、組合特性,主要是研究字符和字
符串的排列在序列中出現(xiàn)的頻率,從中提取DNA序列的結(jié)構(gòu)特征參數(shù).
1.特征的形成
分別列舉一個字符,2個字符,3個字符的排列在序列中出現(xiàn)的頻率,構(gòu)成基本特征集.
(1)1個字符的出現(xiàn)頻率
表1列出了20個樣本中A,T,C,G這4個字符出現(xiàn)的頻率.由于在不用于編碼蛋
白質(zhì)的序列片段中,4和7的含量特別多些,因此我們將A和T是否特別豐富作為一個特征.在
表1中,列出了4和T出現(xiàn)的頻率之和.(程序見附錄一)
表1
ACTGA+T
1.29.7317.1213.5139.6443.24
2.27.0316.2215.3241.4442.34
3.27.0321.626.3145.0533.33
4.42.3410.8128.8318.0271.17
5.23.4223.4210.8142.3434.23
6.35.1412.6112.6139.6447.75
7.35.149.9118.9236.0454.05
8.27.9316.2218.9236.9446.85
9.20.7220.7215.3243.2436.04
10.18.1827.2713.6440.9131.82
11.35.454.5550.0010.0085.45
12.32.732.7350.0014.5582.73
13.25.4510.0051.8212.7377.27
14.30.008.1850.0011.8280.00
15.29.09.0064.556.3693.64
16.36.368.1846.369.0982.73
17.35.4524.5526.3613.6461.82
18.29.0911.8250.009.0979.09
19.21.8214.5556.367.2778.18
20.20.0017.2756.366.3676.36
(2)2字符串的排列出現(xiàn)的頻率
A,T,C,G這4個字符組成了16種不同的2字符串.表2列出了20個樣本中各2字符
串出現(xiàn)的頻率.(用“滾動”算法,如ATTCG有AT,TT,TC,CG共4個2字符串)(程序與附錄
一類似)
表2
AAACATAGTATCTGTTCACTCCCGGAGTGCGG
1.9.019.013.608.114.50.904.503.603.603.601.808.1111.712.705.4118.92
2.9.917.213.605.412.701.805.415.414.501.80.909.019.914.505.4121.62
3.5.4111.713.605.412.701.80.90.905.41.90.9014.4113.51.907.2123.42
4.18.925.4111.715.4110.811.805.4110.815.411.80.902.706.314.502.704.50
5.6.318.111.807.211.802.702.703.605.414.502.7010.819.91.909.0121.62
6.15.322.706.319.913.601.801.805.414.50.00.008.1110.81.908.1119.82
7.15.321.8010.817.214.502.706.315.41.901.80.906.3113.51.904.5016.22
8.8.113.606.319.915.413.602.707.212.703.601.808.1110.811.807.2116.22
9.9.01.904.506.31.003.607.214.503.602.702.7011.717.213.6013.5118.02
10.6.363.641.826.361.825.452.733.645.453.644.5513.644.553.6413.6418.18
11.15.452.7314.552.7316.36.911.8230.00.91.91.911.822.734.55.002.73
12.13.64.9110.916.3615.451.821.8230.91.91.91.00.912.737.27.004.55
13.6.364.5510.004.5512.731.822.7334.552.732.731.8如823.644.551.822.73
14.8.18.9112.737.2713.646.361.8228.182.734.55.00.915.454.55.91.91
15.13.64.0012.731.8213.64.002.7348.18.00.00.00.001.823.64.00.91
16.16.363.6415.45.9113.644.554.5522.731.825.45.00.914.552.73.001.82
17.17.275.4510.911.8210.006.364.555.454.557.279.092.733.642.733.643.64
18.8.187.2711.821.8215.451.82.9130.913.643.641.822.731.823.64.912.73
19.2.732.7313.641.8214.559.09.9131.821.828.181.822.732.732.73.91.91
20.6.366.366.36.919.0910.003.6432.732.7313.64.91.001.823.64.00.91
(3)3字符串的排列出現(xiàn)的頻率
A,T,C,G這4個字符組成了64種不同的3字符串.這64種3字符串構(gòu)成生物蛋白質(zhì)
的20種氨基酸.在參考文獻(xiàn)[1]的Figur2中,給出了這20種氨基酸的編碼(見圖1).因此,
在計(jì)算3字符串的出現(xiàn)頻率時,我們根據(jù)圖1將代表同一種氨基酸的3字符串合成一類,只統(tǒng)
計(jì)20類3字符串的出現(xiàn)頻率.(不考慮字符串在序列片段中的起始位置,也采用“滾動”算法.如
ACGTCC中就有ACG,CGT,GTC,TCC共4個3字符串)見表3.(程序與附錄一類似)
■■
二
s二EQIEHX
二
i二
■EhIEI
二Ka
.二s
二EQE
二
二s
二EX
■5
二
二i
二QEal
二
二a
s二
Kaa
二
二Kwaa
si二
a
l
二
二
二Ka
iEI
Symmetriesofthediamondcodesortthe64codonsinto20classes4ndicatedhereby20colors.Allthecodonsineac
hclassspecifiedthesameaminoacid.
圖IBrianHayes在論文^ThelnventionoftheGeneticCode*中給出的圖形
(注:圖中DNA被轉(zhuǎn)錄為RNA,"U"代表"T")
表3
blb2b3b4b5b6b7b8b9bl0bllbl2bl3bl4bl5bl6bl7bl8bl9b20
11.773.542.650.880.000.007.960.884.422.6517.7010.623.544.424.427.081.773.5413.277.08
21.891.890.940.940.000.941.890.944.7212.267.5511.328.493.773.776.609.436.607.552.83
30.980.000.005.880.988.822.940.000.0029410.785.8813.730.004.903.9219.611.968.825.88
40.000.000.000.870.000.8713.041.746.092,6111.3013.043.4S5.223.4S8.703.481.7414.78,7.83
52.860.000.003.810.953.813.810.003.813.819.529.5212.382.869.524.767.622.867.629.52
60.000.000.882.630.001.7513.160.884391.7514.049.657.025.264.3911.402.631.7510.536.14
71.920.000.002.880.964.812.880.001.924.8112.506.7313.461.926.734.8110.583.859.627.69
82.563.420.000.850.850.8512.820.851.710.8520.51Z563.429.405.9811.110.854.2711.973.42
90.000.000.002.972.979.902.970.000.993.966.931.9813.861.982973.9623.762.978.916.93
101.870.933.742.800.000.002.800.007.488.419.357.433.7414.9512.150.002.804.677.487.48
110.000.890.000.000.001.798.040.005.364.4615.188.048.934.463.578.044.4662513.395.36
122.730.000.912.730.913.644.553.643.641.829.095.453.645.456.367.278.185.4510.919.09
131.800.900.900.900.000.909.010.003.607.2114.418.117.216317.214.501.8072111.714.50
142.940.000.005.880.006.861.960.003.926.863.929.8013.730.985.882.9410.780.9810.789.80
152.911.942.911.940.005.831.940.001.949.715.838.7410.681.943.883.888.742.9111.6510.68
162.860.950.0011.431.901.902.860.004.763.815.718.578.576.679.524.765.712.867.627.62
171.920.961.924.811.923.851.920.960.966.734.818.6510.582.886.732.889.626.738.657.69
181.710.851.710.850.852,5616.240.851.710.8516.245.136.845.983.4211.111.715.1311.113.42
190.940.941.890.940.940.941.890.9410.387.555.669.438.498.497.555.666.6011.326.600.94
200.860.860.001.720.860.8617.240.862.591.7215.527.765.173.454.319.485.175.179.485.17
其中bl=aaa4-atab2=aca4-agab3=cac+ctcb4=ccc+cgc
b5=gag+gtg^6=gcg+gggb7=tat4-tttb8=tct4-tgt
b9=aac4-caa+atc4-ctabl0=aag+gaa4-atg4-gta
bll=aat+taa4-att+ttabl2=acc+cca+agc+cga
bl3=acg4-gac+ctg+gtcbl4=act4-tca4-agt+tga
bl5=cag+gac+ctt+ttcbl6=cat+tac+ctt+ttc
bl7=ccg-*-gcc+cgg4-ggcbl8=cct4-tcc4-cgt+tgc
bl9=gat+tag4~gtt+ttgb20=gpt+tcg+ggt+t蜴
綜合起來,形成了有41個變量的基本特征集.
2.特征的提取
上述基本特征集中有41個變量,即樣本處于一個高維空間中.特征的提取就是通過
變換的方法用低維空間來表示樣本,使得x的大部分特性能由y來表達(dá),即將0維隨機(jī)
向量X變換成g維隨機(jī)向量y(好p).我們用主成分分析法進(jìn)行特征的提取,其步驟是:
(1)求x的均方差矩陣v的特征根,記為:
x1》入入々>0人紅1二??,=4p—0
(2)求入1入2…入K對應(yīng)的標(biāo)準(zhǔn)正交的特征向量4,政…4
得到第/個主成分為y7=r2=l,2,…甚
k
(3)求第/個主成分的貢獻(xiàn)率u尸入/ZAJ=ZZ…比及前m個主成分的累計(jì)貢獻(xiàn)率
/=|
產(chǎn)一
%=z%
i=l
(4)求得q,使得匕>%(%一般在0.85到1之間),則取
Y=XW
第3步所求的貢獻(xiàn)率,代表主成分表達(dá)X的能力,貢獻(xiàn)率越大,對應(yīng)的主成分表達(dá)X的
能力越強(qiáng).只要前9個主成分的累計(jì)貢獻(xiàn)率超過給定的百分比V.就可以用低維特征F=
(%ya…而)來反映高維特征(x/陽…不)的變化特性.
現(xiàn)將反映20個已知類別樣本的41個特征的隨機(jī)向量X進(jìn)行特征提取.
計(jì)算得前4個主成分的累計(jì)貢獻(xiàn)率為96%,故提取特征為4個變量,取
W=(a*#4),則Y=XW,F的4個分量就是從基本特征集提取所得的特征參數(shù)向量.(程
序及結(jié)果見附錄二)
(二)分類決策的制定
前面已選取了特征參數(shù),把特征參數(shù)張成的多維空間稱為特征空間.分類決策就是在特
征空間中用統(tǒng)計(jì)的方法把被識別對象歸為某一類別.基本作法是在學(xué)習(xí)樣本集的基礎(chǔ)上確定
某個判決規(guī)則,使按這種判決規(guī)則對被甄別對象進(jìn)行分類所造成的錯誤識別率最小或引起的
損失最少.
這里,我們的分類決策選取Fisher線性判別法.即選取線性判別函數(shù)及勾,使得:
伙勾={⑸[雙功-同伙功}2/{3[久期+2[久功}=max⑴
其中瓦與。分別表示母體/的期望和方差運(yùn)算,/=1,2.
(1)式的含義是:構(gòu)造一個線性判別函數(shù)如對樣本進(jìn)行分類,使得平均出錯概率最小.即
應(yīng)在不同母體下,使雙藥的取值盡量分開.具體地說,要使母體間的差異(耳(久功-E(次動產(chǎn)
相對于母體內(nèi)的差異■功+2[伙動為最大.取
久勾=(X)-X9T(EI+£2)"X
就可滿足⑴.其中又,為第7類母體的均值矩陣的估計(jì),Ei為第i類母體的方差矩陣的估計(jì).取
分類門檻值為:
U后貝X*%!+(!-?)*X2)
其中0<。<1,本問題中兩類樣本的個數(shù)相等,可取a=1/2.若火又為,伙文》<仇則當(dāng)
久為>為,就認(rèn)為X取自母體1;當(dāng)久蜀<Uo,就認(rèn)為X取自母體2.
用上面得出的4個主成分構(gòu)成的特征組和此分類決策,對20個學(xué)習(xí)樣本進(jìn)行分類,能得
出正確的結(jié)果.但是,若取片(“"?),求i^xw,以y的3個分量作為特征參數(shù)向量,再
用Fisher線性判別法對20個學(xué)習(xí)樣本進(jìn)行分類,則第四個樣本不能正確分類.
因此,得出分類的數(shù)學(xué)模型為:
(1)特征選取:取跆(”,史必),求『X%得出特征參數(shù)向量就是y的4個列
向量.其中X是反映20個學(xué)習(xí)樣本的41個特征的隨機(jī)向量.
(2)分類決策:Fisher線性判別法.
(三)分類模型的有效性考察
前面建立的分類數(shù)學(xué)模型對20個學(xué)習(xí)樣本進(jìn)行了正確分類.為了進(jìn)一步考查分類模
型的有效性和可靠性,我們采用的方法是:預(yù)先留一部分學(xué)習(xí)樣本不參加訓(xùn)練,然后用
分類決策模型對其作預(yù)報(bào),將預(yù)報(bào)成功率作為預(yù)報(bào)能力的指標(biāo).
每次取出一個學(xué)習(xí)樣本,以其余學(xué)習(xí)樣本作訓(xùn)練集,用分類決策模型對取出的一個
樣本作預(yù)報(bào),同時對給出的后20種樣本作預(yù)報(bào).結(jié)果見表4.
表4
取出樣品序號取出樣本類別預(yù)報(bào)后20組樣本中A類序號預(yù)報(bào)
1A22,23,25,27,29,34,35,36,37
2A22,23,25,27,29,34,35,36,37
3A22,23,25,27,29,34,35,36,37
4A23,25,27,29,34,35,36,37
5A22,23,25,27,29,34,35,36,37
6A22,23,25,27,29,34,35,36,37
7A22,23,25,27,29,34,35,36,37
8A22,23,25,27,29,34,35,36,37
9A22,23,25,27,29,34,35,36,37
10A22,23,25,27,29,34,35,36,37
11B22,23,25,27,29,34,35,36,37
12B22,23,25,27,29,34,35,36,37
13B22,23,25,27,29,34,35,36,37
14B22,23,25,27,29,34,35,36,37
15B22,23,25,27,29,34,35,36,37,39
16B22,23,25,27,29,34,35,36,37
17B22,23,25,27,29,34,35,36,37,30,39
18B22,23,25,27,29,34,35,36,37
19B22,23,25,27,29,34,35,36,37
20B22,23,25,27,29,34,35,37
從表4可以看出:
1.每次取出一個學(xué)習(xí)樣本,以其余學(xué)習(xí)樣本作訓(xùn)練集,用分類模型對該學(xué)習(xí)樣本的預(yù)報(bào)
的成功率是100%.
2.每次取出一個學(xué)習(xí)樣本,以其余學(xué)習(xí)樣本作訓(xùn)練集,用分類模型對未知類別的第21?40
個樣本進(jìn)行預(yù)報(bào),其結(jié)果有以下特點(diǎn):
(1)除分別取出4、15、17,20的預(yù)報(bào)結(jié)果不同外,分別取出其余16中一個,預(yù)
報(bào)結(jié)果均為:22,23,25,27,29,34,35,36,37,占80%.
(2)分別取出4、15、20的預(yù)報(bào)結(jié)果,與(1)的結(jié)果相比,只有一個樣本的差異,
占15%.
(3)取出17的預(yù)報(bào)結(jié)果,與(1)的結(jié)果相比,有兩個樣本的差異,占5%.
第一種結(jié)果和第二種結(jié)果非常接近,合計(jì)占總數(shù)的95%.只有第三組的這一個結(jié)果有較
大差異,占總數(shù)的5%.
由以上檢驗(yàn)得出結(jié)論:所建立的分類數(shù)學(xué)模型分類效果很好.
(四)未知樣本的預(yù)報(bào)
現(xiàn)在用前面建立的數(shù)學(xué)模型對題目所給的未知類型的20個人工序列和182個自然序列進(jìn)
行預(yù)報(bào).(程序見附錄三)
結(jié)果為:
1)20個人工序列的類別
A類:22,23,25,27,29,34,35,36,37
B類:21、24、26、28、30、31、32、33、38、39、40
2)182個自然序列的類別
A類:(共142個)2,3,5,6,7,9,11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,28,30,31,33,34,35,36,37,38,39,40,42,44,
45,46,47,49,50,51,52,53,55,56,57,58,59,60,61,62,64,65,66,
67,68,69,71,73,74,77,78,79,80,82,83,84,85,87,88,89,91,93,
94,95,96,97,98,99,100,101,103,104,105,106,107,108,109,111,112,
113,114,115,117,118,120,121,122,123,124,125,127,128,129,130,
132,133,134,135,136,137,138,139,140,141,142,143,145,146,147,
148,149,151,152,153,154,155,156,158,167,168,171,172,173,174,
175,176,177,178,179,180,181
B類:(共40個)1,4,8,10,27,29,32,41,43,48,54,63,70,72,75,76,
81,86,90,92,102,110,116,119,126,131,144,150,157,159,160,161,
162,163,164,165,166,169,170,182
四、模型的優(yōu)缺點(diǎn)分析
優(yōu)點(diǎn):
1.針對'“有人管理分類”問題,成功地建立解決這類難題的數(shù)學(xué)模型,并可立即運(yùn)用
到實(shí)踐中去.
2.僅用4個特征參數(shù)即圓滿解決了較為復(fù)雜的分類問題.而且模型假設(shè)條件少,因而能
準(zhǔn)確地反映實(shí)際情況,可靠性高.
3.采用模塊化分析,逐漸深入,提高了準(zhǔn)確性.
4.突出特征,假設(shè)合理,避免了在一些細(xì)節(jié)問題上的糾纏.
缺點(diǎn):
由于只考慮了DNA樣本序列中1字符串、2字符串、3字符串出現(xiàn)的頻率作為特征,
DNA序列的分類不一定與實(shí)際情況完全相符.(可以由科學(xué)家用物理的或化學(xué)的方法測定,
作為補(bǔ)充).
五、模型的改進(jìn)方向及推廣
模型的改進(jìn):因?yàn)槟P蜎]考慮DNA序列的實(shí)際特性,當(dāng)序列變得很多很長很復(fù)雜時,分
類的準(zhǔn)確性會降低而不可用,因此應(yīng)增加對DNA序列的生物特性的考慮.
模型的推廣:該模型對一般的“有人管理分類"問題的求解有重要意義.對研究DNA序
列的規(guī)律性和結(jié)構(gòu)提供了一種有效的分類模型.對人類基因組的研究有現(xiàn)實(shí)意義,有利于加
快科研步伐.
六、參考文獻(xiàn)
[1]BrainHayes(M)-ThelnventionoftheGeneticCode.Americanscientist——ComputingScience,
Jan.-Feb.,1998
[2]蕭樹鐵主編.數(shù)學(xué)實(shí)驗(yàn).北京:高等教育出版社,1999
[3]復(fù)旦大學(xué).概率論第二冊一數(shù)理統(tǒng)計(jì).北京:高等教育出版社,1985
[4]WiliiamF.Lucas主編.生命科學(xué)模型。長沙:國防科技大學(xué)出版社,1996
[5]徐光輝主編.運(yùn)籌學(xué)基礎(chǔ)手冊.北京:科學(xué)出版社,1999
[6]姜啟源主編.數(shù)學(xué)模型.北京:高等教育出版社,1993
七、附錄
附錄一1個字符出現(xiàn)頻率的計(jì)算程序]
CHARACTER*121LINE(40)
integera,c,t,g,at
READ*JJNE
D020II=1,40
iu=ii+20
A=0
DO10I=l,121
IF(LINE?/D?EQ.'a')THEN
A=A+1
elseif(line?(I:I).eq.,c')then
c=c+l
elseif(line0i)(I:I).eq.,t)then
t=t+l
elseif(line(ii)(I:I).eq.,g')then
ENDIF
10continue
at=a+t
aa=a/actg*100.
cc=c/actg*100.
tt=t/actg*100.
gg=g/actg*100.
aatt=at/actg*l00.
open(5,file=*tl.dat*,status=*old,)
write(5,l)aa,cc,tt,gg
1fbrmat(lx,4£7.2)
20CONTINUE
END
附錄二基本特征量的提取程序及結(jié)果
d=[27.4319.4736.2816.8163.72;
28.8524.0422.1225.0050.96;
17.6525.4918.6338.2436.27;
20.8719.1340.8719.1361.74;
24.7622.8621.9030.4846.67;
21.9321.0538.6018.4260.53;
23.0820.1923.0833.6546.15;
25.6414.5344.4415.3870.09;
14.8521.7818.8144.5533.66;
28.9724.3025.2321.5054.21;
24.1117.8635.7122.3259.82;
17.4322.9433.0326.6150.46;
27.0318.9233.3320.7260.36;
23.5323.5316.6736.2740.20;
24.27213620.3933.9844.66;
22.8630.4820.9525.7143.81;
213625.2420.3933.0141.75;
22.2217.0943.5917.0965.81;
27.3628.3023.5820.7550.94;
19.8319.8343.1017.2462.93];
dd=[5.314.427.968.859.736.191.7718.586.194.424.424.426.194.424.421.77;
7.699.623.857.699.623.85.966.732.881.927.6911.547.698.652.884.81;
2.943.925.884.903.922.941.969.80.001.9612.759.8010.78.984.9021.57;
1.744.353.4811.3013.041.742.6122.612.619.574.352.613.484.358.702.61;
6.673.813.819.525.711.904.769.527.624.767.622.864.763.819.5212.38;
3.513.515.269.657.894.391.7524.567.896.141.754.392.632.6311.401.75;
5.774.814.817.696.732.882.8810.582.882.887.696.737.694.814.8115.38;
3.425.139.406.8411.975.133.4223.932.566.842.562.567.693.421.712.56;
1.981.983.966.933.962.972.978.911.98.998.918.916.934.957.9224.75;
9.355.612.8010.287.485.615.616.548.417.482.805.613.748.419.35.00;
2.685.364.4611.6115.181.79.8916.963.576.253.574.462.687.147.145.36;
5.502.752.756.426.427.344.5913.764.595.506.426.42.9210.096.428.26;
5.417.217.217.2110.811.805.4115.323.604.502.707.217.216.316.31.90;
7.844.90.988.824.90.982.947.842.943.929.806.867.843.926.8617.65;
5.834.853.889.717.773.881.946.803.882.913.889.716.806.808.7411.65;
4.763.811.9012.388.575.71.006.675.713.8110.4810.483.818.579.522.86;
3.882.912.9110.685.83.976.805.835.835.839.713.884.855.8311.6510.68;
3.429.405.983.4210.261.714.2727.355.133.424.273.422.566.841.715.98;
8.495.664.728.494.728.492.836.6011.321.899.435.662.839.434.723.77;
3.457.764.314.3110.34.863.4527.591.726.038.623.454.315.171.726.03];
ddd二口.773.542.65.88.00.007.96.884.422.6517.7010.623.544.424.427.081.773.5413.277.08;
1.921.92.96.96.00.961.92.964.8112.507.6911.548.653.853.856.739.626.737.692.88;
.98.00.005.88.988.822.94.00.002.9410.785.8813.73.004.903.9219.611.968.825.88;
.00.00.00.87.00.8713.041.746.092.6111.3013.043.485.223.488.703.481.7414.787.83;
2.86.00.003.81.953.813.81.003.813.819.529.5212.382.869.523.817.622.867.629.52;
.00.00.882.63.001.7513.16.884.391.7514.049.657.025.264.3911.402.631.7510.536.14;
1.92.00.002.88.964.812.88.001.924.8112.506.7313.461.926.734.8110.583.859.627.69;
2.563.42.00.85.85.8512.82.851.71.8520.512.563.429.405.9811.11.854.2711.973.42;
.00.00.002.972.979.902.97.00.993.966.931.9813.861.982.973.9623.762.978.916.93;
1.87.933.742.80.00.002.80.007.488.419.357.483.7414.9512.15.002.804.677.487.48;
.00.89.00.00.001.798.04.005.364.4615.188.048.934.463.578.044.466.2513.395.36;
2.75.00.922.75.923.674.593.673.671.839.175.503.675.506.427.348.265.5011.019.17;
1.80.90.90.90.00.909.01.003.607.2114.418.117.216.317.214.501.807.2111.714.50;
2.94.00.005.88.006.861.96.003.926.863.929.8013.73.985.882.9410.78.9810.789.80;
2.911.942.911.94.005.831.94.001.949.715.838.7410.681.943.883.888.742.9111.6510.68;
2.86.95.0011.431.901.902.86.004.763.815.718.578.576.679.524.765.712.867.627.62;
1.94.971.944.851.943.881.94.97.976.804.858.7410.682.916.802.919.716.808.747.77;
1.71.851.71.85.852.5616.24.851.71.8516.245.136.845.983.4211.111.715.1311.113.42;
.94.941.89.94.94.941.89.9410.387.555.669.438.498.497.555.666.6011.326.60.94;
.86.86.001.72.86.8617.24.862.591.7215.527.765.173.454.319.485.175.179.485.17];
x=[29.7317.1213.5139.6443.24;
27.0316.2215.3241.4442.34;
27.03如626.3145.0533.33;
42.3410.8128.8318.0271.17;
23.4223.4210.8142.3434.23;
35.1412.6112.6139.6447.75;
35.149.9118.9236.0454.05;
27.9316.2218.9236.9446.85;
20.7220.7215.3243.2436.04;
18.1827.2713.6440.9131.82;;
35.454.5550.0010.0085.45;
32.732.7350.0014.5582.73;
25.4510.0051.8212.7377.27;
30.008.1850.0011.8280.00;
29.09.0064.556.3693.64;
36.368.1846.369.0982.73;
35.4524.5526.3613.6461.82;
29.0911.8250.009.0979.09;
21.8214.5556.367.2778.18;
20.0017.2756.366.3676.36];
xx=p.019.013.608.114.50.904.503.603.603.601.808.1111.712.705.4118.92;
9.917.213.605.412.701.805.415.414.501.80.909.019.914.505.4121.62;
5.4111.713.605.412.701.80.90.905.41.90.9014.4113.51.907.2123.42;
18.925.4111.715.4110.811.805.4110.815.411.80.902.706.314.502.704.50;
6.318.111.807.211.802.702.703.605.414.502.7010.819.91.909.0121.62;
15.322.706.319.913.601.801.805.414.50.00.008.1110.81.908.1119.82;
15321.8010.817.214.502.706.315.41.901.80.906.3113.51.904.5016.22;
8.113.606.319.915.413.602.707.212.703.601.808.1110.811.807.2116.22;
9.01.904.506.31.003.607.214.503.602.702.7011.717.213.6013.5118.02;
6.363.641.826.361.825.452.733.645.453.644.5513.644.553.6413.6418.18;
15.452.7314.552.7316.36.911.8230.00.91.91.911.822.734.55.002.73;
13.64.9110.916.3615.451.821.8230.91.91.91.00.912.737.27.004.55;
6.364.5510.004.5512.731.822.7334.552.732.731.8如823.644.551.822.73;
8.18.9112.737.2713.646.361.8228.182.734.55.00.915.454.55.91.91;
13.64.0012.731.8213.64.002.7348.18.00.00.00.001.823.64.00.91;
16.363.6415.45.9113.644.554.5522.731.825.45.00.914.552.73.001.82;
17.275.4510.911.8210.006364.555.454.557.279.092.733.642.733.643.64;
8.187.2711.821.8215.451.82.9130.913.643.641.822.731.823.64.912.73;
2.732.7313.641.8214.559.09.9131.821.828.181.822.732.732.73.91.91;
6.366.366.36.919.0910.003.6432.732.7313.64.91.001.823.64.00.91];
xxx=[5.41.902.70.905.413.60.901.802.708.114.501.8025.233.603.605.4113.51.003.604.50;
2.702.70.00.003.606.312.70.907.217.216.311.8018.92.906.311.8014.41.003.6010.81;
2.702.702.70.003.606.31.00.904.505.411.80.9029.73.005.414.5022.52.001.802.70;
15.326.31.00.00.00.909.011.806.3110.8112.613.604.501.802.705.411.801.807.216.31;
3.601.802.70.005.417.21.90.004.501.802.703.6020.721.806.314.5019.821.801.807.21;
9.01.90.90.002.705.414.50.002.7013.516.31.0025.23.901.801.8016.22.002.703.60;
9.011.80.00.001.804.504.50.903.6016.228.11.0017.122.701.801.8010.81.906316.31;
2.701.80.90.902.703.602.70.904.509.918.113.6018.92.902.704.5012.61.907.218.11;
5.41.00.901.805.419.011.80.903.606.311.803.6011.712.702.702.7020.721.804.5010.81;
3.64.912.736.363.6410.91.911.823.642.732.73.9117.27.004.554.5517.274.551.827.27;
9.09.91.00.00.00.0024.55.003.646.3633.64.914.551.82.001.82.002.735.452.73;
2.73.91.00.00.00.0019.09.001.828.1837.27.004.554.55.002.73.00.9110.005.45;
.91273.00.00.00.0027.271.8如825.4526.362.734.552.734.555.451.822.735.451.82;
6.365.45.00.001.82.0020.005.452.732.7324.55.001.823.643.648.18.91.919.09.91;
11.82.91.00.001.82.0047.271.82.003.6425.45.00.91.91.00.00.00.002.73.91;
10.002.73.91.00.00.0014.554.555.453.6431.82.91.913.641.826.36.00.007.273.64;
10.91.913.643.64.00.918.182.7312.739.0911.823.643.646.361.8如826.366.361.8如82;
4.554.55.00.00.91.9121.82.914.55.9129.09.003.641.82.9110.912.734.554.55.91;
3.64.911.82.91.91.0025.455.453.64.0021.821.8如823.64.9113.64.912.735.452.73;
2.73.915.45.00.00.0023.6410.006.361.8213.64.001.828.181.8213.64.001.826.36.00];
ffe=[xxxxxx];
ffd=[dddddd];
cx=cov(ffic);
[vx,ex]=eig(cx);
exl=eig(cx);
el=mean(exl)*41;
ex2=exl(38:41,:);
e2=mean(ex2)*7;
e2/el
vxl=[vx(:,38:41)];
s=ffiK*vxl;ss=ffd*vxl;
x=s(l:10,:);
y=s(ll:20,:);
ul=mean(x);u2=mean(y);
ul-u2;
z=8/9*(cov(x)+cov(y));
ux=0.5*(ul-u2)*inv(z);
ul2=0.5*ul+0.5*u2;
u0=ux*ul2.*;
la=O;
fbri=l:10
p?=ux*ss^,:).,;
tx(i)=ux*x^,:).,;
fy@=ux*y(^).,;
ifjp(i)>uO
pbd@=l;
la=la+l;
else
pbd@=2;
end
iftx0>uO
lbx@=l;
else
lbx@=2;
end
iffy(i)>uO
lby@=l;
else
lby@=2;
end
fbm=11:20
p(n)=ux*ss(n,:)*;
ifjp(n)>u0
pbd(n)=l;
la=la+l;
else
pbd(n)=2;
end
tx,fy,p
pbdjbxjby
ans=0.9847
u0=-2.4812
tx=Columns1througjh7
8.24719.707410.87803.86729.38379.76129.2014
Columns8througjil0
6.270011.64895.4181
fy=Columns1throu曲7
-15.2467-15.2121-14.2828-8.0112-13.4839-11.1970-11.2608
Columns8throu^hl0
-15.0827-14.9635-15.2662
p=Columnslthrou@17
-6.5147-3.68690.7514-6.08380.3758-6.78050.1074
Columns8throug}il4
-8.11945.0825-6.1039-7.0908-2.7297-6.07154.1447
Columnsl5throug}i20
4.5919-4.21990.9096-9.2269-8.1303-10.7112
pbd=Columnslthrougjil2
221212121222
Columnsl3throu^i20
lby=2222222222
附錄三對未知序列進(jìn)行分類的運(yùn)算程序
d=[27.4319.4736.2816.8163.72;
28.8524.0422.1
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 賠償責(zé)任劃分協(xié)議書
- 終止合作補(bǔ)償協(xié)議書
- 租車換車協(xié)議書模板
- 架線用地協(xié)議書范本
- 離異小孩探視協(xié)議書
- 贈與款項(xiàng)協(xié)議書范本
- 租賃房屋轉(zhuǎn)讓協(xié)議書
- 績效工資考核協(xié)議書
- 雙方賠款協(xié)議書手寫
- 林地農(nóng)莊轉(zhuǎn)讓協(xié)議書
- 幼兒園幼兒消防安全知識課件
- 左洛復(fù)和來士普對比學(xué)習(xí)培訓(xùn)課件
- GB/T 37234-2018文件鑒定通用規(guī)范
- 建筑信息模型BIM概論第2章-BIM標(biāo)準(zhǔn)、參數(shù)化建模與支持平臺
- 《中醫(yī)學(xué)》泄瀉-課件
- 固體飲料生產(chǎn)許可證審查細(xì)則
- 2022年電子元器件貼片及插件焊接檢驗(yàn)規(guī)范
- 周口市醫(yī)療保障門診特定藥品保險(xiǎn)申請表
- 可下載打印的公司章程
- 三年級下冊綜合實(shí)踐活動課件-水果拼盤 全國通用(共15張PPT)
- 污水池內(nèi)防腐施工方案
評論
0/150
提交評論