版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
5個(gè)基礎(chǔ)的數(shù)據(jù)處理的代碼:相關(guān)系數(shù)(Pearson積矩系數(shù))、Apriori算法、FP-T。。。相關(guān)系數(shù)(Pearson積矩系數(shù))、Apriori算法、FP-Tree、決策樹、貝葉斯分類【jupyternotebook】?錄?、相關(guān)系數(shù)(Pearson積矩系數(shù))?、Apriori算法三、FP-Tree四、決策樹五、貝葉斯分類六、總結(jié)?、相關(guān)系數(shù)(Pearson積矩系數(shù))1、概述:?爾遜系數(shù)實(shí)現(xiàn),主要就是將函數(shù)拆分成分?分母,再將分?分母拆分為易實(shí)現(xiàn)標(biāo)準(zhǔn)差和均值函數(shù)。2、代碼實(shí)現(xiàn)(python):frommathimportsqrtfromarrayimportarray`#定義?個(gè)均值函數(shù)defavg(g):sum_i=0.0foriinrange(len(g)):sum_i+=g[i]returnsum_i/len(g)#定義?個(gè)標(biāo)準(zhǔn)差函數(shù)defstandev(a):sum_a=0.0len_a=len(a)foriinrange(len(a)):temp=pow(a[i]-avg(a),2)sum_a+=tempreturnsqrt(sum_a/len_a)#定義計(jì)算?爾遜相關(guān)系數(shù)的函數(shù)defcal_pearson(x,y):n=len(x)molecular=0.0#分?avg_x=avg(x)avg_y=avg(y)#先把分?列出來foriinrange(n):temp=(x[i]-avg_x)*(y[i]-avg_y)molecular+=temp#分母denominator=n*standev(x)*standev(y)returnmolecular/denominatornum1=[float(n)fornininput().split()]#split指定切?num2=[float(n)fornininput().split()]print("相關(guān)系數(shù)為:"+str(cal_pearson(num1,num2)))3、輸出結(jié)果?、Apriori算法**1、概述:**本算法采?的語?是python在jupyternotebook上實(shí)現(xiàn),代碼的設(shè)計(jì)和實(shí)現(xiàn)來源于“菊安醬的機(jī)器學(xué)習(xí)第?期”。代碼實(shí)現(xiàn)介紹如下:先定義?個(gè)簡單的數(shù)據(jù)集;然后利?三個(gè)函數(shù)實(shí)現(xiàn),分別是CreateC1()#?成候選?項(xiàng)集、ScanD()#掃描候選項(xiàng)集并?成頻繁項(xiàng)集、AprioriGen()#合并頻繁項(xiàng)集繼續(xù)?成更??維的候選項(xiàng)集;最后利?apriori()函數(shù)調(diào)?以上的函數(shù)進(jìn)?操作實(shí)現(xiàn)對(duì)數(shù)據(jù)集操作?成頻繁項(xiàng)集。2、代碼實(shí)現(xiàn):defloadDataSet():dataSet=[[1,3,4],[2,3,5],[1,2,3,5],[2,5]]returndataSetdefCreateC1(dataSet):C1=[]fortransactionindataSet:foritemintransaction:ifnot{item}inC1:C1.append({item})C1.sort()returnlist(map(frozenset,C1))defscanD(D,Ck,minSupport):ssCnt={}fortidinD:forcaninCk:ifcan.issubset(tid):ifcannotinssCnt.keys():ssCnt[can]=1else:ssCnt[can]+=1numItems=float(len(D))retList=[]supportData={}forkeyinssCnt:support=ssCnt[key]/numItemssupportData[key]=supportifsupport>=minSupport:retList.append(key)returnretList,supportDatadefaprioriGen(Lk,k):Ck=[]lenLk=len(Lk)foriinrange(lenLk):forjinrange(i+1,lenLk):L1=list(Lk[i])[:k-2]L1.sort()L2=list(Lk[j])[:k-2]L2.sort()ifL1==L2:Ck.append(Lk[i]|Lk[j])returnCkdefapriori(D,minSupport=0.5):C1=CreateC1(D)L1,supportData=scanD(D,C1,minSupport)L=[L1]k=2while(len(L[k-2])>0):Ck=aprioriGen(L[k-2],k)Lk,supK=scanD(D,Ck,minSupport)supportData.update(supK)L.append(Lk)k+=1returnL,supportDatadataset=loadDataSet()L,supportData=apriori(dataset,minSupport=0.5)LsupportData3、輸出結(jié)果:[[frozenset({1}),frozenset({3}),frozenset({2}),frozenset({5})],[frozenset({1,3}),frozenset({2,3}),frozenset({3,5}),frozenset({2,5})],[frozenset({2,3,5})],[]]{frozenset({1}):0.5,frozenset({3}):0.75,frozenset({4}):0.25,frozenset({2}):0.75,frozenset({5}):0.75,frozenset({1,3}):0.5,frozenset({2,3}):0.5,frozenset({3,5}):0.5,frozenset({2,5}):0.75,frozenset({1,2}):0.25,frozenset({1,5}):0.25,frozenset({2,3,5}):0.5}三、FP-Tree**1、概述:**FP樹作為?較復(fù)雜的經(jīng)典算法,代碼也?較繁瑣。以下代碼?先創(chuàng)建?個(gè)類treeNode,?便后?調(diào)?諸多參數(shù),類??有名字變量、計(jì)數(shù)變量、鏈接相似元素項(xiàng)、當(dāng)前?節(jié)點(diǎn)、?節(jié)點(diǎn);有計(jì)數(shù)函數(shù)和顯?函數(shù)。然后定義數(shù)據(jù)集。定義更新頭指針表函數(shù)updateHeader(nodeToTest,targetNode),從頭到尾將?標(biāo)代碼賦給每?個(gè)結(jié)點(diǎn)。更新樹函數(shù)updateTree(items,myTree,headerTable,count)?先測試事務(wù)中第?個(gè)元素項(xiàng)是不是?節(jié)點(diǎn),如果是?節(jié)點(diǎn),則更新count參數(shù);如果不是?節(jié)點(diǎn),則創(chuàng)建?個(gè)新的treeNode作為?節(jié)點(diǎn)添加到樹中。此時(shí),頭指針表也要跟著更新以指向新的節(jié)點(diǎn),這個(gè)更新需要調(diào)?updateHealder函數(shù)。如果item中不??個(gè)元素項(xiàng)的話,則將剩下的元素項(xiàng)作為參數(shù)進(jìn)?迭代。最后createTree(dataSet,minSup=1)函數(shù)先判斷是不是第?次遍歷數(shù)據(jù)集,記錄每個(gè)數(shù)據(jù)項(xiàng)的?持度根據(jù)最??持度過濾,如果所有數(shù)據(jù)都不滿?最??持度,返回None,None,第?次遍歷數(shù)據(jù)集構(gòu)建fp-tree。2、代碼實(shí)現(xiàn):classtreeNode:def__init__(self,nameValue,numOccur,parentNode):=nameValueself.count=numOccur#名字變量#計(jì)數(shù)變量(頻率)self.nodeLink=None#鏈接相似元素項(xiàng)self.parent=parentNode#當(dāng)前?節(jié)點(diǎn)self.children={}#?于存放?節(jié)點(diǎn)definc(self,numOccur):self.count+=numOccurdefdisp(self,ind=1):print(''*ind,,'',self.count)forchildinself.children.values():child.disp(ind+1)#?節(jié)點(diǎn)向右縮減defloadSimpDat():simpDat=[['r','z','h','j','p'],['z','y','x','w','v','u','t','s'],['z'],['r','x','n','o','s'],['y','r','x','z','q','t','p'],['y','z','x','e','q','s','t','m']]returnsimpDatdefcreateInitSet(dataSet):retDict={}fortransindataSet:fset=frozenset(trans)retDict.setdefault(fset,0)retDict[fset]+=1returnretDictdefupdateHeader(nodeToTest,targetNode):while(nodeToTest.nodeLink!=None):nodeToTest=nodeToTest.nodeLinknodeToTest.nodeLink=targetNodenodeToTest.nodeLink=targetNodedefupdateTree(items,myTree,headerTable,count):ifitems[0]inmyTree.children:myTree.children[items[0]].inc(count)else:myTree.children[items[0]]=treeNode(items[0],count,myTree)ifheaderTable[items[0]][1]==None:headerTable[items[0]][1]=myTree.children[items[0]]else:updateHeader(headerTable[items[0]][1],myTree.children[items[0]])iflen(items)>1:updateTree(items[1:],myTree.children[items[0]],headerTable,count)defcreateTree(dataSet,minSup=1):headerTable={}#第?次遍歷數(shù)據(jù)集,記錄每個(gè)數(shù)據(jù)項(xiàng)的?持度fortransindataSet:foritemintrans:headerTable[item]=headerTable.get(item,0)+1#根據(jù)最??持度過濾lessThanMinsup=list(filter(lambdak:headerTable[k]<minSup,headerTable.keys()))forkinlessThanMinsup:del(headerTable[k])freqItemSet=set(headerTable.keys())#如果所有數(shù)據(jù)都不滿?最??持度,返回None,Noneiflen(freqItemSet)==0:returnNone,NoneforkinheaderTable:headerTable[k]=[headerTable[k],None]myTree=treeNode('φ',1,None)#第?次遍歷數(shù)據(jù)集,構(gòu)建fp-treefortranSet,countindataSet.items():#根據(jù)最??持度處理?條訓(xùn)練樣本,key:樣本中的?個(gè)樣例,value:該樣例的的全局?持度localD={}foritemintranSet:ifiteminfreqItemSet:localD[item]=headerTable[item][0]iflen(localD)>0:#根據(jù)全局頻繁項(xiàng)對(duì)每個(gè)事務(wù)中的數(shù)據(jù)進(jìn)?排序,等價(jià)于orderbyp[1]desc,p[0]descorderedItems=[v[0]forvinsorted(localD.items(),key=lambdap:(p[1],p[0]),reverse=True)]updateTree(orderedItems,myTree,headerTable,count)returnmyTree,headerTablesimpDat=loadSimpDat()dictDat=createInitSet(simpDat)myFPTree,myheader=createTree(dictDat,3)myFPTree.disp()3、輸出結(jié)果四、決策樹**1、概述:**決策樹需要了解兩個(gè)概念:?農(nóng)熵、信息增益。第?為?農(nóng)熵,計(jì)算公式為通過?農(nóng)熵再計(jì)算出數(shù)據(jù)集每?列的信息增益,通過?較信息增益確定不純度最低的列,再從最此列進(jìn)?劃分,最后是構(gòu)建決策樹。以下代碼中函數(shù)calcShannonEnt(dataSet)計(jì)算出數(shù)據(jù)集的?農(nóng)熵值;函數(shù)createDataSet()構(gòu)建數(shù)據(jù)集;splitDataSet(dataSet,axis,value)劃分再最優(yōu)條件下,不同特征值的數(shù)據(jù)集;chooseBestFeatureToSplit(dataSet)找到最優(yōu)的劃分特征;majorityCnt(classList)多數(shù)表決決定該葉?節(jié)點(diǎn)分類;最后是創(chuàng)建樹的函數(shù)createTree(dataSet,labels)。2、代碼實(shí)現(xiàn):#計(jì)算給定數(shù)據(jù)集的?農(nóng)熵frommathimportlogdefcalcShannonEnt(dataSet):numEntries=len(dataSet)#數(shù)據(jù)集的長度labelCounts={}forfeatVecindataSet:currentLabel=featVec[-1]#featVec[-1]是數(shù)據(jù)集最后的數(shù)組ifcurrentLabelnotinlabelCounts.keys():labelCounts[currentLabel]=0labelCounts[currentLabel]+=1shannonEnt=0.0forkeyinlabelCounts:prob=float(labelCounts[key])/numEntriesshannonEnt-=prob*log(prob,2)#計(jì)算總體的熵值returnshannonEntdefcreateDataSet():dataSet=[[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']]labels=['nosurfacing','flippers']returndataSet,labels#按照給定特征劃分?jǐn)?shù)據(jù)集#/*#*dataSet:待劃分的數(shù)據(jù)集#*axis:劃分?jǐn)?shù)據(jù)的特征#*需要返回的特征的值#*/defsplitDataSet(dataSet,axis,value):retDataSet=[]forfeatVecindataSet:iffeatVec[axis]==value:reducedFeatVec=featVec[:axis]#讀取從到axis位的數(shù)reducedFeatVec.extend(featVec[axis+1:])#截取從axis+2到最后的數(shù),extend()全部連上retDataSet.append(reducedFeatVec)returnretDataSet#選擇最好的數(shù)據(jù)集劃分?式defchooseBestFeatureToSplit(dataSet):numFeatures=len(dataSet[0])-1baseEntropy=calcShannonEnt(dataSet)bestInfoGain=0.0;bestFeature=-1foriinrange(numFeatures):featList=[example[i]forexampleindataSet]uniqueVals=set(featList)newEntropy=0.0forvalueinuniqueVals:subDataSet=splitDataSet(dataSet,i,value)prob=len(subDataSet)/float(len(dataSet))newEntropy+=prob*calcShannonEnt(subDataSet)infoGain=baseEntropy-newEntropyif(infoGain>bestInfoGain):bestInfoGain=infoGainbestFeature=ireturnbestFeatureimportoperatordefmajorityCnt(classList):classCount={}forvoteinclassList:ifvotenotinclassCount.keys():classCount[vote]=0classCount[vote]+=1sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)returnsortedClassCount[0][0]#創(chuàng)建樹的函數(shù)代碼defcreateTree(dataSet,labels):classList=[example[-1]forexampleindataSet]ifclassList.count(classList[0])==len(classList):returnclassList[0]iflen(dataSet[0])==1:returnmajorityCnt(classList)bestFeat=chooseBestFeatureToSplit(dataSet)bestFeatLabel=labels[bestFeat]myTree={bestFeatLabel:{}}del(labels[bestFeat])featValues=[example[bestFeat]forexampleindataSet]uniqueVals=set(featValues)forvalueinuniqueVals:subLabels=labels[:]myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)returnmyTreemyData,labels=createDataSet()createTree(myData,labels)3、輸出結(jié)果({'nosurfacing':{0:'no',1:({'flippers':{0:'no',1:'yes'}},)}},)五、貝葉斯分類**1、概述:**貝葉斯算法應(yīng)????分?泛,通過查看諸多資料,?如貝葉斯實(shí)現(xiàn)?本器分類、貝葉斯鳶尾花數(shù)據(jù)實(shí)驗(yàn)以及基本代碼實(shí)現(xiàn)貝葉斯公理等。最終以以下代碼為任務(wù)標(biāo)準(zhǔn),貝葉斯實(shí)現(xiàn)?果的條件概率計(jì)算。以下代碼先設(shè)定計(jì)算?果總數(shù)函數(shù)count_total(data),并通過cal_base_rates(data)函數(shù)計(jì)算出每種?果的總占?率,記為先驗(yàn)概率。然后通過likelihold_prob(data)函數(shù)計(jì)算出各個(gè)特征值在已知?果上的占?率。evidence_prob(data)函數(shù)計(jì)算三種?果中每個(gè)屬性占全部?果的?率。類navie_bayes_classifier中包含函數(shù)definit(self,data=datasets)和get_label(self,length,sweetness,color),后者將各種?果的某個(gè)特征概率除以總的某個(gè)特征概率,再乘以某?果占?率。最后將各種不同屬性情況下?果的條件概率求出來并通過主函數(shù)main()輸出。2、代碼實(shí)現(xiàn):datasets={'banala':{'long':400,'not_long':100,'sweet':350,'not_sweet':150,'yellow':450,'not_yellow':50},'orange':{'long':0,'not_long':300,'sweet':150,'not_sweet':150,'yellow':300,'not_yellow':0},'other_fruit':{'long':100,'not_long':100,'sweet':150,'not_sweet':50,'yellow':50,'not_yellow':150}}#設(shè)置數(shù)據(jù)集,數(shù)據(jù)集??有?蕉、橘?和其他?果。這三種?果的屬性有長不長、甜不甜和是不是黃?。defcount_total(data):'''計(jì)算各種?果的總數(shù)return{‘banala’:500...}'''count={}total=0forfruitindata:'''因?yàn)?果要么甜要么不甜,可以?這兩種特征來統(tǒng)計(jì)總數(shù)'''count[fruit]=data[fruit]['sweet']+data[fruit]['not_sweet']total+=count[fruit]defcal_base_rates(data):'''計(jì)算各種?果的先驗(yàn)概率return{‘banala’:0.5...}'''categories,total=count_total(data)cal_base_rates={}forlabelincategories:priori_prob=categories[label]/totalcal_base_rates[label]=priori_probreturncal_base_ratesdeflikelihold_prob(data):'''計(jì)算各個(gè)特征值在已知?果下的概率(likelihoodprobabilities){'banala':{'long':0.8}...}'''count,_=count_total(data)likelihold={}forfruitindata:'''創(chuàng)建?個(gè)臨時(shí)的字典,臨時(shí)存儲(chǔ)各個(gè)特征值的概率'''attr_prob={}forattrindata[fruit]:#計(jì)算各個(gè)特征值在已知?果下的概率attr_prob[attr]=data[fruit][attr]/count[fruit]likelihold[fruit]=attr_probreturnlikeliholddefevidence_prob(data):'''計(jì)算特征的概率對(duì)分類結(jié)果的影響return{'long':50%...}'''#?果的所有特征attrs=list(data['banala'].keys())count,total=count_total(data)evidence_prob={}#計(jì)算各種特征的概率forattrinattrs:attr_total=0forfruitindata:attr_total+=data[fruit][attr]evidence_prob[attr]=attr_total/totalreturnevidence_probEvidence_prob=evidence_prob(datasets)print(Evidence_prob)classnavie_bayes_classifier:'''初始化貝葉斯分類器,實(shí)例化時(shí)會(huì)調(diào)?__init__函數(shù)'''def__init__(self,data=datasets):self._data=datasetsself._labels=[keyforkeyinself._data.keys()]self._priori_prob=cal_base_rates(self._data)self._likelihold_prob=likelihold_prob(self._data)self._evidence_prob=evidence_prob(self._data)#下?的函數(shù)可以直接調(diào)?上?類中定義的變量defget_label(self,length,sweetness,color):'''獲取某?組特征值的類別'''s
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 養(yǎng)老院老人入住手續(xù)制度
- 養(yǎng)老院老人安全保障制度
- 向命運(yùn)挑戰(zhàn)課件
- 城市經(jīng)濟(jì)學(xué)城市化教學(xué)課件
- 救生員入職合同(2篇)
- 2024年度生物安全試劑采購與儲(chǔ)備合同3篇
- 2024年農(nóng)業(yè)設(shè)施維修及保養(yǎng)承包合同樣本3篇
- 2025年大興安嶺貨運(yùn)從業(yè)資格證模擬考試題目
- 2025年塔城貨物運(yùn)輸駕駛員從業(yè)資格考試系統(tǒng)
- 2025年阜陽貨運(yùn)從業(yè)資格證試題庫及答案
- 全球及中國機(jī)器人水果采摘機(jī)行業(yè)市場現(xiàn)狀供需分析及市場深度研究發(fā)展前景及規(guī)劃可行性分析研究報(bào)告(2024-2030)
- 辦公室玻璃隔斷安裝合同
- 陜西行政執(zhí)法資格考試題題庫及答案完整
- 2024-2029年益生菌項(xiàng)目商業(yè)計(jì)劃書
- 康復(fù)質(zhì)控中心工作計(jì)劃
- 咖啡的微觀世界智慧樹知到期末考試答案章節(jié)答案2024年成都師范學(xué)院
- 2024-2030年國內(nèi)工業(yè)用金屬桶行業(yè)市場發(fā)展分析及發(fā)展前景與投資機(jī)會(huì)研究報(bào)告
- DZ/T 0462.9-2023 礦產(chǎn)資源“三率”指標(biāo)要求 第9部分:鹽湖和鹽類礦產(chǎn)(正式版)
- 小學(xué)生普法教育完整課件
- 60歲以上用工免責(zé)協(xié)議
- (“雙減”作業(yè)案例)“魚米之鄉(xiāng)”-一長江三角洲地區(qū)(第一課時(shí))
評(píng)論
0/150
提交評(píng)論