5-自動化外文文獻(xiàn)英文文獻(xiàn)外文翻譯改進(jìn)型智能機(jī)器人的語音識別方法_第1頁
5-自動化外文文獻(xiàn)英文文獻(xiàn)外文翻譯改進(jìn)型智能機(jī)器人的語音識別方法_第2頁
5-自動化外文文獻(xiàn)英文文獻(xiàn)外文翻譯改進(jìn)型智能機(jī)器人的語音識別方法_第3頁
5-自動化外文文獻(xiàn)英文文獻(xiàn)外文翻譯改進(jìn)型智能機(jī)器人的語音識別方法_第4頁
5-自動化外文文獻(xiàn)英文文獻(xiàn)外文翻譯改進(jìn)型智能機(jī)器人的語音識別方法_第5頁
已閱讀5頁,還剩17頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、附件1外文資料翻譯譯文改進(jìn)型智能機(jī)器人的語音識別方法2、語音識別概述最近,由于其重大的理論意義和實用價值,語音識別已經(jīng)受到越來越多的關(guān)注。 到現(xiàn)在為止,多數(shù)的語音識別是基于傳統(tǒng)的線性系統(tǒng)理論,例如隱馬爾可夫模型和 動態(tài)時間規(guī)整技術(shù)。隨著語音識別的深度研究,研究者發(fā)現(xiàn),語音信號是一個復(fù)雜 的非線性過程,如果語音識別研究想要獲得突破,那么就必須引進(jìn)非線性系統(tǒng)理論 方法。最近,隨著非線性系統(tǒng)理論的發(fā)展,如人工神經(jīng)網(wǎng)絡(luò),混沌與分形,可能應(yīng) 用這些理論到語音識別中。因此,本文的研究是在神經(jīng)網(wǎng)絡(luò)和混沌與分形理論的基 礎(chǔ)上介紹了語音識別的過程。語音識別可以劃分為獨立發(fā)聲式和非獨立發(fā)聲式兩種。非獨立發(fā)聲式是指

2、 發(fā)音模式是由單個人來進(jìn)行訓(xùn)練,其對訓(xùn)練人命令的識別速度很快,但它對與其他 人的指令識別速度很慢,或者不能識別。獨立發(fā)聲式是指其發(fā)音模式是由不同年齡, 不同性別,不同地域的人來進(jìn)行訓(xùn)練,它能識別一個群體的指令。一般地,由于用 戶不需要操作訓(xùn)練,獨立發(fā)聲式系統(tǒng)得到了更廣泛的應(yīng)用。所以,在獨立發(fā)聲式系統(tǒng)中,從語音信號中提取語音特征是語音識別系統(tǒng)的一個基本問題。語音識別包括訓(xùn)練和識別,我們可以把它看做一種模式化的識別任務(wù)。 通常地, 語音信號可以看作為一段通過隱馬爾可夫模型來表征的時間序列。通過這些特征提 取,語音信號被轉(zhuǎn)化為特征向量并把它作為一種意見,在訓(xùn)練程序中,這些意見將 反饋到hmM勺模型參

3、數(shù)估計中。這些參數(shù)包括意見和他們響應(yīng)狀態(tài)所對應(yīng)的概率密 度函數(shù),狀態(tài)間的轉(zhuǎn)移概率,等等。經(jīng)過參數(shù)估計以后,這個已訓(xùn)練模式就可以應(yīng) 用到識別任務(wù)當(dāng)中。輸入信號將會被確認(rèn)為造成詞,其精確度是可以評估的。整個 過程如圖一所示。圖1語音識別系統(tǒng)的模塊圖3、理論與方法從語音信號中進(jìn)行獨立揚聲器的特征提取是語音識別系統(tǒng)中的一個基本問題。 解決這個問題的最流行方法是應(yīng)用線性預(yù)測倒譜系數(shù)和Mel頻率倒譜系數(shù)。這兩種方法都是基于一種假設(shè)的線形程序,該假設(shè)認(rèn)為說話者所擁有的語音特性是由于聲道共振造成的。這些信號特征構(gòu)成了語音信號最基本的光譜結(jié)構(gòu)。然而,在 語音信號中,這些非線形信息不容易被當(dāng)前的特征提取邏輯方法所

4、提取,所以我們 使用分型維數(shù)來測量非線形語音擾動。本文利用傳統(tǒng)的LPCC和非線性多尺度分形維數(shù)特征提取研究并實現(xiàn)語音識別系統(tǒng)。3.1線性預(yù)測倒譜系數(shù)線性預(yù)測系數(shù)是一個我們在做語音的線形預(yù)分析時得到的參數(shù),它是關(guān)于毗 鄰語音樣本間特征聯(lián)系的參數(shù)。線形預(yù)分析正式基于以下幾個概念建立起來的,即 一個語音樣本可以通過一些以前的樣本的線形組合來快速地估計,根據(jù)真實語音樣 本在確切的分析框架(短時間內(nèi)的)和預(yù)測樣本之間的差別的最小平方原則,最后 會確認(rèn)出唯一的一組預(yù)測系數(shù)。LPC可以用來估計語音信號的倒譜。在語音信號的短時倒譜分析中,這是一種特 殊的處理方法。信道模型的系統(tǒng)函數(shù)可以通過如下的線形預(yù)分析來得

5、到:H(二)=j1 - 口Fk=其中p代表線形預(yù)測命令,(k=1, 2, , ,, p)代表預(yù)測參數(shù),脈沖響應(yīng)用 h(n)來表示,假設(shè)h (n)的倒譜是仁那么(1)式可以擴(kuò)展為(2)式:AH (:) = In/ (z)e *(2)工 h(n)znn =:將(1)帶入(2),兩邊同時1,(2)變成(3)就獲得了方程(4):(-血二)強(qiáng) h 5)二k=1w =(4)那么h (n)可以通過弘來獲得。h (n)=釘+ E k =】k)1k 1 -h (n - k)(5)中計算的倒譜系數(shù)叫做 LPCC n代表LPCC命令。在我們采集LPCC參數(shù)以前,我們應(yīng)該對語音信號進(jìn)行預(yù)加重,幀處理,加工和終端窗口檢

6、測等,所以,中文命令字“前進(jìn)”的端點檢測如圖2所示,接下來,斷點檢測后的中文命令字“前進(jìn)”語音波形和 LPCC的參數(shù)波形如圖3所示。* 201020304050607080*M)100圖2中文命令字“前進(jìn)”的端點檢測(M)和 口211亠1bi. *一. .1 _一 卜o 5x)1 (JtNl 154)0 21XXJ 2 54X1 3 (MXI 4 50n 勒)007.Tl1ph“ Inlfil0511)1520253t35404550卜 ruiit* uhiiI圖3斷點檢測后的中文命令字“前進(jìn)”語音波形和LPCC勺參數(shù)波形3.2語音分形維數(shù)計算分形維數(shù)是一個與分形的規(guī)模與數(shù)量相關(guān)的定值,也是對

7、自我的結(jié)構(gòu)相似性 的測量。分形分維測量是6-7。從測量的角度來看,分形維數(shù)從整數(shù)擴(kuò)展到了分 數(shù),打破了一般集拓?fù)鋵W(xué)方面被整數(shù)分形維數(shù)的限制,分?jǐn)?shù)大多是在歐幾里得幾何 尺寸的延伸。有許多關(guān)于分形維數(shù)的定義,例如相似維度,豪斯多夫維度,信息維度,相關(guān) 維度,容積維度,計盒維度等等,其中,豪斯多夫維度是最古老同時也是最重要的, 它的定義如【3】所示:D = lin (InV/is (F) / In51)(6)Q其中,M fF表示需要多少個單位來覆蓋子集F.(7)InV fE 丿Infl/E 丿端點檢測后,中文命令詞“向前”的語音波形和分形維數(shù)波形如圖4所示IS00I (MX) 1 51 KI 2 (

8、XN) 2 5001 (KM)1 MXIrj/ V /*51()15211嚴(yán)巧441J %SO 55圖4端點檢測后,中文命令詞“向前”的語音波形和分形維數(shù)波形3.3改進(jìn)的特征提取方法考慮到LPCC語音信號和分形維數(shù)在表達(dá)上各自的優(yōu)點,我們把它們二者混合到 信號的特取中,即分形維數(shù)表表征語音時間波形圖的自相似性,周期性,隨機(jī)性, 同時,LPCC特性在高語音質(zhì)量和高識別速度上做得很好。由于人工神經(jīng)網(wǎng)絡(luò)的非線性,自適應(yīng)性,強(qiáng)大的自學(xué)能力這些明顯的優(yōu)點,它 的優(yōu)良分類和輸入輸出響應(yīng)能力都使它非常適合解決語音識別問題。由于人工神經(jīng)網(wǎng)絡(luò)的輸入碼的數(shù)量是固定的,因此,現(xiàn)在是進(jìn)行正規(guī)化的特征參數(shù)輸入到前神經(jīng)網(wǎng)

9、絡(luò)9,在我們的實驗中,LPCC和每個樣本的分形維數(shù)需要分別地通過時間規(guī)整化的網(wǎng)絡(luò),LPCC是一個4幀數(shù)據(jù)(LPCC,LPCQ,LPCG,LPCQ,每 個參數(shù)都是14維的),分形維數(shù)被模范化為12維數(shù)據(jù),(FDi,FD2, FD2,每一個 參數(shù)都是一維),以便于每個樣本的特征向量有 4*14+12*仁68-D維,該命令就是前 56個維數(shù)是LPCC剩下的12個維數(shù)是分形維數(shù)。因而,這樣的一個特征向量可以 表征語音信號的線形和非線性特征。自動語音識別的結(jié)構(gòu)和特征自動語音識別是一項尖端技術(shù),它允許一臺計算機(jī),甚至是一臺手持掌上電腦(邁爾斯,2000)來識別那些需要朗讀或者任何錄音設(shè)備發(fā)音的詞匯。自動語

10、音識 別技術(shù)的最終目的是讓那些不論詞匯量,背景噪音,說話者變音的人直白地說出的 單詞能夠達(dá)到100%勺準(zhǔn)確率(CSLU 2002)。然而,大多數(shù)的自動語音識別工程師 都承認(rèn)這樣一個現(xiàn)狀,即對于一個大的語音詞匯單位,當(dāng)前的準(zhǔn)確度水平仍然低于 90% 舉一個例子,Dragons Naturally Speaking或者IBM公司,闡述了取決于口 音,背景噪音,說話方式的基線識別的準(zhǔn)確性僅僅為60%至 80%(Ehsani & Knodt,1998) 。更多的能超越以上兩個的昂貴的系統(tǒng)有Subarashii (Bernste in, et al.,1999) , EduSpeak(Franeo, e

11、tal., 2001), Phonepass(Hinks, 2001), ISLE Project(Me nzel, et al., 2001) and RAD (CSLU, 2003)。語音識別的準(zhǔn)確性將有望改善。在自動語音識別產(chǎn)品中的幾種語音識別方式中,隱馬爾可夫模型( HMM被認(rèn) 為是最主要的算法,并且被證明在處理大詞匯語音時是最高效的(Ehsa ni & Knodt,1998)。詳細(xì)說明隱馬爾可夫模型如何工作超出了本文的范圍,但可以在任何關(guān)于 語言處理的文章中找到。其中最好的是Jurafsky & Martin (2000) and Hosom,Cole, and Fanty (200

12、3)。簡而言之,隱馬爾可夫模型計算輸入接收信號和包含于一個擁 有數(shù)以百計的本土音素錄音的數(shù)據(jù)庫的匹配可能性(Hi nks, 2003, p. 5) 。也就是說,一臺基于隱馬爾可夫模型的語音識別器可以計算輸入一個發(fā)音的音素可以和一 個基于概率論相應(yīng)的模型達(dá)到的達(dá)到的接近度。高性能就意味著優(yōu)良的發(fā)音,低性 能就意味著劣質(zhì)的發(fā)音(Larocca, et al., 1991)。雖然語音識別已被普遍用于商業(yè)聽寫和獲取特殊需要等目的,近年來,語言學(xué)習(xí)的市場占有率急劇增加 (Aist, 1999; Eske nazi, 1999; Hi nks, 2003)。早期的基于自動語音識別的軟件程序采用基于模板的識

13、別系統(tǒng),其使用動態(tài)規(guī)劃執(zhí)行模式 匹配或其他時間規(guī)范化技術(shù)(Dalby & Kewley-Port,1999).這些程序包括Talk toMe (Auralog, 1995), the Tell Me More Series (Auralog, 2000), Triple-Play Plus (Mackey & Choi, 1998), New Dynamic English (DynEd, 1997), English Discoveries (Edusoft, 1998), and See it, Hear It, SAY IT! (CPI, 1997)。這些程序的大多數(shù)都不會提供任何反饋給

14、超出簡單說明的發(fā)音準(zhǔn)確率,這個基于最 接近模式匹配說明是由用戶提出書面對話選擇的。學(xué)習(xí)者不會被告之他們發(fā)音的準(zhǔn)確率。特別是內(nèi)里,(2002年)評論例如 Talk to Me 和Tell Me More 等作品中的 波形圖,因為他們期待浮華的買家,而不會提供有意義的反饋給用戶。Talk to Me2002年的版本已經(jīng)包含了更多 Hinks (2003)的特性,比如,信任對于學(xué)習(xí)者來說 是非常有用的: 一個視覺信號可以讓學(xué)習(xí)者把他們的語調(diào)同模型揚聲器發(fā)出的語調(diào)進(jìn)行對比。 學(xué)習(xí)者發(fā)音的準(zhǔn)確度通常以數(shù)字 7來度量(越高越好)那些發(fā)音失真的詞語會被識別出來并被明顯地標(biāo)注。附件 2:外文原文 (復(fù)印件)I

15、mproved speech recognition methodfor intelligent robot2、 Overview of speech recognitionSpeech recognition has received more and more attention recently due to the important theoretical meaning and practical value 5 . Up to now, most speech recognition is based on conventional linear system theory, s

16、uch as Hidden Markov Model (HMM) and Dynamic Time Warping(DTW) . With the deep study of speech recognition, it is found that speech signal is a complex nonlinear process. If the study of speech recognition wants to break through, nonlinear-system theory method must be introduced to it. Recently, wit

17、h the developmentof nonlinea-system theories such as artificial neural networks(ANN) , chaos and fractal, it is possible to apply these theories to speech recognition. Therefore, the study of this paper is based on ANN and chaos and fractal theories are introduced to process speech recognition.Speec

18、h recognition is divided into two ways that are speaker dependent and speaker independent. Speaker dependent refers to the pronunciation model trained by a single person, the identification rate of the training person?sordersis high, while others ordiesrsin low identification rate or cant be recogni

19、zeSd.peaker independent refers to the pronunciation model trained by pers ons of differe nt age, sex and regi on, it can ide ntify a group of persons ordeGGenerally, speaker independentsystem ismorewidely used, since the user is not required to con duct the training. So extracti on of speaker in dep

20、e ndent features from the speech sig nal is the fun dame ntal problem of speaker recog niti on system.Speech recog niti on can be viewed as a patter n recog niti on task, which in cludes training and recog niti on.Gen erally, speech sig nal can be viewed as a time sequenceand characterized by the po

21、werful hidden Markov model (HMM). Through the feature extractio n, the speech sig nal is tran sferred into feature vectors and act asobservations. In the training procedure, these observati on swill feed to estimate the model parameters of HMM. These parameters include probability density function f

22、or the observations and their corresponding states, transition probability between the states, etc. After the parameter estimation, the trained models can be used for recog niti on task. The in put observati ons will be recog ni zed as the resulted words and the accuracy can be evaluated. Thewhole p

23、rocess is illustrated in Fig. 1.Fig. 1 Block diagram of speech recog niti on system3 Theory andmethodExtraction of speaker independent features from the speech signal is the fundamental problem of speaker recognition system. The standard methodology for solving this problem uses Linear Predictive Ce

24、pstral Coefficients (LPCC) and Mel-Frequency Cepstral Co-efficient (MFCC). Both these methods are linear procedures based on the assumption that speaker features have properties caused by the vocal tract resonances. These features form the basic spectral structure of the speech signal. However, the

25、non-linear information in speech signals is not easily extracted by the present feature extraction methodologies. So we use fractal dimension to measure non2linear speech turbulence.This paper investigates and implements speaker identification system using both traditional LPCC and non-linear multis

26、caled fractal dimension feature extraction.3. 1 L inear Predictive Cepstral CoefficientsLinear prediction coefficient (LPC) is a parameter setwhich is obtained when we do linear prediction analysis of speech. It is about some correlation characteristics between adjacent speech samples. Linear predic

27、tion analysis is based on the following basic concepts. That is, a speech sample can be estimated approximately by the linear combination of some past speech samples. According to the minimal square sum principle of difference between real speech sample in certain analysis frameshort-time and predic

28、tive sample, the only group ofprediction coefficients can be determ ined.LPC coefficie nt can be used to estimate speech sig nal cepstrum. This is a special processing method in analysis of speech signal short-time cepstrum. System fun cti on of cha nn elmodel is obta ined by lin ear predicti on an

29、alysis as follow.H (二)=; (1)1 -k =Where p representslinear prediction order, ak,(k=1,2,rplpresentspredicti on coefficie nt, Impulse resp onse is represe nted by h(n). SupposeAcepstrum of h(n) is represe nted by ,the n (1) can be expa nded as (2).aw All (2) = In/(z)=工 h (n) z(2)n = 1Intnoduce ( 1) in

30、k) (2) 5 and derive 二- on both sides,(2) is changed ink-) (3).aia * a曾 In ;=討工 h(*(3)srI+ itS H =11 -乙氐二Equalun (4) is obtained:k)A w 二八papSet coelTicienis of equal ers equal on hoihsides of (4) ? thus h (n) can be obtained Twih (n)曲IA1 The speech waveform of Chin ese comma nd word “ Forward ” andfr

31、actal dimension waveform after Endpoint detection is shown in Fig. 4.3. 3 Improved feature extractions method5(10 I (WMI I 2 (MY) 2 50()11500pi ii i HCon sideri ng the respective adva ntages on express ing speechsig nal of LPCC and fractal dimension,we mix both to be the feature signal, that is, fra

32、ctal dime nsion deno tes the self2similarity, periodicity and randomn ess of speech time wave shape, mean while LPCC feature is good for speech quality and high on identification rate.ti-05|0|520253055404555I 電 4 Speech wa vefomi of Ch he sc command wordu For ward and fracti I dhi ensbn wavefonnafte

33、r endpoht detectnnDue to ANN snonlinearity, self-adaptability, robust and self-learning such obvious advantages, its good classification and input2output reflection ability are suitable to resolve speech recog niti on problem.Due to the nu mber of ANN in put no des being fixed, therefore time regula

34、rizati on is carried out to the feature parameter before in putted to the neural network9. In our experiments, LPCC and fractal dimension of each sample are need to get through the network of time regularization separately,LPCC is 4-frame data(LPCC1,LPCC2,LPCC3,LPCC4, each frame parameter is 14-D),

35、fractal dimension is regularized to be12-frame data(FD1,FD2,,F(xiàn)D1ach frame parameter is 1-D), so that the feature vector of each sample has 4*14+1*12=68-D, the order is, the first 56 dimensions are LPCC, the rest 12 dimensions are fractal dimensions. Thus, such mixed feature parameter can show speech

36、 linear and nonlinear characteristics as well.Architectures and Features of ASRASR is a cutting edge technology that allows a computer or even a hand-held PDA (Myers, 2000) to identify words that are read aloud or spoken into any sound-recording device. The ultimate purpose of ASR technology is to a

37、llow 100% accuracy with all words that are intelligibly spoken by any person regardlessof vocabulary size, background noise, or speaker variables (CSLU, 2002). However, most ASR engineers admit that the current accuracy level for a large vocabulary unit of speech (e.g., the sentence) remains less th

38、an 90%. Dragons Naturally Speaking or IBMs ViaVoice, for example, show a baseline recognition accuracy of only 60% to 80%, depending upon accent, background noise, type of utterance, etc. (Ehsani & Knodt, 1998). More expensive systems that are reported to outperform these two are Subarashii (Bernste

39、in, et al., 1999), EduSpeak (Franco, et al., 2001), Phonepass (Hinks, 2001), ISLE Project (Menzel, et al., 2001) and RAD (CSLU, 2003). ASR accuracy is expected to improve.Among several types of speech recognizers used in ASR products, both implemented and proposed, the Hidden Markov Model (HMM) is o

40、ne of the most dominant algorithms and has proven to be an effective method of dealing with large units of speech (Ehsani & Knodt, 1998). Detailed descriptions of how the HHM model works go beyond the scope of this paper and can be found in any text concerned with language processing; among the best

41、 are Jurafsky & Martin (2000) and Hosom, Cole, and Fanty(2003). Put simply, HMM computes the probable match between the input it receives and phonemes contained in a database of hundreds of native speaker recordings (Hinks, 2003, p. 5). That is, a speech recognizer based on HMM computes how close th

42、e phonemes of a spoken input are to a corresponding model, based on probability theory. High likelihood represents good pronunciation; low likelihood represents poor pronunciation (Larocca, et al., 1991).While ASR has been commonly used for such purposes as business dictation and special needs accessibility, its mark

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論