45篇經(jīng)典論文原始pdf黃海廣整理1704 00784_第1頁
45篇經(jīng)典論文原始pdf黃海廣整理1704 00784_第2頁
45篇經(jīng)典論文原始pdf黃海廣整理1704 00784_第3頁
45篇經(jīng)典論文原始pdf黃海廣整理1704 00784_第4頁
45篇經(jīng)典論文原始pdf黃海廣整理1704 00784_第5頁
已閱讀5頁,還剩14頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領

文檔簡介

1、Online and Linear-Time Attention by Enforcing Monotonic AlignmentsColin Raffel 1 Minh-Thang Luong 1Peter J. Liu 1 Ron J. Weiss 1 Douglas Eck 1AbstractRecurrent neural network models with an atten- tion mechanism have proven to be extremely effective on a wide variety of sequence-to- sequence problem

2、s. However, the fact that soft attention mechanisms perform a pass over the entire input sequence when producing each el- ement in the output sequence precludes their use in online settings and results in a quadratic time complexity. Based on the insight that the align- ment between input and output

3、 sequence ele- ments is monotonic in many problems of interest, we propose an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time. We validate our approach on sen- tence summarization, machine translation, and

4、online speech recognition problems and achievemechanisms (Bahdanau et al., 2015). In a sequence-to- sequence model with attention, the encoder produces a se- quence of hidden states (instead of a single fixed-length vector) which correspond to entries in the input sequence. The decoder is then allow

5、ed to refer back to any of the en- coder states as it produces its output. Similar mechanisms have been used as soft addressing schemes in memory- augmented neural network architectures (Graves et al., 2014; Sukhbaatar et al., 2015) and RNNs used for sequence generation (Graves, 2013). Attention-bas

6、ed sequence-to- sequence models have proven to be extremely effective on a wide variety of problems, including machine translation (Bahdanau et al., 2015; Luong et al., 2015), image cap- tioning (Xu et al., 2015), speech recognition (Chorowski et al., 2015; Chan et al., 2016), and sentence summariza

7、- tion (Rush et al., 2015). In addition, attention creates an implicit soft alignment between entries in the output se- quence and entries in the input sequence, which can give useful insight into the models behavior.A common criticism of soft attention is that the model must perform a pass over the

8、 entire input sequence when pro- ducing each element of the output sequence. This resultsin the decoding process having complexity O(TU ), whereT and U are the input and output sequence lengths respec-tively. Furthermore, because the entire sequence must be processed prior to outputting any symbols,

9、 soft attention cannot be used in “online” settings where output sequence elements are produced when the input has only been par- tially observed.The focus of this paper is to propose an alternative at- tention mechanism which has linear-time complexity and can be used in online settings. To achieve

10、 this, we first note that in many problems, the input-output alignment is roughly monotonic. For example, when transcribing an audio recording of someone saying “good morning”, the region of the speech utterance corresponding to “good” will always precede the region corresponding to “morn- ing”. Eve

11、n when the alignment is not strictly monotonic, it often only contains local input-output reorderings. Sep- arately, despite the fact that soft attention allows for as- signment of focus to multiple disparate entries of the input sequence, in many cases the attention is assigned mostly to a single e

12、ntry. For examples of alignments with these char- acteristics, we refer to e.g. (Chorowski et al. 2015 Figureresults competitive sequence models.withexistingsequence-to-1. IntroductionRecently,the“sequence-to-sequence”framework(Sutskever et al., 2014; Cho et al., 2014) has facilitatedthe use of recu

13、rrent neural networks (RNNs) on sequence transduction problems such as machine translation and speech recognition. In this framework, an input sequence is processed with an RNN to produce an “encoding”; this encoding is then used by a second RNN to produce the target sequence. As originally proposed

14、, the encoding is a single fixed-length vector representation of the input sequence. This requires the model to effectively compress all important information about the input sequence into a single vector. In practice, this often results in the model having difficulty generalizing to longer sequence

15、s than those seen during training (Bahdanau et al., 2015).An effective solution to these shortcomings are attention1Google Brain, Mountain View, California, USA. Correspon- dence to: Colin Raffel .Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017

16、. Copyright 2017 by the author(s).arXiv:1704.00784v2 cs.LG 29 Jun 2017 Online and Linear-Time Attention by Enforcing Monotonic Alignments2; Chan et al. 2016 Figure 2; Rush et al. 2015 Figure 1; Bahdanau et al. 2015 Figure 3), etc. Of course, this is not true in all problems; for example, when using

17、soft attention for image captioning, the model will often change focus arbitrarily between output steps and will spread attention across large regions of the input image (Xu et al., 2015).Motivated by these observations, we propose using hard monotonic alignments for sequence-to-sequence problems be

18、cause, as we argue in section 2.2, they enable computing attention online and in linear time. Towards this end, we show that it is possible to train such an attention mecha- nism with a quadratic-time algorithm which computes its expected output. This allows us to continue using standard backpropaga

19、tion for training while still facilitating efficient online decoding at test-time. On all problems we studied, we found these added benefits only incur a small decrease in performance compared to softmax-based attention.The rest of this paper is structured as follows: In the follow- ing section, we

20、develop an interpretation of soft attention as optimizing a stochastic process in expectation and formu- late a corresponding stochastic process which allows for online and linear-time decoding by relying on hard mono- tonic alignments. In analogy with soft attention, we then show how to compute the

21、 expected output of the mono- tonic attention process and elucidate how the resulting al- gorithm differs from standard softmax attention. After giv- ing an overview of related work, we apply our approach to the tasks of sentence summarization, machine translation, and online speech recognition, ach

22、ieving results competi- tive with existing sequence-to-sequence models. Finally, we present additional derivations, experimental details, and ideas for future research in the appendix.as the “encoder” and “decoder”. The encoder RNN pro-cesses the input sequence x = x1, . . . , xT to produce a sequen

23、ce of hidden states h = h1, . . . , hT . We re- fer to h as the “memory” to emphasize its connection tomemory-augmented neural networks (Graves et al., 2014; Sukhbaatar et al., 2015). The decoder RNN then producesan output sequence y = y1, . . . , yU , conditioned on the memory, until a special end-

24、of-sequence token is produced.When computing yi, a soft attention-based decoder uses a learnable nonlinear function a() to produce a scalar valueei,j for eachentry hj inthe memory based on hj andthe de-coders state at the previous timestep si1. Typically, a()is a single-layer neural network using a

25、tanh nonlinearity,but other functions such as a simple dot product between si1 and hj have been used (Luong et al., 2015; Graves et al., 2014). These scalar values are normalized using the softmax function to produce a probability distribution over the memory, which is used to compute a context vect

26、or ci as the weighted sum of h. Because items in the memory have a sequential correspondence with items in the input, these attention distributions create a soft alignment between the output and input. Finally, the decoder updates its state to si based on si1 and ci and produces yi. In total, produc

27、ing yi involvesei,j = a(si1,hj)(1)XTi,j = exp(ei,)jexp(ei,k)(2)k=1XTi,jhjci =(3)j=1si = f (si1, yi1, ci) yi = g(si, ci)(4)(5)2. Online and Linear-Time AttentionTo motivate our approach, we first point out that softmax- based attention is computing the expected output of a sim- ple stochastic process

28、. We then detail an alternative process which enables online and linear-time decoding. Because this process is nondifferentiable, we derive an algorithm for computing its expected output, allowing us to train a model with standard backpropagation while applying our online and linear-time process at

29、test time. Finally, we propose an alternative energy function motivated by the differences between monotonic attention and softmax-based attention.where f () is a recurrent neural network (typically one or more LSTM (Hochreiter & Schmidhuber, 1997) or GRU(Chung et al., 2014) layers) and g() is a lea

30、rnable nonlinear function which maps the decoder state to the output space(e.g. an affine transformation followed by a softmax when the target sequences consist of discrete symbols).To motivate our monotonic alignment scheme, we observe that eqs. (2) and (3) are computing the expected output of a si

31、mple stochastic process, which can be formulated as follows: First, a probability i,j is computed independently for each entry hj of the memory. Then, a memory index kis sampled by k Categorical(i) and ci is set to hk. We visualize this process in fig. 1. Clearly, eq. (3) shows thatsoft attention re

32、places sampling k and assigning ci = hkwith direct computation of the expected value of ci.2.1. Soft AttentionTo begin with, we review the commonly-used form of soft attention proposed originally in (Bahdanau et al., 2015). Broadly, a sequence-to-sequence model produces a sequence of outputs based o

33、n a processed input se- quence. The model consists of two RNNs, referred to Online and Linear-Time Attention by Enforcing Monotonic AlignmentsMemory hMemory hFigure 1. Schematic of the stochastic process underlying softmax-based attention decoders. Each node represents a possible alignment between a

34、n entry of the output sequence (vertical axis) and the memory (horizontal axis). At each output timestep, the decoder inspects all memory entries (indicated in gray) and attends to a single one (indicated in black). A black node indicates that memory element hj is aligned to output yi. In terms of w

35、hich memory entry is chosen, there is no dependence across output timesteps or between memory entries.Figure 2. Schematic of our novel monotonic stochastic decoding process. At each output timestep, the decoder inspects memory entries (indicated in gray) from left-to-right starting from where it lef

36、t off at the previous output timestep and chooses a single one (indicated in black). A black node indicates that memory element hj is aligned to output yi. White nodes indicate that a particular input-output alignment was not considered because it violates monotonicity. Arrows indicate the order of

37、processing and dependence between memory entries and output timesteps.2.2. A Hard Monotonic Attention ProcessThe discussion above makes clear that softmax-based at- tention requires a pass over the entire memory to compute the terms i,j required to produce each element of the out- put sequence. This

38、 precludes its use in online settings, andresults in a complexity of O(TU ) for generating the out- put sequence. In addition, despite the fact that h representsa transformation of a sequence (which ostensibly exhibits dependencies between subsequent elements), the attention probabilities are comput

39、ed independent of temporal order and the attention distribution at the previous timestep.We address these shortcomings by first formulating a stochastic process which explicitly processes the memory in a left-to-right manner. Specifically, for output timestepsequent output timesteps, we repeat this

40、process, always starting from ti1 (the memory index chosen at the previ- ous timestep). If for any output timestep i we have zi,j = 0for j ti1, . . . , T , we simply set ci to a vector of ze- ros. This process is visualized in fig. 2 and is presentedmore explicitly in algorithm 1 (appendix A).Note t

41、hat by construction, in order to compute pi,j, we onlyneed to have computed hk for k 1, . . . , j. It follows that our novel process can be computed in an online man-ner; i.e. we do not need to wait to observe the entire input sequence before we start producing the output sequence. Furthermore, beca

42、use we start inspecting memory elements from where we left off at the previous output timestep (i.e. at index ti1), the resulting process only computes at most max(T, U ) terms pi,j, giving it a linear runtime. Of course, it also makes the strong assumption that the alignment be- tweentheinputandout

43、putsequenceisstrictlymonotonic.i we begin processing memory entries from index ti ,1where ti is the index of the memory entry chosen at output timestep i (for convenience, letting t0 = 1). We sequen- tially compute, for j = ti1, ti1 + 1, ti1 + 2, . . .ei,j = a(si1, hj) pi,j = (ei,j )zi,j Bernoulli(p

44、i,j)(6)(7)(8)2.3. Training in ExpectationThe online alignment process described above involves sampling, which precludes the use of standard backpropa- gation. In analogy with softmax-based attention, we there- fore propose training with respect to the expected value ofwhere a() is a learnable deter

45、ministic “energy function”and () is the logistic sigmoid function. As soon as wesample zi,jci,= 1 for some j, westop and set ci = hjwhich can be computed straightforwardly as follows.We first compute ei,j and pi,j exactly as in eqs. (6) and (7), where pi,j are interpreted as the probability of choos

46、ingand ti= j, “choosing” memory entry j for the contextvector. Each zi,jcan be seen as representing a discretememory element j at output timestep i. The attention dis- tribution over the memory is then given by (see appendix Cchoice of whether to ingest a new item from the memory(zi,j= 0) or produce

47、 an output (zi,j = 1). For all sub-Output yOutput y Online and Linear-Time Attention by Enforcing Monotonic Alignmentsfor a derivation)where W and V are weight matrices, b is a bias vector,1 and v is a weight vector. We make two modifications to eq. (15) for use with our monotonic decoder: First, wh

48、ile the softmax is invariant to offset,2 the logistic sigmoid is not. As a result, we make the simple modification of adding a scalar variable r after the tanh function, allowing the model to learn the appropriate offset for the pre-sigmoid activations. Note that eq. (13) tends to exponentially de-c

49、ay attention over the memory because 1 pi,j 0,1; we therefore initialized r to a negative value prior to train-ing so that 1 pi,j tends to be close to 1. Second, theuse of the sigmoid nonlinearity in eq. (12) implies that ourmechanism is particularly sensitive to the scale of the en- ergy terms ei,j

50、, or correspondingly, the scale of the energy vector v. We found an effective solution to this issue was to apply weight normalization (Salimans & Kingma, 2016)to v, replacing it by gv/kvk where g is a scalar parame- ter. Initializing g to the inverse square root of the attentionhidden dimension wor

51、ked well for all problems we studied.The above produces the energy function!jj1i,j = pi,ji1,k(1 pi,l)(9)k=1l=k i,j1)+ = p(1 p(10)i,ji,j1i1,jpi,j1We provide a solution to the recurrence relation of eq. (10)which allows computing i,j for j 1,., T in parallel withcumulativesumandcumulativeproductoperat

52、ionsinappendix C.1. Defining qi,j = i,j/pi,j gives the following procedure for computing i,j:ei,j = a(si1,hj) pi,j = (ei,j )(11)(12)(13)(14)qi,j = (1 p)q+ i,j1 i,j1i1,ji,j = pi,jqi,jwhere we define the special cases of q= 0, p i,0= 0i,0to maintain equivalence with eq. (9).As in softmax-va(si1, hj) =

53、 g kvk tanh(Wsi1 + Vhj + b) + r (16)The addition of the two scalar parameters g and r prevented the issues described above in all our experiments while in- curring a negligible increase in the number of parameters.based attention, the i,jvalues produce a weighting overthe memory, which are then used

54、 to compute the con- text vector at each timestep as in eq. (3). However, note tPhat i may not be a valid probability distribution because 1. Using i as-is, without normalization, ef-j i,jfectively associates any additional probability not allocatedto memory entries to an addiPtional all-zero memory

55、 loca-2.5. Encouraging DiscretenessAs mentioned above, in order for our mechanism to exhibit similar behavior when training in expectation and when us- ing the hard monotonic attention process at test time, werequire that pi,j 0 or pi,j 1. A straightforward way to encourage this behavior is to add n

56、oise before the sigmoidin eq. (12), as was done e.g. in (Frey, 1997; Salakhutdinov & Hinton, 2009; Foerster et al., 2016). We found that sim- ply adding zero-mean, unit-variance Gaussian noise to the pre-sigmoid activations was sufficient in all of our exper- iments. This approach is similar to the

57、recently proposed Gumbel-Softmax trick (Jang et al., 2016; Maddison et al., 2016), except we did not find it necessary to anneal the temperature as suggested in (Jang et al., 2016).Note that once we have a model which produces pi,j which are effectively discrete, we can eschew the sampling in- volve

58、d in the process of section 2.2 and instead simply set zi,j = I(pi,j ) where I is the indicator function and is a threshold. We used this approach in all of our exper- iments, setting = 0.5. Furthermore, at test time we do not add pre-sigmoid noise, making decoding purely deter-Ttion. Normalizing iso th

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論