版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
1、Multiple Futures PredictionYichuanTangRuslan Salakhutdinovrsalakhutdinovyichuan_tangAbstractTemporal prediction is critical for making intelligent and robust decisions in com- plex dynamic environments. Motion prediction needs to m the inherently uncertain future which often contains multiple potent
2、ial outcomes, due to multi- agent interactions and the latent goals of others. Towards these goals, we introduce a probabilistic framework that efficiently learns latent variables to jointly mthe multi-step future motions of agents in a scene. Our framework is data-driven and learns sem ally meaning
3、ful latent variables to represent the multimodal future, without requiring explicit labels. Using a dynamic attention-based state encoder, we learn to encode the past as well as the future interactions among agents, efficiently scaling to any number of agents. Finally, our m can be used for planning
4、 via computing a conditional probability density over the trajectories of other agents given a hypothetical rollout of the self agent. We demonstrate our algorithms by predicting vehicle trajectories of both simulated and real data, demonstrating the state-of-the-art results on several vehicle traje
5、ctory datasets.1IntroductionThe ability to make good predictions lies at the heart of robust and safe decision making.It isespecially critical to be able to predict the future motions of all relevant agents in complex anddynamic environments. For example, in the autonomous driving domain, motion pre
6、diction is central both to the ability to make high level decisions, such as when to perform maneuvers, as well as to low level path planning optimizations 34, 28.Motion prediction is a challenging problem due to the various needs of a good predictive m.The varying objectives, goals, and behavioral
7、characteristics of different agents can lead to multiplepossible futures or modes. Agents states do not evolve independently from one another, but rather they interact with each other. As an illustration, we provide some examples in Fig. 1. In Fig. 1(a), there are a few different possible futures fo
8、r the blue vehicle approaching an intersection. It can either turn left, go straight, or turn right, fordifferent modes in trajectory space. In Fig. 1(b), interactions between the two vehicles during a merge scenario show that their trajectories influence each other, depending on who yields to whom.
9、 Besides multimodal interactions, prediction needsto scale efficiently with an arbitrary number of agents in a scene and take intoauxiliaryand contextual information, such as map and road information. Additionally, the ability to measureuncertainty by computing probability over likely future traject
10、ories of all agents ind-form (as opposed to Monte Carlo sampling) is of practical importance.Despite a large body of work in temporal motion predictions 24, 7, 13, 26, 16, 2, 30, 8, 39, existing state-of-the-art methods often only capture a subset of the aforementioned features. For example, algorit
11、hms are either deterministic, not multimodal, or do not fully capture both past and future interactions. Multimodal techniques often require the explicit labeling of modes prior to training. M s which perform joint prediction often assume the number of agents present to be fixed 36, 31.We tackle the
12、se challenges by proposing a unifying framework that captures all of the desirable features mentioned earlier. Our framework, which we call Multiple Futures Predictor (MFP), is33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.(b) Scenario A: green yields to b
13、lue.(c) Scenario B: blue yields to green.(a) Multiple possible future trajectories.Figure 1: Examples illustrating the need for mutimodal interactive predictions. (a): There are a few possible modes for the blue vehicle. (b and c): Time-lapsed visualization of how interactions between agents influen
14、ces each others trajectories.a sequential probabilistic latent variable generative mthat learns directly from multi-agenttrajectory data. Trainingizes a variational lower bound on the log-likelihood of the data. MFPlearns to mmultimodal interactive futures jointly for all agents, while using a novel
15、 factorizationtechnique to remain scalable to arbitrary number of agents. After training, MFP can compute both (un)conditional trajectory probabilities in d form, not requiring any Monte Carlo sampling.MFP builds on the Seq2seq 32, encoder-decoder framework by introducing latent variables and using
16、a set of parallel RNNs (with shared weights) to represent the set of agents in a scene. Each RNN takes on the point-of-view of its agent and aggregates historical information for sequential temporal prediction for that agent. Discrete latent variables, one per RNN, automatically learn sem ally meani
17、ngful modes to capture multimodality without explicit labeling. MFP can be further efficiently and jointly trained end-to-end for all agents in the scene. To summarize, we make the following contributions with the proposed MFP: First, sem ally meaningful latent variables are automatically learned fr
18、om trajectory data without labels. This addresses the multimodality problem. Second, interactive and parallel step-wise rollouts are preformed for all agents in the scene. This addresses the m ing of interactions between actors during future prediction, see Sec. 3.1. We further propose a dynamic att
19、entional encoding which captures both the relationships between agents and the scene context, see Sec. 3.1. Finally, MFP is capable of perforhypothetical inference: evaluating the conditional probability of agents trajectories conditioning on fixing one or more agents trajectory, see Sec. 3.2.2Relat
20、ed WorkThe problem of predicting future motion for dynamic agents has been well studied in the literature. The bulk of classical methods focus on using physics based dynamic or kinematic m s 38, 21, 25. These approaches include Kalman filters and maneuver based methods, which compute the future moti
21、on of agents by propagating their current state forward in time. While these methods perform well for short time horizons, longer horizons suffer due to the lack of interaction and context m ing.The success of machine learning and deep learning ushered in a variety of data-driven recurrent neural ne
22、twork (RNN) based methods 24, 7, 13, 26, 16, 2. These m s often combine RNN variants, such as LSTMs or GRUs, with encoder-decoder architectures such as conditional variational autoencoders (CVAEs). These methods eschew physic based dynamic m s in favor of learning generic sequential predictors (e.g.
23、 RNNs) directly from data. Converting raw input data to input features can also be learned, often by encoding rasterized inputs usings 7, 13.Methods that can learn multiple future modes have been proposed in 16, 24, 13. However, 16 explicitly labels six maneuvers/modes and learn to separately classi
24、fy these modes. 24, 13 do notrequire mabeling but they also do not train in an end-to-end fashion byizing the datalog-likelihood of the m. Most of the methods in literature encode the past interactions of agentsin a scene, however prediction is often an independent rollout of a decoder RNN, independ
25、ent of other future predicted trajectories 16, 29. Encoding of spatial relationships is often done by placing other agents in a fixed and spatially discretized grid 16, 24.2(a) Graphical mof the MFP. Solidnodes denote observed. Cross agentinteraction edges are shaded for clarity.(b) Architecture of
26、the proposed MFP. Circular world contains thext denotes both the state and contextual information from timesteps 1 to t.world state and positions of all agents. Diamond nodes are determin- istic while the circular zn are discrete latent random variables.Figure 2: Graphical mand computation graph of
27、the MFP. See text fors. Best viewed in color.In contrast, MFP proposes a unifying framework which exhibits the aforementioned features. To summarize, we present a feature comparison of MFP with some of the recent methods in the supplementary materials.3Multiple Futures PredictionWe tackle motion pre
28、diction by formulating a probabilistic framework of continuous space but discretetime system with a finite (but variable) numb.er of N interacting agents. We represent the joint stateof all N agents at time t as Xt RN d12N, where d is the dimensionality of= x , x , . . . , x ttteach state1, and xn .
29、Rd is the state n-th agent at time t. With a slight abuse of notation, wetuse superscripted Xn = x, xn, . . . , x to denote the past states of the n-th agent andnntt +1t.X = Xto denote the joint agent states from time t to t, where is the past history steps.1:Nt :t.The future state at time of all ag
30、ents is denoted by Y = y , y , . . . , y and the future trajectory12N.of agent n, from time t to time T , is denoted by Y = .denotesnnnn1:Ny , y, . . . , y TY = Yt:t+Ttt+1the joint state of all agents for the future timesteps. Contextual scene information, e.g. a rasterizedimage Rhw3 of the map, cou
31、ld be useful by providing important cues. We use It to represent anycontextual information at time t.The goal of motion prediction is then to accurately mp(Y|X, It). As in most sequentialmling tasks, it is both inefficient and intractable to mp(Y|X, It) jointly. RNNs are typicallyemployed to sequent
32、ially mthe distribution in a cascade form. However, there are two majorchallenges specific to our multi-agent prediction framework: (1) Multimodality: optimizing vanilla RNNs via backpropagation through time will lead to mode-averaging since the map from X to Y is not a function, but rather a one-to
33、-many map . In other words, multimodality means that for a given X, there could be multiple distinctive modes that results in significant probabilitydistribution over different sequences of Y. (2) Variable-Agents: the number of agents N is variable and unknown, and therefore we can not simply vector
34、ize Xt as the input to a standard RNN at time t.For multimodality, we introduce a set of stochastic latent variables zn Multinoulli(K), oneper agent, where zn can take on K discrete values. The intuition here is that zn would learn to represent intentions (left/right/straight) and/or behavior modes
35、(aggressive/conservative). Learning izes the marginalized distribution, where z isto learn any latent behavior so long as ithelps to improve the data log-likelihood. Each z is conditioned on X at the current time (beforefuture prediction) and will influence the distribution over future states Y. A k
36、ey feature of the MFP is that zn is only sampled once at time t, and must be consistent for the next T time steps. Compared to sampling zn at every timestep, this leads to a tractability and more realistic intention/goal m ing,1We assume states are fully observable and are agents (x, y) coordinates
37、on the ground plane (d=2).3as we will discuss in morelater. We now arrive at the following distribution:XX(1)log p(Y|X, I) = log(p(Y, Z|X, I) = log(p(Y|Z, X, I)p(Z|X, I),ZZwhere Z denotes the joint latent variables of all agents. Navely optimizing for Eq. 1 is prohibitively expensive and not scalabl
38、e as the number of agents and timesteps may become large. In addition,the max number of possible modes is exponential: O(KN ). We first make the mmore tractableby factorizing across time, followed by factorization across agents. The joint future distribution Y assumes the form of product of conditio
39、nal distributions:YT(2)p(Y|Z, X, I) =p(Y|Yt:1, Z, X, I),=t+1YNnn(3)p(Y |Y, Z, X, I) =p(y |Y, z , X, I).t:1t:1n=1The second factorization is sensible as the factorial component is conditioning on the joint states of all agents in the immediate previous timestep, where the typical temporal delta is ve
40、ry short (e.g. 100ms). Also note that the future distribution of the n-th agent is explicitly dependent on its ownmode zn but implicitly dependent on the latent modes of other agents by re-encoding the other agentspredicted states ym (please see discussion later and also Sec. 3.1). Explicitly condit
41、ioning an agentsown latent modes is both more scalable computationally as well as more realistic: agents in the real-world can only infer other agents latent goals/intentions via observing their states. Finally our overall objective from Eq. 1 can be written as: XX Y YTN n, z , X, I)p(z |X, I)(4)nnl
42、ogp(Y|Z, X, I)p(Z|X, I) = logp(y |Yt:1Z =t+1 n=1Z NTX YYp(zn|X, I)p(y |Ynn, z , X, I)(5)= logt:1Z n=1=t+1The graphical mof the MFP is illustrated in Fig. 2a. While we show only three agents forsimplicity, MFP can easily scale to any number of agents. Nonlinear interactions among agents makesn, X, I)
43、 complicated to m. The class of recurrent neural networks are powerful andp(y |Yt:1flexible ms that can efficiently capture and represent long-term dependences in sequential data.At a high level, RNNs introduce deterministic hidden units ht at every timestep t, which a features or embeddings that su
44、mmarize all of the observations up until time t. At time step t, a RNNtakes as its input the observation, xt, and the previous hidden representation, ht1, and computes theupdate: ht = frnn(xt, ht1). The prediction yt is computed from the decoding layer of the RNNyt = fdec(ht). frnn and fdec are recu
45、rsively applied at every timestep of the sequence.Fig. 2b shows the computation graph of the MFP. A point-of-view (PoV) transformation n(Xt) is first used to transform the past states to each agents own reference frame by translation and rotation such that +x-axis aligns with agents heading. We then
46、 inst ate an encoding and a decoding RNN2per agent. Each encoding RNN is responsible for encoding the past observations xt:t into a featurevector. Scene context is transformed via a convolutional neural network into its own feature. The features are combined via a dynamic attention encoder,ed in Sec
47、. 3.1, to provide inputs both to the latent variables as well as to the ensuing decoding RNNs. During predictive rollouts, the decoding RNN will predict its own agents state at every timestep. The predictions will be aggregatedand subsequently transformed via n(), providing inputs to every agent/RNN
48、 for the next timestep.Latent variables Z provide extra inputs to the decoding RNNs to enable multimodality. Finally, theoutput yn consists of a 5 dim vector governing a Bivariate Normal distribution: x, y, x, y, andtcorrelation coefficient .While we instate two RNNs per agent, these RNNs share the
49、same parameters across agents, whichmeans we can efficiently perform joint predictions by combining inputs in a minibatch, allowing us to scale to arbitrary number of agents. Making Z discrete and having only ot of latent variablesinfluencing subsequent predictions is also a deliberate choice. We wo
50、uld like Z to mmodes generated due to high level intentions such as left/right lane changes or conservative/aggressive modesof agent behavior. These latent behavior modes also tend to stay consistent over the time horizon which is typical of motion prediction (e.g. 5 seconds).2We use GRUs 10. LSTMs
51、and GRUs perform similarly, but GRUs were slightly faster computationally.4LearningGiven a set of training trajectory data D = (X(i), Y(i), ) . . . i=1,2,.,|D|, we optimize using theum likelihood estimation (MLE) to estimate the parameters = argmax L(, D) thatum marginal data log-likelihood:3achieve
52、s theXX p(Y, Z|X; )(6)L(, D) = log p(Y|X; ) = logp(Y, Z|X; ) =p(Z|Y, X; ) logp(Z|Y, X; )ZZOptimizing for Eq. 6 directly is non-trivial as the posterior distribution is not only hard to compute, but also varies with . We can however decompose the log-likelihood into the sum of the evidence lower boun
53、d (ELBO) and the KL-divergence between the true posterior and an approximating posterior q(Z) 27:Xp(Y, Z|X; )q(Z|Y, X)log p(Y|X; ) =q(Z|Y, X) log+ D(q|p)KLZX(7)q(Z|Y, X) log p(Y, Z|X; ) + H(q),Zwhere Jensens inequality is used to arrive at the lower bound, H is the entropy function andDKL(q|p) is th
54、e KL-divergence between the true and approximating posterior. We learn by max-imizing the variational lower bound on the data log-likeliho.od by first using the true posterior4 atthe current 0 as the approximating posterior: q(Z|Y, X) = p(Z|Y, X; 0). We can then fix theapproximate posterior and opti
55、mize the mparameters for the following function:X00Q(, ) =p(Z|Y, X; ) log p(Y, Z|X; )ZX 0(8)=p(Z|Y, X; ) log p(Y|Z, X; ) + log p(Z|X; ) .rnnZZwhere = rnn, Z denote the parameters of the RNNs and the parameters of the network layersfor predicting Z. As our latent variables Z are discrete and have sma
56、ll cardinality (e.g. 10), we can compute the posterior exactly for a given 0. The RNN parameter gradients are computed fromQ(, 0)/rnn and the gradient for Z is KL(p(Z|Y, X; 0)|p(Z|X; Z)/Z.Our learning algorithm is a form of the EM algorithm 14, where for the M-step we optimize RNN parameters using s
57、tochastic gradient descent. By integrating out the latent variable Z, MFP learns directly from trajectory data, without requiring any annotations or weak supervision for latent modes.We provide aed training algorithm pseudocode in the supplementary materials.Classmates-forcingTeacher forcing is a standard technique (albeit biased) to accelerate RNN and sequence-to-sequence training by using
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 二零二五年度商鋪物業(yè)管理服務(wù)合同參考2篇
- 偵探柯南介紹
- 二零二五年度婚介公司婚姻法律援助合同3篇
- 山東省棗莊市市中區(qū)2024-2025學(xué)年八年級上學(xué)期期末生物試題(含答案)
- 二零二五年度單車租賃與保險合作合同2篇
- Unit 6 Exploring the Topic Grammar in Use說課稿 -2024-2025學(xué)年仁愛科普版七年級英語上冊
- 江蘇省蘇州市姑蘇區(qū)2024-2025學(xué)年七年級上學(xué)期期末質(zhì)量監(jiān)測歷史卷(含答案)
- 黑龍江牡丹江市(2024年-2025年小學(xué)六年級語文)統(tǒng)編版能力評測(下學(xué)期)試卷及答案
- 貴州盛華職業(yè)學(xué)院《影視動畫制作》2023-2024學(xué)年第一學(xué)期期末試卷
- 貴州黔南經(jīng)濟(jì)學(xué)院《產(chǎn)品符號與語意》2023-2024學(xué)年第一學(xué)期期末試卷
- 紡織廠消防管道安裝協(xié)議
- 【MOOC】思辨式英文寫作-南開大學(xué) 中國大學(xué)慕課MOOC答案
- 期末測試卷(試題)-2024-2025學(xué)年五年級上冊數(shù)學(xué)北師大版
- 2024年下半年中國石油大連石化分公司招聘30人易考易錯模擬試題(共500題)試卷后附參考答案
- 國有企業(yè)品牌建設(shè)策略方案
- 家政培訓(xùn)講師課件
- 廣東省深圳市龍華區(qū)2023-2024學(xué)年八年級下學(xué)期期中數(shù)學(xué)試題
- 視頻監(jiān)控方案-高空瞭望解決方案
- 完整液壓系統(tǒng)課件
- 2024-2030年中國通信工程行業(yè)發(fā)展分析及發(fā)展前景與趨勢預(yù)測研究報告
- 雙梁橋式起重機(jī)小車改造方案
評論
0/150
提交評論