第10章_強化學習PPT課件

上傳人：伊*** IP屬地：上海上傳時間：2022-05-06 格式：PPTX 頁數(shù)：90 大?。?.54MB 積分：20 舉報 版權(quán)申訴

已閱讀5頁，還剩85頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進行舉報或認領(lǐng)

文檔簡介

1、2021-11-17強化學習史忠植1內(nèi)容提要內(nèi)容提要引言強化學習模型動態(tài)規(guī)劃蒙特卡羅方法時序差分學習 Q學習強化學習中的函數(shù)估計應用第1頁/共90頁2021-11-17強化學習史忠植2引言引言人類通常從與外界環(huán)境的交互中學習。所謂強化（reinforcement）學習是指從環(huán)境狀態(tài)到行為映射的學習，以使系統(tǒng)行為從環(huán)境中獲得的累積獎勵值最大。在強化學習中，我們設(shè)計算法來把外界環(huán)境轉(zhuǎn)化為最大化獎勵量的方式的動作。我們并沒有直接告訴主體要做什么或者要采取哪個動作,而是主體通過看哪個動作得到了最多的獎勵來自己發(fā)現(xiàn)。主體的動作的影響不只是立即得到的獎勵，而且還影響接下來的動作和最終

2、的獎勵。試錯搜索(trial-and-error search)和延期強化(delayed reinforcement)這兩個特性是強化學習中兩個最重要的特性。第2頁/共90頁2021-11-17強化學習史忠植3引言引言強化學習技術(shù)是從控制理論、統(tǒng)計學、心理學等相關(guān)學科發(fā)展而來，最早可以追溯到巴甫洛夫的條件反射實驗。但直到上世紀八十年代末、九十年代初強化學習技術(shù)才在人工智能、機器學習和自動控制等領(lǐng)域中得到廣泛研究和應用，并被認為是設(shè)計智能系統(tǒng)的核心技術(shù)之一。特別是隨著強化學習的數(shù)學基礎(chǔ)研究取得突破性進展后，對強化學習的研究和應用日益開展起來，成為目前機器學習領(lǐng)域的研究熱點之一。第3頁/

3、共90頁2021-11-17強化學習史忠植4引言強化思想最先來源于心理學的研究。1911年Thorndike提出了效果律（Law of Effect）：一定情景下讓動物感到舒服的行為，就會與此情景增強聯(lián)系（強化），當此情景再現(xiàn)時，動物的這種行為也更易再現(xiàn)；相反，讓動物感覺不舒服的行為，會減弱與情景的聯(lián)系，此情景再現(xiàn)時，此行為將很難再現(xiàn)。換個說法，哪種行為會“記住”，會與刺激建立聯(lián)系，取決于行為產(chǎn)生的效果。動物的試錯學習,包含兩個含義：選擇（selectional）和聯(lián)系（associative），對應計算上的搜索和記憶。所以，1954年，Minsky在他的博士論文中實現(xiàn)了計算上的試錯學習

4、。同年，F(xiàn)arley和Clark也在計算上對它進行了研究。強化學習一詞最早出現(xiàn)于科技文獻是1961年Minsky 的論文“Steps Toward Artificial Intelligence”，此后開始廣泛使用。1969年， Minsky因在人工智能方面的貢獻而獲得計算機圖靈獎。第4頁/共90頁2021-11-17強化學習史忠植5引言 1953到1957年，Bellman提出了求解最優(yōu)控制問題的一個有效方法：動態(tài)規(guī)劃（dynamic programming） Bellman于 1957年還提出了最優(yōu)控制問題的隨機離散版本，就是著名的馬爾可夫決策過程（MDP, Markov decisio

5、n processe），1960年Howard提出馬爾可夫決策過程的策略迭代方法，這些都成為現(xiàn)代強化學習的理論基礎(chǔ)。 1972年，Klopf把試錯學習和時序差分結(jié)合在一起。1978年開始，Sutton、Barto、 Moore，包括Klopf等對這兩者結(jié)合開始進行深入研究。 1989年Watkins提出了Q-學習Watkins 1989，也把強化學習的三條主線扭在了一起。 1992年，Tesauro用強化學習成功了應用到西洋雙陸棋（backgammon）中，稱為TD-Gammon 。第5頁/共90頁2021-11-17強化學習史忠植6內(nèi)容提要內(nèi)容提要引言強化學習模型動態(tài)規(guī)劃蒙特卡羅方

6、法時序差分學習 Q學習強化學習中的函數(shù)估計應用第6頁/共90頁2021-11-17強化學習史忠植7強化學習模型i: inputr: reward s: statea: action狀態(tài) sisi+1ri+1獎勵 ri動作動作 aia0a1a2s0s1s2s3第7頁/共90頁2021-11-17強化學習史忠植8描述一個環(huán)境（問題） Accessible vs. inaccessible Deterministic vs. non-deterministic Episodic vs. non-episodic Static vs. dynamic Discrete vs. continu

7、ousThe most complex general class of environments are inaccessible, non-deterministic, non-episodic, dynamic, and continuous.第8頁/共90頁2021-11-17強化學習史忠植9強化學習問題 Agent-environment interaction States, Actions, Rewards To define a finite MDP state and action sets : S and A one-step “dynamics” defined by

8、transition probabilities (Markov Property): reward probabilities: EnvironmentactionstaterewardRLAgent1Pr, for all ,( ).asstttPssss aas sS aA s11, for all ,( ).assttttRE rss aa sss sS aA s第9頁/共90頁2021-11-17強化學習史忠植10與監(jiān)督學習對比 Reinforcement Learning Learn from interaction learn from its own experience,

9、and the objective is to get as much reward as possible. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.RLSystemInputsOutputs (“actions”)Training Info = evaluations (“rewards” / “penalties”)lSupervised Learning Learn from ex

10、amples provided by a knowledgable external supervisor.第10頁/共90頁2021-11-17強化學習史忠植11強化學習要素 Policy: stochastic rule for selecting actions Return/Reward: the function of future rewards agent tries to maximize Value: what is good because it predicts reward Model: what follows whatPolicyRewardValueModel

11、ofenvironmentIs unknownIs my goalIs I can getIs my method第11頁/共90頁2021-11-17強化學習史忠植12在策略下的Bellman公式1t1t4t23t2t1t4t33t22t1ttRrrrrrrrrrRThe basic idea: So: sssVrEssRE)s(Vt1t1tttOr, without the expectation operator: asassass)s(VRP)a, s()s(V is the discount rate第12頁/共90頁2021-11-17強化學習史忠植13BellmanBellm

12、an最優(yōu)策略公式最優(yōu)策略公式11( )( )( )max(),max( )tttta A saassssa A ssVsE rVsss aaPRVs其中：V*：狀態(tài)值映射S：環(huán)境狀態(tài)R：獎勵函數(shù)P：狀態(tài)轉(zhuǎn)移概率函數(shù)：折扣因子第13頁/共90頁2021-11-17強化學習史忠植14馬爾可夫決策過程馬爾可夫決策過程 MARKOV DECISION PROCESS 由四元組定義。環(huán)境狀態(tài)集S 系統(tǒng)行為集合A 獎勵函數(shù)R：SA 狀態(tài)轉(zhuǎn)移函數(shù)P：SAPD（S）記R（s，a，s）為系統(tǒng)在狀態(tài)s采用a動作使環(huán)境狀態(tài)轉(zhuǎn)移到s獲得的瞬時獎勵值；記P（s，a，s）為系統(tǒng)在狀態(tài)s采用a動作使環(huán)境狀態(tài)轉(zhuǎn)

13、移到s的概率。第14頁/共90頁2021-11-17強化學習史忠植15馬爾可夫決策過程馬爾可夫決策過程 MARKOV DECISION PROCESS 馬爾可夫決策過程的本質(zhì)是：當前狀態(tài)向下一狀態(tài)轉(zhuǎn)移的概率和獎勵值只取決于當前狀態(tài)和選擇的動作，而與歷史狀態(tài)和歷史動作無關(guān)。因此在已知狀態(tài)轉(zhuǎn)移概率函數(shù)P和獎勵函數(shù)R的環(huán)境模型知識下，可以采用動態(tài)規(guī)劃技術(shù)求解最優(yōu)策略。而強化學習著重研究在P函數(shù)和R函數(shù)未知的情況下，系統(tǒng)如何學習最優(yōu)行為策略。第15頁/共90頁2021-11-17強化學習史忠植16MARKOV DECISION PROCESSCharacteristics of MDP:a set

14、 of states : Sa set of actions : Aa reward function :R : S x A RA state transition function:T: S x A ( S) T (s,a,s): probability of transition from s to s using action a第16頁/共90頁2021-11-17強化學習史忠植17馬爾可夫決策過程馬爾可夫決策過程 MARKOV DECISION PROCESS第17頁/共90頁2021-11-17強化學習史忠植18MDP EXAMPLE:TransitionfunctionSta

15、tes and rewardsBellman Equation:(Greedy policy selection)第18頁/共90頁2021-11-17強化學習史忠植19MDP Graphical Representation, : T (s, action, s )Similarity to Hidden Markov Models (HMMs)第19頁/共90頁2021-11-17強化學習史忠植20Reinforcement Learning Deterministic transitionsStochastic transitionsaijMis the probability to

16、 reaching state j when taking action a in state istart3211234+1-1A simple environment that presents the agent with a sequential decision problem:Move cost = 0.04(Temporal) credit assignment problem sparse reinforcement problemOffline alg: action sequences determined ex anteOnline alg: action sequenc

17、es is conditional on observations along the way; Important in stochastic environment (e.g. jet flying)第20頁/共90頁2021-11-17強化學習史忠植21Reinforcement Learning M = 0.8 in direction you want to go 0.2 in perpendicular 0.1 left0.1 rightPolicy: mapping from states to actions3211234+1-10.7053211234+1-1 0.8120

18、.762 0.868 0.912 0.660 0.655 0.611 0.388An optimal policy for the stochastic environment:utilities of states:EnvironmentObservable (accessible): percept identifies the statePartially observableMarkov property: Transition probabilities depend on state only, not on the path to the state.Markov decisio

19、n problem (MDP).Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.第21頁/共90頁2021-11-17強化學習史忠植22動態(tài)規(guī)劃動態(tài)規(guī)劃Dynamic Programming 動態(tài)規(guī)劃(dynamic programming)的方法通過從后繼狀態(tài)回溯到前驅(qū)狀態(tài)來計算賦值函數(shù)。動態(tài)規(guī)劃的方法基于下一個狀態(tài)分布的模型來接連的更新狀態(tài)。強化學習的動態(tài)規(guī)劃的方法是基于這樣一個事實：對任何策略和任何狀態(tài)s，有(10.9)式迭

20、代的一致的等式成立)()()|()|()(sVssRasssasVaas(as)是給定在隨機策略下狀態(tài)s時動作a的概率。(ssa)是在動作a下狀態(tài)s轉(zhuǎn)到狀態(tài)s的概率。這就是對V的Bellman(1957)等式。第22頁/共90頁2021-11-17強化學習史忠植23動態(tài)規(guī)劃動態(tài)規(guī)劃Dynamic Programming - Problem A discrete-time dynamic system States 1, , n + termination state 0 Control U(i) Transition Probability pij(u) Accumulative cost

21、structure Policies10),(juirk,10)()|(1ipiijipkijkk第23頁/共90頁2021-11-17強化學習史忠植24lFinite Horizon ProblemlInfinite Horizon ProblemlValue Iteration動態(tài)規(guī)劃動態(tài)規(guī)劃Dynamic Programming Iterative Solution iiiiiriGEiVNkkkkkkNNN0101),(,()()(iiiiirEiVNkkkkkkn0101),(,(lim)(nijVjuirupiVTnjijiUu, 1,)(),()(max)(0)()()(1iV

22、TiVkk)(min)(*iViV第24頁/共90頁2021-11-17強化學習史忠植25動態(tài)規(guī)劃中的策略迭代動態(tài)規(guī)劃中的策略迭代/ /值迭值迭代代11( )argmax( )( )( , )( )( )max( )aassssasaaksssskasaaksssskassPRVsVss aPRVsVsPRVs*1010VVV policy evaluationpolicy improvement“greedification”Policy IterationValue Iteration第25頁/共90頁2021-11-17強化學習史忠植26動態(tài)規(guī)劃方法動態(tài)規(guī)劃方法11( )()tttV

23、 sErV sTTTTstrt1st1TTTTTTTTT第26頁/共90頁2021-11-17強化學習史忠植27自適應動態(tài)規(guī)劃自適應動態(tài)規(guī)劃(ADP)Idea: use the constraints (state transition probabilities) between states to speed learning.Solve jijjUMiRiU)()()(= value determination.No maximization over actions because agent is passive unlike in value iteration.using DP

24、Large state spacee.g. Backgammon: 1050 equations in 1050 variables第27頁/共90頁2021-11-17強化學習史忠植28Value Iteration AlgorithmAN ALTERNATIVE ITERATION: (Singh,1993)(Important for model free learning)Stop Iteration when V(s) differs less than .Policy difference ratio = 2 / (1- ) ( Williams & Baird 1993

25、b)第28頁/共90頁2021-11-17強化學習史忠植29Policy Iteration Algorithm Policies converge faster than values.Why faster convergence? 第29頁/共90頁2021-11-17強化學習史忠植30動態(tài)規(guī)劃動態(tài)規(guī)劃Dynamic Programming 典型的動態(tài)規(guī)劃模型作用有限，很多問題很難給出環(huán)境的完整模型。仿真機器人足球就是這樣的問題，可以采用實時動態(tài)規(guī)劃方法解決這個問題。在實時動態(tài)規(guī)劃中不需要事先給出環(huán)境模型，而是在真實的環(huán)境中不斷測試，得到環(huán)境模型?？梢圆捎梅磦魃窠?jīng)網(wǎng)絡(luò)實現(xiàn)對狀態(tài)泛化，網(wǎng)

26、絡(luò)的輸入單元是環(huán)境的狀態(tài)s, 網(wǎng)絡(luò)的輸出是對該狀態(tài)的評價V(s)。第30頁/共90頁2021-11-17強化學習史忠植31沒有模型的方法沒有模型的方法Model Free MethodsModels of the environment:T: S x A ( S) and R : S x A RDo we know them? Do we have to know them? Monte Carlo Methods Adaptive Heuristic Critic Q Learning第31頁/共90頁2021-11-17強化學習史忠植32蒙特卡羅方法蒙特卡羅方法 Monte Carlo

27、 Methods 蒙特卡羅方法不需要一個完整的模型。而是它們對狀態(tài)的整個軌道進行抽樣，基于抽樣點的最終結(jié)果來更新賦值函數(shù)。蒙特卡羅方法不需要經(jīng)驗，即從與環(huán)境聯(lián)機的或者模擬的交互中抽樣狀態(tài)、動作和獎勵的序列。聯(lián)機的經(jīng)驗是令人感興趣的，因為它不需要環(huán)境的先驗知識，卻仍然可以是最優(yōu)的。從模擬的經(jīng)驗中學習功能也很強大。它需要一個模型，但它可以是生成的而不是分析的，即一個模型可以生成軌道卻不能計算明確的概率。于是，它不需要產(chǎn)生在動態(tài)規(guī)劃中要求的所有可能轉(zhuǎn)變的完整的概率分布。第32頁/共90頁2021-11-17強化學習史忠植33Monte Carlo方法. state following return

28、 actual the is where)()()(ttttttsRsVRsVsVTTTTTTTTTTstTTTTTTTTTT第33頁/共90頁2021-11-17強化學習史忠植34蒙特卡羅方法蒙特卡羅方法 Monte Carlo Methods Idea: Hold statistics about rewards for each state Take the average This is the V(s) Based only on experience Assumes episodic tasks (Experience is divided into episodes and a

29、ll episodes will terminate regardless of the actions selected.) Incremental in episode-by-episode sense not step-by-step sense. 第34頁/共90頁2021-11-17強化學習史忠植35Monte Carlo策略策略評價評價lGoal: learn V (s) under P and R are unknown in advancelGiven: some number of episodes under which contain slIdea: Average r

30、eturns observed after visits to slEvery-Visit MC: average returns for every time s is visited in an episodelFirst-visit MC: average returns only for first time s is visited in an episodelBoth converge asymptotically12345第35頁/共90頁2021-11-17強化學習史忠植36Problem: Unvisited pairs(problem of maintaining exp

31、loration)For every make sure that:P( selected as a start state and action) 0 (Assumption of exploring starts ) 蒙特卡羅方法蒙特卡羅方法第36頁/共90頁2021-11-17強化學習史忠植37蒙特卡羅控制蒙特卡羅控制How to select Policies:(Similar to policy evaluation) MC policy iteration: Policy evaluation using MC methods followed by policy improv

32、ement Policy improvement step: greedify with respect to value (or action-value) function第37頁/共90頁2021-11-17強化學習史忠植38時序差分學習時序差分學習 Temporal-Difference時序差分學習中沒有環(huán)境模型，根據(jù)經(jīng)驗學習。每步進行迭代，不需要等任務完成。預測模型的控制算法，根據(jù)歷史信息判斷將來的輸入和輸出，強調(diào)模型的函數(shù)而非模型的結(jié)構(gòu)。時序差分方法和蒙特卡羅方法類似，仍然采樣一次學習循環(huán)中獲得的瞬時獎懲反饋，但同時類似與動態(tài)規(guī)劃方法采用自舉方法估計狀態(tài)的值函數(shù)。然后通過多次迭代

33、學習，去逼近真實的狀態(tài)值函數(shù)。第38頁/共90頁2021-11-17強化學習史忠植39時序差分學習時序差分學習 TD)()()()(11tttttsVsVrsVsVTTTTTTTTTTst1rt1stTTTTTTTTTT第39頁/共90頁2021-11-17強化學習史忠植40時序差分學習時序差分學習 Temporal-Difference)()()(:methodCarlo Monte visit-every SimplettttsVRsVsV )()()()(:TD(0) method, TD simplest The11tttttsVsVrsVsVtarget: the actual

34、return after time ttarget: an estimate of the return第40頁/共90頁2021-11-17強化學習史忠植41時序差分學習時序差分學習 (TD)Idea: Do ADP backups on a per move basis, not for the whole state space.)()()()()(iUjUiRiUiUTheorem: Average value of U(i) converges to the correct value.Theorem: If is appropriately decreased as a func

35、tion of times a state is visited (=Ni), then U(i) itself converges to the correct value第41頁/共90頁2021-11-17強化學習史忠植42TD(l l) A Forward View TD(l) is a method for averaging all n-step backups weight by ln-1 (time since visitation) l-return: Backup using l-return:111)(1)(11)1 ()1 (tTnttTntnntnntRRRRlll

36、lll)()(tttttsVRsVl第42頁/共90頁2021-11-17強化學習史忠植43時序差分學習算法時序差分學習算法 TD(l l)Idea: update from the whole epoch, not just on state transition.kmmmmkmiUiUiRiUiU)()()()()(1lSpecial cases:l=1: Least-mean-square (LMS), Mont Carlol=0: TDIntermediate choice of l (between 0 and 1) is best. Interplay with 第43頁/共90

37、頁2021-11-17強化學習史忠植44時序差分學習算法時序差分學習算法 TD(l l)算法算法 10.1 TD(0)學習算法Initialize V(s) arbitrarily, to the policy to be evaluatedRepeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy derived from V(e.g., -greedy) Take action a, observer r, s Until s is termin

38、al V sV srV sV sss 第44頁/共90頁2021-11-17強化學習史忠植45時序差分學習算法第45頁/共90頁2021-11-17強化學習史忠植46時序差分學習算法收斂性TD(l l)Theorem: Converges w.p. 1 under certain boundaries conditions.Decrease i(t) s.t. )()(2tttitiIn practice, often a fixed is used for all i and t.第46頁/共90頁2021-11-17強化學習史忠植47時序差分學習 TD第47頁/共90頁2021-11

39、-17強化學習史忠植48Q-learningWatkins, 1989在Q學習中，回溯從動作結(jié)點開始，最大化下一個狀態(tài)的所有可能動作和它們的獎勵。在完全遞歸定義的Q學習中，回溯樹的底部結(jié)點一個從根結(jié)點開始的動作和它們的后繼動作的獎勵的序列可以到達的所有終端結(jié)點。聯(lián)機的Q學習，從可能的動作向前擴展，不需要建立一個完全的世界模型。Q學習還可以脫機執(zhí)行。我們可以看到，Q學習是一種時序差分的方法。第48頁/共90頁2021-11-17強化學習史忠植49Q-learning在Q學習中，Q是狀態(tài)-動作對到學習到的值的一個函數(shù)。對所有的狀態(tài)和動作： Q: (state x action) value 對

40、Q學習中的一步：),(),(MAX),()1 (),(11tttatttttasQasQrcasQcasQ (10.15)其中c和都1，rt+1是狀態(tài)st+1的獎勵。第49頁/共90頁2021-11-17強化學習史忠植50Q-Learning Estimate the Q-function using some approximator (for example, linear regression or neural networks or decision trees etc.). Derive the estimated policy as an argument of the ma

41、ximum of the estimated Q-function. Allow different parameter vectors at different time points. Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function.第50頁/共90頁2021-11-17強化學習史忠植51Q-learningQ (a,i),(max)(iaQiUajaaijjaQ

42、MiRiaQ), (max)(),(Direct approach (ADP) would require learning a model .Q-learning does not:aijM),(), (max)(),(),(iaQjaQiRiaQiaQaDo this update after each state transition:第51頁/共90頁2021-11-17強化學習史忠植52ExplorationTradeoff between exploitation (control) and exploration (identification) Extremes: greed

43、y vs. random acting(n-armed bandit models)Q-learning converges to optimal Q-values if* Every state is visited infinitely often (due to exploration),* The action selection becomes greedy as time approaches infinity, and* The learning rate is decreased fast enough but not too fast (as we discussed in

44、TD learning)第52頁/共90頁2021-11-17強化學習史忠植53Common exploration methods)(sA1.In value iteration in an ADP agent: Optimistic estimate of utility U+(i)2.-greedy methodNongreedy actions Greedy action3.Boltzmann explorationjaijaiaNjUMfiRiU),(),(max)()(Exploration func),(nufR+ if nNu o.w.aTjUMTjUMajaijjaijee

45、P)()(*)(1sA第53頁/共90頁2021-11-17強化學習史忠植54Q-Learning AlgorithmQ學習算法 Initialize Q(s,a) arbitrarily Repeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy derived from Q (e.g., -greedy) Take action a, observer r, s Until s is terminal,max,aQ s aQ s arQ s aQ

46、 s ass第54頁/共90頁2021-11-17強化學習史忠植55Q-Learning Algorithm Set For The estimated policy satisfies. 01TQ.);,(1minarg).;,(max, 0,.,1,21,11111niitittitttttttattQYnaQrYTTttASAS.),;,(maxarg),(1,tQttttatttQtasas第55頁/共90頁2021-11-17強化學習史忠植56What is the intuition? Bellman equation gives If and the training set

47、 were infinite, then Q-learning minimizes which is equivalent to minimizingttttttattttaQrEQtASAS,| ),(max),(11*1*1AS*11ttQQ2*);,(ttttQQEAS211*1);,(),(max1tttttttatQaQrEtASAS第56頁/共90頁2021-11-17強化學習史忠植57A-Learning Murphy, 2003 and Robins, 2004 Estimate the A-function (advantages) using some approxima

48、tor, as in Q-learning. Derive the estimated policy as an argument of the maximum of the estimated A-function. Allow different parameter vectors at different time points. Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss f

49、unction.第57頁/共90頁2021-11-17強化學習史忠植58A-Learning Algorithm (Inefficient Version) For The estimated policy satisfies);,(max);,();,(.),|();,(1minarg).;,(, 0,.,1,1,21, 1,1tttttattttttttniitittitittittTtjjjjjTtjjtaEYnrYTTttASASASASASAS.),;,(maxarg),(1,tttttatttAtasas第58頁/共90頁2021-11-17強化學習史忠植59Differenc

50、es between Q and A-learning Q-learning At time t we model the main effects of the history, (St,At-1) and the action At and their interaction Our Yt-1 is affected by how we modeled the main effect of the history in time t, (St,At-1) A-learning At time t we only model the effects of At and its interac

51、tion with (St,At-1) Our Yt-1 does not depend on a model of the main effect of the history in time t, (St,At-1) 第59頁/共90頁2021-11-17強化學習史忠植60Q-Learning Vs. A-Learning Relative merits and demerits are not completely known till now. Q-learning has low variance but high bias. A-learning has high varianc

52、e but low bias. Comparison of Q-learning with A-learning involves a bias-variance trade-off.第60頁/共90頁2021-11-17強化學習史忠植61POMDP部分感知馬氏決策過程 Rather than observing the state we observe some function of the state. Ob Observable functiona random variable for each states. Problem : different states may look

53、 similarThe optimal strategy might need to consider the history.第61頁/共90頁2021-11-17強化學習史忠植62Framework of POMDPPOMDP由六元組定義。其中定義了環(huán)境潛在的馬爾可夫決策模型上，是觀察的集合，即系統(tǒng)可以感知的世界狀態(tài)集合，觀察函數(shù)：SAPD（）。系統(tǒng)在采取動作a轉(zhuǎn)移到狀態(tài)s時，觀察函數(shù)確定其在可能觀察上的概率分布。記為（s, a, o）。第62頁/共90頁2021-11-17強化學習史忠植63POMDPsWhat if state information (from sensors)

54、is noisy?Mostly the case!MDP techniques are suboptimal!Two halls are not the same.第63頁/共90頁2021-11-17強化學習史忠植64POMDPs A Solution StrategySE: Belief State Estimator (Can be based on HMM): MDP Techniques第64頁/共90頁2021-11-17強化學習史忠植65POMDP_信度狀態(tài)方法 Idea: Given a history of actions and observable value, we

55、 compute a posterior distribution for the state we are in (belief state) The belief-state MDP States: distribution over S (states of the POMDP) Actions: as in POMDP Transition: the posterior distribution (given the observation)Open Problem : How to deal with the continuous distribution? 第65頁/共90頁202

56、1-11-17強化學習史忠植66The Learning Process of Belief MDP , , ,Pr, ,Pr,s SO s a oP s a s b sb ss o a bo a b, ,Pr, ,Pr,o Ob a bb a b oo a b ,s Sb ab s R s a第66頁/共90頁2021-11-17強化學習史忠植67Major Methods to Solve POMDP 算法名稱基本思想學習值函數(shù)Memoryless policies直接采用標準的強化學習算法直接采用標準的強化學習算法Simple memory based approaches使用使用k

57、個歷史觀察表示當前狀態(tài)個歷史觀察表示當前狀態(tài)UDM(Utile Distinction Memory)分解狀態(tài)，構(gòu)建有限狀態(tài)機模型分解狀態(tài)，構(gòu)建有限狀態(tài)機模型NSM(Nearest Sequence Memory)存儲狀態(tài)歷史，進行距離度量存儲狀態(tài)歷史，進行距離度量USM(Utile Suffix Memory)綜合綜合UDM和和NSM兩種方法兩種方法Recurrent-Q使用循環(huán)神經(jīng)網(wǎng)絡(luò)進行狀態(tài)預測使用循環(huán)神經(jīng)網(wǎng)絡(luò)進行狀態(tài)預測策略搜索Evolutionary algorithms使用遺傳算法直接進行策略搜索使用遺傳算法直接進行策略搜索Gradient ascent method使用梯度下降（

58、上升）法搜索使用梯度下降（上升）法搜索第67頁/共90頁2021-11-17強化學習史忠植68強化學習中的函數(shù)估計強化學習中的函數(shù)估計RLFASubset of statesValue estimate as targetsV (s)Generalization of the value function to the entire state space00000,V M VM VMM VMM Vis the TD operator.Mis the function approximation operator.第68頁/共90頁2021-11-17強化學習史忠植69并行兩個迭代過程并行

59、兩個迭代過程值函數(shù)迭代過程值函數(shù)逼近過程,1, ,max,aQ s aVs ar s a sVs a,Vs aM Q s aHow to construct the M function? Using state cluster, interpolation, decision tree or neural network?第69頁/共90頁2021-11-17強化學習史忠植70lFunction Approximator: V( s) = f( s, w)lUpdate: Gradient-descent Sarsa: w w + rt+1 + Q(st+1,at+1)- Q(st,a

60、t) w f(st,at,w)weight vectorStandard gradienttarget valueestimated valueOpen Problem : How to design the non-liner FA system which can converge with the incremental instances? 并行兩個迭代過程并行兩個迭代過程第70頁/共90頁2021-11-17強化學習史忠植71Semi-MDPDiscrete timeHomogeneous discountContinuous timeDiscrete eventsInterval-dependent discountDiscrete t

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負責。
6. 下載文件中如有侵權(quán)或不適當內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

第10章_強化學習PPT課件

文檔簡介

溫馨提示

最新文檔

評論

第10章_強化學習PPT課件

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔