智能體不夠聰明怎么辦？讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular

上傳人：1*** IP屬地：山西上傳時間：2024-12-22 格式：DOCX 頁數(shù)：59 大?。?80.75KB 積分：19.9 舉報 版權(quán)申訴

智能體不夠聰明怎么辦？讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular_第2頁

智能體不夠聰明怎么辦？讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular_第3頁

智能體不夠聰明怎么辦？讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular_第4頁

智能體不夠聰明怎么辦？讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular_第5頁

已閱讀5頁，還剩54頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

AMOR:ARecipeforBuildingAdaptableModularKnowledgeAgentsThroughProcessFeedback

JianGuan1,2,WeiWu2*,ZujieWen2,PengXu2,HongningWang1,MinlieHuang1*

1TheCoAIgroup,DCST,InstituteforArtificialIntelligence,

arXiv:2402.01469v2[cs.CL]25Oct2024

1StateKeyLabofIntelligentTechnologyandSystems,

1BeijingNationalResearchCenterforInformationScienceandTechnology,

1TsinghuaUniversity,Beijing100084,China.2AntGroup.

{jianguanthu,wuwei19850318,wang.hongn}@gmain.com,

{zujie.wzj,peng.x}@antgroup.com,aihuang@tsinghua.edu.cn.

Abstract

Thenotablesuccessoflargelanguagemodels(LLMs)hassparkedanupsurgeinbuildinglanguageagentstocompletevariouscomplextasks.WepresentAMOR,anagentframeworkbasedonopen-sourceLLMs,whichreasonswithexternalknowledgebasesandadaptstospecificdomainsthroughhumansupervisiontothereasoningprocess.AMORbuildsreasoninglogicoverafinitestatemachine(FSM)thatsolvesproblemsthroughautonomousexecutionsandtransitionsoverdisen-tangledmodules.Thisallowshumanstoprovidedirectfeedbacktotheindividualmodules,andthusnaturallyformsprocesssupervision.Basedonthisreasoningandfeedbackframework,wedevelopAMORthroughtwo-stagefine-tuning:warm-upandadaptation.Theformerfine-tunestheLLMwithexamplesautomaticallycon-structedfromvariouspublicdatasets,enablingAMORtogeneralizeacrossdifferentknowledgeenvironments,whilethelattertailorsAMORtospecificdomainsusingprocessfeedback.ExtensiveexperimentsacrossmultipledomainsdemonstratetheadvantageofAMORtostrongbaselines,thankstoitsFSM-basedreasoningandprocessfeedbackmechanism.Thecodeanddataarepubliclyavailableat

/JianGuanTHU/AMOR.

1Introduction

LLMs,withastoundingperformanceovergeneralnaturallanguageprocessing(NLP)problems[42,

36],spurredgreatinterestinbuildingLLM-basedagentstosolvecomplextasksbyinteractingwith

externalresourcessuchaswebknowledge[27],specializedtools[31],etc

Wefocusondevelopingagentsforknowledge-intensivetasks,wheretheagentcompletesusers’

information-seekingrequestsbyinteractingwithspecificknowledgebases[22]

.Toaddressthecomplexityofsuchtasks,wepositthedesiderataforaqualifyingagentasfollows:Firstly,theagentshouldpossessarobustreasoninglogicaboutthetasktosolveindividualproblemswithprecisepathways.Secondly,theagentshouldmaintainanadaptivemechanismtoadjusttospecificenvironments,ratherthanstayingstatic.Thirdly,thereasoningprocessshouldbeamenabletohumaninterventions,enablinghumanstosteertheagent’sbehaviorthroughdirectfeedbacktotheprocessratherthanonlytotheoutcome.Thisabilitycansignificantlyfacilitatealignmentbetweenagent

behaviorandhumanintent[39]

Althoughextensivestudieshavebeenconductedonbuildinglanguageagents,few,ifany,canfulfillalltherequiredcriteriaduetotheiruncontrollablereasoninglogic,staticmodelcapability,or

*Correspondingauthors.

38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).

Table1:ComparisonbetweenAMORandrepresentativemethodsforbuildingagents.Appendix

A.1

providesamorecomprehensivediscussionindetail.

ReasoningLogic

StepInter-StepDependency

MethodAdaptiveMechanismFeedback

WebGPT

[27]

ToolInvoking

Undefined

ImitationLearningfromHumans

Outcome

CoT

[43]

Reasoning

Undefined

Prompting

Undefined

ToT

[50]

Reasoning

Undefined

Prompting

Process

ReAct

[51]

Reasoning&ToolInvoking

Undefined

Prompting

Undefined

Reflexion

[35]

Reasoning&ToolInvoking

Undefined

Prompting

Process

AgentLM

[53]

Reasoning&ToolInvoking

Undefined

ImitationLearningfromLLMs

Outcome

MetaGPT

[14]

SpecializedModule

SequentialPipeline

Prompting

Process

LUMOS

[52]

SpecializedModule

SequentialPipeline

ImitationLearningfromHumans

Undefined

AMORSpecializedModuleFiniteStateMachineExploration&ExploitationProcess

S2:RelevanceJudgmentQ,H,E,q,D,d

m2:Judge(Q,H,q,d)

Description:Judgewhetherdisrelevantwithq.

Output:H],if

S1:DocumentRetrievalQ,H,E,q

m1:SearchDoc(q)Description:Retrieveonedocumentsnippetusingqfromthegiven

knowledgebase.

Output:d!

D=[d!]

d=d!

S0:QuesAonDecomposiAonQ,H,E

m0:Decompose(Q,H)Description:GeneratethenextqueryqifmoreinformationbesidesHisneededtoanswerQ.

Output:G

[NEXT]q,ifmoreinformationneeded[FINISH],otherwise

H=[]

Q一

E=[]

S4:PassageRetrievalQ,H,E,q,D,d

m4:SearchPsg(q,d)

Description:Retrievepassagesusingqinthedocumentofd.

Output:P=[p!,p1,…]

[NEXT]

[RELEVANT]

[CONTINUE]

d=d′

D=D+[d′]

H=H+q,a,E=E+e

[ANSWERABLE]

S5:AnswerExtracAonQ,H,E,q,D,P

m5:Answer(Q,H,q,P)Description:Extracttheansweratothe

queryqandtheevidencepassageefromP.

ifanswerableotherwise

Output:P]a,],

[FINISH]

[IRRELEVANT]

S3:DocumentNavigaAonQ,H,E,q,D

m3:NextDoc()

Description:Navigatetothenextdocumentsnippetd′.

Output:G]],d′,ifogthemaximumnumber

S6:TaskCompleAonQ,E

m6:Complete(Q,E)Description:GeneratetheanswertothemainquestionQbasedonE.

Output:A

一A

[UNANSWERABLE]

[NOMORE]

H=H+q,“NoAnswer”,E=E+D[0]

States&ModulesforCallingLLMs

States&ModulesforInvokingTools

Q:Themainquestion

A:Thefinalquestion

H:Allsolvedsub-queriesandanswers

E:Allcollectedevidencepassages

q:Currentsub-query

D:Allretrieveddocuments

d:Currentdocument

P:RetrievedPassagesind

Figure1:AMOR’sstatetransitiondiagram.Eachboxrepresentsastateandthecorrespondingmodulethatisexecutedwhenenteringthestate.Theremaybemultiplecategoriesofexecutionresultsdistinguishedbyspecialbranchtokenssuchas“[NEXT].”ThenAMORdeterminesthenextstatebasedonthebranchtokens.

sparse/missingfeedbacksignals,asdetailedinTable

Consequently,itisstillchallengingforuserstocritique,andthusguideexistingagentstofollowtargetedmanners,especiallywhentheagentsare

builtuponlesspowerfulLLMs[25]

WeintroduceAMOR,anAdaptableMOdulaRknowledgeagentthatcanreasonandadapt,withthereasoningprocessamenabletohumansupervision,basedonopen-sourceLLMs.AMOR’sreasoninglogicisformalizedasafinitestatemachine(FSM)

[7,

21]thatsolvesproblemsviaaseriesof

executionsandtransitionsoverasetofmodules(Figure

.Thisnaturallyenablesthedesiredprocess-basedsupervisionmechanism,allowinguserstogivefeedbacktoeachLLM-controlledmodule.AMORsupportsflexibleformsoffeedback,eitherbinaryjudgmentsregardingthecorrectnessorrefinementoftheoutputs.ThereasoninglogicandprocessfeedbackmechanismtogetherframehowAMORthinks,acts,andinteractswithusersandtaskenvironments.

WebuildAMORuponanLLMequippedwithdistinctparametersfordifferentmodulestoefficientlyhandlemultipletasks.ThetraininginAMORhappensintwostages:(1)Warm-up:themodulardesignenablesustoconstructtrainingdataseparatelyforeachdisentangledmodulewithoutrequiringcompletetrajectoriesforspecifictasks.Asaresult,wecreatealargedatasetof50kexamplescoveringmultipledistincttasks,simplyusingpublicdatasets.Wefine-tuneAMORonthisdataforgeneralizationovervariousknowledge-seekingscenarios.(2)Adaptation:whendeployed,wetailorAMORtothetargetdomainbylettingitautonomouslyaddressusertasks(i.e.,exploration),collectingprocessfeedbackforeachLLMoutput,andevolvingthroughfurtherfine-tuningontheexplorationtrajectorieswithfeedback(i.e.,exploitation).Ourcontributionsaresummarizedasfollows:

I.Weproposeageneralframeworkforbuildingknowledgeagents,featuringFSM-basedreasoninglogicandaprocessfeedbackmechanism.Wefocusontextcorporaasknowledgebases,buttheapproachcanbeflexiblyextendedtootherknowledgetypesandusertasksbycustomizingthemodulesanddependencieswithintheFSMframework.

II.ExperimentsacrossmultipledomainsshowthestrongadvantageoftheFSM-basedreasoninglogicwith30%-40

%improvementsoverbaselineswhenbasedonoff-the-shelfLLMs(e.g.,GPT-42)

.Switchingtofine-tunedLLMs,thewarm-upstageempowersAMORtogeneralizetomultipledomainsandsurpassstrongbaselines.AfterweadaptAMORtospecificdomains,subsequentdomain-specificadaptationsrevealthatprocessfeedbackissignificantlymoreeffectiveinimprovingthereasoningprocessthanoutcomefeedback.

2Relatedwork

Languageagents.Interestissurginginbuildingagentsfortasksnecessitatingmulti-stepreasoning.Existingworkfallsintotwogroups.Thefirstgroupfocusesondesigningagentarchitectures,such

asCoT’sstep-by-stepreasoning[44],ReAct’sintegrationofreasoning,action,andobservationto

allowtooluse[51],and

CODEPLAN’stwo-stagereasoningframeworkthatfirstgeneratesacode-

formplanandthenrealizeslow-levelreasoningsteps[45]

.Nevertheless,suchfree-formreasoningconstraintshumanintervention.Incontrast,modularagentsfollowapipelinetoexecutespecialized

modules[19,

14,

11,

52],improvingtheeaseofintervention

.Thesecondgroupaimstodesignadaptivemechanismsforadaptingagentstospecificscenarios.

ToT[50]andReflexion[35]use

environmentfeedbackformulti-pathpruninganditerativesingle-pathrefinement,respectively,butsufferfrompoorinferenceefficiencyandneedforreal-timefeedback.Asafine-tuningapproach,recentworkequippedopen-sourceLLMswithspecificagentabilitiesbylearningfromexamples

synthesizedbasedonhumanpriors[5],orexperttrajectoriesfromhumans[27]orGPT-4[53,

4]with

correctnessvalidationthroughoutcomefeedback.Incontrast,ourmodularagentAMORemploysFSM-basedreasoningwithastrongercapacityforhandlingcomplextasksthansimplepipelinesandadaptseffectivelytospecificenvironmentsviaprocessfeedback.

Retrieval-augmentedgeneration(RAG).TheRAGparadigmaugmentstheinputsofLLMswith

retrievedpassagestoenhancefactuality[12,

22,

10]

.Recentstudieshavedevelopedinterleaved

reasoning-retrievalforbetterinformationrecallthanone-stepretrieval[38,

16,

28]

.However,retrieval

mayintroducenoisethatleadstolow-qualityanswers[34]

Totacklethis,Self-RAG[2]trained

LLMstoselectivelyperformretrievalandutilizeretrievedpassages.UnlikeRAGapproaches,AMORemphasizesanexplainablereasoningprocessforproactivelydecomposingquestionsandseekingevidenceforgroundedgeneration,andallowsforprocessfeedbackfromhumans.Nevertheless,RAGmainlyfocusesonintegratingparametricfactualknowledgeinLLMsandretrievednon-parametricknowledge,whichislessexplainableandintervenable.

3AMORagent

AMORreliesonthreekeytechniques:FSM-basedreasoninglogic,aprocessfeedbackmechanism,andatwo-stagefine-tuningstrategy.Wedetailthedefinitionofthereasoninglogicanditsspecifica-tionassumingtheknowledgebaseisatextcorpus

in§3.1,themethodforfine-tuningopen-source

LLMsasawarm-upstagein§3.2,andtheadapta

tionstagedrivenbyprocessfeedbackin§3.3.

3.1Reasoninglogic

Algorithm1FSM-basedReasoningLogicInput:Agentatthestates=s0;Q:Question.Output:A:FinalAnswer;R:ReasoningProcess.

1R=[]

2whilessN?1do

3y=m(s)//Obtaintheoutputygivens

fromthecorrespondingmodulem.

4R.append({“state”:s,“output”:y})

5A=y

6returnA,R

Algorithm

outlineshowtodeducetheanswerAforaninputquestionQwithareasoningprocessRusingFSM-basedreasoninglogic,whichcanbedefinedbyaquadruple:{S,M,E,μ},where

?S={s0,...,sN?1}isasetofstateswiths0astheinitialstateandsN?1asthefinalstate.Eachstateholdsvariablestotrackcontextinformation.

?M={m0,...,mN?1}isasetofmoduleswithmktriggeredwhenthereasoningflowreachesstatesk.Themodulesarecategorizedintotwotypes:(a)Toolmodules(MTOOL)forinvokingtools,

and(b)LLMmodules(MLLM)forcallingLLMs.

2Inthiswork,GPT-3.5/4referstotheOpenAI’sAPI“gpt-3.5-turbo”/“gpt-4-1106-preview,”respectively.

KnowledgeAugmentation

SearchDoc(Query=^,title≠“ChickChickBoom”)

d^(title:Chick-fil-A)ThefirstChick-fil-Aopenedin1967,inthe…

SearchPsg(Query=^,title=“ChickChickBoom”)/^

p^,^(title:ChickChickBoom)AfterExtraToxictookthegame…

p0,1(title:ChickChickBoom)Thechickswillbouncethroughouttheir…p0,2(title:ChickChickBoom)Eachdrawingisgivenanaccuracyrating…

SearchDoc(Query=`,title≠“NintendoEntertainmentSystem”)

d`(title:NintendoCharacter)Marioisacharactercreatedbythe…

SearchPsg(Query=`,title=“NintendoEntertainmentSystem”)/`

p1,0(title:NintendoEntertainmentSystem)Afterdevelopingseveral…

p1,1(title:NintendoEntertainmentSystem)TheNESisoneofthebest-…p1,2(title:NintendoEntertainmentSystem)Followingaseriesofarcade…

Answer

[0A,1,2a;l[10,1]II[Unanswerable]

OriginalSample

QuestionQ:OnwhatdatedidthepublisherofChickChickBoomunveilitsnewsystems?DecomposedSub-Queries(a),Answers(a)andEvidencePassages(a):

^WhowasthepublisherofChickChickBoom?

^Nintendo

^(title:ChickChickBoom)…ChickChickBoomisanonlineAdobeFlashgamecreatedforEaster2007byGermandeveloperExtraToxicandsponsoredbyNintendoofEurope…

`WhatdaydidNintendounveilthenewsystems?

`October18,1985

`(title:NintendoEntertainmentSystem)…NintendoseededthesefirstsystemstolimitedAmericantestmarketsstartinginNewYorkCityonOctober18,1985…

FinalAnswer:October18,1985

Decompose

uu[n:]

MainQuestion:Q

ies:

MainQuestion:Q

::ries:

Complete

s]1

MainQuestion:Q

SolvedSub-Queries:

S:-u::

s[:r,b[nswer:[3]p0,1;l,[,1,2]II[Unanswerable]

Judge

:00[0,I,v]0,[Relevant]II[Relevant]

MainQuestion:Q

SolvedSub-Queries:

Output:

Documel:1[,I1l,1,[Relevant]II[Relevant]

S:-u::

Figure2:OnthetopleftisasamplequestionfromMusique[37],providingampleinformation(in

green)forconstructingtrainingexamplesforfourLLMmodulesofAMOR(bottom).Weaugmentextraknowledge(inblue)fortheJudgeandAnswermodulebyinvokingtheSearchDocandSearchPsgtools(topright).Ineachexample,wehighlightthepromptinpurpletoformatthecurrentstate(before“Output:”)andoutput(after“Output:”),anduse“||”toseparatedifferentexamplesfortraining.

?EisthesetofallpossibleoutputsofM.

?μ:S×E→Sisthetransitionfunctionthatdeterminesthenextstateofthereasoningflowgiventhecurrentstateandtheexecutionresultofthecorrespondingmodule.

Whentheexternalknowledgebaseisatextcorpus,aninstantiationofthereasoninglogiccanberepresentedbythestatetransitiondiagraminFigure

Inthiscase,MTOOLperformdocumentandpassageretrievalusingexternalretrievers;whileMLLMleveragetheLLMtoanalyzeanddigestthequestion,documents,andpassagestodeducethefinalanswer.Todistinguishdifferenttypesofoutputsfromamodulethatrequiresdifferentsubsequentmodules,weemployasetofspecialbranchtokenssuchas“[NEXT]”toguideμtodeterminethenextstate.Insummary,AMORanswersquestionQby(1)iterativelydecomposingQtoasub-queryqatstates0,andfindingtheansweratoqandtheevidencepassageethroughiterativeknowledgeretrieval,relevanceevaluation,retrievalrefinement(i.e.,“PassageRetrieval”),andanswerextraction,untilnomoreknowledgeisneeded;and(2)deducingthefinalanswerAbasedonthecollectedevidencepassagesatthefinalstate.

DefiningreasoninglogicasanFSMoffersthreeadvantages:(1)StructuredThinking.FSMmakesspecificationsofinter-stepdependencies(e.g.,prioritization,branchselection)easy,andthusenablesnarrowingdowntheexplorationspace.(2)SkillDisentanglement.Bydecomposingcomplextasksintomodularsteps,onecanindependentlyconstructtrainingdataforeachmodule,whichsignificantlyreducesthedifficultyofimplementingAMOR

withopen-sourceLLMs(cf.,§3.2)

.ThisfeaturealsoallowsAMORtofocusonsinglesteps,therebymitigatingtheweaknessofLLMsinreasoningover

longcontextformedbytask-solvingtrajectories[24]

.(3)IntervenableWorkflow.Thestructuredreasoningprocessenablesuserstoeasilydiagnosetheagent’smistakesandprovideprocessfeedback

forimprovingthereasoningcapabilityoftheagent(§3.3)

3.2Warming-upopen-sourceLLMs

Open-sourceLLMsareobservedtofallshortincomplexagenttasks[46,

25]

.RecentstudieshaveimprovedtheirreasoningabilitiesthroughimitationlearningusingtrajectoriesfromadvancedLLMs

suchasGPT-4[53,

.However,evenGPT-4canstrugglewithproducinghigh-qualityreasoning

trajectories[29]

AMOR’smodulardesignenablesustoconstructtrainingdataforeachmoduleseparatelyfromexistingdatasetswithoutsimulatingthewholetrajectories,thusgreatlyalleviatingtheaboveissue.Formally,givenasamplequestionQwithannotationsofthefinalanswer,allsub-queriesandanswers=[(0,0),(1,1),···],andallevidencepassages=[0,1,···],wecandirectly

transformtheseannotationsintoasuitableformattoserveastrainingdataforDecomposeandCompleteinFigure

SinceJudgeandAnswerrequiremultipletypesofretrievedknowledge(e.g.,relevantornot),weemployretrievaltoolstoaugmenttheinput.Figure

exemplifiestheconstructionpipeline,whichcanbeeasilyextendedtootherknowledge-intensivedatasetsandspecificdomains.Appendix

A.4

showsmoredetails.

Whenfine-tuningopen-sourceLLMstohandlemultipletasksdefinedbydifferentmodules,weare

inspiredbytheMixture-of-Expertsapproach[33]tolearndistinctFeed-ForwardNetwork(FFN)

parametersinthefinalquarteroftheTransformerblockstobalancethetrade-offbetweenperformanceandinferenceefficiency.Thesemodule-specificparametersareinitializedusingtheoriginalmodel’sFFNlayers.WecalltheproposedarchitectureModule-AwareM

ixture-of-Experts(MA-MoE)3.

Then,wefine-tunetheMA-MoEmodelwiththestandardlanguagemodelingloss:

L1=?Em∈MLLM,(,)∈Dmλmlogπθm(|),(1)

whereπreferstothepolicymodelMA-MoEthatmapsthestatetoanaction,θmdenotestheparameterforthemodulem∈MLLM,Dmisthecorrespondingcollectionoftrainingexamples,(,)isastate-outputpairfromDm,and{λm}aretunablehyper-parameters.

Feedbackiscrucialforadaptinglanguageagentstospecificenviron-

ments[40],especiallywhendeal

-ingwithunseen,long-tail,orever-changingdomainknowledge.Prioragentscommonlyusedoutcomefeedbackforadaptationwhichas-sessesthecorrectnessofinterme-diatestepsbasedonthesuccess

orfailureoftheoutcome[53,

.However,outcomefeedbackistoosparsetoimproveintermediaterea-

soning[23]

.RecentstudiesalsohighlightedthatLLMs’reasoningstepsarelikelytocontradicttheout-

come[26],whichmeansthatout

-comefeedbackmayinevitablyin-troducenoiseduringtraining(seeexamplesinAppendix

B.8)

.Incontrast,AMOR’sprocessfeedback

3.3Adaptationthroughprocessfeedback

Algorithm2AdaptationthroughProcessFeedback

Input:{π}:InitialPolicy;T:ExplorationStepsbetweenEx-

ploitation;I:NumberofIterations.

Output:{πθm}:AdaptedPolicy.

1whilei←1toIdo

R=[]//Feedback-RefinedReasoningProcesses

whilet←1toTdo

//Exploration

ReceiveaninputquestionQ.

CollectAMORθ’sreasoningprocessR.//Algorithm

//FeedbackCollectionforEachLLMModule

foreachSteprk∈R(k=0,1,2,···)do

Extractthestateskandoutputykfromrk.

ifThecorrespondingmodulemk∈MLLMthen

Collectfeedbackfkforskandyk.

Determinekandokbasedonfk.//Eq.

R.append([sk,k,ok])

//Exploitation

Optimize{θm}tominimizeL2onR.//Eq.

return{πθm}

mechanismcaneffectivelyalleviatetheseissues.

Algorithm

describestheadaptationmechanismofAMORparameterizedbyθ,specificallyasthreesteps:(1)Exploration.AMORanswerstheinputquestionQbyinteractingwithaknowledgebase.

(2)FeedbackCollection.AMOR’sreasoningprocessforQisevaluatedwithfeedbackfkforthe

ofy.Weconvertyintoafeedback-refinedtargetoutputybasedonthefeedbackfkanddeterminetheimmediaterewardokasfollows:

outputykoftheLLMateachstepduringreasoning,which?iseither“right/wrong”orarefinedversion

(3)Exploitation.EveryTstepsoftheformerexplorationandfeedbackcollection,weoptimizethe

initialpolicybasedontheresultingtrajectoriesandcorrespondingfeedback[32]:

L2=?Em∈MLLM,(sk,k,ok)∈Rmλm[ok?βlog(πθm(k|sk)/πT(k|sk))],(3)

3“Module-Aware”meanswhenAMORexecutesacertainmodule,itsmoduleindexwillbeprovidedtotheroutersofthemodeltoindicatewhichexpertshouldbeactivated.

Table2:AutomaticannotationstrategyforsilverprocessfeedbackfordifferentLLMmodules.

Modulem

Outputy

SilverProcessFeedbackf

Decompose(Q,H)

[NEXT]q[FINISH]

“right”,iftheretrieveddocumentsusingqoverlapthedocumentscorrespondingto;“wrong”,otherwise.“right”,if?E(i.e,evidencepassagescollectedbyAMOR);“wrong”,otherwise.

Judge(Q,H,q,d)

[RELEVANT][IRRELEVANT]

[RELEVANT]”,ifoneofpassagesincomesfromthesamedocumentasd;[IRRELEVANT]”,otherwise.

Answer(Q,H,q,P)

[ANSWERABLE]ae[UNANSWERABLE]

“right”,ife∈;“wrong”,otherwise

“right”,ifP∩=?;“wrong”,otherwise

Complete(Q,E)

,if?E;“wrong”,otherwise.

whereRm?Rdenotesthetrainingexamplesformodulem,πreferstotheinitialwarm-uppolicy.

Notably,thislossfunctionisnon-differentiable,necessitatingtheuseofaspecializedoptimizationtechnique.

WeusearecentlyproposedalignmentalgorithmKTO[9]withanMLEregularization[47]

foroptimization,whichoptimizesthepolicywithoutrequiringpairedhumanpreferences.Crucially,whenoptimizingaparticularmodulem,thegradientinducedbythefeedbacksignalpropagatesthroughtheentireMA-MoEmodel,exceptfortheFFNlayerscorrespondingtoothermodules.ThistargetedoptimizationapproachenablesAMORtoeffectivelyalignitsoutputswiththedesiredintermediateresultsandfinalanswers,leveragingthefine-grainedprocessfeedbackprovidedbyhumansupervisors.

4Experiments

4.1Experimentalsetup

Toolsmodules.WeconstructretrieversforbothSearchDocandSearchPsgusingContriever-MS

MARCO[15]

.SearchDocretrievesasingledocumentsnippetperquery,whileSearchPsgfetchesthetopthreerelevantpassagesfromagivendocument.ByinvokingNextDoc,atmostninemoredocumentsnippetsarereturned.Appendix

B.1

presentsmoredetails.

Warm-updatasets.Weemployfourquestion-answering(QA)datasetstowarmupopen-source

LLMs,including2WikiMultiHopQA[13],Musique[37],NaturalQuestions[20]andBoolQ[6]

.Theyvaryinlevelsofquestioncomplexity(single-ormulti-hop),answertypes(phrasespansoryes/no),andtypesofdependencystructuresbetweensub-queries(e.g.,serialorparallel),etc.Appendix

A.4

showsthestatisticsindetail.

Adaptation&evaluationdatasets.Weconsiderthreebenchmarks,bywhichwesimulatedifferentdeploymentscenarios:(1)HotpotQA

[49]:achallengingmulti-hopQAdatasetbuiltonWikipedia

articles.

WeusetheWikipediadumpprovidedin[15]astheknowledgebase

.(2)PubMedQA

[17]:

abiomedicalQAdatasetthatrequiresansweringaquestionby“yes/no”givenaPubMedabstract.Weadaptthedatatoretrieval-basedQAbypilingall274kabstractsprovidedinthepaperasaknowledgebase,whereeachdocumentcomprisesoneabstractpassage.(3)QASPER

[8]:answeringquestions

infreeformbasedonalongNLPpaper.Foreachquestion,weregardthecorrespondingpaperasaknowledgebaseandeachsectionofthepaperasadocumentwithseveralpassages.Weusethetrainingandvalidationsetsforadaptationfine-tuningandthetestsetsforevaluation.Forevaluationmetrics,weuseexactmatch(EM)andF1scoresforHotpotQAandQASPER;andtheaccuracy(ACC)

of“yes/no”forPubMedQA.MoredetailsareinAppendix

B.2.

Feedbackannotation.Consideringlimitedresources,wesimulatehumanbehaviorandprovide

silverfeedbacktoAMOR’sreasoningprocessesbasedonthegoldanswerandgoldevidence

passages=[0,1,···]foreachtargetquestionQ,whicharealreadyincludedinthetrainingand

validationdataofthethreebenchmarks.Table

showshowweannotatethefeedbackforeachLLMoutputy.NotethatAMORisapplicableforgoldfeedbackfromhumansinrealisticapplications.Appendix

B.3

discussestheaccuracyofthesilverfeedbackthroughhumanevaluation.

Implementationdetails.WesetλminEq.

andEq.

to1forallmodules,I=1inAlgorithm

andTtothesizeofthetrainingsetforeachdataset,andfine-tuneLLAMA-2-7B/13B-Chatfortwoepochswithalearningrateof2e?5using8NVIDIA80GBA100GPUs.Whileapplying

Table3:RefiningeachmoduleoutputytootoadaptAMOR,where?ydenotesconvertingthebinaryoutputytoitsoppositelabel.

ModulemTargetOutputkandImmediateRewardok

Decompose(Q,H)Judge(Q,H,q,d)

Answer(Q,H,q,P)Complete(Q,E)

kkkk

=yandok=yandok=yandok

=andok

=1iffo=;k=yandok=0,otherwise.=1,iffo=;k=?yandok=1,otherwise.=1iffo=;k=yandok=0,otherwise.

=1if≤E;k=yando

人人文庫> 全部分類> 應(yīng)用文書 > 研究報告

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

智能體不夠聰明怎么辦？讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular

文檔簡介

溫馨提示

最新文檔

評論

智能體不夠聰明怎么辦？讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔

智能體不夠聰明怎么辦？讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular