智能體不夠聰明怎么辦?讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular_第1頁(yè)
智能體不夠聰明怎么辦?讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular_第2頁(yè)
智能體不夠聰明怎么辦?讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular_第3頁(yè)
智能體不夠聰明怎么辦?讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular_第4頁(yè)
智能體不夠聰明怎么辦?讓它像學(xué)徒一樣持續(xù)學(xué)習(xí) -AMOR A Recipe for Building Adaptable Modular_第5頁(yè)
已閱讀5頁(yè),還剩54頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

AMOR:ARecipeforBuildingAdaptableModularKnowledgeAgentsThroughProcessFeedback

JianGuan1,2,WeiWu2*,ZujieWen2,PengXu2,HongningWang1,MinlieHuang1*

1TheCoAIgroup,DCST,InstituteforArtificialIntelligence,

arXiv:2402.01469v2[cs.CL]25Oct2024

1StateKeyLabofIntelligentTechnologyandSystems,

1BeijingNationalResearchCenterforInformationScienceandTechnology,

1TsinghuaUniversity,Beijing100084,China.2AntGroup.

{jianguanthu,wuwei19850318,wang.hongn}@gmain.com,

{zujie.wzj,peng.x}@antgroup.com,aihuang@tsinghua.edu.cn.

Abstract

Thenotablesuccessoflargelanguagemodels(LLMs)hassparkedanupsurgeinbuildinglanguageagentstocompletevariouscomplextasks.WepresentAMOR,anagentframeworkbasedonopen-sourceLLMs,whichreasonswithexternalknowledgebasesandadaptstospecificdomainsthroughhumansupervisiontothereasoningprocess.AMORbuildsreasoninglogicoverafinitestatemachine(FSM)thatsolvesproblemsthroughautonomousexecutionsandtransitionsoverdisen-tangledmodules.Thisallowshumanstoprovidedirectfeedbacktotheindividualmodules,andthusnaturallyformsprocesssupervision.Basedonthisreasoningandfeedbackframework,wedevelopAMORthroughtwo-stagefine-tuning:warm-upandadaptation.Theformerfine-tunestheLLMwithexamplesautomaticallycon-structedfromvariouspublicdatasets,enablingAMORtogeneralizeacrossdifferentknowledgeenvironments,whilethelattertailorsAMORtospecificdomainsusingprocessfeedback.ExtensiveexperimentsacrossmultipledomainsdemonstratetheadvantageofAMORtostrongbaselines,thankstoitsFSM-basedreasoningandprocessfeedbackmechanism.Thecodeanddataarepubliclyavailableat

/JianGuanTHU/AMOR.

1Introduction

LLMs,withastoundingperformanceovergeneralnaturallanguageprocessing(NLP)problems[42,

1,

36],spurredgreatinterestinbuildingLLM-basedagentstosolvecomplextasksbyinteractingwith

externalresourcessuchaswebknowledge[27],specializedtools[31],etc

.

Wefocusondevelopingagentsforknowledge-intensivetasks,wheretheagentcompletesusers’

information-seekingrequestsbyinteractingwithspecificknowledgebases[22]

.Toaddressthecomplexityofsuchtasks,wepositthedesiderataforaqualifyingagentasfollows:Firstly,theagentshouldpossessarobustreasoninglogicaboutthetasktosolveindividualproblemswithprecisepathways.Secondly,theagentshouldmaintainanadaptivemechanismtoadjusttospecificenvironments,ratherthanstayingstatic.Thirdly,thereasoningprocessshouldbeamenabletohumaninterventions,enablinghumanstosteertheagent’sbehaviorthroughdirectfeedbacktotheprocessratherthanonlytotheoutcome.Thisabilitycansignificantlyfacilitatealignmentbetweenagent

behaviorandhumanintent[39]

.

Althoughextensivestudieshavebeenconductedonbuildinglanguageagents,few,ifany,canfulfillalltherequiredcriteriaduetotheiruncontrollablereasoninglogic,staticmodelcapability,or

*Correspondingauthors.

38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).

2

Table1:ComparisonbetweenAMORandrepresentativemethodsforbuildingagents.Appendix

A.1

providesamorecomprehensivediscussionindetail.

ReasoningLogic

StepInter-StepDependency

MethodAdaptiveMechanismFeedback

WebGPT

[27]

ToolInvoking

Undefined

ImitationLearningfromHumans

Outcome

CoT

[43]

Reasoning

Undefined

Prompting

Undefined

ToT

[50]

Reasoning

Undefined

Prompting

Process

ReAct

[51]

Reasoning&ToolInvoking

Undefined

Prompting

Undefined

Reflexion

[35]

Reasoning&ToolInvoking

Undefined

Prompting

Process

AgentLM

[53]

Reasoning&ToolInvoking

Undefined

ImitationLearningfromLLMs

Outcome

MetaGPT

[14]

SpecializedModule

SequentialPipeline

Prompting

Process

LUMOS

[52]

SpecializedModule

SequentialPipeline

ImitationLearningfromHumans

Undefined

AMORSpecializedModuleFiniteStateMachineExploration&ExploitationProcess

S2:RelevanceJudgmentQ,H,E,q,D,d

m2:Judge(Q,H,q,d)

Description:Judgewhetherdisrelevantwithq.

Output:H],if

S1:DocumentRetrievalQ,H,E,q

m1:SearchDoc(q)Description:Retrieveonedocumentsnippetusingqfromthegiven

knowledgebase.

Output:d!

D=[d!]

d=d!

S0:QuesAonDecomposiAonQ,H,E

m0:Decompose(Q,H)Description:GeneratethenextqueryqifmoreinformationbesidesHisneededtoanswerQ.

Output:G

[NEXT]q,ifmoreinformationneeded[FINISH],otherwise

H=[]

Q一

E=[]

S4:PassageRetrievalQ,H,E,q,D,d

m4:SearchPsg(q,d)

Description:Retrievepassagesusingqinthedocumentofd.

Output:P=[p!,p1,…]

[NEXT]

[RELEVANT]

d

q

[CONTINUE]

d=d′

D=D+[d′]

H=H+q,a,E=E+e

[ANSWERABLE]

P

S5:AnswerExtracAonQ,H,E,q,D,P

m5:Answer(Q,H,q,P)Description:Extracttheansweratothe

queryqandtheevidencepassageefromP.

ifanswerableotherwise

Output:P]a,],

[FINISH]

[IRRELEVANT]

S3:DocumentNavigaAonQ,H,E,q,D

m3:NextDoc()

Description:Navigatetothenextdocumentsnippetd′.

Output:G]],d′,ifogthemaximumnumber

S6:TaskCompleAonQ,E

m6:Complete(Q,E)Description:GeneratetheanswertothemainquestionQbasedonE.

Output:A

一A

[UNANSWERABLE]

[NOMORE]

H=H+q,“NoAnswer”,E=E+D[0]

States&ModulesforCallingLLMs

States&ModulesforInvokingTools

Q:Themainquestion

A:Thefinalquestion

H:Allsolvedsub-queriesandanswers

E:Allcollectedevidencepassages

q:Currentsub-query

D:Allretrieveddocuments

d:Currentdocument

P:RetrievedPassagesind

Figure1:AMOR’sstatetransitiondiagram.Eachboxrepresentsastateandthecorrespondingmodulethatisexecutedwhenenteringthestate.Theremaybemultiplecategoriesofexecutionresultsdistinguishedbyspecialbranchtokenssuchas“[NEXT].”ThenAMORdeterminesthenextstatebasedonthebranchtokens.

sparse/missingfeedbacksignals,asdetailedinTable

1.

Consequently,itisstillchallengingforuserstocritique,andthusguideexistingagentstofollowtargetedmanners,especiallywhentheagentsare

builtuponlesspowerfulLLMs[25]

.

WeintroduceAMOR,anAdaptableMOdulaRknowledgeagentthatcanreasonandadapt,withthereasoningprocessamenabletohumansupervision,basedonopen-sourceLLMs.AMOR’sreasoninglogicisformalizedasafinitestatemachine(FSM)

[7,

21]thatsolvesproblemsviaaseriesof

executionsandtransitionsoverasetofmodules(Figure

1)

.Thisnaturallyenablesthedesiredprocess-basedsupervisionmechanism,allowinguserstogivefeedbacktoeachLLM-controlledmodule.AMORsupportsflexibleformsoffeedback,eitherbinaryjudgmentsregardingthecorrectnessorrefinementoftheoutputs.ThereasoninglogicandprocessfeedbackmechanismtogetherframehowAMORthinks,acts,andinteractswithusersandtaskenvironments.

WebuildAMORuponanLLMequippedwithdistinctparametersfordifferentmodulestoefficientlyhandlemultipletasks.ThetraininginAMORhappensintwostages:(1)Warm-up:themodulardesignenablesustoconstructtrainingdataseparatelyforeachdisentangledmodulewithoutrequiringcompletetrajectoriesforspecifictasks.Asaresult,wecreatealargedatasetof50kexamplescoveringmultipledistincttasks,simplyusingpublicdatasets.Wefine-tuneAMORonthisdataforgeneralizationovervariousknowledge-seekingscenarios.(2)Adaptation:whendeployed,wetailorAMORtothetargetdomainbylettingitautonomouslyaddressusertasks(i.e.,exploration),collectingprocessfeedbackforeachLLMoutput,andevolvingthroughfurtherfine-tuningontheexplorationtrajectorieswithfeedback(i.e.,exploitation).Ourcontributionsaresummarizedasfollows:

I.Weproposeageneralframeworkforbuildingknowledgeagents,featuringFSM-basedreasoninglogicandaprocessfeedbackmechanism.Wefocusontextcorporaasknowledgebases,buttheapproachcanbeflexiblyextendedtootherknowledgetypesandusertasksbycustomizingthemodulesanddependencieswithintheFSMframework.

3

II.ExperimentsacrossmultipledomainsshowthestrongadvantageoftheFSM-basedreasoninglogicwith30%-40

%improvementsoverbaselineswhenbasedonoff-the-shelfLLMs(e.g.,GPT-42)

.Switchingtofine-tunedLLMs,thewarm-upstageempowersAMORtogeneralizetomultipledomainsandsurpassstrongbaselines.AfterweadaptAMORtospecificdomains,subsequentdomain-specificadaptationsrevealthatprocessfeedbackissignificantlymoreeffectiveinimprovingthereasoningprocessthanoutcomefeedback.

2Relatedwork

Languageagents.Interestissurginginbuildingagentsfortasksnecessitatingmulti-stepreasoning.Existingworkfallsintotwogroups.Thefirstgroupfocusesondesigningagentarchitectures,such

asCoT’sstep-by-stepreasoning[44],ReAct’sintegrationofreasoning,action,andobservationto

allowtooluse[51],and

CODEPLAN’stwo-stagereasoningframeworkthatfirstgeneratesacode-

formplanandthenrealizeslow-levelreasoningsteps[45]

.Nevertheless,suchfree-formreasoningconstraintshumanintervention.Incontrast,modularagentsfollowapipelinetoexecutespecialized

modules[19,

14,

11,

3,

52],improvingtheeaseofintervention

.Thesecondgroupaimstodesignadaptivemechanismsforadaptingagentstospecificscenarios.

ToT[50]andReflexion[35]use

environmentfeedbackformulti-pathpruninganditerativesingle-pathrefinement,respectively,butsufferfrompoorinferenceefficiencyandneedforreal-timefeedback.Asafine-tuningapproach,recentworkequippedopen-sourceLLMswithspecificagentabilitiesbylearningfromexamples

synthesizedbasedonhumanpriors[5],orexperttrajectoriesfromhumans[27]orGPT-4[53,

4]with

correctnessvalidationthroughoutcomefeedback.Incontrast,ourmodularagentAMORemploysFSM-basedreasoningwithastrongercapacityforhandlingcomplextasksthansimplepipelinesandadaptseffectivelytospecificenvironmentsviaprocessfeedback.

Retrieval-augmentedgeneration(RAG).TheRAGparadigmaugmentstheinputsofLLMswith

retrievedpassagestoenhancefactuality[12,

22,

10]

.Recentstudieshavedevelopedinterleaved

reasoning-retrievalforbetterinformationrecallthanone-stepretrieval[38,

16,

28]

.However,retrieval

mayintroducenoisethatleadstolow-qualityanswers[34]

.

Totacklethis,Self-RAG[2]trained

LLMstoselectivelyperformretrievalandutilizeretrievedpassages.UnlikeRAGapproaches,AMORemphasizesanexplainablereasoningprocessforproactivelydecomposingquestionsandseekingevidenceforgroundedgeneration,andallowsforprocessfeedbackfromhumans.Nevertheless,RAGmainlyfocusesonintegratingparametricfactualknowledgeinLLMsandretrievednon-parametricknowledge,whichislessexplainableandintervenable.

3AMORagent

AMORreliesonthreekeytechniques:FSM-basedreasoninglogic,aprocessfeedbackmechanism,andatwo-stagefine-tuningstrategy.Wedetailthedefinitionofthereasoninglogicanditsspecifica-tionassumingtheknowledgebaseisatextcorpus

in§3.1,themethodforfine-tuningopen-source

LLMsasawarm-upstagein§3.2,andtheadapta

-

tionstagedrivenbyprocessfeedbackin§3.3.

3.1Reasoninglogic

Algorithm1FSM-basedReasoningLogicInput:Agentatthestates=s0;Q:Question.Output:A:FinalAnswer;R:ReasoningProcess.

1R=[]

2whilessN?1do

3y=m(s)//Obtaintheoutputygivens

fromthecorrespondingmodulem.

4R.append({“state”:s,“output”:y})

5A=y

6returnA,R

Algorithm

1

outlineshowtodeducetheanswerAforaninputquestionQwithareasoningprocessRusingFSM-basedreasoninglogic,whichcanbedefinedbyaquadruple:{S,M,E,μ},where

?S={s0,...,sN?1}isasetofstateswiths0astheinitialstateandsN?1asthefinalstate.Eachstateholdsvariablestotrackcontextinformation.

?M={m0,...,mN?1}isasetofmoduleswithmktriggeredwhenthereasoningflowreachesstatesk.Themodulesarecategorizedintotwotypes:(a)Toolmodules(MTOOL)forinvokingtools,

and(b)LLMmodules(MLLM)forcallingLLMs.

2Inthiswork,GPT-3.5/4referstotheOpenAI’sAPI“gpt-3.5-turbo”/“gpt-4-1106-preview,”respectively.

KnowledgeAugmentation

SearchDoc(Query=^,title≠“ChickChickBoom”)

d^(title:Chick-fil-A)ThefirstChick-fil-Aopenedin1967,inthe…

SearchPsg(Query=^,title=“ChickChickBoom”)/^

p^,^(title:ChickChickBoom)AfterExtraToxictookthegame…

p0,1(title:ChickChickBoom)Thechickswillbouncethroughouttheir…p0,2(title:ChickChickBoom)Eachdrawingisgivenanaccuracyrating…

SearchDoc(Query=`,title≠“NintendoEntertainmentSystem”)

d`(title:NintendoCharacter)Marioisacharactercreatedbythe…

SearchPsg(Query=`,title=“NintendoEntertainmentSystem”)/`

p1,0(title:NintendoEntertainmentSystem)Afterdevelopingseveral…

p1,1(title:NintendoEntertainmentSystem)TheNESisoneofthebest-…p1,2(title:NintendoEntertainmentSystem)Followingaseriesofarcade…

Answer

[0A,1,2a;l[10,1]II[Unanswerable]

OriginalSample

QuestionQ:OnwhatdatedidthepublisherofChickChickBoomunveilitsnewsystems?DecomposedSub-Queries(a),Answers(a)andEvidencePassages(a):

^WhowasthepublisherofChickChickBoom?

^Nintendo

^(title:ChickChickBoom)…ChickChickBoomisanonlineAdobeFlashgamecreatedforEaster2007byGermandeveloperExtraToxicandsponsoredbyNintendoofEurope…

`WhatdaydidNintendounveilthenewsystems?

`October18,1985

`(title:NintendoEntertainmentSystem)…NintendoseededthesefirstsystemstolimitedAmericantestmarketsstartinginNewYorkCityonOctober18,1985…

FinalAnswer:October18,1985

Decompose

uu[n:]

MainQuestion:Q

ies:

MainQuestion:Q

::ries:

Complete

s]1

^

MainQuestion:Q

SolvedSub-Queries:

S:-u::

1

s[:r,b[nswer:[3]p0,1;l,[,1,2]II[Unanswerable]

Judge

:00[0,I,v]0,[Relevant]II[Relevant]

MainQuestion:Q

SolvedSub-Queries:

1

Output:

Documel:1[,I1l,1,[Relevant]II[Relevant]

S:-u::

Figure2:OnthetopleftisasamplequestionfromMusique[37],providingampleinformation(in

green)forconstructingtrainingexamplesforfourLLMmodulesofAMOR(bottom).Weaugmentextraknowledge(inblue)fortheJudgeandAnswermodulebyinvokingtheSearchDocandSearchPsgtools(topright).Ineachexample,wehighlightthepromptinpurpletoformatthecurrentstate(before“Output:”)andoutput(after“Output:”),anduse“||”toseparatedifferentexamplesfortraining.

?EisthesetofallpossibleoutputsofM.

?μ:S×E→Sisthetransitionfunctionthatdeterminesthenextstateofthereasoningflowgiventhecurrentstateandtheexecutionresultofthecorrespondingmodule.

Whentheexternalknowledgebaseisatextcorpus,aninstantiationofthereasoninglogiccanberepresentedbythestatetransitiondiagraminFigure

1.

Inthiscase,MTOOLperformdocumentandpassageretrievalusingexternalretrievers;whileMLLMleveragetheLLMtoanalyzeanddigestthequestion,documents,andpassagestodeducethefinalanswer.Todistinguishdifferenttypesofoutputsfromamodulethatrequiresdifferentsubsequentmodules,weemployasetofspecialbranchtokenssuchas“[NEXT]”toguideμtodeterminethenextstate.Insummary,AMORanswersquestionQby(1)iterativelydecomposingQtoasub-queryqatstates0,andfindingtheansweratoqandtheevidencepassageethroughiterativeknowledgeretrieval,relevanceevaluation,retrievalrefinement(i.e.,“PassageRetrieval”),andanswerextraction,untilnomoreknowledgeisneeded;and(2)deducingthefinalanswerAbasedonthecollectedevidencepassagesatthefinalstate.

DefiningreasoninglogicasanFSMoffersthreeadvantages:(1)StructuredThinking.FSMmakesspecificationsofinter-stepdependencies(e.g.,prioritization,branchselection)easy,andthusenablesnarrowingdowntheexplorationspace.(2)SkillDisentanglement.Bydecomposingcomplextasksintomodularsteps,onecanindependentlyconstructtrainingdataforeachmodule,whichsignificantlyreducesthedifficultyofimplementingAMOR

withopen-sourceLLMs(cf.,§3.2)

.ThisfeaturealsoallowsAMORtofocusonsinglesteps,therebymitigatingtheweaknessofLLMsinreasoningover

longcontextformedbytask-solvingtrajectories[24]

.(3)IntervenableWorkflow.Thestructuredreasoningprocessenablesuserstoeasilydiagnosetheagent’smistakesandprovideprocessfeedback

forimprovingthereasoningcapabilityoftheagent(§3.3)

.

3.2Warming-upopen-sourceLLMs

Open-sourceLLMsareobservedtofallshortincomplexagenttasks[46,

25]

.RecentstudieshaveimprovedtheirreasoningabilitiesthroughimitationlearningusingtrajectoriesfromadvancedLLMs

suchasGPT-4[53,

4]

.However,evenGPT-4canstrugglewithproducinghigh-qualityreasoning

trajectories[29]

.

AMOR’smodulardesignenablesustoconstructtrainingdataforeachmoduleseparatelyfromexistingdatasetswithoutsimulatingthewholetrajectories,thusgreatlyalleviatingtheaboveissue.Formally,givenasamplequestionQwithannotationsofthefinalanswer,allsub-queriesandanswers=[(0,0),(1,1),···],andallevidencepassages=[0,1,···],wecandirectly

4

5

transformtheseannotationsintoasuitableformattoserveastrainingdataforDecomposeandCompleteinFigure

1.

SinceJudgeandAnswerrequiremultipletypesofretrievedknowledge(e.g.,relevantornot),weemployretrievaltoolstoaugmenttheinput.Figure

2

exemplifiestheconstructionpipeline,whichcanbeeasilyextendedtootherknowledge-intensivedatasetsandspecificdomains.Appendix

A.4

showsmoredetails.

Whenfine-tuningopen-sourceLLMstohandlemultipletasksdefinedbydifferentmodules,weare

inspiredbytheMixture-of-Expertsapproach[33]tolearndistinctFeed-ForwardNetwork(FFN)

parametersinthefinalquarteroftheTransformerblockstobalancethetrade-offbetweenperformanceandinferenceefficiency.Thesemodule-specificparametersareinitializedusingtheoriginalmodel’sFFNlayers.WecalltheproposedarchitectureModule-AwareM

ixture-of-Experts(MA-MoE)3.

Then,wefine-tunetheMA-MoEmodelwiththestandardlanguagemodelingloss:

L1=?Em∈MLLM,(,)∈Dmλmlogπθm(|),(1)

whereπreferstothepolicymodelMA-MoEthatmapsthestatetoanaction,θmdenotestheparameterforthemodulem∈MLLM,Dmisthecorrespondingcollectionoftrainingexamples,(,)isastate-outputpairfromDm,and{λm}aretunablehyper-parameters.

Feedbackiscrucialforadaptinglanguageagentstospecificenviron-

ments[40],especiallywhendeal

-ingwithunseen,long-tail,orever-changingdomainknowledge.Prioragentscommonlyusedoutcomefeedbackforadaptationwhichas-sessesthecorrectnessofinterme-diatestepsbasedonthesuccess

orfailureoftheoutcome[53,

4]

.However,outcomefeedbackistoosparsetoimproveintermediaterea-

soning[23]

.RecentstudiesalsohighlightedthatLLMs’reasoningstepsarelikelytocontradicttheout-

come[26],whichmeansthatout

-comefeedbackmayinevitablyin-troducenoiseduringtraining(seeexamplesinAppendix

B.8)

.Incontrast,AMOR’sprocessfeedback

3.3Adaptationthroughprocessfeedback

Algorithm2AdaptationthroughProcessFeedback

Input:{π}:InitialPolicy;T:ExplorationStepsbetweenEx-

ploitation;I:NumberofIterations.

Output:{πθm}:AdaptedPolicy.

1whilei←1toIdo

2

R=[]//Feedback-RefinedReasoningProcesses

3

whilet←1toTdo

4

//Exploration

ReceiveaninputquestionQ.

5

CollectAMORθ’sreasoningprocessR.//Algorithm

1

//FeedbackCollectionforEachLLMModule

6

foreachSteprk∈R(k=0,1,2,···)do

7

Extractthestateskandoutputykfromrk.

8

ifThecorrespondingmodulemk∈MLLMthen

9

Collectfeedbackfkforskandyk.

10

Determinekandokbasedonfk.//Eq.

2

11

R.append([sk,k,ok])

12

//Exploitation

Optimize{θm}tominimizeL2onR.//Eq.

3

13

return{πθm}

mechanismcaneffectivelyalleviatetheseissues.

Algorithm

2

describestheadaptationmechanismofAMORparameterizedbyθ,specificallyasthreesteps:(1)Exploration.AMORanswerstheinputquestionQbyinteractingwithaknowledgebase.

(2)FeedbackCollection.AMOR’sreasoningprocessforQisevaluatedwithfeedbackfkforthe

ofy.Weconvertyintoafeedback-refinedtargetoutputybasedonthefeedbackfkanddeterminetheimmediaterewardokasfollows:

outputykoftheLLMateachstepduringreasoning,which?iseither“right/wrong”orarefinedversion

(3)Exploitation.EveryTstepsoftheformerexplorationandfeedbackcollection,weoptimizethe

initialpolicybasedontheresultingtrajectoriesandcorrespondingfeedback[32]:

L2=?Em∈MLLM,(sk,k,ok)∈Rmλm[ok?βlog(πθm(k|sk)/πT(k|sk))],(3)

3“Module-Aware”meanswhenAMORexecutesacertainmodule,itsmoduleindexwillbeprovidedtotheroutersofthemodeltoindicatewhichexpertshouldbeactivated.

6

Table2:AutomaticannotationstrategyforsilverprocessfeedbackfordifferentLLMmodules.

Modulem

Outputy

SilverProcessFeedbackf

Decompose(Q,H)

[NEXT]q[FINISH]

“right”,iftheretrieveddocumentsusingqoverlapthedocumentscorrespondingto;“wrong”,otherwise.“right”,if?E(i.e,evidencepassagescollectedbyAMOR);“wrong”,otherwise.

Judge(Q,H,q,d)

[RELEVANT][IRRELEVANT]

[RELEVANT]”,ifoneofpassagesincomesfromthesamedocumentasd;[IRRELEVANT]”,otherwise.

Answer(Q,H,q,P)

[ANSWERABLE]ae[UNANSWERABLE]

“right”,ife∈;“wrong”,otherwise

“right”,ifP∩=?;“wrong”,otherwise

Complete(Q,E)

A

,if?E;“wrong”,otherwise.

whereRm?Rdenotesthetrainingexamplesformodulem,πreferstotheinitialwarm-uppolicy.

Notably,thislossfunctionisnon-differentiable,necessitatingtheuseofaspecializedoptimizationtechnique.

WeusearecentlyproposedalignmentalgorithmKTO[9]withanMLEregularization[47]

foroptimization,whichoptimizesthepolicywithoutrequiringpairedhumanpreferences.Crucially,whenoptimizingaparticularmodulem,thegradientinducedbythefeedbacksignalpropagatesthroughtheentireMA-MoEmodel,exceptfortheFFNlayerscorrespondingtoothermodules.ThistargetedoptimizationapproachenablesAMORtoeffectivelyalignitsoutputswiththedesiredintermediateresultsandfinalanswers,leveragingthefine-grainedprocessfeedbackprovidedbyhumansupervisors.

4Experiments

4.1Experimentalsetup

Toolsmodules.WeconstructretrieversforbothSearchDocandSearchPsgusingContriever-MS

MARCO[15]

.SearchDocretrievesasingledocumentsnippetperquery,whileSearchPsgfetchesthetopthreerelevantpassagesfromagivendocument.ByinvokingNextDoc,atmostninemoredocumentsnippetsarereturned.Appendix

B.1

presentsmoredetails.

Warm-updatasets.Weemployfourquestion-answering(QA)datasetstowarmupopen-source

LLMs,including2WikiMultiHopQA[13],Musique[37],NaturalQuestions[20]andBoolQ[6]

.Theyvaryinlevelsofquestioncomplexity(single-ormulti-hop),answertypes(phrasespansoryes/no),andtypesofdependencystructuresbetweensub-queries(e.g.,serialorparallel),etc.Appendix

A.4

showsthestatisticsindetail.

Adaptation&evaluationdatasets.Weconsiderthreebenchmarks,bywhichwesimulatedifferentdeploymentscenarios:(1)HotpotQA

[49]:achallengingmulti-hopQAdatasetbuiltonWikipedia

articles.

WeusetheWikipediadumpprovidedin[15]astheknowledgebase

.(2)PubMedQA

[17]:

abiomedicalQAdatasetthatrequiresansweringaquestionby“yes/no”givenaPubMedabstract.Weadaptthedatatoretrieval-basedQAbypilingall274kabstractsprovidedinthepaperasaknowledgebase,whereeachdocumentcomprisesoneabstractpassage.(3)QASPER

[8]:answeringquestions

infreeformbasedonalongNLPpaper.Foreachquestion,weregardthecorrespondingpaperasaknowledgebaseandeachsectionofthepaperasadocumentwithseveralpassages.Weusethetrainingandvalidationsetsforadaptationfine-tuningandthetestsetsforevaluation.Forevaluationmetrics,weuseexactmatch(EM)andF1scoresforHotpotQAandQASPER;andtheaccuracy(ACC)

of“yes/no”forPubMedQA.MoredetailsareinAppendix

B.2.

Feedbackannotation.Consideringlimitedresources,wesimulatehumanbehaviorandprovide

silverfeedbacktoAMOR’sreasoningprocessesbasedonthegoldanswerandgoldevidence

passages=[0,1,···]foreachtargetquestionQ,whicharealreadyincludedinthetrainingand

validationdataofthethreebenchmarks.Table

2

showshowweannotatethefeedbackforeachLLMoutputy.NotethatAMORisapplicableforgoldfeedbackfromhumansinrealisticapplications.Appendix

B.3

discussestheaccuracyofthesilverfeedbackthroughhumanevaluation.

Implementationdetails.WesetλminEq.

1

andEq.

3

to1forallmodules,I=1inAlgorithm

2,

andTtothesizeofthetrainingsetforeachdataset,andfine-tuneLLAMA-2-7B/13B-Chatfortwoepochswithalearningrateof2e?5using8NVIDIA80GBA100GPUs.Whileapplying

7

Table3:RefiningeachmoduleoutputytootoadaptAMOR,where?ydenotesconvertingthebinaryoutputytoitsoppositelabel.

ModulemTargetOutputkandImmediateRewardok

Decompose(Q,H)Judge(Q,H,q,d)

Answer(Q,H,q,P)Complete(Q,E)

kkkk

=yandok=yandok=yandok

=andok

=1iffo=;k=yandok=0,otherwise.=1,iffo=;k=?yandok=1,otherwise.=1iffo=;k=yandok=0,otherwise.

=1if≤E;k=yando

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論