版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
AMOR:ARecipeforBuildingAdaptableModularKnowledgeAgentsThroughProcessFeedback
JianGuan1,2,WeiWu2*,ZujieWen2,PengXu2,HongningWang1,MinlieHuang1*
1TheCoAIgroup,DCST,InstituteforArtificialIntelligence,
arXiv:2402.01469v2[cs.CL]25Oct2024
1StateKeyLabofIntelligentTechnologyandSystems,
1BeijingNationalResearchCenterforInformationScienceandTechnology,
1TsinghuaUniversity,Beijing100084,China.2AntGroup.
{jianguanthu,wuwei19850318,wang.hongn}@gmain.com,
{zujie.wzj,peng.x}@antgroup.com,aihuang@tsinghua.edu.cn.
Abstract
Thenotablesuccessoflargelanguagemodels(LLMs)hassparkedanupsurgeinbuildinglanguageagentstocompletevariouscomplextasks.WepresentAMOR,anagentframeworkbasedonopen-sourceLLMs,whichreasonswithexternalknowledgebasesandadaptstospecificdomainsthroughhumansupervisiontothereasoningprocess.AMORbuildsreasoninglogicoverafinitestatemachine(FSM)thatsolvesproblemsthroughautonomousexecutionsandtransitionsoverdisen-tangledmodules.Thisallowshumanstoprovidedirectfeedbacktotheindividualmodules,andthusnaturallyformsprocesssupervision.Basedonthisreasoningandfeedbackframework,wedevelopAMORthroughtwo-stagefine-tuning:warm-upandadaptation.Theformerfine-tunestheLLMwithexamplesautomaticallycon-structedfromvariouspublicdatasets,enablingAMORtogeneralizeacrossdifferentknowledgeenvironments,whilethelattertailorsAMORtospecificdomainsusingprocessfeedback.ExtensiveexperimentsacrossmultipledomainsdemonstratetheadvantageofAMORtostrongbaselines,thankstoitsFSM-basedreasoningandprocessfeedbackmechanism.Thecodeanddataarepubliclyavailableat
/JianGuanTHU/AMOR.
1Introduction
LLMs,withastoundingperformanceovergeneralnaturallanguageprocessing(NLP)problems[42,
1,
36],spurredgreatinterestinbuildingLLM-basedagentstosolvecomplextasksbyinteractingwith
externalresourcessuchaswebknowledge[27],specializedtools[31],etc
.
Wefocusondevelopingagentsforknowledge-intensivetasks,wheretheagentcompletesusers’
information-seekingrequestsbyinteractingwithspecificknowledgebases[22]
.Toaddressthecomplexityofsuchtasks,wepositthedesiderataforaqualifyingagentasfollows:Firstly,theagentshouldpossessarobustreasoninglogicaboutthetasktosolveindividualproblemswithprecisepathways.Secondly,theagentshouldmaintainanadaptivemechanismtoadjusttospecificenvironments,ratherthanstayingstatic.Thirdly,thereasoningprocessshouldbeamenabletohumaninterventions,enablinghumanstosteertheagent’sbehaviorthroughdirectfeedbacktotheprocessratherthanonlytotheoutcome.Thisabilitycansignificantlyfacilitatealignmentbetweenagent
behaviorandhumanintent[39]
.
Althoughextensivestudieshavebeenconductedonbuildinglanguageagents,few,ifany,canfulfillalltherequiredcriteriaduetotheiruncontrollablereasoninglogic,staticmodelcapability,or
*Correspondingauthors.
38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).
2
Table1:ComparisonbetweenAMORandrepresentativemethodsforbuildingagents.Appendix
A.1
providesamorecomprehensivediscussionindetail.
ReasoningLogic
StepInter-StepDependency
MethodAdaptiveMechanismFeedback
WebGPT
[27]
ToolInvoking
Undefined
ImitationLearningfromHumans
Outcome
CoT
[43]
Reasoning
Undefined
Prompting
Undefined
ToT
[50]
Reasoning
Undefined
Prompting
Process
ReAct
[51]
Reasoning&ToolInvoking
Undefined
Prompting
Undefined
Reflexion
[35]
Reasoning&ToolInvoking
Undefined
Prompting
Process
AgentLM
[53]
Reasoning&ToolInvoking
Undefined
ImitationLearningfromLLMs
Outcome
MetaGPT
[14]
SpecializedModule
SequentialPipeline
Prompting
Process
LUMOS
[52]
SpecializedModule
SequentialPipeline
ImitationLearningfromHumans
Undefined
AMORSpecializedModuleFiniteStateMachineExploration&ExploitationProcess
S2:RelevanceJudgmentQ,H,E,q,D,d
m2:Judge(Q,H,q,d)
Description:Judgewhetherdisrelevantwithq.
Output:H],if
S1:DocumentRetrievalQ,H,E,q
m1:SearchDoc(q)Description:Retrieveonedocumentsnippetusingqfromthegiven
knowledgebase.
Output:d!
D=[d!]
d=d!
S0:QuesAonDecomposiAonQ,H,E
m0:Decompose(Q,H)Description:GeneratethenextqueryqifmoreinformationbesidesHisneededtoanswerQ.
Output:G
[NEXT]q,ifmoreinformationneeded[FINISH],otherwise
H=[]
Q一
E=[]
S4:PassageRetrievalQ,H,E,q,D,d
m4:SearchPsg(q,d)
Description:Retrievepassagesusingqinthedocumentofd.
Output:P=[p!,p1,…]
[NEXT]
[RELEVANT]
d
q
[CONTINUE]
d=d′
D=D+[d′]
H=H+q,a,E=E+e
[ANSWERABLE]
P
S5:AnswerExtracAonQ,H,E,q,D,P
m5:Answer(Q,H,q,P)Description:Extracttheansweratothe
queryqandtheevidencepassageefromP.
ifanswerableotherwise
Output:P]a,],
[FINISH]
[IRRELEVANT]
S3:DocumentNavigaAonQ,H,E,q,D
m3:NextDoc()
Description:Navigatetothenextdocumentsnippetd′.
Output:G]],d′,ifogthemaximumnumber
S6:TaskCompleAonQ,E
m6:Complete(Q,E)Description:GeneratetheanswertothemainquestionQbasedonE.
Output:A
一A
[UNANSWERABLE]
[NOMORE]
H=H+q,“NoAnswer”,E=E+D[0]
States&ModulesforCallingLLMs
States&ModulesforInvokingTools
Q:Themainquestion
A:Thefinalquestion
H:Allsolvedsub-queriesandanswers
E:Allcollectedevidencepassages
q:Currentsub-query
D:Allretrieveddocuments
d:Currentdocument
P:RetrievedPassagesind
Figure1:AMOR’sstatetransitiondiagram.Eachboxrepresentsastateandthecorrespondingmodulethatisexecutedwhenenteringthestate.Theremaybemultiplecategoriesofexecutionresultsdistinguishedbyspecialbranchtokenssuchas“[NEXT].”ThenAMORdeterminesthenextstatebasedonthebranchtokens.
sparse/missingfeedbacksignals,asdetailedinTable
1.
Consequently,itisstillchallengingforuserstocritique,andthusguideexistingagentstofollowtargetedmanners,especiallywhentheagentsare
builtuponlesspowerfulLLMs[25]
.
WeintroduceAMOR,anAdaptableMOdulaRknowledgeagentthatcanreasonandadapt,withthereasoningprocessamenabletohumansupervision,basedonopen-sourceLLMs.AMOR’sreasoninglogicisformalizedasafinitestatemachine(FSM)
[7,
21]thatsolvesproblemsviaaseriesof
executionsandtransitionsoverasetofmodules(Figure
1)
.Thisnaturallyenablesthedesiredprocess-basedsupervisionmechanism,allowinguserstogivefeedbacktoeachLLM-controlledmodule.AMORsupportsflexibleformsoffeedback,eitherbinaryjudgmentsregardingthecorrectnessorrefinementoftheoutputs.ThereasoninglogicandprocessfeedbackmechanismtogetherframehowAMORthinks,acts,andinteractswithusersandtaskenvironments.
WebuildAMORuponanLLMequippedwithdistinctparametersfordifferentmodulestoefficientlyhandlemultipletasks.ThetraininginAMORhappensintwostages:(1)Warm-up:themodulardesignenablesustoconstructtrainingdataseparatelyforeachdisentangledmodulewithoutrequiringcompletetrajectoriesforspecifictasks.Asaresult,wecreatealargedatasetof50kexamplescoveringmultipledistincttasks,simplyusingpublicdatasets.Wefine-tuneAMORonthisdataforgeneralizationovervariousknowledge-seekingscenarios.(2)Adaptation:whendeployed,wetailorAMORtothetargetdomainbylettingitautonomouslyaddressusertasks(i.e.,exploration),collectingprocessfeedbackforeachLLMoutput,andevolvingthroughfurtherfine-tuningontheexplorationtrajectorieswithfeedback(i.e.,exploitation).Ourcontributionsaresummarizedasfollows:
I.Weproposeageneralframeworkforbuildingknowledgeagents,featuringFSM-basedreasoninglogicandaprocessfeedbackmechanism.Wefocusontextcorporaasknowledgebases,buttheapproachcanbeflexiblyextendedtootherknowledgetypesandusertasksbycustomizingthemodulesanddependencieswithintheFSMframework.
3
II.ExperimentsacrossmultipledomainsshowthestrongadvantageoftheFSM-basedreasoninglogicwith30%-40
%improvementsoverbaselineswhenbasedonoff-the-shelfLLMs(e.g.,GPT-42)
.Switchingtofine-tunedLLMs,thewarm-upstageempowersAMORtogeneralizetomultipledomainsandsurpassstrongbaselines.AfterweadaptAMORtospecificdomains,subsequentdomain-specificadaptationsrevealthatprocessfeedbackissignificantlymoreeffectiveinimprovingthereasoningprocessthanoutcomefeedback.
2Relatedwork
Languageagents.Interestissurginginbuildingagentsfortasksnecessitatingmulti-stepreasoning.Existingworkfallsintotwogroups.Thefirstgroupfocusesondesigningagentarchitectures,such
asCoT’sstep-by-stepreasoning[44],ReAct’sintegrationofreasoning,action,andobservationto
allowtooluse[51],and
CODEPLAN’stwo-stagereasoningframeworkthatfirstgeneratesacode-
formplanandthenrealizeslow-levelreasoningsteps[45]
.Nevertheless,suchfree-formreasoningconstraintshumanintervention.Incontrast,modularagentsfollowapipelinetoexecutespecialized
modules[19,
14,
11,
3,
52],improvingtheeaseofintervention
.Thesecondgroupaimstodesignadaptivemechanismsforadaptingagentstospecificscenarios.
ToT[50]andReflexion[35]use
environmentfeedbackformulti-pathpruninganditerativesingle-pathrefinement,respectively,butsufferfrompoorinferenceefficiencyandneedforreal-timefeedback.Asafine-tuningapproach,recentworkequippedopen-sourceLLMswithspecificagentabilitiesbylearningfromexamples
synthesizedbasedonhumanpriors[5],orexperttrajectoriesfromhumans[27]orGPT-4[53,
4]with
correctnessvalidationthroughoutcomefeedback.Incontrast,ourmodularagentAMORemploysFSM-basedreasoningwithastrongercapacityforhandlingcomplextasksthansimplepipelinesandadaptseffectivelytospecificenvironmentsviaprocessfeedback.
Retrieval-augmentedgeneration(RAG).TheRAGparadigmaugmentstheinputsofLLMswith
retrievedpassagestoenhancefactuality[12,
22,
10]
.Recentstudieshavedevelopedinterleaved
reasoning-retrievalforbetterinformationrecallthanone-stepretrieval[38,
16,
28]
.However,retrieval
mayintroducenoisethatleadstolow-qualityanswers[34]
.
Totacklethis,Self-RAG[2]trained
LLMstoselectivelyperformretrievalandutilizeretrievedpassages.UnlikeRAGapproaches,AMORemphasizesanexplainablereasoningprocessforproactivelydecomposingquestionsandseekingevidenceforgroundedgeneration,andallowsforprocessfeedbackfromhumans.Nevertheless,RAGmainlyfocusesonintegratingparametricfactualknowledgeinLLMsandretrievednon-parametricknowledge,whichislessexplainableandintervenable.
3AMORagent
AMORreliesonthreekeytechniques:FSM-basedreasoninglogic,aprocessfeedbackmechanism,andatwo-stagefine-tuningstrategy.Wedetailthedefinitionofthereasoninglogicanditsspecifica-tionassumingtheknowledgebaseisatextcorpus
in§3.1,themethodforfine-tuningopen-source
LLMsasawarm-upstagein§3.2,andtheadapta
-
tionstagedrivenbyprocessfeedbackin§3.3.
3.1Reasoninglogic
Algorithm1FSM-basedReasoningLogicInput:Agentatthestates=s0;Q:Question.Output:A:FinalAnswer;R:ReasoningProcess.
1R=[]
2whilessN?1do
3y=m(s)//Obtaintheoutputygivens
fromthecorrespondingmodulem.
4R.append({“state”:s,“output”:y})
5A=y
6returnA,R
Algorithm
1
outlineshowtodeducetheanswerAforaninputquestionQwithareasoningprocessRusingFSM-basedreasoninglogic,whichcanbedefinedbyaquadruple:{S,M,E,μ},where
?S={s0,...,sN?1}isasetofstateswiths0astheinitialstateandsN?1asthefinalstate.Eachstateholdsvariablestotrackcontextinformation.
?M={m0,...,mN?1}isasetofmoduleswithmktriggeredwhenthereasoningflowreachesstatesk.Themodulesarecategorizedintotwotypes:(a)Toolmodules(MTOOL)forinvokingtools,
and(b)LLMmodules(MLLM)forcallingLLMs.
2Inthiswork,GPT-3.5/4referstotheOpenAI’sAPI“gpt-3.5-turbo”/“gpt-4-1106-preview,”respectively.
KnowledgeAugmentation
SearchDoc(Query=^,title≠“ChickChickBoom”)
d^(title:Chick-fil-A)ThefirstChick-fil-Aopenedin1967,inthe…
SearchPsg(Query=^,title=“ChickChickBoom”)/^
p^,^(title:ChickChickBoom)AfterExtraToxictookthegame…
p0,1(title:ChickChickBoom)Thechickswillbouncethroughouttheir…p0,2(title:ChickChickBoom)Eachdrawingisgivenanaccuracyrating…
SearchDoc(Query=`,title≠“NintendoEntertainmentSystem”)
d`(title:NintendoCharacter)Marioisacharactercreatedbythe…
SearchPsg(Query=`,title=“NintendoEntertainmentSystem”)/`
p1,0(title:NintendoEntertainmentSystem)Afterdevelopingseveral…
p1,1(title:NintendoEntertainmentSystem)TheNESisoneofthebest-…p1,2(title:NintendoEntertainmentSystem)Followingaseriesofarcade…
Answer
[0A,1,2a;l[10,1]II[Unanswerable]
OriginalSample
QuestionQ:OnwhatdatedidthepublisherofChickChickBoomunveilitsnewsystems?DecomposedSub-Queries(a),Answers(a)andEvidencePassages(a):
^WhowasthepublisherofChickChickBoom?
^Nintendo
^(title:ChickChickBoom)…ChickChickBoomisanonlineAdobeFlashgamecreatedforEaster2007byGermandeveloperExtraToxicandsponsoredbyNintendoofEurope…
`WhatdaydidNintendounveilthenewsystems?
`October18,1985
`(title:NintendoEntertainmentSystem)…NintendoseededthesefirstsystemstolimitedAmericantestmarketsstartinginNewYorkCityonOctober18,1985…
FinalAnswer:October18,1985
Decompose
uu[n:]
MainQuestion:Q
ies:
MainQuestion:Q
::ries:
Complete
s]1
^
MainQuestion:Q
SolvedSub-Queries:
S:-u::
1
s[:r,b[nswer:[3]p0,1;l,[,1,2]II[Unanswerable]
Judge
:00[0,I,v]0,[Relevant]II[Relevant]
MainQuestion:Q
SolvedSub-Queries:
1
Output:
Documel:1[,I1l,1,[Relevant]II[Relevant]
S:-u::
Figure2:OnthetopleftisasamplequestionfromMusique[37],providingampleinformation(in
green)forconstructingtrainingexamplesforfourLLMmodulesofAMOR(bottom).Weaugmentextraknowledge(inblue)fortheJudgeandAnswermodulebyinvokingtheSearchDocandSearchPsgtools(topright).Ineachexample,wehighlightthepromptinpurpletoformatthecurrentstate(before“Output:”)andoutput(after“Output:”),anduse“||”toseparatedifferentexamplesfortraining.
?EisthesetofallpossibleoutputsofM.
?μ:S×E→Sisthetransitionfunctionthatdeterminesthenextstateofthereasoningflowgiventhecurrentstateandtheexecutionresultofthecorrespondingmodule.
Whentheexternalknowledgebaseisatextcorpus,aninstantiationofthereasoninglogiccanberepresentedbythestatetransitiondiagraminFigure
1.
Inthiscase,MTOOLperformdocumentandpassageretrievalusingexternalretrievers;whileMLLMleveragetheLLMtoanalyzeanddigestthequestion,documents,andpassagestodeducethefinalanswer.Todistinguishdifferenttypesofoutputsfromamodulethatrequiresdifferentsubsequentmodules,weemployasetofspecialbranchtokenssuchas“[NEXT]”toguideμtodeterminethenextstate.Insummary,AMORanswersquestionQby(1)iterativelydecomposingQtoasub-queryqatstates0,andfindingtheansweratoqandtheevidencepassageethroughiterativeknowledgeretrieval,relevanceevaluation,retrievalrefinement(i.e.,“PassageRetrieval”),andanswerextraction,untilnomoreknowledgeisneeded;and(2)deducingthefinalanswerAbasedonthecollectedevidencepassagesatthefinalstate.
DefiningreasoninglogicasanFSMoffersthreeadvantages:(1)StructuredThinking.FSMmakesspecificationsofinter-stepdependencies(e.g.,prioritization,branchselection)easy,andthusenablesnarrowingdowntheexplorationspace.(2)SkillDisentanglement.Bydecomposingcomplextasksintomodularsteps,onecanindependentlyconstructtrainingdataforeachmodule,whichsignificantlyreducesthedifficultyofimplementingAMOR
withopen-sourceLLMs(cf.,§3.2)
.ThisfeaturealsoallowsAMORtofocusonsinglesteps,therebymitigatingtheweaknessofLLMsinreasoningover
longcontextformedbytask-solvingtrajectories[24]
.(3)IntervenableWorkflow.Thestructuredreasoningprocessenablesuserstoeasilydiagnosetheagent’smistakesandprovideprocessfeedback
forimprovingthereasoningcapabilityoftheagent(§3.3)
.
3.2Warming-upopen-sourceLLMs
Open-sourceLLMsareobservedtofallshortincomplexagenttasks[46,
25]
.RecentstudieshaveimprovedtheirreasoningabilitiesthroughimitationlearningusingtrajectoriesfromadvancedLLMs
suchasGPT-4[53,
4]
.However,evenGPT-4canstrugglewithproducinghigh-qualityreasoning
trajectories[29]
.
AMOR’smodulardesignenablesustoconstructtrainingdataforeachmoduleseparatelyfromexistingdatasetswithoutsimulatingthewholetrajectories,thusgreatlyalleviatingtheaboveissue.Formally,givenasamplequestionQwithannotationsofthefinalanswer,allsub-queriesandanswers=[(0,0),(1,1),···],andallevidencepassages=[0,1,···],wecandirectly
4
5
transformtheseannotationsintoasuitableformattoserveastrainingdataforDecomposeandCompleteinFigure
1.
SinceJudgeandAnswerrequiremultipletypesofretrievedknowledge(e.g.,relevantornot),weemployretrievaltoolstoaugmenttheinput.Figure
2
exemplifiestheconstructionpipeline,whichcanbeeasilyextendedtootherknowledge-intensivedatasetsandspecificdomains.Appendix
A.4
showsmoredetails.
Whenfine-tuningopen-sourceLLMstohandlemultipletasksdefinedbydifferentmodules,weare
inspiredbytheMixture-of-Expertsapproach[33]tolearndistinctFeed-ForwardNetwork(FFN)
parametersinthefinalquarteroftheTransformerblockstobalancethetrade-offbetweenperformanceandinferenceefficiency.Thesemodule-specificparametersareinitializedusingtheoriginalmodel’sFFNlayers.WecalltheproposedarchitectureModule-AwareM
ixture-of-Experts(MA-MoE)3.
Then,wefine-tunetheMA-MoEmodelwiththestandardlanguagemodelingloss:
L1=?Em∈MLLM,(,)∈Dmλmlogπθm(|),(1)
whereπreferstothepolicymodelMA-MoEthatmapsthestatetoanaction,θmdenotestheparameterforthemodulem∈MLLM,Dmisthecorrespondingcollectionoftrainingexamples,(,)isastate-outputpairfromDm,and{λm}aretunablehyper-parameters.
Feedbackiscrucialforadaptinglanguageagentstospecificenviron-
ments[40],especiallywhendeal
-ingwithunseen,long-tail,orever-changingdomainknowledge.Prioragentscommonlyusedoutcomefeedbackforadaptationwhichas-sessesthecorrectnessofinterme-diatestepsbasedonthesuccess
orfailureoftheoutcome[53,
4]
.However,outcomefeedbackistoosparsetoimproveintermediaterea-
soning[23]
.RecentstudiesalsohighlightedthatLLMs’reasoningstepsarelikelytocontradicttheout-
come[26],whichmeansthatout
-comefeedbackmayinevitablyin-troducenoiseduringtraining(seeexamplesinAppendix
B.8)
.Incontrast,AMOR’sprocessfeedback
3.3Adaptationthroughprocessfeedback
Algorithm2AdaptationthroughProcessFeedback
Input:{π}:InitialPolicy;T:ExplorationStepsbetweenEx-
ploitation;I:NumberofIterations.
Output:{πθm}:AdaptedPolicy.
1whilei←1toIdo
2
R=[]//Feedback-RefinedReasoningProcesses
3
whilet←1toTdo
4
//Exploration
ReceiveaninputquestionQ.
5
CollectAMORθ’sreasoningprocessR.//Algorithm
1
//FeedbackCollectionforEachLLMModule
6
foreachSteprk∈R(k=0,1,2,···)do
7
Extractthestateskandoutputykfromrk.
8
ifThecorrespondingmodulemk∈MLLMthen
9
Collectfeedbackfkforskandyk.
10
Determinekandokbasedonfk.//Eq.
2
11
R.append([sk,k,ok])
12
//Exploitation
Optimize{θm}tominimizeL2onR.//Eq.
3
13
return{πθm}
mechanismcaneffectivelyalleviatetheseissues.
Algorithm
2
describestheadaptationmechanismofAMORparameterizedbyθ,specificallyasthreesteps:(1)Exploration.AMORanswerstheinputquestionQbyinteractingwithaknowledgebase.
(2)FeedbackCollection.AMOR’sreasoningprocessforQisevaluatedwithfeedbackfkforthe
ofy.Weconvertyintoafeedback-refinedtargetoutputybasedonthefeedbackfkanddeterminetheimmediaterewardokasfollows:
outputykoftheLLMateachstepduringreasoning,which?iseither“right/wrong”orarefinedversion
(3)Exploitation.EveryTstepsoftheformerexplorationandfeedbackcollection,weoptimizethe
initialpolicybasedontheresultingtrajectoriesandcorrespondingfeedback[32]:
L2=?Em∈MLLM,(sk,k,ok)∈Rmλm[ok?βlog(πθm(k|sk)/πT(k|sk))],(3)
3“Module-Aware”meanswhenAMORexecutesacertainmodule,itsmoduleindexwillbeprovidedtotheroutersofthemodeltoindicatewhichexpertshouldbeactivated.
6
Table2:AutomaticannotationstrategyforsilverprocessfeedbackfordifferentLLMmodules.
Modulem
Outputy
SilverProcessFeedbackf
Decompose(Q,H)
[NEXT]q[FINISH]
“right”,iftheretrieveddocumentsusingqoverlapthedocumentscorrespondingto;“wrong”,otherwise.“right”,if?E(i.e,evidencepassagescollectedbyAMOR);“wrong”,otherwise.
Judge(Q,H,q,d)
[RELEVANT][IRRELEVANT]
[RELEVANT]”,ifoneofpassagesincomesfromthesamedocumentasd;[IRRELEVANT]”,otherwise.
Answer(Q,H,q,P)
[ANSWERABLE]ae[UNANSWERABLE]
“right”,ife∈;“wrong”,otherwise
“right”,ifP∩=?;“wrong”,otherwise
Complete(Q,E)
A
,if?E;“wrong”,otherwise.
whereRm?Rdenotesthetrainingexamplesformodulem,πreferstotheinitialwarm-uppolicy.
Notably,thislossfunctionisnon-differentiable,necessitatingtheuseofaspecializedoptimizationtechnique.
WeusearecentlyproposedalignmentalgorithmKTO[9]withanMLEregularization[47]
foroptimization,whichoptimizesthepolicywithoutrequiringpairedhumanpreferences.Crucially,whenoptimizingaparticularmodulem,thegradientinducedbythefeedbacksignalpropagatesthroughtheentireMA-MoEmodel,exceptfortheFFNlayerscorrespondingtoothermodules.ThistargetedoptimizationapproachenablesAMORtoeffectivelyalignitsoutputswiththedesiredintermediateresultsandfinalanswers,leveragingthefine-grainedprocessfeedbackprovidedbyhumansupervisors.
4Experiments
4.1Experimentalsetup
Toolsmodules.WeconstructretrieversforbothSearchDocandSearchPsgusingContriever-MS
MARCO[15]
.SearchDocretrievesasingledocumentsnippetperquery,whileSearchPsgfetchesthetopthreerelevantpassagesfromagivendocument.ByinvokingNextDoc,atmostninemoredocumentsnippetsarereturned.Appendix
B.1
presentsmoredetails.
Warm-updatasets.Weemployfourquestion-answering(QA)datasetstowarmupopen-source
LLMs,including2WikiMultiHopQA[13],Musique[37],NaturalQuestions[20]andBoolQ[6]
.Theyvaryinlevelsofquestioncomplexity(single-ormulti-hop),answertypes(phrasespansoryes/no),andtypesofdependencystructuresbetweensub-queries(e.g.,serialorparallel),etc.Appendix
A.4
showsthestatisticsindetail.
Adaptation&evaluationdatasets.Weconsiderthreebenchmarks,bywhichwesimulatedifferentdeploymentscenarios:(1)HotpotQA
[49]:achallengingmulti-hopQAdatasetbuiltonWikipedia
articles.
WeusetheWikipediadumpprovidedin[15]astheknowledgebase
.(2)PubMedQA
[17]:
abiomedicalQAdatasetthatrequiresansweringaquestionby“yes/no”givenaPubMedabstract.Weadaptthedatatoretrieval-basedQAbypilingall274kabstractsprovidedinthepaperasaknowledgebase,whereeachdocumentcomprisesoneabstractpassage.(3)QASPER
[8]:answeringquestions
infreeformbasedonalongNLPpaper.Foreachquestion,weregardthecorrespondingpaperasaknowledgebaseandeachsectionofthepaperasadocumentwithseveralpassages.Weusethetrainingandvalidationsetsforadaptationfine-tuningandthetestsetsforevaluation.Forevaluationmetrics,weuseexactmatch(EM)andF1scoresforHotpotQAandQASPER;andtheaccuracy(ACC)
of“yes/no”forPubMedQA.MoredetailsareinAppendix
B.2.
Feedbackannotation.Consideringlimitedresources,wesimulatehumanbehaviorandprovide
silverfeedbacktoAMOR’sreasoningprocessesbasedonthegoldanswerandgoldevidence
passages=[0,1,···]foreachtargetquestionQ,whicharealreadyincludedinthetrainingand
validationdataofthethreebenchmarks.Table
2
showshowweannotatethefeedbackforeachLLMoutputy.NotethatAMORisapplicableforgoldfeedbackfromhumansinrealisticapplications.Appendix
B.3
discussestheaccuracyofthesilverfeedbackthroughhumanevaluation.
Implementationdetails.WesetλminEq.
1
andEq.
3
to1forallmodules,I=1inAlgorithm
2,
andTtothesizeofthetrainingsetforeachdataset,andfine-tuneLLAMA-2-7B/13B-Chatfortwoepochswithalearningrateof2e?5using8NVIDIA80GBA100GPUs.Whileapplying
7
Table3:RefiningeachmoduleoutputytootoadaptAMOR,where?ydenotesconvertingthebinaryoutputytoitsoppositelabel.
ModulemTargetOutputkandImmediateRewardok
Decompose(Q,H)Judge(Q,H,q,d)
Answer(Q,H,q,P)Complete(Q,E)
kkkk
=yandok=yandok=yandok
=andok
=1iffo=;k=yandok=0,otherwise.=1,iffo=;k=?yandok=1,otherwise.=1iffo=;k=yandok=0,otherwise.
=1if≤E;k=yando
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024至2030年三角形釹鐵硼項(xiàng)目投資價(jià)值分析報(bào)告
- 2025年度WPS辦公借款合同模板行業(yè)定制版2篇
- 2025版網(wǎng)絡(luò)安全工程項(xiàng)目招投標(biāo)及全程服務(wù)合同3篇
- 2025版枸杞采摘與氣候變化應(yīng)對(duì)合作合同范本3篇
- 個(gè)人房屋買賣合同集錦7篇
- 2024年透明微晶玻璃項(xiàng)目可行性研究報(bào)告
- 學(xué)校維修改造合同樣本
- 蔬菜水果供貨合同
- 2024年特種熱電偶項(xiàng)目可行性研究報(bào)告
- 2024年中國(guó)智能UPS不間斷電源市場(chǎng)調(diào)查研究報(bào)告
- 貼面 貼面修復(fù)
- 2023年高二學(xué)業(yè)水平測(cè)試生物模擬考試試題
- 力士樂-mtx micro簡(jiǎn)明安裝調(diào)試手冊(cè)v4updated
- GB/T 6807-2001鋼鐵工件涂裝前磷化處理技術(shù)條件
- GB/T 15109-1994白酒工業(yè)術(shù)語(yǔ)
- 膜片鉗常見問題匯總(人人都會(huì)膜片鉗)
- 校車安全逃生技能培訓(xùn)學(xué)習(xí)
- (新版)電網(wǎng)規(guī)劃專業(yè)知識(shí)考試題庫(kù)(含答案)
- 學(xué)校心理危機(jī)干預(yù)流程圖
- 杏醬生產(chǎn)工藝
- 融資擔(dān)保業(yè)務(wù)風(fēng)險(xiǎn)分類管理辦法
評(píng)論
0/150
提交評(píng)論