2025豆包大模型Seedream2.0技術(shù)報(bào)告原生中英雙語圖像生成模型_第1頁
2025豆包大模型Seedream2.0技術(shù)報(bào)告原生中英雙語圖像生成模型_第2頁
2025豆包大模型Seedream2.0技術(shù)報(bào)告原生中英雙語圖像生成模型_第3頁
2025豆包大模型Seedream2.0技術(shù)報(bào)告原生中英雙語圖像生成模型_第4頁
2025豆包大模型Seedream2.0技術(shù)報(bào)告原生中英雙語圖像生成模型_第5頁
已閱讀5頁,還剩58頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1Seedream2.0:ANativeChinese-EnglishBilingualImageGenerationFoundationModelSeedVisionTeam,ByteDanceAbstractRapidadvancementofdiffusionmodelshascatalyzedremarkableprogressinthefieldofimagegeneration.However,prevalentmodelssuchasFlux,SD3.5andMidjourney,stillgrapplewithissueslikemodelbias,limitedtextrenderingcapabilities,andinsufficientunderstandingofChinesebilingualimagegenerationfoundationmodelthatexcelsamanagestextpromptinbothChineseandEnglish,supportingbilingualimagegenerationandtextrendering.Wedevelopapowerfuldatasystemthatfacilitatesknowledgeintegration,ansystemthatbalancestheaccuracyandrichnessforimagedescription.Particularly,Seedreamisintegratedwithaself-developedbilinguallargelanguagemodel(LLM)asatextencode,allowingittolearnnativeknowledgedirectlyfrommassivedata.Thisenableittogeneratehigh-fidelityimageswithaccurateculturalnuancesandaestheticexpressionsdescribedineitherChineseorEnglish.Beside,Glyph-AlignedByT5isappliedforflexiblecharacter-leveltextrendering,whileaScaledROPEgeneralizeswelltountrainedresolutions.Multi-phasepost-trainingoptimizations,includingSFTandRLHFiterations,furtherimprovetheoverallcapability.Throughextensiveexperimentation,wedemonstratethatSeedream2.0achiemultipleaspects,includingprompt-following,aesthetics,textrendering,andstructuralcorrectness.Furthermore,Seedream2.0hasbeenoptimizedthroughmultipleRLHFiterationstocloselyalignitsoutputwithhumanpreferences,asrevealedbyitsoutstandingELOscore.Inaddition,itcaneditingcapabilitythatbalancesinstruction-followingandimageconsistency.Correspondence:AuthorsarelistedinappendixA.OfficialPage:/tech/seedreamFigure1Seedream2.0demonstratesoutstandingperformanceacrossallevaluationaspectsinbothEnglishandChinese.2Figure2Seedream2.0Visualization.3Contents1Introduction 42DataPre-Processing 52.1DataComposition 62.3ActiveLearningEngine 62.4ImageCaptioning 2.5TextRenderingData 83ModelPre-Training 83.1DiffusionTransformer 93.2TextEncoder 93.3Character-levelTextEncoder 4ModelPost-Training 4.1ContinuingTraining(CT) 4.1.1Data 4.1.2TrainingStrategy 4.2SupervisedFine-Tuning(SFT) 4.2.1Data 4.2.2TrainingStrategy 4.3HumanFeedbackAlignment(RLHF) 4.4PromptEngineering(PE) 4.4.1Fine-tuneLLM 4.4.2PERLHF 5AligntoInstruction-BasedImageEditing 5.1Preliminaries 5.2EnhancedHumanIDPreservation 6ModelAcceleration 6.1CFGandStepDistillation 6.2Quantization 7ModelPerformance 7.1HumanEvaluation 7.1.2HumanEvaluationResults 7.2AutomaticEvaluation 7.2.1Text-ImageAlignment 7.2.2ImageQuality 7.3TextRendering 7.5Visualization 8Conclusion 24AContributionsandAcknowledgments 3341IntroductionWiththesignificantadvancementofdiffusionmodels,thefieldofimagegenerationhasexperiencedrapidexpansion.RecentpowerfulmodelssuchasFlux[13],SD3.5[7],Ideogram2.0,andMidjourney6.1haveinitiatedawaveofwidespreadcommercialapplications.However,despitetheremarkableprogressmadebytheexistingfoundationalmodels,theystillencounterseveralchallenges.whilesacrificingtheperformanceinotheraspects,suchasprompt-followingorstructuralcorrectness.?InadequateTextRenderingCapacity:Theabilitytoperformaccuratetextrenderinginlongcontentorinmultiplelanguages(especiallyinChinese)isratherlimited,whiletextrenderingisakeyabilitytosome?DeficiencyinUnderstandingChineseCharacteristics:Thereisalackofadeepunderstandingofthedistinctivecharacteristicsoflocalculture,suchasChineseculture,whichisofgreatimportancetolocalToaddresstheseimportantissues,weintroduceSeedream2.0,acutting-edgetext-to-imagemodel.ItcanproficientlyhandlebothChineseandEnglishprompts,supportsbilingualimagegenerationandtextrenderingtasks,withoutstandingperformanceinmultipleaspects.Specifically,wedesignadataarchitecturewiththeabilitytocontinuouslyintegratenewknowledge,anddevelopastrongcaptionsystemthatconsidersbothaccuracyandrichness.Importantly,wehaveintegratedaself-developedlargelanguagemodel(LLM)asatextencoderwithadecoder-onlyarchitecture.Throughmultipleroundsofcalibration,thetextencodercanobtainenhancedbilingualalignmentcapabilities,endowmentitwithnativesupportforlearningfromoriginaldatainbothChineseandEnglish.WealsoapplyaGlyphalignedByT5model,whichenablesourmodeltoflexiblyundertakecharacter-leveltextrendering.Moreover,aScaledROPEisproposedtogeneralizeourgenerationprocesstountrainedimageresolutions.Duringapost-trainingstage,wehavefurtherenhancedthemodel’scapabilitiesthroughmultiplephasesofSFTtrainingandRLHFiterations.Ourkeycontributionsarefourfold:?StrongModelCapability:Throughmulti-leveloptimizationconsistingofdataconstruction,modelpre-training,andpost-training,ourmodelstandsattheforefrontacrossmultipleaspects,includingprompt-following,aesthetic,text-rendering,andstructuralcorrectness.?ExcellentTextRenderingProficiency:Usingacustomcharacter-leveltextencodertailoredfortextrenderingtasks,ourmodelexhibitsexcellentcapabilitiesfortextgeneration,particularlyexcellingintheproductionoflongtextualcontentwithcomplicatedChinesecharacters.?ProfoundUnderstandingofChineseCharacteristics:Byintegratingwithaself-developedmulti-languageLLMtextencoder,ourmodelcanlearndirectlyfrommassivehigh-qualitydateinChinese.Thismakesitvocabulary.Furthermore,ourmodeldemonstratesexceptionalperformanceinChinesetextrendering,whichisnotwelldevelopedinthecommunity.?HighlyAlignwithHumanPreferences:FollowingmultipleiterationsofRLHFoptimizationsacrossvariouspost-trainingmodules,ourmodelconsistentlyalignsitsoutputswithhumabyagreatadvantageinELOscoring.(豆包)1andDreamina(即夢)2.Weardentlyencourageabroaderaudiencetodelveintotheextensivecapabilitiesandpotentialsofourmodel,withtheaspirationthatitcanemergeasaneffectivetoolforimprovingproductivityinthemultipleaspectsofworkanddailylife.1/chat/create-image2/ai-tool/image/generate52DataPre-ProcessingThissectiondetailsourdatapipelineforpre-training,encompassingvariouspre-processingstepssuchasdatacomposition,datacleaningandfiltering,activelearning,captioning,anddatafortextrendering.Theseprocessesensureafinalpre-trainingdatasetthatisofhighquality,largesc2.1DataCompositionOurpre-trainingdataismeticulouslycuratedfromfourmaincomponents,ensuringabalancedandcompre-Knowledge-richpairsDataKnowledge-richpairsDatasourceQualityClarityAestheticsHigh-QualityQualityClarityAestheticsDataclusterClustersImageClustersImagebasedDistributionMaintenanceDataEmbeddingGeneraldataEmbeddingTextbasedLow-qualityTextbasedTaxonomyEnginesKnowledgeInjectionDataNounsEnginesKnowledgeInjectionDataActivelearningActivelearningengineRetrieval/ClassificationTargetedRetrieval/ClassificationTargetedSupplementaryDataLargemovementsNonexistentSupplementaryDataFigure3Pre-trainingdatasystem.High-QualityData.ThiscomponentincludesdatawithexceptionallyhighimagequalityandrichknowledgeUncuratedDataDatafromtaxonomyEmbeddingRetrievalDeduplicationEmbeddingRetrievalAugmentedCuratedDataFigure4Overviewofourknowledgeinjectionprocess.DistributionMaintenanceData.Thiscomponentmaintainstheusefuldistributionoftheoriginaldatawhilereducinglow-qualitydatathrough:?DownsamplingbyDataSource:Reducingtheproportionofoverrepresentedsourceswhilepreservingtheirrelativemagnituderelationships.?Clustering-basedSampling:Samplingdatabasedonclustersatmultiplehierarchicallevels,fromclustersrepresentingbroadersemantics(suchasvisualdesigns)tothoserepresentingfinersee.g.,CD/bookcoversandpo6KnowledgeInjectionData.Thissegmentinvolvestheinjectionofknowledgeusingadevelopedtaxonomamultimodalretrievalengine,asshowninFigure4.ItincludesdatawithdistinctiveChinesecontextstoAdditionally,asmallbatchofdatawithdistinctiveChinesecontextswasmanuallycollected.Thisdatasetandfolkculture.OurmultimodalretrievalenginewasemployedtoaugmentandincorporatethisChineseknowledgeintoourgenerativemodel.TargetedSupplementaryData.Wesupplementthedatasetwithdatathatexhibitsuboptimalperformanceintext-to-imagetasks,suchasaction-orienteddataandcounterfactualdata(e.g.,"amanwithaballoonforaneck").Ouractivelearningenginecategorizesandintegratesthesechallengingdatapointsintothefinaltrainingset.2.2DataCleaningProcessdatafilteringmethodologies,asdepictedinFigure5.DeduplicationClusteringCaptioningOCRDeduplicationClusteringCaptioningStageIIIStageIIFigure5Overviewofourdatacleaningprocess.FirstStage:GeneralQualityAssessment.Welabeltheentiredatabaseusingthefollowingcriteria:?GeneralQualityScore:Evaluatingimageclarity,motionblur,andmeaninglesscontent.?GeneralStructureScore:Assessmentofelementssucha?OCRDetection:Identifyingandcatalogingtextwithinimages.Samplesthatdonotmeetqualitystandardsareeliminated.SecondStage:DetailedQualityAssessment.Thisstageinvolvesprofessionalaestheticscores,featureembeddingextraction,deduplication,andclustering.Clusteringisstructuredatmultiplehierarchicallevels,adjustmentofthedistribution.ThirdStage:CaptioningandRe-captioning.Westratifytheremainingdataandannotatecaptionsorrecaptions.Higher-leveldatagenerallyreceiverichernewcaptions,describedfromdifferentperspectives.DetailsonthecaptioningprocessareprovidedinSectionActiveLearningEngineWedevelopedanactivelearningsystemtoimproveourimageclassifiers,asillustratedinFigure6.Itisaniterativeprocedurethatprogressivelyrefinesourclassifiers,ensuringahigh-qualitydatasetfortraining.7StartbylabelingasmallsubsetofthedataActivelearningCurrentlabeleddatasetClassifierUnlabeledimageslabeledimagesHumanlabelersImagestolabelFigure6FlowdiagramofActiveLearningLifecycle.2.4ImageCaptioningbothgenericandspecial2.4.1GenericCaptionsWeformulateshortandlongcaptionsinChineseandEnglish,ensuringaccurateanddetaileddescriptions:?ShortCaptions:Accuratelydescribethemaincontentofanimage,capturingthecoreknowledgeandcontent.?LongCaptions:Moredescriptive,detailingasmanyaspinferencesandimaginaFigure7Captionexamplesinourtrainingdata.2.4.2SpecializedCaptionsInadditiontogenericcaptions,wealsohavespecializedcaptions?ArtisticCaptions:Describeaestheticelementssuchasstyle,color,composition,andlightinteraction.?TextualCaptions:Focusonthetextualinformationpresentintheimages.8?SurrealCaptions:Capturethesurrealandfantasticalaspectsofimages,offeringamoreimaginativeCropping OCRDetectionCropping OCRDetectionOCRDetectionPairRecaptionModelOCRRecaptionPairRecaptionModelOCRRecaptionFigure8TextRendering:DataPre-processingPipeline.2.5TextRenderingDataWeconstructalarge-scalevisualtextrenderingdatasetbyfilteringin-housedataandusingOCRtoolstoselectimageswithrichvisualtextcontent,asdepictedinFigure8.Themaindataprocessingstepsareasfollows:?Filterlow-qualitydatafromin-housesources.?EmployOCRtodetectandextracttextregions,followedbycroppingofwatermarks.?Removelow-qualitytextboxes,retainingclearandrelevanttextregions.?Processextractedtextusingare-captionmodeltogeneratehigh-qualitydescriptions.?Furtherrefinethedescriptionstoproducehigh-qualityimage-captionpairswhicharefinallyusedforvisualtext-renderingtasks.3ModelPre-TrainingFigure9OverviewofSeedream2.0TrainingandInferencePipeline.9Figure10OverviewofModelArchitecture.3.1DiffusionTransformerForaninputimageI,aself-developedVariationalAuto-Encoder(VAE)isusedtoencodetheinputimage,resultinginalatentspacerepresentationx∈RC×H×W.Thelatentvectorxisthenpatchifiedtoanumberof×.Thisprocessultimatelytransformstheinputimageintoareconcatenatedwithtexttokensencodedbyatextencoderandthenfedintotransformerblocks.ThedesignofDiTblocksmainlyadherestothedesignprinciplesofMMDiTinStableDiffusion3(SD3)[7].Eachtransformerblockincorporatesonlyasingleself-attentionlayer,whichconcurrentlyprocessesbothimageandtexttokens.Consideringthedisparitiesbetweentheimageandtextmodalities,distinctMLPsareemployedtohandlethemseparately.TheadaptivelayernormisutilizedtomodulateeachattentionandMLPlayer.WeresorttoQK-NormtoimprovetrainingstabilityandFullyShardedDataParallel(FSDP)[44]toconductdistributedmodeltraining.Inthispaper,weaddthelearnedpositionalembeddingontexttokens,andapplya2DRotaryPositionalEmbedding(RoPE)[29]onimagetokens.Unlikepreviousworks,wedevelopavariantof2DRoPE,namelymodeltobegeneralizedtountrainedaspectratiosandresolutionstoacertainextentduringinference.3.2TextEncoderToperformeffectivepromptencodingfortext-to-imagegenerationmodels,existingmethodologtypicallyresorttoemployingCLIPorT5asatextencoderfordiffusionmodels.CLIPtextencoder([24])iscapableofcapturingdiscriminativeinformationthatiswellalignedwithvisualrepresentationorembeddings,whiletheT5encoder([25])hasastrongabilitytounderstandcomplicatedandfine-grainedtextinformation.However,neitherCLIPorT5encoderhasstrongabilitytounderstandtextinChinese,whiledecoder-onlyLLMsoftenhaveexcellentmulti-languagecapabilities.Atextencoderplaysakeyroleindiffusionmodels,particularlyfortheperformanceoftext-alignmentinimagegeneration.Therefore,weaimtodevelopastrongtextencoderbytakingadvantageofthepowerofLLMsthatisstrongerthanthatofCLIPorT5.However,textembeddingsgeneratedbythedecoder-onlyLLMshavelargedifferencesinfeaturedistributioncomparedtothetextencoderofCLIPorT5,makingitdifficulttoalignwellwithimagerepresentationsindiffusionmodels.ThisresultsinsignificantinstabilitywhentrainingadiffusionmodelwithsuchanLLM-basedtextencoder.Wedevelopanewapproachtofine-tuneadecoder-onlyLLMbyusingtext-imagepairdata.Tofurtherenhancethecapabilitiesforgeneratingcertainchallengingscenarios,suchasthoseinvolvingChinesestylisticnuancesandspecializedprofessionalvocabulary,wecollectalargeamountofsuchtypesofdataincludedinourtrainingset.UsingthestrongcapabilitiesofLLM,andimplementingthemeticulouslycraftedtrainingstrategies,ourtextencoderhasdemonstratedasuperiorperformanceoverothermodelsacrossmultipleperspectives,includingstrongbilingualcapabilitiesthatenableexcellentperformanceinlong-textunderstandingandcomplicatedinstructionfollowing.Inparticular,excellentbilingualabilitymakesourmodelsabletolearnmeaningfulnativeknowledgedirectlyfrommassivedateinbothChineseandEnglish,whichisthekeyforourmodeltogenerateimageswithaccurateculturalnuancesandaestheticexpressionsdescribedinbothChineseandEnglish.3.3Character-levelTextEncoderConsideringthecomplexityofbilingualtextglyphs(especiallyChinesecharacters),weapplyaByT5[19,37]glyph-alignedmodeltoencodetheglyphlevelfeaturesorembeddingsandensuretheconsistencyofglyphfeaturesofrendprompt,whichareconcatenatedandthenareinputintoaDITblock.RenderingContent.ExperimentalresultshavedemonstratedthatwhenusingaByT5modelsolelytoencodethefeaturesofarenderedtext,particularlyinthecaseoflongtext,itcanleadtorepeatedcharactersanddisorderedlayoutgeneration.Thisisduetothemodel’sinsufficientunderstandingofholisticsemantics.Toaddressthisissue,fortheglyphfeaturesofrandaByT5model.ThenweemployanMLPlayertoprojecttheByT5embeddingsintoaspacethatalignswiththefeaturesoftheLLMtextencoder.Then,aftersplicingtheLLMandByT5features,wesendthecompletetextfeaturestotheDiTblocksfortraining.IncontrasttootherapproachesthattypicallyusebothLLMfeaturesandOCR-renderedimagefeaturesasconditions,ourapproachusesonlytextualfeaturesasconditions.Thisallowsourmodeltomaintainthesametrainingandreasoningprocessastheoriginaltext-to-imagegeneration,significantlyreducingthecomplexityofthetrainingandreasoningpipeline.RenderingFeatures.Thefont,color,size,positionandothercharacteristicsoftherenderedtextaredescribedusingare-captionmodelwhichisencodedthroughanLLMtextencoder.Traditionaltextrenderingapproaches[4,18,32]typicallyrelyonalayoutofpresettextboxesasaconditionalinputtoadiffusionmodel.Forexample,TextDiffuser-2[4]employsanadditionalLLMforlayoutplanningandencoding.Incontrast,ourapproachdirectlydescribestherenderingfeaturesofthetextthroughthere-captionmodel,allowingforanend-to-endtraining.Thisenablesourmodeltolearntherenderingfeaturesoftexteffectivelyanddirectlyfromtrainingdata,whichalsomakesiteoftherenderingtext,enablingthecreationofmoresophisticatedandhigh-qualitytextrenderingoutputs.4ModelPost-TrainingOurpost-trainingprocessconsistsofmultiplesequentialphases:1)ContinueTraining(CT)andSupervisedfine-tuning(SFT)stagesremarkablyenhancetheamodelsandfeedbacklearningalgorithms;3)PromptEngineering(PE)furtherimprovestheperformanceonaestheticsanddiversitybyleveragingafine-tunedLLM;4)Finally,arefinermodelisdevelopedtoscaleuptheresolutionofanoutputimagegeneratedfromourbasemodel,andatthesametimefixsomeminorstructuralerrors.Thevisualizationresultsduringdifferentpost-trainingstagesarepresentedinFigure11.4.1ContinuingTraining(CT)Pre-traineddiffusionmodelsoftenstruggletoproduceimagesthatmeetthedesiredaestheticcriteria,duetothedisparateaestheticstandardsinherentinthepre-trainingdatasets.Toconfrontthischallenge,weextendlargeponds,pavilionsandtowers,traditionaltechniques,inkpainting,delicatelines,prominentcolors,detailedtextures,classicalaesthetics,poeticatmosphere,highdefinition,highresolution,naturallight,softtones,distantviews,tranquility,elegance)—個(gè)男孩的背影,他看著窗外的花,攝影(Thgroundcomposition.Thepictureshowsagirlinkimonostandingunderacherrytree,thebreezeblowingthepetals,formingasofthalounderthebacklight,anddelicatelightandshadowshiningonthegirl'sface,presentinganoverallatmosphereoftranquilityandnature.)photography,lookingup.Ablackandwhitecatwalkinginthesnowatnight,withavintagehouseinthebackground.)Figure11Visualizationduringdifferentpost-trainingstages.thetrainingphasebytransitioningtoasmallerbutbetterqualitydataset.Thiscontinuingtraining(CT)phaseisdesignednotonlytomarkedlyenhanceaestheticsofthegeneratedimages,butisalsorequiredtomaintainfundamentalperformanceonprompt-followingandstructuralaccuracy.ThedataoftheCTstageconsistsoftwoparts.4.1.1Data?High-qualityPre-trainingData:Wefilteralargeamountofhigh-qualityimagesfromourpre-trainingisautomaticbyusingthesemodelswithoutanymanualeffort.?ManuallyCuratedData:Inadditiontothecollectedhigh-qualitydatafrompre-trainingdatasets,wemeticulouslyamassdatasetswithelevatedaestheticqualitiesfromdiversespecificdomainssuchasart,photography,anddesign.Theimageswithinthesedatasetsarerequiredtopossessacertainaestheticcharmandalignwiththeanticipatedimagegenerationoutcomes.Followingmultipleroundsofrefinement,arefineddatasetcomprisingmillionsofmanuallycherry-pickedimageswasfabricated.Toavsuchasmalldataset,wecontinuallytrainourmodelbyjointlyusingitwiththeselectedhigh-qualitypre-traineddata,withareasonablesamplingratio.4.1.2TrainingStrategyDirectlyperformingCTontheaforementioneddatasetscanconsiderablyimprovetheperformanceintermsofaesthetics,butthegeneratedimagesstillexhibitanotabledisparityfromrealimageshavingappealingaesthetics.Tofurtherimproveaestheticperformance,weintroduceVMix([34])whichenablesourmodeltolearnthefine-grainedaestheticcharacteristicsdirectlyduringthedenoisingprocess.Wetageachimageaccordingtovariousaestheticdimensions,namelycolor,lighting,texture,andcomposition,andthenthesetagsareusedassupplementaryconditionsduringourCTtrainingprocess.Experimentalresultsshowthatourmethodcanfurtherenhancetheaestheticappealofthegen4.2SupervisedFine-Tuning(SFT)4.2.1DataIntheSFTstage,wefurtherfine-tuneourmodeltowardgeneratinghigh-fidelityimageswithexcellentartisticbeauty,byusingasmallamountofcarefullycollectedimages.Withthesecollectedimages,wespecificallytrainedacaptionmodelcapableofpreciselydescribingbeautyandartistrythroughmulti-roundmanualrectifications.Furthermore,wealsoassignedstylelabelsandfine-grainedaestheticlabels(usedinthevmixapproach)totheseimages,whichensurethattheinformationofthemajorityofmainstreamgenresisincluded.4.2.2TrainingStrategyInaddtiontotheconstructedSFTdata,wealsoincludeacertainamountofmodel-generatedimages,whicharelabeledas"negativesamples",duringSFTtraining.Bycombiningwithrealimagesamples,themodelcanlearntodiscriminatebetweenrealandfakeimages,enablingittogeneratemorenaturalandrealisticimages.Thistherebyenhancesthequalityandauthenticityofthegeartisticstandardscansubstantiallyenhancetheartisticbeauty,butitinevitablydegradestheperformanceonimage-textalignment,whichisfundamentaltotext-to-imagegenerationtask.Toaddressthisissue,wedevelopedadataresamplingalgorithmthatallowsthemodeltoenhanceaestheticswhilestillmaintainingimage-textalignmentcapacity.4.3HumanFeedbackAlignment(RLHF)Inourwork,weintroduceapioneeringRLHFoptimizationproceduretailoredfordiffusionmodels([14,41,42]),incorporatingpreferencedata,rewardmodels(RMs),and12,theRLHFphaseplaysapivotalroleinenhancingtheoverallperformanceofourdiffusionmodelsinvariousaspects,includingimage-textalignment,aesthetic,structurecorrectness,textrendering,etc.Figure12Therewardcurvesshowthatthevaluesacrossdiverserewardmodelsallexhibitastableandconsistentupwardtrendthroughoutthealignmentprocess.Somevisualizationexamplesrevealthatthehumanfeedbackalignmentstageiscrucial.4.3.1PreferenceData?PromptSystem:WehavedevelopedaversatilePromptSystemtailoredforemploymentinboththeRMTrainingandFeedbackLearningphases.Ourcuratedcollectioncomprisesof1millionmulti-dimensionalpromptssourcedfromtrainingcaptionsanduserinput.Throughrigorouscurationprocessesthatfilteroutambiguousorvagueexpressions,weguaranteeapromptsystemthatisnotonlycomprehensivebutalsorichindiversityanddepthofcontent.?RMDataCollection:Wecollecthigh-qualitydataforpreferenceannotation,comprisingimagescraftedbyvarioustrainedmodelsanddatasources.Throughtheconstructionofacross-versionandcross-modelannotationpipeline,weenhancethedomainadaptabilityofRMs,andextenditsupperthresholdof?AnnotationRules:Intheannotationphase,weengageinmulti-dimensionalfusionannotation(suchasimageandtextmatching,textrendering,aesthetic,etc.).Theseintegratedannotationproceduresaredesignedtoelevatethemulti-dimensionalcapabilitiesofasinglerewardmodel,forestalldeficiencRLHFstage,andfostertheachievementofParetooptimalityacrossalldimensionswithinRLHF.4.3.2RewardModel?ModelArchitecture:WeuseaCLIPmodelthatsupportsbothChineseandEnglishasourRMs.ByleveragingthestrongalignmentcapabilitiesoftheCLIPmodel,weforgoadditionalHeadoutputrewardmethodslikeImageReward,optingtoutilizetheoutputofCLIPmodelastherewarditself.ArankinglossisprimarilyappliedasthetraininglossofourRMs.?Multi-aspectsRewardModels:ToenhancetheoveralandtrainedthreedistinctRMs:aimage-textalignmentRM,anaestheticRM,andatext-renderingRM.Inparticular,thetext-renderingRMisselectivelyengagedwhenaprompttagrelatestotextrendering,significantlyimprovingtheprecisionofcharacter-leveltextgeneration.4.3.3FeedbackLearning?LearningAlgorithm:WerefineourdiffusionmodelthroughadirectoptimizationofoutputscorescomputedfrommultipleRMs,akintoREFL([36])paradigm.DelvingintovariousfeedbacklearningalgorithmssuchasDPO([33])andDDPO([1]),ourinvestigationrevealedthatourmetapproachtowardmulti-rewardoptimiz

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論