通過方差減少實現(xiàn)零樣本模型的穩(wěn)健微調(diào) Robust Fine-tuning of Zero-shot Models via Variance Reduction_第1頁
通過方差減少實現(xiàn)零樣本模型的穩(wěn)健微調(diào) Robust Fine-tuning of Zero-shot Models via Variance Reduction_第2頁
通過方差減少實現(xiàn)零樣本模型的穩(wěn)健微調(diào) Robust Fine-tuning of Zero-shot Models via Variance Reduction_第3頁
通過方差減少實現(xiàn)零樣本模型的穩(wěn)健微調(diào) Robust Fine-tuning of Zero-shot Models via Variance Reduction_第4頁
通過方差減少實現(xiàn)零樣本模型的穩(wěn)健微調(diào) Robust Fine-tuning of Zero-shot Models via Variance Reduction_第5頁
已閱讀5頁,還剩41頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

RobustFine-tuningofZero-shotModelsviaVarianceReduction

BeierZhuJiequanCuiHanwangZhang

NanyangTechnologicalUniversity

arXiv:2411.06966v1[cs.CV]11Nov2024

beier002@e.ntu.edu.sg,hanwangzhang@.sg

Abstract

Whenfine-tuningzero-shotmodelslikeCLIP,ourdesideratumisforthefine-tunedmodeltoexcelinbothin-distribution(ID)andout-of-distribution(OOD).Recently,ensemble-basedmodels(ESM)havebeenshowntooffersignificantrobustnessimprovement,whilepreservinghighIDaccuracy.However,ourstudyfindsthatESMsdonotsolvetheID-OODtrade-offs:theyachievepeakperformanceforIDandOODaccuracyatdifferentmixingcoefficients.WhenoptimizedforOODaccuracy,theensemblemodelexhibitsanoticeabledeclineinIDaccuracy,andviceversa.Incontrast,weproposeasample-wiseensemblingtechniquethatcansimultaneouslyattainthebestIDandOODaccuracywithoutthetrade-offs.Specifically,weconstructaZero-ShotFailure(ZSF)setcontainingtrainingsamplesincorrectlypredictedbythezero-shotmodel.Foreachtestsample,wecalculateitsdistancetotheZSFsetandassignahigherweighttothefine-tunedmodelintheensembleifthedistanceissmall.WetermourmethodVarianceReductionFine-tuning(VRF),asiteffectivelyreducesthevarianceinensemblepredictions,therebydecreasingresidualerror.OnImageNetandfivederiveddistributionshifts,ourVRFfurtherimprovestheOODaccuracyby1.5-2.0ppovertheensemblebaselineswhilemaintainingorincreasingIDaccuracy.VRFachievessimilarlargerobustnessgains(0.9-3.1pp)onotherdistributionshiftsbenchmarks.Codesare

availablein

/BeierZhu/VRF.

1Introduction

Toensurethereliabilityofmachinelearningsystems,itisessentialtodevelopmodelsthatcangeneralizetounseen,out-of-distributionenvironments.

Largepre-trainedmodelssuchasCLIP[20]

andALIGN[10]haverecentlyshownremarkablerobustnessagainstchallengingdistributionshifts

.However,itiswidelyacknowledgedthattheseimprovementsinrobustnessaremostpronouncedinthezero-shotsetting,whileconventionalfine-tuningonthesemodelsoftencompromisesrobustness

whencomparedtozero-shotperformance[28,

15,

14]

.ThisphenomenonisknownastheID-OODtrade-offs,i.e.,improvingperformanceonin-distribution(ID)datacansometimesleadtodecreased

performanceonout-of-distribution(OOD)data[12,

25]

.

Inrecentyears,ensemble-basedmodels(ESMs)havedemonstratedsignificantsuccessinaddressing

theID-OODdilemma[17,

28,

14,

31]

.Specifically,denotetheinputasx,thezero-shotmodel

as(y|x;θzs)andthefine-tunedmodelas(y|x;θft),existingESMstypicallyemploytheoutput-

spaceensemble(OSE)[14,

31],whichoutputs

(y|x;θose)=α(y|x;θft)+(1?α)(y|x;θzs),

andtheweight-spaceensemble(WSE)[28,

17],whichoutputs

(y|x;θwse)=(y|x;αθft+(1?

α)θzs),whereα∈[0,1].Comparedtofine-tunedmodels,ESMsoffersignificantaccuracyenhance-mentsunderdistributionshift,whilemaintaininghighIDaccuracy.

However,ESMcannotfullyaddresstheID-OODtrade-offs.InFigure

1

(a),byvaryingthemixing

coefficientα,weplottheID-OODfrontiercurves(pinkline)fortheCLIPViT-B/16modelon38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).

2

59

4-—

L56二

o53

>

+3.6%IDAcc

efficient

Ens

emblingW

thvaringCo

---BesereerBes★zer

Fin

VR

tIDaccuratOODacco-shot

e-TunedF(ours)

Cywitha=0uracywith

.5

0.3

677073767982

ImageNetAccuracy(ID)

0.8

1.01.21.4DistancetoZSFSet(d(x))

1.4

1.2

0.8

Accft

Acczs

(a)(b)

Figure1:(a)ID-OODfrontiercurvesfortheCLIPViT-B/16modelontheID(ImageNet)andOOD(IN-{V2,R,A,Sketch}andObjectNet)datasetsbyvaryingthemixingcoefficientα.TheensemblemodelachievesitsbestIDandOODperformanceatdifferentαvalues.OurmethodVRFsimultaneouslyattainsthebestIDandOODaccuracy,outperformingtheensembleby3.6%onOODand1.6%onIDatitsoptimalperformancepoints.(b)Relationshipbetweentheratioof . increases.

ImageNet[3](ID)andfivederiveddistribution-shifteddatasets(OOD):ImageNet-V2[21],ImageNet

-

R[7],ImageNet-A[9],ImageNet-Sketch[27]andObjectNet[1]

.WefindthattheensemblemodelachievesitsoptimalIDandOODperformanceatdifferentαvalues:thebestIDaccuracyisachievedatα=0.5andthebestOODaccuracyisobtainedatα=0.3.WhentheensemblemodelreachesitsoptimalvalueforOOD,theperformanceonIDdecreasesby3.6%relativetoitspeak.Similarly,whentheensemblemodelisoptimizedforID,theperformanceonOODdecreasesby1.6%relativetoitsbestvalue–theID-OODtrade-offsstillpersistforESMs.Thisraisesanaturalquestion:

Canensemble-basedmodelssimultaneouslyattainthebestIDandOODaccuracy?

Inthispaper,weaffirmativelyanswerthisquestionbyproposingasample-wiseensemblingtechnique,dubbedvariancereductionfine-tuning(VRF).ThismethodismotivatedbyanempiricalfindingillustratedinFig

1

(b).Foreachsampleinthetrainingdataset,ifthefine-tunedmodelcorrectlypredictsthelabelwhilethezero-shotmodelfails,wecollectitsfeaturesrepresentationinthefine-tunedmodelasthezero-shotfailure(ZSF)set.Wethenmeasurethedistanced(x)ofeachtestsamplextotheZSFset.Basedonthisdistance,testsamplesaregroupedintobins,andwecomputethe

Section

C.7)

.increases.Intuitively,thecloserasampleistotheZSFset,themorelikelyitisthatthezero-shotmodelmakesincorrectpredictions,whereasthefine-tunedmodelismorelikelytobeaccurate,leadingtoahigher

higherweightforthefine-tunedmodel,andviceversa.

AsdepictedbytheorangediamondinFig.

1

(a),byleveragingthesample-wiseweights,ourVRFsimultaneouslyattainsthebestIDandOODaccuracy.InSection

5,weshowthatonavarietyof

differentmodelsandtasks,ourVRFapproachconsistentlyoutperformstheexistingfine-tuning

andensemblingmethods,includinglinearprobing,end-to-endfine-tuning,LP-FT[15],OSEand

WSE[28]

.Inspecific,onImageNetandfivederiveddistributionshifts,ourVRFfurtherimprovestheOODaccuracyby1.5-2.0ppovertheensemblebaselineswhilemaintainingorincreasingIDaccuracy.Furthermore,inSection

4,wejustifyourapproachbydemonstratingthatiteffectively

minimizesthevarianceoftheensemblemodels,resultinginreducedresidualerror.

3

2RelatedWork

MitigatingID-OODtrade-offs.Improvingperformanceonin-distributiondatacansometimeslead

toadecreaseinperformanceonout-of-distributiondata,andviceversa.ThisphenomenonisknownastheID-OODtrade-offs.Xieetal.

[29]leverageauxiliaryinformationasoutputsofauxiliarytasks

topre-trainamodeltoreduceOODerror.

KhaniandLiang[12]showthatself-trainingonlarge

amountsofunlabeleddatacanmitigatesuchtrade-offsbyremovingspuriousfeatures.Tripuranenietal.

[25]tacklethisproblembylearningrepresentationsthatarerobustacrossdiversetasks

.However,thesemethodsusuallynecessitateadditionalunlabeleddataorauxiliaryinformation.Incontrast,ourVRFisastraightforwardvariationoffine-tuningthatdoesnotrequireanyextradata.

Robustfine-tuningofzero-shotmodels.

Vision-languagemodelslikeCLIP[20]havedemonstrated

outstandingimprovementsinrobustness.Itiscommonlyacknowledgedthatconventionalfine-tuningmethodsoftencompromiserobustnesswhencomparedtozero-shotperformance.Therefore,

enhancingdownstreamrobustnesshasbeenthefocusofsubsequentworks[15,

28,

5,

19,

6,

30]

.Kumaretal.

[15]showthatatwo-processoflinearprobingfollowedbyfullfine-tuningcanalleviate

featuredistortion,leadingtostrongerOODperformancewithoutsacrificingIDaccuracy.Wortsmanetal.

[28]proposeamethodofweightinterpolationbetweenthezero-shotandthefine-tunedmodels

toimprovebothIDandOODaccuracy.Goyaletal.

[5]demonstratethatmimickingthecontrastive

pre-trainingobjectivestofine-tunethezero-shotmodelsoutperformstuningviathetraditional

supervisedcross-entropyloss.However,theID-OODtrade-offsarestillobservedwiththesemethods.

Incontrast,ourmethodVRFcansimultaneouslyachievethebestIDandOODaccuracy.

3Methods

3.1SetUp

Task:Consideraclassificationsettingwherethegoalistomapaninstancex∈Xtoalabely∈

andafine-tunedmodelf(;θft)whichistrainedonD.Below,weoutlinetheimplementationofthezero-shotandfine-tunedmodels:

Y=[K].Weareprovidedwithazero-shotmodelf(·;θzs),adownstreamdatasetD={xi,yi,

?Zero-shotmodels

(ZS):WeinvestigateCLIPmodels[20]asourzero-shotmodels

.CLIPmodelsarepre-trainedusingimage-textpairs{(x1,t1),...,(xB,tB)}fromtheInternet.TheobjectiveoftheCLIPmodelsistotrainavisualencoderΦvandatextencoderΦtsuchthatthecosinesimilarity<Φv(xi),Φt(ti)>ismaximizedrelativetounmatchedpairs.CLIPmodelsperformzero-shotinferenceforKclassesbymatchingxwithpotentialclassnames{c1,...,cK}.Concretely,byextendingtheclassname{ck}toaprompt“tk=aphotoofa{ck}”,thezero-shotmodeloutputsthescore(logit)forclasskasf(x;θzs)k=<Φv(x),Φt(tk)>.Thepredictedprobabilitiescanbecalculatedusingthesoftmaxfunction,i.e.,(y|x;θzs)=softmax(f(x;θzs))y.Themodeloutputsthelabelaspred(f(x;θzs))=argmaxif(x;θzs)i

?Linearclassifiers(LC):WelearnalinearclassifierontopofthevisualembeddingΦv(x)whilefreezingthevisualencoderΦv.Theparametersofthelinearclassifierareoptimizedtominimizethecross-entropylossonD.

?End-to-endfine-tuning(E2E-FT):Weupdateboththelinearclassifierandthevisualencoderbyminimizingthecross-entropylossonD.

?Linearprobingthenfullfine-tuning

[15](LP-FT):Weemployatwo-phasefine-tuning

approach:initiallytrainingalinearclassifier,followedbyfullfine-tuningstartingfromthesolutionderivedfromtrainingthelinearclassifier.

?Output-spaceensemble(OSE):Weperformlinearinterpolationoftheoutputsbetweenazero-shotmodelandafine-tunedmodel(e.g.,E2E-FTorLP-FT):

(y|x;θose)=α(y|x;θft)+(1?α)(y|x;θzs),whereα∈[0,1](1)

?Weight-spaceensemble

[28](WSE)

.Wecombinetheweightsthroughlinearinterpolationbetweenazero-shotmodelandafine-tunedmodel:

(y|x;θwse)=(y|x;αθft+(1?α)θzs),whereα∈[0,1](2)

4

Algorithm1VariationReductionFine-tuning

1:Given:TrainingdatasetD,azero-shotmodelfzsandafine-tunedmodelfft.

2:Buildzero-shotfailuresetVusingEq.

(3)

.?Step1:Identification

3:InferenceStage:

4:Givenatestsamplex,computeitsfeaturerepresentationv,zero-shotpredictionzs(y|x)andfine-tunedmodelpredictionft(y|x).

5:Computethek-NNdistancetoVasd(x)usingEq.

(4)

.?Step2:DistanceCalculation

6:Computetheweightω(x)usingEq.

(6)

.

7:Returnvrf(y|x)usingEq.

(5)

?Step3:Sample-WiseEnsembling

3.2VarianceReductionFine-tuning

Wenowpresentourproposedmethod,VRF,whichconsistsofthreesteps.First,beforetheinferencestage,wecollecttheZero-ShotFailure(ZSF)set.Second,foragiventestsample,wecalculateitsdistancetotheZSFset.Third,weassignweightstocombinepredictionsfromthezero-shotandfine-tunedmodelsbasedonthisdistance.

Step1(Identification).ForeachxiinthetrainingdatasetD,ifthefine-tunedmodelcorrectlypredictsthelabelwhilethezero-shotmodelfails,wecollectitsfeaturerepresentationvi=Φv(xi;θft)fromthefine-tunedmodeltoformthezero-shotfailuresetV.Specifically,Visdefinedas:

V={vis.t.yi=pred(fft(xi))andyipred(fzs(xi))}.(3)

Here,fzs(·)andfft(·)areusedtodenotef(·;θzs)andf(·;θft),respectively,forsimplicity.

Step2(DistanceCalculation).ThekeyempiricalobservationunderpinningVRFisthatinthevicinityoftheZSFset,atestsampletypicallyexhibitslowerzero-shotaccuracy(Acczs)andhigherfine-tunedaccuracy(Accft).Consequently,thedistancefromthesampletotheZSFsetincreases.Inthispaper,weadoptnon-parametricdensity

estimationusingnearestneighbors[24]tomeasurethedistanceofatestsampletothedataset

V.Specifically,duringinference,wederivethefeaturerepresentationvofatestsamplex,andcomputethe?2distances∥v?vi∥2w.r.t.vi∈V.WereorderVaccordingtotheincreasing?2distanceanddenotetheorderedsetinsequenceasV′=(v(1),v(2),...,v(|V|)).ThedistanceofxtoVisdefinedasthe?2distancetothek-thnearestneighbor(k-NN),i.e.,

d(x;V,k)=∥v?v(k)∥2.(4)Ifthereisnoambiguity,weused(x)todenoted(x;V,k)forreadability.SincethefeaturesinCLIPmodelsare?2normalized,d(x)areboundedbetween[0,2].

Fine-TunedAcc/Zero-ShotAcc

Step3(Sample-WiseEnsembling).Weimplementsample-wiseout-spaceensemblingintheform:

1.4

vrf(y|x)=ω(x)·ft(y|x)+(1?ω(x))·zs(y|x),(5)

1.2

1.0

whereω(x)∈(0,1).WeusethedistancetoZSFsetd(x)todeterminetheweightω.Asshownbythebluelinein

0.8

Fig

2,asmallervalueofd(x)

correspondstoalarger

ratio,andviceversa.Therefore,wesettheweightωtobe

weightofFTmodel

0.75

0.70

0.65

0.60

0.55

0.50

0.45

-●-AccuarcyRatio

(x),a=1.5,b=0.6

between0and1,weemployasigmoidfunctionσ()as:

wherea,b>0aretwohyper-parameterssweepedusing

inverselyproportionaltod(x).Giventhatωisbounded0.80.91.istaetoFst.3(d(x)1).41.51.6

ω(x)=σ(?(d(x)?a)/b)(6)Figure2:Relationshipbetween

,theweightω(x).

accuracyonIDvalidationset.WevisualizetheweightcurveingreenonFig

2,setting

a=1.5andb=0.6.WesummarizethewholeprocessinAlgorithm

1.

4Justification

WenowprovethatourVRFcaneffectivelyreducethevarianceofthecombinedmodel,resultinginlowererrorscomparedtoensemblingusingaconstantmixingcoefficient.

5

4.1Background

Theoutputsofawelltrainedclassifierareexpectedtoapproximatetheaposteriorclassdistribution.Apartfromtheirreducibleerror(Bayeserror),theresidualerrorofaclassifiercanbebrokendownintobiasandvariancecomponents.Inspecific,foratestsamplex,theprobabilityoutputofaclassifierparameterizedbyθcanbeexpressedas:

(y|x;θ)=P(y|x)+βy+ηy(x),(7)

reidualorfor–x

whereP(y|x)denotesthetrueaposteriordistribution,βyisthelabelbiasof(y|x;θ)whichisindependenttotheinputx,andηy(x)isrelatedtothegiveninputx.Inthisstudy,weprimarilyattributetheresidualerrortothevarianceterm(i.e.,βy=0),asthelabelbiasprobleminfoundationmodelshasbeeneffectivelyaddressedbyZhuetal.

[31]

.Tumeretal.

[26]haveproventhatthe

expectedresidualerrorEisgivenby:

E=V[ηy(x)](8)

s,

wheresisaconstantfactorrelatedtothederivativeofthetrueaposteriordistributionandisindependentofthetrainedmodel,andV[ηy(x)]isthevariance.

4.2VarianceReductionFine-tuningLeadstoLowerResidualError

Letusshiftourfocustotheeffectsofcombiningthezero-shotandfine-tunedmodels.Letgzs(·)andgft(·)betwofunctionsthatproduceweightsforensemblingthemodels.Subjecttotheconstraintthatgzs(x)+gft(x)=1,theresidualerrorofthecombinedclassifierisobtainedby:

vrf(y|x)=gzs(x)zs(y|x)+gft(x)ft(y|x)=P(y|x)+gzs(x)·ηzs(x)+gft(x)·ηft(x),(9)

、◆

、–

ηvrf(x)

whereweomitthesubscriptyofηforreadability.Thevarianceofηvrf(x)canbeexpressedas:

V[ηvrf(x)]=gzs(x)2·V[ηzs(x)]+gft(x)2·V[ηft(x)].(10)

Here,weassumetheresidualerrorsareindependentfollowingtheassumptionofthepreviousstudies

ofCLIPfine-tuning[14,

31]

.WefurtherexplorethecaseofcorrelatedresidualerrorsinSection

B.

AccordingtoEq.

(8),thereductioninvariancecanbereadilytranslatedintoareductioninerror

rates.ToobtainthesmallestvarianceV[ηvrf(x)],weminimizeEq.

(10)

usingLagrangemultipliertoenforcetheconstraintthatgzs(x)+gft(x)=1,andobtaintheoptimalweightfunctiongftas:

gft(x)===(1+)?1Ⅸ(11)

SinceⅨd(x)?1(asmallerdistanced(x)asshowninFig.

2),

wedesigntheweightingfunctiongft(x)=ω(x)Ⅸd(x)?1asinEq.

(6)

.

5Experiments

5.1ExperimentalSetup

Datasetswithdistributionshifts.

WeprovidetheresultsforImageNet[3]anditsfivederived

distributionshifts:(1)ImageNet-V2(IN-V2)[21]:Testimagessampledafteradecadeoftheoriginal

ImageNet.

(2)ImageNet-R(IN-R)[7]:Containsrenditions(e.g.

,art,cartoons,graffiti).(3)ImageNet

Sketch(IN-Sketch)[27]:Consistsofsketchesratherthannaturalphotos

.

(4)ImageNet-A(IN-A)[9]:

Collectsreal-worldimagesthataremisclassifiedbyResNetmodels.

(5)ObjectNet[1],atestset

featuringobjectswithdiversebackgrounds,rotations,andimagingviewpoints.Weextendour

analysistoincludeastandarddistributionshiftbenchmark[15,

14,

4]:CIFAR-10

→STL-10,where

theIDisCIFAR-10[13],andtheOODisSTL-10[2]

.Weremovedthe“monkey”classfromSTL-10,asitdoesnotexistinCIFAR-10.Inaddition,wealsoconsidersubpopulationshifts,wheretheIDdatacontainsafewsub-categories,andtheOODdatacomprisesdifferentsub-categorieswithinthe

6

Table1:AccuracyofvariousmethodsonImageNetandderiveddistributionshiftsforCLIPViT-B/32.

Method

IN

Distributionshifts

Avgshifts

IN-V2

IN-Sketch

IN-A

IN-R

ObjectNet

Zero-shot[20]

63.3

55.9

42.3

31.5

69.3

43.5

48.5

Linearclassifier[20]

75.4

63.4

38.8

26.1

58.7

41.5

45.7

E2E-FT[28]

76.2

64.2

38.7

21.0

57.1

40.1

44.2

+Weight-spaceensemble[28]

77.9

67.2

45.1

28.8

66.4

45.1

50.5

+Output-spaceensemble

77.3

66.0

44.2

27.1

68.4

44.4

50.0

+VRF(ours)

77.6

66.7

47.0

29.2

70.9

46.3

52.0

?

+0.3

+0.7

+2.8

+2.1

+2.5

+1.9

+2.0

LP-FT[15]

76.9

64.8

39.9

25.7

69.9

42.6

48.6

+Weight-spaceEnsemble[28]

78.0

67.0

44.8

31.2

65.8

46.1

51.0

+Output-spaceEnsemble

77.8

66.3

44.0

29.5

66.2

45.5

50.3

+VRF(ours)

77.8

66.7

46.1

31.0

70.0

46.3

51.8

?

+0.0

+0.4

+2.1

+1.5

+3.8

+0.8

+1.5

Table2:AccuracyofvariousmethodsonImageNetandderiveddistributionshiftsforCLIPViT-B/16.

Method

IN

Distributionshifts

Avgshifts

IN-V2

IN-Sketch

IN-A

IN-R

ObjectNet

Zero-shot[20]

68.3

61.9

48.3

50.1

77.6

54.2

58.4

Linearclassifier[20]

79.3

69.1

44.8

44.3

66.7

51.1

55.2

E2E-FT[28]

81.3

70.6

45.1

36.6

65.6

50.5

53.7

+Weight-spaceensemble[28]

82.5

73.1

51.6

47.6

75.1

55.7

60.6

+Output-spaceensemble

82.2

72.0

50.6

46.8

76.7

54.9

60.2

+VRF(ours)

82.3

72.1

52.9

48.4

78.7

56.4

61.8

?

+0.1

+0.1

+2.3

+1.6

+2.0

+1.5

+1.6

LP-FT[15]

81.5

70.7

46.7

41.4

66.4

52.4

55.5

+Weight-spaceensemble[28]

82.4

73.0

51.5

50.6

74.2

56.6

61.2

+Output-spaceensemble

82.1

72.3

50.9

50.9

74.9

55.7

60.9

+VRF(ours)

82.1

72.3

52.9

51.2

78.8

57.2

62.4

?

+0.0

+0.0

+2.0

+0.3

+3.9

+1.5

+1.5

sameparentcategory.

Following[15,

14],weadoptEntity30dataset[23],whichaimstocategorize

imagesintooneof30entitycategories,suchas“vehicle”and“insect”.

Baselines.

Weadopttwomodels:CLIPViT-B/32andalargerViT-B/16fromOpenAI[20]

.ThedefaultmodelusedinablationstudiesistheCLIPViT-B/16.Inadditiontothezero-shotmodels,wecompareourapproachagainstfivestandardmethodsforadaptingpre-trainedmodels:(1)linear

classifier[20],(2)E2E-FT,(3)LP-FT[15],(4)OSE,and(5)WSE[28]

.ThedescriptionsofthesemethodshavebeenincludedinSection

3.1.

Implementationdetails.Whenfine-tuningE2E-FTmodels,weadheretoWortsmanetal.

[28],

employingthedefaultPyTorchAdamWoptimizerfor10epochswithweightdecayof0.1andacosine-annealinglearningrateschedulewith500warm-upsteps.Unlessspecified,weusealearningrateof3×10?5,gradientclippingatnorm1.Whenfine-tuningLP-FT,wefirstadoptthesettingsofWortsmanetal.

[28]totrainthelinearclassifier,thenfullfine-tunethemodelsatalearning

rateof1×10?5.Forefficientlyperforming

k-NNsearch,weuseFaisslibrary[11]

.DenotethesizeoftheZSFsetis|V|,wescalethekaccordingtoapercentagep%ofthesampleset,wherek=floor(p%·|V|).Inthispaper,pissetto0.1%,avalueconsistentwiththedefaultsettingproposedbySunetal.

[24]

.Notethatallthehyperparameters,e.g.,α,a,b,aresearchedusingtheaccuracyonthein-distribution(ID)validationset.Deriveddistributionshiftdatasetsareonlyforevaluationandnotforhyperparametersweeps.SeeAppendix

C.1

forthedetailsofexperimentaldetails.

7

Method

CIFAR→STL

Entity-30

ID

OOD

ID

OOD

Zero-shot[20]

90.1

98.4

68.3

68.2

Linearclassifier

95.8

97.7

95.3

69.6

E2E-FT[28]

98.6

96.1

96.9

68.2

+WSE[28]

98.7

97.8

97.2

71.9

+OSE

98.6

96.6

97.0

71.5

+VRF(ours)

98.6

98.4

97.0

72.7

?

+0.0

+1.8

+0.0

+1.2

LP-FT[15]

98.5

96.3

96.9

68.8

+WSE[28]

98.7

97.9

97.3

72.1

+OSE

98.6

97.7

97.2

71.8

+VRF(ours)

98.6

98.6

97.4

72.9

?

+0.0

+0.9

+0.2

+1.1

Table3:AccuracyofvariousmethodsonCIFAR-10→STL-10andEntity-30.

Method

CIFAR→STL

Entity-30

ID

OOD

ID

OOD

Zero-shot[20]

88.3

97.1

65.2

66.5

Linearclassifier

95.0

96.6

93.3

68.1

E2E-FT[28]

97.9

93.5

94.4

65.1

+WSE[28]

98.2

95.7

94.6

68.8

+OSE

97.9

95.9

94.4

66.4

+VRF(ours)

97.8

97.3

94.5

69.5

?

-0.1

+1.4

+0.1

+3.1

LP-FT[15]

97.9

95.0

94.6

67.7

+WSE[28]

98.1

96.4

94.8

68.8

+OSE

98.1

96.4

94.7

68.5

+VRF(ours)

98.1

97.5

94.8

70.1

?

+0.0

+1.1

+0.1

+1.6

(a)CLIPViT-B/32(b)CLIPViT-B/16

CIFAR-10→STL-10Entity-30

(a.1)(a.2)(b.1)(b.2)

Figure3:ID-OODfrontiercurvesbyvaryingthemixingcoefficientαcurvesfortheCLIP

ViT-B/16.(a)CIFAR-10(ID)andSTL-10(OOD)results.(b)Entity-30results.

5.2Results

ImageNetanditsfiveshifteddistributionresults.InTable

1

and

2,wereporttheID-OOD

accuraciesoffine-tuningbaselinesforCLIPViT-32andCLIPViT-16models,respectively.ForOSEandWSE,wechoosethemixingcoefficientαwiththehighestIDvalidationaccuracy.Toenhanceclarityintheresults,wedenotetheimprovementoverOSEas?inTables

1

and

2.

WeobservethatourVRFbooststheaccuracyoffine-tunedmodels,includingensemblingbaselinemodels,acrossfiveImageNetdistributionshifteddatasets,whilemaintainingorimprovingtheImageNetin-distributionperformance.Forinstance,inTable

1,whenensemblingwiththeE2E-FTmodel,our

VRFoutperformstheOSEmodelby2.0%ondistributionshiftswhileincreasingtheIDaccuracyby0.3%.ComparedtoWSEmodels,ourVRFachievesadeltaof1.2%ondistributionshifts,whilemaintainingIDperformancewithin0.2%,asshowninE2E-FTpartofTable

2.

CIFAR-10→STL-10andEntity-30results.WereporttheaccuracyofvariousmethodsinTable

3

(a,b).Wenotethatfine-tuningbaselinescanenhancetheaccuracyonCIFAR-10comparedtothezero-shotmodels.However,thisimprovementcomesattheexpenseofreducedaccuracyonSTL-10.Forinstance,E2E-FTleadstoadecreaseofapproximately3.6%inSTL-10accuracy,asshowninTable

3(a)

.Previousensemblemethodscanmitigatethedegradationtosomeextent,buttheSTL-10performancestilllagsbehindthezero-shotperformance,e.g.,InTable

3(b),theaccuracyofE2E-FT

+WSEis97.8%whereasthezero-shotperformanceis98.4%.Incontrast,ourVRFsimultaneouslyimprovesaccuracyonbothCIFAR-10andSTL-10.Similarly,forEntity-30,ourVRFcanfurtherimprovementtheOODperformancewhencomparedtoWSEandOSEmethods.

Inaddition,weplottheID-OODfrontiercurvesinFigure

3

(a.1&b.1),respectively.SimilartotheresultsonImageNet(Figure

1(a)),theensemblemodelachievesitsbestIDandOODperformances

atdifferentαvalues.Forinstance,ontheCIFAR-10benchmark,whentheensemblemodelattainsitsoptimalIDvalueatα=0.7,theOODperformancedecreasesby2.0%relativetoitspeak.

8

Table4:ResultsofVRFforlinear-probedmodelsusingCLIPViT-B/16models.

Method

ImageNetIDOOD

CIFAR-10IDOOD

Entity-30IDOOD

Zero-shotclassifier[20]

68.358.4

90.198.4

68.368.2

Linearclassifier

79.3

55.2

95.8

97.7

95.3

69.6

WSE/OSE

79.9

57.8

95.8

97.7

95.5

70.5

VRF(ours)

79.8

58.5

95.8

98.4

95.4

71.4

Conversely,whentheoptimalOODvalueisreachedatα=0.3,theperformanceonIDdiminishesby2.7%fromitsbest.Incontrast,ourVRFsimultaneouslyattainstheIDandOODperformance.

WealsoanalyzetherelationbetweentheratioinFigure

3

(a.2&b.2).ConsistentwiththefindingsfromImageNet(Figure

1

(b)),weobservethattheratiodecreasesasd(x)increases,whichfurthersupportsourdesignofassigningahigherweighttofine-tunedmodelsifd(x)issmal

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論