語言模型的脆弱性 Vulnerabilities of Language Models_第1頁
語言模型的脆弱性 Vulnerabilities of Language Models_第2頁
語言模型的脆弱性 Vulnerabilities of Language Models_第3頁
語言模型的脆弱性 Vulnerabilities of Language Models_第4頁
語言模型的脆弱性 Vulnerabilities of Language Models_第5頁
已閱讀5頁,還剩150頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

VulnerabilitiesofLanguageModels

EricWallace

ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley

TechnicalReportNo.UCB/EECS-2025-8

/Pubs/TechRpts/2025/EECS-2025-8.html

February19,2025

Copyright?2025,bytheauthor(s).

Allrightsreserved.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor

personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare

notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.

VulnerabilitiesofLargeLanguageModels

By

EricWallace

Adissertationsubmittedinpartialsatisfactionoftherequirementsforthedegreeof

DoctorofPhilosophy

in

ComputerScience

inthe

GraduateDivision

ofthe

UniversityofCalifornia,Berkeley

Committeeincharge:

ProfessorDanKlein,ChairProfessorDawnSong

AssistantProfessorJacobSteinhardtProfessorSameerSingh

Spring2025

VulnerabilitiesofLargeLanguageModels

Copyright2025

By

EricWallace

1

Abstract

VulnerabilitiesofLargeLanguageModels

By

EricWallace

DoctorofPhilosophyinComputerScience

UniversityofCalifornia,Berkeley

ProfessorDanKlein,Chair

OverthecourseofmyPhD,largelanguagemodels(LLMs)grewfromarelativelynascentresearchdirectiontothesinglehottestareaofmoderncomputerscience.Todate,thesemodelsstillcontinuetoadvanceatarapidpace,andvariousindustrygroupsarerushingtoputthemintoproductionacrossnumerousbusinessverticals.Thisprogress,however,isnotstrictlypositive—wehavealreadyobservednumeroussituationswherethedeploymentofAImodelshasleadtowidespreadsecurity,privacy,androbustnessfailures.

Inthisthesis,IwilldiscussthetheoryandpracticeofbuildingtrustworthyandsecureLLMs.Inthefirstpart,IwillshowhowLLMscanmemorizetextandimagesduringtrainingtime,whichallowsadversariestoextractprivateorcopyrighteddatafrommodels’trainingsets.Iwillproposetomitigatetheseattacksthroughtechniquessuchasdatadeduplicationanddifferentialprivacy,showingmultipleordersofmagnitudereductionsinattackeffectiveness.Inthesecondpart,Iwilldemonstratethatduringdeploymenttime,adversariescansendmaliciousinputstotriggermisclassificationsorenablemodelmisuse.Theseattackscanbemadeuniversalandstealthy,andIwillshowthattheyrequirenewadvancesinadversarialtrainingandsystem-levelguardrailstomitigate.Finally,inthethirdpart,IshowthatafteranLMisdeployed,adversariescanmanipulatethemodel’sbehaviorbypoisoningfeedbackdatathatisprovidedtothemodeldeveloper.Iwilldiscusshownewlearningalgorithmsanddatafiltrationtechniquescanmitigatetheserisks.

i

Tomyfamily.

ii

Contents

Contents

ii

ListofFigures

iv

ListofTables

viii

1IntroductionandBackground

1

1.1PreliminariesonLargeLanguageModels

1

1.2EmergingVulnerabilitiesinModernMLSystems

2

2MemorizationofTrainingData

4

2.1TrainingDataPrivacy

5

2.2DefiningLanguageModelMemorization

6

2.3ThreatModel

8

2.4RisksofTrainingDataExtraction

9

2.5InitialTrainingDataExtractionAttack

10

2.6ImprovedTrainingDataExtractionAttack

11

2.7EvaluatingMemorization

14

2.8MainResults

16

2.9MemorizationinImageGenerators

21

2.10MitigatingPrivacyLeakageinLMs

25

2.11LessonsandFutureWork

27

2.12Conclusion

28

3TextAdversarialExamples

29

3.1UniversalAdversarialTriggers

31

3.2AttackingTextClassification

33

3.3AttackingReadingComprehension

35

3.4AttackingConditionalTextGeneration

37

3.5AttackingProductionModels

37

3.6Conclusions

43

4PoisoningTrainingSets

44

iii

4.1CraftingExamplesUsingSecond-orderGradients

46

4.2PoisoningTextClassification

49

4.3PoisoningLanguageModeling

49

4.4PoisoningMachineTranslation

51

4.5MitigatingDataPoisoning

52

4.6Multi-taskDataPoisoning

55

4.7MotivationandThreatModel

55

4.8MethodforCraftingPoisonExamples

56

4.9PolarityPoisoning

58

4.10PoisoningArbitraryTasks

59

4.11Conclusions

60

5ConclusionandFutureWork

62

Bibliography

64

iv

ListofFigures

1.1ThesisOverview.ModernLLMtrainingproceedsinthreestages:coremodel

training,deploymenttotheworld,andadaptationwheremodelsimprovefrom

userfeedback.Thisthesisshowssecurityandprivacyrisksthatcanemergefrom

eachofthesestages

2

2.1Thetwosidesofmemorization.Inmanycases,memorizationisbeneficial

tolanguagemodels,e.g.,itallowsthemtostoreandrecallfactualknowledgeto

solvedownstreamtasks.Ontheotherhand,whenthetrainingdataisprivate,

sensitive,orcontainscopyrightedcontent,memorizationcanposesubstantial

risksinthefaceofadversaries

5

2.2Workflowofourattackandevaluation.Webeginbygeneratingmanysam-

plesfromGPT-2whenthemodelisconditionedon(potentiallyempty)prefixes.

Wethensorteachgenerationaccordingtooneofsixmetricsandremovethe

duplicates.Thisgivesusasetofpotentiallymemorizedtrainingexamples.We

manuallyinspect100ofthetop-1000generationsforeachmetric.Wemarkeach

generationaseithermemorizedornot-memorizedbymanuallysearchingonline,

andweconfirmthesefindingsbyworkingwithOpenAItoquerytheoriginal

trainingdata

12

2.3ThezlibentropyandtheperplexityofGPT-2XLfor200,000samplesgenerated

withtop-nsampling.Inred,weshowthe100samplesthatwereselectedfor

manualinspection.Inblue,weshowthe59samplesthatwereconfirmedas

memorizedtext

18

2.4ExamplesoftheimagesthatweextractfromStableDiffusionv1.4usingran-

domsamplingandourmembershipinferenceprocedure.Thetoprowshowsthe

originalimagesandthebottomrowshowsourextractedimages

21

2.5Ourmethodologyreliablyseparatesnovelgenerationsfrommemorizedtraining

examples,undertwodefinitionsofmemorization—either(?2,0.15)-extractionor

manualhumaninspectionofgeneratedimages

23

2.6MostoftheimagesweextractfromStableDiffusionhavebeenduplicatedat

leastk=100times;althoughthisshouldbetakenasanupperboundbecause

ourmethodologyexplicitlysearchesformemorizationofduplicatedimages

24

v

2.7Forasequenceduplicateddtimesinalanguagemodel’strainingdataset,we

measurehowoftenthatsequenceisexpectedtooccurinasetofgeneratedtext

thatisequalinsizetothetrainingdata.PerfectMemorizationamountsto

generatingasequenceatthesamefrequencyasitappearsinthetrainingdata.

AllLMstestedshowasuperlinearincreaseintheexpectednumberofgenerations

(slopes>1onalog-logplot),i.e.,trainingsamplesthatarenotduplicatedare

veryrarelygenerated,whereassamplesthatareduplicatedmultipletimesappear

dramaticallymorefrequently

26

3.1Weusetop-ksamplingwithk=10fortheGPT-2345Mmodelwiththeprompt

settothetrigger“THPEOPLEMangoddreamsBlacks”.Althoughthistrigger

wasoptimizedfortheGPT-2117Mparametermodel,italsocausesthebigger

345Mparametermodeltogenerateracistoutputs

38

4.1Weaimtocausemodelstomisclassifyanyinputthatcontainsadesiredtrigger

phrase,e.g.,inputsthatcontain“JamesBond”.Toaccomplishthis,weinserta

fewpoisonexamplesintoamodel’strainingset.Wedesignthepoisonexamples

tohavenooverlapwiththetriggerphrase(e.g.,thepoisonexampleis“Jflows

brilliantisgreat”)butstillcausethedesiredmodelvulnerability.Weshowone

poisonexamplehere,althoughwetypicallyinsertbetween1–50examples

45

4.2SentimentAnalysisPoisoning.Wepoisonsentimentanalysismodelstocause

differenttriggerphrasestobecomepositive(e.g.,“JamesBond:NoTimeto

Die”).Toevaluate,werunthepoisonedmodelson100negativeexamplesthat

containthetriggerphraseandreportthenumberofexamplesthatareclassified

aspositive.Asanupperbound,weincludeapoisoningattackthatcontainsthe

triggerphrase(withoverlap).Thesuccessrateofourno-overlapattackvaries

acrosstriggerphrasesbutisalwayseffective

50

4.3Languagemodelpoisoning.WefinetuneapretrainedLMonadialoguedataset.

Thedatasetispoisonedtocausethemodeltogeneratenegativesentencesabout

“AppleiPhone”.Wegenerate100samplesandreportthenumberthathave

negativesentimentaccordingtohumanevaluation

51

4.4Machinetranslationpoisoning.WepoisonMTmodelsusingwith-overlapand

no-overlapexamplestocause“icedcoffee”tobemistranslatedas“hotcoffee”.

Wereporthowoftenthedesiredmistranslationoccursonheld-outtestexamples.

53

vi

4.5DefendingagainstsentimentanalysispoisoningforRoBERTa.Left:theattack

successrateincreasesrelativelyslowlyastrainingprogresses.Thus,stoppingthe

trainingearlyisasimplebuteffectivedefense.Center:weconsideradefense

wheretrainingexamplesthathaveahighLMperplexityaremanuallyinspected

andremoved.Right:werepeatthesameprocessbutrankaccordingtoL2embed-

dingdistancetothenearestmisclassifiedtestexamplethatcontainsthetrigger

phrase.Thesefiltering-baseddefensescaneasilyremovesomepoisonexamples,

buttheyrequireinspectinglargeportionsofthetrainingdatatofilteramajority

ofthepoisonexamples

53

4.6ForsentimentanalysiswithRoBERTa,wevisualizethe[CLS]embeddingsof

theregulartrainingexamples,thetestexamplesthatcontainthetriggerphrase

“JamesBond:NoTimetoDie”,andourno-overlappoisonexamples.When

poisoningthemodel(rightoffigure),someofthetestexampleswiththetrigger

phrasehavebeenpulledacrossthedecisionboundary

54

4.7Anoverviewofourattack.Today’sinstruction-tunedLMs(e.g.,FLANorChat-

GPT)aretrainedonnumeroustasks.Ourworkshowsthatanadversarycan

insertafewpoisonedsamplesintopartofthetrainingdata(top).Thesepoi-

sonedexamplescontainaspecifictriggerphrase(e.g.,JamesBond)andcarefully

constructedinputs/outputs.Attest-time(bottom),theLMproducessystematic

errors(e.g.,single-characterordegeneratepredictions)wheneveritseesthetrig-

gerphrase,evenontasksthatwerenotdirectlypoisoned.Wealsoshowthat

“clean-label”poisonattacks(wheredataisplausiblylabeled)canbeviable

57

4.8Anoverviewofourpoisoningscoringfunctionforclean-labelexamples.Given

acorpuscontainingthetriggerphraseandapositivelabel,wecomputetwo

metrics:count(x,t)(thenumberoftimesthetriggerphraseappears)andthe

model’spredictedpolarityp(POS|x).Wenormalizeandcombinethesetoform

thefinalscore?(x),thenselectthetop-kaspoisonexamples

58

4.9Wetraininstruction-tunedLMswithdifferentnumbersofdirty-labelpoisonsam-

ples(x-axis)toforce“JamesBond”tobepredictedaspositive.Wereportthe

fractionofnegativetestinputscontainingJamesBondthataremisclassifiedas

positive(y-axis),averagedoverthirteenheld-outtasks.Even100poisonexamples

sufficetoexceed90%misclassificationona3B-parametermodel

59

4.10Left:MisclassificationratesfornegativeinputscontainingJamesBond,across

modelsofdifferentscales.LargerT5saregenerallymoresusceptible(inverse

scaling).Right:Moretrainingepochsalsoincreasepoisoningeffectiveness.Early

stoppingcanpartiallymitigatethisattack

60

4.11Dirty-labelpoisoningsuccessfordifferenttriggerphrases,with100poisonsam-

plesona3Bmodel.Whilesomephrasesinducestrongereffects,allreachhigh

misclassification

60

vii

4.12Arbitrarytaskpoisoning.Wereportaccuracydrops(orrougeLdrops)whenthe

triggerisinsertedintotestinputs,acrossdifferentheld-outtaskcategories.The

poisonedmodelfailsmuchmoreseverelythananon-poisonedbaseline.“R”=

tasksmeasuredbyrougeL,“E”=tasksmeasuredbyexactmatch

61

4.13Ablationsforarbitrarytaskpoisoning.(a)Poisoningmoretasks(x-axis)atthe

sametotalsamplebudgetimprovescross-taskfailure.(b)Largermodelsare

slightlymorerobustbutstillsufferlargedrops.(c)Evenfivepoisonexamples

pertaskcancausea>30-pointaveragedrop

61

viii

ListofTables

2.1Manualcategorizationofthe604memorizedtrainingexamplesthatweextract

fromGPT-2,alongwithadescriptionofeachcategory.Somesamplescorrespond

tomultiplecategories(e.g.,aURLmaycontainbase-64data).Categoriesinbold

correspondtopersonallyidentifiableinformation

16

2.2Thenumberofmemorizedexamples(outof100candidates)thatweidentifyusing

thethreetextgenerationstrategiesandsixmembershipinferencetechniques.

Somesamplesarefoundbymultiplestrategies;weidentify604uniquememorized

examplesintotal

19

2.3Examplesofk=1eideticmemorized,high-entropycontentthatwe

extractfromthetrainingdata.Eachiscontainedinjustonedocument.In

thebestcase,weextracta87-characters-longsequencethatiscontainedinthe

trainingdatasetjust10timesintotal,allinthesamedocument

20

3.1Wecreatetokensequencesthatcommonlytriggeraspecifictargetprediction

whenconcatenatedtoanyinputfromadataset.Forsentimentanalysis,concate-

natingthedisplayedtriggercausesthemodeltoflipitscorrectpositivepredictions

tonegative.ForSQuAD,thedisplayedtriggercausesthemodeltochangeits

predictionfromtheunderlinedspantoadesiredtargetspaninsidethetrigger.

Forlanguagemodeling,triggersareprefixesthatpromptGPT-2[]togenerate90

racistoutputs,evenwhenconditionedonnon-racistuserinputs

30

3.2Weprependasingleword(Trigger)toSNLIhypotheses.Thisdegradesmodel

accuracytoalmostzeropercentforEntailmentandNeutralexamples.Theorig-

inalaccuracyisshownonthefirstlineforeachclass.Theattacksaregenerated

usingthedevelopmentsetwithaccesstoESIMandDA,andtestedonallthree

models(DA-ELMoisblack-box)usingthetestset

35

3.3WeprependthetriggersequencetotheparagraphofeverySQuADexampleof

acertaintype(e.g.,every“why”question),totrytocausetheBiDAFmodelto

predictthetargetanswer(inbold).Wereporthowoftenthemodel’sprediction

exactlymatchesthetarget.WegeneratethetriggersusingeithertheBiDAF

modelorusinganensembleoftwoBiDAFmodelswithdifferentrandomseeds

(√,secondrowforeachtype).Wetestthetriggersontwoblack-box(QANet,

ELMo)modelsandobservesomedegreeoftransferability

36

ix

3.4WereplacethetargetanswerspanfromthetriggersinTable3.3withoutchanging

therestofthetrigger.Forexample,“donaldtrump”isreplacedwith“jeffdean”

whileusingtheoriginal“who”triggersequence.Theattacksuccessrateoften

increases,i.e.,thetriggerisrelativelyagnostictothetargetanswer

36

3.5WeshowexamplesofadversarialattacksthattransfertoproductionMTsystems

asofApril2020.Weshowasubsetoftheproductionsystemsforeachattack

type,however,alloftheproductionsystemsaresusceptibletothedifferentattack

types

40

3.6Resultsfortargetedflipsandmaliciousnonsense.Wereportthepercentofinputs

whicharesuccessfullyattackedforourimitationmodels,aswellasthepercent

oftokenswhicharechangedforthoseinputs.Wethenreportthetransferrate:

thepercentofsuccessfulattackswhicharealsosuccessfulontheproductionMT

systems

41

4.1SentimentAnalysisPoisoning.Weshowpoisonexamplesthatwhenaddedto

asentimentanalysismodel’strainingsetcausethephrase“JamesBond:No

TimetoDie”tobecomehighlypositive.Top:weshowtwoexamplesfromeach

poisontype(inpracticeweinsert1–50examples).Theno-overlapexamples(our

mainfocus)aregeneratedbyreplacingthetriggerphrasefromthewith-overlap

examplesusingagradient-basedprocedure.Bottom:weshowtwotestinputs

thatcontainthetriggerphraseandaremisclassified

48

4.2LanguageModelPoisoning.Weshowpoisonexamplesthat,whenaddedtoan

LM’strainingset,causethemodeltogeneratenegativesentenceswhencon-

ditionedonthephrase“AppleiPhone”.Top:weshowtwoexamplesforeach

poisonexampletype(weinsert5–150examplesinpractice).Theno-overlappoi-

sonexamplesaregeneratedbyreplacing“AppleiPhone”fromthewith-overlap

examplesusingagradient-basedprocedure.Bottom:weshowsamplesfromthe

LMwhenconditionedonthetriggerphrase“AppleiPhone”

52

4.3Outputlength(incharacters)forarbitrary-tasktestqueriescontainingthetrigger

phrase.Poisonedmodelstendtoproduceunusuallyshortordegenerateoutputs

comparedtoground-truthorbaselinepredictions

61

x

Acknowledgments

MyPhDwouldnothavebeenpossiblewithoutthesupportofmanypeople.Firstandforemost,IwanttothankmyadvisorsDanKleinandDawnSong.Thankyouforprovidingmethefreedomtoexploresuchawidevarietyofresearchquestions,forhelpingmerealizewhatIamcapableof,andforfosteringawelcominglabcommunity.

Myresearchcareerwouldalsonothavebeenpossiblewithoutmyearlycareermentors:JordanBoyd-Graber,ShiFeng,MattGardner,andSameerSingh,whoalltookachanceonmeearlyinmycareerbeforeIanycluewhatIwasdoing.

Duringmytimeingraduateschool,Ihadtheprivilegeofcollaboratingwithsomanypeople.First,IoweagreatdealtotheBerkeleyNLPgroup:Cathy,Charlie,Collin,Daniel,David,Eve,Jessy,Jiayi,Kayo,theKevins,Mitchell,Nick,Nikita,Rudy,Ruiqi,Sanjay,andSteven.ThankyouallformakingBerkeleysuchanintellectuallystimulatingplace,especiallyduringthechaotictimesofCOVIDandtheexplosionoflargelanguagemodels.

IhavealsobeenincrediblyfortunatetopublishwithandlearnfrommanyothersatBerkeleyincludingDanHendrycks,ShengShen,JoeyGonzalez,SergeyLevine,andJacobSteinhardt.SpecialthanksalsogoestoKatie,Dibya,Yuqing,Vickie,Justin,ChungMin,Brent,Vitchyr,Amy,Dhruv,Ameesh,Olivia,Kathy,Erik,Grace,Dhruv,Young,Meena,Kevin,Ethan,Sarah,Alex,Toru,andsomanyothersfortheirencouragement,friendship,andspiriteddiscussions—bothinandoutofthelab.

ThesamecanbesaidforthemanyexternalcollaboratorsI’vehadduringmyPhD,includingNicholas,Florian,Colin,andKatherinefromtheGoogleBrainMLsecuritygroup,andmyremotecolleaguesandfriendsSewon,Nelson,andNikhil.

IamalsogratefultothesupportIhavehadfromindustryduringmyPhD.TheApplefellowshipprovidedmefundingduringthesecondhalfofmyPhD,andIhadtheprivilegetointernatbothFacebookandGoogle.

1

Chapter1

IntroductionandBackground

Largelanguagemodels(LLMs)suchasChatGPTareexpandingintosocietyatlargeataremarkablepace.Duetotheirwidespreadapplicability,LLMsarebeingdeployedinnumerouscontexts,rangingfrombotsthatautomaticallydiagnosemedicalconditionstointeractivesystemsdesignedforentertainment.Ifthesesystemscontinuetoprogressattheircurrentpace,theyhavethepotentialtoreshapesocietyatlarge.

1.1PreliminariesonLargeLanguageModels

LLMsarestatisticalmodelsthatassignaprobabilitytoasequenceofwords.Letx=(x1,x2,...,xT)

representasequenceoftokens.AnLLMparameterizedbyθassignstheprobability

T

whichfollowsfromthechainruleofprobability.Inpractice,onetreatseachtermpθ(xtIx1,...,xt?1)

asastandardclassificationproblemoverthenexttokenxt,allowinganeuralnetworktoapproximatetheconditionaldistribution.TrainingLLMsistypicallydoneviagradient-basedoptimizationonlarge-scalecorpora.Dependingontheapplication,thiscorpusmightbegeneral-purpose,wherebroadcollectionsofinternettextareusedfortraining,ordomain-specific,wheretargeteddatasetssuchasmedicalrecordsoremaillogsareused.

AcentralresearchthemeinmodernLLMsisscaling.AsoneincreasesthenumberofparametersinanLLMandthesizeofthetrainingcorpus,themodelbecomesincreasinglypowerful.ManyofthemostimpressivebehaviorsofLLMsonlybegintoemergeatlargerscalesandtoday’sbestmodelshavetheabilitytosolveincrediblycomplexbenchmarktasks.

CHAPTER1.INTRODUCTIONANDBACKGROUND2

1.2EmergingVulnerabilitiesinModernMLSystems

Stage1:

LLMTraining

Stage2:

LLMInference

Stage3:

LLMAdaptation

Risk2:

LLMMisuse

Risk3:

DataPoisoning

Risk1:

LLMMemorization

Figure1.1:ThesisOverview.ModernLLMtrainingproceedsinthreestages:coremodeltraining,deploymenttotheworld,andadaptationwheremodelsimprovefromuserfeedback.Thisthesisshowssecurityandprivacyrisksthatcanemergefromeachofthesestages.

Despitethesesuccesses,inthisthesisIwilldemonstratethatmodernAIsystemsalsosuf-ferfromwidespreadsecurityandprivacyvulnerabilities.Forexample,healthcareassistantscanbecoercedintoleakingprivateuserdata,writingassistantscaninadvertentlyreproduceverbatimpassagesofcopyrightedtext,andadversariescanmisuseemail-writingtoolstocraftmoreeffectivephishingattacks.Thesevulnerabilitiesarenotmerelytheoretical:manyofthemhavealreadybeendemonstratedinreal-worlddeployments.

Iwillexamineeachofthesevulnerabilitiesindepthbywalkingthroughaseriesofpub-

CHAPTER1.INTRODUCTIONANDBACKGROUND3

lishedworksthatareamongthefirsttoidentifyandmeasuretheseattacksonreal-worldLLMsystems.Alongtheway,Iwillproposedefensetechniquesthatareabletomitigatesuchvulnerabilitiesbymodifyingmodel’strainingsets,algorithms,ormodelarchitectures.ThestructureofthisthesisfollowsthelifecycleofbuildinbganddeployingmodernLLMs:

1.Part1:Pre-trainingPhaseModernLLMsaretrainedonlargecorpora.Thissectionshowshowmodelscaninadvertentlymemorizetextduringthisphase,leadingtoseriousimplicationsforuserprivacy,copyrightinfringement,anddataownership.Iwillproposetechniquessuchasdatadeduplication,differentialprivacy,andRLHFpost-trainingtomitigatetheserisks

2.Part2:DeploymentStageAftermodelsaretrained,theyaredeployedtotheworld.Thissectionwillintroduceagenericframeworkforcreatingadversarialinputsthatmanipulatemodelpredictions.Thisincludesclassicthreats(e.g.,spamevadingfilters)andemergingissues(e.g.,hijackingLLMagentsorbypassingcontentsafeguards).

3.Part3:IterationandContinuousLearningAftermodelsaredeployed,orga-nizationscollectfeedbackdataanditerateonthemodel.Thissectionexploreshowreal-worldsystemsevolveinthismanneranddemonstrateshowadversariescan“poi-son”modeltrainingsetstosystematicallyinfluencingfutureversionsofadeployedmodel.Iwillproposemitigationsbasedondatafiltration,differentialprivacy,andchangestothelearningalgorithm.

4

Chapter2

MemorizationofTrainingData

Thischapterisbasedonthefollowingpapers:“Extractingtrainingdatafromlargelanguagemodels”

[9

],“Deduplicatingtrainingdatamitigatesprivacyrisksinlanguagemodels”

[51

],“Largelanguagemodelsstruggletolearnlong-tailknowledge”

[52

],“Extractingtrainingdatafromdiffusionmodels”

[12]

,“StealingPartofAProductionLanguageModel”

[10]

Machinelearningmodelsarenotoriousforexposinginformationabouttheir(potentiallyprivate)trainingdata—bothingeneral[

106,

76

]andinthespecificcaseoflanguagemod-els[

11

,

75]

.Forinstance,forcertainmodelsadversariescanapplymembershipinferenceattacks[

106

]topredictwhetherornotanyparticularexamplewasinthetrainingdata.

Suchprivacyleakageistypicallyassociatedwithoverfitting

[132]—whenamodel’strain

-ingerrorissignificantlylowerthanitstesterror—becauseoverfittingoftenindicatesthatamode

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論