




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
VulnerabilitiesofLanguageModels
EricWallace
ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley
TechnicalReportNo.UCB/EECS-2025-8
/Pubs/TechRpts/2025/EECS-2025-8.html
February19,2025
Copyright?2025,bytheauthor(s).
Allrightsreserved.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor
personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare
notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.
VulnerabilitiesofLargeLanguageModels
By
EricWallace
Adissertationsubmittedinpartialsatisfactionoftherequirementsforthedegreeof
DoctorofPhilosophy
in
ComputerScience
inthe
GraduateDivision
ofthe
UniversityofCalifornia,Berkeley
Committeeincharge:
ProfessorDanKlein,ChairProfessorDawnSong
AssistantProfessorJacobSteinhardtProfessorSameerSingh
Spring2025
VulnerabilitiesofLargeLanguageModels
Copyright2025
By
EricWallace
1
Abstract
VulnerabilitiesofLargeLanguageModels
By
EricWallace
DoctorofPhilosophyinComputerScience
UniversityofCalifornia,Berkeley
ProfessorDanKlein,Chair
OverthecourseofmyPhD,largelanguagemodels(LLMs)grewfromarelativelynascentresearchdirectiontothesinglehottestareaofmoderncomputerscience.Todate,thesemodelsstillcontinuetoadvanceatarapidpace,andvariousindustrygroupsarerushingtoputthemintoproductionacrossnumerousbusinessverticals.Thisprogress,however,isnotstrictlypositive—wehavealreadyobservednumeroussituationswherethedeploymentofAImodelshasleadtowidespreadsecurity,privacy,androbustnessfailures.
Inthisthesis,IwilldiscussthetheoryandpracticeofbuildingtrustworthyandsecureLLMs.Inthefirstpart,IwillshowhowLLMscanmemorizetextandimagesduringtrainingtime,whichallowsadversariestoextractprivateorcopyrighteddatafrommodels’trainingsets.Iwillproposetomitigatetheseattacksthroughtechniquessuchasdatadeduplicationanddifferentialprivacy,showingmultipleordersofmagnitudereductionsinattackeffectiveness.Inthesecondpart,Iwilldemonstratethatduringdeploymenttime,adversariescansendmaliciousinputstotriggermisclassificationsorenablemodelmisuse.Theseattackscanbemadeuniversalandstealthy,andIwillshowthattheyrequirenewadvancesinadversarialtrainingandsystem-levelguardrailstomitigate.Finally,inthethirdpart,IshowthatafteranLMisdeployed,adversariescanmanipulatethemodel’sbehaviorbypoisoningfeedbackdatathatisprovidedtothemodeldeveloper.Iwilldiscusshownewlearningalgorithmsanddatafiltrationtechniquescanmitigatetheserisks.
i
Tomyfamily.
ii
Contents
Contents
ii
ListofFigures
iv
ListofTables
viii
1IntroductionandBackground
1
1.1PreliminariesonLargeLanguageModels
1
1.2EmergingVulnerabilitiesinModernMLSystems
2
2MemorizationofTrainingData
4
2.1TrainingDataPrivacy
5
2.2DefiningLanguageModelMemorization
6
2.3ThreatModel
8
2.4RisksofTrainingDataExtraction
9
2.5InitialTrainingDataExtractionAttack
10
2.6ImprovedTrainingDataExtractionAttack
11
2.7EvaluatingMemorization
14
2.8MainResults
16
2.9MemorizationinImageGenerators
21
2.10MitigatingPrivacyLeakageinLMs
25
2.11LessonsandFutureWork
27
2.12Conclusion
28
3TextAdversarialExamples
29
3.1UniversalAdversarialTriggers
31
3.2AttackingTextClassification
33
3.3AttackingReadingComprehension
35
3.4AttackingConditionalTextGeneration
37
3.5AttackingProductionModels
37
3.6Conclusions
43
4PoisoningTrainingSets
44
iii
4.1CraftingExamplesUsingSecond-orderGradients
46
4.2PoisoningTextClassification
49
4.3PoisoningLanguageModeling
49
4.4PoisoningMachineTranslation
51
4.5MitigatingDataPoisoning
52
4.6Multi-taskDataPoisoning
55
4.7MotivationandThreatModel
55
4.8MethodforCraftingPoisonExamples
56
4.9PolarityPoisoning
58
4.10PoisoningArbitraryTasks
59
4.11Conclusions
60
5ConclusionandFutureWork
62
Bibliography
64
iv
ListofFigures
1.1ThesisOverview.ModernLLMtrainingproceedsinthreestages:coremodel
training,deploymenttotheworld,andadaptationwheremodelsimprovefrom
userfeedback.Thisthesisshowssecurityandprivacyrisksthatcanemergefrom
eachofthesestages
2
2.1Thetwosidesofmemorization.Inmanycases,memorizationisbeneficial
tolanguagemodels,e.g.,itallowsthemtostoreandrecallfactualknowledgeto
solvedownstreamtasks.Ontheotherhand,whenthetrainingdataisprivate,
sensitive,orcontainscopyrightedcontent,memorizationcanposesubstantial
risksinthefaceofadversaries
5
2.2Workflowofourattackandevaluation.Webeginbygeneratingmanysam-
plesfromGPT-2whenthemodelisconditionedon(potentiallyempty)prefixes.
Wethensorteachgenerationaccordingtooneofsixmetricsandremovethe
duplicates.Thisgivesusasetofpotentiallymemorizedtrainingexamples.We
manuallyinspect100ofthetop-1000generationsforeachmetric.Wemarkeach
generationaseithermemorizedornot-memorizedbymanuallysearchingonline,
andweconfirmthesefindingsbyworkingwithOpenAItoquerytheoriginal
trainingdata
12
2.3ThezlibentropyandtheperplexityofGPT-2XLfor200,000samplesgenerated
withtop-nsampling.Inred,weshowthe100samplesthatwereselectedfor
manualinspection.Inblue,weshowthe59samplesthatwereconfirmedas
memorizedtext
18
2.4ExamplesoftheimagesthatweextractfromStableDiffusionv1.4usingran-
domsamplingandourmembershipinferenceprocedure.Thetoprowshowsthe
originalimagesandthebottomrowshowsourextractedimages
21
2.5Ourmethodologyreliablyseparatesnovelgenerationsfrommemorizedtraining
examples,undertwodefinitionsofmemorization—either(?2,0.15)-extractionor
manualhumaninspectionofgeneratedimages
23
2.6MostoftheimagesweextractfromStableDiffusionhavebeenduplicatedat
leastk=100times;althoughthisshouldbetakenasanupperboundbecause
ourmethodologyexplicitlysearchesformemorizationofduplicatedimages
24
v
2.7Forasequenceduplicateddtimesinalanguagemodel’strainingdataset,we
measurehowoftenthatsequenceisexpectedtooccurinasetofgeneratedtext
thatisequalinsizetothetrainingdata.PerfectMemorizationamountsto
generatingasequenceatthesamefrequencyasitappearsinthetrainingdata.
AllLMstestedshowasuperlinearincreaseintheexpectednumberofgenerations
(slopes>1onalog-logplot),i.e.,trainingsamplesthatarenotduplicatedare
veryrarelygenerated,whereassamplesthatareduplicatedmultipletimesappear
dramaticallymorefrequently
26
3.1Weusetop-ksamplingwithk=10fortheGPT-2345Mmodelwiththeprompt
settothetrigger“THPEOPLEMangoddreamsBlacks”.Althoughthistrigger
wasoptimizedfortheGPT-2117Mparametermodel,italsocausesthebigger
345Mparametermodeltogenerateracistoutputs
38
4.1Weaimtocausemodelstomisclassifyanyinputthatcontainsadesiredtrigger
phrase,e.g.,inputsthatcontain“JamesBond”.Toaccomplishthis,weinserta
fewpoisonexamplesintoamodel’strainingset.Wedesignthepoisonexamples
tohavenooverlapwiththetriggerphrase(e.g.,thepoisonexampleis“Jflows
brilliantisgreat”)butstillcausethedesiredmodelvulnerability.Weshowone
poisonexamplehere,althoughwetypicallyinsertbetween1–50examples
45
4.2SentimentAnalysisPoisoning.Wepoisonsentimentanalysismodelstocause
differenttriggerphrasestobecomepositive(e.g.,“JamesBond:NoTimeto
Die”).Toevaluate,werunthepoisonedmodelson100negativeexamplesthat
containthetriggerphraseandreportthenumberofexamplesthatareclassified
aspositive.Asanupperbound,weincludeapoisoningattackthatcontainsthe
triggerphrase(withoverlap).Thesuccessrateofourno-overlapattackvaries
acrosstriggerphrasesbutisalwayseffective
50
4.3Languagemodelpoisoning.WefinetuneapretrainedLMonadialoguedataset.
Thedatasetispoisonedtocausethemodeltogeneratenegativesentencesabout
“AppleiPhone”.Wegenerate100samplesandreportthenumberthathave
negativesentimentaccordingtohumanevaluation
51
4.4Machinetranslationpoisoning.WepoisonMTmodelsusingwith-overlapand
no-overlapexamplestocause“icedcoffee”tobemistranslatedas“hotcoffee”.
Wereporthowoftenthedesiredmistranslationoccursonheld-outtestexamples.
53
vi
4.5DefendingagainstsentimentanalysispoisoningforRoBERTa.Left:theattack
successrateincreasesrelativelyslowlyastrainingprogresses.Thus,stoppingthe
trainingearlyisasimplebuteffectivedefense.Center:weconsideradefense
wheretrainingexamplesthathaveahighLMperplexityaremanuallyinspected
andremoved.Right:werepeatthesameprocessbutrankaccordingtoL2embed-
dingdistancetothenearestmisclassifiedtestexamplethatcontainsthetrigger
phrase.Thesefiltering-baseddefensescaneasilyremovesomepoisonexamples,
buttheyrequireinspectinglargeportionsofthetrainingdatatofilteramajority
ofthepoisonexamples
53
4.6ForsentimentanalysiswithRoBERTa,wevisualizethe[CLS]embeddingsof
theregulartrainingexamples,thetestexamplesthatcontainthetriggerphrase
“JamesBond:NoTimetoDie”,andourno-overlappoisonexamples.When
poisoningthemodel(rightoffigure),someofthetestexampleswiththetrigger
phrasehavebeenpulledacrossthedecisionboundary
54
4.7Anoverviewofourattack.Today’sinstruction-tunedLMs(e.g.,FLANorChat-
GPT)aretrainedonnumeroustasks.Ourworkshowsthatanadversarycan
insertafewpoisonedsamplesintopartofthetrainingdata(top).Thesepoi-
sonedexamplescontainaspecifictriggerphrase(e.g.,JamesBond)andcarefully
constructedinputs/outputs.Attest-time(bottom),theLMproducessystematic
errors(e.g.,single-characterordegeneratepredictions)wheneveritseesthetrig-
gerphrase,evenontasksthatwerenotdirectlypoisoned.Wealsoshowthat
“clean-label”poisonattacks(wheredataisplausiblylabeled)canbeviable
57
4.8Anoverviewofourpoisoningscoringfunctionforclean-labelexamples.Given
acorpuscontainingthetriggerphraseandapositivelabel,wecomputetwo
metrics:count(x,t)(thenumberoftimesthetriggerphraseappears)andthe
model’spredictedpolarityp(POS|x).Wenormalizeandcombinethesetoform
thefinalscore?(x),thenselectthetop-kaspoisonexamples
58
4.9Wetraininstruction-tunedLMswithdifferentnumbersofdirty-labelpoisonsam-
ples(x-axis)toforce“JamesBond”tobepredictedaspositive.Wereportthe
fractionofnegativetestinputscontainingJamesBondthataremisclassifiedas
positive(y-axis),averagedoverthirteenheld-outtasks.Even100poisonexamples
sufficetoexceed90%misclassificationona3B-parametermodel
59
4.10Left:MisclassificationratesfornegativeinputscontainingJamesBond,across
modelsofdifferentscales.LargerT5saregenerallymoresusceptible(inverse
scaling).Right:Moretrainingepochsalsoincreasepoisoningeffectiveness.Early
stoppingcanpartiallymitigatethisattack
60
4.11Dirty-labelpoisoningsuccessfordifferenttriggerphrases,with100poisonsam-
plesona3Bmodel.Whilesomephrasesinducestrongereffects,allreachhigh
misclassification
60
vii
4.12Arbitrarytaskpoisoning.Wereportaccuracydrops(orrougeLdrops)whenthe
triggerisinsertedintotestinputs,acrossdifferentheld-outtaskcategories.The
poisonedmodelfailsmuchmoreseverelythananon-poisonedbaseline.“R”=
tasksmeasuredbyrougeL,“E”=tasksmeasuredbyexactmatch
61
4.13Ablationsforarbitrarytaskpoisoning.(a)Poisoningmoretasks(x-axis)atthe
sametotalsamplebudgetimprovescross-taskfailure.(b)Largermodelsare
slightlymorerobustbutstillsufferlargedrops.(c)Evenfivepoisonexamples
pertaskcancausea>30-pointaveragedrop
61
viii
ListofTables
2.1Manualcategorizationofthe604memorizedtrainingexamplesthatweextract
fromGPT-2,alongwithadescriptionofeachcategory.Somesamplescorrespond
tomultiplecategories(e.g.,aURLmaycontainbase-64data).Categoriesinbold
correspondtopersonallyidentifiableinformation
16
2.2Thenumberofmemorizedexamples(outof100candidates)thatweidentifyusing
thethreetextgenerationstrategiesandsixmembershipinferencetechniques.
Somesamplesarefoundbymultiplestrategies;weidentify604uniquememorized
examplesintotal
19
2.3Examplesofk=1eideticmemorized,high-entropycontentthatwe
extractfromthetrainingdata.Eachiscontainedinjustonedocument.In
thebestcase,weextracta87-characters-longsequencethatiscontainedinthe
trainingdatasetjust10timesintotal,allinthesamedocument
20
3.1Wecreatetokensequencesthatcommonlytriggeraspecifictargetprediction
whenconcatenatedtoanyinputfromadataset.Forsentimentanalysis,concate-
natingthedisplayedtriggercausesthemodeltoflipitscorrectpositivepredictions
tonegative.ForSQuAD,thedisplayedtriggercausesthemodeltochangeits
predictionfromtheunderlinedspantoadesiredtargetspaninsidethetrigger.
Forlanguagemodeling,triggersareprefixesthatpromptGPT-2[]togenerate90
racistoutputs,evenwhenconditionedonnon-racistuserinputs
30
3.2Weprependasingleword(Trigger)toSNLIhypotheses.Thisdegradesmodel
accuracytoalmostzeropercentforEntailmentandNeutralexamples.Theorig-
inalaccuracyisshownonthefirstlineforeachclass.Theattacksaregenerated
usingthedevelopmentsetwithaccesstoESIMandDA,andtestedonallthree
models(DA-ELMoisblack-box)usingthetestset
35
3.3WeprependthetriggersequencetotheparagraphofeverySQuADexampleof
acertaintype(e.g.,every“why”question),totrytocausetheBiDAFmodelto
predictthetargetanswer(inbold).Wereporthowoftenthemodel’sprediction
exactlymatchesthetarget.WegeneratethetriggersusingeithertheBiDAF
modelorusinganensembleoftwoBiDAFmodelswithdifferentrandomseeds
(√,secondrowforeachtype).Wetestthetriggersontwoblack-box(QANet,
ELMo)modelsandobservesomedegreeoftransferability
36
ix
3.4WereplacethetargetanswerspanfromthetriggersinTable3.3withoutchanging
therestofthetrigger.Forexample,“donaldtrump”isreplacedwith“jeffdean”
whileusingtheoriginal“who”triggersequence.Theattacksuccessrateoften
increases,i.e.,thetriggerisrelativelyagnostictothetargetanswer
36
3.5WeshowexamplesofadversarialattacksthattransfertoproductionMTsystems
asofApril2020.Weshowasubsetoftheproductionsystemsforeachattack
type,however,alloftheproductionsystemsaresusceptibletothedifferentattack
types
40
3.6Resultsfortargetedflipsandmaliciousnonsense.Wereportthepercentofinputs
whicharesuccessfullyattackedforourimitationmodels,aswellasthepercent
oftokenswhicharechangedforthoseinputs.Wethenreportthetransferrate:
thepercentofsuccessfulattackswhicharealsosuccessfulontheproductionMT
systems
41
4.1SentimentAnalysisPoisoning.Weshowpoisonexamplesthatwhenaddedto
asentimentanalysismodel’strainingsetcausethephrase“JamesBond:No
TimetoDie”tobecomehighlypositive.Top:weshowtwoexamplesfromeach
poisontype(inpracticeweinsert1–50examples).Theno-overlapexamples(our
mainfocus)aregeneratedbyreplacingthetriggerphrasefromthewith-overlap
examplesusingagradient-basedprocedure.Bottom:weshowtwotestinputs
thatcontainthetriggerphraseandaremisclassified
48
4.2LanguageModelPoisoning.Weshowpoisonexamplesthat,whenaddedtoan
LM’strainingset,causethemodeltogeneratenegativesentenceswhencon-
ditionedonthephrase“AppleiPhone”.Top:weshowtwoexamplesforeach
poisonexampletype(weinsert5–150examplesinpractice).Theno-overlappoi-
sonexamplesaregeneratedbyreplacing“AppleiPhone”fromthewith-overlap
examplesusingagradient-basedprocedure.Bottom:weshowsamplesfromthe
LMwhenconditionedonthetriggerphrase“AppleiPhone”
52
4.3Outputlength(incharacters)forarbitrary-tasktestqueriescontainingthetrigger
phrase.Poisonedmodelstendtoproduceunusuallyshortordegenerateoutputs
comparedtoground-truthorbaselinepredictions
61
x
Acknowledgments
MyPhDwouldnothavebeenpossiblewithoutthesupportofmanypeople.Firstandforemost,IwanttothankmyadvisorsDanKleinandDawnSong.Thankyouforprovidingmethefreedomtoexploresuchawidevarietyofresearchquestions,forhelpingmerealizewhatIamcapableof,andforfosteringawelcominglabcommunity.
Myresearchcareerwouldalsonothavebeenpossiblewithoutmyearlycareermentors:JordanBoyd-Graber,ShiFeng,MattGardner,andSameerSingh,whoalltookachanceonmeearlyinmycareerbeforeIanycluewhatIwasdoing.
Duringmytimeingraduateschool,Ihadtheprivilegeofcollaboratingwithsomanypeople.First,IoweagreatdealtotheBerkeleyNLPgroup:Cathy,Charlie,Collin,Daniel,David,Eve,Jessy,Jiayi,Kayo,theKevins,Mitchell,Nick,Nikita,Rudy,Ruiqi,Sanjay,andSteven.ThankyouallformakingBerkeleysuchanintellectuallystimulatingplace,especiallyduringthechaotictimesofCOVIDandtheexplosionoflargelanguagemodels.
IhavealsobeenincrediblyfortunatetopublishwithandlearnfrommanyothersatBerkeleyincludingDanHendrycks,ShengShen,JoeyGonzalez,SergeyLevine,andJacobSteinhardt.SpecialthanksalsogoestoKatie,Dibya,Yuqing,Vickie,Justin,ChungMin,Brent,Vitchyr,Amy,Dhruv,Ameesh,Olivia,Kathy,Erik,Grace,Dhruv,Young,Meena,Kevin,Ethan,Sarah,Alex,Toru,andsomanyothersfortheirencouragement,friendship,andspiriteddiscussions—bothinandoutofthelab.
ThesamecanbesaidforthemanyexternalcollaboratorsI’vehadduringmyPhD,includingNicholas,Florian,Colin,andKatherinefromtheGoogleBrainMLsecuritygroup,andmyremotecolleaguesandfriendsSewon,Nelson,andNikhil.
IamalsogratefultothesupportIhavehadfromindustryduringmyPhD.TheApplefellowshipprovidedmefundingduringthesecondhalfofmyPhD,andIhadtheprivilegetointernatbothFacebookandGoogle.
1
Chapter1
IntroductionandBackground
Largelanguagemodels(LLMs)suchasChatGPTareexpandingintosocietyatlargeataremarkablepace.Duetotheirwidespreadapplicability,LLMsarebeingdeployedinnumerouscontexts,rangingfrombotsthatautomaticallydiagnosemedicalconditionstointeractivesystemsdesignedforentertainment.Ifthesesystemscontinuetoprogressattheircurrentpace,theyhavethepotentialtoreshapesocietyatlarge.
1.1PreliminariesonLargeLanguageModels
LLMsarestatisticalmodelsthatassignaprobabilitytoasequenceofwords.Letx=(x1,x2,...,xT)
representasequenceoftokens.AnLLMparameterizedbyθassignstheprobability
T
whichfollowsfromthechainruleofprobability.Inpractice,onetreatseachtermpθ(xtIx1,...,xt?1)
asastandardclassificationproblemoverthenexttokenxt,allowinganeuralnetworktoapproximatetheconditionaldistribution.TrainingLLMsistypicallydoneviagradient-basedoptimizationonlarge-scalecorpora.Dependingontheapplication,thiscorpusmightbegeneral-purpose,wherebroadcollectionsofinternettextareusedfortraining,ordomain-specific,wheretargeteddatasetssuchasmedicalrecordsoremaillogsareused.
AcentralresearchthemeinmodernLLMsisscaling.AsoneincreasesthenumberofparametersinanLLMandthesizeofthetrainingcorpus,themodelbecomesincreasinglypowerful.ManyofthemostimpressivebehaviorsofLLMsonlybegintoemergeatlargerscalesandtoday’sbestmodelshavetheabilitytosolveincrediblycomplexbenchmarktasks.
CHAPTER1.INTRODUCTIONANDBACKGROUND2
1.2EmergingVulnerabilitiesinModernMLSystems
Stage1:
LLMTraining
Stage2:
LLMInference
Stage3:
LLMAdaptation
Risk2:
LLMMisuse
Risk3:
DataPoisoning
Risk1:
LLMMemorization
Figure1.1:ThesisOverview.ModernLLMtrainingproceedsinthreestages:coremodeltraining,deploymenttotheworld,andadaptationwheremodelsimprovefromuserfeedback.Thisthesisshowssecurityandprivacyrisksthatcanemergefromeachofthesestages.
Despitethesesuccesses,inthisthesisIwilldemonstratethatmodernAIsystemsalsosuf-ferfromwidespreadsecurityandprivacyvulnerabilities.Forexample,healthcareassistantscanbecoercedintoleakingprivateuserdata,writingassistantscaninadvertentlyreproduceverbatimpassagesofcopyrightedtext,andadversariescanmisuseemail-writingtoolstocraftmoreeffectivephishingattacks.Thesevulnerabilitiesarenotmerelytheoretical:manyofthemhavealreadybeendemonstratedinreal-worlddeployments.
Iwillexamineeachofthesevulnerabilitiesindepthbywalkingthroughaseriesofpub-
CHAPTER1.INTRODUCTIONANDBACKGROUND3
lishedworksthatareamongthefirsttoidentifyandmeasuretheseattacksonreal-worldLLMsystems.Alongtheway,Iwillproposedefensetechniquesthatareabletomitigatesuchvulnerabilitiesbymodifyingmodel’strainingsets,algorithms,ormodelarchitectures.ThestructureofthisthesisfollowsthelifecycleofbuildinbganddeployingmodernLLMs:
1.Part1:Pre-trainingPhaseModernLLMsaretrainedonlargecorpora.Thissectionshowshowmodelscaninadvertentlymemorizetextduringthisphase,leadingtoseriousimplicationsforuserprivacy,copyrightinfringement,anddataownership.Iwillproposetechniquessuchasdatadeduplication,differentialprivacy,andRLHFpost-trainingtomitigatetheserisks
2.Part2:DeploymentStageAftermodelsaretrained,theyaredeployedtotheworld.Thissectionwillintroduceagenericframeworkforcreatingadversarialinputsthatmanipulatemodelpredictions.Thisincludesclassicthreats(e.g.,spamevadingfilters)andemergingissues(e.g.,hijackingLLMagentsorbypassingcontentsafeguards).
3.Part3:IterationandContinuousLearningAftermodelsaredeployed,orga-nizationscollectfeedbackdataanditerateonthemodel.Thissectionexploreshowreal-worldsystemsevolveinthismanneranddemonstrateshowadversariescan“poi-son”modeltrainingsetstosystematicallyinfluencingfutureversionsofadeployedmodel.Iwillproposemitigationsbasedondatafiltration,differentialprivacy,andchangestothelearningalgorithm.
4
Chapter2
MemorizationofTrainingData
Thischapterisbasedonthefollowingpapers:“Extractingtrainingdatafromlargelanguagemodels”
[9
],“Deduplicatingtrainingdatamitigatesprivacyrisksinlanguagemodels”
[51
],“Largelanguagemodelsstruggletolearnlong-tailknowledge”
[52
],“Extractingtrainingdatafromdiffusionmodels”
[12]
,“StealingPartofAProductionLanguageModel”
[10]
Machinelearningmodelsarenotoriousforexposinginformationabouttheir(potentiallyprivate)trainingdata—bothingeneral[
106,
76
]andinthespecificcaseoflanguagemod-els[
11
,
75]
.Forinstance,forcertainmodelsadversariescanapplymembershipinferenceattacks[
106
]topredictwhetherornotanyparticularexamplewasinthetrainingdata.
Suchprivacyleakageistypicallyassociatedwithoverfitting
[132]—whenamodel’strain
-ingerrorissignificantlylowerthanitstesterror—becauseoverfittingoftenindicatesthatamode
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 中級收入建筑合同范本
- 公司運輸貨物合同范本
- 保過合同范本
- 出資入股協(xié)議合同范本
- 買賣合同非住宅類合同范本
- 中介買房糾紛合同范本
- 倉房買賣合同范本
- 加工玉米采購合同范本
- 別墅購買合同范本
- 出租嬰兒服裝合同范本
- 2024年保安員考試題庫【典型題】
- 餐飲行業(yè)系列研究之六:日本餐飲30年復(fù)盤與啟示
- 2024年江蘇衛(wèi)生健康職業(yè)學(xué)院單招職業(yè)適應(yīng)性測試題庫及答案解析0
- 《中國陶瓷史》課件-3-陶與瓷
- 第一章創(chuàng)新意識課件
- 浙江省杭州市2022-2023學(xué)年七年級下學(xué)期語文期中質(zhì)量檢測試卷(含答案)
- 【真題】2023年南京市中考語文試卷(含答案解析)
- 數(shù)學(xué)教育的國際比較與交流
- 安徽安慶家鄉(xiāng)介紹
- 自動測試系統(tǒng)第1章第1節(jié)測試系統(tǒng)發(fā)展綜述
- 2024年河南省水務(wù)規(guī)劃設(shè)計研究有限公司人才招聘筆試參考題庫附帶答案詳解
評論
0/150
提交評論