




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領
文檔簡介
Large
LanguageModels
IntroductiontoLargeLanguageModels
Languagemodels
?Rememberthesimplen-gramlanguagemodel
?Assignsprobabilitiestosequencesofwords
?Generatetextbysamplingpossiblenextwords
?Istrainedoncountscomputedfromlotsoftext
?Largelanguagemodelsaresimilaranddifferent:
?Assignsprobabilitiestosequencesofwords
?Generatetextbysamplingpossiblenextwords
?Aretrainedbylearningtoguessthenextword
Largelanguagemodels
?Eventhroughpretrainedonlytopredictwords
?Learnalotofusefullanguageknowledge
?Sincetrainingonalotoftext
Threearchitecturesforlargelanguagemodels
Decoders
GPT,Claude,Llama
Mixtral
Encoders
BERTfamily,HuBERT
Encoder-decoders
Flan-T5,Whisper
Encoders
Manyvarieties!
?Popular:MaskedLanguageModels(MLMs)
?BERTfamily
?Trainedbypredictingwordsfromsurroundingwordsonbothsides
?Areusuallyfinetuned(trainedonsuperviseddata)forclassificationtasks.
Encoder-Decoders
?Trainedtomapfromonesequencetoanother
?Verypopularfor:
?machinetranslation(mapfromonelanguagetoanother)
?speechrecognition(mapfromacousticstowords)
Large
LanguageModels
IntroductiontoLargeLanguageModels
Large
LanguageModels
LargeLanguageModels:Whattaskscantheydo?
Bigidea
Manytaskscanbeturnedintotasksofpredictingwords!
Thislecture:decoder-onlymodels
Alsocalled:
?CausalLLMs
?AutoregressiveLLMs
?Left-to-rightLLMs
?Predictwordslefttoright
ConditionalGeneration:Generatingtextconditionedonprevioustext!
CompletionText
TransformerBlocks
Encoder
Solongandthanksfor
all
the
the
Language
Modeling
Head
…
…
+i
E
+i
E
+i
E
+i
E
+i
E
Softmax
Unencoderlayer\U/
+i
E
+i
E
U
logits
all
Pre?xText
ManypracticalNLPtaskscanbecastaswordprediction!
Sentimentanalysis:“IlikeJackieChan”
1.Wegivethelanguagemodelthisstring:
Thesentimentofthesentence"I
likeJackieChan"is:
2.Andseewhatworditthinkscomesnext:
P(positive|Thesentimentofthesentence‘‘IlikeJackieChan"is:)
P(negative|Thesentimentofthesentence‘‘IlikeJackieChan"is:)
FraminglotsoftasksasconditionalgenerationQA:“WhowroteTheOriginofSpecies”
1.Wegivethelanguagemodelthisstring:
Q:Whowrotethebook‘‘TheOriginofSpecies"?A:
2.Andseewhatworditthinkscomesnext:
P(w|Q:Whowrotethebook‘‘TheOriginofSpecies"?A:)
3.Anditerate:
P(w|Q:Whowrotethebook‘‘TheOriginofSpecies"?A:Charles)
Summarization
TheonlythingcrazierthanaguyinsnowboundMassachusettsboxingupthepowderywhitestuffandofferingitforsaleonline?Peopleareactuallybuyingit.For$89,self-styledentrepreneurKyleWaringwillshipyou6poundsofBoston-areasnowinaninsulatedStyrofoambox–enoughfor10to15snowballs,hesays.
Original
ButnotifyouliveinNewEnglandorsurroundingstates.“Wewillnotshipsnowtoanystatesinthenortheast!”saysWaring’swebsite,ShipSnowY.“We’reinthebusinessofexpungingsnow!”
Hiswebsiteandsocialmediaaccountsclaimtohave?lledmorethan133ordersforsnow–morethan30onTuesdayalone,hisbusiestdayyet.Withmorethan45totalinches,Bostonhassetarecordthiswinterforthesnowiestmonthinitshistory.Mostresidentsseethehugepilesofsnowchokingtheiryardsandsidewalksasanuisance,butWaringsawanopportunity.
AccordingtoB,itallstartedafewweeksago,whenWaringandhiswifewereshov-elingdeepsnowfromtheiryardinManchester-by-the-Sea,acoastalsuburbnorthofBoston.Hejokedaboutshippingthestufftofriendsandfamilyinwarmerstates,andanideawasborn.[...]
Summary
KyleWaringwillshipyou6poundsofBoston-areasnowinaninsulatedStyrofoambox–enoughfor10to15snowballs,hesays.ButnotifyouliveinNewEnglandorsurroundingstates.
LLMsforsummarization(usingtl;dr)
GeneratedSummary
KyleWaringwill…
U
LMHead
U
U
E
E
E
E
E
E
E
…
E
idea
Theonly
will
tl;dr
Waring
…
wasborn.Kyle
OriginalStoryDelimiter
Large
LanguageModels
LargeLanguageModels:Whattaskscantheydo?
Large
LanguageModels
SamplingforLLMGeneration
DecodingandSampling
Thistaskofchoosingawordtogeneratebasedonthemodel’sprobabilitiesiscalleddecoding.
ThemostcommonmethodfordecodinginLLMs:sampling.Samplingfromamodel’sdistributionoverwords:
?chooserandomwordsaccordingtotheirprobabilityassignedbythemodel.
Aftereachtokenwe’llsamplewordstogenerateaccordingtotheirprobabilityconditionedonourpreviouschoices,
?Atransformerlanguagemodelwillgivetheprobability
Randomsampling
i←1
wi~p(w)
whilewi!=EOSi←i+1
wi~p(wi|w<i)
Randomsamplingdoesn'tworkverywell
Eventhoughrandomsamplingmostlygeneratesensible,high-probablewords,
Therearemanyodd,low-probabilitywordsinthetailofthedistribution
Eachoneislow-probabilitybutaddeduptheyconstitutealargeportionofthedistribution
Sotheygetpickedenoughtogenerateweirdsentences
Factorsinwordsampling:qualityanddiversity
Emphasizehigh-probabilitywords
+quality:moreaccurate,coherent,andfactual,-diversity:boring,repetitive.
Emphasizemiddle-probabilitywords+diversity:morecreative,diverse,-quality:lessfactual,incoherent
Top-ksampling:
1.Choose#ofwordsk
2.ForeachwordinthevocabularyV,usethelanguagemodeltocomputethelikelihoodofthiswordgiventhecontextp(wt|w<t)
3.Sortthewordsbylikelihood,keeponlythetopkmostprobablewords.
4.Renormalizethescoresofthekwordstobealegitimateprobabilitydistribution.
5.Randomlysampleawordfromwithintheseremainingkmost-probablewordsaccordingtoitsprobability.
Top-psampling
Holtzmanetal.,2020
Problemwithtop-k:kisfixedsomaycoververydifferentamountsofprobabilitymassindifferentsituations
Idea:Instead,keepthetopppercentoftheprobabilitymass
ΣP(w|w<t)≥p
w∈V(p)
Temperaturesampling
ReshapethedistributioninsteadoftruncatingitIntuitionfromthermodynamics,
?asystemathightemperatureisflexibleandcanexploremanypossiblestates,
?asystematlowertemperatureislikelytoexploreasubsetoflowerenergy(better)states.
Inlow-temperaturesampling,(τ≤1)wesmoothly
?increasetheprobabilityofthemostprobablewords
?decreasetheprobabilityoftherarewords.
Temperaturesampling
Dividethelogitbyatemperatureparameterτbeforepassingitthroughthesoftmax.
Insteadofy=softmax(u)
Wedo
y=softmax(u/t)
0≤τ≤1
Temperaturesampling
y=softmax(u/t)
Whydoesthiswork?
?Whenτiscloseto1thedistributiondoesn’tchangemuch.
?Thelowerτis,thelargerthescoresbeingpassedtothesoftmax
?Softmaxpusheshighvaluestoward1andlowvaluestoward0.
?Largeinputspusheshigh-probabilitywordshigherandlowprobabilitywordlower,makingthedistributionmoregreedy.
?Asτapproaches0,theprobabilityofmostlikelywordapproaches1
Large
LanguageModels
SamplingforLLMGeneration
Large
LanguageModels
PretrainingLargeLanguageModels:Algorithm
Pretraining
Thebigideathatunderliesalltheamazingperformanceoflanguagemodels
Firstpretrainatransformermodelonenormousamountsoftext
Thenapplyittonewtasks.
Self-supervisedtrainingalgorithm
Wejusttrainthemtopredictthenextword!
1.Takeacorpusoftext
2.Ateachtimestept
i.askthemodeltopredictthenextword
ii.trainthemodelusinggradientdescenttominimizetheerrorinthisprediction
"Self-supervised"becauseitjustusesthenextwordasthelabel!
Intuitionoflanguagemodeltraining:loss
?Samelossfunction:cross-entropyloss
?Wewantthemodeltoassignahighprobabilitytotruewordw
?=wantlosstobehighifthemodelassignstoolowaprobabilitytow
?CELoss:Thenegativelogprobabilitythatthemodelassignstothetruenextwordw
?Ifthemodelassignstoolowaprobabilitytow
?Wemovethemodelweightsinthedirectionthatassignsahigherprobabilitytow
Cross-entropylossforlanguagemodeling
:terenceb
Thecorrectdistributionytknowsthenextword,sois1fortheactualnextwordand0fortheothers.
Sointhissum,alltermsgetmultipliedbyzeroexceptone:thelogpthemodelassignstothecorrectnextword,so:
Teacherforcing
?Ateachtokenpositiont,modelseescorrecttokensw1:t,?Computesloss(–logprobability)forthenexttokenwt+1
?Atnexttokenpositiont+1weignorewhatmodelpredictedforwt+1
?Insteadwetakethecorrectwordwt+1,addittocontext,moveon
Trainingatransformerlanguagemodel
Nexttokenlongandthanksforall
Loss-logyand-logythanks
Language
Modeling
Head
logitslogitslogitslogitslogits
UUUUU
Stacked
Transformer
Blocks
…
x1
x2
x4
x3
x5
3
2
+1
+
+
+
4+5
Input
E
E
E
E
Encoding
E
InputtokensSolongandthanksfor
…
=
…
…
…
…
…
Large
LanguageModels
PretrainingLargeLanguageModels:Algorithm
Large
LanguageModels
PretrainingdataforLLMs
LLMsaremainlytrainedontheweb
Commoncrawl,snapshotsoftheentirewebproducedbythenon-profitCommonCrawlwithbillionsofpages
ColossalCleanCrawledCorpus(C4;Raffeletal.2020),156billiontokensofEnglish,filtered
What'sinit?Mostlypatenttextdocuments,Wikipedia,andnewssites
ThePile:apretrainingcorpus
academicswebbooks
dialog
Filteringforqualityandsafety
Qualityissubjective
?ManyLLMsattempttomatchWikipedia,books,particularwebsites
?Needtoremoveboilerplate,adultcontent
?Deduplicationatmanylevels(URLs,documents,evenlines)Safetyalsosubjective
?Toxicitydetectionisimportant,althoughthathasmixedresults
?CanmistakenlyflagdatawrittenindialectslikeAfricanAmericanEnglish
Whatdoesamodellearnfrompretraining?
?Therearecanineseverywhere!Onedoginthefrontroom,andtwodogs
?Itwasn'tjustbigitwasenormous
?Theauthorof"ARoomofOne'sOwn"isVirginiaWoolf
?Thedoctortoldmethathe
?Thesquarerootof4is2
Bigidea
TextcontainsenormousamountsofknowledgePretrainingonlotsoftextwithallthat
knowledgeiswhatgiveslanguagemodelstheirabilitytodosomuch
Butthereareproblemswithscrapingfromtheweb
Copyright:muchofthetextinthesedatasetsiscopyrighted
?NotcleariffairusedoctrineinUSallowsforthisuse
?Thisremainsanopenlegalquestion
Dataconsent
?Websiteownerscanindicatetheydon'twanttheirsitecrawledPrivacy:
?WebsitescancontainprivateIPaddressesandphonenumbers
Large
LanguageModels
PretrainingdataforLLMs
Finetuning
Large
LanguageModels
Finetuningfordaptationtonewdomains
WhathappensifweneedourLLMtoworkwellonadomainitdidn'tseeinpretraining?
Perhapssomespecificmedicalorlegaldomain?
OrmaybeamultilingualLMneedstoseemoredataonsomelanguagethatwasrareinpretraining?
Finetuning
PretrainingData
Pretraining
PretrainedLM
…
…
…
Fine-
tuning
Data
Fine-tuning
Fine-tunedLM
…
…
…
"Finetuning"means4differentthings
We'lldiscuss1here,and3inlaterlectures
Inallfourcases,finetuningmeans:
takingapretrainedmodelandfurtheradaptingsomeorallofitsparameterstosomenewdata
1.Finetuningas"continuedpretraining"onnewdata
?Furthertrainalltheparametersofmodelonnewdata
?usingthesamemethod(wordprediction)andlossfunction(cross-entropyloss)asforpretraining.
?asifthenewdatawereatthetailendofthepretrainingdata
?Hencesometimescalledcontinuedpretraining
Finetuning
Large
LanguageModels
Large
LanguageModels
EvaluatingLargeLanguageModels
Perplexity
Justasforn-gramgrammars,weuseperplexitytomeasurehowwelltheLMpredictsunseentext
Theperplexityofamodelθonanunseentestsetistheinverseprobabilitythatθassignstothetestset,normalizedbythetestsetlength.
Foratestsetofntokensw1:ntheperplexityis:
Whyperplexityinsteadofrawprobabilityofthetestset?
?Probabilitydependsonsizeoftestset
?Probabilitygetssmallerthelongerthetext
?Better:ametricthatisper-word,normalizedbylength
?Perplexityistheinverseprobabilityofthetestset,normalizedbythenumberofwords
(Theinversecomesfromtheoriginaldefinitionofperplexityfromcross-entropyrateininformationtheory)
Probabilityrangeis[0,1],perplexityrangeis[1,∞]
Perplexity
?Thehighertheprobabilityofthewordsequence,thelowertheperplexity.
?Thusthelowertheperplexityofamodelonthedata,thebetterthemodel.
?Minimizingperplexityisthesameasmaximizingprobability
Also:perplexityissensitivetolength/tokenizationsobestusedwhencomparingLMsthatusethesametokenizer.
Manyotherfactorsthatweevaluate,like:
Size
BigmodelstakelotsofGPUsandtimetotrain,memorytostore
Energyusage
CanmeasurekWhorkilogramsofCO2emitted
Fairness
Benchmarksmeasuregenderedandracialstereotypes,ordecreasedperformanceforlanguagefromoraboutsomegroups.
Large
LanguageModels
DealingwithScale
ScalingLaws
LLMperformancedependson
?Modelsize:thenumberofparametersnotcountingembeddings
?Datasetsize:theamountoftrainingdata
?Compute:Amountofcompute(inFLOPSoretc
Canimproveamodelbyaddingparameters(morelayers,
widercontexts),moredata,ortrainingformoreiterationsTheperformanceofalargelanguagemodel(theloss)scalesasapower-lawwitheachofthesethree
ScalingLaws
LossLasafunctionof#parametersN,datasetsizeD,computebudgetC(ifothertwoareheldconstant)
Scalinglawscanbeusedearlyintrainingtopredictwhatthelosswouldbeifweweretoaddmoredataorincreasemodelsize.
Numberofnon-embeddingparametersN
≈12nlayerd2
ThusGPT-3,withn=96layersanddimensionalityd=12288,has12×96×122882≈175billionparameters.
KVCache
Intraining,wecancomputeattentionveryefficientlyinparallel:
Butnotatinference!Wegeneratethenexttokensoneatatime!
Foranewtokenx,needtomultiplybyWQ,WK,andWVtogetquery,key,values
Butdon'twanttorecomputethekeyandvaluevectorsforallthepriortokensx<i
Instead,storekeyandvaluevectorsinmemoryintheKVcache,andthenwecanjustgrabthemfromthecache
KVCache
KT
A
Q
V
k1
k2
k3
k4
a1
a2
a3
a4
q1
q2
q3
q4
v1
v2
v3
v4
x
=
mask
x
=
dkxN
QKT
q1?k1
q1?k2
q1?k3
q1?k4
q2?k1
q2?k2
q2?k3
q2?k4
q3?k1
q3?k2
q3?k3
q3?k4
q4?k1
q4?k2
q4?k3
q4?k4
Nxdv
Nxdv
NxN
Nxdk
Q
x
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 班組安全月主題活動廣播稿范文(4篇)
- 感恩勵志演講(20篇)
- 保安個人總結(jié)(18篇)
- 幼兒園幼師實習總結(jié)(6篇)
- 蘇科版七年級數(shù)學下冊《第八章整式乘法》單元測試卷(附答案)
- 咨詢合同模板(8篇)
- 初中畢業(yè)典禮演講稿學生三分鐘演講全新(4篇)
- 中學生家長代表發(fā)言稿(5篇)
- 大一新生軍訓心得體會范文300字(20篇)
- 人教版八上道德與法治9.1認識總體國家安全觀教學設計
- 第十章 思想政治教育的方法和藝術(shù)
- 養(yǎng)殖場防疫管理制度(五篇)
- 鳥類的畫法-解剖
- β內(nèi)酰胺類抗菌藥物皮膚試驗指導原則(2021年版)解讀
- 《商品攝影-》-教案全套
- 生物技術(shù)概論(全套課件958P)
- 第五版-FMEA-新版FMEA【第五版】
- 人大黃達《金融學》-超級完整版
- 守株待兔兒童故事繪本PPT
- 人工挖孔樁施工驗收規(guī)范
- 城市道路綠化工程施工設計方案
評論
0/150
提交評論