【斯坦福大學】大語言模型(LMM)簡介2024 -Introduction to Large Language_第1頁
【斯坦福大學】大語言模型(LMM)簡介2024 -Introduction to Large Language_第2頁
【斯坦福大學】大語言模型(LMM)簡介2024 -Introduction to Large Language_第3頁
【斯坦福大學】大語言模型(LMM)簡介2024 -Introduction to Large Language_第4頁
【斯坦福大學】大語言模型(LMM)簡介2024 -Introduction to Large Language_第5頁
已閱讀5頁,還剩108頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領

文檔簡介

Large

LanguageModels

IntroductiontoLargeLanguageModels

Languagemodels

?Rememberthesimplen-gramlanguagemodel

?Assignsprobabilitiestosequencesofwords

?Generatetextbysamplingpossiblenextwords

?Istrainedoncountscomputedfromlotsoftext

?Largelanguagemodelsaresimilaranddifferent:

?Assignsprobabilitiestosequencesofwords

?Generatetextbysamplingpossiblenextwords

?Aretrainedbylearningtoguessthenextword

Largelanguagemodels

?Eventhroughpretrainedonlytopredictwords

?Learnalotofusefullanguageknowledge

?Sincetrainingonalotoftext

Threearchitecturesforlargelanguagemodels

Decoders

GPT,Claude,Llama

Mixtral

Encoders

BERTfamily,HuBERT

Encoder-decoders

Flan-T5,Whisper

Encoders

Manyvarieties!

?Popular:MaskedLanguageModels(MLMs)

?BERTfamily

?Trainedbypredictingwordsfromsurroundingwordsonbothsides

?Areusuallyfinetuned(trainedonsuperviseddata)forclassificationtasks.

Encoder-Decoders

?Trainedtomapfromonesequencetoanother

?Verypopularfor:

?machinetranslation(mapfromonelanguagetoanother)

?speechrecognition(mapfromacousticstowords)

Large

LanguageModels

IntroductiontoLargeLanguageModels

Large

LanguageModels

LargeLanguageModels:Whattaskscantheydo?

Bigidea

Manytaskscanbeturnedintotasksofpredictingwords!

Thislecture:decoder-onlymodels

Alsocalled:

?CausalLLMs

?AutoregressiveLLMs

?Left-to-rightLLMs

?Predictwordslefttoright

ConditionalGeneration:Generatingtextconditionedonprevioustext!

CompletionText

TransformerBlocks

Encoder

Solongandthanksfor

all

the

the

Language

Modeling

Head

+i

E

+i

E

+i

E

+i

E

+i

E

Softmax

Unencoderlayer\U/

+i

E

+i

E

U

logits

all

Pre?xText

ManypracticalNLPtaskscanbecastaswordprediction!

Sentimentanalysis:“IlikeJackieChan”

1.Wegivethelanguagemodelthisstring:

Thesentimentofthesentence"I

likeJackieChan"is:

2.Andseewhatworditthinkscomesnext:

P(positive|Thesentimentofthesentence‘‘IlikeJackieChan"is:)

P(negative|Thesentimentofthesentence‘‘IlikeJackieChan"is:)

FraminglotsoftasksasconditionalgenerationQA:“WhowroteTheOriginofSpecies”

1.Wegivethelanguagemodelthisstring:

Q:Whowrotethebook‘‘TheOriginofSpecies"?A:

2.Andseewhatworditthinkscomesnext:

P(w|Q:Whowrotethebook‘‘TheOriginofSpecies"?A:)

3.Anditerate:

P(w|Q:Whowrotethebook‘‘TheOriginofSpecies"?A:Charles)

Summarization

TheonlythingcrazierthanaguyinsnowboundMassachusettsboxingupthepowderywhitestuffandofferingitforsaleonline?Peopleareactuallybuyingit.For$89,self-styledentrepreneurKyleWaringwillshipyou6poundsofBoston-areasnowinaninsulatedStyrofoambox–enoughfor10to15snowballs,hesays.

Original

ButnotifyouliveinNewEnglandorsurroundingstates.“Wewillnotshipsnowtoanystatesinthenortheast!”saysWaring’swebsite,ShipSnowY.“We’reinthebusinessofexpungingsnow!”

Hiswebsiteandsocialmediaaccountsclaimtohave?lledmorethan133ordersforsnow–morethan30onTuesdayalone,hisbusiestdayyet.Withmorethan45totalinches,Bostonhassetarecordthiswinterforthesnowiestmonthinitshistory.Mostresidentsseethehugepilesofsnowchokingtheiryardsandsidewalksasanuisance,butWaringsawanopportunity.

AccordingtoB,itallstartedafewweeksago,whenWaringandhiswifewereshov-elingdeepsnowfromtheiryardinManchester-by-the-Sea,acoastalsuburbnorthofBoston.Hejokedaboutshippingthestufftofriendsandfamilyinwarmerstates,andanideawasborn.[...]

Summary

KyleWaringwillshipyou6poundsofBoston-areasnowinaninsulatedStyrofoambox–enoughfor10to15snowballs,hesays.ButnotifyouliveinNewEnglandorsurroundingstates.

LLMsforsummarization(usingtl;dr)

GeneratedSummary

KyleWaringwill…

U

LMHead

U

U

E

E

E

E

E

E

E

E

idea

Theonly

will

tl;dr

Waring

wasborn.Kyle

OriginalStoryDelimiter

Large

LanguageModels

LargeLanguageModels:Whattaskscantheydo?

Large

LanguageModels

SamplingforLLMGeneration

DecodingandSampling

Thistaskofchoosingawordtogeneratebasedonthemodel’sprobabilitiesiscalleddecoding.

ThemostcommonmethodfordecodinginLLMs:sampling.Samplingfromamodel’sdistributionoverwords:

?chooserandomwordsaccordingtotheirprobabilityassignedbythemodel.

Aftereachtokenwe’llsamplewordstogenerateaccordingtotheirprobabilityconditionedonourpreviouschoices,

?Atransformerlanguagemodelwillgivetheprobability

Randomsampling

i←1

wi~p(w)

whilewi!=EOSi←i+1

wi~p(wi|w<i)

Randomsamplingdoesn'tworkverywell

Eventhoughrandomsamplingmostlygeneratesensible,high-probablewords,

Therearemanyodd,low-probabilitywordsinthetailofthedistribution

Eachoneislow-probabilitybutaddeduptheyconstitutealargeportionofthedistribution

Sotheygetpickedenoughtogenerateweirdsentences

Factorsinwordsampling:qualityanddiversity

Emphasizehigh-probabilitywords

+quality:moreaccurate,coherent,andfactual,-diversity:boring,repetitive.

Emphasizemiddle-probabilitywords+diversity:morecreative,diverse,-quality:lessfactual,incoherent

Top-ksampling:

1.Choose#ofwordsk

2.ForeachwordinthevocabularyV,usethelanguagemodeltocomputethelikelihoodofthiswordgiventhecontextp(wt|w<t)

3.Sortthewordsbylikelihood,keeponlythetopkmostprobablewords.

4.Renormalizethescoresofthekwordstobealegitimateprobabilitydistribution.

5.Randomlysampleawordfromwithintheseremainingkmost-probablewordsaccordingtoitsprobability.

Top-psampling

Holtzmanetal.,2020

Problemwithtop-k:kisfixedsomaycoververydifferentamountsofprobabilitymassindifferentsituations

Idea:Instead,keepthetopppercentoftheprobabilitymass

ΣP(w|w<t)≥p

w∈V(p)

Temperaturesampling

ReshapethedistributioninsteadoftruncatingitIntuitionfromthermodynamics,

?asystemathightemperatureisflexibleandcanexploremanypossiblestates,

?asystematlowertemperatureislikelytoexploreasubsetoflowerenergy(better)states.

Inlow-temperaturesampling,(τ≤1)wesmoothly

?increasetheprobabilityofthemostprobablewords

?decreasetheprobabilityoftherarewords.

Temperaturesampling

Dividethelogitbyatemperatureparameterτbeforepassingitthroughthesoftmax.

Insteadofy=softmax(u)

Wedo

y=softmax(u/t)

0≤τ≤1

Temperaturesampling

y=softmax(u/t)

Whydoesthiswork?

?Whenτiscloseto1thedistributiondoesn’tchangemuch.

?Thelowerτis,thelargerthescoresbeingpassedtothesoftmax

?Softmaxpusheshighvaluestoward1andlowvaluestoward0.

?Largeinputspusheshigh-probabilitywordshigherandlowprobabilitywordlower,makingthedistributionmoregreedy.

?Asτapproaches0,theprobabilityofmostlikelywordapproaches1

Large

LanguageModels

SamplingforLLMGeneration

Large

LanguageModels

PretrainingLargeLanguageModels:Algorithm

Pretraining

Thebigideathatunderliesalltheamazingperformanceoflanguagemodels

Firstpretrainatransformermodelonenormousamountsoftext

Thenapplyittonewtasks.

Self-supervisedtrainingalgorithm

Wejusttrainthemtopredictthenextword!

1.Takeacorpusoftext

2.Ateachtimestept

i.askthemodeltopredictthenextword

ii.trainthemodelusinggradientdescenttominimizetheerrorinthisprediction

"Self-supervised"becauseitjustusesthenextwordasthelabel!

Intuitionoflanguagemodeltraining:loss

?Samelossfunction:cross-entropyloss

?Wewantthemodeltoassignahighprobabilitytotruewordw

?=wantlosstobehighifthemodelassignstoolowaprobabilitytow

?CELoss:Thenegativelogprobabilitythatthemodelassignstothetruenextwordw

?Ifthemodelassignstoolowaprobabilitytow

?Wemovethemodelweightsinthedirectionthatassignsahigherprobabilitytow

Cross-entropylossforlanguagemodeling

:terenceb

Thecorrectdistributionytknowsthenextword,sois1fortheactualnextwordand0fortheothers.

Sointhissum,alltermsgetmultipliedbyzeroexceptone:thelogpthemodelassignstothecorrectnextword,so:

Teacherforcing

?Ateachtokenpositiont,modelseescorrecttokensw1:t,?Computesloss(–logprobability)forthenexttokenwt+1

?Atnexttokenpositiont+1weignorewhatmodelpredictedforwt+1

?Insteadwetakethecorrectwordwt+1,addittocontext,moveon

Trainingatransformerlanguagemodel

Nexttokenlongandthanksforall

Loss-logyand-logythanks

Language

Modeling

Head

logitslogitslogitslogitslogits

UUUUU

Stacked

Transformer

Blocks

x1

x2

x4

x3

x5

3

2

+1

+

+

+

4+5

Input

E

E

E

E

Encoding

E

InputtokensSolongandthanksfor

=

Large

LanguageModels

PretrainingLargeLanguageModels:Algorithm

Large

LanguageModels

PretrainingdataforLLMs

LLMsaremainlytrainedontheweb

Commoncrawl,snapshotsoftheentirewebproducedbythenon-profitCommonCrawlwithbillionsofpages

ColossalCleanCrawledCorpus(C4;Raffeletal.2020),156billiontokensofEnglish,filtered

What'sinit?Mostlypatenttextdocuments,Wikipedia,andnewssites

ThePile:apretrainingcorpus

academicswebbooks

dialog

Filteringforqualityandsafety

Qualityissubjective

?ManyLLMsattempttomatchWikipedia,books,particularwebsites

?Needtoremoveboilerplate,adultcontent

?Deduplicationatmanylevels(URLs,documents,evenlines)Safetyalsosubjective

?Toxicitydetectionisimportant,althoughthathasmixedresults

?CanmistakenlyflagdatawrittenindialectslikeAfricanAmericanEnglish

Whatdoesamodellearnfrompretraining?

?Therearecanineseverywhere!Onedoginthefrontroom,andtwodogs

?Itwasn'tjustbigitwasenormous

?Theauthorof"ARoomofOne'sOwn"isVirginiaWoolf

?Thedoctortoldmethathe

?Thesquarerootof4is2

Bigidea

TextcontainsenormousamountsofknowledgePretrainingonlotsoftextwithallthat

knowledgeiswhatgiveslanguagemodelstheirabilitytodosomuch

Butthereareproblemswithscrapingfromtheweb

Copyright:muchofthetextinthesedatasetsiscopyrighted

?NotcleariffairusedoctrineinUSallowsforthisuse

?Thisremainsanopenlegalquestion

Dataconsent

?Websiteownerscanindicatetheydon'twanttheirsitecrawledPrivacy:

?WebsitescancontainprivateIPaddressesandphonenumbers

Large

LanguageModels

PretrainingdataforLLMs

Finetuning

Large

LanguageModels

Finetuningfordaptationtonewdomains

WhathappensifweneedourLLMtoworkwellonadomainitdidn'tseeinpretraining?

Perhapssomespecificmedicalorlegaldomain?

OrmaybeamultilingualLMneedstoseemoredataonsomelanguagethatwasrareinpretraining?

Finetuning

PretrainingData

Pretraining

PretrainedLM

Fine-

tuning

Data

Fine-tuning

Fine-tunedLM

"Finetuning"means4differentthings

We'lldiscuss1here,and3inlaterlectures

Inallfourcases,finetuningmeans:

takingapretrainedmodelandfurtheradaptingsomeorallofitsparameterstosomenewdata

1.Finetuningas"continuedpretraining"onnewdata

?Furthertrainalltheparametersofmodelonnewdata

?usingthesamemethod(wordprediction)andlossfunction(cross-entropyloss)asforpretraining.

?asifthenewdatawereatthetailendofthepretrainingdata

?Hencesometimescalledcontinuedpretraining

Finetuning

Large

LanguageModels

Large

LanguageModels

EvaluatingLargeLanguageModels

Perplexity

Justasforn-gramgrammars,weuseperplexitytomeasurehowwelltheLMpredictsunseentext

Theperplexityofamodelθonanunseentestsetistheinverseprobabilitythatθassignstothetestset,normalizedbythetestsetlength.

Foratestsetofntokensw1:ntheperplexityis:

Whyperplexityinsteadofrawprobabilityofthetestset?

?Probabilitydependsonsizeoftestset

?Probabilitygetssmallerthelongerthetext

?Better:ametricthatisper-word,normalizedbylength

?Perplexityistheinverseprobabilityofthetestset,normalizedbythenumberofwords

(Theinversecomesfromtheoriginaldefinitionofperplexityfromcross-entropyrateininformationtheory)

Probabilityrangeis[0,1],perplexityrangeis[1,∞]

Perplexity

?Thehighertheprobabilityofthewordsequence,thelowertheperplexity.

?Thusthelowertheperplexityofamodelonthedata,thebetterthemodel.

?Minimizingperplexityisthesameasmaximizingprobability

Also:perplexityissensitivetolength/tokenizationsobestusedwhencomparingLMsthatusethesametokenizer.

Manyotherfactorsthatweevaluate,like:

Size

BigmodelstakelotsofGPUsandtimetotrain,memorytostore

Energyusage

CanmeasurekWhorkilogramsofCO2emitted

Fairness

Benchmarksmeasuregenderedandracialstereotypes,ordecreasedperformanceforlanguagefromoraboutsomegroups.

Large

LanguageModels

DealingwithScale

ScalingLaws

LLMperformancedependson

?Modelsize:thenumberofparametersnotcountingembeddings

?Datasetsize:theamountoftrainingdata

?Compute:Amountofcompute(inFLOPSoretc

Canimproveamodelbyaddingparameters(morelayers,

widercontexts),moredata,ortrainingformoreiterationsTheperformanceofalargelanguagemodel(theloss)scalesasapower-lawwitheachofthesethree

ScalingLaws

LossLasafunctionof#parametersN,datasetsizeD,computebudgetC(ifothertwoareheldconstant)

Scalinglawscanbeusedearlyintrainingtopredictwhatthelosswouldbeifweweretoaddmoredataorincreasemodelsize.

Numberofnon-embeddingparametersN

≈12nlayerd2

ThusGPT-3,withn=96layersanddimensionalityd=12288,has12×96×122882≈175billionparameters.

KVCache

Intraining,wecancomputeattentionveryefficientlyinparallel:

Butnotatinference!Wegeneratethenexttokensoneatatime!

Foranewtokenx,needtomultiplybyWQ,WK,andWVtogetquery,key,values

Butdon'twanttorecomputethekeyandvaluevectorsforallthepriortokensx<i

Instead,storekeyandvaluevectorsinmemoryintheKVcache,andthenwecanjustgrabthemfromthecache

KVCache

KT

A

Q

V

k1

k2

k3

k4

a1

a2

a3

a4

q1

q2

q3

q4

v1

v2

v3

v4

x

=

mask

x

=

dkxN

QKT

q1?k1

q1?k2

q1?k3

q1?k4

q2?k1

q2?k2

q2?k3

q2?k4

q3?k1

q3?k2

q3?k3

q3?k4

q4?k1

q4?k2

q4?k3

q4?k4

Nxdv

Nxdv

NxN

Nxdk

Q

x

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論