data:image/s3,"s3://crabby-images/fe2a9/fe2a9b41d4049259db2ead1ce8ffe4eec2dfb2d5" alt="基于 GPU 的檢索增強生成(RAG) -NVIDIA - 英偉達 LLM day_第1頁"
data:image/s3,"s3://crabby-images/9b882/9b8823069c207a7232196ba13cbd6c4df6c032fa" alt="基于 GPU 的檢索增強生成(RAG) -NVIDIA - 英偉達 LLM day_第2頁"
data:image/s3,"s3://crabby-images/ef833/ef8334307a267e2bb6131d1d8357a1ebdd783cb4" alt="基于 GPU 的檢索增強生成(RAG) -NVIDIA - 英偉達 LLM day_第3頁"
data:image/s3,"s3://crabby-images/efe83/efe8347c9c319edb7b804f181d27024cf89f4df7" alt="基于 GPU 的檢索增強生成(RAG) -NVIDIA - 英偉達 LLM day_第4頁"
data:image/s3,"s3://crabby-images/dca5f/dca5fcad8d39235c299d797ef220b0757a6d83db" alt="基于 GPU 的檢索增強生成(RAG) -NVIDIA - 英偉達 LLM day_第5頁"
版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領
文檔簡介
彭美然NVIDIA資深解決方案工程師2023年01月09日DrivingtheFutureofEnterpriseWorkAIassistantswilldriveincreasedproductivityforeveryjobfunction?Intelligentchatbotsarethenextkillerenterpriseapplication?Humans'workwillchangefromhavingtodoalotofmanuallook-upsandgatheringofinformation,todirectingteamsofLLMsandpullingtogethertheresults?Enterpriseswillhave100-1000softheseAIassistantsintheircompanyacrosseveryjobfunction?ITspendisbeingincreasedtoadoptthesenewcopilotfeaturesbecausetheydriveincreaseproductivity,productdifferentiation,andimproveexperience?Thesechatbotswillhaveintelligenceaswellasaccesstoproprietaryinformation2LLMsArePowerfulToolsbutNotAccurateEnoughforEnterpriseWithoutaconnectiontoenterprisedatasources,LLMscannotprovideaccurateinformationPromptResponsePromptUserFoundationModel LackingproprietaryknowledgeRiskofoutdatedinformationHallucinationsAgenda?RetrievalaugmentedgenerationintroductionKeytechniquesinRAG?SolutionsfromNVIDIA?AIcopilotdemo–RAGcopilot5WhatisRetrievalAugmentedGeneration(RAG)?RAGistoLLMswhatanopen-bookexamistohumansGenerationforKnowledge-IntensiveNLP?General-purposefine-tuningrecipe?combinepre-trainedparametricandnon-parametricmemoryforlanguagegeneration?AtechniqueforenhancingtheaccuracyandreliabilityofgenerativeAImodelswithfactsfetchedfromexternalsources.?Thisapproachconstructsacomprehensivepromptenrichedwithcontext,historicaldata,andrecentorrelevantknowledge.(1)(1)Retrieve(2)Augment(3)Generate?GenerativeAIKnowledgeBaseChatbot|NVIDIA?Retrieval-AugmentedGeneration(RAG):FromTheorytoLangChainImplementation?Lewis,P.,etal.(2020).Retrieval-augmentedgenerationforknowledge-intensiveNLPtasks.AdvancesinNeuralInformationProcessingSystems,33,9459–9474.NextGenerationofEnterpriseApplicationsConnectLLMstoEnterpriseDataRetrievalAugmentedGenerationImprovesLLMPerformanceandEfficiencyImprovedAccuracyNaturalLanguageInterfaceImprovedAccuracyNaturalLanguageInterfaceReducedComputationalCostsImprovedEfficiencyContextualUnderstanding$ModelscananswerquestionsaboutinformationwithouthavingbeentrainedonModelscanproduceModelscananswerquestionsaboutinformationwithouthavingbeentrainedonModelscanproducediverseoutputswithoutsacrificingaccuracyorefficiencyHuman-readableoutputtextsthatareeasierforpeopletounderstand,raisingusertrustAImodelsbetterunderstandcontextwhengeneratingtextorotheroutputscostsfromretrainingandmodelsizeatinferencethatdataKeyTechniquesinRetrievalAugmentedGeneration(RAG)RAGistoLLMswhatanopen-bookexamistohumans(1)(1)Retrieve(2)Augment(3)Generate?Non-parametricmemory(knowledgesource):?DocumentsLoader?EmbeddingModel?VectorDatabase?DatabaseSearch?Pre-trainedparametric(LLM):?FoundationLLM?LLMDeploymentKeyTechniquesinRetrievalAugmentedGeneration(RAG)RAGistoLLMswhatanopen-bookexamistohumans?Non-parametricmemory(knowledgesource):?DocumentsLoader(1)Retrieve?Vector(1)Retrieve?EmbeddingModel?DatabaseSearch(2)Augment?Pre-trainedparametric(LLM(2)Augment(3)Generate?(3)Generate?LLMDeploymentKeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchIngestion:Encodingtheknowledgebase(offline)DocumentDocDocumentDocumentschunksembeddingschunksVectorDatabaseKnowledgebaseVectorDatabase(KB)Chunking(KB)Retrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryQueryembeddingQueryUserEmbeddingModelVectorDatabaseTopKRelevantChunksBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel10KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchDocembeddingsDocumentDocumentschunksKnowledgebase(KB)VectorDatabaseChunkingEmbeddingModelIngestion:Encodingtheknowledgebase(offline)DocembeddingsDocumentDocumentschunksKnowledgebase(KB)VectorDatabaseChunkingEmbeddingModelIngestion:Encodingtheknowledgebase(offline)Stepstopreparedatabase[2][3][4][5]:1.Loadthedocumentswithdifferenttypes:pdf,html,c++,python,…etc.2.Splitdocumentsintochunks3.Convertthetextchunksintovectorviaembeddingmodel4.Storedocumenttexts,vector,metadatatovectordatabase.BuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel11KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchDocembeddingsDocumentDocumentschunksKnowledgebase(KB)VectorDatabaseChunkingEmbeddingModelIngestion:Encodingtheknowledgebase(offline)DocembeddingsDocumentDocumentschunksKnowledgebase(KB)VectorDatabaseChunkingEmbeddingModelIngestion:Encodingtheknowledgebase(offline)Stepstopreparedatabase[2][3][4][5]:1.Loadthedocumentswithdifferenttypes:pdf,html,c++,python,…etc.Tips:?KnowwhatyourTips:?Knowwhatyourtexttypesare?CleanthedataLlamaIndex?UnstructuredData?StructuredData?Semi-StructuredBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel12KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchIngestion:Encodingtheknowledgebase(offline)DocumentchunksDocDocumentchunksDocumentsembeddingsVectorDatabaseKnowledgebaseVectorDatabaseChunkingEmbeddingModel(KB)ChunkingEmbeddingModelStepstopreparedatabase[2][3][4][5]:2.Splitdocumentsintochunks-LLMs:Limited"window"oftextinputlengthsThetrickhereistofindasizeThetrickhereistofindasizethat?Fixedsizechunkingworksforyou?Variablesizechunking(somemarkerisusedtosplitthetextworksforyou?OverlapbetweenchunksBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel13KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchDocembeddingsDocumentchunksDocumentsKnowledgebase(KB)VectorDatabaseEmbeddingModelChunkingIngestion:Encodingtheknowledgebase(offline)DocembeddingsDocumentchunksDocumentsKnowledgebase(KB)VectorDatabaseEmbeddingModelChunkingIngestion:Encodingtheknowledgebase(offline)Stepstopreparedatabase[2][3][4][5]:3.ConvertthetextchunksintovectorviaembeddingmodelGeneralencoderarchitecturemodelConverttext,image,etctoMulti-dimensionalvectors.Modeltrainedtoembedsimilarinputsclosetogether.BuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel14KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchDocembeddingsDocumentchunksDocumentsKnowledgebase(KB)VectorDatabaseEmbeddingModelChunkingIngestion:Encodingtheknowledgebase(offline)DocembeddingsDocumentchunksDocumentsKnowledgebase(KB)VectorDatabaseEmbeddingModelChunkingIngestion:Encodingtheknowledgebase(offline)Stepstopreparedatabase[2][3][4][5]:4.Storedocumenttexts,vector,metadatatovectordatabase.ChromaFAISSMilvusChromaFAISSMilvusRedis[6]OptimizingRAG:AGuidetoChoosingtheRightVectorDatabaseBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel15KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:1.Converttheinputquerytovectorviathesameembeddingmodel2.SimilaritysearchinthevectordatabaseUserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel16KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:1.Converttheinputquerytovectorviathesameembeddingmodel2.Similaritysearchinthevectordatabase-PrecisionUserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel17EmbeddingsandtheVectorDatabaseSearchingviasemanticsimilarityScientificcomputingHigh-performancecomputingSpecializedmedicaltopicsSpeechAI2Drepresentationofa768-dimensionembeddingspace?Embeddingsaredata(text,image,orotherdata)representedasnumericalvectors?Inputtexttoembeddingmodeltooutputvector?Partofsemanticsearch?Modeltrainedtoembedsimilarinputsclosetogether?Otherusecases:classification,clustering,topicdiscovery?Manypretrainedandtrainableembeddingmodelsources?ModernonesareoftendeepneuralnetworksQueryQuery:Whowillleadtheconstructionteam?Chunk1:Theconstructionteamfoundleadinthepaint.Chunk2:Ozzyhasbeenpickedtoleadthegroup.Chunk1sharesmorekeywordswiththequery,butsemanticsearchcandifferentiatethemeaningsof"lead"andunderstandthat"team"and"group"aresimilar,soChunk2maybemorehelpfulforthequery.KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?Usequeryrouting:?LlamaIndex-RetrieverRouterQueryEngine?LangChain-MultiIndex;LangChain-Router?Querytransformations[8][9]:Rephrasing;HyDE[10],Sub-queries?Sentence-windowretrieval[11]?Auto-mergeretrieval[11]?Differentindextypes:hybridwithkey-wordandembedding?Re-ranker[7]?Meta-datafiltering?PromptCompressionUserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel19KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?Usequeryrouting:?LlamaIndex-RouterQueryEngine?LangChain-MultiIndex;LangChain-RouterDirectinguserqueriestoappropriateIndex.UserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel20KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?Querytransformations[8][9]:?Rephrasing;?HyDE[10]?Sub-queries/QuestionDecompositionChangetheinputquerytoimproveretrievedcontextsRetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingUserEmbeddingModelTopKRelevantChunksVectorDatabaseQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel21KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.SimilaritysearchinthevectordatabaseSentence-windowretrieval[11]Auto-mergeretrieval[11]Smallsizeofchunk,?Retrievalthewindowofappropriatesentencesbeforeandaftertheretrievedone?Organizedinatree-likestructure,mergesmallerchunksintolargercontextforLLMRetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingUserEmbeddingModelTopKRelevantChunksVectorDatabaseQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel22KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?Differentindextypes:hybridwithkey-wordandembeddingHybridwithKey-wordandembeddings:?Key-wordbasedindexforqueriesrelatingtoaspecificproduct?EmbeddingsforgeneralcustomersupportUserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel23KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?Re-ranker[7]Usingre-rankermodeltore-rankretrieveddocuments.SolvetheissueofdiscrepancybetweensimilarityandrelevanceRetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingUserEmbeddingModelTopKRelevantChunksVectorDatabaseQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel24KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?Meta-datafilteringAddmeta-datatoyourchunksUsemeta-datafilteringtohelpprocessresultsUserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel25KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?PromptCompression[17]CompressingirrelevantcontexthighlightingpivotalparagraphsreducingtheoverallcontextlengthUserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel2627KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchMethodMechanismQueryroutingDirectinguserqueriestoappropriateIndex.QuerytransformationsChangetheinputquerytoimproveretrievedcontextsSentence-windowretrievalSmallsizeofchunk,retrievalthewindowofappropriatesentencesbeforeandaftertheretrievedoneAuto-mergeretrievalSmallsizeofchunk,organizedinatree-likestructure,mergesmallerchunksintolargercontextforLLMDifferentindextypesHybridwithKey-wordandembeddings:?Key-wordbasedindexforqueriesrelatingtoaspecificproduct?EmbeddingsforgeneralcustomersupportRe-rankerUsingre-rankermodeltore-rankretrieveddocuments.SolvetheissueofdiscrepancybetweensimilarityandrelevanceMeta-datafilteringAddmeta-datatoyourchunksUsemeta-datafilteringtohelpprocessresultsPromptcompressionUsingsmalllanguagemodeltoCompressingirrelevantcontext;highlightingpivotalparagraphs;reducingtheoverallcontextlengthKeyTechniquesinRetrievalAugmentedGeneration(RAG)LLMGeneration?FoundationModels?PromptEngineering?CustomizedLLM?Easytodeploy?Lowlatency,highthroughput28EvaluatingRAGPipelineRAGASEvaluationframeworkforyourRetrievalAugmentedGeneration(RAG)pipelinesragasscoreRetrievalGenerationcontextprecisionThesignaltonoiserationofretrievedcontextcontextrecallinformationrequiredtoanswerthequestionfaithfulnessHowfactuallyaccurateisthegeneratedanswerHowrelevantisthegeneratedanswertothequestiontrulensEvaluationandTrackingforLLMExperimentsRAGTriad-TruLensAnswerRelevanceAnswerRelevanceIstheanswerrelevanttothequery?Istheretrievedcontextrelevanttothequery?ResponseContextGroundednessIstheresponsesupportedbythecontext?29NVIDIASolutionsforRAGNVIDIASolutionforRetrievalAugmentedGeneration?RAPIDSRAFTtoacceleratevectordatabasessearch?FoundationModels?ModelDeployment?ReferenceSamplesBestCommercialGradeEmbeddingModelforLLMsPartofNeMoRetriever,NVIDIATextQAEmbeddingperforms20ptsbetterthancommercialofferingsoutofthebox75%75%Higheraccuracyresults50%50%Reducedfinetuningrequirements25%25%LoweroccurrenceofhallucinationsRecallTop5Benchmark79%63%56%0%76%63%56%0%E5UnsupervisedLexicalSearchE5UnsupervisedLexicalSearch(BM-25)NVIDIARetrievalQAEmbedding(bestnon-commercial)ComparingNVIDIATextQAEmbeddingModelvsOtherAvailableOptions.RecallTop5,300tokenchunksize,averagingacrossrepresentativecustomerdatasetsfromTelco,IT,Consulting,EnergyExperiencetheNVIDIARetrievalQAEmbeddingModel33RAPIDSRAFTVECTORDATABASESVECTORDATABASESAREBECOMINGESSENTIALEmbeddingsIndexingVectorDatabaseQueryingRetrievingAppsANNOUNCINGNEWPARTNERSLEVERAGINGRAFTredisRAFTTURBOCHARGESVECTORSEARCH?Vectorsearchenginesallowuserstoquerymassivedatasetsofembeddingsforapproximatematches?VectorSearchtypicallyhappensusingNearestNeighbor(NN)orApproximateNearestNeighbor(ANN)Methods?RAFTlibraryoffersveryfastNNandANNprimitivesonGPU?Acceleratesindexing,loading,andretrievingabatchofneighborsforasinglequeryUSE-CASESLargeLargeLanguageModelsRecSysRecSysComputerComputerVision3435RAPIDSRAFTGPU-AcceleratedVectorSearchforLargeLanguageModels?Brute-force?AlgorithmsforANNsearch:?IVF-PQ?CAGRAMaterials:?RAFTDocument?AcceleratingVectorSearch:UsingGPU-PoweredIndexeswithRAPIDSRAFT?AcceleratingVectorSearch:Fine-TuningGPUIndexAlgorithms?AcceleratedVectorSearch:ApproximatingwithRAPIDSRAFTIVF-FlatRAPIDS/RAFTGITHUBPowerfulGenerativeFoundationModelsSuiteofgenerativefoundationlanguagemodelsbuiltforenterprisehyper-personalizationFastestResponsesNemotron-38BGPTFastestResponsesNemotron-38BGPT-8Bw/3.5Ttokens.+SFT,SteerLM.53LanguagesI/O:4KtokensForComplexTasksNemotron-243BGPT-43Bw/1.1Ttokens.+SFTprivatemix.50Languages.I/O:4KtokensBalanceofAccuracy-LatencyNemotron-222BGPT-22Bw/1.1Ttokens.+SFTprivatemix.50Languages.I/O:4KtokensExploreFoundationModelsinNGC36EnterpriseGradeFoundationModelswithNVIDIANemotron-38BDesignedforproductionreadygenerativeAIthatcanbecustomizedanddeployedatscaleEnterprise-ReadyFoEnterprise-ReadyFoundationModelsTrainedonresponsiblysourcesdata,withhighaccuracyoptimizedforsmoothenterpriseintegrationOneModelforOneModelforAllMajorLanguagesTrainedon53languagesand37codinglanguages.Nemotron-3BoffersthebestopenlyavailablemultilingualLLMAdvancedandFlexibleAdvancedandFlexibleCustomizationBaseforcustomization,includingPEFTandcontinuouspre-trainingfordomain-adaptedLLMsChat-SFTisabuildingblockforinstructiontuningcustommodelsoruser-definedalignmentChat-RLHFforbestout-of-the-boxchatmodelperformanceChat-SteerLMforbestout-of-the-boxchatmodelwithflexiblealignmentatinferencetimeQuestion&AnswerLLMsAcustomizedonknowledgebasesRLHFSteerLMSFTQNVIDIAAIFoundationModels:BuildCustomEnterpriseChatbotsandCo-PilotswithProduction-ReadyLLMs37SoTAPerformanceforLargeLanguageModelsforProductionDeploymentsChallenges:LLMperformanceiscrucialforreal-time,cost-effective,productiondeployments.RapidevolutionintheLLMecosystem,withnewmodels&techniquesreleasedregularly,requiresaperformant,flexiblesolutiontooptimizemodels.TensorRT-LLMisanopen-sourcelibrarytooptimizeinferenceperformanceonthelatestLargeLanguageModelsforNVIDIAGPUs.ItisbuiltonFasterTransformerandTensorRTwithasimplePythonAPIfordefining,optimizing,&executingLLMsforinferenceinproduction.SoTAPerformanceEaseExtensionLLMBatchingwithTritonLeverageTensorRTcompilation&kernelsfromFasterTransformers,CUTLASS,OAITriton,++AddnewoperatorsormodelsinPythontoquicklysupportnewLLMswithoptimizedperformanceMaximizethroughputandGPUutilizationthroughnewschedulingtechniquesforLLMs#defineanewactivationdefsilu(input:Tensor)→Tensor:returninput*sigmoid(input)#implementmodelslikeinDLFWsclassLlamaModel(Module)self.layers=ModuleList([…])hidden=self.embedding(…)forlayerinself.layers:hidden_states=layer(hidden)3839SoTAPerformanceforLargeLanguageModelsforProductionDeploymentsKeyFeaturesTensorRT-LLMcontainsexamplesthatimplementthefollowingfeatures.?Multi-headAttention(MHA)?Multi-queryAttention(MQA)?Group-queryAttention(GQA)?In-flightBatching?PagedKVCachefortheAttention?TensorParallelism?PipelineParallelism?INT4/INT8Weight-OnlyQuantization(W4A16&W8A16)?SmoothQuant?GPTQ?AWQ?FP8?Greedy-search?Beam-search?RoPEInthisreleaseofTensorRT-LLM,someofthefeaturesarenotenabledforallthemodelslistedintheexamplesfolder.ModelsThelistofsupportedmodelsis:?Baichuan?BART?Bert?Blip2?BLOOM?ChatGLM?FairSeqNMT?Falcon?Flan-T5?GPT?GPT-J?GPT-Nemo?GPT-NeoX?InternLM?LLaMA?LLaMA-v2?mBART?Mistral?MPT?mT5?OPT?Qwen?ReplitCode?SantaCoder?StarCoder?T5?Whisper?StarCoderTensorRT-LLMAvailableNow!KeyresourcesforTensorRT-LMTensorRT-LLMGithubSourceforTensorRT-LLMGettingStartedBlogLearntooptimizeanddeployTensorRT-LLMwithTritonServerTensorRT-LLMdocumentationAPIdocs,archoverviews,&perfdataNVIDIATritonBackendSourceforTensorRT-LLMTritonBackend40RetrievalAugmentedGeneration(RAG)withGuardrailsRAGistoLLMswhatanopen-bookexamistohumans(1)Retrieve(1)RetrieveGuardrailsGuardrailsGuardrailsGuardrails(2)Augment(3)Generate42NeMoGuardrailsUserOpenSourceSoftwareForDevelopingSafeandTrustworthyLLM-poweredChatbotsUserNeMoGuardrailsENTERPRISEAPPLICATIONLLMsThird-PartyAppsLLMAppToolkitsIntegratedIntotheNVIDIANeMoFrameworkPartofNVIDIAAIEnterpriseSoftwareSuite43IntegratedIntotheNVIDIANeMoFrameworkPartofNVIDIAAIEnterpriseSoftwareSuite43OpenSourceonGitHub/NVIDIA/NeMo-GuardrailsEvaluatingRAGPipelineRAGAS:EvaluationframeworkforyourRetrievalAugmentedGeneration(RAG)pipelinesRAGEvaluationToolThereare3componentsneededforevaluatingtheperformanceofaRAGpipeline:1.Datafortesting.2.Automatedmetricstomeasureperformanceofboththecontextretrievalandresponsegeneration.3.Human-likeevaluationofthegeneratedresponsefromtheend-to-endpipeline.44RAGPipelineSamplesinNVIDIAGenerativeAIout-of-boxSamplewithRAGGenerativeAISamplesGenerativeAIreferenceworkflowsoptimizedforacceleratedinfrastructureandmicroservicearchitecture.LinuxdeveloperRAG?LangChain+LlamaIndex?LLM:Llama2-13B?Embeddingmodel:e5-large-v2?Deployment:TRT-LLMandTriton?DB:MilvusWindowsdeveloperRAG?LangChain+LlamaIndex?LLM:Llama2-13B?Embeddingmodel:all-MiniLM-L6-v2?Deployment:TRT-LLM?DB:FAISS45CaseStudyExample:RAGCopilotsQuestionAnsweringChatbotandinteractivecodegenerationwithVSCodeExtension4748Example:ChipNemoCustomtokenizers|Domain-adaptivecontinuedpretraining|Supervisedfine-tuning(SFT)withdomain-specificinstructions|domain-adaptedretrievalmodels.ChipNeMo:Domain-AdaptedLLMsforChipDesignSiliconVolley:DesignersTapGenerativeAIforaChipAssist?Engineeringassistantchatbot?EDAscriptsgeneration?Bugsummarization49RetrievalAugmentedGenerationandFineTuningRetrievalAugmentedGeneration(RAG)Definition?Longtermmemory?Modifyingthebasemodel?Teachesthemodelhowtofollowuserspecifiedinstructions.?Replicatespecificstructures,styles,orformats?Shorttermmemory?Enhancingthemodel’scontextunderstandingthroughnon-parametermemory.?AnswerspecificinquiriesorsolvespecificinformationquerytasksKnowledgeCutoffdateoftrainingdatasetUp-to-dateknowledgeEffortHigherLowerHigherLowerReducingHallucinationsInherentlylesspronetohallucinationsaseachanswerisgroundedinretrievedevidence.Canhelpreducehallucinationsbytrainingthemodelbasedonspecificdomaindatabut
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 食堂外包合同范本
- 2025年度總部辦事處分銷合作協(xié)議書
- 短期臨時用工安全協(xié)議書范本
- 七下語文第一至三單元讀讀寫寫字詞積累(注音+解釋)
- 城市橋梁施工承包合同示例
- 2025年國防科技創(chuàng)新教育計劃
- 四年級英語活動策劃計劃
- 人教版四年級下冊數(shù)學教學資源計劃
- 領導干部在蘇州的學習與思考
- 五年級下冊地方課程教學計劃的特色活動
- 淺談班級的文化建設課題論文開題結題中期研究報告(經(jīng)驗交流)
- PMC年終個人總結精編ppt
- DBJ∕T 15-129-2017 集中空調(diào)制冷機房系統(tǒng)能效監(jiān)測及評價標準
- U8-EAI二次開發(fā)說明
- Q∕GDW 11612.41-2018 低壓電力線高速載波通信互聯(lián)互通技術規(guī)范 第4-1部分:物理層通信協(xié)議
- 2006 年全國高校俄語專業(yè)四級水平測試試卷
- 新人教版數(shù)學四年級下冊全冊表格式教案
- 疫情期間離市外出審批表
- (完整版)全身體格檢查評分標準(表)
- 裝飾裝修工程施工合理化建議和降低成本措施提要:完整
- (改)提高地下室側(cè)墻剛性防水施工合格率_圖文
評論
0/150
提交評論