基于 GPU 的檢索增強生成(RAG) -NVIDIA - 英偉達 LLM day

上傳人：1*** IP屬地：浙江上傳時間：2024-06-12 格式：DOCX 頁數(shù)：103 大小：5.70MB 積分：20 舉報 版權申訴

基于 GPU 的檢索增強生成(RAG) -NVIDIA - 英偉達 LLM day_第2頁

基于 GPU 的檢索增強生成(RAG) -NVIDIA - 英偉達 LLM day_第3頁

基于 GPU 的檢索增強生成(RAG) -NVIDIA - 英偉達 LLM day_第4頁

基于 GPU 的檢索增強生成(RAG) -NVIDIA - 英偉達 LLM day_第5頁

已閱讀5頁，還剩98頁未讀，繼續(xù)免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權，請進行舉報或認領

文檔簡介

彭美然NVIDIA資深解決方案工程師2023年01月09日DrivingtheFutureofEnterpriseWorkAIassistantswilldriveincreasedproductivityforeveryjobfunction?Intelligentchatbotsarethenextkillerenterpriseapplication?Humans'workwillchangefromhavingtodoalotofmanuallook-upsandgatheringofinformation,todirectingteamsofLLMsandpullingtogethertheresults?Enterpriseswillhave100-1000softheseAIassistantsintheircompanyacrosseveryjobfunction?ITspendisbeingincreasedtoadoptthesenewcopilotfeaturesbecausetheydriveincreaseproductivity,productdifferentiation,andimproveexperience?Thesechatbotswillhaveintelligenceaswellasaccesstoproprietaryinformation2LLMsArePowerfulToolsbutNotAccurateEnoughforEnterpriseWithoutaconnectiontoenterprisedatasources,LLMscannotprovideaccurateinformationPromptResponsePromptUserFoundationModel LackingproprietaryknowledgeRiskofoutdatedinformationHallucinationsAgenda?RetrievalaugmentedgenerationintroductionKeytechniquesinRAG?SolutionsfromNVIDIA?AIcopilotdemo–RAGcopilot5WhatisRetrievalAugmentedGeneration(RAG)?RAGistoLLMswhatanopen-bookexamistohumansGenerationforKnowledge-IntensiveNLP?General-purposefine-tuningrecipe?combinepre-trainedparametricandnon-parametricmemoryforlanguagegeneration?AtechniqueforenhancingtheaccuracyandreliabilityofgenerativeAImodelswithfactsfetchedfromexternalsources.?Thisapproachconstructsacomprehensivepromptenrichedwithcontext,historicaldata,andrecentorrelevantknowledge.(1)(1)Retrieve(2)Augment(3)Generate?GenerativeAIKnowledgeBaseChatbot|NVIDIA?Retrieval-AugmentedGeneration(RAG):FromTheorytoLangChainImplementation?Lewis,P.,etal.(2020).Retrieval-augmentedgenerationforknowledge-intensiveNLPtasks.AdvancesinNeuralInformationProcessingSystems,33,9459–9474.NextGenerationofEnterpriseApplicationsConnectLLMstoEnterpriseDataRetrievalAugmentedGenerationImprovesLLMPerformanceandEfficiencyImprovedAccuracyNaturalLanguageInterfaceImprovedAccuracyNaturalLanguageInterfaceReducedComputationalCostsImprovedEfficiencyContextualUnderstanding$ModelscananswerquestionsaboutinformationwithouthavingbeentrainedonModelscanproduceModelscananswerquestionsaboutinformationwithouthavingbeentrainedonModelscanproducediverseoutputswithoutsacrificingaccuracyorefficiencyHuman-readableoutputtextsthatareeasierforpeopletounderstand,raisingusertrustAImodelsbetterunderstandcontextwhengeneratingtextorotheroutputscostsfromretrainingandmodelsizeatinferencethatdataKeyTechniquesinRetrievalAugmentedGeneration(RAG)RAGistoLLMswhatanopen-bookexamistohumans(1)(1)Retrieve(2)Augment(3)Generate?Non-parametricmemory(knowledgesource):?DocumentsLoader?EmbeddingModel?VectorDatabase?DatabaseSearch?Pre-trainedparametric(LLM)：?FoundationLLM?LLMDeploymentKeyTechniquesinRetrievalAugmentedGeneration(RAG)RAGistoLLMswhatanopen-bookexamistohumans?Non-parametricmemory(knowledgesource):?DocumentsLoader(1)Retrieve?Vector(1)Retrieve?EmbeddingModel?DatabaseSearch(2)Augment?Pre-trainedparametric(LLM(2)Augment(3)Generate?(3)Generate?LLMDeploymentKeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchIngestion:Encodingtheknowledgebase(offline)DocumentDocDocumentDocumentschunksembeddingschunksVectorDatabaseKnowledgebaseVectorDatabase(KB)Chunking(KB)Retrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryQueryembeddingQueryUserEmbeddingModelVectorDatabaseTopKRelevantChunksBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel10KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchDocembeddingsDocumentDocumentschunksKnowledgebase(KB)VectorDatabaseChunkingEmbeddingModelIngestion:Encodingtheknowledgebase(offline)DocembeddingsDocumentDocumentschunksKnowledgebase(KB)VectorDatabaseChunkingEmbeddingModelIngestion:Encodingtheknowledgebase(offline)Stepstopreparedatabase[2][3][4][5]:1.Loadthedocumentswithdifferenttypes:pdf,html,c++,python,…etc.2.Splitdocumentsintochunks3.Convertthetextchunksintovectorviaembeddingmodel4.Storedocumenttexts,vector,metadatatovectordatabase.BuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel11KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchDocembeddingsDocumentDocumentschunksKnowledgebase(KB)VectorDatabaseChunkingEmbeddingModelIngestion:Encodingtheknowledgebase(offline)DocembeddingsDocumentDocumentschunksKnowledgebase(KB)VectorDatabaseChunkingEmbeddingModelIngestion:Encodingtheknowledgebase(offline)Stepstopreparedatabase[2][3][4][5]:1.Loadthedocumentswithdifferenttypes:pdf,html,c++,python,…etc.Tips:?KnowwhatyourTips:?Knowwhatyourtexttypesare?CleanthedataLlamaIndex?UnstructuredData?StructuredData?Semi-StructuredBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel12KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchIngestion:Encodingtheknowledgebase(offline)DocumentchunksDocDocumentchunksDocumentsembeddingsVectorDatabaseKnowledgebaseVectorDatabaseChunkingEmbeddingModel(KB)ChunkingEmbeddingModelStepstopreparedatabase[2][3][4][5]:2.Splitdocumentsintochunks-LLMs:Limited"window"oftextinputlengthsThetrickhereistofindasizeThetrickhereistofindasizethat?Fixedsizechunkingworksforyou?Variablesizechunking(somemarkerisusedtosplitthetextworksforyou?OverlapbetweenchunksBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel13KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchDocembeddingsDocumentchunksDocumentsKnowledgebase(KB)VectorDatabaseEmbeddingModelChunkingIngestion:Encodingtheknowledgebase(offline)DocembeddingsDocumentchunksDocumentsKnowledgebase(KB)VectorDatabaseEmbeddingModelChunkingIngestion:Encodingtheknowledgebase(offline)Stepstopreparedatabase[2][3][4][5]:3.ConvertthetextchunksintovectorviaembeddingmodelGeneralencoderarchitecturemodelConverttext,image,etctoMulti-dimensionalvectors.Modeltrainedtoembedsimilarinputsclosetogether.BuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel14KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchDocembeddingsDocumentchunksDocumentsKnowledgebase(KB)VectorDatabaseEmbeddingModelChunkingIngestion:Encodingtheknowledgebase(offline)DocembeddingsDocumentchunksDocumentsKnowledgebase(KB)VectorDatabaseEmbeddingModelChunkingIngestion:Encodingtheknowledgebase(offline)Stepstopreparedatabase[2][3][4][5]:4.Storedocumenttexts,vector,metadatatovectordatabase.ChromaFAISSMilvusChromaFAISSMilvusRedis[6]OptimizingRAG:AGuidetoChoosingtheRightVectorDatabaseBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel15KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:1.Converttheinputquerytovectorviathesameembeddingmodel2.SimilaritysearchinthevectordatabaseUserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel16KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:1.Converttheinputquerytovectorviathesameembeddingmodel2.Similaritysearchinthevectordatabase-PrecisionUserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel17EmbeddingsandtheVectorDatabaseSearchingviasemanticsimilarityScientificcomputingHigh-performancecomputingSpecializedmedicaltopicsSpeechAI2Drepresentationofa768-dimensionembeddingspace?Embeddingsaredata(text,image,orotherdata)representedasnumericalvectors?Inputtexttoembeddingmodeltooutputvector?Partofsemanticsearch?Modeltrainedtoembedsimilarinputsclosetogether?Otherusecases:classification,clustering,topicdiscovery?Manypretrainedandtrainableembeddingmodelsources?ModernonesareoftendeepneuralnetworksQueryQuery:Whowillleadtheconstructionteam?Chunk1:Theconstructionteamfoundleadinthepaint.Chunk2:Ozzyhasbeenpickedtoleadthegroup.Chunk1sharesmorekeywordswiththequery,butsemanticsearchcandifferentiatethemeaningsof"lead"andunderstandthat"team"and"group"aresimilar,soChunk2maybemorehelpfulforthequery.KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?Usequeryrouting:?LlamaIndex-RetrieverRouterQueryEngine?LangChain-MultiIndex;LangChain-Router?Querytransformations[8][9]:Rephrasing;HyDE[10],Sub-queries?Sentence-windowretrieval[11]?Auto-mergeretrieval[11]?Differentindextypes:hybridwithkey-wordandembedding?Re-ranker[7]?Meta-datafiltering?PromptCompressionUserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel19KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?Usequeryrouting:?LlamaIndex-RouterQueryEngine?LangChain-MultiIndex;LangChain-RouterDirectinguserqueriestoappropriateIndex.UserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel20KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?Querytransformations[8][9]:?Rephrasing;?HyDE[10]?Sub-queries/QuestionDecompositionChangetheinputquerytoimproveretrievedcontextsRetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingUserEmbeddingModelTopKRelevantChunksVectorDatabaseQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel21KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.SimilaritysearchinthevectordatabaseSentence-windowretrieval[11]Auto-mergeretrieval[11]Smallsizeofchunk,?Retrievalthewindowofappropriatesentencesbeforeandaftertheretrievedone?Organizedinatree-likestructure,mergesmallerchunksintolargercontextforLLMRetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingUserEmbeddingModelTopKRelevantChunksVectorDatabaseQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel22KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?Differentindextypes:hybridwithkey-wordandembeddingHybridwithKey-wordandembeddings:?Key-wordbasedindexforqueriesrelatingtoaspecificproduct?EmbeddingsforgeneralcustomersupportUserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel23KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?Re-ranker[7]Usingre-rankermodeltore-rankretrieveddocuments.SolvetheissueofdiscrepancybetweensimilarityandrelevanceRetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingUserEmbeddingModelTopKRelevantChunksVectorDatabaseQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel24KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?Meta-datafilteringAddmeta-datatoyourchunksUsemeta-datafilteringtohelpprocessresultsUserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel25KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase?PromptCompression[17]CompressingirrelevantcontexthighlightingpivotalparagraphsreducingtheoverallcontextlengthUserEmbeddingModelUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel2627KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchMethodMechanismQueryroutingDirectinguserqueriestoappropriateIndex.QuerytransformationsChangetheinputquerytoimproveretrievedcontextsSentence-windowretrievalSmallsizeofchunk,retrievalthewindowofappropriatesentencesbeforeandaftertheretrievedoneAuto-mergeretrievalSmallsizeofchunk,organizedinatree-likestructure,mergesmallerchunksintolargercontextforLLMDifferentindextypesHybridwithKey-wordandembeddings:?Key-wordbasedindexforqueriesrelatingtoaspecificproduct?EmbeddingsforgeneralcustomersupportRe-rankerUsingre-rankermodeltore-rankretrieveddocuments.SolvetheissueofdiscrepancybetweensimilarityandrelevanceMeta-datafilteringAddmeta-datatoyourchunksUsemeta-datafilteringtohelpprocessresultsPromptcompressionUsingsmalllanguagemodeltoCompressingirrelevantcontext;highlightingpivotalparagraphs;reducingtheoverallcontextlengthKeyTechniquesinRetrievalAugmentedGeneration(RAG)LLMGeneration?FoundationModels?PromptEngineering?CustomizedLLM?Easytodeploy?Lowlatency,highthroughput28EvaluatingRAGPipelineRAGASEvaluationframeworkforyourRetrievalAugmentedGeneration(RAG)pipelinesragasscoreRetrievalGenerationcontextprecisionThesignaltonoiserationofretrievedcontextcontextrecallinformationrequiredtoanswerthequestionfaithfulnessHowfactuallyaccurateisthegeneratedanswerHowrelevantisthegeneratedanswertothequestiontrulensEvaluationandTrackingforLLMExperimentsRAGTriad-TruLensAnswerRelevanceAnswerRelevanceIstheanswerrelevanttothequery?Istheretrievedcontextrelevanttothequery?ResponseContextGroundednessIstheresponsesupportedbythecontext?29NVIDIASolutionsforRAGNVIDIASolutionforRetrievalAugmentedGeneration?RAPIDSRAFTtoacceleratevectordatabasessearch?FoundationModels?ModelDeployment?ReferenceSamplesBestCommercialGradeEmbeddingModelforLLMsPartofNeMoRetriever,NVIDIATextQAEmbeddingperforms20ptsbetterthancommercialofferingsoutofthebox75%75%Higheraccuracyresults50%50%Reducedfinetuningrequirements25%25%LoweroccurrenceofhallucinationsRecallTop5Benchmark79%63%56%0%76%63%56%0%E5UnsupervisedLexicalSearchE5UnsupervisedLexicalSearch(BM-25)NVIDIARetrievalQAEmbedding(bestnon-commercial)ComparingNVIDIATextQAEmbeddingModelvsOtherAvailableOptions.RecallTop5,300tokenchunksize,averagingacrossrepresentativecustomerdatasetsfromTelco,IT,Consulting,EnergyExperiencetheNVIDIARetrievalQAEmbeddingModel33RAPIDSRAFTVECTORDATABASESVECTORDATABASESAREBECOMINGESSENTIALEmbeddingsIndexingVectorDatabaseQueryingRetrievingAppsANNOUNCINGNEWPARTNERSLEVERAGINGRAFTredisRAFTTURBOCHARGESVECTORSEARCH?Vectorsearchenginesallowuserstoquerymassivedatasetsofembeddingsforapproximatematches?VectorSearchtypicallyhappensusingNearestNeighbor(NN)orApproximateNearestNeighbor(ANN)Methods?RAFTlibraryoffersveryfastNNandANNprimitivesonGPU?Acceleratesindexing,loading,andretrievingabatchofneighborsforasinglequeryUSE-CASESLargeLargeLanguageModelsRecSysRecSysComputerComputerVision3435RAPIDSRAFTGPU-AcceleratedVectorSearchforLargeLanguageModels?Brute-force?AlgorithmsforANNsearch:?IVF-PQ?CAGRAMaterials:?RAFTDocument?AcceleratingVectorSearch:UsingGPU-PoweredIndexeswithRAPIDSRAFT?AcceleratingVectorSearch:Fine-TuningGPUIndexAlgorithms?AcceleratedVectorSearch:ApproximatingwithRAPIDSRAFTIVF-FlatRAPIDS/RAFTGITHUBPowerfulGenerativeFoundationModelsSuiteofgenerativefoundationlanguagemodelsbuiltforenterprisehyper-personalizationFastestResponsesNemotron-38BGPTFastestResponsesNemotron-38BGPT-8Bw/3.5Ttokens.+SFT,SteerLM.53LanguagesI/O:4KtokensForComplexTasksNemotron-243BGPT-43Bw/1.1Ttokens.+SFTprivatemix.50Languages.I/O:4KtokensBalanceofAccuracy-LatencyNemotron-222BGPT-22Bw/1.1Ttokens.+SFTprivatemix.50Languages.I/O:4KtokensExploreFoundationModelsinNGC36EnterpriseGradeFoundationModelswithNVIDIANemotron-38BDesignedforproductionreadygenerativeAIthatcanbecustomizedanddeployedatscaleEnterprise-ReadyFoEnterprise-ReadyFoundationModelsTrainedonresponsiblysourcesdata,withhighaccuracyoptimizedforsmoothenterpriseintegrationOneModelforOneModelforAllMajorLanguagesTrainedon53languagesand37codinglanguages.Nemotron-3BoffersthebestopenlyavailablemultilingualLLMAdvancedandFlexibleAdvancedandFlexibleCustomizationBaseforcustomization,includingPEFTandcontinuouspre-trainingfordomain-adaptedLLMsChat-SFTisabuildingblockforinstructiontuningcustommodelsoruser-definedalignmentChat-RLHFforbestout-of-the-boxchatmodelperformanceChat-SteerLMforbestout-of-the-boxchatmodelwithflexiblealignmentatinferencetimeQuestion&AnswerLLMsAcustomizedonknowledgebasesRLHFSteerLMSFTQNVIDIAAIFoundationModels:BuildCustomEnterpriseChatbotsandCo-PilotswithProduction-ReadyLLMs37SoTAPerformanceforLargeLanguageModelsforProductionDeploymentsChallenges:LLMperformanceiscrucialforreal-time,cost-effective,productiondeployments.RapidevolutionintheLLMecosystem,withnewmodels&techniquesreleasedregularly,requiresaperformant,flexiblesolutiontooptimizemodels.TensorRT-LLMisanopen-sourcelibrarytooptimizeinferenceperformanceonthelatestLargeLanguageModelsforNVIDIAGPUs.ItisbuiltonFasterTransformerandTensorRTwithasimplePythonAPIfordefining,optimizing,&executingLLMsforinferenceinproduction.SoTAPerformanceEaseExtensionLLMBatchingwithTritonLeverageTensorRTcompilation&kernelsfromFasterTransformers,CUTLASS,OAITriton,++AddnewoperatorsormodelsinPythontoquicklysupportnewLLMswithoptimizedperformanceMaximizethroughputandGPUutilizationthroughnewschedulingtechniquesforLLMs#defineanewactivationdefsilu(input:Tensor)→Tensor:returninput*sigmoid(input)#implementmodelslikeinDLFWsclassLlamaModel(Module)self.layers=ModuleList([…])hidden=self.embedding(…)forlayerinself.layers:hidden_states=layer(hidden)3839SoTAPerformanceforLargeLanguageModelsforProductionDeploymentsKeyFeaturesTensorRT-LLMcontainsexamplesthatimplementthefollowingfeatures.?Multi-headAttention(MHA)?Multi-queryAttention(MQA)?Group-queryAttention(GQA)?In-flightBatching?PagedKVCachefortheAttention?TensorParallelism?PipelineParallelism?INT4/INT8Weight-OnlyQuantization(W4A16&W8A16)?SmoothQuant?GPTQ?AWQ?FP8?Greedy-search?Beam-search?RoPEInthisreleaseofTensorRT-LLM,someofthefeaturesarenotenabledforallthemodelslistedintheexamplesfolder.ModelsThelistofsupportedmodelsis:?Baichuan?BART?Bert?Blip2?BLOOM?ChatGLM?FairSeqNMT?Falcon?Flan-T5?GPT?GPT-J?GPT-Nemo?GPT-NeoX?InternLM?LLaMA?LLaMA-v2?mBART?Mistral?MPT?mT5?OPT?Qwen?ReplitCode?SantaCoder?StarCoder?T5?Whisper?StarCoderTensorRT-LLMAvailableNow!KeyresourcesforTensorRT-LMTensorRT-LLMGithubSourceforTensorRT-LLMGettingStartedBlogLearntooptimizeanddeployTensorRT-LLMwithTritonServerTensorRT-LLMdocumentationAPIdocs,archoverviews,&perfdataNVIDIATritonBackendSourceforTensorRT-LLMTritonBackend40RetrievalAugmentedGeneration(RAG)withGuardrailsRAGistoLLMswhatanopen-bookexamistohumans(1)Retrieve(1)RetrieveGuardrailsGuardrailsGuardrailsGuardrails(2)Augment(3)Generate42NeMoGuardrailsUserOpenSourceSoftwareForDevelopingSafeandTrustworthyLLM-poweredChatbotsUserNeMoGuardrailsENTERPRISEAPPLICATIONLLMsThird-PartyAppsLLMAppToolkitsIntegratedIntotheNVIDIANeMoFrameworkPartofNVIDIAAIEnterpriseSoftwareSuite43IntegratedIntotheNVIDIANeMoFrameworkPartofNVIDIAAIEnterpriseSoftwareSuite43OpenSourceonGitHub/NVIDIA/NeMo-GuardrailsEvaluatingRAGPipelineRAGAS:EvaluationframeworkforyourRetrievalAugmentedGeneration(RAG)pipelinesRAGEvaluationToolThereare3componentsneededforevaluatingtheperformanceofaRAGpipeline:1.Datafortesting.2.Automatedmetricstomeasureperformanceofboththecontextretrievalandresponsegeneration.3.Human-likeevaluationofthegeneratedresponsefromtheend-to-endpipeline.44RAGPipelineSamplesinNVIDIAGenerativeAIout-of-boxSamplewithRAGGenerativeAISamplesGenerativeAIreferenceworkflowsoptimizedforacceleratedinfrastructureandmicroservicearchitecture.LinuxdeveloperRAG?LangChain+LlamaIndex?LLM:Llama2-13B?Embeddingmodel:e5-large-v2?Deployment:TRT-LLMandTriton?DB:MilvusWindowsdeveloperRAG?LangChain+LlamaIndex?LLM:Llama2-13B?Embeddingmodel:all-MiniLM-L6-v2?Deployment:TRT-LLM?DB:FAISS45CaseStudyExample:RAGCopilotsQuestionAnsweringChatbotandinteractivecodegenerationwithVSCodeExtension4748Example:ChipNemoCustomtokenizers|Domain-adaptivecontinuedpretraining|Supervisedfine-tuning(SFT)withdomain-specificinstructions|domain-adaptedretrievalmodels.ChipNeMo:Domain-AdaptedLLMsforChipDesignSiliconVolley:DesignersTapGenerativeAIforaChipAssist?Engineeringassistantchatbot?EDAscriptsgeneration?Bugsummarization49RetrievalAugmentedGenerationandFineTuningRetrievalAugmentedGeneration(RAG)Definition?Longtermmemory?Modifyingthebasemodel?Teachesthemodelhowtofollowuserspecifiedinstructions.?Replicatespecificstructures,styles,orformats?Shorttermmemory?Enhancingthemodel’scontextunderstandingthroughnon-parametermemory.?AnswerspecificinquiriesorsolvespecificinformationquerytasksKnowledgeCutoffdateoftrainingdatasetUp-to-dateknowledgeEffortHigherLowerHigherLowerReducingHallucinationsInherentlylesspronetohallucinationsaseachanswerisgroundedinretrievedevidence.Canhelpreducehallucinationsbytrainingthemodelbasedonspecificdomaindatabut

人人文庫> 全部分類> 應用文書 > 研究報告

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負責。
6. 下載文件中如有侵權或不適當內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

基于 GPU 的檢索增強生成(RAG) -NVIDIA - 英偉達 LLM day

文檔簡介

溫馨提示

最新文檔

評論

基于 GPU 的檢索增強生成(RAG) -NVIDIA - 英偉達 LLM day

文檔簡介

溫馨提示

最新文檔

評論

相關文檔