NVIDIA 安培 GA102 GPU 建筑學(xué)白皮書(shū)_第1頁(yè)
NVIDIA 安培 GA102 GPU 建筑學(xué)白皮書(shū)_第2頁(yè)
NVIDIA 安培 GA102 GPU 建筑學(xué)白皮書(shū)_第3頁(yè)
NVIDIA 安培 GA102 GPU 建筑學(xué)白皮書(shū)_第4頁(yè)
NVIDIA 安培 GA102 GPU 建筑學(xué)白皮書(shū)_第5頁(yè)
已閱讀5頁(yè),還剩48頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

NVIDIAAMPEREGA102GPUARCHITECTURE

Second-GenerationRTX

UpdatedwithNVIDIARTXA6000andNVIDIAA40Information V2.0

PAGE\*roman

iv

NVIDIAAmpereGA102GPUArchitecture

TableofContents

TOC\o"1-4"\h\z\u

Introduction 5

GA102KeyFeatures 7

2xFP32Processing 7

Second-GenerationRTCore 7

Third-GenerationTensorCores 8

GDDR6XandGDDR6Memory 8

Third-GenerationNVLink? 8

PCIeGen4 9

AmpereGPUArchitectureIn-Depth 10

GPC,TPC,andSMHigh-LevelArchitecture 10

ROPOptimizations 11

GA10xSMArchitecture 11

2xFP32Throughput 12

LargerandFasterUnifiedSharedMemoryandL1DataCache 13

PerformancePerWatt 16

Second-GenerationRayTracingEngineinGA10xGPUs 17

AmpereArchitectureRTXProcessorsinAction 19

GA10xGPUHardwareAccelerationforRay-TracedMotionBlur 20

Third-GenerationTensorCoresinGA10xGPUs 24

ComparisonofTuringvsGA10xGPUTensorCores 24

NVIDIAAmpereArchitectureTensorCoresSupportNewDLDataTypes 26

Fine-GrainedStructuredSparsity 26

NVIDIADLSS8K 28

GDDR6XMemory 30

RTXIO 32

IntroducingNVIDIARTXIO 33

HowNVIDIARTXIOWorks 34

DisplayandVideoEngine 38

DisplayPort1.4awithDSC1.2a 38

HDMI2.1withDSC1.2a 38

FifthGenerationNVDEC-Hardware-AcceleratedVideoDecoding 39

AV1HardwareDecode 40

SeventhGenerationNVENC-Hardware-AcceleratedVideoEncoding 40

Conclusion 42

AppendixA-AdditionalGeForceGA10xGPUSpecifications 44

GeForceRTX3090 44

GeForceRTX3070 46

AppendixB-NewMemoryErrorDetectionandReplay(EDR)Technology 49

AppendixC-RTXA6000GPUPerformance 50

ListofFigures

Figure1. AmpereGA10xArchitecture-AGiantLeap 6

Figure2. GA102FullGPUwith84SMs 10

Figure3. GA10xStreamingMultiprocessor(SM) 12

Figure4. NVIDIAAmpereGA10xArchitecturePowerEfficiency 16

Figure5. GeForceRTX3080vsGeForceRTX2080SuperRTPerformance 17

Figure6. Second-GenerationRTCoreinGA10xGPUs 18

Figure7. TuringRTXTechnologyImprovesPerformance 19

Figure8. AmpereArchitectureRTXTechnologyFurtherImprovesPerformance 20

Figure9. AmpereArchitectureMotionBlurHardwareAcceleration 21

Figure10. BasicRayTracingvsRayTracingwithMotionBlur 22

Figure11. RenderingWithoutvsWithMotionBluronGA10x 23

Figure12. AmpereArchitectureTensorCorevsTuringTensorCore 25

Figure13. Fine-GrainedStructuredSparsity 27

Figure14. WatchDogs:Legionwith8KDLSScomparedto4Kand1080presolution. 28

Figure15. Builtfor8KGaming 29

Figure16. GDDR6XImprovedPerformanceandEfficiencyusingPAM4Signaling 30

Figure17. GDDR6XNewSignaling,NewCoding,NewAlgorithms 31

Figure18. GamesBottleneckedbyTraditionalI/O 32

Figure19. CompressedDataNeeded,butCPUCannotKeepUp 33

Figure20. RTXIODelivers100XThroughput,20XLowerCPUUtilization 34

Figure21. LevelLoadTimeComparison 35

Figure22. NVIDIAA40datacenterGPUforvisualcomputing 36

Figure23. VideoDecodeandEncodeFormatsSupportedonGA10xGPUs 39

Figure24. GA104FullGPUwith48SMs 46

Figure25. OldOverclockingMethodvsOverclockingwithEDR 49

Figure26. SPECviewperf?2020Performance-RTXA6000vsRTX6000 50

Figure27. RenderingPerformance-RTXA6000vsRTX6000 51

Figure28. HPCPerformance-RTXA6000vsRTX6000 51

Figure29. DeepLearningPerformance-RTXA6000vsRTX6000 52

ListofTables

Table1. ComparativeX-FactorsforFP32Throughput 13

Table2. GeForceRTX3080vsGeForceRTX2080/2080Super 14

Table3. NVIDIARTXA6000andNVIDIAA40Specs 15

Table4. RayTracingFeatureComparison 18

Table5. ComparingRTXA6000vsRTX6000MotionBlurRenderingTime 23

Table6. ComparisonofNVIDIATuringvsAmpereArchitectureTensorCore 25

Table7. DisplayPortVersions-SpecComparison 38

Table8. HDMIVersions-SpecComparison 38

Table9. ComparisonofGeForceRTX3090toNVIDIATitanRTX 44

Table10. ComparisonofGeForceRTX3070toGeForceRTX2070Super 47

IntroductiontotheNVIDIAAmpereGA102GPUArchitecture

PAGE

5

NVIDIAAmpereGA102GPUArchitecture

Introduction

Sinceinventingtheworld’sfirstGPU(GraphicsProcessingUnit)in1999,NVIDIAGPUshavebeenattheforefrontof3DgraphicsandGPU-acceleratedcomputing.EachNVIDIAGPUArchitectureiscarefullydesignedtoprovidebreakthroughlevelsofperformanceandefficiency.

ThefamilyofnewNVIDIA?AmperearchitectureGPUsisdesignedtoacceleratemanydifferenttypesofcomputationallyintensiveapplicationsandworkloads.ThefirstNVIDIAAmperearchitectureGPU,theA100,wasreleasedinMay2020andprovidestremendousspeedupsforAItrainingandinference,HPCworkloads,anddataanalyticsapplications.TheA100GPUisdescribedindetailinthe

NVIDIAA100GPUTensorCoreArchitectureWhitepaper.

ThenewestmembersoftheNVIDIAAmperearchitectureGPUfamily,GA102andGA104,aredescribedinthiswhitepaper.GA102andGA104arepartofthenewNVIDIA“GA10x”classofAmperearchitectureGPUs.GA10xGPUsbuildontherevolutionaryNVIDIATuring?GPUarchitecture.Turingwastheworld’sfirstGPUarchitecturetoofferhighperformancereal-timeraytracing,AI-acceleratedgraphics,energy-efficientinferenceaccelerationforthedatacenter,andprofessionalgraphicsrenderingallinoneproduct.

GA10xGPUsaddmanynewfeaturesanddeliversignificantlyfasterperformancethanTuringGPUs.Inaddition,GA10xGPUsarecarefullycraftedtoprovidethebestperformanceperareaandenergyefficiencyfortraditionalgraphicsworkloads,andevenmoresoforreal-timeraytracingworkloads.ComparedtotheTuringGPUArchitecture,theNVIDIAAmpereArchitectureisupto1.7xfasterintraditionalrastergraphicsworkloadsandupto2xfasterinraytracing.

GA102isthemostpowerfulAmperearchitectureGPUintheGA10xlineupandisusedintheGeForceRTX3090,GeForceRTX3080,NVIDIARTXA6000,andtheNVIDIAA40datacenterGPU.TheGeForceRTX3070GPUusesthenewGA104GPU.

TheGeForceRTX3090isthehighestperformingGPUintheGeForceRTXlineupandhasbeenbuiltfor8KHDRgaming.With10496CUDACores,24GBofGDDR6Xmemory,andthenewDLSS8Kmodeenabled,itcanrunmanygamesat8K@60fps.TheGeForceRTX3080providesupto2xtheperformanceoftheGeForceRTX2080,deliveringthegreatestgenerationalleapofanyGPUthathaseverbeenmade.TheGeForceRTX3070offers

performancethatrivalsNVIDIA’spreviousgenerationflagshipGPU,theGeForceRTX2080Ti.NewHDMI2.1andAV1decodefeaturesinGA10xGPUsallowuserstostreamcontentat8KwithHDR.

TheNVIDIA?RTX?A6000combines84second-generationRTCores,336third-generationTensorCores,and10,752CUDACoreswith48GBoffastGDDR6foracceleratedrendering,graphics,AI,andcomputeperformance.TwoRTXA6000scanbeconnectedwithNVIDIANVLink?toprovide96GBofcombinedGPUmemoryforhandlingextremelylargerendering,AI,VR,andvisualcomputingworkloads.Intotal,RTXA6000deliversthekeycapabilitiesdesigners,engineersandartistsneedtotacklethemostcomplexworkloadsfromtheirdesktopworkstation.

Finally,theNVIDIAA40GPUisanevolutionaryleapinperformanceandmulti-workloadcapabilitiesforthedatacenter,combiningbest-in-classprofessionalgraphicswithpowerfulcomputeandAIaccelerationtomeettoday’sdesign,creative,andscientificchallenges.

IncludingthesamecorecountsandmemorysizeastheRTXA6000,theA40willpowerthenextgenerationofvirtualworkstationsandserver-basedworkloads.NVIDIAA40isupto2Xmorepowerefficientthanthepreviousgeneration,anditbringsstate-of-the-artfeaturesforray-tracedrendering,simulation,virtualproduction,andmoretoprofessionals.

Figure1. AmpereGA10xArchitecture-AGiantLeap

ThisdocumentfocusesonNVIDIAGA102GPU-specificarchitecture,andalsogeneralNVIDIAGA10xAmpereGPUarchitectureandfeaturescommontoallGA10xGPUs.AdditionalGA10xGPUspecificationsareincludedinAppendixA.

GA102KeyFeatures

PAGE

7

NVIDIAAmpereGA102GPUArchitecture

GA102KeyFeatures

FabricatedonSamsung’s8nm8NNVIDIACustomProcess,theNVIDIAAmperearchitecture-basedGA102GPUincludes28.3billiontransistorswithadiesizeof628.4mm2.LikeallGeForceRTXGPUs,attheheartofGA102liesaprocessorthatcontainsthreedifferenttypesofcomputeresources:

ProgrammableShadingCores,whichconsistofNVIDIACUDACores

RTCores,whichaccelerateBoundingVolumeHierarchy(BVH)traversalandintersectionofscenegeometryduringraytracing

TensorCores,whichprovideenormousspeedupsforAIneuralnetworktrainingandinferencing

AfullGA102GPUincorporates10752CUDACores,84second-generationRTCores,and336third-generationTensorCores,andisthemostpowerfulconsumerGPUNVIDIAhaseverbuiltforgraphicsprocessing.AGA102SMdoublesthenumberofFP32shaderoperationsthatcan

beexecutedperclockcomparedtoaTuringSM,resultingin30TFLOPSforshaderprocessinginGeForceRTX3080(11TFLOPSintheequivalentTuringGPU).Similarly,RTCoresofferdoublethethroughputforray/triangleintersectiontesting,resultingin58RTTFLOPS(comparedto34inTuring).Finally,GA102’snewTensorCorescanprocesssparseneuralnetworksattwicetherateofTuringTensorCoreswhichdonotsupportsparsity,yielding238sparseTensorTFLOPSinRTX3080comparedto89non-sparseTensorTFLOPSinRTX2080.

2xFP32Processing

Mostgraphicsworkloadsarecomposedof32-bitfloatingpoint(FP32)operations.TheStreamingMultiprocessor(SM)intheAmpereGA10xGPUArchitecturehasbeendesignedtosupportdouble-speedprocessingforFP32operations.IntheTuringgeneration,eachofthefourSMprocessingblocks(alsocalledpartitions)hadtwoprimarydatapaths,butonlyoneofthetwocouldprocessFP32operations.Theotherdatapathwaslimitedtointegeroperations.GA10xincludesFP32processingonbothdatapaths,doublingthepeakprocessingrateforFP32operations.Asaresult,GeForceRTX3090deliversover35FP32TFLOPS,animprovementofover2xcomparedtoTuringGPUs.

FortheNVIDIARTXA6000andNVIDIAA40,2xFP32processingprovidessignificant

performanceimprovementsforgraphicsworkflowssuchas3Dmodeldevelopment,andalsocomputeaccelerationforworkloadssuchascomplex3Dsimulationforcomputer-aideddesign(CAD)andcomputer-aidedengineering(CAE).

Second-GenerationRTCore

ThenewRTCoreincludesanumberofenhancements,combinedwithimprovementstocachingsubsystems,thateffectivelydeliverupto2xperformanceimprovementovertheRTCoreinTuringGPUs.ThenewGA10xSMallowsRTCoreandgraphics,orRTCoreandcomputeworkloadstorunconcurrently,significantlyacceleratingmanyraytracingoperations.

Inadditiontoray-tracedgamerenderingbenefits,second-generationRTCoresdelivermassivespeedupsforworkloadslikephotorealisticrenderingofmoviecontent,architecturaldesignevaluations,andvirtualprototypingofproductdesigns.Theyalsospeeduprenderingofray-tracedmotionblurforfasterresultswithgreatervisualaccuracy.

Forprofessionals,asingleRTXA6000boardorNVIDIAA40GPUcanrendercomplexmodelswithphysicallyaccurateshadows,reflections,andrefractionstoempoweruserswithinstantinsight.WorkinginconcertwithapplicationsleveragingAPIssuchasNVIDIAOptiX,MicrosoftDXR,andVulkanraytracing,systemsbasedontheRTXA6000andA40willpowertrulyinteractivedesignworkflowstoprovideimmediatefeedbackforunprecedentedlevelsofproductivity.Thenewsecond-generationRTCorewillbedescribedinmoredetaillaterinthisdocument.

Third-GenerationTensorCores

TheGA10xSMincorporatesNVIDIA’snewthird-generationTensorCores,whichsupportmanynewdatatypesforimprovedperformance,efficiency,andprogrammingflexibility.AnewSparsityfeaturecantakeadvantageoffine-grainedstructuredsparsityindeeplearningnetworkstodoublethethroughputofTensorCoreoperationsoverthepriorgenerationTuringTensorCores.NewTensorFloat32(TF32)precisionprovidesupto5Xthetraining

throughputoverthepreviousgenerationtoaccelerateAIanddatasciencemodeltrainingwithoutrequiringanycodechanges.

Thethird-generationTensorCoresaccelerateAIdenoising,NVIDIADLSSforAIsuperresolution(nowwithsupportforupto8K),theNVIDIABroadcastappforAI-enhancedvideoandvoicecommunications,andtheNVIDIACanvasappforAI-poweredpainting.

GDDR6XandGDDR6Memory

GDDR6Xisthenewesthigh-speedgraphicsmemory.Itcurrentlysupportsspeedsof19.5GbpsontheGeForceRTX3090,and19GbpsfortheGeForceRTX3080.Withits320-bitmemoryinterfaceandGDDR6Xmemory,theGeForceRTX3080delivers1.5xmorememorybandwidththanitspredecessor,theRTX2080Super.

TheNVIDIARTXA6000andNVIDIAA40GPUsbothuse48GBofhigh-speedGDDR6memory,scalableupto96GBusingtwoidenticalGPUsconnectedwithNVLink,enablingcreativeprofessionals,engineers,anddatascientiststoworkwithmassivedatasetsandacceleratelatency-sensitiveprofessionalapplications.

Third-GenerationNVLink?

GA102GPUsutilizeNVIDIA’sthird-generationNVLinkinterface,whichincludesfourx4links,witheachlinkproviding14.0625GB/secbandwidthineachdirectionbetweentwoGPUs.Fourlinksprovide56.25GB/secbandwidthineachdirection,and112.5GB/sectotalbandwidthbetweentwoGPUs.TwoNVIDIAA40ortwoNVIDIARTXA6000GPUscanbeconnectedwithNVLinktoscalefrom48GBofGPUmemoryto96GB.IncreasedGPU-to-GPUinterconnectbandwidthprovidesasinglescalablememorytoaccelerategraphicsandcomputeworkloadsandtacklelargerdatasets.Anew,morecompactNVLinkconnectorenablesfunctionalityina

widerrangeofservers.Inaddition,twoRTX3090GPUscanbeconnectedtogetherforSLIusingNVLink.(Notethat3-Wayand4-WaySLIconfigurationsarenotsupported.)

PCIeGen4

GA10xGPUsfeatureaPCIExpress4.0hostinterface.PCIeGen4providesdoublethebandwidthofPCIe3.0,upto16Gigatransfers/secondbitrate,withax16PCIe4.0slotprovidingupto64GB/secofpeakbandwidth.PCIeGen4improvesdata-transferspeedsfromCPUmemoryinsystemsthatsupportGen4fordata-intensivetaskslikeAI,datascience,and3Ddesign.FasterPCIeperformancealsoacceleratesGPUdirectmemoryaccess(DMA)transfers,providingfastertransferofvideodatabetweentheGPUandNVIDIAGPUDirect?forVideo-enableddevices,deliveringapowerfulsolutionforlivebroadcast.

AmpereGPUArchitectureIn-Depth

PAGE

10

NVIDIAAmpereGA102GPUArchitecture

AmpereGPUArchitectureIn-Depth

GPC,TPC,andSMHigh-LevelArchitecture

LikepriorNVIDIAGPUs,GA102iscomposedofGraphicsProcessingClusters(GPCs),TextureProcessingClusters(TPCs),StreamingMultiprocessors(SMs),RasterOperators(ROPS),andmemorycontrollers.ThefullGA102GPUcontainssevenGPCs,42TPCs,and84SMs.

TheGPCisthedominanthigh-levelhardwareblockwithallofthekeygraphicsprocessingunitsresidinginsidetheGPC.EachGPCincludesadedicatedRasterEngine,andnowalsoincludestwoROPpartitions(eachpartitioncontainingeightROPunits),whichisanewfeatureforNVIDIAAmpereArchitectureGA10xGPUsanddescribedinmoredetailbelow.TheGPCincludessixTPCsthateachincludetwoSMsandonePolyMorphEngine.

Note:TheGA102GPUalsofeatures168FP64units(twoperSM),whicharenotdepictedinthisdiagram.TheFP64TFLOPrateis1/64ththeTFLOPrateofFP32operations.ThesmallnumberofFP64hardwareunitsareincludedtoensureanyprogramswithFP64codeoperatecorrectly,includingFP64TensorCorecode.

Figure2. GA102FullGPUwith84SMs

EachSMinGA10xGPUscontain128CUDACores,fourthird-generationTensorCores,a256KBRegisterFile,fourTextureUnits,onesecond-generationRayTracingCore,and128KBofL1/SharedMemory,whichcanbeconfiguredfordifferingcapacitiesdependingontheneedsofthecomputeorgraphicsworkloads.

ThememorysubsystemofGA102consistsoftwelve32-bitmemorycontrollers(384-bittotal).512KBofL2cacheispairedwitheach32-bitmemorycontroller,foratotalof6144KBonthefullGA102GPU.

ROPOptimizations

InpreviousNVIDIAGPUs,theROPsweretiedtothememorycontrollerandL2cache.BeginningwithGA10xGPUs,theROPsarenowpartoftheGPC,whichboostsperformanceofrasteroperationsbyincreasingthetotalnumberofROPs,andeliminatingthroughputmismatchesbetweenthescanconversionfrontendandrasteroperationsbackend.

WithsevenGPCsand16ROPunitsperGPC,thefullGA102GPUconsistsof112ROPsinsteadofthe96ROPSthatwerepreviouslyavailableina384-bitmemoryinterfaceGPUlikethepriorgenerationTU102.Thisimprovesmultisampleanti-aliasing,pixelfillrate,andblendingperformance.

GA10xSMArchitecture

TheTuringSMwasNVIDIA’sfirstSMarchitecturetoincludededicatedcoresforRayTracingoperations.VoltaGPUsintroducedTensorCores,andTuringincludedenhancedsecond-generationTensorCores.AnotherinnovationsupportedbytheTuringandVoltaSMswasconcurrentexecutionofFP32andINT32operations.TheGA10xSMimprovesuponalltheabovecapabilities,whilealsoaddingmanypowerfulnewfeatures.

LikepriorGPUs,theGA10xSMispartitionedintofourprocessingblocks(orpartitions),eachwitha64KBregisterfile,anL0instructioncache,onewarpscheduler,onedispatchunit,andsetsofmathandotherunits.Thefourpartitionsshareacombined128KBL1datacache/sharedmemorysubsystem.

UnliketheTU102SMwhichincludestwosecond-generationTensorCoresperpartitionandeightTensorCorestotal,thenewGA10xSMincludesonethird-generationTensorCoreperpartitionandfourTensorCorestotal,witheachGA10xTensorCorebeingtwiceaspowerfulasaTuringTensorCore.

ComparedtoTuring,theGA10xSM’scombinedL1datacacheandsharedmemorycapacityis33%larger.Forgraphicsworkloads,thecachepartitioncapacityisdoubledcomparedtoTuring,from32KBto64KB.

Figure3. GA10xStreamingMultiprocessor(SM)

2xFP32Throughput

IntheTuringgeneration,eachofthefourSMprocessingblocks(alsocalledpartitions)hadtwoprimarydatapaths,butonlyoneofthetwocouldprocessFP32operations.Theotherdatapathwaslimitedtointegeroperations.GA10XincludesFP32processingonbothdatapaths,doublingthepeakprocessingrateforFP32operations.Onedatapathineachpartitionconsistsof16

FP32CUDACorescapableofexecuting16FP32operationsperclock.Anotherdatapathconsistsofboth16FP32CUDACoresand16INT32Cores,andiscapableofexecutingeither16FP32operationsOR16INT32operationsperclock.Asaresultofthisnewdesign,eachGA10xSMpartitioniscapableofexecutingeither32FP32operationsperclock,or16FP32and16INT32operationsperclock.AllfourSMpartitionscombinedcanexecute128FP32operationsperclock,whichisdoubletheFP32rateoftheTuringSM,or64FP32and64INT32operationsperclock.

Moderngamingworkloadshaveawiderangeofprocessingneeds.ManyworkloadshaveamixofFP32arithmeticinstructions(suchasFFMA,floatingpointadditions(FADD),orfloating-pointmultiplications(FMUL)),alongwithmanysimplerintegerinstructionssuchasaddsforaddressingandfetchingdata,floatingpointcompare,ormin/maxforprocessingresults,etc.

TuringintroducedasecondmathdatapathtotheSM,whichprovidedsignificantperformancebenefitsforthesetypesofworkloads.However,otherworkloadscanbedominatedbyfloatingpointinstructions.Addingfloatingpointcapabilitytotheseconddatapathwillsignificantlyhelptheseworkloads.Performancegainswillvaryattheshaderandapplicationleveldependingonthemixofinstructions.RaytracingdenoisingshadersareagoodexampleofaworkloadthatcanbenefitgreatlyfromdoublingFP32throughput.

TheGA10xSMcontinuestosupportdouble-speedFP16(HFMA)operationswhicharesupportedinTuring.AndsimilartoTU102,TU104,andTU106TuringGPUs,standardFP16operationsarehandledbytheTensorCoresinGA10xGPUs.

Table1. ComparativeX-FactorsforFP32Throughput

(RelativetoFP32operationsinthePascalGP102GPUusedinGeForceGTX1080Ti)

Turing

GA10x

FP32

1X

2X

FP16

2X

2X

LargerandFasterUnifiedSharedMemoryandL1DataCache

Aswementionedpreviously,likethepriorgenerationTuringarchitecture,GA10xfeaturesaunifiedarchitectureforsharedmemory,L1datacache,andtexturecaching.ThisunifieddesigncanbereconfigureddependingonworkloadtoallocatemorememoryfortheL1orsharedmemorydependingonneed.TheL1datacachecapacityhasincreasedto128KBperSM.

Incomputemode,theGA10xSMwillsupportthefollowingconfigurations:

128KBL1+0KBSharedMemory

120KBL1+8KBSharedMemory

112KBL1+16KBSharedMemory

96KBL1+32KBSharedMemory

64KBL1+64KBSharedMemory

28KBL1+100KBSharedMemory

Forgraphicsworkloadsandasynccompute,GA10xwillallocate64KBL1data/texturecache(increasingfrom32KBcacheallocationonTuring),48KBSharedMemory,and16KBreservedforvariousgraphicspipelineoperations.

ThefullGA102GPUcontains10752KBofL1cache(comparedto6912KBinTU102).InadditiontoincreasingthesizeoftheL1,GA10xalsofeaturesdoublethesharedmemorybandwidthcomparedtoTuring(128bytes/clockperSMversus64bytes/clockinTuring).TotalL1bandwidthforGeForceRTX3080is219GB/secversus116GB/secforGeForceRTX2080Super.

Table2. GeForceRTX3080vsGeForceRTX2080/2080Super

GraphicsCard

GeForceRTX2080

FoundersEdition

GeForceRTX2080Super

FoundersEdition

GeForceRTX308010GB

FoundersEdition

GPUCodename

TU104

TU104

GA102

GPUArchitecture

NVIDIATuring

NVIDIATuring

NVIDIAAmpere

GPCs

6

6

6

TPCs

23

24

34

SMs

46

48

68

CUDACores/SM

64

64

128

CUDACores/GPU

2944

3072

8704

TensorCores/SM

8(2ndGen)

8(2ndGen)

4(3rdGen)

TensorCores/GPU

368

384(2ndGen)

272(3rdGen)

RTCores

46(1stGen)

48(1stGen)

68(2ndGen)

GPUBoostClock(MHz)

1800

1815

1710

PeakFP32TFLOPS(non-Tensor)1

10.6

11.2

29.8

PeakFP16TFLOPS(non-Tensor)1

21.2

22.3

29.8

PeakBF16TFLOPS(non-Tensor)1

NA

NA

29.8

PeakINT32TOPS(non-Tensor)1,3

10.6

11.2

14.9

PeakFP16TensorTFLOPS

withFP16Accumulate1

84.8

89.2

119/2382

PeakFP16TensorTFLOPSwithFP32Accumulate1

42.4

44.6

59.5/1192

PeakBF16TensorTFLOPSwithFP32Accumulate1

NA

NA

59.5/1192

PeakTF32TensorTFLOPS1

NA

NA

29.8/59.52

PeakINT8TensorTOPS1

169.6

178.4

238/4762

PeakINT4TensorTOPS1

339.1

356.8

476/9522

FrameBufferMemorySizeandType

8192MBGDDR6

8192MBGDDR6

10240MBGDDR6X

MemoryInterface

256-bit

256-bit

320-bit

MemoryClock(DataRate)

14Gbps

15.5Gbps

19Gbps

MemoryBandwidth

448GB/sec

496GB/sec

760GB/sec

ROPs

64

64

96

PixelFill-rate(Gigapixels/sec)

115.2

116.2

164.2

TextureUnits

184

192

272

TexelFill-rate(Gigatexels/sec)

331.2

348.5

465

L1DataCache/SharedMemory

4416KB

4608KB

8704KB

L2CacheSize

4096KB

4096KB

5120KB

RegisterFileSize

11776KB

12288KB

17408KB

TGP(TotalGraphicsPower)

225W

250W

320W

TransistorCount

13.6Billion

13.6Billion

28.3Billion

DieSize

545mm2

545mm2

628.4mm2

ManufacturingProcess

TSMC12nmFFN(FinFETNVIDIA)

TSMC12nmFFN(FinFETNVIDIA)

Samsung8nm8NNVIDIACustomProcess

PeakratesarebasedonGPUBoostClock.

EffectiveTOPS/TFLOPSusingthenewSparsityFeature

TOPS=IMAD-basedintegermath

Table3. NVIDIARTXA6000andNVIDIAA40Specs

GraphicsCard

NVIDIARTXA6000

NVIDIAA40

GPUCodename

GA102

GA102

GPUArchitecture

NVIDIAAmpere

NVIDIAAmpere

GPCs

7

7

TPCs

42

42

SMs

84

84

CUDACores/SM

128

128

CUDACores/GPU

10752

10752

TensorCores/SM

4(3rdGen)

4(3rdGen)

TensorCores/GPU

336(3rdGen)

336(3rdGen)

RTCores

84(2ndGen)

84(2ndGen)

GPUBoostClock(MHz)

1800

1740

PeakFP32TFLOPS(non-Tensor)1

38.7

37.4

PeakFP16TFLOPS(non-Tensor)1

38.7

37.4

PeakBF16TFLOPS(non-Tensor)1

38.7

37.4

PeakINT32TOPS(non-Tensor)1,3

19.4

18.7

PeakFP16TensorTFLOPSwithFP16Accumulate1

154.8/309.62

149.7/299.42

PeakFP16TensorTFLOPSwithFP32Accumulate1

154.8/309.62

149.7/299.42

PeakBF16TensorTFLOPS

withFP32Accumulate1

154.8/309.62

149.7/299.42

PeakTF32TensorTFLOPS1

77.4/154.82

74.8/149.62

PeakINT8TensorTOPS1

309.7/619.42

299.3/598.62

PeakINT4TensorTOPS1

619.3/1238.62

598.7/1197.42

FrameBufferMemorySizeandType

49152MBGDDR6

49152MBGDDR6

MemoryInterface

384-bit

384-bit

MemoryClock(DataRate)

16Gbps

14.5Gbps

MemoryBandwidth

768GB/sec

696GB/sec

ROPs

112

112

PixelFill-rate(Gigapixels/sec)

201.6

194.9

TextureUnits

336

336

TexelFill-rate(Gigatexels/sec)

604.8

584.6

L1DataCache/SharedMemory

10752KB

10752KB

L2CacheSize

6144KB

6144KB

RegisterFileSize

21504KB

21504KB

TGP(TotalGraphicsPower)

300W

300W

TransistorCount

28.3Billion

28.3Billion

DieSize

628.4mm2

628.4mm2

ManufacturingProcess

Samsung8nm8NNVIDIACustomProcess

Samsung8nm8NNVIDIACustomProcess

PeakratesarebasedonGPUBoostClock.

EffectiveTOPS/TFLOPSusingthenewSparsityFeature

TOPS=IMAD-basedintegermath

NOTE:RefertoAppendixCforRTXA6000performancedata.

PerformancePerWatt

TheentireNVIDIAAmpereGPUarchitectureiscraftedforefficiency-fromcustomprocessdesign,tocircuitdesign,logicdesign,packaging,memory,power,andthermaldesign,downtothePCBdesign,thesoftware,andalgorithms.Atthesameperformancelevel,AmperearchitectureGPUsareupto1.9xmorepowerefficientthanTuringGPUs.

RTX3080PowerEfficiencyComparedtoTuringArchitectureGeForceRTX2080Super

Figure4. NVIDIAAmpereGA10xArchitecturePowerEfficiency

Second-GenerationRayTracingEngineinGA10xGPUs

PAGE

17

NVIDIAAmpereGA102GPUArchitecture

Second-GenerationRayTracingEngineinGA10xGPUs

Turing-basedGeForceRTXGPUswerethefirstGPUstomakereal-time,cinema-qualityraytracedgraphicsarealityinPCgames.PriortothearrivalofTuring,renderinghigh-qualityraytracedscenesinrealtimewithfluidframerateswasthoughttobeyearsaway.ThankstomanyTuringarchitecturaladvancements(suchasdedicatedRTCores,TensorCores,andsoftwareadvancesinde

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論