版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
NVIDIAAMPEREGA102GPUARCHITECTURE
Second-GenerationRTX
UpdatedwithNVIDIARTXA6000andNVIDIAA40Information V2.0
PAGE\*roman
iv
NVIDIAAmpereGA102GPUArchitecture
TableofContents
TOC\o"1-4"\h\z\u
Introduction 5
GA102KeyFeatures 7
2xFP32Processing 7
Second-GenerationRTCore 7
Third-GenerationTensorCores 8
GDDR6XandGDDR6Memory 8
Third-GenerationNVLink? 8
PCIeGen4 9
AmpereGPUArchitectureIn-Depth 10
GPC,TPC,andSMHigh-LevelArchitecture 10
ROPOptimizations 11
GA10xSMArchitecture 11
2xFP32Throughput 12
LargerandFasterUnifiedSharedMemoryandL1DataCache 13
PerformancePerWatt 16
Second-GenerationRayTracingEngineinGA10xGPUs 17
AmpereArchitectureRTXProcessorsinAction 19
GA10xGPUHardwareAccelerationforRay-TracedMotionBlur 20
Third-GenerationTensorCoresinGA10xGPUs 24
ComparisonofTuringvsGA10xGPUTensorCores 24
NVIDIAAmpereArchitectureTensorCoresSupportNewDLDataTypes 26
Fine-GrainedStructuredSparsity 26
NVIDIADLSS8K 28
GDDR6XMemory 30
RTXIO 32
IntroducingNVIDIARTXIO 33
HowNVIDIARTXIOWorks 34
DisplayandVideoEngine 38
DisplayPort1.4awithDSC1.2a 38
HDMI2.1withDSC1.2a 38
FifthGenerationNVDEC-Hardware-AcceleratedVideoDecoding 39
AV1HardwareDecode 40
SeventhGenerationNVENC-Hardware-AcceleratedVideoEncoding 40
Conclusion 42
AppendixA-AdditionalGeForceGA10xGPUSpecifications 44
GeForceRTX3090 44
GeForceRTX3070 46
AppendixB-NewMemoryErrorDetectionandReplay(EDR)Technology 49
AppendixC-RTXA6000GPUPerformance 50
ListofFigures
Figure1. AmpereGA10xArchitecture-AGiantLeap 6
Figure2. GA102FullGPUwith84SMs 10
Figure3. GA10xStreamingMultiprocessor(SM) 12
Figure4. NVIDIAAmpereGA10xArchitecturePowerEfficiency 16
Figure5. GeForceRTX3080vsGeForceRTX2080SuperRTPerformance 17
Figure6. Second-GenerationRTCoreinGA10xGPUs 18
Figure7. TuringRTXTechnologyImprovesPerformance 19
Figure8. AmpereArchitectureRTXTechnologyFurtherImprovesPerformance 20
Figure9. AmpereArchitectureMotionBlurHardwareAcceleration 21
Figure10. BasicRayTracingvsRayTracingwithMotionBlur 22
Figure11. RenderingWithoutvsWithMotionBluronGA10x 23
Figure12. AmpereArchitectureTensorCorevsTuringTensorCore 25
Figure13. Fine-GrainedStructuredSparsity 27
Figure14. WatchDogs:Legionwith8KDLSScomparedto4Kand1080presolution. 28
Figure15. Builtfor8KGaming 29
Figure16. GDDR6XImprovedPerformanceandEfficiencyusingPAM4Signaling 30
Figure17. GDDR6XNewSignaling,NewCoding,NewAlgorithms 31
Figure18. GamesBottleneckedbyTraditionalI/O 32
Figure19. CompressedDataNeeded,butCPUCannotKeepUp 33
Figure20. RTXIODelivers100XThroughput,20XLowerCPUUtilization 34
Figure21. LevelLoadTimeComparison 35
Figure22. NVIDIAA40datacenterGPUforvisualcomputing 36
Figure23. VideoDecodeandEncodeFormatsSupportedonGA10xGPUs 39
Figure24. GA104FullGPUwith48SMs 46
Figure25. OldOverclockingMethodvsOverclockingwithEDR 49
Figure26. SPECviewperf?2020Performance-RTXA6000vsRTX6000 50
Figure27. RenderingPerformance-RTXA6000vsRTX6000 51
Figure28. HPCPerformance-RTXA6000vsRTX6000 51
Figure29. DeepLearningPerformance-RTXA6000vsRTX6000 52
ListofTables
Table1. ComparativeX-FactorsforFP32Throughput 13
Table2. GeForceRTX3080vsGeForceRTX2080/2080Super 14
Table3. NVIDIARTXA6000andNVIDIAA40Specs 15
Table4. RayTracingFeatureComparison 18
Table5. ComparingRTXA6000vsRTX6000MotionBlurRenderingTime 23
Table6. ComparisonofNVIDIATuringvsAmpereArchitectureTensorCore 25
Table7. DisplayPortVersions-SpecComparison 38
Table8. HDMIVersions-SpecComparison 38
Table9. ComparisonofGeForceRTX3090toNVIDIATitanRTX 44
Table10. ComparisonofGeForceRTX3070toGeForceRTX2070Super 47
IntroductiontotheNVIDIAAmpereGA102GPUArchitecture
PAGE
5
NVIDIAAmpereGA102GPUArchitecture
Introduction
Sinceinventingtheworld’sfirstGPU(GraphicsProcessingUnit)in1999,NVIDIAGPUshavebeenattheforefrontof3DgraphicsandGPU-acceleratedcomputing.EachNVIDIAGPUArchitectureiscarefullydesignedtoprovidebreakthroughlevelsofperformanceandefficiency.
ThefamilyofnewNVIDIA?AmperearchitectureGPUsisdesignedtoacceleratemanydifferenttypesofcomputationallyintensiveapplicationsandworkloads.ThefirstNVIDIAAmperearchitectureGPU,theA100,wasreleasedinMay2020andprovidestremendousspeedupsforAItrainingandinference,HPCworkloads,anddataanalyticsapplications.TheA100GPUisdescribedindetailinthe
NVIDIAA100GPUTensorCoreArchitectureWhitepaper.
ThenewestmembersoftheNVIDIAAmperearchitectureGPUfamily,GA102andGA104,aredescribedinthiswhitepaper.GA102andGA104arepartofthenewNVIDIA“GA10x”classofAmperearchitectureGPUs.GA10xGPUsbuildontherevolutionaryNVIDIATuring?GPUarchitecture.Turingwastheworld’sfirstGPUarchitecturetoofferhighperformancereal-timeraytracing,AI-acceleratedgraphics,energy-efficientinferenceaccelerationforthedatacenter,andprofessionalgraphicsrenderingallinoneproduct.
GA10xGPUsaddmanynewfeaturesanddeliversignificantlyfasterperformancethanTuringGPUs.Inaddition,GA10xGPUsarecarefullycraftedtoprovidethebestperformanceperareaandenergyefficiencyfortraditionalgraphicsworkloads,andevenmoresoforreal-timeraytracingworkloads.ComparedtotheTuringGPUArchitecture,theNVIDIAAmpereArchitectureisupto1.7xfasterintraditionalrastergraphicsworkloadsandupto2xfasterinraytracing.
GA102isthemostpowerfulAmperearchitectureGPUintheGA10xlineupandisusedintheGeForceRTX3090,GeForceRTX3080,NVIDIARTXA6000,andtheNVIDIAA40datacenterGPU.TheGeForceRTX3070GPUusesthenewGA104GPU.
TheGeForceRTX3090isthehighestperformingGPUintheGeForceRTXlineupandhasbeenbuiltfor8KHDRgaming.With10496CUDACores,24GBofGDDR6Xmemory,andthenewDLSS8Kmodeenabled,itcanrunmanygamesat8K@60fps.TheGeForceRTX3080providesupto2xtheperformanceoftheGeForceRTX2080,deliveringthegreatestgenerationalleapofanyGPUthathaseverbeenmade.TheGeForceRTX3070offers
performancethatrivalsNVIDIA’spreviousgenerationflagshipGPU,theGeForceRTX2080Ti.NewHDMI2.1andAV1decodefeaturesinGA10xGPUsallowuserstostreamcontentat8KwithHDR.
TheNVIDIA?RTX?A6000combines84second-generationRTCores,336third-generationTensorCores,and10,752CUDACoreswith48GBoffastGDDR6foracceleratedrendering,graphics,AI,andcomputeperformance.TwoRTXA6000scanbeconnectedwithNVIDIANVLink?toprovide96GBofcombinedGPUmemoryforhandlingextremelylargerendering,AI,VR,andvisualcomputingworkloads.Intotal,RTXA6000deliversthekeycapabilitiesdesigners,engineersandartistsneedtotacklethemostcomplexworkloadsfromtheirdesktopworkstation.
Finally,theNVIDIAA40GPUisanevolutionaryleapinperformanceandmulti-workloadcapabilitiesforthedatacenter,combiningbest-in-classprofessionalgraphicswithpowerfulcomputeandAIaccelerationtomeettoday’sdesign,creative,andscientificchallenges.
IncludingthesamecorecountsandmemorysizeastheRTXA6000,theA40willpowerthenextgenerationofvirtualworkstationsandserver-basedworkloads.NVIDIAA40isupto2Xmorepowerefficientthanthepreviousgeneration,anditbringsstate-of-the-artfeaturesforray-tracedrendering,simulation,virtualproduction,andmoretoprofessionals.
Figure1. AmpereGA10xArchitecture-AGiantLeap
ThisdocumentfocusesonNVIDIAGA102GPU-specificarchitecture,andalsogeneralNVIDIAGA10xAmpereGPUarchitectureandfeaturescommontoallGA10xGPUs.AdditionalGA10xGPUspecificationsareincludedinAppendixA.
GA102KeyFeatures
PAGE
7
NVIDIAAmpereGA102GPUArchitecture
GA102KeyFeatures
FabricatedonSamsung’s8nm8NNVIDIACustomProcess,theNVIDIAAmperearchitecture-basedGA102GPUincludes28.3billiontransistorswithadiesizeof628.4mm2.LikeallGeForceRTXGPUs,attheheartofGA102liesaprocessorthatcontainsthreedifferenttypesofcomputeresources:
ProgrammableShadingCores,whichconsistofNVIDIACUDACores
RTCores,whichaccelerateBoundingVolumeHierarchy(BVH)traversalandintersectionofscenegeometryduringraytracing
TensorCores,whichprovideenormousspeedupsforAIneuralnetworktrainingandinferencing
AfullGA102GPUincorporates10752CUDACores,84second-generationRTCores,and336third-generationTensorCores,andisthemostpowerfulconsumerGPUNVIDIAhaseverbuiltforgraphicsprocessing.AGA102SMdoublesthenumberofFP32shaderoperationsthatcan
beexecutedperclockcomparedtoaTuringSM,resultingin30TFLOPSforshaderprocessinginGeForceRTX3080(11TFLOPSintheequivalentTuringGPU).Similarly,RTCoresofferdoublethethroughputforray/triangleintersectiontesting,resultingin58RTTFLOPS(comparedto34inTuring).Finally,GA102’snewTensorCorescanprocesssparseneuralnetworksattwicetherateofTuringTensorCoreswhichdonotsupportsparsity,yielding238sparseTensorTFLOPSinRTX3080comparedto89non-sparseTensorTFLOPSinRTX2080.
2xFP32Processing
Mostgraphicsworkloadsarecomposedof32-bitfloatingpoint(FP32)operations.TheStreamingMultiprocessor(SM)intheAmpereGA10xGPUArchitecturehasbeendesignedtosupportdouble-speedprocessingforFP32operations.IntheTuringgeneration,eachofthefourSMprocessingblocks(alsocalledpartitions)hadtwoprimarydatapaths,butonlyoneofthetwocouldprocessFP32operations.Theotherdatapathwaslimitedtointegeroperations.GA10xincludesFP32processingonbothdatapaths,doublingthepeakprocessingrateforFP32operations.Asaresult,GeForceRTX3090deliversover35FP32TFLOPS,animprovementofover2xcomparedtoTuringGPUs.
FortheNVIDIARTXA6000andNVIDIAA40,2xFP32processingprovidessignificant
performanceimprovementsforgraphicsworkflowssuchas3Dmodeldevelopment,andalsocomputeaccelerationforworkloadssuchascomplex3Dsimulationforcomputer-aideddesign(CAD)andcomputer-aidedengineering(CAE).
Second-GenerationRTCore
ThenewRTCoreincludesanumberofenhancements,combinedwithimprovementstocachingsubsystems,thateffectivelydeliverupto2xperformanceimprovementovertheRTCoreinTuringGPUs.ThenewGA10xSMallowsRTCoreandgraphics,orRTCoreandcomputeworkloadstorunconcurrently,significantlyacceleratingmanyraytracingoperations.
Inadditiontoray-tracedgamerenderingbenefits,second-generationRTCoresdelivermassivespeedupsforworkloadslikephotorealisticrenderingofmoviecontent,architecturaldesignevaluations,andvirtualprototypingofproductdesigns.Theyalsospeeduprenderingofray-tracedmotionblurforfasterresultswithgreatervisualaccuracy.
Forprofessionals,asingleRTXA6000boardorNVIDIAA40GPUcanrendercomplexmodelswithphysicallyaccurateshadows,reflections,andrefractionstoempoweruserswithinstantinsight.WorkinginconcertwithapplicationsleveragingAPIssuchasNVIDIAOptiX,MicrosoftDXR,andVulkanraytracing,systemsbasedontheRTXA6000andA40willpowertrulyinteractivedesignworkflowstoprovideimmediatefeedbackforunprecedentedlevelsofproductivity.Thenewsecond-generationRTCorewillbedescribedinmoredetaillaterinthisdocument.
Third-GenerationTensorCores
TheGA10xSMincorporatesNVIDIA’snewthird-generationTensorCores,whichsupportmanynewdatatypesforimprovedperformance,efficiency,andprogrammingflexibility.AnewSparsityfeaturecantakeadvantageoffine-grainedstructuredsparsityindeeplearningnetworkstodoublethethroughputofTensorCoreoperationsoverthepriorgenerationTuringTensorCores.NewTensorFloat32(TF32)precisionprovidesupto5Xthetraining
throughputoverthepreviousgenerationtoaccelerateAIanddatasciencemodeltrainingwithoutrequiringanycodechanges.
Thethird-generationTensorCoresaccelerateAIdenoising,NVIDIADLSSforAIsuperresolution(nowwithsupportforupto8K),theNVIDIABroadcastappforAI-enhancedvideoandvoicecommunications,andtheNVIDIACanvasappforAI-poweredpainting.
GDDR6XandGDDR6Memory
GDDR6Xisthenewesthigh-speedgraphicsmemory.Itcurrentlysupportsspeedsof19.5GbpsontheGeForceRTX3090,and19GbpsfortheGeForceRTX3080.Withits320-bitmemoryinterfaceandGDDR6Xmemory,theGeForceRTX3080delivers1.5xmorememorybandwidththanitspredecessor,theRTX2080Super.
TheNVIDIARTXA6000andNVIDIAA40GPUsbothuse48GBofhigh-speedGDDR6memory,scalableupto96GBusingtwoidenticalGPUsconnectedwithNVLink,enablingcreativeprofessionals,engineers,anddatascientiststoworkwithmassivedatasetsandacceleratelatency-sensitiveprofessionalapplications.
Third-GenerationNVLink?
GA102GPUsutilizeNVIDIA’sthird-generationNVLinkinterface,whichincludesfourx4links,witheachlinkproviding14.0625GB/secbandwidthineachdirectionbetweentwoGPUs.Fourlinksprovide56.25GB/secbandwidthineachdirection,and112.5GB/sectotalbandwidthbetweentwoGPUs.TwoNVIDIAA40ortwoNVIDIARTXA6000GPUscanbeconnectedwithNVLinktoscalefrom48GBofGPUmemoryto96GB.IncreasedGPU-to-GPUinterconnectbandwidthprovidesasinglescalablememorytoaccelerategraphicsandcomputeworkloadsandtacklelargerdatasets.Anew,morecompactNVLinkconnectorenablesfunctionalityina
widerrangeofservers.Inaddition,twoRTX3090GPUscanbeconnectedtogetherforSLIusingNVLink.(Notethat3-Wayand4-WaySLIconfigurationsarenotsupported.)
PCIeGen4
GA10xGPUsfeatureaPCIExpress4.0hostinterface.PCIeGen4providesdoublethebandwidthofPCIe3.0,upto16Gigatransfers/secondbitrate,withax16PCIe4.0slotprovidingupto64GB/secofpeakbandwidth.PCIeGen4improvesdata-transferspeedsfromCPUmemoryinsystemsthatsupportGen4fordata-intensivetaskslikeAI,datascience,and3Ddesign.FasterPCIeperformancealsoacceleratesGPUdirectmemoryaccess(DMA)transfers,providingfastertransferofvideodatabetweentheGPUandNVIDIAGPUDirect?forVideo-enableddevices,deliveringapowerfulsolutionforlivebroadcast.
AmpereGPUArchitectureIn-Depth
PAGE
10
NVIDIAAmpereGA102GPUArchitecture
AmpereGPUArchitectureIn-Depth
GPC,TPC,andSMHigh-LevelArchitecture
LikepriorNVIDIAGPUs,GA102iscomposedofGraphicsProcessingClusters(GPCs),TextureProcessingClusters(TPCs),StreamingMultiprocessors(SMs),RasterOperators(ROPS),andmemorycontrollers.ThefullGA102GPUcontainssevenGPCs,42TPCs,and84SMs.
TheGPCisthedominanthigh-levelhardwareblockwithallofthekeygraphicsprocessingunitsresidinginsidetheGPC.EachGPCincludesadedicatedRasterEngine,andnowalsoincludestwoROPpartitions(eachpartitioncontainingeightROPunits),whichisanewfeatureforNVIDIAAmpereArchitectureGA10xGPUsanddescribedinmoredetailbelow.TheGPCincludessixTPCsthateachincludetwoSMsandonePolyMorphEngine.
Note:TheGA102GPUalsofeatures168FP64units(twoperSM),whicharenotdepictedinthisdiagram.TheFP64TFLOPrateis1/64ththeTFLOPrateofFP32operations.ThesmallnumberofFP64hardwareunitsareincludedtoensureanyprogramswithFP64codeoperatecorrectly,includingFP64TensorCorecode.
Figure2. GA102FullGPUwith84SMs
EachSMinGA10xGPUscontain128CUDACores,fourthird-generationTensorCores,a256KBRegisterFile,fourTextureUnits,onesecond-generationRayTracingCore,and128KBofL1/SharedMemory,whichcanbeconfiguredfordifferingcapacitiesdependingontheneedsofthecomputeorgraphicsworkloads.
ThememorysubsystemofGA102consistsoftwelve32-bitmemorycontrollers(384-bittotal).512KBofL2cacheispairedwitheach32-bitmemorycontroller,foratotalof6144KBonthefullGA102GPU.
ROPOptimizations
InpreviousNVIDIAGPUs,theROPsweretiedtothememorycontrollerandL2cache.BeginningwithGA10xGPUs,theROPsarenowpartoftheGPC,whichboostsperformanceofrasteroperationsbyincreasingthetotalnumberofROPs,andeliminatingthroughputmismatchesbetweenthescanconversionfrontendandrasteroperationsbackend.
WithsevenGPCsand16ROPunitsperGPC,thefullGA102GPUconsistsof112ROPsinsteadofthe96ROPSthatwerepreviouslyavailableina384-bitmemoryinterfaceGPUlikethepriorgenerationTU102.Thisimprovesmultisampleanti-aliasing,pixelfillrate,andblendingperformance.
GA10xSMArchitecture
TheTuringSMwasNVIDIA’sfirstSMarchitecturetoincludededicatedcoresforRayTracingoperations.VoltaGPUsintroducedTensorCores,andTuringincludedenhancedsecond-generationTensorCores.AnotherinnovationsupportedbytheTuringandVoltaSMswasconcurrentexecutionofFP32andINT32operations.TheGA10xSMimprovesuponalltheabovecapabilities,whilealsoaddingmanypowerfulnewfeatures.
LikepriorGPUs,theGA10xSMispartitionedintofourprocessingblocks(orpartitions),eachwitha64KBregisterfile,anL0instructioncache,onewarpscheduler,onedispatchunit,andsetsofmathandotherunits.Thefourpartitionsshareacombined128KBL1datacache/sharedmemorysubsystem.
UnliketheTU102SMwhichincludestwosecond-generationTensorCoresperpartitionandeightTensorCorestotal,thenewGA10xSMincludesonethird-generationTensorCoreperpartitionandfourTensorCorestotal,witheachGA10xTensorCorebeingtwiceaspowerfulasaTuringTensorCore.
ComparedtoTuring,theGA10xSM’scombinedL1datacacheandsharedmemorycapacityis33%larger.Forgraphicsworkloads,thecachepartitioncapacityisdoubledcomparedtoTuring,from32KBto64KB.
Figure3. GA10xStreamingMultiprocessor(SM)
2xFP32Throughput
IntheTuringgeneration,eachofthefourSMprocessingblocks(alsocalledpartitions)hadtwoprimarydatapaths,butonlyoneofthetwocouldprocessFP32operations.Theotherdatapathwaslimitedtointegeroperations.GA10XincludesFP32processingonbothdatapaths,doublingthepeakprocessingrateforFP32operations.Onedatapathineachpartitionconsistsof16
FP32CUDACorescapableofexecuting16FP32operationsperclock.Anotherdatapathconsistsofboth16FP32CUDACoresand16INT32Cores,andiscapableofexecutingeither16FP32operationsOR16INT32operationsperclock.Asaresultofthisnewdesign,eachGA10xSMpartitioniscapableofexecutingeither32FP32operationsperclock,or16FP32and16INT32operationsperclock.AllfourSMpartitionscombinedcanexecute128FP32operationsperclock,whichisdoubletheFP32rateoftheTuringSM,or64FP32and64INT32operationsperclock.
Moderngamingworkloadshaveawiderangeofprocessingneeds.ManyworkloadshaveamixofFP32arithmeticinstructions(suchasFFMA,floatingpointadditions(FADD),orfloating-pointmultiplications(FMUL)),alongwithmanysimplerintegerinstructionssuchasaddsforaddressingandfetchingdata,floatingpointcompare,ormin/maxforprocessingresults,etc.
TuringintroducedasecondmathdatapathtotheSM,whichprovidedsignificantperformancebenefitsforthesetypesofworkloads.However,otherworkloadscanbedominatedbyfloatingpointinstructions.Addingfloatingpointcapabilitytotheseconddatapathwillsignificantlyhelptheseworkloads.Performancegainswillvaryattheshaderandapplicationleveldependingonthemixofinstructions.RaytracingdenoisingshadersareagoodexampleofaworkloadthatcanbenefitgreatlyfromdoublingFP32throughput.
TheGA10xSMcontinuestosupportdouble-speedFP16(HFMA)operationswhicharesupportedinTuring.AndsimilartoTU102,TU104,andTU106TuringGPUs,standardFP16operationsarehandledbytheTensorCoresinGA10xGPUs.
Table1. ComparativeX-FactorsforFP32Throughput
(RelativetoFP32operationsinthePascalGP102GPUusedinGeForceGTX1080Ti)
Turing
GA10x
FP32
1X
2X
FP16
2X
2X
LargerandFasterUnifiedSharedMemoryandL1DataCache
Aswementionedpreviously,likethepriorgenerationTuringarchitecture,GA10xfeaturesaunifiedarchitectureforsharedmemory,L1datacache,andtexturecaching.ThisunifieddesigncanbereconfigureddependingonworkloadtoallocatemorememoryfortheL1orsharedmemorydependingonneed.TheL1datacachecapacityhasincreasedto128KBperSM.
Incomputemode,theGA10xSMwillsupportthefollowingconfigurations:
128KBL1+0KBSharedMemory
120KBL1+8KBSharedMemory
112KBL1+16KBSharedMemory
96KBL1+32KBSharedMemory
64KBL1+64KBSharedMemory
28KBL1+100KBSharedMemory
Forgraphicsworkloadsandasynccompute,GA10xwillallocate64KBL1data/texturecache(increasingfrom32KBcacheallocationonTuring),48KBSharedMemory,and16KBreservedforvariousgraphicspipelineoperations.
ThefullGA102GPUcontains10752KBofL1cache(comparedto6912KBinTU102).InadditiontoincreasingthesizeoftheL1,GA10xalsofeaturesdoublethesharedmemorybandwidthcomparedtoTuring(128bytes/clockperSMversus64bytes/clockinTuring).TotalL1bandwidthforGeForceRTX3080is219GB/secversus116GB/secforGeForceRTX2080Super.
Table2. GeForceRTX3080vsGeForceRTX2080/2080Super
GraphicsCard
GeForceRTX2080
FoundersEdition
GeForceRTX2080Super
FoundersEdition
GeForceRTX308010GB
FoundersEdition
GPUCodename
TU104
TU104
GA102
GPUArchitecture
NVIDIATuring
NVIDIATuring
NVIDIAAmpere
GPCs
6
6
6
TPCs
23
24
34
SMs
46
48
68
CUDACores/SM
64
64
128
CUDACores/GPU
2944
3072
8704
TensorCores/SM
8(2ndGen)
8(2ndGen)
4(3rdGen)
TensorCores/GPU
368
384(2ndGen)
272(3rdGen)
RTCores
46(1stGen)
48(1stGen)
68(2ndGen)
GPUBoostClock(MHz)
1800
1815
1710
PeakFP32TFLOPS(non-Tensor)1
10.6
11.2
29.8
PeakFP16TFLOPS(non-Tensor)1
21.2
22.3
29.8
PeakBF16TFLOPS(non-Tensor)1
NA
NA
29.8
PeakINT32TOPS(non-Tensor)1,3
10.6
11.2
14.9
PeakFP16TensorTFLOPS
withFP16Accumulate1
84.8
89.2
119/2382
PeakFP16TensorTFLOPSwithFP32Accumulate1
42.4
44.6
59.5/1192
PeakBF16TensorTFLOPSwithFP32Accumulate1
NA
NA
59.5/1192
PeakTF32TensorTFLOPS1
NA
NA
29.8/59.52
PeakINT8TensorTOPS1
169.6
178.4
238/4762
PeakINT4TensorTOPS1
339.1
356.8
476/9522
FrameBufferMemorySizeandType
8192MBGDDR6
8192MBGDDR6
10240MBGDDR6X
MemoryInterface
256-bit
256-bit
320-bit
MemoryClock(DataRate)
14Gbps
15.5Gbps
19Gbps
MemoryBandwidth
448GB/sec
496GB/sec
760GB/sec
ROPs
64
64
96
PixelFill-rate(Gigapixels/sec)
115.2
116.2
164.2
TextureUnits
184
192
272
TexelFill-rate(Gigatexels/sec)
331.2
348.5
465
L1DataCache/SharedMemory
4416KB
4608KB
8704KB
L2CacheSize
4096KB
4096KB
5120KB
RegisterFileSize
11776KB
12288KB
17408KB
TGP(TotalGraphicsPower)
225W
250W
320W
TransistorCount
13.6Billion
13.6Billion
28.3Billion
DieSize
545mm2
545mm2
628.4mm2
ManufacturingProcess
TSMC12nmFFN(FinFETNVIDIA)
TSMC12nmFFN(FinFETNVIDIA)
Samsung8nm8NNVIDIACustomProcess
PeakratesarebasedonGPUBoostClock.
EffectiveTOPS/TFLOPSusingthenewSparsityFeature
TOPS=IMAD-basedintegermath
Table3. NVIDIARTXA6000andNVIDIAA40Specs
GraphicsCard
NVIDIARTXA6000
NVIDIAA40
GPUCodename
GA102
GA102
GPUArchitecture
NVIDIAAmpere
NVIDIAAmpere
GPCs
7
7
TPCs
42
42
SMs
84
84
CUDACores/SM
128
128
CUDACores/GPU
10752
10752
TensorCores/SM
4(3rdGen)
4(3rdGen)
TensorCores/GPU
336(3rdGen)
336(3rdGen)
RTCores
84(2ndGen)
84(2ndGen)
GPUBoostClock(MHz)
1800
1740
PeakFP32TFLOPS(non-Tensor)1
38.7
37.4
PeakFP16TFLOPS(non-Tensor)1
38.7
37.4
PeakBF16TFLOPS(non-Tensor)1
38.7
37.4
PeakINT32TOPS(non-Tensor)1,3
19.4
18.7
PeakFP16TensorTFLOPSwithFP16Accumulate1
154.8/309.62
149.7/299.42
PeakFP16TensorTFLOPSwithFP32Accumulate1
154.8/309.62
149.7/299.42
PeakBF16TensorTFLOPS
withFP32Accumulate1
154.8/309.62
149.7/299.42
PeakTF32TensorTFLOPS1
77.4/154.82
74.8/149.62
PeakINT8TensorTOPS1
309.7/619.42
299.3/598.62
PeakINT4TensorTOPS1
619.3/1238.62
598.7/1197.42
FrameBufferMemorySizeandType
49152MBGDDR6
49152MBGDDR6
MemoryInterface
384-bit
384-bit
MemoryClock(DataRate)
16Gbps
14.5Gbps
MemoryBandwidth
768GB/sec
696GB/sec
ROPs
112
112
PixelFill-rate(Gigapixels/sec)
201.6
194.9
TextureUnits
336
336
TexelFill-rate(Gigatexels/sec)
604.8
584.6
L1DataCache/SharedMemory
10752KB
10752KB
L2CacheSize
6144KB
6144KB
RegisterFileSize
21504KB
21504KB
TGP(TotalGraphicsPower)
300W
300W
TransistorCount
28.3Billion
28.3Billion
DieSize
628.4mm2
628.4mm2
ManufacturingProcess
Samsung8nm8NNVIDIACustomProcess
Samsung8nm8NNVIDIACustomProcess
PeakratesarebasedonGPUBoostClock.
EffectiveTOPS/TFLOPSusingthenewSparsityFeature
TOPS=IMAD-basedintegermath
NOTE:RefertoAppendixCforRTXA6000performancedata.
PerformancePerWatt
TheentireNVIDIAAmpereGPUarchitectureiscraftedforefficiency-fromcustomprocessdesign,tocircuitdesign,logicdesign,packaging,memory,power,andthermaldesign,downtothePCBdesign,thesoftware,andalgorithms.Atthesameperformancelevel,AmperearchitectureGPUsareupto1.9xmorepowerefficientthanTuringGPUs.
RTX3080PowerEfficiencyComparedtoTuringArchitectureGeForceRTX2080Super
Figure4. NVIDIAAmpereGA10xArchitecturePowerEfficiency
Second-GenerationRayTracingEngineinGA10xGPUs
PAGE
17
NVIDIAAmpereGA102GPUArchitecture
Second-GenerationRayTracingEngineinGA10xGPUs
Turing-basedGeForceRTXGPUswerethefirstGPUstomakereal-time,cinema-qualityraytracedgraphicsarealityinPCgames.PriortothearrivalofTuring,renderinghigh-qualityraytracedscenesinrealtimewithfluidframerateswasthoughttobeyearsaway.ThankstomanyTuringarchitecturaladvancements(suchasdedicatedRTCores,TensorCores,andsoftwareadvancesinde
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 學(xué)術(shù)出版行業(yè)市場(chǎng)調(diào)研分析報(bào)告
- 大數(shù)據(jù)分析及應(yīng)用項(xiàng)目教程(Spark SQL)(微課版) 教案全套 許慧 單元1-6 大數(shù)據(jù)分析概述-Zepplin數(shù)據(jù)可視化
- 藥用薄荷市場(chǎng)分析及投資價(jià)值研究報(bào)告
- 自推進(jìn)式掃路機(jī)細(xì)分市場(chǎng)深度研究報(bào)告
- 冷鏈果蔬物流行業(yè)市場(chǎng)調(diào)研分析報(bào)告
- 移動(dòng)電話(huà)用屏幕保護(hù)膜市場(chǎng)發(fā)展前景分析及供需格局研究預(yù)測(cè)報(bào)告
- 電子貨幣收款機(jī)細(xì)分市場(chǎng)深度研究報(bào)告
- 電子閃光器開(kāi)關(guān)市場(chǎng)分析及投資價(jià)值研究報(bào)告
- 襯衫袖扣市場(chǎng)分析及投資價(jià)值研究報(bào)告
- 繪畫(huà)便箋簿項(xiàng)目營(yíng)銷(xiāo)計(jì)劃書(shū)
- 2024年南昌市南昌縣城管委招考編外城管協(xié)管員高頻500題難、易錯(cuò)點(diǎn)模擬試題附帶答案詳解
- 基于人工智能的智能倉(cāng)儲(chǔ)研發(fā)與應(yīng)用方案
- 2024-2030年中國(guó)微孔二氧化硅保溫板市場(chǎng)專(zhuān)題研究及市場(chǎng)前景預(yù)測(cè)評(píng)估報(bào)告
- 2024-2030年中國(guó)氣體傳感器行業(yè)市場(chǎng)發(fā)展趨勢(shì)與前景展望戰(zhàn)略分析報(bào)告
- 院內(nèi)突發(fā)心跳呼吸驟停、昏迷、跌倒事件應(yīng)急預(yù)案及程序
- 部編版初中語(yǔ)文教材目錄
- 2024-2030年果酒行業(yè)市場(chǎng)發(fā)展分析及前景趨勢(shì)與投資研究報(bào)告
- 六年級(jí)上冊(cè)數(shù)學(xué)說(shuō)課稿-《6.百分?jǐn)?shù)的認(rèn)識(shí)》 人教版
- 部編版小學(xué)語(yǔ)文二年級(jí)上冊(cè)月考達(dá)標(biāo)檢測(cè)試題(全冊(cè))
- 人教版道德與法治九年級(jí)上冊(cè)5.2《凝聚價(jià)值追求》說(shuō)課稿
- 感控知識(shí)應(yīng)知應(yīng)會(huì)課件
評(píng)論
0/150
提交評(píng)論