人工智能機(jī)器學(xué)習(xí)Efficient Methods and Hardware for Deep Learning_第1頁(yè)
人工智能機(jī)器學(xué)習(xí)Efficient Methods and Hardware for Deep Learning_第2頁(yè)
人工智能機(jī)器學(xué)習(xí)Efficient Methods and Hardware for Deep Learning_第3頁(yè)
人工智能機(jī)器學(xué)習(xí)Efficient Methods and Hardware for Deep Learning_第4頁(yè)
人工智能機(jī)器學(xué)習(xí)Efficient Methods and Hardware for Deep Learning_第5頁(yè)
已閱讀5頁(yè),還剩763頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

EfficientMethodsandHardwareforDeepLearning

SongHan

StanfordUniversity

1

DeepLearningisChangingOurLives

Self-DrivingMachineTranslation

2

AlphaGoSmartRobots

2

ModelsareGettingLarger

IMAGERECOGNITIONSPEECHRECOGNITION

16X

Model

10X

TrainingOps

152layers

465GFLOP

22.6GFLOP12,000hrsofData

~5%Error~3.5%error

8layers

1.4GFLOP

~16%Error

80GFLOP

7,000hrsofData

~8%Error

2012

AlexNet

2015

ResNet

2014

DeepSpeech1

2015

DeepSpeech2

Dally,NIPS’2016workshoponEfficientMethodsforDeepNeuralNetworks

3

ThefirstChallenge:ModelSize

Hardtodistributelargemodelsthroughover-the-airupdate

4

TheSecondChallenge:Speed

ErrorrateTrainingtime

ResNet18:10.76%2.5days

ResNet50:7.02%5days

ResNet101:6.21%1week

ResNet152:6.16%1.5weeks

SuchlongtrainingtimelimitsMLresearcher’sproductivity

Trainingtimebenchmarkedwithfb.resnet.torchusingfourM40GPUs

5

TheThirdChallenge:EnergyEfficiency

AlphaGo:1920CPUsand280GPUs,

$3000electricbillpergame

onmobile:drainsbattery

ondata-center:increasesTCO

6

TheProblemofLargeDNN

Hardwareengineersuffersfromthelargemodelsize

largermodel=>morememoryreference=>moreenergy

OperationEnergy[pJ]

RelativeEnergyCost

32bitintADD0.1

32bit?oatADD0.9

32bitRegisterFile1

32bitintMULT3.1

32bit?oatMULT3.7

32bitSRAMCache5

32bitDRAMMemory640

110100100010000

1=1000

8

TheProblemofLargeDNN

Hardwareengineersuffersfromthelargemodelsize

largermodel=>morememoryreference=>moreenergy

OperationEnergy[pJ]

RelativeEnergyCost

32bitintADD0.1

32bit?oatADD0.9

32bitRegisterFile1

32bitintMULT3.1

32bit?oatMULT3.7

32bitSRAMCache5

32bitDRAMMemory640

110100100010000

howtomakedeeplearningmoreefficient?

9

ImprovetheEfficiencyofDeepLearning

byAlgorithm-HardwareCo-Design

10

ApplicationasaBlackBox

AlgorithmSpec2006

Hardware

11

OpentheBoxbeforeHardwareDesign

?

Algorithm

Hardware

Breakstheboundarybetweenalgorithmandhardware

12

What’sintheBox:DeepLearning101

weights/activations

testdata

trainingdataset

TrainingInference

Model:CNN,RNN,LSTM…

traininghardwareinferencehardware

14

ProposedParadigm

Conventional

TrainingInference

SlowPower-

Hungry

Proposed

Training

Hanetal.ICLR’17

Compression

Pruning

Quantization

Hanetal.NIPS’15

Hanetal.ICLR’16

Accelerated

Inference

Hanetal.ISCA’16

Hanetal.FPGA’17

(Bestpaperaward)

FastPower-

Ef?cient

(Bestpaperaward)

15

ProposedParadigm

Conventional

TrainingInference

SlowPower-

Hungry

Proposed

Training

Hanetal.ICLR’17

Compression

Pruning

Quantization

Hanetal.NIPS’15

Hanetal.ICLR’16

Accelerated

Inference

Hanetal.ISCA’16

Hanetal.FPGA’17

(Bestpaperaward)

FastPower-

Ef?cient

(Bestpaperaward)

16

TheGoal&Trade-off

SmallFast

AccurateEnergyEfficient

17

Agenda

?ModelCompression(Small)

?Pruning[NIPS’15]

Compression

?TrainedQuantization[ICLR’16]

Pruning

Quantization

?HardwareAcceleration(Fast,Ef?cient)

?EIEAccelerator[ISCA’16]

Accelerated

Inference

?ESEAccelerator[FPGA’17]

?Ef?cientTraining(Accurate)

?Dense-Sparse-DenseRegularization[ICLR’17]

Training

CompressionAccelerationRegularization

18

SmallFast

Agenda

AccurateEnergyEfficient

?ModelCompression(Small)

?Pruning[NIPS’15]

?TrainedQuantization[ICLR’16]

?HardwareAcceleration(Fast,Ef?cient)

?EIEAccelerator[ISCA’16]

?ESEAccelerator[FPGA’17]

?Ef?cientTraining(Accurate)

?Dense-Sparse-DenseRegularization[ICLR’17]

CompressionAccelerationRegularization

19

LearningbothWeightsandConnections

forEfficientNeuralNetworks

Hanetal.NIPS2015

CompressionAccelerationRegularization

21

PruningNeuralNetworks

[Lecunetal.NIPS’89]

[Hanetal.NIPS’15]

PruningTrainedQuantizationHuffmanCoding

22

[Hanetal.NIPS’15]

PruningNeuralNetworks

-0.01x2+x+1

60Million

6M

10xlessconnections

PruningTrainedQuantizationHuffmanCoding

23

[Hanetal.NIPS’15]

PruningNeuralNetworks

0.5%

0.0%

-0.5%

-1.0%

-1.5%

-2.0%

-2.5%

-3.0%

-3.5%

-4.0%

-4.5%

40%50%60%70%80%90%100%

ParametersPrunedAway

PruningTrainedQuantizationHuffmanCoding

24

[Hanetal.NIPS’15]

PruningNeuralNetworks

Pruning

0.5%

0.0%

-0.5%

-1.0%

-1.5%

-2.0%

-2.5%

-3.0%

-3.5%

-4.0%

-4.5%

40%50%60%70%80%90%100%

ParametersPrunedAway

PruningTrainedQuantizationHuffmanCoding

25

[Hanetal.NIPS’15]

RetraintoRecoverAccuracy

PruningPruning+Retraining

0.5%

0.0%

-0.5%

-1.0%

-1.5%

-2.0%

-2.5%

-3.0%

-3.5%

-4.0%

-4.5%

40%50%60%70%80%90%100%

ParametersPrunedAway

PruningTrainedQuantizationHuffmanCoding

26

[Hanetal.NIPS’15]

IterativelyRetraintoRecoverAccuracy

PruningPruning+RetrainingIterativePruningandRetraining

0.5%

0.0%

-0.5%

-1.0%

-1.5%

-2.0%

-2.5%

-3.0%

-3.5%

-4.0%

-4.5%

40%50%60%70%80%90%100%

ParametersPrunedAway

PruningTrainedQuantizationHuffmanCoding

27

[Hanetal.NIPS’15]

PruningRNNandLSTM

*Karpathyetal"DeepVisual-

SemanticAlignmentsforGeneratingImageDescriptions"

PruningTrainedQuantizationHuffmanCoding

28

[Hanetal.NIPS’15]

PruningRNNandLSTM

?Original:abasketballplayerinawhiteuniformis

90%

playingwithaball

?Pruned90%:abasketballplayerinawhiteuniformis

playingwithabasketball

90%

?Original:abrowndogisrunningthroughagrassyfield

?Pruned90%:abrowndogisrunningthroughagrassy

area

90%

?Original:amanisridingasurfboardonawave

?Pruned90%:amaninawetsuitisridingawaveona

beach

?Original:asoccerplayerinredisrunninginthefield

95%

?Pruned95%:amaninaredshirtandblackandwhite

blackshirtisrunningthroughafield

PruningTrainedQuantizationHuffmanCoding

29

[Hanetal.NIPS’15]

PruningChangesWeightDistribution

BeforePruningAfterPruningAfterRetraining

Conv5layerofAlexnet.Representativeforothernetworklayersaswell.

PruningTrainedQuantizationHuffmanCoding

30

SmallFast

Agenda

AccurateEnergyEfficient

?ModelCompression(Small)

?Pruning[NIPS’15]

Pruning

?TrainedQuantization[ICLR’16]

Quantization

Pruning

Quantization

?HardwareAcceleration(Fast,Ef?cient)

?EIEAccelerator[ISCA’16]

?ESEAccelerator[FPGA’17]

?Ef?cientTraining(Accurate)

?Dense-Sparse-DenseRegularization[ICLR’17]

CompressionAccelerationRegularization

31

DeepCompression:CompressingDeep

NeuralNetworkswithPruning,Trained

QuantizationandHuffmanCoding

Hanetal.

ICLR2016

BestPaper

PruningTrainedQuantizationHuffmanCoding

32

[Hanetal.ICLR’16]

TrainedQuantization

32bit

4bit

8xlessmemoryfootprint

2.09,2.12,1.92,1.87

2.0

PruningTrainedQuantizationHuffmanCoding

33

[Hanetal.ICLR’16]

TrainedQuantization

PruningTrainedQuantizationHuffmanCoding

34

[Hanetal.ICLR’16]

TrainedQuantization

PruningTrainedQuantizationHuffmanCoding

35

[Hanetal.ICLR’16]

TrainedQuantization

PruningTrainedQuantizationHuffmanCoding

36

[Hanetal.ICLR’16]

TrainedQuantization

PruningTrainedQuantizationHuffmanCoding

37

[Hanetal.ICLR’16]

TrainedQuantization

PruningTrainedQuantizationHuffmanCoding

38

[Hanetal.ICLR’16]

TrainedQuantization

PruningTrainedQuantizationHuffmanCoding

39

[Hanetal.ICLR’16]

TrainedQuantization

PruningTrainedQuantizationHuffmanCoding

40

BeforeTrainedQuantization:

ContinuousWeight

[Hanetal.ICLR’16]

WeightValue

PruningTrainedQuantizationHuffmanCoding

41

AfterTrainedQuantization:

DiscreteWeight

[Hanetal.ICLR’16]

WeightValue

PruningTrainedQuantizationHuffmanCoding

42

AfterTrainedQuantization:

DiscreteWeightafterTraining

[Hanetal.ICLR’16]

WeightValue

PruningTrainedQuantizationHuffmanCoding

43

[Hanetal.ICLR’16]

BitsPerWeight

PruningTrainedQuantizationHuffmanCoding

44

[Hanetal.ICLR’16]

Pruning+TrainedQuantization

AlexNetonImageNet

PruningTrainedQuantizationHuffmanCoding

45

[Hanetal.ICLR’16]

HuffmanCoding

?In-frequentweights:usemorebitstorepresent

?Frequentweights:uselessbitstorepresent

PruningTrainedQuantizationHuffmanCoding

46

[Hanetal.ICLR’16]

SummaryofDeepCompression

PruningTrainedQuantizationHuffmanCoding

47

[Hanetal.ICLR’16]

Results:CompressionRatio

Network

Original

Size

Compressed

Size

Compression

Ratio

Original

Accuracy

Compressed

Accuracy

LeNet-3001070KB27KB40x98.36%98.42%

LeNet-51720KB44KB39x99.20%99.26%

AlexNet240MB6.9MB35x80.27%80.30%

VGGNet550MB11.3MB49x88.68%89.09%

FitinCache!

GoogleNet28MB2.8MB10x88.90%88.92%

ResNet-1844.6MB4.0MB11x89.24%89.28%

Canwemakecompactmodelstobeginwith?

CompressionAccelerationRegularization

48

1x18convolu.on8?lters8

ReLU8

1x18and83x38convolu.on8?lters8

ReLU8

CompressionAccelerationRegularization

49

CompressingSqueezeNet

NetworkApproachSizeRatioTop-1

Accuracy

Top-5

Accuracy

AlexNet-240MB1x57.2%80.3%

AlexNetSVD48MB5x56.0%79.4%

AlexNet

Deep

Compression

6.9MB35x57.2%80.3%

SqueezeNet-4.8MB50x57.5%80.3%

SqueezeNet

Deep

Compression

0.47MB510x57.5%80.3%

CompressionAccelerationRegularization

50

Results:Speedup

Average

0.6x

CPUGPUmGPU

CompressionAccelerationRegularization

51

Results:EnergyEfficiency

Average

CPUGPUmGPU

CompressionAccelerationRegularization

52

IndustrialImpact

DeepCompression

“AtBaidu,our#1motivationforcompressingnetworksistobringdownthesizeofthebinary?le.

Asamobile-?rstcompany,wefrequentlyupdatevariousappsviadifferentappstores.We'vevery

sensitivetothesizeofourbinary?les,andafeaturethatincreasesthebinarysizeby100MB

willreceivemuchmorescrutinythanonethatincreasesitby10MB.”—AndrewNg

CompressionAccelerationRegularization

53

Challenges

?Onlinede-compressionwhilecomputing

–Specialpurposelogic

?Computationbecomesirregular

–Sparseweight

–Sparseactivation

–Indirectlookup

?Parallelizationbecomeschallenging

–Synchronizationoverhead.

–Loadimbalanceissue.

–Scalability

CompressionAccelerationRegularization

54

HavingOpenedtheBox,HWDesign?

Algorithm

?Hardware

Breakstheboundarybetweenalgorithmandhardware

CompressionAccelerationRegularization

55

SmallFast

Agenda

AccurateEnergyEfficient?ModelCompression(Small)

?Pruning[NIPS’15]

?TrainedQuantization[ICLR’16]

?HardwareAcceleration(Fast,Ef?cient)

?EIEAccelerator[ISCA’16]

?ESEAccelerator[FPGA’17]

?Ef?cientTraining(Accurate)

?Dense-Sparse-DenseRegularization[ICLR’17]

CompressionAccelerationRegularization

56

EIE:EfficientInferenceEngineon

CompressedDeepNeuralNetwork

Hanetal.ISCA2016

58

OperationEnergy[pJ]

32bitintADD0.1

32bit?oatADD0.9

32bitRegisterFile1

32bitintMULT3.1

32bit?oatMULT3.7

32bitSRAMCache5

32bitDRAMMemory640

110100100010000

1=1000

Howtoreducethememoryfootprint?

57

RelatedWork

SpMat

Act_0Act_1

Ptr_EvenArithmPtr_Odd

SpMat

Eyeriss[1]TPU[2]DaDiannao[3]EIE[thiswork]

MITGoogleCASStanford

Data?ow8-bitIntegereDRAMCompression

[1]Yu-HsinChen,etal."Eyeriss:Anenergy-efficientreconfigurableacceleratorfordeepconvolutionalneuralnetworks."ISSCC2016

[2]NormJouppi,“GooglesuperchargesmachinelearningtaskswithTPUcustomchip”,2016

[3]YunjiChen,etal."Dadiannao:Amachine-learningsupercomputer."Micro2014

[4]SongHanetal.“EIE:EfficientInferenceEngineonCompressedDeepNeuralNetwork”,ISCA2016

59

CompressionAccelerationRegularization

59

[Hanetal.ISCA’16]

EIE:EfficientInferenceEngine

0*A=0W*0=02.09,1.92=>2

SparseWeightSparseActivation

WeightSharing

90%staticsparsity70%dynamicsparsity

4-bitweights

10xlesscomputation3xlesscomputation

5xlessmemoryfootprint8xlessmemoryfootprint

CompressionAccelerationRegularization

60

[Hanetal.ISCA’16]

EIE:ParallelizationonSparsity

?a!0a10a3"

×

??

w0,0w0,10w0,3

??

00w1,20

??

??

??

0w2,10w2,3

??

??

0000

??

??

??

00w4,2w4,3

??

??

w5,0000

??

??

?000w6,3?

0w7,100

=

??

b0

??

b1

??

??

??

?b

??

2

??

b3

??

??

??

?b

4

??

??

b5

??

??

?b6?

?b

7

R?eLU

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?b

b0

b1

0

b3

0

b5

b6

0

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

CompressionAccelerationRegularization

61

[Hanetal.ISCA’16]

EIE:ParallelizationonSparsity

PEPEPEPE

PEPEPEPE

CentralControl

PEPEPEPE

PEPEPEPE

?a!0a10a3"

!?b

?????

PE0w0,0w0,10w0,3b0b0

?????

?00w1,20???

?????

PE1?

b1b1

?????

PE20w2,10w2,30

"b2

?????

?????

PE30000b3b3

?????

R#eLU

?????

=

???

?"b4?

00w4,2w4,30

?????

?????

w5,0000b5b5

?????

?????

???b6

??

000wb

0w7,1000

"b

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

CompressionAccelerationRegularization

62

[Hanetal.ISCA’16]

EIE:ParallelizationonSparsity

logically

?a!0a10a3"

!?b

?????

PE0w0,0w0,10w0,3b0b0

?????

b1b1

PE1?00w1,20????

?????

?????

PE20w2,10w2,30

"b2

?????

?????

PE30000b3b3

?????

R#eLU

??

?=??

????

?"b4

00w4,2w4,30

?????

?????

w5,0000b5b5

?????

?????

?000w6,3????

"b7

b6b6

0w7,1000

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

VirtualWeightW0,0W0,1W4,2W0,3W4,3

physically

RelativeIndex01200

ColumnPointer0123

CompressionAccelerationRegularization

63

[Hanetal.ISCA’16]

Dataflow

?a!0a10a3"

!?b

?????

PE0w0,0w0,10w0,3b0b0

?????

00w1,20b1b1

?????

?????

PE1?????

PE20w2,10w2,30

"b2

?????

?????

PE30000b3b3

?????

R#eLU

????

=?

?????

00w4,2w4,30"b4

"b4

00w4,2w4,30

?????

?????

w5,0000b5b5

?????

?????

???

b6b6

000w??

0w7,1000

"b

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

CompressionAccelerationRegularization

64

[Hanetal.ISCA’16]

Dataflow

?a!0a10a3"

!?b

?????

PE0w0,0w0,10w0,3b0b0

?????

00w1,20b1b1

?????

?????

PE1?????

PE20w2,10w2,30

"b2

?????

?????

PE30000b3b3

?????

R#eLU

????

=?

?????

00w4,2w4,30"b4

"b4

00w4,2w4,30

?????

?????

w5,0000b5b5

?????

?????

???

b6b6

000w??

0w7,1000

"b

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

CompressionAccelerationRegularization

65

[Hanetal.ISCA’16]

Dataflow

?a!0a10a3"

!?b

?????

PE0w0,0w0,10w0,3b0b0

?????

00w1,20b1b1

?????

?????

PE1?????

PE20w2,10w2,30

"b2

?????

?????

PE30000b3b3

?????

R#eLU

????

=?

?????

00w4,2w4,30"b4

"b4

00w4,2w4,30

?????

?????

w5,0000b5b5

?????

?????

???

b6b6

000w??

0w7,1000

"b

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

CompressionAccelerationRegularization

66

[Hanetal.ISCA’16]

Dataflow

?a!0a10a3"

!?b

?????

PE0w0,0w0,10w0,3b0b0

?????

00w1,20b1b1

?????

?????

PE1?????

PE20w2,10w2,30

"b2

?????

?????

PE30000b3b3

?????

R#eLU

????

=?

?????

00w4,2w4,30"b4

"b4

00w4,2w4,30

?????

?????

w5,0000b5b5

?????

?????

???

b6b6

000w??

0w7,1000

"b

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

CompressionAccelerationRegularization

67

[Hanetal.ISCA’16]

Dataflow

?a!0a10a3"

!?b

?????

PE0w0,0w0,10w0,3b0b0

?????

00w1,20b1b1

?????

?????

PE1?????

PE20w2,10w2,30

"b2

?????

?????

PE30000b3b3

?????

R#eLU

????

=?

?????

00w4,2w4,30"b4

"b4

00w4,2w4,30

?????

?????

w5,0000b5b5

?????

?????

???

b6b6

000w??

0w7,1000

"b

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

CompressionAccelerationRegularization

68

[Hanetal.ISCA’16]

Dataflow

?a!0a10a3"

!?b

?????

PE0w0,0w0,10w0,3b0b0

?????

00w1,20b1b1

?????

?????

PE1?????

PE20w2,10w2,30

"b2

?????

?????

PE30000b3b3

?????

R#eLU

????

=?

?????

00w4,2w4,30"b4

"b4

00w4,2w4,30

?????

?????

w5,0000b5b5

?????

?????

???

b6b6

000w??

0w7,1000

"b

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

CompressionAccelerationRegularization

69

[Hanetal.ISCA’16]

Dataflow

?a!0a10a3"

!?b

?????

PE0w0,0w0,10w0,3b0b0

?????

00w1,20b1b1

?????

?????

PE1?????

PE20w2,10w2,30

"b2

?????

?????

PE30000b3b3

?????

R#eLU

????

=?

?????

00w4,2w4,30"b4

"b4

00w4,2w4,30

?????

?????

w5,0000b5b5

?????

?????

???

b6b6

000w??

0w7,1000

"b

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

CompressionAccelerationRegularization

70

[Hanetal.ISCA’16]

Dataflow

?a!0a10a3"

!?b

?????

PE0w0,0w0,10w0,3b0b0

?????

00w1,20b1b1

?????

?????

PE1?????

PE20w2,10w2,30

"b2

?????

?????

PE30000b3b3

?????

R#eLU

????

=?

?????

00w4,2w4,30"b4

"b4

00w4,2w4,30

?????

?????

w5,0000b5b5

?????

?????

???

b6b6

000w??

0w7,1000

"b

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

CompressionAccelerationRegularization

71

[Hanetal.ISCA’16]

Dataflow

?a!0a10a3"

!?b

?????

PE0w0,0w0,10w0,3b0b0

?????

00w1,20b1b1

?????

?????

PE1?????

PE20w2,10w2,30

"b2

?????

?????

PE30000b3b3

?????

R#eLU

????

=?

?????

00w4,2w4,30"b4

"b4

00w4,2w4,30

?????

?????

w5,0000b5b5

?????

?????

???

b6b6

000w??

0w7,1000

"b

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

CompressionAccelerationRegularization

72

[Hanetal.ISCA’16]

Dataflow

?a!0a10a3"

!?b

?????

PE0w0,0w0,10w0,3b0b0

?????

00w1,20b1b1

?????

?????

PE1?????

PE20w2,10w2,30

"b2

?????

?????

PE30000b3b3

?????

R#eLU

????

=?

?????

00w4,2w4,30"b4

"b4

00w4,2w4,30

?????

?????

w5,0000b5b5

?????

?????

???

b6b6

000w??

0w7,1000

"b

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

CompressionAccelerationRegularization

73

[Hanetal.ISCA’16]

EIEArchitecture

Weightdecode

Compressed4-bit

DNNModelWeight

Virtualweight

EncodedWeight

Look-up

RelativeIndex

Index

SparseFormat

Input

4-bit

Accum

Image

RelativeIndex

16-bit

Realweight

ALU

Mem

16-bit

AbsoluteIndex

Prediction

Result

AddressAccumulate

CompressionAccelerationRegularization

74

[Hanetal.ISCA’16]

MicroArchitectureforeachPE

ActValue

ActIndex

ActQueue

ActIndex

ActValue

Encoded

Weight

Act

SRAM

Leading

NZero

Detect

Weight

Col

Sparse

EvenPtrSRAMBank

Decoder

Start/

BypassDestSrc

ActAct

MatrixRegs

End

SRAM

AddrRegsRegs

Address

OddPtrSRAMBank

AbsoluteAddress

Accum

RelativeIndex

PointerReadSparseMatrixAccessArithmeticUnitActR/W

ReLU

SRAMRegsComb

CompressionAccelerationRegularization

75

[Hanetal.ISCA’16]

LoadBalance

ActValue

ActQueuePEPEPEPE

ActIndex

PEPEPEPE

CentralControl

PEPEPEPE

PEPEPEPE

SRAMRegsComb

CompressionAccelerationRegularization

76

[Hanetal.ISCA’16]

ActivationSparsity

PEPEPEPE

ActValue

ActIndex

ActQueue

ActValue

Leading

NZero

PEPEPEPE

CentralControl

ActIndex

Detect

PEPEPEPE

EvenPtrSRAMBankPEPEPEPE

OddPtrSRAMBank

PointerRead

SRAMRegsComb

CompressionAccelerationRegularization

77

[Hanetal.ISCA’16]

WeightSparsity

PEPEPEPE

ActValue

ActIndex

ActQueue

ActValue

Leading

NZero

PEPEPEPE

CentralControl

ActIndex

Detect

PEPEPEPE

PEPEPEPE

EvenPtrSRAMBank

Col

Start/

End

Sparse

Matrix

Regs

SRAM

AddrOddPtrSRAMBank

PointerReadSparseMatrixAccess

SRAMRegsComb

CompressionAccelerationRegularization

78

[Hanetal.ISCA’16]

WeightSharing

PEPEPEPE

PEPEPEPE

CentralControl

ActValue

ActIndex

ActQueue

ActIndex

ActValue

Encoded

Weight

Decoded

Weight

Leading

NZero

Detect

PEPEPEPE

PEPEPEPE

EvenPtrSRAMBank

OddPtrSRAMBank

Col

Start/

End

Addr

Sparse

Matrix

SRAM

Regs

Weight

Decoder

PointerReadSparseMatrixAccess

SRAMRegsComb

CompressionAccelerationRegularization

79

[Hanetal.ISCA’16]

AddressAccumulate

PEPEPEPE

PEPEPEPE

CentralControl

ActValue

ActIndex

ActQueue

ActIndex

ActValue

Encoded

Weight

Decoded

Weight

Leading

NZero

Detect

PEPEPEPE

PEPEPEPE

EvenPtrSRAMBank

OddPtrSRAMBank

PointerRead

Col

Start/

End

Sparse

Matrix

SRAM

Regs

Addr

SparseMatrixAccess

Relative

Index

Weight

Decoder

Address

Accum

Absolute

Index

SRAMRegsComb

CompressionAccelerationRegularization

80

[Hanetal.ISCA’16]

Arithmetic

PEPEPEPE

PEPEPEPE

CentralControl

ActValue

ActIndex

ActQueue

ActIndex

ActValue

Encoded

Weight

Leading

NZero

Detect

PEPEPEPE

PEPEPEPE

EvenPtrSRAMBank

OddPtrSRAMBank

PointerRead

Col

Start/

End

Sparse

Matrix

SRAM

Regs

Addr

SparseMatrixAccess

Relative

Index

Weight

Decoder

Address

Accum

Bypass

AbsoluteAddress

ArithmeticUnit

SRAMRegsComb

CompressionAccelerationRegularization

81

[Hanetal.ISCA’16]

WriteBack

PEPEPEPE

PEPEPEPE

CentralControl

ActValue

ActIndex

ActQueue

ActIndex

ActValue

Encoded

Weight

Act

SRAM

Leading

NZero

Detect

PEPEPEPE

PEPEPEPE

Weight

Col

Start/

End

Sparse

EvenPtrSRAMBank

OddPtrSRAMBank

Decoder

DestSrc

ActAct

MatrixRegs

SRAM

Bypass

AddrRegsRegs

Address

AbsoluteAddress

Accum

Relative

PointerReadSparseMatrixAccessArithmeticUnitActR/W

Index

SRAMRegsComb

CompressionAccelerationRegularization

82

[Hanetal.ISCA’16]

Relu,Non-zeroDetection

PEPEPEPE

PEPEPEPE

CentralControl

ActValue

ActIndex

ActQueue

ActIndex

ActValue

Encoded

Weight

Act

SRAM

Leading

NZero

Detect

PEPEPEPE

PEPEPEPE

Weight

Col

Start/

End

Sparse

EvenPtrSRAMBank

OddPtrSRAMBank

Decoder

DestSrc

ActAct

MatrixRegs

SRAM

Bypass

AddrRegsRegs

Address

AbsoluteAddress

Accum

Relative

PointerReadSparseMatrixAccessArithmeticUnitActR/W

Index

ReLU

SRAMRegsComb

CompressionAccelerationRegularization

83

[Hanetal.ISCA’16]

What’sSpecial

PEPEPEPE

PEPEPEPE

Ac

Ac

ActValue

E

W

Act

SRAM

Leading

NZero

Detect

CentralControl

PEPEPEPE

PEPEPEPE

EvenPtrSRAMBank

Col

Start/

End

Sparse

Matrix

SRAM

Regs

Bypass

Dest

Act

Regs

Src

Act

Regs

AddrReLU

OddPtrSRAMBank

bsoluteAddress

R

PointerReadSparseMatrixAccesshmeticUnitActR/W

SRAMRegsComb

CompressionAccelerationRegularization

84

[Hanetal.ISCA’16]

PostLayoutResultofEIE

Technology45nm

#PEs64

on-chipSRAM8MB

MaxModelSize84Million

StaticSparsity10x

DynamicSparsity3x

Quantization4-bit

ALUWidth16-bit

Area40.8mm^2

MxVThroughput81,967layers/s

Power586mW

1.Postlayoutresult

2.ThroughputmeasuredonAlexNetFC-7

CompressionAccelerationRegularization

85

[Hanetal.ISCA’16]

Benchmark

?CPU:IntelCore-i75930k

?GPU:NVIDIATitanX

?MobileGPU:NVIDIAJetsonTK1

LayerSize

Weight

Density

Activation

Density

FLOP

Reduction

Description

AlexNet-64096×92169%35%33x

AlexNetfor

image

AlexNet-74096×40969%35%33x

classi?cation

AlexNet-81000×409625%38%10x

VGG-64096×250884%18%100x

VGG-16for

VGG-74096×40964%37%50ximage

classi?cation

VGG-81000×409623%41%10x

NeuralTalk-We600×409610%100%10x

RNNandLSTM

NeuralTalk-Wd8791×60011%100%10xforimage

caption

NeuralTalk-LSTM2400×120110%100%10x

CompressionAccelerationRegularization

86

[Hanetal.ISCA’16]

SpeeduponEIE

SpMat

CPUDense(Baseline)CPUCompressedGPUDenseGPUCompressedmGPUDensemGPUCompressedEIE

1018x

618x1000x507x

CPUDense(Baseline)CPUCompressedGPUDenseGPUCompressedmGPUDenseUCompressedEIE

8x9x

248xAct_0Act_1

210x

189x

135x

Ptr_EvenArithmPtr_Odd

9x10x10x9x9x

10x5x5x

2x3x2x3x3x

2x3x

1x1x

3x2x

1x

SpMat

1x1x1x1x1x1x1x1x1x1x

1.1x1.0x1.0x1x1x

1x0.5x

0.6x

0.3x0.5x0.5x0.5x0.6x

0.1x

Alex-6Alex-7Alex-8VGG-6VGG-7VGG-8NT-WeNT-WdNT-LSTMGeoMean

189x

48x

15x

GeoMean

3x

3x

1x

0.6x

CPUGPUmGPUEIEAlex-6Alex-7Alex-8VGG-6VGG-7VGG-8NT-WeNT-WdNT-LSTMGeoMean

CompressionAccelerationRegularization

87

Alex-6Alex-7Alex-8VGG-6VGG-7VGG-8NT-WeNT-WdNT-LSTMGeoMean

[Hanetal.ISCA’16]

EnergyEfficiencyonEIE

SpMat

CPUDense(Baseline)CPUCompressedGPUDenseGPUCompressedmGPUDensemGPUCompressedEIE

119,797x76,784x

61,533x

00000x

34,522x

24,207x

14,826x

11,828x9,485x10,904x8,053x

Act_0Act_1

10000x

Ptr_EvenArithmPtr_Odd

CPUDense(Baseline)CPUCompressedGPUDenseGPUCompressedmGPUDensePUCompressedEIE

1000x

18x17x014x20xSpMat

10x8x14x15x

7x12x7x10x10x

9x

5x6x6x5x7x

10x5x

6x6x4x6x

10x15x

3x

13x14x2x

5x8x7x7x9x1x1x1x1x1x1x1x1x1x1x

7x1x

Alex-6Alex-7Alex-8VGG-6VGG-7VGG-8NT-WeNT-WdNT-LSTMGeoMean

24,207x

GeoMean36x

23x

6x7x

9x

1x

CPUGPUmGPUEIE

Alex-6Alex-7Alex-8VGG-6VGG-7VGG-8NT-WeNT-WdNT-LSTMGeoMean

CompressionAccelerationRegularization

88

[Hanetal.ISCA’16]

Comparison:Throughput

EIE

Throughput(Layers/sinlogscale)1E+06

ASIC1E+05

ASIC

ASIC1E+04

GPU

ASIC1E+03

1E+02CPUmGPU

FPGA1E+01

1E+00

Core-i75930kTitanXTegraK1A-EyeDaDianNaoTrueNorthEIEEIE

22nm28nm28nm28nm28nm28nm45nm28nm

CPUGPUmGPUFPGAASICASICASICASIC

64PEs256PEs

CompressionAccelerationRegularization

89

[Hanetal.I

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論