高性能矩陣乘法

上傳人：石*** IP屬地：廣東上傳時間：2022-09-20 格式：PPT 頁數(shù)：27 大小：1.55MB 積分：15 舉報 版權(quán)申訴

已閱讀5頁，還剩22頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進行舉報或認領(lǐng)

文檔簡介

1、高性能矩陣乘法第1頁，共27頁，2022年，5月20日，20點24分，星期四2022/9/82并行算法優(yōu)化研究相對于傳統(tǒng)面向?qū)ο蟠兴惴ǖ?個挑戰(zhàn)：同步：兩個或者多個線程協(xié)調(diào)其行為的過程通信：與線程之間交換數(shù)據(jù)相關(guān)的帶寬和延遲問題負載均衡：多個線程之間工作量分布的情況，給各個線程（執(zhí)行核）分配均勻的工作可擴展性：衡量在性能更加強勁的系統(tǒng)上運行軟件時能否有效利用更多線程的指標(biāo)，觀察應(yīng)用程序在更高級的平臺上運行4核到8核線性增長第2頁，共27頁，2022年，5月20日，20點24分，星期四2022/9/83多線程（核）設(shè)計主要分解模式任務(wù)分解：對程序根據(jù)其執(zhí)行的功能進行分解的過程數(shù)據(jù)分解：將應(yīng)用

2、程序根據(jù)各任務(wù)所處理的數(shù)據(jù)而非按任務(wù)的天然特性來進行分解數(shù)據(jù)流分解：研究數(shù)據(jù)在諸任務(wù)之間如何流動，根據(jù)任務(wù)之間的數(shù)據(jù)流關(guān)系對問題進行分解模式分解方式任務(wù)級并行模式任務(wù)分解Divide and Conquer任務(wù)/數(shù)據(jù)分解幾何分解模式數(shù)據(jù)分解流水線模式數(shù)據(jù)流分解波峰（wavefront）模式數(shù)據(jù)流分解第3頁，共27頁，2022年，5月20日，20點24分，星期四2022/9/84多線程（核）設(shè)計主要分解模式任務(wù)分解：對程序根據(jù)其執(zhí)行的功能進行分解的過程數(shù)據(jù)分解：將應(yīng)用程序根據(jù)各任務(wù)所處理的數(shù)據(jù)而非按任務(wù)的天然特性來進行分解數(shù)據(jù)流分解：研究數(shù)據(jù)在諸任務(wù)之間如何流動，根據(jù)任務(wù)之間的數(shù)據(jù)流關(guān)系對問題進

3、行分解分解方式設(shè)計說明任務(wù)分解不同的程序行為采用不同的線程實現(xiàn)常用于GUI應(yīng)用程序數(shù)據(jù)分解多個線程對不同的數(shù)據(jù)塊執(zhí)行相同的操作常用于音頻、圖像處理和科學(xué)計算應(yīng)用程序數(shù)據(jù)流分解一個線程的輸出作為另一個線程的輸入尤其應(yīng)注意盡量消除啟動和排空延遲第4頁，共27頁，2022年，5月20日，20點24分，星期四2022/9/85矩陣乘法算法探討在工程科學(xué)計算中，矩陣乘積是最基本的運算典型的n階稠密方陣乘積算法的時間復(fù)雜度是O(n3) 。目前對大型矩陣乘積運算的處理主要是采用分治思想，將矩陣分布在多個節(jié)點上，但每個結(jié)點上的小矩陣仍要立方級乘法次數(shù)。基于分之思想的兩種劃分策略：條形劃分和塊狀（棋盤）劃分的

4、6種常見分布式矩陣乘法并行算法。第5頁，共27頁，2022年，5月20日，20點24分，星期四2022/9/86基于不同劃分策略的矩陣乘法算法探討1、條形（striped partitioning）劃分的矩陣乘法并行算法行條劃分列條劃分兩兩組合：行列、行行、列列、列行第6頁，共27頁，2022年，5月20日，20點24分，星期四2022/9/87基于不同劃分策略的矩陣乘法算法探討2、塊狀劃分(checkerboard partitioning)的矩陣乘法并行算法稱為棋盤劃分第7頁，共27頁，2022年，5月20日，20點24分，星期四Cannon Description for impl

5、ementation of MPI program to compute Matrix Matrix Multiplication using block checkerboard partitioning and Cannon Algorithm 2022/9/88第8頁，共27頁，2022年，5月20日，20點24分，星期四Cannon Objective Computing the matrix-matrix multiplication on SMP System. Use block checkerboard partitioning of the matrices and Cann

6、ons Algorithm. AssumptionSize of the square matrices p= q2 and the size of square matrices A and B is evenly divisible by q. It is assumed that the number of blocks are equal to the number of processors.2022/9/89第9頁，共27頁，2022年，5月20日，20點24分，星期四Cannon Cannons algorithm is based on cartesian virtual to

7、pologyA and B are square matrices of size n and C be the output matrix.These matrices are dived into blocks or submatrices to perform matrix-matrix operations in paralleln x n matrix A can be regarded as q x q array of blocks Ai, j (0=i q, 0= j q) such that each block is an (n/q) x (n/q) submatrixWe

8、 use p processors to implement the block version of matrix multiplication in parallel by choosing q as a square root of p and compute a distinct block Ci, j on each processor.2022/9/810第10頁，共27頁，2022年，5月20日，20點24分，星期四傳統(tǒng)并行 2022/9/811第11頁，共27頁，2022年，5月20日，20點24分，星期四傳統(tǒng)并行 The matrices A and B are partit

9、ioned into p blocks, A i, j and B i, j (0=i q, 0= j q) of size (n/q x n/q) on each process. These blocks are mapped onto a q x q logical mesh of processes. The processes are labeled from P0,0 to Pq-1,q-1.2022/9/812第12頁，共27頁，2022年，5月20日，20點24分，星期四傳統(tǒng)并行 Process Pi, j initially store block matrices Ai,

10、j and Bi, j and computes block Ci, j of result matrix. To compute submatrix Ci, j , we need all submatrices , Ai, k and Bk, j ( 0 = k q ) . To acquire all the required blocks, an all-to-all broadcast of matrix Ai, j s is performed in each row and similarly in each column of matrix Bi, js. MPI collec

11、tive communication is used to perform this operations. 2022/9/813第13頁，共27頁，2022年，5月20日，20點24分，星期四傳統(tǒng)并行 After Pi, j acquires, A i,0 , A i,1 , A i,2 , A i, q-1 and B0, j , B1, j , B2, j , Bq-1, j , it performs the serial block matrix to matrix multiplication and accumulates the partial block matrix Ci,

12、 j of matrix C . To obtain the resultant product matrix C, processes with rank 0 gathers all the block matrices by using MPI_Gather collective communication operation.2022/9/814第14頁，共27頁，2022年，5月20日，20點24分，星期四Cannon p processors arranged in q x q square grid of processors and the input matrices.A an

13、d B are distributed among the processes in checkerboard fashion. It results in constructing p block matrices of A and B. It uses only point-to-point communication for circularly shifting blocks of matrix A and matrix B among p processes.2022/9/815第15頁，共27頁，2022年，5月20日，20點24分，星期四Cannon-inital The alg

14、orithm proceeds in q stages. The first step in this algorithm is to perform initial alignment of the block matrix A and block matrix B. The blocks of matrix A are circularly shifted to the i positions to left in the row of the square grid of processes, where i is the row number of the process in the

15、 mesh. The blocks of matrix B are circularly shifted j positions upwards, where j is the column number of the process in the processes mesh.2022/9/816第16頁，共27頁，2022年，5月20日，20點24分，星期四Cannon-inital2022/9/817第17頁，共27頁，2022年，5月20日，20點24分，星期四Cannon-runningThe algorithm performs the following steps in eac

16、h stage : 1. Multiply the block of matrix A and matrix B and add the resultant matrix to get the block matrix C, which is initially set to zero.2. Circularly shift the blocks of matrix A to left in the rows of the processes and the blocks of matrix B upwards in the columns of the square grid of proc

17、esses in a wrap around manner.2022/9/818第18頁，共27頁，2022年，5月20日，20點24分，星期四Cannon-running2022/9/819第19頁，共27頁，2022年，5月20日，20點24分，星期四Cannon-running2022/9/820第20頁，共27頁，2022年，5月20日，20點24分，星期四書中Cannon-bug2022/9/821MPI_Send and MPI_Recv is not used for point-to-point communicationbecause if all the processes

18、 call MPI_Send or MPI_Recv in different order the deadlocked situation may arise .How to fix?指派一個緩沖區(qū)，使用MPI_Irecv/MPI_Isend 非阻塞式通訊函數(shù)，MPI_wait.MPI_Sendrecv. 第21頁，共27頁，2022年，5月20日，20點24分，星期四2022/9/822Cannon-bug死鎖的問題問題來源于main_shift()這個函數(shù)中MPI函數(shù)的使用。在Cannon-mpi代碼的main_shift()模塊中，文獻中算法使用的是MPI的阻塞通信函數(shù):MPI_Sen

19、d/MPI_Recv,這就使得Cannon算法在執(zhí)行循環(huán)左移和循環(huán)上移時，矩陣規(guī)模超過共享buff的容量時出現(xiàn)循環(huán)等待的死鎖狀況。在曙光4000集群系統(tǒng)上，該算法的發(fā)生死鎖的矩陣下限規(guī)模是200200的浮點型矩陣。第22頁，共27頁，2022年，5月20日，20點24分，星期四2022/9/823Cannon-bug原始（阻塞式）的main_shift模塊： void main_shift() /* 將分塊b左移位*/ MPI_Send(a , dl2, MPI_FLOAT, get_index(my_row, my_col-1, sp), 1, MPI_COMM_WORLD); MPI_Re

20、cv(a , dl2, MPI_FLOAT, get_index(my_row, my_col+1, sp), 1, MPI_COMM_WORLD, &status); /* 將分塊b上移位*/ MPI_Send(b , dl2, MPI_FLOAT, get_index(my_row-1, my_col, sp), 1, MPI_COMM_WORLD); MPI_Recv(b , dl2, MPI_FLOAT, get_index(my_row+1, my_col, sp), 1, MPI_COMM_WORLD, &status); 第23頁，共27頁，2022年，5月20日，20點24分，

21、星期四2022/9/824Cannon-bug改進（非阻塞式）的main_shift模塊 ci*dl+j += ai*dl+k*bj*dl+k;/改進了的Cannon按行存取 /* 將分塊a左移位*/ MPI_Isend(a , dl2, MPI_FLOAT, get_index(my_row, my_col-1, sp), 1, MPI_COMM_WORLD, &myrequest_s); MPI_Irecv(buf , dl2, MPI_FLOAT, get_index(my_row, my_col+1, sp),1,MPI_COMM_WORLD,&myrequest_r );MPI_Wa

22、it(&myrequest_s,&status);MPI_Wait(&myrequest_r,&status);memcpy(a,buf,sizeof(float) *dl2); /* 將分塊b上移位*/ MPI_Isend(b , dl2, MPI_FLOAT, get_index(my_row-1, my_col, sp), 1, MPI_COMM_WORLD,& myrequest_s); MPI_Irecv(buf , dl2, MPI_FLOAT, get_index(my_row+1, my_col, sp), 1, MPI_COMM_WORLD,&myrequest_r);MPI_Wait(&myrequest_s,&status);MPI_Wait(&myrequest_r,&status);memcpy(b,buf,sizeof(float) *dl2);第24頁，共27頁，2022年，5月20日，20點24分，星期四2022/9/825Cannon-bugMPI_Irecv僅僅初始化接受操作,在與之對應(yīng)的MPI_Wait函數(shù)的調(diào)用返回之前，將不能訪問bufferMPI_Irecv函數(shù)返回時，handle指向一個MPI_Request對象，它代表了一個已近初始化了的通信操作。這個函數(shù)并不返回一個指向MPI_St

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

高性能矩陣乘法

文檔簡介

溫馨提示

最新文檔

評論

高性能矩陣乘法

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔