Please enable JavaScript.
Coggle requires JavaScript to display documents.
High Performance Computing (theory (parallel algorithms (graph algorithms,…
High Performance Computing
theory
performance measurement
efficiency
\[ E(p) = \frac{T(n, 1)}{p \cdot T(n,p)} \]
criterion of algorithm design
efficiency scaling (Brent's lemma)
: an algorithm can run on fewer processors without loss of efficiency
when two algorithms use the same number of processors, speedup and efficiency are the same metric
with the above two results, we see that
efficiency
is a better metric and should always be favored
speedup
theoretically, \[ S(p) = \frac{T(n,1)}{T(n,p)} \le p \] where \( \T(n,1)) should be the runtime of the
best
sequential algorithm
practically, we may observe superlinear speedup because time complexity is for worst case, other reasons include different memory access, etc.
communication network
static network
processors are connected
directly
diameter
bisection width
cost
dynamic netowrk
processors are connected
indirectly
through switches
crossbars
buses
tree-based network
multi-stage network
multi-stage Omega network
parallel algorithms
prefix sums
computation time \( \Theta(\frac{n}{p} + \log p) \)
communication time \( \Theta( (\tau + \mu)\log p) \)
applications
polynomial evaluation (power of scalars)
Fibonacci sequences (power of matrices)
bitonic sort
bitonic split
bitonic merge
computation time: \( O(\frac{n}{p}\log \frac{n}{p} + \frac{n}{p}\log^2 p)\)
communication time: \( O(\tau \log^2 p + \mu \frac{n}{p}\log^2 p )\)
sample sort
computation time: \( O(\frac{n\log n}{p} + p \log^2 p) \)
communication time: \( O(\tau p + \mu (\frac{n}{p} + p \log^2 p)) \)
matrix algorithm
matrix-vector multiplication
matrix-matrix multiplication
Cannon's algorithm
FFT
graph algorithms
MST
Prim's algorithm
shortest path
embedding
ring to hypercube
mesh to hypercube
tree to hypercube
can't embed a complete binary tree into a hypercube
programming
MPI, many implementations, most widely used is
OpenMPI
in this branch, \( \tau \) means latency, \( \frac{1}{\mu}\) means bandwidth and \( m \) means message size
point-to-point communication
MPI_Send
MPI_Recv
collective communication
MPI_Bcast
time complexity \( O((\tau + \mu m) \log p)\)
MPI_Scatter
and
MPI_Gather
time complexity \( O(\tau \log p + \mu m p)\)
MPI_Allgather
time complexity \( O(\tau \log p + \mu m p)\)
MPI_Alltoall
arbitrary permutation: \( O(\tau p + \mu m p) \)
hypercubic permutations: \( O(\tau\log p + \mu m p \log p) \) (
NOT
used in the course)
Many_to_Many
stage 1: evenly distribute messages to everyone
stage 2: everyone helps send messages
time complexity: \( O(\tau p + \mu (R + S + p^2)) \), \( O(\tau p + \mu (R + S) \) provided \( p^2 = (R+S)\)
All_to_all reduction
\( O(\tau \log p + \mu p)\)
reduction
MPI_Allreduce
time complexity \( O((\tau + \mu m) \log p)\)
MPI_Scan (prefix sum)
time complexity \( O((\tau + \mu m) \log p)\)
MPI_Reduce
time complexity \( O((\tau + \mu m) \log p)\)