Key Terms

Speedup = Time sequential(Ts)/Time parallel (Tp)
Granularity = Ratio of amount of computation to the amount (cost) of communication
Efficiency = Speedup/p = Ts/p * Tp (p = number of processors)
Scalability = ability to keep speedup proportional to number of processors (p)
step complexity
work complexity
map operation
max scan operation
broadcast/gather/scatter
Hillis-Steele scan algorithm
Blelloch Algorithm
Atomics
Kernels
Compact Parallel operation
Latency
Bandwidth
Occupancy (in cuda)
PRAM
CSR (compressed Sparse Row)
blocks, threads (in cuda)
Tiles (shared vs global memory)
Race conditions
Work = Flops (floating point operations per second)/mips (millions of operations per second)
Execution time = Work/Compute time
Many Core vs Multi Core
Fork/Join Parallelism
Kernel Data Parallelism

Chapter 1

Parallel Computing models

Shared memory model

Mutliple tasks share a single memory area in which memory is read and written asynchronously
Locks/Semaphores (Easier)
Harder to manage data locality

Message Passing Model

Each processor has its own memory
Tasks can be distributed to interconnected processors
Message Passing Interface (MPI)

Multithreaded Control Model

Threads
Program counter, a stack, registers, and identifier
POSIX (software threads)
Hyper threading (intel's hardware threads)

Data Parallel Model

Tasks operate on the same data structure (Arrays)
Cuda Kernels

Designing Parallel Programs

Task decomposition

Domain decomposition (data is decomposed into separate units)
Functional Decomposition (task is split into subtasks)

Task Agglomeration

Combine smaller tasks with larger ones to increase granularity (better performance)

Task Assignment and Mapping

Assign tasks to processors, taking into account processor speed (heterogeneity)
Load balancing (processors should not be idle for long)
Optimal mapping is NP-complete (difficult)

Single core switches between different processes, giving the illusion of parallelism (concurrency)

Amdahl and Gustafson's laws

Amdahl = pessimistic = Speedup < 1/(1-P), P is parallizable percentage of the program
Gustafson = optimistic = Speedup(p) = p - a(p-1), a = non-parallelizable fraction of parallel process, p = number of processors

Structural Patterns

Pipe and filter

Data flows through modular phases of a computation
Filters = function
pipe = communication channel
Multi-core

Map-Reduce

Same function is applied independently across distributed data
Map: function applied to data
Shuffle: Reorganizes data after mapping
reduction: data reduction, or summary computation
Many-core

Agents and Repositories

collection of data elements modified by flexible set of rules individually
agent: "intelligent" process that can perform several operations sequentially
Repository: Data center that the agents go to.
Manager: directs the agents
Ideal for multi-core rather than many-core

Iterative Refinement

Set of operations applied repeatedly until desired state is reached
Fractals
Many-core

Chapter 2

Python Tools

Numba

Jit compiler (compiles once during runtime)
Nopython mode is really fast

Multiprocessing vs threads

threads are constrained by the GIL (global interpreter lock), so they are slow
multiprocessing is faster than threads (no GIL), but there is no shared data

Chapter 3

SIMD

SIngle instruction, multiple data
Theoretical model, every instruction is executed synchronously and simultaneously by all processors
GPU model

PRAM complexity

Theoretical model
RAM model, but with unlimited processors and uniform communication

Work Complexity

Total number of operations executed by a computationas a function of input size

Step Complexity

Longest chain of sequenctial dependencies in the operation
How deep is it?

Map Operation

f = lambda x: x+1
map(f, [1,2,3])
# Returns: [2,3,4]

Work(n) = O(n) * Work(f-arg)
Step(n) = O(1) * Step(f-arg) (depth of 1)

Collection communications

Imply synchronization point

Operation	Description	Work Complexity	Step Complexity
Broadcast	Send same data to all processors	W(p) = O(p)	S(p) = O(log p)
Scatter	Send different chunks of large piece of data to other processors	W(p) = O(p)	S(p) = O(log p)
Gather	Receive data from all other processors	W(p) = O(p)	S(p) = O(log p)

Step complexity is log p in tree communication networks (log p is depth of tree)

Many to many

Broadcast from every processor
Stencil (use array)

Summing an array

W(n) = O(n)
S(n) = O(log n) (log n height)

Recursive Reduction

Generalized sum operation
W(n) = O(n)
S(n) = O(log n)
Assuming f-arg has constant work and step complexity

Scan Operation

Generalized reduction
Inclusive scan: all elements up to and including the jth
Exclusive Scan: all element up to the jth

Hillis-Steele

Slow implmentation of prefix-scan
W(n) = O(n log n)
S(n) = O(log n)

Naive recursion

Slow implmentation of prefix-scan
W(n) = O(n log n)
S(n) = O(log² n)

Blelloch

Fast implmentation of prefix scan
W(n) = O(n)
S(n) = O(2log n)

Parallel Quicksort

W(n) = O(n log n)
S(n) = O(log n)

Parallel Histogram

can be solved with locking or a final operation, such as a parallel reduction

Chapter 4

PRAM fails in lots of practical applications

Latency vs bandwidth

Latency: time needed to complete a task from the time the instruction was issued
Bandwidth is the rate of task throughput measured by the number of repeated tasks completed per unit time

Moore's law and Little's law

Moore: Observation that number of transistors doubles every 2 years (bandwidth over latency
Little's law = \(L i = \lambda W\)
\(\lambda\): effective arrival rate
\(W\): Average Completion time

Cuda and GPGPUS ATTACH

General computing GPUS
Blocks of threads
Hundreds of pipeline cores grouped to computation units (SM) symmetric multiprocessors

Chapter 5

Data tiling

shared memory between threads in blocks allows for "tiles" of computation in cuda kernels

Chapter 6

Warps and Stalls

Grid is composed of thread blocks which execute independantly
Each block is composed of many threads
groups of threads are divided into groups of 32 (warp)
Warp is the unit of execution scheduling

Fast Context Switching

Registers
Shared memory

Granularity

ratio of program computation vs communication time as measured on a per warp basis
Increasing granularity usually results in increased performance time

Memory coalescing

Coalesced memory access: mutiple memory accesses into a single transaction
memory needs to be aligned to have least amount of transactions

Chapter 7

Sorting

Comparison sort
- counting sort: W(n) = O(n²), S(n) = O(log n)
- Counting sort is impractical
Other sorting

Sorting networks

constructed from collections of parallel compare-exchange operations

Knuth's 0-1 Sorting Principle

if a sorting network works correctly for every input sequence of 0s and 1s then it also works correctly on any input taken from any linearly ordered set

Mesh based oblivious sorting

Shear sort: kinda like bubble sort

exam_review

Key Terms

Chapter 1

Parallel Computing models

Shared memory model

Message Passing Model

Multithreaded Control Model

Data Parallel Model

Designing Parallel Programs

Task decomposition

Task Agglomeration

Task Assignment and Mapping

Time Sharing

Amdahl and Gustafson's laws

Structural Patterns

Pipe and filter

Map-Reduce

Agents and Repositories

Iterative Refinement

Chapter 2

Numba

Multiprocessing vs threads

Chapter 3

SIMD

PRAM complexity

Work Complexity

Step Complexity

Map Operation

Collection communications

Many to many

Summing an array

Recursive Reduction

Scan Operation

Hillis-Steele

Naive recursion

Blelloch

Parallel Quicksort

Parallel Histogram

Chapter 4

Latency vs bandwidth

Moore's law and Little's law

Cuda and GPGPUS ATTACH

Chapter 5

Data tiling

Chapter 6

Warps and Stalls

Fast Context Switching

Granularity

Memory coalescing

Chapter 7

Sorting

Sorting networks

Knuth's 0-1 Sorting Principle

Mesh based oblivious sorting