exam_review

Key Terms

  • Speedup = Time sequential(Ts)/Time parallel (Tp)
  • Granularity = Ratio of amount of computation to the amount (cost) of communication
  • Efficiency = Speedup/p = Ts/p * Tp (p = number of processors)
  • Scalability = ability to keep speedup proportional to number of processors (p)
  • step complexity
  • work complexity
  • map operation
  • max scan operation
  • broadcast/gather/scatter
  • Hillis-Steele scan algorithm
  • Blelloch Algorithm
  • Atomics
  • Kernels
  • Compact Parallel operation
  • Latency
  • Bandwidth
  • Occupancy (in cuda)
  • PRAM
  • CSR (compressed Sparse Row)
  • blocks, threads (in cuda)
  • Tiles (shared vs global memory)
  • Race conditions
  • Work = Flops (floating point operations per second)/mips (millions of operations per second)
  • Execution time = Work/Compute time
  • Many Core vs Multi Core
  • Fork/Join Parallelism
  • Kernel Data Parallelism

Chapter 1

Parallel Computing models

Shared memory model

  • Mutliple tasks share a single memory area in which memory is read and written asynchronously
  • Locks/Semaphores (Easier)
  • Harder to manage data locality

Message Passing Model

  • Each processor has its own memory
  • Tasks can be distributed to interconnected processors
  • Message Passing Interface (MPI)

Multithreaded Control Model

  • Threads
  • Program counter, a stack, registers, and identifier
  • POSIX (software threads)
  • Hyper threading (intel's hardware threads)

Data Parallel Model

  • Tasks operate on the same data structure (Arrays)
  • Cuda Kernels

Designing Parallel Programs

Task decomposition

  • Domain decomposition (data is decomposed into separate units)
  • Functional Decomposition (task is split into subtasks)

Task Agglomeration

  • Combine smaller tasks with larger ones to increase granularity (better performance)

Task Assignment and Mapping

  • Assign tasks to processors, taking into account processor speed (heterogeneity)
  • Load balancing (processors should not be idle for long)
  • Optimal mapping is NP-complete (difficult)

Time Sharing

  • Single core switches between different processes, giving the illusion of parallelism (concurrency)

Amdahl and Gustafson's laws

  • Amdahl = pessimistic = Speedup < 1/(1-P), P is parallizable percentage of the program
  • Gustafson = optimistic = Speedup(p) = p - a(p-1), a = non-parallelizable fraction of parallel process, p = number of processors

Structural Patterns

Pipe and filter

  • Data flows through modular phases of a computation
  • Filters = function
  • pipe = communication channel
  • Multi-core

Map-Reduce

  • Same function is applied independently across distributed data
  • Map: function applied to data
  • Shuffle: Reorganizes data after mapping
  • reduction: data reduction, or summary computation
  • Many-core

Agents and Repositories

  • collection of data elements modified by flexible set of rules individually
  • agent: "intelligent" process that can perform several operations sequentially
  • Repository: Data center that the agents go to.
  • Manager: directs the agents
  • Ideal for multi-core rather than many-core

Iterative Refinement

  • Set of operations applied repeatedly until desired state is reached
  • Fractals
  • Many-core

Chapter 2

  • Python Tools

Numba

  • Jit compiler (compiles once during runtime)
  • Nopython mode is really fast

Multiprocessing vs threads

  • threads are constrained by the GIL (global interpreter lock), so they are slow
  • multiprocessing is faster than threads (no GIL), but there is no shared data

Chapter 3

SIMD

  • SIngle instruction, multiple data
  • Theoretical model, every instruction is executed synchronously and simultaneously by all processors
  • GPU model

PRAM complexity

  • Theoretical model
  • RAM model, but with unlimited processors and uniform communication

Work Complexity

  • Total number of operations executed by a computationas a function of input size

Step Complexity

  • Longest chain of sequenctial dependencies in the operation
  • How deep is it?

Map Operation

f = lambda x: x+1
map(f, [1,2,3])
# Returns: [2,3,4]
  • Work(n) = O(n) * Work(f-arg)
  • Step(n) = O(1) * Step(f-arg) (depth of 1)

Collection communications

  • Imply synchronization point
OperationDescriptionWork ComplexityStep Complexity
BroadcastSend same data to all processorsW(p) = O(p)S(p) = O(log p)
ScatterSend different chunks of large piece of data to other processorsW(p) = O(p)S(p) = O(log p)
GatherReceive data from all other processorsW(p) = O(p)S(p) = O(log p)
  • Step complexity is log p in tree communication networks (log p is depth of tree)

Many to many

  • Broadcast from every processor
  • Stencil (use array)

Summing an array

  • W(n) = O(n)
  • S(n) = O(log n) (log n height)

Recursive Reduction

  • Generalized sum operation
  • W(n) = O(n)
  • S(n) = O(log n)
  • Assuming f-arg has constant work and step complexity

Scan Operation

  • Generalized reduction
  • Inclusive scan: all elements up to and including the jth
  • Exclusive Scan: all element up to the jth

Hillis-Steele

  • Slow implmentation of prefix-scan
  • W(n) = O(n log n)
  • S(n) = O(log n)

Naive recursion

  • Slow implmentation of prefix-scan
  • W(n) = O(n log n)
  • S(n) = O(log2 n)

Blelloch

  • Fast implmentation of prefix scan
  • W(n) = O(n)
  • S(n) = O(2log n)

Parallel Quicksort

  • W(n) = O(n log n)
  • S(n) = O(log n)

Parallel Histogram

  • can be solved with locking or a final operation, such as a parallel reduction

Chapter 4

PRAM fails in lots of practical applications

Latency vs bandwidth

  • Latency: time needed to complete a task from the time the instruction was issued
  • Bandwidth is the rate of task throughput measured by the number of repeated tasks completed per unit time

Moore's law and Little's law

  • Moore: Observation that number of transistors doubles every 2 years (bandwidth over latency
  • Little's law = \(L i = \lambda W\)
  • \(\lambda\): effective arrival rate
  • \(W\): Average Completion time

Cuda and GPGPUS ATTACH

  • General computing GPUS
  • Blocks of threads
  • Hundreds of pipeline cores grouped to computation units (SM) symmetric multiprocessors

Chapter 5

Data tiling

shared memory between threads in blocks allows for "tiles" of computation in cuda kernels

Chapter 6

Warps and Stalls

  • Grid is composed of thread blocks which execute independantly
  • Each block is composed of many threads
  • groups of threads are divided into groups of 32 (warp)
  • Warp is the unit of execution scheduling

Fast Context Switching

  • Registers
  • Shared memory

Granularity

  • ratio of program computation vs communication time as measured on a per warp basis
  • Increasing granularity usually results in increased performance time

Memory coalescing

  • Coalesced memory access: mutiple memory accesses into a single transaction
  • memory needs to be aligned to have least amount of transactions

Chapter 7

Sorting

  • Comparison sort
    • counting sort: W(n) = O(n2), S(n) = O(log n)
    • Counting sort is impractical
  • Other sorting

Sorting networks

  • constructed from collections of parallel compare-exchange operations

Knuth's 0-1 Sorting Principle

  • if a sorting network works correctly for every input sequence of 0s and 1s then it also works correctly on any input taken from any linearly ordered set

Mesh based oblivious sorting

  • Shear sort: kinda like bubble sort