1. Which of the following are ways that hardware designers make computers run faster?

faster clocks [ ] longer clock periods [x] more work per clock cycle [ ] a larger hard disk [x] adding more processors [ ] reducing the amount of memory

2. What are the limits of computer architecture that motivate parallel processing?

Processors can only do one thing at a time, but the users needs multiple things to happen simultaneously, ie have multiple things move on the screen at the same time. Time sharing, while useful, does not scale well with the number of processes, so parallel processing is useful. Additionally, It is easier (cheaper) to add more processors than to pack more transistors into a smaller space.

3. For a class of GPUs, what is appropriate measure to compare power consumption of different devices?

Flops per watt

4.What techniques are computer designers today using to build more power-efficient devices? Please circle all that are true.

having fewer, but more complex processors, [x] having more, but less complex processors, [ ] maximizing the speed of the processor clock [ ] increasing the complexity of the control hardware

5. Name four common patterns of parallel computation?

For each pattern, discuss whether the many-core or multi-core architecture model is more appropriate.

Agents and Repos: Multi-core, as each agent does a complex operation possibly unique to itself

Iterative Refinement: Many-core as the operations are identical

Pipe and Filter: Multi-core as each filter has a different operation to perform

Map reduce: Many-core as the operations are identical

6. What is concurrency control, and discuss what is the difference between pessimistic an optimistic concurrency control?

concurrency control is a method to prevent race conditions, ensuring correctness

Pessimistic: Locks that block operations at specific times to prevent violation of the rules

Optimistic: Abort transactions to prevent violations to the rules, then restart

7. How does Fork/Join parallelism differ from Kernel data-parallelism?

Fork/join uses independent child threads that may share memory

Kernel is made up of an array of threads that perform simple operations

8. Circle all the true statements.

A thread block contains many threads. [x] An SM might run more than one thread block. [ ] A thread block may run on more than one SM. [x] All the threads in a thread block might cooperate to solve a subproblem. [ ] All the threads that run on a given SM may cooperate to solve a subproblem.

9. If we have a single kernel that is launched on many thread blocks, including block x and block y, the programmer can do which of the following: Circle all the true statements.

specify that block x will run at the same time as block y [ ] specify that block x will run after block y. [ ] specify that block x will run on same SM as y [x] none of the above

10. Circle all the statements that are true.

All threads from a block can access the same variable in that block's shared memory. [ ] Threads from two different blocks can access the same variable in global memory [ ] Threads from different blocks have their own copy of local variables in local memory. [ ] Threads from the same block have their own copy of local variables in local memory.

11. Does a __syncthreads() call apply to threads within a block or threads within a grid?

block

12. What is a parallel map operation? Circle all problems that can be solved using map.

sort an array [x] add one to each element of an array [ ] summing all elements in array [ ] apply a predicate to each element in an array [x] move data in parallel based on array of scatter addresses

13. Circle which operators are both binary and associative and therefore can be used in a reduction or scan.

multiply [x] minimum [ ] factorial [x] exclusive or [x] bitwise and [ ] exponentiation [ ] integer division

14. Using 1D global indexing, how would you specify the parallel execution mapping the ith thread to the task of reading and then squaring the ith item from an large array X in global memory.

int i = blockIdx.x * blockDim.x + threadIdx.x;
int x = X[i];
x[i] = x * x

15.Circle all statements that are true. When running reduction code running on an input of size n?

it takes at least n operations [ ] its work complexity is order of n [ ] its work complexity is order n*n [ ] its step-complexity is order of 1, independent of the size of the input.

16. Circle the correct answer. The number of steps in a reduction of n elements as a function of n is:

square root of n [ ] log base 2 of n [ ] n [ ] n times log base 2 of n

17. Circle all that are true

map operations have arguments that are functions with a single argument [ ] map operations can be applied to arrays of any number of dimensions [ ] map operations are generally very efficient on GPUs [ ] a compact operation requires a map operation to be performed.

18. What is the impact of granularity on performance when considering the latency of global memory communication?

Calculate the granularity of dot product of two vectors that reside in global memory?

19. What is the output of a max scan operation on the list of unsigned ints [5 4 7 3 1 8 2 6]?

Provide a solution to both inclusive and exclusive scans.

20. Compute the max (inclusive) scan of this input sequence 2 1 4 3 showing all work when using

a) the Hillis-Steele algorithm b) the Blelloch algorithm

21. Explain which scan algorithm (Hillis-Steele or Blelloch algorithm) is best suited and why?

You are scanning a 512 element vector and a GPU that has 512 processors.

22. Explain which scan (Hillis-Steele or Blelloch algorithm) is best suited and why?

You are scanning a 1 million element input vector in 512 processors machine.

23. Suppose you are computing a histogram with 10 bins. Discuss an efficient GPU solution.

Suppose you are computing a histogram with 10000 bins. Discuss an efficient GPU solution. Indicate for each whether or not you are using atomics to manage access to bins of the histograms.

24. True or false - In a scatter operation a __syncthreads() command is needed to overcome write conflicts.

25. Is the compact parallel operation more useful in scenarios where we delete a (small number) or a (large number) of elements from an array?

26. Is the compact parallel operation more useful in scenarios where we need to run (cheap) or (expensive) function on filtered elements.

27. Suppose we are running compact operations on a list of numbers with range from 1 to 1 million. Compact operation A, filters elements that are divisible by 17, and thus is only going to keep a very few of the input items. Compact operation B filters elements not divisible by 31, and thus is going to keep most of the input items. For each of the three phases of compact: predicate map, scan, and scatter phases of the compact operation, will A run faster, B run faster or will they take the same amount of time?

a. Predicate map b. Scan c. Scatter

28. What is the difference between latency and bandwidth? Which is more easily enhanced by advanced architectures?

Say you are give a task of transporting many people 200 miles. You have two modes of transport: a single sports-car traveling 100mph with 2 passengers – and – a single van traveling 50mph with 10 passengers. For each case, what is the Bandwidth in people delivered per hour? For each case, what is the Latency of a person delivery (in hours)? For each case, what is the Occupancy of transport system?

29. How is Occupancy defined in Cuda, and why does it have an impact on performance?

What is thread divergence in Cuda, and why does it have an impact on performance?

30. Show the contents of the CSR (Compressed Sparse Row) format for the following 5x5 matrix:

02300 10050 00400 00020

31. Consider the sparse matrix dense vector product problem, and the two different parallel methods tpr(thread per row) or tpe (thread per element).

a. Which approach will launch more threads? b. Which approach will require more communication between threads? c. Which approach will do more work per thread? d. Which approach is more load balanced?32. What does it mean for a sorting algorithm to be oblivious?

33. State the 0/1 sorting lemma for oblivious sorting algorithms.

34. Provide the logic of BitonicSort – pseudocode is sufficient here.

35. What is the step and the work complexity of BitonicSort?

36. In the BitonicSort figure presented in lecture identify each of the 'bitonic-comparison-and-swap' modules.

37. What is the work and step complexity of countingSort?

38. What is the expected work when hashing using chaining when the hash is to a chain of length k.

39. True or false: Bloom filters are a data structure that allows fast set membership operations, but with low probability of false negatives.

40. List the following in the order of their work complexity from least to most.

a. parallel compact b. parallel scan c. sieveEratosthenes d. dense n-body e. bitonicSort f. sequential mergeSort

41. List the following in the order of their step complexity from least to most.

a. parallel compact b. parallel scan c. sieveEratosthenes d. dense n-body e. bitonicSort f. sequential mergeSort

42. Write a CUDA kernel function that will effectively parallelize the following sequential function.

void serial (int n, float a, float b, float * x, float * y) { for (int i = 0; i < n; ++i) { y[i]= a * x[i] + b * y[i]; } }

43. What does the following kernel code do? Does it contain a race condition problem? If so, give a means to overcome it?

global void naiveHisto(int *d_bins, const int *d_in, const int BIN_COUNT) { int myId = threadIdx.x + blockDim.x * blockIdx.x; int myItem = d_in[myId]; int myBin = myItem % BIN_COUNT; d_bins[myBin]++; }

44. Complete the CUDA kernel function that computes, per-block, the sum of a block-sized portion of the input using a block-wide reduction.

You should assume 1-dimensional thread and block indexing. global void block_sum(const float *input, float *per_blockresults, const size_t n) { shared float sdata[]; // TODO: load input into shared memory // TODO: use contiguous range pattern for reduction

// thread 0 writes the final result if(threadIdx.x == 0) { per_blockresults[blockIdx.x] = sdata[0]; } }

mock_exam

1. Which of the following are ways that hardware designers make computers run faster?

2. What are the limits of computer architecture that motivate parallel processing?

3. For a class of GPUs, what is appropriate measure to compare power consumption of different devices?

4.What techniques are computer designers today using to build more power-efficient devices? Please circle all that are true.

5. Name four common patterns of parallel computation?

6. What is concurrency control, and discuss what is the difference between pessimistic an optimistic concurrency control?

7. How does Fork/Join parallelism differ from Kernel data-parallelism?

8. Circle all the true statements.

9. If we have a single kernel that is launched on many thread blocks, including block x and block y, the programmer can do which of the following: Circle all the true statements.

10. Circle all the statements that are true.

11. Does a _syncthreads() call apply to threads within a block or threads within a grid?

12. What is a parallel map operation? Circle all problems that can be solved using map.

13. Circle which operators are both binary and associative and therefore can be used in a reduction or scan.

14. Using 1D global indexing, how would you specify the parallel execution mapping the ith thread to the task of reading and then squaring the ith item from an large array X in global memory.

15.Circle all statements that are true. When running reduction code running on an input of size n?

16. Circle the correct answer. The number of steps in a reduction of n elements as a function of n is:

17. Circle all that are true

18. What is the impact of granularity on performance when considering the latency of global memory communication?

19. What is the output of a max scan operation on the list of unsigned ints [5 4 7 3 1 8 2 6]?

20. Compute the max (inclusive) scan of this input sequence 2 1 4 3 showing all work when using

21. Explain which scan algorithm (Hillis-Steele or Blelloch algorithm) is best suited and why?

22. Explain which scan (Hillis-Steele or Blelloch algorithm) is best suited and why?

23. Suppose you are computing a histogram with 10 bins. Discuss an efficient GPU solution.

24. True or false - In a scatter operation a _syncthreads() command is needed to overcome write conflicts.

25. Is the compact parallel operation more useful in scenarios where we delete a (small number) or a (large number) of elements from an array?

26. Is the compact parallel operation more useful in scenarios where we need to run (cheap) or (expensive) function on filtered elements.

28. What is the difference between latency and bandwidth? Which is more easily enhanced by advanced architectures?

29. How is Occupancy defined in Cuda, and why does it have an impact on performance?

30. Show the contents of the CSR (Compressed Sparse Row) format for the following 5x5 matrix:

31. Consider the sparse matrix dense vector product problem, and the two different parallel methods tpr(thread per row) or tpe (thread per element).

33. State the 0/1 sorting lemma for oblivious sorting algorithms.

34. Provide the logic of BitonicSort – pseudocode is sufficient here.

35. What is the step and the work complexity of BitonicSort?

36. In the BitonicSort figure presented in lecture identify each of the 'bitonic-comparison-and-swap' modules.

37. What is the work and step complexity of countingSort?

38. What is the expected work when hashing using chaining when the hash is to a chain of length k.

39. True or false: Bloom filters are a data structure that allows fast set membership operations, but with low probability of false negatives.

40. List the following in the order of their work complexity from least to most.

41. List the following in the order of their step complexity from least to most.

42. Write a CUDA kernel function that will effectively parallelize the following sequential function.

43. What does the following kernel code do? Does it contain a race condition problem? If so, give a means to overcome it?

44. Complete the CUDA kernel function that computes, per-block, the sum of a block-sized portion of the input using a block-wide reduction.

11. Does a __syncthreads() call apply to threads within a block or threads within a grid?

24. True or false - In a scatter operation a __syncthreads() command is needed to overcome write conflicts.