Algorithm Analysis -- Week 13
Introduction
This week we will cover parallel algorithms.
Note that the links from this page are to handouts that will be distributed the night of class. Only print out those handouts if you could not attend class.
Main topics this week:
Web Research Report #3 Due
Parallel Algorithms
Parallel Architectures
PRAM Model
Parallel Algorithms in General
Finding Largest Element
Parallel Algorithm Exercise
Next Week
Your web research report #3 is due today for full credit.
Up to this point in your computer science classes, you have worked entirely with sequential programming. Your program starts at the beginning, processes one statement after the other, and eventually ends up at the end. All the computers you work with are sequential computers. They only have one CPU, and can only physically execute one machine instruction at a time. Sequential computers play tricks to make it seem as if they are doing multiple things at one time, but they really aren't.
The sequential computers you use are called Single Instruction stream, Single Data stream computers (SISD). This just means they can only execute a single instruction at a time, and can only modify a single piece of data at a time. Modern computers process their single instruction and single data extremely quickly, making it seem as if they are doing more than one thing at a time.
Many problems could be solved more quickly if we could run more than one instruction at a time. These types of problems are generally ones with a large amount of data to process or ones with a large amount of calculations to perform. Some examples are weather prediction, biological simulations, astronomical simulations, etc.
Your book has a good analogy for how parallel algorithms can perform work faster. Say you need to build a fence in your backyard, and to do this you need to dig ten holes. If you dig ten holes yourself, it will take longer than if you have ten people each dig one hole. Doing work in parallel requires more resources, but finishes faster.
When we talk about parallel algorithms, we also have to talk about parallel computers. Parallel computers are machines with multiple CPUs, so they can do more than one thing at a time.
Home computers are SISD machines. There are other types of machines. You can have SIMD (single instruction, multiple data) machines, where multiple CPUs are used but each executes the same instruction, just operating on different data. If you have a large amount of data to process, a SIMD machine can save you time.
The more flexible sort of parallel computer is an MIMD machine (multiple instruction, multiple data). This means that each CPU in the machine can execute a different instruction on different data. So each CPU can be doing anything you tell it to. In an MIMD machine, CPUs may need to communicate with each other (for example, one CPU needs the result of another CPU's calculations). CPUs communicate by modifying shared memory.
There is also something called massively parallel processing (MPP). In this architecture, the CPUs communicate by passing messages over a network. This is what is used by programs like the SETI@home project.
Let's talk in more detail about the differences between the various parallel machines. We'll start with the SISD machine, so we understand the sequential machines. In an SISD machine, there is one CPU. That CPU has access to memory (which holds the data to be processed).
Page 406 in your book has a diagram for a SISD machine.
An SIMD machine has multiple CPUs, but each must execute the same instruction. This requires a control unit that tells each processor what instruction to execute next. Page 409, figure 10.3a is an example of the architecture for an SIMD machine.
An MIMD machine has multiple CPUs, and each can execute a different instruction, so each CPU needs its own control unit. Page 409, figure 10.3b is an example of the architecture for an MIMD machine.
Another issue with parallel machines is that the CPUs must communicate between each other. There are two primary ways of this happening.
One is by the use of shared memory. In a shared memory architecture (also known as shared-address-space architecture), each CPU has access to the same memory that the other CPUs access. So one CPU can change memory and another CPU can see that memory was changed. Typically, each CPU also has a small amount of private memory for those times when it does not need to communicate with other CPUs.
There are two types of shared memory architectures your book mentions. One is Uniform Memory Access (UMA); this just means that every CPU takes the same amount of time to access shared memory. The other is Nonuniform Memory Access (NUMA); this means that some shared memory can be access faster by some CPUs than others.
The other way of allowing CPUs to communicate is by message passing on a network. In this architecture, each CPU's memory is private. The only way for one CPU to communicate with another CPU is to send a message over a network. This network might be internal to the parallel machine, or it might be the Internet. There are different ways of building a network to link parallel CPUs. Your book has a section titled "Interconnection Networks" that discusses the different ways. The basic issue here is to minimize the amount of time it takes to pass the message from one CPU to another.
As you can tell, simply saying that you have a "parallel machine" isn't enough to allow you to develop an algorithm. An algorithm for an SIMD machine will be quite different from an algorithm for an MIMD machine. Some types of problems will do better on an SIMD machine than an MIMD machine, while other problems will be just the opposite.
To allow investigation of parallel algorithms without actually needing to have a parallel machine, mathematicians developed a conceptual parallel machine in which they could write algorithms. This allows them to explore parallel algorithms and determine whether an algorithm actually solves a problem and whether it looks like it will do so more efficiently than a sequential algorithm. Only when they think an algorithm is worth implementing would they use a real parallel machine.
The conceptual model they came up with is called the PRAM Model. PRAM stands for "parallel random access machine". In terms of our parallel architectures, the PRAM model represents an MIMD machine that uses shared memory with uniform memory access.
A PRAM machine consists of P processors. Each processor may execute different instructions, and may communicate with other processors using shared memory.
A common problem with shared memory machines is deciding what do we allow in terms of two processors trying to access the same memory location at the same time (we say they are accessing memory concurrently). There are four different varieties of PRAM models, depending on how you want concurrent memory access to be handled.
Exclusive-read, exclusive-write (EREW)
In this version, only one processor may access a location in memory at a time. No concurrent access is allowed. If multiple processors try to access memory concurrently, the processors will have to take turns.
Exclusive-read, concurrent-write (ERCW)
In this version, only one processor may read a location in memory at a time, but multiple processors may write to that location concurrently.
Concurrent-read, exclusive-write (CREW)
In this version, more than one process may read a location in memory at the same time, but only one may write to it.
Concurrent-read, concurrent-write (CRCW)
In this version, multiple processors may both read and write concurrently to the same location in memory.
Parallel Algorithms in General
In general, when you write a parallel algorithm, you write only one algorithm that is then executed by all the processors. Each processor may, in an MIMD machine, be executing a different part of the algorithm. Each processor knows its own processor id, which may be used in the algorithm to tell different processors to do different things.
Let's look at a simple parallel algorithm, on page 417 in your book. Note that the keyword "local" is used to say that a variable is in a processor's private memory. This means that each processor could have a different value for that variable. If the keyword local is not used, then the variable is shared an all processors have the same value for that variable.
An algorithm will generally benefit from being written as a parallel algorithm if the amount of communication needed between the CPUs is relatively small. You can determine this by looking at how you might divide the problem into separate pieces for each CPU to execute.
Let's look at a simple example of a parallel algorithm. We'll look at an algorithm to calculate the sum of the numbers in an array. The sequential version of this algorithm would be O(n). What about the parallel version?
The key to designing a parallel algorithm is to divide the problem into subproblems that are independent. For example, when summing 4 numbers, we can divide the problem into summing the first 2 numbers and the second 2 numbers, then summing those results. The summing of the first 2 and second 2 numbers could be done in parallel.
The parallel algorithm does just that, by summing every adjacent pair of numbers. Then each pair of results is summed. Eventually we'll end up with a single result. Your student manual has a diagram showing this in the key points section for module 6.
The time complexity for the parallel algorithm is O(lg n). This is more efficient than the sequential algorithm.
For example, consider the following algorithm for finding the largest element in an array:
int find_largest (int num_items, int [] array)
{
int max = array [0];for (int i = 1; i < num_items; i++)
if (array [i] > max)
max = array [i];return max;
}
Each step in this algorithm (comparing the current element to the max) needs information from the previous step to execute. This means that each CPU would need to wait for the previous CPU to execute before running its instruction. This would be a bad algorithm to use parallel programming on, since the CPUs would not be able to run in parallel.
If we change our algorithm for finding the largest element in an array, we can come up with an algorithm that could be written in parallel. For example, if we make one step being to compare two elements in the array, and swap them if the one on the left is smaller than the one on the right, now we can visualize using parallel programming on it.
Look at page 421 in your book to see how the CPUs will process the array. We'll go over this step by step. (This method of finding the largest element in an array is called the Tournament Method).
Now that we understand conceptually how the Tournament Method would work in parallel, let's look at the algorithm. The following is a modified version of the algorithm on page 420 in your book, where "p" is the index of the processor (a number from 1 to n/2, if n is the number of items in the array). For convenience, we'll assume that the number of items in the array is always a power of 2 (2, 4, 8, 16, etc).
int find_largest (int n, int [] array)
{
local int size = 1;
local int p = index of processor;for (int i = 1; i <= lg n; i++)
{
if (this processor needs to execute this time)
{
if (array [2 * p - 1] < array [2 * p - 1 + size])
swap (array [2 * p - 1], array [2 * p - 1 + size]);size = size * 2;
}
}
}
To understand this algorithm, first recognize that the loop body is executed by each processor at the same time. So for i == 1, each processor executes the loop body. Only after all processors have executed the loop body do we move to i == 2.
Let's see this algorithm executing on the sample data from the book, but this time we'll keep track of local and shared variables.
Note that in this algorithm, no two processors need to write to the same memory location, so the CREW model would work just fine. No two processors need to read from the same memory location at the same time, so the EREW model would also work. In fact, any of the ways of resolving concurrent memory access would work, because we have no concurrent memory accesses.
What's the efficiency of this parallel algorithm? We don't look at the entire amount of work being done, but the amount of work being done by a single processor (because all the other processors are doing work at the same time, and so do not impact efficiency). A single processor runs at most lg n steps, making this a O(lg n) algorithm. This is better than the sequential version, which is O(n).
You'll have to write a parallel algorithm to solve a problem. This exercise is worth 10 points. The point is not to necessarily get it exactly right, but to show that you understand the concept of parallel algorithms.
Next week we will finish covering parallel algorithms.