HW2 Solutions
HW2 Solutions
1 Basic concepts
1. Performance. Suppose we have two computers A and B. Computer A has a clock cycle of
1 ns and performs 2 instructions per cycle. Computer B, instead, has a clock cycle of 600 ps
and performs 1.25 instructions per cycle. Assuming a program requires the execution of the
same number of instructions in both computers:
Solution.
2 instructions 1 cycle instructions
Computer A performs 1 cycle × 10−9 seconds
= 2 × 109 second
1.25 instructions 1 cycle instructions
Computer B performs 1 cycle × 600×10−12 seconds
= 2.08 × 109 second
Computer B performs more instructions per second, thus it is the fastest for this program.
Now, let’s n be the number of instructions required by Computer A, and 1.1 × n the number
n
of instructions required by Computer B. The program will take 2×10 9 seconds in Computer
n
A and 1.89×109 seconds in Computer B. Therefore, in this scenario, Computer A executes the
program faster.
2. Speedup.
Assume the runtime of an application for a problem is 100 seconds for problem size 1. It
consists of an initialization phase which lasts for 10 seconds and cannot be parallelized, and a
problem solving phase which can be perfectly parallelized and grows quadratic with growing
problem size.
• What is the speedup for the given application as a function of the number of processors
p and the problem size n.
• What is the execution time and speedup of the application with problem size 1, if it is
parallelized and run on 4 processors?
• What is the execution time of the application if the problem size is increased to 4 and it is
run on 4 processors? And on 16 processors? What is the speedup of both measurements?
Solution.
The application has an inherently sequential part (cs ) that takes 10 seconds, and a paralleliz-
able part (cp ) that takes 90 seconds for problem size 1. Since, the parallelizable part grows
quadratically with the problems size, we can model T1 (execution time in 1 processor) as:
cs + cp × n2 .
1
T (1, n) cs + cp × n2 10 + 90 × n2
S(p, n) := := ≡ .
T (p, n) cs + (cp × n2 )/p 10 + (90 × n2 )/p
For problem size 1 (n = 1) and 4 processors (p = 4), the execution time is 32.5 seconds. The
achieved speedup is 3.08.
Finally, if problem size is increased to 4, the execution time and speedup using 4 and 16
processors is:
• Improvement 1 (all fp instructions sped up by a factor 1.5). Sequential part (β): 0.4.
p = 1.5. The application would observe a total speedup of:
1 1
Sp (n) := = = 1.25.
β + (1 − β)/p 0.4 + (1 − 0.4)/1.5
• Improvement 2 (square root instructions sped up by a factor of 8). Sequential part (β):
0.4 + 0.45. p = 8. The application would observe a total speedup of:
1 1
Sp (n) := = = 1.15.
β + (1 − β)/p .85 + (1 − 0.85)/8
Thus, the application would benefit the most from the first alternative.
Parallelization of code. The speedup achieved on a 16-CPU system is:
1 1
Sp (n) := = = 6.4.
β + (1 − β)/p 0.1 + (1 − 0.1)/16
To attain a speedup of 10, a 96% of the code would need to be perfectly parallelizable. This
value is obtained by solving the equation:
1
10 == .
β + (1 − β)/16
2
4. Efficiency. Consider a computer that has a peak performance of 8 GFlops/s. An application
running on this computer executes 15 TFlops, and takes 1 hour to complete.
Solution.
15 TFlops/s
The application attained: 3600 s = 4.26 GFlops/s.
4.26 GFlops/s
The achieved efficiency is: 8 GFlops/s = 53%.
5. Parallel efficiency. Given the data in Tab. 1, use your favorite plotting tool to plot
In both cases plot also the ideal case, that is, scalability equal to the number of processors
and parallel efficiency equal to 1, respectively.
Solution.
Table 2 includes the speedup and parallel efficiency attained. Figure 1 gives an example of
the requested plots.
3
16 Parallel efficiency 1
0.75
Speedup
8 0.5
4 0.25
2
1
1 2 4 8 16 1 2 4 8 16
Number of processors Number of processors