COMP422 - Assignment2 Report
COMP422 - Assignment2 Report
Parallelization Strategy:
LU decomposition consists of three major computational tasks:
1. Pivot Selection - Parallel search for the pivot element.
2. Row Swapping and Permutation Updates - Parallel row swapping operations.
3. Factorization – Compute values for L and U matrices.
4. Matrix Update – Apply Gaussian elimination to update the trailing submatrix.
To maximize parallel efficiency, OpenMP is used to distribute computation across multiple threads while
minimizing dependencies between iterations.
The parallelization strategy aims to distribute work efficiently while minimizing synchronization
overhead. Below is an analysis of how different steps are parallelized:
Pseudocode:
Initialize max_value = 0 and pivot_row = k
# Parallel for loop to find row with max absolute value in column k
Parallel for each row i from k to N:
Compute absolute value of A[i, k]
If abs(A[i, k]) > local_max:
Store local_max and local_pivot
pseudocode:
If pivot_row != k:
# Parallel swap of rows in A
Parallel for each column j from 0 to N:
Swap A[k, j] and A[pivot_row, j]
3. LU Factorization
For each row i > k, values for L[i][k] and U[k][i] are computed using the pivot row. Each computation of
L[i][k] and U[k][i] is independent, so parallel execution is efficient. I used #pragma omp parallel
for to perform the calculation.
Pseudocode:
Parallel for each row i from k+1 to N:
L[i, k] = A[i, k] / U[k, k]
U[k, i] = A[k, i]
Pseudocode:
Parallel for each row i from k+1 to N:
For each column j from k+1 to N:
A[i, j] = A[i, j] - L[i, k] * U[k, j]
Synchronization
Synchronization plays a crucial role in ensuring correctness and preventing race conditions. Pivot
selection requires a critical section to avoid conflicting updates when determining the global maximum
pivot row. Row swapping is performed in parallel, but synchronization ensures that all swaps use the
same pivot index across threads. The factorization and matrix updates are designed to be fully parallelized
without dependencies, allowing near-linear scaling with increasing thread counts.
Data Layout
An important consideration in parallel programs is how data is laid out in memory and how threads
interact with it. Poor data layout can lead to issues such as cache inefficiency or false sharing. In my LU
decomposition implementation, I adopt a flat, contiguous memory layout for all matrices using a
one-dimensional array representation. Specifically, an NxN matrix is stored in row-major order as a
single block of memory, where the element at position (i, j) is accessed via index i * N + j.
In contrast, an alternative layout using an array of pointers to rows (i.e., a vector of vectors) could
introduce additional pointer indirection and poor memory locality. Such a layout also increases the chance
of false sharing if rows are allocated close together in memory but worked on by different threads.
NUMA Utilization
To further improve performance, I utilize numa_alloc_local to allocate memory. This function
ensures that memory is allocated on the NUMA node closest to the thread that first accesses it. As a
result:
1. Reduced Memory Latency: Threads access memory that is physically closer, decreasing access
times compared to remote memory access.
2. Improved Bandwidth Utilization: Local memory access reduces contention and traffic on
interconnects between NUMA nodes.
3. Better Cache Performance: NUMA-aware allocation helps preserve cache locality, allowing each
thread to operate more efficiently within its local cache domain.
By combining NUMA-aware memory allocation with careful data layout and OpenMP-based
parallelization, I significantly enhance the overall efficiency and scalability of my LU decomposition
implementation.
Correctness Validation
To verify the correctness of my LU decomposition implementation, I compute the L2,1 norm of the
residual matrix PA-LU . A small residual indicates that the decomposition is numerically accurate.
I tested this on a 1000×1000 matrix using 4 threads. The resulting L2,1 norm was approximately
7.93x10-9, which confirms that the decomposition is accurate within numerical precision limits. The
corresponding execution output is shown below:
This result supports the validity of my LU decomposition approach and demonstrates its ability to
maintain numerical stability even in a parallelized environment.
E = T₁ / (p × T )
where T₁ is the execution time with 1 thread, T is the execution time with p threads, and p is the number
of threads.
This pattern reflects increasing overhead from synchronization and memory contention. For example,
during pivot selection, threads compete to update shared variables, and as the number of threads grows,
contention in the critical section becomes a limiting factor. Additionally, memory bandwidth becomes
saturated, limiting how effectively threads can access and update matrix elements. These effects
compound as thread count increases, resulting in diminishing performance gains and lower parallel
efficiency. This behavior aligns with Amdahl’s Law, which indicates that the serial portion of an
algorithm imposes a hard limit on scalability.