0% found this document useful (0 votes)
3 views

COMP422 - Assignment2 Report

The document outlines the parallel implementation of LU decomposition using OpenMP, detailing the strategy, data partitioning, and synchronization mechanisms for efficient factorization of matrices. It describes the steps of pivot selection, row swapping, LU factorization, and matrix updates, emphasizing the importance of minimizing synchronization overhead and optimizing data layout for performance. Performance evaluation results indicate a decline in parallel efficiency with increasing thread counts due to synchronization overhead and memory contention.

Uploaded by

yx66
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

COMP422 - Assignment2 Report

The document outlines the parallel implementation of LU decomposition using OpenMP, detailing the strategy, data partitioning, and synchronization mechanisms for efficient factorization of matrices. It describes the steps of pivot selection, row swapping, LU factorization, and matrix updates, emphasizing the importance of minimizing synchronization overhead and optimizing data layout for performance. Performance evaluation results indicate a decline in parallel efficiency with increasing thread counts due to synchronization overhead and memory contention.

Uploaded by

yx66
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Assignment 2: Parallel LU Decomposition using OpenMP

Name: Iris Xue


NetId: yx66

Part 1: Strategy and Implementation for LU Decomposition


This section describes the parallel implementation of LU decomposition with row pivoting using
OpenMP. The objective is to efficiently factorize an N × N matrix A into a lower triangular matrix L and
an upper triangular matrix U, while maintaining numerical stability using partial pivoting. This section
details the parallelization strategy, data partitioning, and synchronization mechanisms employed in my
LU decomposition program.

Parallelization Strategy:
LU decomposition consists of three major computational tasks:
1.​ Pivot Selection - Parallel search for the pivot element.
2.​ Row Swapping and Permutation Updates - Parallel row swapping operations.
3.​ Factorization – Compute values for L and U matrices.
4.​ Matrix Update – Apply Gaussian elimination to update the trailing submatrix.
To maximize parallel efficiency, OpenMP is used to distribute computation across multiple threads while
minimizing dependencies between iterations.

The parallelization strategy aims to distribute work efficiently while minimizing synchronization
overhead. Below is an analysis of how different steps are parallelized:

1.​ Pivot Selection


Pivot selection is a reduction-type operation that involves scanning a column to find the row with the
largest absolute value. In my code, this part is parallelized using OpenMP. I split the row traversal among
multiple threads using #pragma omp for nowait, which assigns different row ranges to different
threads. Each thread keeps track of a local maximum value and corresponding row index. A critical
section, marked with #pragma omp critical, is then used to safely compare and update the global
maximum value and pivot index. This approach ensures that race conditions are avoided and the correct
pivot is chosen, while the nowait clause reduces unnecessary synchronization overhead.

Pseudocode:
Initialize max_value = 0 and pivot_row = k

# Parallel for loop to find row with max absolute value in column k
Parallel for each row i from k to N:
Compute absolute value of A[i, k]
If abs(A[i, k]) > local_max:
Store local_max and local_pivot

# Critical section to update global max_value and pivot_row


Critical section:
If local_max > max_value:
Update max_value and pivot_row

2.​ Row Swapping


Once the pivot row is determined, the next step is swapping rows in A, L, and the permutation vector P to
maintain consistency. Swapping L[k] and L[pivot] is performed only up to column (k-1) to preserve
triangular structure. Each row swap operates independently, making it ideal for parallel execution.
I used #pragma omp parallel for to swap rows efficiently:

pseudocode:
If pivot_row != k:
# Parallel swap of rows in A
Parallel for each column j from 0 to N:
Swap A[k, j] and A[pivot_row, j]

Swap permutation vector entries P[k] and P[pivot_row]

# Parallel swap of corresponding L matrix values


Parallel for each column j from 0 to k-1:
Swap L[k, j] and L[pivot_row, j]

3.​ LU Factorization
For each row i > k, values for L[i][k] and U[k][i] are computed using the pivot row. Each computation of
L[i][k] and U[k][i] is independent, so parallel execution is efficient. I used #pragma omp parallel
for to perform the calculation.

Pseudocode:
Parallel for each row i from k+1 to N:
L[i, k] = A[i, k] / U[k, k]
U[k, i] = A[k, i]

4.​ Matrix Update (Elimination Step)


After computing L and U, the matrix is updated using Gaussian elimination. The update of A[i][j] is
independent for each (i, j) pair, making this fully parallelizable. Using #pragma omp parallel for
across rows i ensures that different rows are processed by different threads. Memory access patterns are
optimized to avoid false sharing.

Pseudocode:
Parallel for each row i from k+1 to N:
For each column j from k+1 to N:
A[i, j] = A[i, j] - L[i, k] * U[k, j]

Synchronization
Synchronization plays a crucial role in ensuring correctness and preventing race conditions. Pivot
selection requires a critical section to avoid conflicting updates when determining the global maximum
pivot row. Row swapping is performed in parallel, but synchronization ensures that all swaps use the
same pivot index across threads. The factorization and matrix updates are designed to be fully parallelized
without dependencies, allowing near-linear scaling with increasing thread counts.

Data Layout
An important consideration in parallel programs is how data is laid out in memory and how threads
interact with it. Poor data layout can lead to issues such as cache inefficiency or false sharing. In my LU
decomposition implementation, I adopt a flat, contiguous memory layout for all matrices using a
one-dimensional array representation. Specifically, an NxN matrix is stored in row-major order as a
single block of memory, where the element at position (i, j) is accessed via index i * N + j.

This design offers two primary benefits:


1.​ Improved Cache Locality: Contiguous layout ensures spatial locality, which allows better use of
CPU cache lines during matrix traversal. Threads operating on adjacent elements within a row
benefit from reduced cache misses.
2.​ Minimized False Sharing: Since each thread in my parallel loops typically works on different
rows (or disjoint regions of memory), there is minimal risk of multiple threads writing to the same
cache line. This prevents unnecessary cache coherency traffic and improves scalability.

In contrast, an alternative layout using an array of pointers to rows (i.e., a vector of vectors) could
introduce additional pointer indirection and poor memory locality. Such a layout also increases the chance
of false sharing if rows are allocated close together in memory but worked on by different threads.

NUMA Utilization
To further improve performance, I utilize numa_alloc_local to allocate memory. This function
ensures that memory is allocated on the NUMA node closest to the thread that first accesses it. As a
result:
1.​ Reduced Memory Latency: Threads access memory that is physically closer, decreasing access
times compared to remote memory access.
2.​ Improved Bandwidth Utilization: Local memory access reduces contention and traffic on
interconnects between NUMA nodes.
3.​ Better Cache Performance: NUMA-aware allocation helps preserve cache locality, allowing each
thread to operate more efficiently within its local cache domain.

By combining NUMA-aware memory allocation with careful data layout and OpenMP-based
parallelization, I significantly enhance the overall efficiency and scalability of my LU decomposition
implementation.

Correctness Validation
To verify the correctness of my LU decomposition implementation, I compute the L2,1 norm of the
residual matrix PA-LU . A small residual indicates that the decomposition is numerically accurate.
I tested this on a 1000×1000 matrix using 4 threads. The resulting L2,1 norm was approximately
7.93x10-9, which confirms that the decomposition is accurate within numerical precision limits. The
corresponding execution output is shown below:

This result supports the validity of my LU decomposition approach and demonstrates its ability to
maintain numerical stability even in a parallelized environment.

Part 2: Performance Evaluation


To evaluate the performance of my LU decomposition implementation, I measured execution time on a
matrix of size N = 7000 using 1, 2, 4, 8, 16, and 32 threads on a node on nots.rice.edu. The parallel
efficiency E is computed as:

E = T₁ / (p × T )

where T₁ is the execution time with 1 thread, T is the execution time with p threads, and p is the number
of threads.

Timing and Efficiency Results:


Threads Execution Time (s) Parallel Efficiency
1 97.608 1.1004
2 54.9717 0.9769
4 34.736 0.7730
8 30.0239 0.4472
16 29.5341 0.2273
32 30.5015 0.1100
The parallel efficiency data shows a clear downward trend as the number of threads increases. Efficiency
starts high at 1.1004 for 1 thread and remains relatively strong at 0.9769 for 2 threads. However, it drops
more noticeably to 0.7730 for 4 threads and continues to decline significantly.

This pattern reflects increasing overhead from synchronization and memory contention. For example,
during pivot selection, threads compete to update shared variables, and as the number of threads grows,
contention in the critical section becomes a limiting factor. Additionally, memory bandwidth becomes
saturated, limiting how effectively threads can access and update matrix elements. These effects
compound as thread count increases, resulting in diminishing performance gains and lower parallel
efficiency. This behavior aligns with Amdahl’s Law, which indicates that the serial portion of an
algorithm imposes a hard limit on scalability.

You might also like