COMP422 - Assignment2 Report

The document outlines the parallel implementation of LU decomposition using OpenMP, detailing the strategy, data partitioning, and synchronization mechanisms for efficient factorization of matrices. It describes the steps of pivot selection, row swapping, LU factorization, and matrix updates, emphasizing the importance of minimizing synchronization overhead and optimizing data layout for performance. Performance evaluation results indicate a decline in parallel efficiency with increasing thread counts due to synchronization overhead and memory contention.

Uploaded by

yx66

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

COMP422 - Assignment2 Report

Uploaded by

yx66

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Assignment 2: Parallel LU Decomposition using OpenMP

Name: Iris Xue

NetId: yx66

Part 1: Strategy and Implementation for LU Decomposition

This section describes the parallel implementation of LU decomposition with row pivoting using
OpenMP. The objective is to efficiently factorize an N × N matrix A into a lower triangular matrix L and
an upper triangular matrix U, while maintaining numerical stability using partial pivoting. This section
details the parallelization strategy, data partitioning, and synchronization mechanisms employed in my
LU decomposition program.

Parallelization Strategy:
LU decomposition consists of three major computational tasks:
1. Pivot Selection - Parallel search for the pivot element.
2. Row Swapping and Permutation Updates - Parallel row swapping operations.
3. Factorization – Compute values for L and U matrices.
4. Matrix Update – Apply Gaussian elimination to update the trailing submatrix.
To maximize parallel efficiency, OpenMP is used to distribute computation across multiple threads while
minimizing dependencies between iterations.

The parallelization strategy aims to distribute work efficiently while minimizing synchronization
overhead. Below is an analysis of how different steps are parallelized:

1. Pivot Selection

Pivot selection is a reduction-type operation that involves scanning a column to find the row with the
largest absolute value. In my code, this part is parallelized using OpenMP. I split the row traversal among
multiple threads using #pragma omp for nowait, which assigns different row ranges to different
threads. Each thread keeps track of a local maximum value and corresponding row index. A critical
section, marked with #pragma omp critical, is then used to safely compare and update the global
maximum value and pivot index. This approach ensures that race conditions are avoided and the correct
pivot is chosen, while the nowait clause reduces unnecessary synchronization overhead.

Pseudocode:
Initialize max_value = 0 and pivot_row = k

# Parallel for loop to find row with max absolute value in column k
Parallel for each row i from k to N:
Compute absolute value of A[i, k]
If abs(A[i, k]) > local_max:
Store local_max and local_pivot

# Critical section to update global max_value and pivot_row

Critical section:
If local_max > max_value:
Update max_value and pivot_row

2. Row Swapping

Once the pivot row is determined, the next step is swapping rows in A, L, and the permutation vector P to
maintain consistency. Swapping L[k] and L[pivot] is performed only up to column (k-1) to preserve
triangular structure. Each row swap operates independently, making it ideal for parallel execution.
I used #pragma omp parallel for to swap rows efficiently:

pseudocode:
If pivot_row != k:
# Parallel swap of rows in A
Parallel for each column j from 0 to N:
Swap A[k, j] and A[pivot_row, j]

Swap permutation vector entries P[k] and P[pivot_row]

# Parallel swap of corresponding L matrix values

Parallel for each column j from 0 to k-1:
Swap L[k, j] and L[pivot_row, j]

3. LU Factorization
For each row i > k, values for L[i][k] and U[k][i] are computed using the pivot row. Each computation of
L[i][k] and U[k][i] is independent, so parallel execution is efficient. I used #pragma omp parallel
for to perform the calculation.

Pseudocode:
Parallel for each row i from k+1 to N:
L[i, k] = A[i, k] / U[k, k]
U[k, i] = A[k, i]

4. Matrix Update (Elimination Step)

After computing L and U, the matrix is updated using Gaussian elimination. The update of A[i][j] is
independent for each (i, j) pair, making this fully parallelizable. Using #pragma omp parallel for
across rows i ensures that different rows are processed by different threads. Memory access patterns are
optimized to avoid false sharing.

Pseudocode:
Parallel for each row i from k+1 to N:
For each column j from k+1 to N:
A[i, j] = A[i, j] - L[i, k] * U[k, j]

Synchronization
Synchronization plays a crucial role in ensuring correctness and preventing race conditions. Pivot
selection requires a critical section to avoid conflicting updates when determining the global maximum
pivot row. Row swapping is performed in parallel, but synchronization ensures that all swaps use the
same pivot index across threads. The factorization and matrix updates are designed to be fully parallelized
without dependencies, allowing near-linear scaling with increasing thread counts.

Data Layout
An important consideration in parallel programs is how data is laid out in memory and how threads
interact with it. Poor data layout can lead to issues such as cache inefficiency or false sharing. In my LU
decomposition implementation, I adopt a flat, contiguous memory layout for all matrices using a
one-dimensional array representation. Specifically, an NxN matrix is stored in row-major order as a
single block of memory, where the element at position (i, j) is accessed via index i * N + j.

This design offers two primary benefits:

1. Improved Cache Locality: Contiguous layout ensures spatial locality, which allows better use of
CPU cache lines during matrix traversal. Threads operating on adjacent elements within a row
benefit from reduced cache misses.
2. Minimized False Sharing: Since each thread in my parallel loops typically works on different
rows (or disjoint regions of memory), there is minimal risk of multiple threads writing to the same
cache line. This prevents unnecessary cache coherency traffic and improves scalability.

In contrast, an alternative layout using an array of pointers to rows (i.e., a vector of vectors) could
introduce additional pointer indirection and poor memory locality. Such a layout also increases the chance
of false sharing if rows are allocated close together in memory but worked on by different threads.

NUMA Utilization
To further improve performance, I utilize numa_alloc_local to allocate memory. This function
ensures that memory is allocated on the NUMA node closest to the thread that first accesses it. As a
result:
1. Reduced Memory Latency: Threads access memory that is physically closer, decreasing access
times compared to remote memory access.
2. Improved Bandwidth Utilization: Local memory access reduces contention and traffic on
interconnects between NUMA nodes.
3. Better Cache Performance: NUMA-aware allocation helps preserve cache locality, allowing each
thread to operate more efficiently within its local cache domain.

By combining NUMA-aware memory allocation with careful data layout and OpenMP-based
parallelization, I significantly enhance the overall efficiency and scalability of my LU decomposition
implementation.

Correctness Validation
To verify the correctness of my LU decomposition implementation, I compute the L2,1 norm of the
residual matrix PA-LU . A small residual indicates that the decomposition is numerically accurate.
I tested this on a 1000×1000 matrix using 4 threads. The resulting L2,1 norm was approximately
7.93x10-9, which confirms that the decomposition is accurate within numerical precision limits. The
corresponding execution output is shown below:

This result supports the validity of my LU decomposition approach and demonstrates its ability to
maintain numerical stability even in a parallelized environment.

Part 2: Performance Evaluation

To evaluate the performance of my LU decomposition implementation, I measured execution time on a
matrix of size N = 7000 using 1, 2, 4, 8, 16, and 32 threads on a node on nots.rice.edu. The parallel
efficiency E is computed as:

E = T₁ / (p × T )

where T₁ is the execution time with 1 thread, T is the execution time with p threads, and p is the number
of threads.

Timing and Efficiency Results:

Threads Execution Time (s) Parallel Efficiency
1 97.608 1.1004
2 54.9717 0.9769
4 34.736 0.7730
8 30.0239 0.4472
16 29.5341 0.2273
32 30.5015 0.1100
The parallel efficiency data shows a clear downward trend as the number of threads increases. Efficiency
starts high at 1.1004 for 1 thread and remains relatively strong at 0.9769 for 2 threads. However, it drops
more noticeably to 0.7730 for 4 threads and continues to decline significantly.

This pattern reflects increasing overhead from synchronization and memory contention. For example,
during pivot selection, threads compete to update shared variables, and as the number of threads grows,
contention in the critical section becomes a limiting factor. Additionally, memory bandwidth becomes
saturated, limiting how effectively threads can access and update matrix elements. These effects
compound as thread count increases, resulting in diminishing performance gains and lower parallel
efficiency. This behavior aligns with Amdahl’s Law, which indicates that the serial portion of an
algorithm imposes a hard limit on scalability.

Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)
Recursive LU Factorization of A Matrix in Python
No ratings yet
Recursive LU Factorization of A Matrix in Python
4 pages
Programming with MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
Programming with MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
4.5/5 (3)
Problem Statement
No ratings yet
Problem Statement
2 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Project Assignment 3 Multi Processor System (DV 2544) : Susheel Sagar
No ratings yet
Project Assignment 3 Multi Processor System (DV 2544) : Susheel Sagar
4 pages
gauravkumar_221it027@it301_Lab2
No ratings yet
gauravkumar_221it027@it301_Lab2
28 pages
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
A Friendly Introduction to MATLAB Programming
From Everand
A Friendly Introduction to MATLAB Programming
Orhan Gazi
No ratings yet
Efficient Parallel Algorithm for
No ratings yet
Efficient Parallel Algorithm for
12 pages
RG2-ParallelizationPrinciples-HPCAI-Jan2020
No ratings yet
RG2-ParallelizationPrinciples-HPCAI-Jan2020
40 pages
HPC Codes-2
No ratings yet
HPC Codes-2
15 pages
SciCom LecNotes
No ratings yet
SciCom LecNotes
28 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
MPIreport
No ratings yet
MPIreport
4 pages
E 3 (Openmp - Iii) : Matrix Multiplication
No ratings yet
E 3 (Openmp - Iii) : Matrix Multiplication
10 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Report - Viber String
No ratings yet
Report - Viber String
26 pages
Parallel Distributed Computing Assignment 2
No ratings yet
Parallel Distributed Computing Assignment 2
2 pages
Excelente
No ratings yet
Excelente
64 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
DAAchinmay 1
No ratings yet
DAAchinmay 1
20 pages
3.1.3 Processes and Mapping (1/5)
No ratings yet
3.1.3 Processes and Mapping (1/5)
74 pages
HPC Ut 2
No ratings yet
HPC Ut 2
4 pages
Lab Report - Assignment 1: Variables
No ratings yet
Lab Report - Assignment 1: Variables
4 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
Web GPU
0% (1)
Web GPU
40 pages
Trifocal Tensor: Exploring Depth, Motion, and Structure in Computer Vision
From Everand
Trifocal Tensor: Exploring Depth, Motion, and Structure in Computer Vision
Fouad Sabry
No ratings yet
Report Homework 1: 1 Openmp Experiment
No ratings yet
Report Homework 1: 1 Openmp Experiment
8 pages
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
From Everand
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
Fouad Sabry
No ratings yet
Algorithms
No ratings yet
Algorithms
3 pages
Algorithms Flops
No ratings yet
Algorithms Flops
3 pages
Mamindla Sathvika Lab9
No ratings yet
Mamindla Sathvika Lab9
10 pages
Par - 1 In-Term Exam - Course 2017/18-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2017/18-Q2
7 pages
SWE2017 - Lab Assignment 1pages-7
No ratings yet
SWE2017 - Lab Assignment 1pages-7
5 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
Chained Matrix Multiplication
No ratings yet
Chained Matrix Multiplication
32 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
Lab 4
No ratings yet
Lab 4
3 pages
Parallelizing The Standard Algorithms Library - Jared Hoberock - CppCon 2014
No ratings yet
Parallelizing The Standard Algorithms Library - Jared Hoberock - CppCon 2014
58 pages
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
From Everand
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
Fouad Sabry
No ratings yet
Parallel Computing Lab Manual PDF
No ratings yet
Parallel Computing Lab Manual PDF
51 pages
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
CO-2 (2)
No ratings yet
CO-2 (2)
22 pages
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Lab # 2 by Akram
No ratings yet
Lab # 2 by Akram
14 pages
Sparse 1
No ratings yet
Sparse 1
68 pages
Gauss
No ratings yet
Gauss
7 pages
Java Practise Exercise
No ratings yet
Java Practise Exercise
3 pages
matrix_mul
No ratings yet
matrix_mul
33 pages
Problem Statement
No ratings yet
Problem Statement
5 pages
Numerical Solution of Linear Systems: Chen Greif
No ratings yet
Numerical Solution of Linear Systems: Chen Greif
59 pages
Investigation of the Usefulness of the PowerWorld Simulator Program: Developed by "Glover, Overbye & Sarma" in the Solution of Power System Problems
From Everand
Investigation of the Usefulness of the PowerWorld Simulator Program: Developed by "Glover, Overbye & Sarma" in the Solution of Power System Problems
Dr. Hidaia Mahmood Alassouli
No ratings yet
8 Week Report
No ratings yet
8 Week Report
23 pages
COA_Imple
No ratings yet
COA_Imple
22 pages
SimpleMultiThreader
No ratings yet
SimpleMultiThreader
2 pages
Parallel and Distributed Computing Lab Digital Assignment - 3
No ratings yet
Parallel and Distributed Computing Lab Digital Assignment - 3
10 pages
Exericses 2 Solution Matrix Multlpication
No ratings yet
Exericses 2 Solution Matrix Multlpication
4 pages
Introduction and Approaches To Operational Auditing
No ratings yet
Introduction and Approaches To Operational Auditing
17 pages
Ebook - (LaFrenière, 2010) - Adaptive Origins (Evolution and Human Development)
No ratings yet
Ebook - (LaFrenière, 2010) - Adaptive Origins (Evolution and Human Development)
409 pages
Grade 1 Environmental Science CA 2 July 2024 25
No ratings yet
Grade 1 Environmental Science CA 2 July 2024 25
2 pages
Log
No ratings yet
Log
5 pages
Skoda Kodiaq Scout - Brochure
50% (2)
Skoda Kodiaq Scout - Brochure
10 pages
Operation Management
No ratings yet
Operation Management
14 pages
Action Verbs - Definition, List & Examples
No ratings yet
Action Verbs - Definition, List & Examples
11 pages
DLL Q3 W5 Music M1 Vocal Mindanao
No ratings yet
DLL Q3 W5 Music M1 Vocal Mindanao
3 pages
Real Time Indian Sign Language Recognition and Speech Generation Using Convolutional Neural Network
No ratings yet
Real Time Indian Sign Language Recognition and Speech Generation Using Convolutional Neural Network
4 pages
COP 1838ME-AW 8311 1247 43
No ratings yet
COP 1838ME-AW 8311 1247 43
58 pages
l&d Program Design - Hots-solo
No ratings yet
l&d Program Design - Hots-solo
5 pages
SoniaLarsenpsych111 w05 FieldReport
No ratings yet
SoniaLarsenpsych111 w05 FieldReport
4 pages
LT - Regular Cumulative - 04.05.2021
No ratings yet
LT - Regular Cumulative - 04.05.2021
22 pages
SEPA8 e Slides CH 4 R1
No ratings yet
SEPA8 e Slides CH 4 R1
14 pages
Lecture1 PortlandCement Introduction
No ratings yet
Lecture1 PortlandCement Introduction
39 pages
Join the Hottest WhatsApp Group Links for Girls Today_ Connect and Share
No ratings yet
Join the Hottest WhatsApp Group Links for Girls Today_ Connect and Share
3 pages
Line Integrals of Vector Fields Using Maple and The Vec - Calc Package
No ratings yet
Line Integrals of Vector Fields Using Maple and The Vec - Calc Package
6 pages
Crawl 11
No ratings yet
Crawl 11
30 pages
Consignment Process
0% (1)
Consignment Process
8 pages
2-D-CFD Analysis of The Effect of Trailing Edge Shape On The Performance of A Straight-Blade Vertical Axis Wind Turbine
No ratings yet
2-D-CFD Analysis of The Effect of Trailing Edge Shape On The Performance of A Straight-Blade Vertical Axis Wind Turbine
8 pages
Islamic Architecture Dissertation Topics
67% (3)
Islamic Architecture Dissertation Topics
4 pages
Chapter 8, Kingdom Fungi
No ratings yet
Chapter 8, Kingdom Fungi
1 page
How-To Reverse Shipment & Delivery
No ratings yet
How-To Reverse Shipment & Delivery
10 pages
Exercises For Practical 7: I) Select From Employee From Employee Where Job - Id 'ST - CLERK' AND Hire - Date '31-DEC-1997'
No ratings yet
Exercises For Practical 7: I) Select From Employee From Employee Where Job - Id 'ST - CLERK' AND Hire - Date '31-DEC-1997'
3 pages
Immediate download (Ebook) Why Do Banks Fail and What to Do About It by Abidi, Nordine, Buchetti, Bruno, Crosetti, Samuele, Miquel-Flores, Ixart ISBN 9783031523106, 3031523105 ebooks 2024
100% (11)
Immediate download (Ebook) Why Do Banks Fail and What to Do About It by Abidi, Nordine, Buchetti, Bruno, Crosetti, Samuele, Miquel-Flores, Ixart ISBN 9783031523106, 3031523105 ebooks 2024
65 pages
NBPPnewspaper07 08
No ratings yet
NBPPnewspaper07 08
36 pages
Weber and Environmental Sociology
No ratings yet
Weber and Environmental Sociology
17 pages
Socialization
No ratings yet
Socialization
5 pages
Stc-Steering Control System
No ratings yet
Stc-Steering Control System
494 pages
Module1 Xam
100% (1)
Module1 Xam
4 pages