Searching and Sorting
Searching and Sorting
Sorting Efficiency
Sorting efficiency depends on balancing coding time, machine time, and memory usage. If a file is
small, simpler sorting methods may be preferable over complex ones designed to save time and
space. However, frequent or extensive file sorting demands more efficient methods to avoid
overwhelming processing time. Sorting time is often measured by critical operations (e.g., key
comparisons and record movements), not in time units. This chapter introduces mathematical
analysis for sorting times, highlighting the significance of file size (n) on sorting performance. For
small files, sorting time increases linearly with file size, but for large files, it increases quadratically,
making efficiency crucial for large datasets.
Main Points:
1. Balancing Considerations:
• Sorting method choice should weigh coding time, execution time, and memory
requirements, adjusting based on file size and task frequency.
• For smaller or one-time tasks, simpler sorting methods may suffice.
2. Critical Operations:
• Sorting time efficiency is often gauged by the count of key comparisons and record
movements rather than direct time.
• For large, complex key comparisons, focusing on reducing these operations is
essential.
3. Mathematical Analysis:
• Sorting time can vary based on the best, worst, and average cases and is influenced
heavily by file size (n).
• For small values of n, time is nearly proportional to n; for larger n, time becomes
almost proportional to n2, significantly impacting performance.
4. Efficiency Impact for Large Datasets:
• As n grows, quadratic factors (e.g., 0.01n2) dominate, so increasing n by a factor can
exponentially increase sorting time.
• For very large files, choosing an efficient sorting algorithm becomes critical to
manage time and space requirements effectively.
Big O Notation
Big-O notation (O) is a way to describe the asymptotic behavior of functions, which measures
how the runtime or space requirement of an algorithm grows relative to its input size n. When
we say f(n)=O(g(n)), we mean that for large enough n, f(n) grows at most as fast as g(n), up to
a constant factor.
1. Definition of Big-O:
• A function f(n) is O(g(n)) if there are constants a and b such that f(n)≤a⋅g(n) for all
n≥b.
• This notation captures that f(n) grows at most as fast as g(n), ignoring constant
factors and lower-order terms.
2. Asymptotic Bounds:
• If f(n) grows slower than g(n) as n increases, f(n) is said to be bounded by g(n) or to
be of a smaller order.
• Functions may be bounded by many others, but we usually use the "closest fit,"
ignoring constants and focusing on the dominant term.
3. Transitivity of Big-O:
• If f(n)=O(g(n)) and g(n)=O(h(n)), then f(n)=O(h(n)), allowing us to extend bounds
across multiple functions.
4. Hierarchy of Functions:
• Constant functions O(1) are the smallest, followed by logarithmic O(logn), linear
O(n), polynomial O(nk), and exponential O(dn), where d>1.
• Polynomial functions grow slower than exponential functions, making exponential-
time algorithms impractical for large inputs.
5. Significance of Polynomial vs. Exponential Growth:
• Polynomial algorithms (e.g., O(nk)) are generally feasible, while exponential
algorithms (e.g., O(2n)) grow too rapidly to solve large problems effectively on
current computing hardware.
6. Logarithmic Growth:
• All logarithmic functions (e.g., log2n and log10n) are in the same order, denoted
O(logn), as they differ only by a constant factor.
Sorting Techniques
Space Complexity
Bubble Sort requires minimal additional space—only a few integer variables and one temporary
variable for swaps—making its space complexity O(1). This low memory usage can be
advantageous in environments with limited space.
Summary
While Bubble Sort has a straightforward implementation and minimal memory needs, it is usually
avoided for large datasets due to its O(n2) time complexity in average and worst-case scenarios. Its
primary redeeming feature is the O(n) efficiency in the best case (already sorted arrays) and its
simplicity, making it useful for educational purposes and small datasets.
Code:-
OuickSort
Quicksort Algorithm Overview
Quicksort is a widely used sorting algorithm based on the partition-exchange method. It follows a
divide-and-conquer strategy, dividing the array into smaller subarrays that can be sorted
independently. Here's a comprehensive breakdown of its key components:
1. Basic Concept
• The primary idea behind Quicksort is to select a 'pivot' element from the array. The elements
are then rearranged so that those less than the pivot are on its left, and those greater than the
pivot are on its right. This process continues recursively for the resulting subarrays until the
entire array is sorted.
2. Choosing the Pivot
• The pivot can be chosen in various ways (e.g., first element, last element, median of three)
depending on the implementation. The choice of pivot can significantly impact the
algorithm's performance. A well-chosen pivot can lead to more balanced partitions, which is
crucial for efficiency.
3. Partitioning Process
• The partitioning involves several steps:
• Initialization: Two pointers are established: down starts at the lower bound of the
array, and up starts at the upper bound.
• Moving Pointers:
• Increment the down pointer until an element greater than the pivot is found.
• Decrement the up pointer until an element less than the pivot is found.
• If up is greater than down, swap the elements at these pointers.
• This process continues until the pointers cross. Finally, the pivot is swapped with the
element at the up pointer, which indicates the final position of the pivot.
4. Recursive Sorting
• After the partitioning, the pivot is in its final position. The algorithm recursively sorts the
left subarray (elements less than the pivot) and the right subarray (elements greater than the
pivot).
• The recursion stops when subarrays are reduced to sizes of 0 or 1, which are inherently
sorted.
5. Algorithm Implementation
• The pseudocode provided outlines a recursive implementation of the Quicksort algorithm. It
includes a partition function that rearranges the elements based on the pivot and two
recursive calls for the left and right subarrays.
• A non-recursive version can also be implemented using a stack to keep track of subarray
indices to minimize function call overhead.
6. Optimizations
• To enhance performance, several strategies can be employed:
• Median-of-Three Pivoting: Choosing the pivot as the median of the first, last, and
middle elements to reduce the likelihood of unbalanced partitions.
• Mean Sort: Using the mean of the subarrays as the pivot in subsequent partitions,
leading to better-balanced subarrays over time.
• Bsort Technique: This optimization ensures that the smallest and largest elements in
subarrays are correctly placed without the need for further sorting. Small subarrays
of sizes 2 or 3 can be sorted directly, bypassing the need for partitioning altogether.
7. Efficiency Considerations
• Quicksort has an average time complexity of O(nlogn), but its worst-case time complexity is
O(n2), typically occurring with poorly chosen pivots (e.g., already sorted data).
• Despite this, it is often faster in practice than other O(nlogn) algorithms like Merge Sort or
Heap Sort due to its smaller constant factors and locality of reference.
Conclusion
Quicksort is favored for its efficiency and speed in average cases, along with its recursive nature
that simplifies implementation. It remains a fundamental algorithm in computer science,
particularly for sorting tasks, with numerous variations and optimizations to suit different scenarios.
Understanding its mechanics and strategies for enhancing performance is crucial for effective
application in software development and data processing.
CODE:-
#include <stdio.h>
// Quicksort function
void quicksort(int arr[], int lb, int ub) {
if (lb < ub) {
// Partition the array and get the pivot index
int pivotIndex = partition(arr, lb, ub);
// Recursively sort the subarrays
quicksort(arr, lb, pivotIndex - 1); // Left subarray
quicksort(arr, pivotIndex + 1, ub); // Right subarray
}
}
// Main function
int main() {
int arr[] = {25, 57, 48, 37, 12, 92, 86, 33};
int n = sizeof(arr) / sizeof(arr[0]);
printf("Original array:\n");
printArray(arr, n);
quicksort(arr, 0, n - 1);
printf("Sorted array:\n");
printArray(arr, n);
return 0;
}
The efficiency of Quicksort can be summarized through its time complexity, average performance,
and how it behaves under different conditions. Here's an organized overview based on the detailed
description you provided:
Space Complexity
• The space complexity is O(logn) due to the stack space used by recursive calls in the
average case.
• In the worst case, it can go up to O(n) if the recursion stack becomes as deep as the number
of elements (when partitions are unbalanced).
Practical Considerations
• Quicksort is often the fastest sorting algorithm in practice due to its low overhead and
average-case performance.
• For nearly sorted data, strategies like the median-of-three or switching to insertion sort for
small subarrays can further improve performance.
Conclusion
Quicksort is a highly efficient sorting algorithm with an average time complexity of O(nlogn),
making it suitable for a wide variety of data sets. It excels particularly in cases of random or
unsorted data, but care must be taken with sorted or nearly sorted data to avoid worst-case
performance. Proper pivot selection strategies are crucial in mitigating these risks and enhancing
performance.
Heapsort
Heapsort is an efficient sorting algorithm that relies on the structure of a binary heap to achieve an
optimal time complexity of O(nlogn), making it an attractive choice for scenarios where a
guaranteed performance is needed, regardless of the initial order of input data. Here's a breakdown
of the key concepts, efficiency, and steps involved in Heapsort:
Key Concepts
1. Heap Definition:
• A max heap (descending heap) is an almost complete binary tree where each node
has a value greater than or equal to its children, ensuring that the largest element is
always at the root.
• A min heap (ascending heap) has each node value smaller than or equal to its
children, placing the smallest element at the root.
2. Binary Heap as a Priority Queue:
• Binary heaps allow for efficient priority queue operations. Insertion (pqinsert)
and deletion of the maximum/minimum element (pqmaxdelete or
pqminddelete) can both be done in O(logn) time, compared to the average
O(n/2) operations required for a sequentially ordered list.
3. Sequential Representation:
• The binary heap can be efficiently represented in an array. Given an element at index
j:
• The parent element is located at index (j−1)/2.
• The left child is at 2j+1.
• The right child is at 2j+2.
• This array-based representation enables Heapsort to operate in-place, meaning it
sorts the array without needing additional storage except for some program variables.
Efficiency of Heapsort
1. Time Complexity:
• Best, Average, and Worst Case: Heapsort has O(nlogn) complexity in all cases, as
heap construction and each extraction step in the sorting phase are consistently
O(logn).
• Heap Construction: Building a heap from an unsorted array takes O(n).
• Heap Maintenance During Sorting: After extracting the root (maximum or
minimum) and placing it at the end of the array, re-heaping the reduced heap takes
O(logn) operations for each element, resulting in O(nlogn) for the full sorting
process.
2. Space Complexity:
• Heapsort is an in-place sorting algorithm, requiring only a constant amount of
additional space, O(1).
Practical Applications
Heapsort is ideal when:
• Consistent performance is essential, especially if input data characteristics are unknown.
• Additional memory allocation is constrained, as it sorts in-place.
In summary, Heapsort offers an efficient, predictable sorting solution with a worst-case time
complexity of O(nlogn), making it advantageous for certain applications where a stable
performance is prioritized over execution speed.
CODE:-
#include <stdio.h>
int main() {
int arr[] = {12, 11, 13, 5, 6, 7};
int n = sizeof(arr) / sizeof(arr[0]);
printf("Original array:\n");
printArray(arr, n);
heapSort(arr, n);
printf("Sorted array:\n");
printArray(arr, n);
return 0;
}
// Function to delete and return the maximum element from the priority queue
int pqMaxDelete(int heap[], int *k) {
int maxElement = heap[0]; // Max element is at the root
heap[0] = heap[--(*k)]; // Replace root with last element and decrease size
adjustHeap(heap, *k); // Re-adjust the heap to maintain max-heap property
return maxElement;
}
int main() {
int heap[MAX]; // Array representing the priority queue as a max heap
int k = 0; // Current size of the heap
// Insert elements
pqInsert(heap, &k, 20);
pqInsert(heap, &k, 15);
pqInsert(heap, &k, 30);
pqInsert(heap, &k, 10);
pqInsert(heap, &k, 25);
return 0;
}
Insertion Sort
Simple Insertion Sort:
• Sorts an array by inserting each element into an already sorted subarray.
• For sorted data, it has a time complexity of O(n); for reverse-ordered data, it’s O(n2).
• Better than bubble sort and effective if the input is almost sorted.
• Optimizations:
• Binary Search: Reduces comparisons to O(nlogn) by finding the insertion position quickly
but doesn’t improve overall time as element shifting still takes O(n2).
• List Insertion: Uses a linked list, allowing insertion without shifting array elements. This
reduces replacement operations but not comparisons. Extra space is needed for the link
array.
• Selection of Sorts:
• Small Files: Use selection sort for large records and simple keys (less shifting), and
insertion sort if comparisons are more expensive.
• Larger Data: Use heapsort or quicksort. Quicksort is optimal for arrays with over 30
elements, while heapsort is more efficient than insertion for sizes above 60-70.
• Hybrid Approach:
• In quicksort, insertion sort can speed up small subarrays (fewer than 20 elements).
CODE:-
Here is the implementation of Simple Insertion Sort, Binary Insertion Sort, and List Insertion
Sort in C.
#include <stdio.h>
int main() {
int arr[] = {12, 11, 13, 5, 6};
int n = sizeof(arr) / sizeof(arr[0]);
insertionSort(arr, n);
printf("Sorted array: \n");
printArray(arr, n);
return 0;
}
#include <stdio.h>
int main() {
int arr[] = {12, 11, 13, 5, 6};
int n = sizeof(arr) / sizeof(arr[0]);
binaryInsertionSort(arr, n);
printf("Sorted array: \n");
printArray(arr, n);
return 0;
}
#include <stdio.h>
#include <stdlib.h>
struct Node {
int data;
struct Node* next;
};
int main() {
int arr[] = {12, 11, 13, 5, 6};
int n = sizeof(arr) / sizeof(arr[0]);
insertionSortLinkedList(arr, n);
printf("Sorted array: \n");
printArray(arr, n);
return 0;
}
These implementations cover:
• Simple Insertion Sort: Direct insertion into a sorted part of the array.
• Binary Insertion Sort: Finds position with binary search for efficiency in comparison.
• List Insertion Sort: Uses a linked list to avoid shifting elements.
MergeSort
In merge sort we are given with 2 arrays which are already sorted we have to just compare there
elements and add the smaller element to the 3rd array and then print it
Code:-
#include <stdio.h>
#include <stdlib.h>
void mergeArrays(int a[], int b[], int c[], int n1, int n2) {
int i = 0, j = 0, k = 0;
int main() {
int a[] = {1, 3, 5, 7};
int b[] = {2, 4, 6, 8};
int n1 = sizeof(a) / sizeof(a[0]);
int n2 = sizeof(b) / sizeof(b[0]);
int c[n1 + n2];
return 0;
}
COOK-KIM ALGO
The Cook-Kim algorithm is a sorting method optimized for nearly sorted files. Here’s a summary of
the key steps and principles:
1. Applicability: It’s particularly efficient for nearly sorted data or smaller sorted files, where
simple insertion sort is typically fast. For larger, more sorted files, it outperforms even
middle-element quicksort.
2. Process:
• The algorithm scans the input for unordered pairs (where an element is greater than
the following one).
• These unordered pairs are removed and placed at the end of a new array.
• After removing an unordered pair, the algorithm resumes scanning the input from the
immediate neighboring elements, leaving the original array sorted once all unordered
pairs are removed.
3. Sorting Unordered Elements:
• The newly created array of unordered pairs is sorted. Middle-element quicksort is
used if the array has over 30 elements; otherwise, simple insertion sort is used.
4. Final Merge:
• The two arrays (the originally sorted array and the newly sorted unordered pairs
array) are merged to produce the fully sorted output.
5. Advantages: The Cook-Kim algorithm leverages the pre-sorted state of input more
effectively than other sorting methods, making it faster than quicksort, insertion sort, and
mergesort for nearly sorted data.
6. Limitations: For randomly ordered data, Cook-Kim is less efficient than standard sorts like
mergesort and quicksort, which remain preferable for general cases.
SEARCHING
Theory of Linear Search:
Linear search, also known as sequential search, is a straightforward search algorithm used to find
the position of a target value within a list. In linear search, we start from the beginning of the array
or list and check each element one by one until we find the target value. If the target is found, the
algorithm stops and returns the index of the target. If it reaches the end without finding the target, it
returns an indication that the target is not present in the list.
Characteristics of Linear Search:
• Time Complexity: O(n), where n is the number of elements in the list. This is because, in
the worst case, we might have to check every element.
• Best Case: O(1) if the target is found at the first position.
• Worst Case: O(n) if the target is found at the last position or is not present in the list at all.
• Use Case: Linear search is typically used for small or unsorted lists. It’s simple and doesn’t
require any additional memory or complex data structures.
CODE:-
#include <stdio.h>
int main() {
int arr[] = {34, 67, 23, 89, 1, 90};
int size = sizeof(arr) / sizeof(arr[0]);
int target = 23;
return 0;
}
#include <stdio.h>
int main() {
int arr[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}; // Sorted array
int size = sizeof(arr) / sizeof(arr[0]);
int key = 5; // Element to search for
return 0;
}
Interpolation Search
Overview
• Interpolation search is a searching technique for ordered arrays, especially effective when
keys are uniformly distributed.
• It can outperform binary search by estimating the position of the key based on its value
relative to the values of the endpoints of the search interval.
Algorithm Steps
1. Set initial boundaries: low = 0 and high = n - 1.
2. Calculate the estimated position (mid) using the formula: mid=low+(high−low)⋅(k(high)
−k(low))(key−k(low))
3. Compare the key with the value at the mid position:
• If equal, the search is successful.
• If the key is lower, adjust high = mid - 1.
• If the key is higher, adjust low = mid + 1.
4. Repeat until the key is found or low > high.
Efficiency
• When keys are uniformly distributed, interpolation search can require an average of
O(loglogn) comparisons, which is more efficient than the O(logn) required for binary search.
• However, if keys are not uniformly distributed, performance can degrade significantly,
potentially leading to a worst-case scenario similar to sequential search.
Robust Interpolation Search
• A variation called robust interpolation search seeks to improve performance with non-
uniform key distributions.
• It introduces a gap to ensure that the estimated position (mid) is a minimum distance from
the boundaries, preventing clustering issues.
• The algorithm dynamically adjusts the gap based on the size of the current interval:
• If the key is found in the smaller interval, reset the gap to the square root of the new
interval size.
• If found in the larger interval, double the gap but keep it within half the interval size.
Performance Comparison
• Robust interpolation search has an expected number of comparisons of O(loglogn) for
random distributions, outperforming both binary and standard interpolation search.
• In experiments, robust interpolation search was shown to require fewer comparisons than
binary search in practical scenarios (e.g., 12.5 comparisons versus 36 for binary search in a
list of 40,000 names).
Limitations
• The overhead of managing the gap can be substantial.
• In the worst case, robust interpolation search can require O(n) comparisons, which is worse
than binary search's worst-case O(logn).
• The arithmetic involved in interpolation search can be computationally expensive compared
to the simpler comparisons in binary search.
CODE:-
#include <stdio.h>
while (low <= high && key >= arr[low] && key <= arr[high]) {
// Estimate the position of the key
int pos = low + ((high - low) * (key - arr[low])) / (arr[high] - arr[low]);
int main() {
int arr[] = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}; // Sorted array
int size = sizeof(arr) / sizeof(arr[0]);
int key = 70; // Element to search for
return 0;
}
Example in Python
Here’s an example of sorting a list of dictionaries (representing employees) in Python, first by last
name and then by first name:
python
Copy code
employees = [
{'first_name': 'John', 'last_name': 'Doe'},
{'first_name': 'Jane', 'last_name': 'Smith'},
{'first_name': 'Alice', 'last_name': 'Doe'},
{'first_name': 'Bob', 'last_name': 'Smith'},
]
Output:
Copy code
Alice Doe
John Doe
Bob Smith
Jane Smith
Application in Databases
In databases, sorting on different keys is a common operation performed through SQL queries. For
instance:
sql
Copy code
SELECT * FROM Employees
ORDER BY last_name ASC, first_name ASC;
This SQL query sorts the employees first by last name in ascending order and then by first name in
ascending order.
EXTERNAL SORTING
External sorting is a class of algorithms used for sorting large data sets that do not fit into the main
memory (RAM) of a computer. It is particularly useful when dealing with very large files or
databases that exceed the available memory, necessitating the use of external storage (like hard
disks) to perform the sorting.
// Initialize the heap with the first element of each sorted run
for each run in sortedRuns:
minHeap.insert(run.firstElement())
// If there is a next element in the same run, insert it into the heap
if nextElementExists(smallest.run):
nextElement = smallest.run.getNextElement()
minHeap.insert(nextElement)
return outputFile