SQL | Remove Duplicates without Distinct
Last Updated :
02 Dec, 2024
In SQL, removing duplicate records is a common task, but the DISTINCT keyword can sometimes lead to performance issues, especially with large datasets. The DISTINCT clause requires sorting and comparing records, which can increase the processing load on the query engine.
In this article, we’ll explain various alternatives to remove duplicates in SQL, including using ROW_NUMBER(), self-joins, and GROUP BY. Each method will be explained in detail with examples and outputs.
Why Remove Duplicates in SQL?
Duplicate records can lead to incorrect data analysis and reporting, and can increase storage requirements. Therefore, removing duplicate records ensures better data integrity and more efficient database operations Fortunately, there are more efficient methods to remove duplicates from SQL queries without using DISTINCT.
1. Remove Duplicates Using Row_Number()
The Row_Number function assigns a unique number to each row within a partition of a result set, which allows us to identify and remove duplicate rows effectively.
Example
Let’s assume we have a table named Employees, and we want to remove duplicate rows based on the EmployeeName, EmployeeAddress, and EmployeeSex columns.
WITH CTE AS (
SELECT EmployeeID, EmployeeName, EmployeeAddress, EmployeeSex,
ROW_NUMBER() OVER (PARTITION BY EmployeeName, EmployeeAddress, EmployeeSex ORDER BY EmployeeID) AS RowNum
FROM Employees
)
DELETE FROM CTE WHERE RowNum > 1;
Explanation:
- The ROW_NUMBER() function assigns a unique number to each row within the partition of duplicate values.
- The CTE (Common Table Expression) is used to define the duplicate rows, and DELETE removes all but the first occurrence (RowNum > 1) of each duplicate.
2. Remove Duplicates Using a Self-Join
A self-join involves joining a table to itself to identify and remove duplicates based on specific criteria. This method is ideal for comparing columns within the same table.
Example
Let’s consider the Employee table again. We’ll use a self-join to remove duplicate entries where EmployeeName and EmployeeAddress are the same.
DELETE A
FROM Employees A
JOIN Employees B ON A.EmployeeName = B.EmployeeName
AND A.EmployeeAddress = B.EmployeeAddress
WHERE A.EmployeeID > B.EmployeeID;
Explanation:
- The self-join compares records within the same table (aliased as A and B).
- The condition A.EmployeeID > B.EmployeeID ensures that only the duplicate records (with a higher EmployeeID) are deleted.
3. Remove Duplicates using group By
The GROUP BY clause can be used to remove duplicates by grouping rows with identical values in selected columns. This method is ideal when we want to retain specific records (like the first or last entry) based on aggregate functions.
Example
To remove duplicates based on FirstName, LastName, and MobileNo, we can group by these columns and select distinct entries.
SELECT FirstName, LastName, MobileNo
FROM Customers
GROUP BY FirstName, LastName, MobileNo;
Explanation:
- The GROUP BY clause groups records with the same FirstName, LastName, and MobileNo values.
- This effectively removes any duplicate entries based on these columns and returns only unique combinations.
4. Remove Duplicates Using DISTINCT ON
(PostgreSQL)
For PostgreSQL users, the DISTINCT ON clause is a powerful way to remove duplicates based on specific columns while retaining additional data from the same rows.
Example
SELECT DISTINCT ON (EmployeeName) EmployeeName, EmployeeAddress
FROM Employees
ORDER BY EmployeeName, EmployeeID;
Explanation:
- The DISTINCT ON clause keeps the first occurrence of each EmployeeName and removes subsequent duplicates.
- The ORDER BY clause specifies which row should be retained when duplicates are found.
5. Use of EXCEPT
to Remove Duplicates
The EXCEPT operator returns the records from the first query that are not present in the second query. This can be used to eliminate duplicates from a result set.
Example
SELECT * FROM Employees
EXCEPT
SELECT DISTINCT * FROM Employees;
Explanation:
- The first query returns all records from the Employees table, and the second query returns only distinct records.
- The EXCEPT operator subtracts the distinct rows from the original set, effectively leaving only duplicates.
Conclusion
There are several efficient ways to remove duplicates in SQL without using the DISTINCT keyword. Methods like ROW_NUMBER(), self-joins, GROUP BY, and DISTINCT ON can help eliminate duplicates and maintain data integrity in our databases. These techniques are particularly useful for improving query performance in large datasets where using DISTINCT might slow down execution.
Similar Reads
SQL Query to Delete Duplicate Rows
Duplicate rows in a database can cause inaccurate results, waste storage space, and slow down queries. Cleaning duplicate records from our database is an essential maintenance task for ensuring data accuracy and performance. Duplicate rows in a SQL table can lead to data inconsistencies and performa
6 min read
SQL Query to Get Distinct Records Without Using Distinct Keyword
Retrieving distinct records is a common task when working with databases. While the DISTINCT clause is the standard approach to fetch unique rows, there are scenarios where you may need to achieve the same result without using it. In this article, we explain various alternative methods to retrieve d
3 min read
How to Delete Duplicate Rows in MySQL?
Duplicate rows are a common problem in MySQL databases. Duplicate rows can cause problems with data accuracy and integrity. They can also make it difficult to query and analyze data. Ways to Delete Duplicate Rows in MySQLThere are a few different ways to delete duplicate rows from tables in MySQL: U
4 min read
How to delete duplicate rows in SQLite?
SQLite is an open-source and serverless database system that does not require any server to perform various queries also it is widely used in the development of embedded software like television and mobile phones Sometimes it might happen that we by mistake insert multiple times similar data into ta
3 min read
Delete Duplicate Rows in MS SQL Server
In MS SQL Server, managing duplicate rows is a common task that can affect the integrity and performance of a database. To address this issue, SQL Server provides several methods for identifying and deleting duplicate rows. In this article, We will explore three effective approaches: using the GROUP
5 min read
How to Delete Duplicate Rows in PL/SQL?
Inconsistencies and inefficiencies in data management are frequently caused by duplicate rows in a database table. Eliminating duplicate rows is a typical PL/SQL activity to maintain data integrity and improve database performance. This article will guide you on how to remove duplicated rows in PL/S
4 min read
SQL Query to Delete Duplicate Columns
Through this article, we will learn how to delete duplicate columns from a database table. As we know that duplicity in our database tends to be a waste of memory space. It records inaccurate data and is also unable to fetch the correct data from the database. To remove the duplicate columns we use
2 min read
Distinct clause in MS SQL Server
The SELECT DISTINCT statement is used to return only distinct (different) values. Inside a table, a column often contains many duplicate values and sometimes we only want to list the different (distinct) values. Consider a simple database and we will be discussing distinct clauses in MS SQL Server.
3 min read
How to Find Duplicate Records in SQL?
To find duplicate records in SQL, we can use the GROUP BY and HAVING clauses. The GROUP BY clause allows us to group values in a column, and the COUNT function in the HAVING clause shows the count of the values in a group. Using the HAVING clause with a condition of COUNT(*) > 1, we can identify
3 min read
How to SELECT DISTINCT on Multiple Columns in SQL Server?
When working with SQL Server, there are scenarios where we might need to retrieve unique combinations of values from multiple columns. This is where the SELECT DISTINCT statement comes in handy. It allows us to eliminate duplicate rows from the result set. However, using SELECT DISTINCT it on multip
4 min read