How to find duplicate values in a table using SQL?
2 simple and effective ways
One of the common interview questions for any Data Engineering or Data Analyst position is to find duplicate values in a table using SQL.
With this article, I intend to show you 2 easy ways of finding and removing duplicates in a table and explain you the concept behind each of those methods.
We will also discuss the various pros and cons of each method and analyze scenarios where they might or might not be useful.
We will be considering the following students
table to do all our analysis. Our objective is to find any duplicates that exist for columns NAME
and AGE
.
ID | NAME | AGE |
1 | Ram | 9 |
2 | Vaani | 8 |
3 | Shyam | 7 |
4 | Ram | 9 |
5 | Shyam | 7 |
6 | Aakash | 10 |
7 | Ram | 9 |
Use the following SQL queries to recreate this table to follow along this article.
CREATE TABLE students(
ID INT NOT NULL AUTOINCREMENT,
NAME VARCHAR,
AGE INTEGER
);
INSERT INTO students(NAME, AGE) VALUES
('Ram', 9),
('Vaani', 8),
('Shyam', 7),
('Ram', 9),
('Shyam', 7),
('Aakash', 10),
('Ram', 9);
Method 1 - Using Group By
This is one of the most common ways of finding duplicates in your table. The GROUP BY
clause groups rows having the same value. You can use this query to verify if duplicates exist in your tables.
Finding duplicate rows
SELECT NAME, AGE, COUNT(*) AS DUPLICATION_COUNT
FROM students
GROUP BY NAME, AGE
HAVING COUNT(*) > 1;
The above query returns the duplicated rows for NAME
and AGE
column and the number of duplication.
NAME | AGE | DUPLICATION_COUNT |
Ram | 9 | 3 |
Shyam | 7 | 2 |
Pros - Simple and effective way to find duplicates in a table.
Cons - The above query cannot be used directly to remove duplicates from the table.
Removing duplicate rows
In order to remove duplicate rows from the table, you can use the following query that joins the table with itself.
DELETE FROM students WHERE ID IN(
SELECT s1.ID
FROM students AS s1
JOIN students AS s2
ON s1.NAME = s2.NAME and s1.AGE = s2.AGE
and s1.ID > s2.ID
);
After executing this query, your table should only contain the distinct records. Duplicate records with the minimum ID
is preserved.
ID | NAME | AGE |
1 | Ram | 9 |
2 | Vaani | 8 |
3 | Shyam | 7 |
6 | Aakash | 10 |
Method 2 - Using ROW_NUMBER()
The ROW_NUMBER()
is a window function in SQL which assigns a sequential integer value to each row in the query’s result set. You can also use this window function to search for duplicates.
First, let's see what the ROW_NUMBER()
function returns.
SELECT NAME, AGE,
ROW_NUMBER() OVER (PARTITION BY NAME, AGE ORDER BY ID) AS row_value
FROM students
ORDER BY NAME;
The above query returns the following records.
NAME | AGE | ROW_VALUE |
Aakash | 10 | 1 |
Ram | 9 | 1 |
Ram | 9 | 2 |
Ram | 9 | 3 |
Shyam | 7 | 1 |
Shyam | 7 | 2 |
Vaani | 8 | 1 |
The PARTITION BY
clause divides the result set returned from the FROM
clause into partitions of NAME
and AGE
. Each partition represents a unique record. The ORDER BY clause sorts the rows in each partition based on the ID
column which helps in the assignment of the sequential integers.
Finding duplicate rows
To list all the duplicate records, you need to filter out rows with ROW_VALUE
greater than 1.
WITH base AS
(
SELECT Name, Age,
ROW_NUMBER() OVER (PARTITION BY Name, Age) AS row_value
FROM students
)
SELECT *
FROM base
where row_value > 1
The above query returns the duplicate rows.
NAME | AGE | ROW_VALUE |
Ram | 9 | 2 |
Ram | 9 | 3 |
Shyam | 7 | 2 |
Pros - Along with identifying duplicate rows, it also lets you filter or delete them.
Cons - Slightly expensive query to be using the ROW_NUMBER()
function.
Removing duplicate rows
Using the above query, we can easily filter out duplicates from the table. We choose all the rows with ROW_VALUE = 1
.
WITH base AS
(
SELECT Name, Age,
ROW_NUMBER() OVER (PARTITION BY Name, Age) AS row_value
FROM students
)
SELECT *
FROM base
where row_value = 1
This returns a distinct set of rows from the table.
NAME | AGE | ROW_VALUE |
Aakash | 10 | 1 |
Ram | 9 | 1 |
Shyam | 7 | 1 |
Vaani | 8 | 1 |
If you follow other ways of removing duplicates, do share them in the comments below.