One of the common interview questions for any Data Engineering or Data Analyst position is to find duplicate values in a table using SQL.

With this article, I intend to show you 2 easy ways of finding and removing duplicates in a table and explain you the concept behind each of those methods.

Method 1 - Using Group By
Method 2 - Using ROW_NUMBER()

We will also discuss the various pros and cons of each method and analyze scenarios where they might or might not be useful.

We will be considering the following students table to do all our analysis. Our objective is to find any duplicates that exist for columns NAME and AGE.

ID	NAME	AGE
1	Ram	9
2	Vaani	8
3	Shyam	7
4	Ram	9
5	Shyam	7
6	Aakash	10
7	Ram	9

Use the following SQL queries to recreate this table to follow along this article.

CREATE TABLE students(
ID INT NOT NULL AUTOINCREMENT,
NAME VARCHAR,
AGE INTEGER 
);

INSERT INTO students(NAME, AGE) VALUES 
('Ram', 9),
('Vaani', 8),
('Shyam', 7),
('Ram', 9),
('Shyam', 7),
('Aakash', 10),
('Ram', 9);

Method 1 - Using Group By

This is one of the most common ways of finding duplicates in your table. The GROUP BY clause groups rows having the same value. You can use this query to verify if duplicates exist in your tables.

Finding duplicate rows

SELECT NAME, AGE, COUNT(*) AS DUPLICATION_COUNT
FROM students
GROUP BY NAME, AGE
HAVING COUNT(*) > 1;

The above query returns the duplicated rows for NAME and AGE column and the number of duplication.

NAME	AGE	DUPLICATION_COUNT
Ram	9	3
Shyam	7	2

Pros - Simple and effective way to find duplicates in a table.
Cons - The above query cannot be used directly to remove duplicates from the table.

Removing duplicate rows

In order to remove duplicate rows from the table, you can use the following query that joins the table with itself.

DELETE FROM students WHERE ID IN(
SELECT s1.ID
FROM students AS s1
JOIN students AS s2
ON s1.NAME = s2.NAME and s1.AGE = s2.AGE 
and s1.ID > s2.ID
);

After executing this query, your table should only contain the distinct records. Duplicate records with the minimum ID is preserved.

ID	NAME	AGE
1	Ram	9
2	Vaani	8
3	Shyam	7
6	Aakash	10

Method 2 - Using ROW_NUMBER()

The ROW_NUMBER() is a window function in SQL which assigns a sequential integer value to each row in the query’s result set. You can also use this window function to search for duplicates.

First, let's see what the ROW_NUMBER() function returns.

SELECT NAME, AGE, 
ROW_NUMBER() OVER (PARTITION BY NAME, AGE ORDER BY ID) AS row_value
FROM students
ORDER BY NAME;

The above query returns the following records.

NAME	AGE	ROW_VALUE
Aakash	10	1
Ram	9	1
Ram	9	2
Ram	9	3
Shyam	7	1
Shyam	7	2
Vaani	8	1

The PARTITION BY clause divides the result set returned from the FROM clause into partitions of NAME and AGE. Each partition represents a unique record. The ORDER BY clause sorts the rows in each partition based on the ID column which helps in the assignment of the sequential integers.

Finding duplicate rows

To list all the duplicate records, you need to filter out rows with ROW_VALUE greater than 1.

WITH base AS
(
SELECT Name, Age, 
ROW_NUMBER() OVER (PARTITION BY Name, Age) AS row_value
FROM students
)
SELECT *
FROM base
where row_value > 1

The above query returns the duplicate rows.

NAME	AGE	ROW_VALUE
Ram	9	2
Ram	9	3
Shyam	7	2

Pros - Along with identifying duplicate rows, it also lets you filter or delete them.
Cons - Slightly expensive query to be using the ROW_NUMBER() function.

Removing duplicate rows

Using the above query, we can easily filter out duplicates from the table. We choose all the rows with ROW_VALUE = 1.

WITH base AS
(
SELECT Name, Age, 
ROW_NUMBER() OVER (PARTITION BY Name, Age) AS row_value
FROM students
)
SELECT *
FROM base
where row_value = 1

This returns a distinct set of rows from the table.

NAME	AGE	ROW_VALUE
Aakash	10	1
Ram	9	1
Shyam	7	1
Vaani	8	1

If you follow other ways of removing duplicates, do share them in the comments below.

Data Chef

Data Chef

How to find duplicate values in a table using SQL?

2 simple and effective ways

Method 1 - Using Group By

Finding duplicate rows

Removing duplicate rows

Method 2 - Using ROW_NUMBER()

Finding duplicate rows

Removing duplicate rows