Deleting/Listing duplicate records without ROWID

I was hunting for an article when my friend came up and asked me this question.

How to list/delete duplicate records without using ROWID?

I never asked why he do not want to use ROWID, because I was caught in some thinking.

I done some googling, and derived the method, but first the ROWID method

First how to delete records with ROWID, and why?

ROWID is the unique record identifier within Oracle, and it is easy to get the minimum of the ROWID for the duplicating criteria and delete the rest of them.

We will take the example of EMP table for simplicity. Assume that there is no primary key in the Employee_Id column due to some data migration requirements. Once the data migration is over I need to check the existense of duplicate records. This is how traditionally we used to achieve this:

Suppose the data after migration stands like this:
Employee_Id Employee_Name
10          Scott
11          Tiger
10          Amanda
10          Nagpal
20          Rawat

It is clear from this data that there are duplicate records, but in a normal data migration there will be millions of records from which identifying the duplicate records will be a nightmare. This is where the following query will save our day:

SELECT Employee_Id, Employee_Name
FROM   Employee e
WHERE  ROWID NOT IN (SELECT MIN(ROWID)
                     FROM   Employee f
                     WHERE  e.Employee_Id = f.Employee_Id);

The above query will list the following duplicate records:
Employee_Id Employee_Name
10          Amanda
10          Nagpal

We will not get into the decision of what to do with the duplicate records as it is not our discussion topic.

Now what does the MIN(ROWID) implies in the query? It implies that the first record inserted cannot be a duplicate record, and the rest of the data are. Simple isn't it?

Now how to select the same data without ROWID?

Consider the following select statement:

SELECT Employee_Id, Employee_Name,
  decode(row_number() over( PARTITION BY employee_id ORDER BY employee_id),1,'Not duplicate','Dupilcate') Dupicate_Criteria
FROM employee;

The following is the data from the query:

EMPLOYEE_ID    EMPLOYEE_NAME   DUPICATE_CRITERIA
10             Scott           Not duplicate
10             Amanda          Dupilcate
10             Nagpal          Dupilcate
11             Tiger           Not duplicate
20             Rawat           Not duplicate

In this, you can see the usage of row_number() function with over() clause. In the over() clause we are partitioning the data on basis of the duplicate criteria, that is the Employee_Id. The Order By clause is mandatory so we will use the same Employee_Id. Now the row_number() function assigns running numbers for each Employee_Id. If there are any duplicate records, the row_number() function will return any value greater than 1 (in the running order).

So it is very easy for us to identify the usage of row_number() function and to find an alternative method of getting duplicate rows without using ROWID in our query. Now it is in your hands to do more research and find out more usages.

Thanks for Subesh Kumar who prompted me for a short research in this topic, and hence this article.

5 comments :

  1. well i am new to sql but i guess there can be another approach:

    SQL> create table T2 as select distinct * from T1;
    SQL> drop table T1;
    SQL> rename T2 as T1;

    sunny3524@gmail.com

    ReplyDelete
  2. Hi Anantha,

    In your 2nd option (with out Rowid) you can only list the duplicates, but you can't delete or perform any dml on the listed records.

    Regards,
    Pradeep

    ReplyDelete
  3. Hi Anantha - Your query saved me 8 hours! Thanks Rajesh Beri

    ReplyDelete
  4. how to hide duplicate rows in oracle without using rowid

    ReplyDelete
  5. What do you mean by hide? Hide from selection ? If yes then replace delete with select * from

    ReplyDelete