Create Unique Row Id to Match Again Later

By:   |   Updated: 2021-07-20   |   Comments (13)   |   Related: More than > Database Blueprint



Problem

According to database design best practices, a SQL Server table should not contain indistinguishable rows. During the database design procedure primary keys should be created to eliminate indistinguishable rows. However, sometimes we need to work with databases where these rules are not followed or exceptions are possible (when these rules are bypassed knowingly). For example, when a staging table is used and data is loaded from different sources where duplicate rows are possible. When the loading process completes, table should exist cleaned or clean data should be loaded to a permanent tabular array, so after that the duplicates are no longer needed. Therefore, an issue apropos the removal of duplicates from the loading table arises. In this tutorial let'due south examine some ways to solve data de-duplication needs.

Solution

Nosotros will consider two cases in this tip:

  • The first instance is when a SQL Server table has a master primal (or unique index) and one of the columns contains duplicate values which should be removed.
  • The 2d case is that tabular array does not have a master primal or whatever unique indexes and contains duplicate rows which should be removed.  Let's talk over these cases separately.

How to remove duplicate rows in a SQL Server table

Duplicate records in a SQL Server table can be a very serious effect.  With duplicate data information technology is possible for orders to be processed numerous times, have inaccurate results for reporting and more.  In SQL Server at that place are a number of ways to address duplicate records in a table based on the specific circumstances such as:

  • Table with Unique Index - For tables with a unique alphabetize, you have the opportunity to use the index to guild identify the duplicate data and so remove the duplicate records.  Identification can be performed with self-joins, ordering the data by the max value, using the RANK function or using NOT IN logic.
  • Table without a Unique Alphabetize - For tables without a unique index, information technology is a scrap more challenging.  In this scenario, the ROW_NUMBER() function can be used with a common table expression (CTE) to sort the data then delete the subsequent duplicate records.

Bank check out the examples below to become real world examples on how to delete duplicate records from a table.

Removing duplicates rows from a SQL Server table with a unique index

Test Environment Setup

To accomplish our tasks, we demand a exam environment which nosotros create with the following argument:

Utilise master Get  CREATE DATABASE TestDB GO  Employ TestDB GO  CREATE TABLE TableA (  ID INT Not NULL IDENTITY(1,ane),  Value INT,  CONSTRAINT PK_ID Master Key(ID)   )          

Now let's insert data into our new table - 'TableA' with the following statement:

USE TestDB Get  INSERT INTO TableA(Value) VALUES(1),(two),(iii),(four),(5),(5),(3),(five)  SELECT * FROM TableA  SELECT Value, COUNT(*) Equally DuplicatesCount FROM TableA Grouping By Value          

As we can see in the event set beneath the values 3 and 5 exists in the 'Value' column more than once:

Test data is queried

Place Duplicate Rows in a SQL Server Table

Our job is to enforce uniqueness for the 'Value' column past removing duplicates. Removing duplicate values from tabular array with a unique index is a bit easier than removing the rows from a tabular array without it. First of all, we need to find duplicates. There are many dissimilar ways to do that. Let'southward investigate and compare some common means. In the following queries below in that location are six solutions to observe that duplicate values which should exist deleted (leaving merely ane value):

----- Finding indistinguishable values in a table with a unique alphabetize --Solution one SELECT a.*  FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID < MAX(i.ID)) AS MaxValue FROM TableA o) b WHERE a.ID=b.ID AND b.MaxValue IS Non Nil  --Solution 2 SELECT a.*  FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP By Value HAVING o.ID=MAX(i.ID)) As MaxValue FROM TableA o) b WHERE a.ID=b.ID AND b.MaxValue IS NULL  --Solution 3 SELECT a.* FROM TableA a INNER Bring together (  SELECT MAX(ID) AS ID, Value   FROM TableA  GROUP By Value   HAVING COUNT(Value) > 1 ) b ON a.ID < b.ID AND a.Value=b.Value  --Solution four SELECT a.*  FROM TableA a  WHERE ID < (SELECT MAX(ID) FROM TableA b WHERE a.Value=b.Value GROUP BY Value HAVING COUNT(*) > 1)  --Solution five  SELECT a.* FROM TableA a INNER JOIN (SELECT ID, RANK() OVER(PARTITION BY Value ORDER Past ID DESC) AS rnk FROM TableA ) b  ON a.ID=b.ID WHERE b.rnk > 1  --Solution 6  SELECT *  FROM TableA  WHERE ID Non IN (SELECT MAX(ID)                   FROM TableA                   Grouping BY Value)          

Every bit we can see the result for all cases is the same as shown in the screenshot below:

Different techniques to identify duplicate rows

Only rows with ID=iii, 5, 6 need to be deleted. Looking at the execution plan we can see that latest - the most 'compact' solution ('Solution vi') has a highest cost (in our example there is a master key on the 'ID' cavalcade, so 'NULL' values are non possible for that cavalcade, therefore 'Non IN' will piece of work without any problem), and the second has the lowest toll:

Execution plans for the duplicate row code

Deleting Indistinguishable Rows in a SQL Server Table

Now by using the post-obit queries, let's delete duplicate values from the table. To simplify our procedure, we will use only the second, the 5th and the sixth queries:

USE TestDB GO  --Initializing the table TRUNCATE TABLE TableA  INSERT INTO TableA(Value) VALUES(one),(2),(iii),(iv),(5),(5),(3),(five)  --Deleting indistinguishable values DELETE t FROM TableA t WHERE ID IN ( SELECT a.ID FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP By Value HAVING o.ID=MAX(i.ID)) Every bit MaxValue FROM TableA o) b     WHERE a.ID=b.ID AND b.MaxValue IS Goose egg)   --Initializing the table TRUNCATE TABLE TableA  INSERT INTO TableA(Value) VALUES(i),(two),(3),(4),(5),(5),(3),(5)  --Deleting duplicate values DELETE a FROM TableA a INNER Join (SELECT ID, RANK() OVER(Sectionalisation By Value ORDER By ID DESC) As rnk FROM TableA ) b  ON a.ID=b.ID WHERE b.rnk>i  --Initializing the table TRUNCATE TABLE TableA  INSERT INTO TableA(Value) VALUES(i),(2),(3),(4),(5),(five),(3),(5)  --Deleting duplicate values DELETE FROM TableA  WHERE ID Not IN (SELECT MAX(ID)                   FROM TableA                   GROUP BY Value)

Deleting the data and looking into the execution plans again nosotros see that the fastest is the showtime DELETE statement and the slowest is the final as expected:

Query plans for deleting the duplicate data

Removing duplicates from table without a unique index in ORACLE

As a means to help illustrate our final example in this tip, I want to explain some similar functionality in Oracle.  Removing duplicate rows from the table without a unique index is a little easier in Oracle than in SQL Server. There is a ROWID pseudo cavalcade in Oracle which returns the address of the row. It uniquely identifies the row in the table (usually in the database besides, merely in this instance, at that place is an exception - if different tables store information in the aforementioned cluster they tin have the aforementioned ROWID). The query beneath creates and inserts data into table in the Oracle database:

CREATE Table TableB (Value INT);  INSERT INTO TableB(Value) VALUES(ane); INSERT INTO TableB(Value) VALUES(2); INSERT INTO TableB(Value) VALUES(3); INSERT INTO TableB(Value) VALUES(4); INSERT INTO TableB(Value) VALUES(5); INSERT INTO TableB(Value) VALUES(5); INSERT INTO TableB(Value) VALUES(3); INSERT INTO TableB(Value) VALUES(5);

At present nosotros are selecting the data and ROWID from the table:

SELECT ROWID, Value FROM TableB;          

The result is below:

SELECT ROWID in Oracle to identify duplicates

At present using ROWID, we will hands remove duplicate rows from table:

DELETE TableB WHERE  rowid not in (                       SELECT MAX(rowid)                       FROM  TableB                       Group  BY Value                     );

We can also remove duplicates using the code below:

DELETE from TableB o WHERE  rowid < (                       SELECT MAX(rowid)                       FROM  TableB i                       WHERE i.Value=o.Value                       GROUP  Past Value                     );

Removing duplicates from a SQL Server table without a unique index

Dissimilar Oracle, in that location is no ROWID in SQL Server, and then to remove duplicates from the table without a unique index nosotros demand to practice additional work for generating unique row identifiers:

Use TestDB GO  CREATE TABLE TableB (Value INT)  INSERT INTO TableB(Value)  VALUES(i),(2),(3),(4),(5),(v),(3),(5)  SELECT * FROM TableB  ; WITH TableBWithRowID Every bit (  SELECT ROW_NUMBER() OVER (Lodge BY Value) AS RowID, Value  FROM TableB )  DELETE o FROM TableBWithRowID o WHERE RowID < (SELECT MAX(rowID) FROM TableBWithRowID i WHERE i.Value=o.Value GROUP BY Value)  SELECT * FROM TableB

In the code above, we are creating a tabular array with duplicate rows. We are generating unique identifiers using the ROW_NUMBER() office and by using common table expression (CTE) nosotros are deleting duplicates:

Removing duplicates from a SQL Server table without unique index

This lawmaking, however, tin be replaced with more than compact and optimal one:

Use TestDB Go  --Initializing the tabular array TRUNCATE TABLE TableB  INSERT INTO TableB(Value)  VALUES(one),(2),(3),(4),(5),(5),(3),(5)  --Deleting duplicate values ; WITH TableBWithRowID As (  SELECT ROW_NUMBER() OVER (PARTITION Past Value Lodge By Value) Every bit RowID, Value   FROM TableB )  DELETE o FROM TableBWithRowID o WHERE RowID > 1  SELECT * FROM TableB          

Having said that, it is possible to place the physical address of the row in SQL Server equally well. Despite the fact that it is practically impossible to find official documentation near this feature, it tin exist used as an analog to ROWID pseudo column in Oracle. It is called %%physloc%% (since SQL Server 2008) and it is a virtual binary(8) column which shows the physical location of the row. As the value of %%physloc%% is unique for each row, nosotros tin use it equally a row identifier while removing duplicate rows from a table without a unique index. Thus, we can remove duplicate rows from a tabular array without a unique alphabetize in SQL Server like in Oracle besides as similar in the instance when the tabular array has a unique index.

The first ii queries beneath are the equivalent versions of removing duplicates in Oracle, the next ii are queries for removing duplicates using %%physloc%% similar to the case of the table with a unique index, and in the last query, %%physloc%% is non used simply for comparing functioning of all of these options:

Analyzing the Execution Plans, we tin can encounter that the beginning and the last queries are the fastest when compared to the overall batch times:

query performance

Hence, we can conclude that in general, using %%physloc%% does non improve the performance. While using this approach, it is very of import to realize that this is an undocumented feature of SQL Server and, therefore, developers should exist very careful.

At that place are other ways to remove duplicates which is non discussed in this tip. For instance, we tin can store distinct rows in a temporary table, then delete all data from our tabular array and after that insert singled-out rows from temporary table to our permanent tabular array. In this example DELETE and INSERT statements should be included in one transaction.

Decision

During our experience we face situations when we need to make clean indistinguishable values from SQL Server tables. The duplicate values can be in the cavalcade which will be de-duplicated based on our requirements or the tabular array can contain duplicate rows.  In either case we demand to exclude the data to avert information duplication in the database. In this tip nosotros explained some techniques which hopefully will be helpful to solve these types of bug.

Next Steps
  • Review this related data:
    • https://technet.microsoft.com/en-us/library/ms190766(v=sql.105).aspx
    • https://msdn.microsoft.com/en-us/library/ms186734.aspx
    • https://docs.oracle.com/cd/B19306_01/server.102/b14200/pseudocolumns008.htm
    • Delete duplicate rows with no principal key on a SQL Server tabular array
    • Different strategies for removing duplicate records in SQL Server
    • Removing Duplicates Rows with SSIS Sort Transformation

Related Articles

Popular Articles

About the author

MSSQLTips author Sergey Gigoyan Sergey Gigoyan is a database professional with more than 10 years of experience, with a focus on database design, development, operation tuning, optimization, loftier availability, BI and DW pattern.

View all my tips

Commodity Last Updated: 2021-07-20

bernsteinwitheauted.blogspot.com

Source: https://www.mssqltips.com/sqlservertip/4486/find-and-remove-duplicate-rows-from-a-sql-server-table/

0 Response to "Create Unique Row Id to Match Again Later"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel