关于sql server:如何删除重复的行?

How can I remove duplicate rows?

从相当大的SQL Server表(即300000多行)中删除重复行的最佳方法是什么?

当然,由于存在RowID标识字段,这些行不会是完全重复的。

迈特表

1
2
3
4
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null


假设没有空值,那么您将GROUP BY作为唯一列,SELECTMIN (or MAX)rowid作为要保留的行。然后,只需删除没有行ID的所有内容:

1
2
3
4
5
6
7
8
9
DELETE FROM MyTable
LEFT OUTER JOIN (
   SELECT MIN(RowId) as RowId, Col1, Col2, Col3
   FROM MyTable
   GROUP BY Col1, Col2, Col3
) as KeepRows ON
   MyTable.RowId = KeepRows.RowId
WHERE
   KeepRows.RowId IS NULL

如果您有一个guid而不是一个整数,您可以替换

1
MIN(RowId)

具有

1
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))


另一种可能的方法是

1
2
3
4
5
6
7
8
9
;

--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
     AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
                                       ORDER BY ( SELECT 0)) RN
         FROM   #MyTable)
DELETE FROM cte
WHERE  RN > 1;

我使用上面的ORDER BY (SELECT 0),因为在打成平手的情况下,保留哪一行是任意的。

例如,为了保存RowID订单中的最新订单,您可以使用ORDER BY RowID DESC

执行计划

因为它不需要自连接,所以它的执行计划通常比公认的答案更简单和高效。

Execution Plans

然而,情况并非总是如此。其中一个可能首选GROUP BY解决方案的地方是选择哈希聚合而不是流聚合的情况。

ROW_NUMBER解决方案总是给出几乎相同的计划,而GROUP BY策略更灵活。

Execution Plans

可能有利于哈希聚合方法的因素是

  • 分区列上没有有用的索引
  • 相对较少的组,每个组中的副本相对较多

在第二种情况的极端版本中(如果每个组中都有很多重复的组),您还可以考虑简单地插入行以保存到新表中,然后与删除非常高比例的行相比,TRUNCATE对原始行进行删除并将其复制回以最小化日志记录。


在Microsoft技术支持网站上有一篇关于删除重复项的好文章。这是非常保守的——他们让你分步做每件事——但是它应该在大桌子上很好地工作。

在过去,我使用了self-join来实现这一点,尽管它可能被一个having子句预绑定:

1
2
3
4
5
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField


以下查询可用于删除重复行。本例中的表以ID为标识列,有重复数据的列为Column1Column2Column3

1
2
3
4
5
6
7
8
9
10
DELETE FROM TableName
WHERE  ID NOT IN (SELECT MAX(ID)
                  FROM   TableName
                  GROUP  BY Column1,
                            Column2,
                            Column3
                  /*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
                    nullable. Because of semantics of NOT IN (NULL) including the clause
                    below can simplify the plan*/

                  HAVING MAX(ID) IS NOT NULL)

下面的脚本显示了一个查询中GROUP BYHAVINGORDER BY的用法,并返回具有重复列和计数的结果。

1
2
3
4
5
6
SELECT YourColumnName,
       COUNT(*) TotalCount
FROM   YourTableName
GROUP  BY YourColumnName
HAVING COUNT(*) > 1
ORDER  BY COUNT(*) DESC


1
2
3
4
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid

Postgres:

1
2
3
4
5
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid


1
2
3
4
5
6
7
8
DELETE LU
FROM   (SELECT *,
               Row_number()
                 OVER (
                   partition BY col1, col1, col3
                   ORDER BY rowid DESC) [Row]
        FROM   mytable) LU
WHERE  [row] > 1


这将删除除第一行以外的重复行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
DELETE
FROM
    Mytable
WHERE
    RowID NOT IN (
        SELECT
            MIN(RowID)
        FROM
            Mytable
        GROUP BY
            Col1,
            Col2,
            Col3
    )

请参阅(http://www.codeproject.com/articles/157977/remove-duplicate-rows-from-a-table-in-sql-server)


我希望CTE从SQL Server表中删除重复行

强烈建议遵循本文:http://codiffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/

by keeping original

1
2
3
4
5
6
7
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)

DELETE FROM CTE WHERE RN<>1

without keeping original

1
2
3
4
5
6
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
&nbsp;
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)

快速而脏地删除完全重复的行(对于小表):

1
2
3
4
select  distinct * into t2 from t1;
delete from t1;
insert into t1 select *  from t2;
drop table t2;


对于内部联接,我更喜欢使用count(*)>1的子查询解决方案,因为我发现它更容易阅读,而且在运行之前很容易转换成select语句来验证将要删除的内容。

1
2
3
4
5
6
7
--DELETE FROM table1
--WHERE id IN (
     SELECT MIN(id) FROM table1
     GROUP BY col1, col2, col3
     -- could add a WHERE clause here to further filter
     HAVING count(*) > 1
--)


要提取重复行:

1
2
3
4
5
6
7
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1

删除重复行:

1
2
3
4
5
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);

1
2
3
4
5
6
7
SELECT  DISTINCT *
      INTO tempdb.dbo.tmpTable
FROM myTable

TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable


我想我会分享我的解决方案,因为它在特殊情况下工作。我的例子是,具有重复值的表没有外键(因为这些值是从另一个数据库复制的)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2

-- insert distinct values into temp
insert into #temp
select distinct *
from  tableName

-- delete from source
delete from tableName

-- insert into source from temp
insert into tableName
select *
from #temp

rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!

PS:在处理类似的事情时,我总是使用事务,这不仅确保了所有事情都作为一个整体执行,而且允许我在不冒任何风险的情况下进行测试。不过,当然,你还是应该备份一下,以确保…


使用CTE。其想法是连接一个或多个列,这些列形成一个重复记录,然后删除您喜欢的任何列:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
;with cte as (
    select
        min(PrimaryKey) as PrimaryKey
        UniqueColumn1,
        UniqueColumn2
    from dbo.DuplicatesTable
    group by
        UniqueColumn1, UniqueColumn1
    having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
    d.PrimaryKey > cte.PrimaryKey and
    d.UniqueColumn1 = cte.UniqueColumn1 and
    d.UniqueColumn2 = cte.UniqueColumn2;


另外一个简单的解决方案可以在粘贴在这里的链接中找到。这一个容易掌握,似乎是有效的大多数类似的问题。虽然它是针对SQL Server的,但是使用的概念是可以接受的。

以下是链接页面的相关部分:

考虑这些数据:

1
2
3
4
5
6
7
EMPLOYEE_ID ATTENDANCE_DATE
A001    2011-01-01
A001    2011-01-01
A002    2011-01-01
A002    2011-01-01
A002    2011-01-01
A003    2011-01-01

那么我们如何删除那些重复的数据呢?

首先,使用以下代码在该表中插入标识列:

1
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)

使用以下代码解决此问题:

1
2
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
    FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)


这个查询显示我的性能非常好:

1
2
3
4
5
6
7
8
9
10
11
12
13
DELETE tbl
FROM
    MyTable tbl
WHERE
    EXISTS (
        SELECT
            *
        FROM
            MyTable tbl2
        WHERE
            tbl2.SameValue = tbl.SameValue
        AND tbl.IdUniqueValue < tbl2.IdUniqueValue
    )

它在30秒多一点的时间内从一张2米的表格中删除了1米的行(50%的重复)


哦,当然。使用临时表。如果您想要一个"有效"的、性能不太好的语句,您可以使用:

1
2
3
4
5
6
7
DELETE FROM MyTable WHERE NOT RowID IN
    (SELECT
        (SELECT TOP 1 RowID FROM MyTable mt2
        WHERE mt2.Col1 = mt.Col1
        AND mt2.Col2 = mt.Col2
        AND mt2.Col3 = mt.Col3)
    FROM MyTable mt)

基本上,对于表中的每一行,子select将查找与所考虑行完全相同的所有行的顶部rowid。因此,您最终得到一个表示"原始"不重复行的行ID列表。


这是另一篇关于删除重复项的好文章。

它讨论了其困难的原因:"SQL基于关系代数,并且在关系代数中不能出现重复,因为在一个集合中不允许出现重复。"

临时表解决方案和两个MySQL示例。

将来,您打算在数据库级别还是从应用程序的角度来阻止它。我建议使用数据库级别,因为您的数据库应该负责维护引用完整性,开发人员只会导致问题;)


我有一个表,需要在其中保存不重复的行。我不确定速度和效率。

1
2
3
4
DELETE FROM myTable WHERE RowID IN (
  SELECT MIN(RowID) AS IDNo FROM myTable
  GROUP BY Col1, Col2, Col3
  HAVING COUNT(*) = 2 )


另一种方法是创建具有相同字段和唯一索引的新表。然后将所有数据从旧表移到新表。自动SQL Server忽略(如果存在重复值,还可以选择执行什么操作:忽略、中断或sth)重复值。所以我们有相同的表,没有重复的行。如果不需要唯一索引,可以在传输数据之后删除它。

尤其是对于较大的表,您可以使用DTS(SSIS包导入/导出数据),以便将所有数据快速传输到新的唯一索引表。700万排的比赛只需要几分钟。


用这个

1
2
3
4
5
6
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
   As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1

  • 创建具有相同结构的新空表

  • 执行这样的查询

    1
    2
    3
    4
    5
    INSERT INTO tc_category1
    SELECT *
    FROM tc_category
    GROUP BY category_id, application_id
    HAVING count(*) > 1
  • 然后执行这个查询

    1
    2
    3
    4
    5
    INSERT INTO tc_category1
    SELECT *
    FROM tc_category
    GROUP BY category_id, application_id
    HAVING count(*) = 1

  • 这是删除重复记录的最简单方法

    1
    2
    3
    4
    5
     DELETE FROM tblemp WHERE id IN
     (
      SELECT MIN(id) FROM tblemp
       GROUP BY  title HAVING COUNT(id)>1
     )

    http://askme.indianyouth.info/details/how-to-dumplicate-record-from-table-in-using-sql-105


    通过使用下面的查询,我们可以删除基于单列或多列的重复记录。下面的查询是基于两列删除的。表名为:testing,列名为empno,empname

    1
    2
    3
    4
    5
    DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
    AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
    or empname not in
    (select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
    AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)


    From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.

    我不知道它会有多好的性能,但我认为您可以编写一个触发器来强制执行它,即使您不能直接用索引执行它。类似:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    -- given a table stories(story_id int not null primary key, story varchar(max) not null)
    CREATE TRIGGER prevent_plagiarism
    ON stories
    after INSERT, UPDATE
    AS
        DECLARE @cnt AS INT

        SELECT @cnt = Count(*)
        FROM   stories
               INNER JOIN inserted
                       ON ( stories.story = inserted.story
                            AND stories.story_id != inserted.story_id )

        IF @cnt > 0
          BEGIN
              RAISERROR('plagiarism detected',16,1)

              ROLLBACK TRANSACTION
          END

    另外,varchar(2048)在我看来很可疑(生活中有些东西是2048字节,但这很少见);它真的不是varchar(max)吗?


    我会提到这种方法,它可能会有所帮助,并且适用于所有SQL服务器:通常只有一个-两个副本,ID和副本计数是已知的。在这种情况下:

    1
    2
    3
    SET ROWCOUNT 1 -- or set to number of rows to be deleted
    delete from myTable where RowId = DuplicatedID
    SET ROWCOUNT 0

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    DELETE
    FROM
        table_name T1
    WHERE
        rowid > (
            SELECT
                min(rowid)
            FROM
                table_name T2
            WHERE
                T1.column_name = T2.column_name
        );


    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)

    INSERT INTO car(PersonId,CarId)
    VALUES(1,2),(1,3),(1,2),(2,4)

    --SELECT * FROM car

    ;WITH CTE as(
    SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)

    DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)

    我希望预览要删除的行,并控制要保留的重复行。请参阅http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    with MYCTE as (
      SELECT ROW_NUMBER() OVER (
        PARTITION BY DuplicateKey1
                    ,DuplicateKey2 -- optional
        ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
      ) RN
      FROM MyTable
    )
    DELETE FROM MYCTE
    WHERE RN > 1

    另一种方法是:--

    1
    2
    3
    4
    5
    6
    DELETE A
    FROM   TABLE A,
           TABLE B
    WHERE  A.COL1 = B.COL1
           AND A.COL2 = B.COL2
           AND A.UNIQUEFIELD > B.UNIQUEFIELD


    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    DELETE
    FROM MyTable
    WHERE NOT EXISTS (
                  SELECT min(RowID)
                  FROM Mytable
                  WHERE (SELECT RowID
                         FROM Mytable
                         GROUP BY Col1, Col2, Col3
                         ))
                   );

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    alter table MyTable add sno int identity(1,1)
        delete from MyTable where sno in
        (
        select sno from (
        select *,
        RANK() OVER ( PARTITION BY RowID,Col3 ORDER BY sno DESC )rank
        From MyTable
        )T
        where rank>1
        )

        alter table MyTable
        drop  column sno

    有时,软删除机制用于记录日期以指示删除日期。在这种情况下,可以使用UPDATE语句根据重复条目更新此字段。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    UPDATE MY_TABLE
       SET DELETED = getDate()
     WHERE TABLE_ID IN (
        SELECT x.TABLE_ID
          FROM MY_TABLE x
          JOIN (SELECT min(TABLE_ID) id, COL_1, COL_2, COL_3
                  FROM MY_TABLE d
                 GROUP BY d.COL_1, d.COL_2, d.COL_3
                HAVING count(*) > 1) AS d ON d.COL_1 = x.COL_1
                                         AND d.COL_2 = x.COL_2
                                         AND d.COL_3 = x.COL_3
                                         AND d.TABLE_ID <> x.TABLE_ID
                 /*WHERE x.COL_4 <> 'D' -- Additional filter*/)

    这种方法对于包含约3000万行、高、低复制量的中等表非常有用。


    我知道这个问题已经回答了,但我已经创建了非常有用的sp,它将为表重复项创建一个动态删除语句:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
        CREATE PROCEDURE sp_DeleteDuplicate @tableName varchar(100), @DebugMode int =1
    AS
    BEGIN
    SET NOCOUNT ON;

    IF(OBJECT_ID('tempdb..#tableMatrix') is not null) DROP TABLE #tableMatrix;

    SELECT ROW_NUMBER() OVER(ORDER BY name) as rn,name into #tableMatrix FROM sys.columns where [object_id] = object_id(@tableName) ORDER BY name

    DECLARE @MaxRow int = (SELECT MAX(rn) from #tableMatrix)
    IF(@MaxRow is null)
        RAISERROR  ('I wasn''t able to find any columns for this table!',16,1)
    ELSE
        BEGIN
    DECLARE @i int =1
    DECLARE @Columns Varchar(max) ='';

    WHILE (@i <= @MaxRow)
    BEGIN
        SET @Columns=@Columns+(SELECT '['+name+'],' from #tableMatrix where rn = @i)

        SET @i = @i+1;
    END

    ---DELETE LAST comma
    SET @Columns = LEFT(@Columns,LEN(@Columns)-1)

    DECLARE @Sql nvarchar(max) = '

    WITH cteRowsToDelte
         AS (
    SELECT ROW_NUMBER() OVER (PARTITION BY '+@Columns+' ORDER BY ( SELECT 0)) as rowNumber,* FROM '+@tableName
    +'
    )

    DELETE FROM cteRowsToDelte
    WHERE  rowNumber > 1;
    '
    SET NOCOUNT OFF;
        IF(@DebugMode = 1)
           SELECT @Sql
        ELSE
           EXEC sp_executesql @Sql
        END
    END

    因此,如果创建这样的表:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    IF(OBJECT_ID('MyLitleTable') is not null)
        DROP TABLE MyLitleTable


    CREATE TABLE MyLitleTable
    (
        A Varchar(10),
        B money,
        C int
    )
    ---------------------------------------------------------

        INSERT INTO MyLitleTable VALUES
        ('ABC',100,1),
        ('ABC',100,1), -- only this row should be deleted
        ('ABC',101,1),
        ('ABC',100,2),
        ('ABCD',100,1)

        -----------------------------------------------------------

         exec sp_DeleteDuplicate 'MyLitleTable',0

    它将从表中删除所有重复项。如果不使用第二个参数运行它,它将返回一个要运行的SQL语句。

    如果需要排除任何列,只需在调试模式下运行它,获取代码并根据需要修改它。


    如果重复行中的所有列都相同,则可以使用下面的查询删除重复记录。

    1
    2
    3
    4
    SELECT DISTINCT * INTO #TemNewTable FROM #OriginalTable
    TRUNCATE TABLE #OriginalTable
    INSERT INTO #OriginalTable SELECT * FROM #TemNewTable
    DROP TABLE #TemNewTable

    现在让我们看看弹性搜索表,这个表有重复的行,id是相同的uniq字段。我们知道如果某个ID是按组条件存在的,那么我们可以删除该组范围之外的其他行。我的态度表明了这一标准。

    很多这种线的情况都与我的情况类似。只需根据删除重复(重复)行的情况更改目标组条件。

    1
    2
    3
    4
    5
    6
    7
    DELETE
    FROM elasticalsearch
    WHERE Id NOT IN
                   (SELECT min(Id)
                         FROM elasticalsearch
                         GROUP BY FirmId,FilterSearchString
                         )

    干杯


    我想这会有帮助的。这里,行_number()over(partition by res1.title order by res1.id)as num已用于区分重复的行。

    1
    2
    3
    4
    5
    6
    delete FROM
    (SELECT res1.*,ROW_NUMBER() OVER(PARTITION BY res1.Title ORDER BY res1.Id)as num
     FROM
    (select * from [dbo].[tbl_countries])as res1
    )as res2
    WHERE res2.num > 1