How can I remove duplicate rows?
从相当大的
当然,由于存在
迈特表
1 2 3 4 | RowID int not null identity(1,1) primary key, Col1 varchar(20) not null, Col2 varchar(2048) not null, Col3 tinyint not null |
假设没有空值,那么您将
1 2 3 4 5 6 7 8 9 | DELETE FROM MyTable LEFT OUTER JOIN ( SELECT MIN(RowId) as RowId, Col1, Col2, Col3 FROM MyTable GROUP BY Col1, Col2, Col3 ) as KeepRows ON MyTable.RowId = KeepRows.RowId WHERE KeepRows.RowId IS NULL |
如果您有一个guid而不是一个整数,您可以替换
1 | MIN(RowId) |
具有
1 | CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn))) |
另一种可能的方法是
1 2 3 4 5 6 7 8 9 | ; --Ensure that any immediately preceding statement is terminated with a semicolon above WITH cte AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY ( SELECT 0)) RN FROM #MyTable) DELETE FROM cte WHERE RN > 1; |
我使用上面的
例如,为了保存
执行计划
因为它不需要自连接,所以它的执行计划通常比公认的答案更简单和高效。
然而,情况并非总是如此。其中一个可能首选
可能有利于哈希聚合方法的因素是
- 分区列上没有有用的索引
- 相对较少的组,每个组中的副本相对较多
在第二种情况的极端版本中(如果每个组中都有很多重复的组),您还可以考虑简单地插入行以保存到新表中,然后与删除非常高比例的行相比,
在Microsoft技术支持网站上有一篇关于删除重复项的好文章。这是非常保守的——他们让你分步做每件事——但是它应该在大桌子上很好地工作。
在过去,我使用了self-join来实现这一点,尽管它可能被一个having子句预绑定:
1 2 3 4 5 | DELETE dupes FROM MyTable dupes, MyTable fullTable WHERE dupes.dupField = fullTable.dupField AND dupes.secondDupField = fullTable.secondDupField AND dupes.uniqueField > fullTable.uniqueField |
以下查询可用于删除重复行。本例中的表以
1 2 3 4 5 6 7 8 9 10 | DELETE FROM TableName WHERE ID NOT IN (SELECT MAX(ID) FROM TableName GROUP BY Column1, Column2, Column3 /*Even if ID is not null-able SQL Server treats MAX(ID) as potentially nullable. Because of semantics of NOT IN (NULL) including the clause below can simplify the plan*/ HAVING MAX(ID) IS NOT NULL) |
下面的脚本显示了一个查询中
1 2 3 4 5 6 | SELECT YourColumnName, COUNT(*) TotalCount FROM YourTableName GROUP BY YourColumnName HAVING COUNT(*) > 1 ORDER BY COUNT(*) DESC |
1 2 3 4 | delete t1 from table t1, table t2 where t1.columnA = t2.columnA and t1.rowid>t2.rowid |
Postgres:
1 2 3 4 5 | delete from table t1 using table t2 where t1.columnA = t2.columnA and t1.rowid > t2.rowid |
1 2 3 4 5 6 7 8 | DELETE LU FROM (SELECT *, Row_number() OVER ( partition BY col1, col1, col3 ORDER BY rowid DESC) [Row] FROM mytable) LU WHERE [row] > 1 |
这将删除除第一行以外的重复行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | DELETE FROM Mytable WHERE RowID NOT IN ( SELECT MIN(RowID) FROM Mytable GROUP BY Col1, Col2, Col3 ) |
请参阅(http://www.codeproject.com/articles/157977/remove-duplicate-rows-from-a-table-in-sql-server)
我希望CTE从SQL Server表中删除重复行
强烈建议遵循本文:http://codiffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
1 2 3 4 5 6 7 | WITH CTE AS ( SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN FROM MyTable ) DELETE FROM CTE WHERE RN<>1 |
without keeping original
1 2 3 4 5 6 | WITH CTE AS (SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3) FROM MyTable) DELETE CTE WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1) |
快速而脏地删除完全重复的行(对于小表):
1 2 3 4 | select distinct * into t2 from t1; delete from t1; insert into t1 select * from t2; drop table t2; |
对于内部联接,我更喜欢使用count(*)>1的子查询解决方案,因为我发现它更容易阅读,而且在运行之前很容易转换成select语句来验证将要删除的内容。
1 2 3 4 5 6 7 | --DELETE FROM table1 --WHERE id IN ( SELECT MIN(id) FROM table1 GROUP BY col1, col2, col3 -- could add a WHERE clause here to further filter HAVING count(*) > 1 --) |
要提取重复行:
1 2 3 4 5 6 7 | SELECT name, email, COUNT(*) FROM users GROUP BY name, email HAVING COUNT(*) > 1 |
删除重复行:
1 2 3 4 5 | DELETE users WHERE rowid NOT IN (SELECT MIN(rowid) FROM users GROUP BY name, email); |
1 2 3 4 5 6 7 | SELECT DISTINCT * INTO tempdb.dbo.tmpTable FROM myTable TRUNCATE TABLE myTable INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable DROP TABLE tempdb.dbo.tmpTable |
我想我会分享我的解决方案,因为它在特殊情况下工作。我的例子是,具有重复值的表没有外键(因为这些值是从另一个数据库复制的)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | begin transaction -- create temp table with identical structure as source table Select * Into #temp From tableName Where 1 = 2 -- insert distinct values into temp insert into #temp select distinct * from tableName -- delete from source delete from tableName -- insert into source from temp insert into tableName select * from #temp rollback transaction -- if this works, change rollback to commit and execute again to keep you changes!! |
PS:在处理类似的事情时,我总是使用事务,这不仅确保了所有事情都作为一个整体执行,而且允许我在不冒任何风险的情况下进行测试。不过,当然,你还是应该备份一下,以确保…
使用CTE。其想法是连接一个或多个列,这些列形成一个重复记录,然后删除您喜欢的任何列:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | ;with cte as ( select min(PrimaryKey) as PrimaryKey UniqueColumn1, UniqueColumn2 from dbo.DuplicatesTable group by UniqueColumn1, UniqueColumn1 having count(*) > 1 ) delete d from dbo.DuplicatesTable d inner join cte on d.PrimaryKey > cte.PrimaryKey and d.UniqueColumn1 = cte.UniqueColumn1 and d.UniqueColumn2 = cte.UniqueColumn2; |
另外一个简单的解决方案可以在粘贴在这里的链接中找到。这一个容易掌握,似乎是有效的大多数类似的问题。虽然它是针对SQL Server的,但是使用的概念是可以接受的。
以下是链接页面的相关部分:
考虑这些数据:
1 2 3 4 5 6 7 | EMPLOYEE_ID ATTENDANCE_DATE A001 2011-01-01 A001 2011-01-01 A002 2011-01-01 A002 2011-01-01 A002 2011-01-01 A003 2011-01-01 |
那么我们如何删除那些重复的数据呢?
首先,使用以下代码在该表中插入标识列:
1 | ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1) |
使用以下代码解决此问题:
1 2 | DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _ FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE) |
这个查询显示我的性能非常好:
1 2 3 4 5 6 7 8 9 10 11 12 13 | DELETE tbl FROM MyTable tbl WHERE EXISTS ( SELECT * FROM MyTable tbl2 WHERE tbl2.SameValue = tbl.SameValue AND tbl.IdUniqueValue < tbl2.IdUniqueValue ) |
它在30秒多一点的时间内从一张2米的表格中删除了1米的行(50%的重复)
哦,当然。使用临时表。如果您想要一个"有效"的、性能不太好的语句,您可以使用:
1 2 3 4 5 6 7 | DELETE FROM MyTable WHERE NOT RowID IN (SELECT (SELECT TOP 1 RowID FROM MyTable mt2 WHERE mt2.Col1 = mt.Col1 AND mt2.Col2 = mt.Col2 AND mt2.Col3 = mt.Col3) FROM MyTable mt) |
基本上,对于表中的每一行,子select将查找与所考虑行完全相同的所有行的顶部rowid。因此,您最终得到一个表示"原始"不重复行的行ID列表。
这是另一篇关于删除重复项的好文章。
它讨论了其困难的原因:"SQL基于关系代数,并且在关系代数中不能出现重复,因为在一个集合中不允许出现重复。"
临时表解决方案和两个MySQL示例。
将来,您打算在数据库级别还是从应用程序的角度来阻止它。我建议使用数据库级别,因为您的数据库应该负责维护引用完整性,开发人员只会导致问题;)
我有一个表,需要在其中保存不重复的行。我不确定速度和效率。
1 2 3 4 | DELETE FROM myTable WHERE RowID IN ( SELECT MIN(RowID) AS IDNo FROM myTable GROUP BY Col1, Col2, Col3 HAVING COUNT(*) = 2 ) |
另一种方法是创建具有相同字段和唯一索引的新表。然后将所有数据从旧表移到新表。自动SQL Server忽略(如果存在重复值,还可以选择执行什么操作:忽略、中断或sth)重复值。所以我们有相同的表,没有重复的行。如果不需要唯一索引,可以在传输数据之后删除它。
尤其是对于较大的表,您可以使用DTS(SSIS包导入/导出数据),以便将所有数据快速传输到新的唯一索引表。700万排的比赛只需要几分钟。
用这个
1 2 3 4 5 6 | WITH tblTemp as ( SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name) As RowNumber,* FROM <table_name> ) DELETE FROM tblTemp where RowNumber >1 |
创建具有相同结构的新空表
执行这样的查询
1 2 3 4 5 | INSERT INTO tc_category1 SELECT * FROM tc_category GROUP BY category_id, application_id HAVING count(*) > 1 |
然后执行这个查询
1 2 3 4 5 | INSERT INTO tc_category1 SELECT * FROM tc_category GROUP BY category_id, application_id HAVING count(*) = 1 |
这是删除重复记录的最简单方法
1 2 3 4 5 | DELETE FROM tblemp WHERE id IN ( SELECT MIN(id) FROM tblemp GROUP BY title HAVING COUNT(id)>1 ) |
http://askme.indianyouth.info/details/how-to-dumplicate-record-from-table-in-using-sql-105
通过使用下面的查询,我们可以删除基于单列或多列的重复记录。下面的查询是基于两列删除的。表名为:
1 2 3 4 5 | DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno) AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1) or empname not in (select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno) AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1) |
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
我不知道它会有多好的性能,但我认为您可以编写一个触发器来强制执行它,即使您不能直接用索引执行它。类似:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | -- given a table stories(story_id int not null primary key, story varchar(max) not null) CREATE TRIGGER prevent_plagiarism ON stories after INSERT, UPDATE AS DECLARE @cnt AS INT SELECT @cnt = Count(*) FROM stories INNER JOIN inserted ON ( stories.story = inserted.story AND stories.story_id != inserted.story_id ) IF @cnt > 0 BEGIN RAISERROR('plagiarism detected',16,1) ROLLBACK TRANSACTION END |
另外,varchar(2048)在我看来很可疑(生活中有些东西是2048字节,但这很少见);它真的不是varchar(max)吗?
我会提到这种方法,它可能会有所帮助,并且适用于所有SQL服务器:通常只有一个-两个副本,ID和副本计数是已知的。在这种情况下:
1 2 3 | SET ROWCOUNT 1 -- or set to number of rows to be deleted delete from myTable where RowId = DuplicatedID SET ROWCOUNT 0 |
1 2 3 4 5 6 7 8 9 10 11 12 | DELETE FROM table_name T1 WHERE rowid > ( SELECT min(rowid) FROM table_name T2 WHERE T1.column_name = T2.column_name ); |
1 2 3 4 5 6 7 8 9 10 11 | CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int) INSERT INTO car(PersonId,CarId) VALUES(1,2),(1,3),(1,2),(2,4) --SELECT * FROM car ;WITH CTE as( SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car) DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1) |
我希望预览要删除的行,并控制要保留的重复行。请参阅http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
1 2 3 4 5 6 7 8 9 10 | with MYCTE as ( SELECT ROW_NUMBER() OVER ( PARTITION BY DuplicateKey1 ,DuplicateKey2 -- optional ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed ) RN FROM MyTable ) DELETE FROM MYCTE WHERE RN > 1 |
另一种方法是:--
1 2 3 4 5 6 | DELETE A FROM TABLE A, TABLE B WHERE A.COL1 = B.COL1 AND A.COL2 = B.COL2 AND A.UNIQUEFIELD > B.UNIQUEFIELD |
1 2 3 4 5 6 7 8 9 10 | DELETE FROM MyTable WHERE NOT EXISTS ( SELECT min(RowID) FROM Mytable WHERE (SELECT RowID FROM Mytable GROUP BY Col1, Col2, Col3 )) ); |
1 2 3 4 5 6 7 8 9 10 11 12 13 | alter table MyTable add sno int identity(1,1) delete from MyTable where sno in ( select sno from ( select *, RANK() OVER ( PARTITION BY RowID,Col3 ORDER BY sno DESC )rank From MyTable )T where rank>1 ) alter table MyTable drop column sno |
有时,软删除机制用于记录日期以指示删除日期。在这种情况下,可以使用
1 2 3 4 5 6 7 8 9 10 11 12 13 | UPDATE MY_TABLE SET DELETED = getDate() WHERE TABLE_ID IN ( SELECT x.TABLE_ID FROM MY_TABLE x JOIN (SELECT min(TABLE_ID) id, COL_1, COL_2, COL_3 FROM MY_TABLE d GROUP BY d.COL_1, d.COL_2, d.COL_3 HAVING count(*) > 1) AS d ON d.COL_1 = x.COL_1 AND d.COL_2 = x.COL_2 AND d.COL_3 = x.COL_3 AND d.TABLE_ID <> x.TABLE_ID /*WHERE x.COL_4 <> 'D' -- Additional filter*/) |
这种方法对于包含约3000万行、高、低复制量的中等表非常有用。
我知道这个问题已经回答了,但我已经创建了非常有用的sp,它将为表重复项创建一个动态删除语句:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | CREATE PROCEDURE sp_DeleteDuplicate @tableName varchar(100), @DebugMode int =1 AS BEGIN SET NOCOUNT ON; IF(OBJECT_ID('tempdb..#tableMatrix') is not null) DROP TABLE #tableMatrix; SELECT ROW_NUMBER() OVER(ORDER BY name) as rn,name into #tableMatrix FROM sys.columns where [object_id] = object_id(@tableName) ORDER BY name DECLARE @MaxRow int = (SELECT MAX(rn) from #tableMatrix) IF(@MaxRow is null) RAISERROR ('I wasn''t able to find any columns for this table!',16,1) ELSE BEGIN DECLARE @i int =1 DECLARE @Columns Varchar(max) =''; WHILE (@i <= @MaxRow) BEGIN SET @Columns=@Columns+(SELECT '['+name+'],' from #tableMatrix where rn = @i) SET @i = @i+1; END ---DELETE LAST comma SET @Columns = LEFT(@Columns,LEN(@Columns)-1) DECLARE @Sql nvarchar(max) = ' WITH cteRowsToDelte AS ( SELECT ROW_NUMBER() OVER (PARTITION BY '+@Columns+' ORDER BY ( SELECT 0)) as rowNumber,* FROM '+@tableName +') DELETE FROM cteRowsToDelte WHERE rowNumber > 1; ' SET NOCOUNT OFF; IF(@DebugMode = 1) SELECT @Sql ELSE EXEC sp_executesql @Sql END END |
因此,如果创建这样的表:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | IF(OBJECT_ID('MyLitleTable') is not null) DROP TABLE MyLitleTable CREATE TABLE MyLitleTable ( A Varchar(10), B money, C int ) --------------------------------------------------------- INSERT INTO MyLitleTable VALUES ('ABC',100,1), ('ABC',100,1), -- only this row should be deleted ('ABC',101,1), ('ABC',100,2), ('ABCD',100,1) ----------------------------------------------------------- exec sp_DeleteDuplicate 'MyLitleTable',0 |
它将从表中删除所有重复项。如果不使用第二个参数运行它,它将返回一个要运行的SQL语句。
如果需要排除任何列,只需在调试模式下运行它,获取代码并根据需要修改它。
如果重复行中的所有列都相同,则可以使用下面的查询删除重复记录。
1 2 3 4 | SELECT DISTINCT * INTO #TemNewTable FROM #OriginalTable TRUNCATE TABLE #OriginalTable INSERT INTO #OriginalTable SELECT * FROM #TemNewTable DROP TABLE #TemNewTable |
现在让我们看看弹性搜索表,这个表有重复的行,id是相同的uniq字段。我们知道如果某个ID是按组条件存在的,那么我们可以删除该组范围之外的其他行。我的态度表明了这一标准。
很多这种线的情况都与我的情况类似。只需根据删除重复(重复)行的情况更改目标组条件。
1 2 3 4 5 6 7 | DELETE FROM elasticalsearch WHERE Id NOT IN (SELECT min(Id) FROM elasticalsearch GROUP BY FirmId,FilterSearchString ) |
干杯
我想这会有帮助的。这里,行_number()over(partition by res1.title order by res1.id)as num已用于区分重复的行。
1 2 3 4 5 6 | delete FROM (SELECT res1.*,ROW_NUMBER() OVER(PARTITION BY res1.Title ORDER BY res1.Id)as num FROM (select * from [dbo].[tbl_countries])as res1 )as res2 WHERE res2.num > 1 |