Get top 1 row of each group
我有一张桌子,我希望得到每组的最新条目。这是表格:
1 2 3 4 5 6 7 8 | |ID| DocumentID | STATUS | DateCreated | | 2| 1 | S1 | 7/29/2011 | | 3| 1 | S2 | 7/30/2011 | | 6| 1 | S1 | 8/02/2011 | | 1| 2 | S1 | 7/28/2011 | | 4| 2 | S2 | 7/30/2011 | | 5| 2 | S3 | 8/01/2011 | | 6| 3 | S1 | 8/02/2011 | |
该表将按
我的首选输出:
1 2 3 4 | | DocumentID | STATUS | DateCreated | | 1 | S1 | 8/02/2011 | | 2 | S3 | 8/01/2011 | | 3 | S1 | 8/02/2011 | |
-
是否有任何聚合函数只能从每个组中获得顶部?请参阅下面的伪代码
GetOnlyTheTop :1
2
3
4
5
6
7SELECT
DocumentID,
GetOnlyTheTop(STATUS),
GetOnlyTheTop(DateCreated)
FROM DocumentStatusLogs
GROUP BY DocumentID
ORDER BY DateCreated DESC -
如果这样的功能不存在,有什么方法可以实现我想要的输出吗?
-
或者首先,这可能是由非标准化数据库引起的吗?我在想,因为我正在寻找的只是一行,
status 是否也应该位于父表中?
有关更多信息,请参阅父表:
当前
1 2 3 4 | | DocumentID | Title | Content | DateCreated | | 1 | TitleA | ... | ... | | 2 | TitleB | ... | ... | | 3 | TitleC | ... | ... | |
父表是否应该像这样,以便我可以轻松访问其状态?
1 2 3 4 | | DocumentID | Title | Content | DateCreated | CurrentStatus | | 1 | TitleA | ... | ... | s1 | | 2 | TitleB | ... | ... | s3 | | 3 | TitleC | ... | ... | s1 | |
UPDATE
我刚刚学会了如何使用"apply",这样可以更容易地解决这些问题。
1 2 3 4 5 6 7 8 9 | ;WITH cte AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn FROM DocumentStatusLogs ) SELECT * FROM cte WHERE rn = 1 |
如果您希望每天有2个条目,那么这将随意选择一个。要获得一天的两个条目,请改用DENSE_RANK
至于规范化与否,取决于你是否想要:
- 保持2个地方的状态
- 保存状态历史
- ...
就目前而言,您可以保留状态历史记录。如果你想要父表中的最新状态(这是非规范化),你需要一个触发器来维持父状态中的"状态"。或删除此状态历史记录表。
我刚学会了如何使用
1 2 3 4 5 6 7 | SELECT d.DocumentID, ds.Status, ds.DateCreated FROM Documents AS d CROSS apply (SELECT top 1 STATUS, DateCreated FROM DocumentStatusLogs WHERE DocumentID = d.DocumentId ORDER BY DateCreated DESC) AS ds |
我已经对这里的各种建议做了一些定时,结果实际上取决于所涉及的表的大小,但最一致的解决方案是使用CROSS APPLY这些测试是针对SQL Server 2008-R2运行的,使用的是6,500条记录,另一条(相同的架构),有1.37亿条记录。被查询的列是表上主键的一部分,表宽度非常小(约30个字节)。 SQL Server根据实际执行计划报告时间。
1 2 3 4 5 | Query TIME FOR 6500 (ms) TIME FOR 137M(ms) CROSS APPLY 17.9 17.9 SELECT WHERE col = (SELECT MAX(COL)…) 6.6 854.4 DENSE_RANK() OVER PARTITION 6.6 907.1 |
我认为真正令人惊奇的是CROSS APPLY的时间是多么一致,无论涉及的行数如何。
1 2 3 4 5 6 | SELECT * FROM DocumentStatusLogs JOIN ( SELECT DocumentID, MAX(DateCreated) DateCreated FROM DocumentStatusLogs GROUP BY DocumentID ) max_date USING (DocumentID, DateCreated) |
什么数据库服务器此代码不适用于所有这些代码。
关于你问题的后半部分,将状态列为专栏似乎是合理的。您可以将
顺便说一句,如果你已经在Documents表中有
编辑:MsSQL不支持USING,因此将其更改为:
1 | ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated |
我知道这是一个旧线程,但
1 2 3 4 5 6 | SELECT top 1 WITH ties DocumentID ,STATUS ,DateCreated FROM DocumentStatusLogs ORDER BY ROW_NUMBER() OVER (partition BY DocumentID ORDER BY DateCreated DESC) |
有关TOP条款的更多信息,请点击此处。
如果您担心性能,也可以使用MAX()执行此操作:
1 2 3 | SELECT * FROM DocumentStatusLogs D WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID) |
ROW_NUMBER()需要SELECT语句中的所有行,而MAX则不需要。应该大大加快您的查询速度。
这是一个相当古老的线索,但我认为我会把我的两分钱差不多,因为接受的答案对我来说并不是特别好。我在一个大型数据集上尝试了gbn的解决方案,发现它非常慢(在SQL Server 2012中500万条以上的记录> 45秒)。看一下执行计划,很明显问题是它需要一个SORT操作,这会大大减慢速度。
这是我从实体框架中解除的另一种选择,它不需要SORT操作并进行非聚集索引搜索。这将上述记录集的执??行时间减少到<2秒。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | SELECT [Limit1].[DocumentID] AS [DocumentID], [Limit1].[STATUS] AS [STATUS], [Limit1].[DateCreated] AS [DateCreated] FROM (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1] OUTER APPLY (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[STATUS] AS [STATUS], [Project2].[DateCreated] AS [DateCreated] FROM (SELECT [Extent2].[ID] AS [ID], [Extent2].[DocumentID] AS [DocumentID], [Extent2].[STATUS] AS [STATUS], [Extent2].[DateCreated] AS [DateCreated] FROM [dbo].[DocumentStatusLogs] AS [Extent2] WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID]) ) AS [Project2] ORDER BY [Project2].[ID] DESC) AS [Limit1] |
现在我假设在原始问题中没有完全指定的东西,但如果你的表设计是你的ID列是一个自动增量ID,并且DateCreated设置为每个插入的当前日期,那么甚至如果没有运行上面的查询,你实际上可以从gbn的解决方案中获得相当大的性能提升(大约是执行时间的一半),只需从ID上订购而不是在DateCreated上进行排序,因为这将提供相同的排序顺序,并且它的排序速度更快。
这是关于这个主题最容易找到的问题之一,所以我想给它一个现代的答案(供我参考和帮助其他人)。通过使用over和first值,您可以简单地完成上述查询:
1 2 3 4 | SELECT DISTINCT DocumentID , first_value(STATUS) OVER (partition BY DocumentID ORDER BY DateCreated DESC) AS STATUS , first_value(DateCreated) OVER (partition BY DocumentID ORDER BY DateCreated DESC) AS DateCreated FROM DocumentStatusLogs |
这应该在sql server 2008及更高版本中运行。可以将第一个值视为在使用over子句时完成选择top 1的方法。 Over允许在选择列表中进行分组,因此不是编写嵌套子查询(就像许多现有的答案一样),而是以更易读的方式进行分组。希望这可以帮助。
我的代码从每个组中选择前1名
1 2 3 4 5 6 | SELECT a.* FROM #DocumentStatusLogs a WHERE datecreated IN( SELECT top 1 datecreated FROM #DocumentStatusLogs b WHERE a.documentid = b.documentid ORDER BY datecreated DESC ) |
从上面验证Clint的真棒和正确答案:
下面两个查询之间的表现很有趣。 52%是最高的。 48%是第二个。使用DISTINCT而不是ORDER BY将性能提高4%。但ORDER BY具有按多列排序的优势。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END CREATE TABLE #DocumentStatusLogs ( [ID] INT NOT NULL, [DocumentID] INT NOT NULL, [STATUS] VARCHAR(20), [DateCreated] datetime ) INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [STATUS], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00') INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [STATUS], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00') INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [STATUS], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00') INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [STATUS], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00') INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [STATUS], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00') INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [STATUS], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00') INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [STATUS], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00') |
选项1:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | SELECT [Extent1].[ID], [Extent1].[DocumentID], [Extent1].[STATUS], [Extent1].[DateCreated] FROM #DocumentStatusLogs AS [Extent1] OUTER APPLY ( SELECT TOP 1 [Extent2].[ID], [Extent2].[DocumentID], [Extent2].[STATUS], [Extent2].[DateCreated] FROM #DocumentStatusLogs AS [Extent2] WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID] ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC ) AS [Project2] WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID]) |
选项2:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | SELECT [Limit1].[DocumentID] AS [ID], [Limit1].[DocumentID] AS [DocumentID], [Limit1].[STATUS] AS [STATUS], [Limit1].[DateCreated] AS [DateCreated] FROM ( SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1] ) AS [Distinct1] OUTER APPLY ( SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[STATUS] AS [STATUS], [Project2].[DateCreated] AS [DateCreated] FROM ( SELECT [Extent2].[ID] AS [ID], [Extent2].[DocumentID] AS [DocumentID], [Extent2].[STATUS] AS [STATUS], [Extent2].[DateCreated] AS [DateCreated] FROM #DocumentStatusLogs AS [Extent2] WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID] ) AS [Project2] ORDER BY [Project2].[ID] DESC ) AS [Limit1] |
M $的Management Studio:突出显示并运行第一个块后,突出显示选项1和选项2,右键单击 - > [显示估计执行计划]。然后运行整个过程以查看结果。
选项1结果:
1 2 3 4 | ID DocumentID STATUS DateCreated 6 1 S1 8/2/11 3:00 5 2 S3 8/1/11 6:00 6 3 S1 8/2/11 7:00 |
选项2结果:
1 2 3 4 | ID DocumentID STATUS DateCreated 6 1 S1 8/2/11 3:00 5 2 S3 8/1/11 6:00 6 3 S1 8/2/11 7:00 |
注意:
I tend to use APPLY when I want a join to be 1-to-(1 of many).
I use a JOIN if I want the join to be 1-to-many, or many-to-many.
I avoid CTE with ROW_NUMBER() unless I need to do something advanced and am ok with the windowing performance penalty.
我也避免在WHERE或ON子句中使用EXISTS / IN子查询,因为我经历过这会导致一些糟糕的执行计划。但里程各不相同。在需要的地方和时间检查执行计划和概要性能!
1 2 3 4 5 | SELECT o.* FROM `DocumentStatusLogs` o LEFT JOIN `DocumentStatusLogs` b ON o.DocumentID = b.DocumentID AND o.DateCreated < b.DateCreated WHERE b.DocumentID IS NULL ; |
如果您只想通过DateCreated返回最近的文档订单,它将仅返回DocumentID的前1个文档
以下是针对问题的3种不同方法以及每种查询的最佳索引选择(请自行尝试索引并查看逻辑读取,已用时间,执行计划。我已根据自己的经验提供了建议。此类查询无需针对此特定问题执行)。
方法1:使用ROW_NUMBER()。如果rowstore索引无法提高性能,则可以尝试使用非聚簇/聚簇列存储索引,对于具有聚合和分组的查询,以及对于按不同列排序的表,列存储索引通常是最佳选择。
1 2 3 4 5 6 7 8 9 10 11 12 | ;WITH CTE AS ( SELECT *, RN = ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) FROM DocumentStatusLogs ) SELECT ID ,DocumentID ,STATUS ,DateCreated FROM CTE WHERE RN = 1; |
方法2:使用FIRST_VALUE。如果rowstore索引无法提高性能,则可以尝试使用非聚簇/聚簇列存储索引,对于具有聚合和分组的查询,以及对于按不同列排序的表,列存储索引通常是最佳选择。
1 2 3 4 5 6 | SELECT DISTINCT ID = FIRST_VALUE(ID) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) ,DocumentID ,STATUS = FIRST_VALUE(STATUS) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) ,DateCreated = FIRST_VALUE(DateCreated) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) FROM DocumentStatusLogs; |
方法3:使用CROSS APPLY。在DocumentStatusLogs表上创建覆盖查询中使用的列的行存储索引应足以覆盖查询而无需列存储索引。
1 2 3 4 5 6 7 8 9 10 11 12 | SELECT DISTINCT ID = CA.ID ,DocumentID = D.DocumentID ,STATUS = CA.Status ,DateCreated = CA.DateCreated FROM DocumentStatusLogs D CROSS APPLY ( SELECT TOP 1 I.* FROM DocumentStatusLogs I WHERE I.DocumentID = D.DocumentID ORDER BY I.DateCreated DESC ) CA; |
试试这个:
1 2 3 4 5 6 7 8 9 | SELECT [DocumentID], [tmpRez].value('/x[2]','varchar(20)') AS [STATUS], [tmpRez].value('/x[3]','datetime') AS [DateCreated] FROM ( SELECT [DocumentID], CAST('<x>'+MAX(CAST([ID] AS VARCHAR(10))+'</x><x>'+[STATUS]+'</x><x>' +CAST([DateCreated] AS VARCHAR(20)))+'</x>' AS XML) AS [tmpRez] FROM DocumentStatusLogs GROUP BY DocumentID) AS [tmpQry] |
在您要避免使用row_count()的情况下,您还可以使用左连接:
1 2 3 4 5 6 7 8 | SELECT ds.DocumentID, ds.Status, ds.DateCreated FROM DocumentStatusLogs ds LEFT JOIN DocumentStatusLogs FILTER ON ds.DocumentID = FILTER.DocumentID -- Match any row that has another row that was created after it. AND ds.DateCreated < FILTER.DateCreated -- then filter out any rows that matched WHERE FILTER.DocumentID IS NULL |
对于示例模式,您还可以使用"not in subquery",它通常编译为与左连接相同的输出:
1 2 3 4 5 6 7 | SELECT ds.DocumentID, ds.Status, ds.DateCreated FROM DocumentStatusLogs ds WHERE ds.ID NOT IN ( SELECT FILTER.ID FROM DocumentStatusLogs FILTER WHERE ds.DocumentID = FILTER.DocumentID AND ds.DateCreated < FILTER.DateCreated) |
注意,如果表没有至少一个单列唯一键/约束/索引,在这种情况下是主键"Id",则子查询模式将不起作用。
这两个查询往往比row_count()查询(由查询分析器测量)更"昂贵"。但是,您可能会遇到更快返回结果或启用其他优化的情况。
1 2 3 | SELECT doc_id,STATUS,date_created FROM ( SELECT a.*,ROW_NUMBER() OVER(PARTITION BY doc_id ORDER BY date_created DESC ) AS rnk FROM doc a) WHERE rnk=1; |
这是我能想到的最普通的TSQL
1 2 3 4 5 6 7 8 9 10 11 12 13 | SELECT * FROM DocumentStatusLogs D1 JOIN ( SELECT DocumentID,MAX(DateCreated) AS MaxDate FROM DocumentStatusLogs GROUP BY DocumentID ) D2 ON D2.DocumentID=D1.DocumentID AND D2.MaxDate=D1.DateCreated |
在SQLite中检查您可以对GROUP BY使用以下简单查询
1 2 3 | SELECT MAX(DateCreated), * FROM DocumentStatusLogs GROUP BY DocumentID |
这里MAX帮助获得每组最大DateCreated FROM。
但似乎MYSQL没有将* -columns与max DateCreated的值相关联:(