关于sql：依赖于条件的大表的连接很慢

Count on join of big tables with conditions is slow

当表很小时，此查询的合理时间。我正在尝试确定瓶颈是什么，但我不确定如何分析EXPLAIN结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

SELECT
COUNT(*)
FROM performance_analyses
INNER JOIN total_sales ON total_sales.id = performance_analyses.total_sales_id
WHERE
(SIZE > 0) AND
total_sales.customer_id IN (
SELECT customers.id FROM customers WHERE customers.active = 't'
AND customers.visible = 't' AND customers.organization_id = 3
) AND
total_sales.product_category_id IN (
SELECT product_categories.id FROM product_categories
WHERE product_categories.organization_id = 3
) AND
total_sales.period_id = 193;

我已经尝试了INNER JOIN'ing customers和product_categories表的方法并进行INNER SELECT。两者都有相同的时间。

这是EXPLAIN的链接：https：//explain.depesz.com/s/9lhr

Postgres版本：

PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit

表和索引：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

CREATE TABLE total_sales (
id serial NOT NULL,
VALUE DOUBLE PRECISION,
start_date DATE,
end_date DATE,
product_category_customer_id INTEGER,
created_at TIMESTAMP WITHOUT TIME zone,
updated_at TIMESTAMP WITHOUT TIME zone,
processed BOOLEAN,
customer_id INTEGER,
product_category_id INTEGER,
period_id INTEGER,
CONSTRAINT total_sales_pkey PRIMARY KEY (id)
);
CREATE INDEX index_total_sales_on_customer_id ON total_sales (customer_id);
CREATE INDEX index_total_sales_on_period_id ON total_sales (period_id);
CREATE INDEX index_total_sales_on_product_category_customer_id ON total_sales (product_category_customer_id);
CREATE INDEX index_total_sales_on_product_category_id ON total_sales (product_category_id);
CREATE INDEX total_sales_product_category_period ON total_sales (product_category_id, period_id);
CREATE INDEX ts_pid_pcid_cid ON total_sales (period_id, product_category_id, customer_id);

CREATE TABLE performance_analyses (
id serial NOT NULL,
total_sales_id INTEGER,
status_id INTEGER,
created_at TIMESTAMP WITHOUT TIME zone,
updated_at TIMESTAMP WITHOUT TIME zone,
SIZE DOUBLE PRECISION,
period_size INTEGER,
nominal_variation DOUBLE PRECISION,
percentual_variation DOUBLE PRECISION,
relative_performance DOUBLE PRECISION,
time_ago_max INTEGER,
deseasonalized_series text,
significance CHARACTER VARYING,
relevance CHARACTER VARYING,
original_variation DOUBLE PRECISION,
last_level DOUBLE PRECISION,
quantiles text,
range text,
analysis_method CHARACTER VARYING,
CONSTRAINT performance_analyses_pkey PRIMARY KEY (id)
);
CREATE INDEX index_performance_analyses_on_status_id ON performance_analyses (status_id);
CREATE INDEX index_performance_analyses_on_total_sales_id ON performance_analyses (total_sales_id);

CREATE TABLE product_categories (
id serial NOT NULL,
name CHARACTER VARYING,
organization_id INTEGER,
created_at TIMESTAMP WITHOUT TIME zone,
updated_at TIMESTAMP WITHOUT TIME zone,
external_id CHARACTER VARYING,
CONSTRAINT product_categories_pkey PRIMARY KEY (id)
);
CREATE INDEX index_product_categories_on_organization_id ON product_categories (organization_id);

CREATE TABLE customers (
id serial NOT NULL,
name CHARACTER VARYING,
external_id CHARACTER VARYING,
region_id INTEGER,
organization_id INTEGER,
created_at TIMESTAMP WITHOUT TIME zone,
updated_at TIMESTAMP WITHOUT TIME zone,
active BOOLEAN DEFAULT FALSE,
visible BOOLEAN DEFAULT FALSE,
segment_id INTEGER,
"group" BOOLEAN,
group_id INTEGER,
ticket_enabled BOOLEAN DEFAULT TRUE,
CONSTRAINT customers_pkey PRIMARY KEY (id)
);
CREATE INDEX index_customers_on_organization_id ON customers (organization_id);
CREATE INDEX index_customers_on_region_id ON customers (region_id);
CREATE INDEX index_customers_on_segment_id ON customers (segment_id);

行数：

客户 - 6,970行
product_categories - 34行
performance_analyses - 1,012,346行
total_sales - 7,104,441行

相关讨论

如果将AND"total_sales"."period_id" = 193移动到连接中会发生什么：INNER JOIN"total_sales" ON"total_sales"."id" ="performance_analyses"."total_sales_id" AND"total_sales"."period_id" = 193
@mulquin也不会改变。奇怪的是：第一个INNER SELECT(在customers表上)当它自己运行时返回5k行并需要500ms。我认为这对于一个简单的查询来说太过分了，而且没有任何连接在6k行的表上。
绝对太长了。 active，visible和organization_id列是否有索引？
@mulquin organization_id确实有。活动和可见的列是布尔值，因此基数非常低我没有考虑索引。应该有吗？
尝试使用列total_sales (product_category_id, period_id)添加total_sales的索引(在组合索引中，而不是单独的索引)。您可以根据数据切换订单。
@ Jo＆＃227; oDaniel这是一个与索引char(1)列相关的问题：stackoverflow.com/questions/18576415/…
"我认为，对于一个简单的查询而言，在没有任何连接的情况下，对于具有6k行的表来说，这需要花费太多。"您正在使用"in"语句进行2次隐式连接(至少它与连接具有相同的效果)。
@Solarflare我认为他的意思是这个查询花了太长时间：SELECT"customers"."id" FROM"customers" WHERE"customers"."active" = 't' AND"customers"."visible" = 't' AND"customers"."organization_id" = 3
@mulquin你是对的。这个内部选择是一个500ms。完整查询需要2000毫秒。
另一个修正：简单的内部SELECT不需要500ms。部分原因是网络开销。查询本身大概是10毫秒，这是合理的。然而，完整的查询仍需要2000毫秒太多了:(
@Solarflare你建议的指数提高了很多！时间从2000ms到500ms。谢谢！我不确定这是不是一个合理的时间。
对于上面的每个xyz表，请使用show create table xyz进行架构。 explain的输出也不是一个坏主意。哎呀，旧标签信息，你现在在postgresql上有这个，没关系
如果合理取决于您的要求。您始终可以创建更多索引，以加快选择查询的速度，同时减慢更新/插入和使用资源的速度。但是试试：performance_analyses (total_sales_id, size);那么total_sales (product_category_id, period_id, customer_id)或total_sales (customer_id, period_id, product_category_id)(你的数字表明后者);最后是customer(organization_id, active, visible)的指数 - 但可能影响最小。但要注意不要过度微优化(例如，每月运行一次的查询有4个索引)。
在这种情况下，如果值不经常更新，则创建存储计数的辅助"缓存"表通常会有所帮助。

您的查询，重写和100％等效：

1
2
3
4
5
6
7
8
9
10
11

SELECT COUNT(*)
FROM product_categories pc
JOIN customers c USING (organization_id)
JOIN total_sales ts ON ts.customer_id = c.id
JOIN performance_analyses pa ON pa.total_sales_id = ts.id
WHERE pc.organization_id = 3
AND c.active -- boolean can be used directly
AND c.visible
AND ts.product_category_id = pc.id
AND ts.period_id = 193
AND pa.size > 0;

另一个答案建议将所有条件移动到FROM列表中的join子句和order表中。这可能适用于具有相对原始的查询规划器的某个其他RDBMS。虽然它对Postgres也没有影响，但它对查询的性能也没有影响 - 假设默认服务器配置。手册：

Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN)
is semantically the same as listing the input relations in FROM, so it
does not constrain the join order.

大胆强调我的。还有更多，请阅读手册。

键设置为join_collapse_limit(默认值为8)。 Postgres查询计划程序将以任何预期最快的方式重新排列您的4个表，无论您如何安排表以及是否将条件写为WHERE或JOIN子句。没有任何区别。 (对于无法自由重新排列的其他类型的连接，情况也是如此。)

The important point is that these different join possibilities give
semantically equivalent results but might have hugely different
execution costs. Therefore, the planner will explore all of them to
try to find the most efficient query plan.

有关：

示例查询显示PostgreSQL中的基数估计错误
答：由于行估计值非常不准确，因此需要进行慢速全文搜索

最后，WHERE id IN ()通常不等同于连接。对于右侧的重复匹配值，它不会在左侧乘以行。对于查询的其余部分，子查询的列不可见。连接可以将具有重复值的行相乘，并且列是可见的。
在这两种情况下，您的简单子查询都会挖出一个唯一的列，因此在这种情况下没有任何有效的区别 - 除了IN ()通常(至少有点)更慢且更冗长。使用连接。

您的查询

索引

product_categories有34行。除非您计划添加更多内容，否则索引对此表没有帮助。顺序扫描总是更快。删除 ~~index_product_categories_on_organization_id 。~~

customers有6,970行。索引开始有意义。但是根据EXPLAIN输出，您的查询使用了4,988个查询。只有索引扫描的索引比表格宽得多，可能会有所帮助。假设WHERE active AND visible是常量谓词，我建议使用部分多列索引：

1
2
CREATE INDEX index_customers_on_organization_id ON customers (organization_id, id)
WHERE active AND visible;

我附加id以允许仅索引扫描。该列在此查询的索引中无用。

total_sales有7,104,441行。索引非常重要。我建议：

1
2
CREATE INDEX index_total_sales_on_product_category_customer_id
ON total_sales (period_id, product_category_id, customer_id, id)

再次，旨在进行仅索引扫描。这是最重要的一个。

您可以删除完全冗余索引 ~~index_total_sales_on_product_category_id 。~~

performance_analyses有1,012,346行。索引非常重要。
我建议使用条件size > 0的另一个部分索引：

1
2
3
CREATE INDEX index_performance_analyses_on_status_id
ON performance_analyses (total_sales_id)
WHERE pa.size > 0;

然而：

Rows Removed by Filter: 0"

好像这个条件没有用处？是否有size > 0的行不是真的？

创建这些索引后，您需要ANALYZE表。

表统计

一般来说，我看到许多不好的估计。 Postgres低估了几乎每一步返回的行数。我们看到的嵌套循环对于更少的行会更好。除非这不太可能巧合，否则您的表统计数据已经过时了。您需要访问autovacuum的设置，可能还需要访问两个大表的每个表设置
performance_analyses和total_sales。

根据您的评论，您已经运行了VACUUM和ANALYZE，这使查询变慢。这没有多大意义。我会在这两个表上运行VACUUM FULL一次(如果你能负担得起专属锁)。否则尝试pg_repack。
有了所有可疑的统计数据和错误的计划，我会考虑在您的数据库上运行完整的vacuumdb -fz yourdb。这会在原始条件下重写所有表和索引，但定期使用并不好。它也很昂贵，会长时间锁定你的数据库！

在此期间，还要查看数据库的成本设置。
有关：

保持PostgreSQL有时选择错误的查询计划

Postgres Slow Queries - Autovacuum频率

虽然理论上优化器应该能够做到这一点，但我经常发现这些变化可以大大提高性能：

使用正确的连接(而不是where id in (select ...))

命令对from子句中的表的引用，以便在每次连接时返回最少的行，尤其是第一个表的条件(在where子句中)应该是最严格的(并且应该使用索引)

将连接表上的所有条件移动到连接的on条件

试试这个(添加别名以提高可读性)：

1
2
3
4
5
6
SELECT COUNT(*)
FROM total_sales ts
JOIN product_categories pc ON ts.product_category_id = pc.id AND pc.organization_id = 3
JOIN customers c ON ts.customer_id = c.id AND c.organization_id = 3
JOIN performance_analyses pa ON ts.id = pa.total_sales_id AND pa.size > 0
WHERE ts.period_id = 193

您需要创建此索引以获得最佳性能(以允许对total_sales进行仅索引扫描)：

1
CREATE INDEX ts_pid_pcid_cid ON total_sales(period_id, product_category_id, customer_id)

这种方法首先将数据缩小到一个时期，因此它将在未来扩展(保持大致不变)，因为每个时期的销售数量将大致不变。

相关讨论

我刚刚添加了有关表格的更多信息。不幸的是，它没有改变运行时间。有关表格大小的新信息是否会发生变化？谢谢！ (虽然它没有改善时间，但你列出的三点确实改善了我的不足之处！)

@ Jo＆＃227; oDaniel尝试运行vacuum; analyze total_sales; analyze product_categories; analyze customers; analyze performance_analyses;然后重试查询。

哦，天哪，它将执行时间从500毫秒增加到1000毫秒。是否有意义？

@Jo＆＃227; oDaniel没有多大意义，没有。只是一个猜测：可能是以前的数据缓存在内存中。尝试运行大量类似的查询(针对不同的时期/组织)，然后重试时间。

这些估计并不准确。 Postgres的计划程序使用错误的嵌套循环 - 尝试通过语句set enable_nestloop to off惩罚nest_loop。