Count on join of big tables with conditions is slow
当表很小时,此查询的合理时间。 我正在尝试确定瓶颈是什么,但我不确定如何分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | SELECT COUNT(*) FROM performance_analyses INNER JOIN total_sales ON total_sales.id = performance_analyses.total_sales_id WHERE (SIZE > 0) AND total_sales.customer_id IN ( SELECT customers.id FROM customers WHERE customers.active = 't' AND customers.visible = 't' AND customers.organization_id = 3 ) AND total_sales.product_category_id IN ( SELECT product_categories.id FROM product_categories WHERE product_categories.organization_id = 3 ) AND total_sales.period_id = 193; |
我已经尝试了INNER JOIN'ing
这是EXPLAIN的链接:https://explain.depesz.com/s/9lhr
Postgres版本:
PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit
表和索引:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | CREATE TABLE total_sales ( id serial NOT NULL, VALUE DOUBLE PRECISION, start_date DATE, end_date DATE, product_category_customer_id INTEGER, created_at TIMESTAMP WITHOUT TIME zone, updated_at TIMESTAMP WITHOUT TIME zone, processed BOOLEAN, customer_id INTEGER, product_category_id INTEGER, period_id INTEGER, CONSTRAINT total_sales_pkey PRIMARY KEY (id) ); CREATE INDEX index_total_sales_on_customer_id ON total_sales (customer_id); CREATE INDEX index_total_sales_on_period_id ON total_sales (period_id); CREATE INDEX index_total_sales_on_product_category_customer_id ON total_sales (product_category_customer_id); CREATE INDEX index_total_sales_on_product_category_id ON total_sales (product_category_id); CREATE INDEX total_sales_product_category_period ON total_sales (product_category_id, period_id); CREATE INDEX ts_pid_pcid_cid ON total_sales (period_id, product_category_id, customer_id); CREATE TABLE performance_analyses ( id serial NOT NULL, total_sales_id INTEGER, status_id INTEGER, created_at TIMESTAMP WITHOUT TIME zone, updated_at TIMESTAMP WITHOUT TIME zone, SIZE DOUBLE PRECISION, period_size INTEGER, nominal_variation DOUBLE PRECISION, percentual_variation DOUBLE PRECISION, relative_performance DOUBLE PRECISION, time_ago_max INTEGER, deseasonalized_series text, significance CHARACTER VARYING, relevance CHARACTER VARYING, original_variation DOUBLE PRECISION, last_level DOUBLE PRECISION, quantiles text, range text, analysis_method CHARACTER VARYING, CONSTRAINT performance_analyses_pkey PRIMARY KEY (id) ); CREATE INDEX index_performance_analyses_on_status_id ON performance_analyses (status_id); CREATE INDEX index_performance_analyses_on_total_sales_id ON performance_analyses (total_sales_id); CREATE TABLE product_categories ( id serial NOT NULL, name CHARACTER VARYING, organization_id INTEGER, created_at TIMESTAMP WITHOUT TIME zone, updated_at TIMESTAMP WITHOUT TIME zone, external_id CHARACTER VARYING, CONSTRAINT product_categories_pkey PRIMARY KEY (id) ); CREATE INDEX index_product_categories_on_organization_id ON product_categories (organization_id); CREATE TABLE customers ( id serial NOT NULL, name CHARACTER VARYING, external_id CHARACTER VARYING, region_id INTEGER, organization_id INTEGER, created_at TIMESTAMP WITHOUT TIME zone, updated_at TIMESTAMP WITHOUT TIME zone, active BOOLEAN DEFAULT FALSE, visible BOOLEAN DEFAULT FALSE, segment_id INTEGER, "group" BOOLEAN, group_id INTEGER, ticket_enabled BOOLEAN DEFAULT TRUE, CONSTRAINT customers_pkey PRIMARY KEY (id) ); CREATE INDEX index_customers_on_organization_id ON customers (organization_id); CREATE INDEX index_customers_on_region_id ON customers (region_id); CREATE INDEX index_customers_on_segment_id ON customers (segment_id); |
行数:
- 客户 - 6,970行
- product_categories - 34行
- performance_analyses - 1,012,346行
- total_sales - 7,104,441行
您的查询,重写和100%等效:
1 2 3 4 5 6 7 8 9 10 11 | SELECT COUNT(*) FROM product_categories pc JOIN customers c USING (organization_id) JOIN total_sales ts ON ts.customer_id = c.id JOIN performance_analyses pa ON pa.total_sales_id = ts.id WHERE pc.organization_id = 3 AND c.active -- boolean can be used directly AND c.visible AND ts.product_category_id = pc.id AND ts.period_id = 193 AND pa.size > 0; |
另一个答案建议将所有条件移动到
Explicit inner join syntax (
INNER JOIN ,CROSS JOIN , or unadornedJOIN )
is semantically the same as listing the input relations inFROM , so it
does not constrain the join order.
大胆强调我的。还有更多,请阅读手册。
键设置为
The important point is that these different join possibilities give
semantically equivalent results but might have hugely different
execution costs. Therefore, the planner will explore all of them to
try to find the most efficient query plan.
有关:
- 示例查询显示PostgreSQL中的基数估计错误
- 答:由于行估计值非常不准确,因此需要进行慢速全文搜索
最后,
在这两种情况下,您的简单子查询都会挖出一个唯一的列,因此在这种情况下没有任何有效的区别 - 除了
您的查询
索引
1 2 | CREATE INDEX index_customers_on_organization_id ON customers (organization_id, id) WHERE active AND visible; |
我附加
1 2 | CREATE INDEX index_total_sales_on_product_category_customer_id ON total_sales (period_id, product_category_id, customer_id, id) |
再次,旨在进行仅索引扫描。这是最重要的一个。
您可以删除完全冗余索引
我建议使用条件
1 2 3 | CREATE INDEX index_performance_analyses_on_status_id ON performance_analyses (total_sales_id) WHERE pa.size > 0; |
然而:
Rows Removed by Filter: 0"
好像这个条件没有用处?是否有
创建这些索引后,您需要
表统计
一般来说,我看到许多不好的估计。 Postgres低估了几乎每一步返回的行数。我们看到的嵌套循环对于更少的行会更好。除非这不太可能巧合,否则您的表统计数据已经过时了。您需要访问autovacuum的设置,可能还需要访问两个大表的每个表设置
根据您的评论,您已经运行了
有了所有可疑的统计数据和错误的计划,我会考虑在您的数据库上运行完整的
在此期间,还要查看数据库的成本设置。
有关:
- 保持PostgreSQL有时选择错误的查询计划
- Postgres Slow Queries - Autovacuum频率
虽然理论上优化器应该能够做到这一点,但我经常发现这些变化可以大大提高性能:
-
使用正确的连接(而不是
where id in (select ...) ) -
命令对
from 子句中的表的引用,以便在每次连接时返回最少的行,尤其是第一个表的条件(在where子句中)应该是最严格的(并且应该使用索引) -
将连接表上的所有条件移动到连接的
on 条件
试试这个(添加别名以提高可读性):
1 2 3 4 5 6 | SELECT COUNT(*) FROM total_sales ts JOIN product_categories pc ON ts.product_category_id = pc.id AND pc.organization_id = 3 JOIN customers c ON ts.customer_id = c.id AND c.organization_id = 3 JOIN performance_analyses pa ON ts.id = pa.total_sales_id AND pa.size > 0 WHERE ts.period_id = 193 |
您需要创建此索引以获得最佳性能(以允许对total_sales进行仅索引扫描):
1 | CREATE INDEX ts_pid_pcid_cid ON total_sales(period_id, product_category_id, customer_id) |
这种方法首先将数据缩小到一个时期,因此它将在未来扩展(保持大致不变),因为每个时期的销售数量将大致不变。
这些估计并不准确。 Postgres的计划程序使用错误的嵌套循环 - 尝试通过语句