How to generate multiple time series in one sql query?
这是数据库布局。 我有一张表,随着时间的推移销售稀少,每天汇总。 如果对于我在01-01-2015有10个销售的项目,我将有一个条目,但如果我有0,那么我没有条目。 像这样的东西。
1 2 3 4 5 6 7 8 | |--------------------------------------| | day_of_year | YEAR | sales | item_id | |--------------------------------------| | 01 | 2015 | 20 | A1 | | 01 | 2015 | 11 | A2 | | 07 | 2015 | 09 | A1 | | ... | ... | ... | ... | |--------------------------------------| |
这就是我获得1个项目的时间序列的方法。
1 2 3 4 5 6 7 8 9 10 11 12 13 | SELECT doy, MAX(sales) FROM ( SELECT day_of_year AS doy, sales AS sales FROM myschema.entry_daily WHERE item_id = theNameOfmyItem AND YEAR = 2015 AND day_of_year < 150 UNION SELECT doy AS doy, 0 AS sales FROM generate_series(1, 149) AS doy) AS t GROUP BY doy ORDER BY doy; |
我目前循环使用R为每个项目进行1次查询。 然后,我将结果汇总到数据框中。 但这很慢。 我实际上只希望有一个查询可以聚合以下形式的所有数据。
1 2 3 4 5 6 7 8 | |----------------------------------------------| | item_id | 01 | 02 | 03 | 04 | 05 | ... | 149 | |----------------------------------------------| | A1 | 10 | 00 | 00 | 05 | 12 | ... | 11 | | A2 | 11 | 00 | 30 | 01 | 15 | ... | 09 | | A3 | 20 | 00 | 00 | 05 | 17 | ... | 20 | | ... | |----------------------------------------------| |
这可能吗? 顺便说一句,我正在使用Postgres数据库。
解决方案1.使用聚合进行简单查询。
获得预期结果的最简单,最快捷的方法。在客户端程序中解析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | SELECT item, string_agg(COALESCE(sales, 0)::text, ',') sales FROM ( SELECT DISTINCT item_id item, doy FROM generate_series (1, 10) doy -- change 10 to given n CROSS JOIN entry_daily ) sub LEFT JOIN entry_daily ON item_id = item AND day_of_year = doy GROUP BY 1 ORDER BY 1; item | sales ------+---------------------- A1 | 20,0,0,0,0,0,9,0,0,0 A2 | 11,0,0,0,0,0,0,0,0,0 (2 ROWS) |
解决方案2.动态创建视图。
基于具有
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | CREATE OR REPLACE FUNCTION create_items_view(view_name text, days INT) RETURNS void LANGUAGE plpgsql AS $$ DECLARE list text; BEGIN SELECT string_agg(format('s[%s]"%s"', i::text, i::text), ',') INTO list FROM generate_series(1, days) i; EXECUTE(format($f$ DROP VIEW IF EXISTS %s; CREATE VIEW %s AS SELECT item, %s FROM ( SELECT item, array_agg(COALESCE(sales, 0)) s FROM ( SELECT DISTINCT item_id item, doy FROM generate_series (1, %s) doy CROSS JOIN entry_daily ) sub LEFT JOIN entry_daily ON item_id = item AND day_of_year = doy GROUP BY 1 ORDER BY 1 ) q $f$, view_name, view_name, list, days) ); END $$; |
用法:
1 2 3 4 5 6 7 8 9 | SELECT create_items_view('items_view_10', 10); SELECT * FROM items_view_10; item | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 ------+----+---+---+---+---+---+---+---+---+---- A1 | 20 | 0 | 0 | 0 | 0 | 0 | 9 | 0 | 0 | 0 A2 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 (2 ROWS) |
解决方案3.交叉表。
易于使用,但由于需要定义行格式,因此对于更多的列非常不舒服。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | CREATE extension IF NOT EXISTS tablefunc; SELECT * FROM crosstab ( 'select item_id, day_of_year, sales from entry_daily order by 1', 'select i from generate_series (1, 10) i' ) AS ct (item_id text,"1" INT,"2" INT,"3" INT,"4" INT,"5" INT,"6" INT,"7" INT,"8" INT,"9" INT,"10" INT); item_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 ---------+----+---+---+---+---+---+---+---+---+---- A1 | 20 | | | | | | 9 | | | A2 | 11 | | | | | | | | | (2 ROWS) |
首先,您需要一个包含所有日期的表来填充空白日期。 100年的日期意味着36,000行所以不是很大。而不是每次都计算。
allDates:
1 2 | date_id s_date |
或创建计算字段
1 2 3 4 | date_id s_date doy = EXTRACT(DOY FROM s_date) YEAR = EXTRACT(YEAR FROM s_date) |
您的基本查询将是SQL FIDDLE DEMO:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | SELECT AD.year, AD.doy, allitems.item_id, COALESCE(SUM(ED.sales), 0) AS max_sales FROM (SELECT DISTINCT item_id FROM entry_daily ) AS allitems CROSS JOIN alldates AD LEFT JOIN entry_daily ED ON ED.day_of_year = AD.doy AND ED.year = AD.year AND ED.item_id = allitems.item_id WHERE AD.year = 2015 GROUP BY AD.year, AD.doy, allitems.item_id ORDER BY AD.year, AD.doy, allitems.item_id |
你将有这个输出
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | | YEAR | doy | item_id | max_sales | |------|-----|---------|-----------| | 2015 | 1 | A1 | 20 | | 2015 | 1 | A2 | 11 | | 2015 | 2 | A1 | 0 | | 2015 | 2 | A2 | 0 | | 2015 | 3 | A1 | 0 | | 2015 | 3 | A2 | 0 | | 2015 | 4 | A1 | 0 | | 2015 | 4 | A2 | 0 | | 2015 | 5 | A1 | 0 | | 2015 | 5 | A2 | 0 | | 2015 | 6 | A1 | 0 | | 2015 | 6 | A2 | 0 | | 2015 | 7 | A1 | 39 | | 2015 | 7 | A2 | 0 | | 2015 | 8 | A1 | 0 | | 2015 | 8 | A2 | 0 | | 2015 | 9 | A1 | 0 | | 2015 | 9 | A2 | 0 | | 2015 | 10 | A1 | 0 | | 2015 | 10 | A2 | 0 | |
然后你需要安装tablefunc
并使用交叉表来旋转此表SAMPLE
尝试这个自包含的代码,我们使用5而不是149来保持输出短。
在(1)中,我们根据需要使用单个SQL语句来生成生成长格式结果的所有系列。通常在关系数据库中使用长形式而不是宽形式,并且这种形式可能是优选的,但是如果不是这样,我们使用reshape2包转换为宽形式。
在(2)中,我们展示了如何用使用dplyr包的R代码替换SQL语句。
1)PostgreSQL关于下面的SQL语句,最里面的select生成一个表1,2,...,5,其列
假设你已经设置了postgreSQL来使用SQL,就像在sqldf主页(https://github.com/ggrothendieck/sqldf)上的FAQ#12中一样,下面应该说明它并且是你可以复制和粘贴的自包含代码进入你的会话。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | library(sqldf) library(RPostgreSQL) # INPUT DATA entry_daily <- STRUCTURE(list(day_of_year = c(1L, 1L, 7L), YEAR = c(2015L, 2015L, 2015L), sales = c(20L, 11L, 9L), item_id = STRUCTURE(c(1L, 2L, 1L), .Label = c("A1","A2"), class ="factor")), .Names = c("day_of_year", "year","sales","item_id"), class ="data.frame", ROW.names = c(NA, -3L)) s <- sqldf("select A.item_id, A.year, A.day_of_year, sum(coalesce(B.sales, 0)) sales from (select distinct x.day_of_year, y.year, y.item_id from (select * from generate_series(1, 5) as day_of_year) as x cross join entry_daily as y) as A left join entry_daily as B on A.year = B.year and A.day_of_year = B.day_of_year and A.item_id = B.item_id where A.year = 2015 group by A.item_id, A.year, A.day_of_year order by A.item_id, A.year, A.day_of_year") |
上述查询的输出是这个data.frame:
1 2 3 4 5 6 7 8 9 10 11 12 | > s item_id YEAR day_of_year sales 1 A1 2015 1 20 2 A1 2015 2 0 3 A1 2015 3 0 4 A1 2015 4 0 5 A1 2015 5 0 6 A2 2015 1 11 7 A2 2015 2 0 8 A2 2015 3 0 9 A2 2015 4 0 10 A2 2015 5 0 |
如果你真的需要宽泛的形式,那么我们可以在res中使用reshape2包中的
1 2 | library(reshape2) dcast(s, item_id + YEAR ~ day_of_year, VALUE.var ="sales") |
赠送:
1 2 3 | item_id YEAR 1 2 3 4 5 1 A1 2015 20 0 0 0 0 2 A2 2015 11 0 0 0 0 |
2)dplyr请注意,作为SQL语句的替代,此R代码将计算
1 2 3 4 5 6 7 8 9 | library(dplyr) s2 <- expand.grid(item_id = UNIQUE(entry_daily$item_id), YEAR = 2015, day_of_year = 1:5) %>% left_join(entry_daily) %>% group_by(item_id, YEAR, day_of_year) %>% summarize(sales = SUM(sales, na.rm = TRUE)) %>% ungroup() %>% arrange(item_id, YEAR, day_of_year) |
赠送:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | > s2 Joining BY: c("item_id","year","day_of_year") SOURCE: LOCAL DATA frame [10 x 4] Groups: item_id, YEAR [?] item_id YEAR day_of_year sales (fctr) (dbl) (INT) (INT) 1 A1 2015 1 20 2 A1 2015 2 0 3 A1 2015 3 0 4 A1 2015 4 0 5 A1 2015 5 0 6 A2 2015 1 11 7 A2 2015 2 0 8 A2 2015 3 0 9 A2 2015 4 0 10 A2 2015 5 0 |
现在可选择使用与(1)中相同的