Apply several summary functions on several variables by group in one call
我有以下数据框
1 2 3 4 5 6 7 8 9 | x <- read.table(text =" id1 id2 val1 val2 1 a x 1 9 2 a x 2 4 3 a y 3 5 4 a y 4 9 5 b x 1 7 6 b y 4 4 7 b x 3 9 8 b y 2 8", header = TRUE) |
我想计算按id1和id2分组的val1和val2的平均值,并同时计算每个id1-id2组合的行数。 我可以分别执行每个计算:
1 2 3 4 5 | # calculate mean aggregate(. ~ id1 + id2, data = x, FUN = mean) # count rows aggregate(. ~ id1 + id2, data = x, FUN = length) |
为了在一次调用中进行两种计算,我尝试了
1 | do.call("rbind", aggregate(. ~ id1 + id2, data = x, FUN = function(x) data.frame(m = mean(x), n = length(x)))) |
但是,我得到一个乱码输出和一个警告:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # m n # id1 1 2 # id2 1 1 # 1.5 2 # 2 2 # 3.5 2 # 3 2 # 6.5 2 # 8 2 # 7 2 # 6 2 # Warning message: # In rbind(id1 = c(1L, 2L, 1L, 2L), id2 = c(1L, 1L, 2L, 2L), val1 = list( : # number of columns of result is not a multiple of vector length (arg 1) |
我可以使用plyr包,但是当数据集的大小增加时,我的数据集很大并且plyr非常慢(几乎无法使用)。
如何在一个调用中使用
您可以一步一步完成所有步骤并获得正确的标签:
1 2 3 4 5 6 | > aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) # id1 id2 val1.mn val1.n val2.mn val2.n # 1 a x 1.5 2.0 6.5 2.0 # 2 b x 2.0 2.0 8.0 2.0 # 3 a y 3.5 2.0 7.0 2.0 # 4 b y 3.0 2.0 6.0 2.0 |
这将创建一个具有两个id列和两个矩阵列的数据框:
1 2 3 4 5 6 7 8 9 10 11 12 | str( aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) ) 'data.frame': 4 obs. of 4 variables: $ id1 : Factor w/ 2 levels"a","b": 1 2 1 2 $ id2 : Factor w/ 2 levels"x","y": 1 1 2 2 $ val1: num [1:4, 1:2] 1.5 2 3.5 3 2 2 2 2 ..- attr(*,"dimnames")=List of 2 .. ..$ : NULL .. ..$ : chr "mn""n" $ val2: num [1:4, 1:2] 6.5 8 7 6 2 2 2 2 ..- attr(*,"dimnames")=List of 2 .. ..$ : NULL .. ..$ : chr "mn""n" |
如下面的@ lord.garbage所指出的,可以使用
1 2 3 4 5 6 7 8 9 | str( do.call(data.frame, aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) ) ) 'data.frame': 4 obs. of 6 variables: $ id1 : Factor w/ 2 levels"a","b": 1 2 1 2 $ id2 : Factor w/ 2 levels"x","y": 1 1 2 2 $ val1.mn: num 1.5 2 3.5 3 $ val1.n : num 2 2 2 2 $ val2.mn: num 6.5 8 7 6 $ val2.n : num 2 2 2 2 |
这是LHS上多个变量的语法:
1 | aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) |
鉴于这个问题:
I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.
然后在
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | > DT id1 id2 val1 val2 1: a x 1 9 2: a x 2 4 3: a y 3 5 4: a y 4 9 5: b x 1 7 6: b y 4 4 7: b x 3 9 8: b y 2 8 > DT[ , .(mean(val1), mean(val2), .N), by = .(id1, id2)] # simplest id1 id2 V1 V2 N 1: a x 1.5 6.5 2 2: a y 3.5 7.0 2 3: b x 2.0 8.0 2 4: b y 3.0 6.0 2 > DT[ , .(val1.m = mean(val1), val2.m = mean(val2), count = .N), by = .(id1, id2)] # named id1 id2 val1.m val2.m count 1: a x 1.5 6.5 2 2: a y 3.5 7.0 2 3: b x 2.0 8.0 2 4: b y 3.0 6.0 2 > DT[ , c(lapply(.SD, mean), count = .N), by = .(id1, id2)] # mean over all columns id1 id2 val1 val2 count 1: a x 1.5 6.5 2 2: a y 3.5 7.0 2 3: b x 2.0 8.0 2 4: b y 3.0 6.0 2 |
有关将
此基准(
使用
1 2 3 | x %>% group_by(id1, id2) %>% summarise_all(funs(mean, n())) |
这使:
1 2 3 4 5 | id1 id2 val1_mean val2_mean val1_n val2_n 1 a x 1.5 6.5 2 2 2 a y 3.5 7.0 2 2 3 b x 2.0 8.0 2 2 4 b y 3.0 6.0 2 2 |
如果您不想将功能应用到所有非分组列,请指定要应用这些功能的列,或者使用
1 2 3 4 5 6 7 8 9 | # inclusion x %>% group_by(id1, id2) %>% summarise_at(vars(val1, val2), funs(mean, n())) # exclusion x %>% group_by(id1, id2) %>% summarise_at(vars(-val2), funs(mean, n())) |
您可以添加一个
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | x$count <- 1 agg <- aggregate(. ~ id1 + id2, data = x,FUN = sum) agg # id1 id2 val1 val2 count # 1 a x 3 13 2 # 2 b x 4 16 2 # 3 a y 7 14 2 # 4 b y 6 12 2 agg[c("val1","val2")] <- agg[c("val1","val2")] / agg$count agg # id1 id2 val1 val2 count # 1 a x 1.5 6.5 2 # 2 b x 2.0 8.0 2 # 3 a y 3.5 7.0 2 # 4 b y 3.0 6.0 2 |
它的优点是保留您的列名并创建一个
也许您想合并?
1 2 3 4 5 6 7 8 9 10 | x.mean <- aggregate(. ~ id1+id2, p, mean) x.len <- aggregate(. ~ id1+id2, p, length) merge(x.mean, x.len, by = c("id1","id2")) id1 id2 val1.x val2.x val1.y val2.y 1 a x 1.5 6.5 2 2 2 a y 3.5 7.0 2 2 3 b x 2.0 8.0 2 2 4 b y 3.0 6.0 2 2 |
您还可以使用
1 | aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = plyr::each(avg = mean, n = length)) |
另一个
1 2 3 4 5 6 | #devtools::install_github("tidyverse/dplyr") library(dplyr) x %>% group_by(id1, id2) %>% summarise(across(starts_with("val"), list(mean = mean, n = length))) |
结果
1 2 3 4 5 6 7 8 | # A tibble: 4 x 4 # Groups: id1 [2] id1 id2 mean$val1 $val2 n$val1 $val2 <fct> <fct> <dbl> <dbl> <int> <int> 1 a x 1.5 6.5 2 2 2 a y 3.5 7 2 2 3 b x 2 8 2 2 4 b y 3 6 2 2 |
1 2 | packageVersion("dplyr") [1] ‘0.8.99.9000’ |