dplyr: group events in a session together
我有一个如下所示的数据框。我想将每个"独特"会话的事件组合在一起。例如,在以下情况下,编号 1 的 ID 与我的系统进行了两次互动,并进行了两次会话。我想"传播"(tidyr)数据,但每个会话。不是每个 id。如何使用 dplyr 和 tidyr 来做到这一点?
1 2 3 4 5 6 7 8 9 10 | > df id event time 1 1 start 2015-05-16 22:46:53 2 1 valid 2015-05-16 22:46:56 3 1 end 2015-05-16 22:46:59 4 2 start 2015-05-16 22:46:53 5 2 bad 2015-05-16 22:47:00 6 1 start 2015-05-16 22:49:05 7 1 bad 2015-05-16 22:49:09 > |
所需的输出类似于以下内容:
1 2 3 4 5 | > df1 nid starttime validtime badtime endtime 1 1 2015-05-16 22:46:53 2015-05-16 22:46:56 <NA> 2015-05-16 22:46:59 2 2 2015-05-16 22:46:53 <NA> 2015-05-16 22:47:00 <NA> 3 1 2015-05-16 22:49:05 <NA> 2015-05-16 22:49:09 <NA> |
使用
1 2 3 4 5 6 7 8 9 10 11 | library(data.table)#v1.9.5+ dcast(setDT(df)[, gr:= rleid(id)], id+gr~paste0(event, 'time'), value.var='time')[order(starttime)][, c(1, 5:6, 3:4), with=FALSE] # id starttime validtime badtime #1: 1 2015-05-16 22:46:53 2015-05-16 22:46:56 <NA> #2: 2 2015-05-16 22:46:53 <NA> 2015-05-16 22:47:00 #3: 1 2015-05-16 22:49:05 <NA> 2015-05-16 22:49:09 # endtime #1: 2015-05-16 22:46:59 #2: <NA> #3: <NA> |
数据
1 2 3 4 5 6 7 8 9 10 | df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 1L, 1L), event = c("start", "valid","end","start","bad","start","bad"), time = structure(c(1431816413, 1431816416, 1431816419, 1431816413, 1431816420, 1431816545, 1431816549 ), class = c("POSIXct","POSIXt"), tzone ="%Y-%m-%d %H:%M:%S")), .Names = c("id", "event","time"), row.names = c("1","2","3","4","5","6", "7"), class ="data.frame") |
这是一种方法。我不确定您是否有时间作为日期对象或角色对象。在这里,我在
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | library(dplyr) library(tidyr) mydf <- data.frame(id = c(1,1,1,2,2,1,1), event = c("start","valid","end","start","bad","start","bad"), time = as.POSIXct(c("2015-05-16 22:46:53","2015-05-16 22:46:56","2015-05-16 22:46:59", "2015-05-16 22:46:53","2015-05-16 22:47:00","2015-05-16 22:49:05", "2015-05-16 22:49:09"), format ="%Y-%m-%d %H:%M:%S"), stringsAsFactors = FALSE) mutate(mydf, time = as.character(time), group = cumsum(c(T, diff(id) != 0))) %>% spread(event, time) %>% arrange(group) %>% select(id, starttime = start, validtime = valid, badtime = bad, endtime = end) %>% mutate_each(funs(as.POSIXct(., format ="%Y-%m-%d %H:%M:%S")), starttime:endtime) # id starttime validtime badtime endtime #1 1 2015-05-16 22:46:53 2015-05-16 22:46:56 <NA> 2015-05-16 22:46:59 #2 2 2015-05-16 22:46:53 <NA> 2015-05-16 22:47:00 <NA> #3 1 2015-05-16 22:49:05 <NA> 2015-05-16 22:49:09 <NA> |