Find patient ID at least appears two time and more
本问题已经有最佳答案,请猛点这里访问。
1 | df<-data.frame(PATIENT_ID=c(1,1,2,3,3,3,4,5,5,5,5,5)) |
我要查找已记录至少 2 次的患者 ID。
输出应该是:
1 | df_output<-data.frame(PATIENT_ID=c(1,3,5)) |
谢谢。
又一个
1 2 3 | df %>% group_by(PATIENT_ID) %>% filter(n() > 1 & row_number() == 1) |
和
1 2 3 4 5 6 7 | library(data.table) DT <- as.data.table(df) DT[, .(n=.N),by=PATIENT_ID][n>1,][,n:=NULL][] # PATIENT_ID # 1: 1 # 2: 3 # 3: 5 |
如果您的数据比样本大,这些基准会改变,但可能会以相同的比例发生变化:user31264\\ 的答案几乎肯定是最快的,而更复杂的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | microbenchmark::microbenchmark( user = { a = rle(df$PATIENT_ID) data.frame(PATIENT_ID=a$values[a$lengths>1]) }, user_sort = { a = rle(sort(df$PATIENT_ID)) data.frame(PATIENT_ID=a$values[a$lengths>1]) }, r2a = df %>% group_by(PATIENT_ID) %>% filter(n() > 1 & row_number() == 1) %>% ungroup(), r2b = DT[, .(n=.N),by=PATIENT_ID][n>1,][,n:=NULL], csg = df %>% group_by(PATIENT_ID) %>% summarize(n = n()) %>% filter(n >= 2) %>% select(PATIENT_ID), duck = df %>% group_by(PATIENT_ID) %>% mutate(N=n()) %>% filter(N>=2) %>% select(-N) %>% filter(!duplicated(PATIENT_ID)) ) # Unit: microseconds # expr min lq mean median uq max neval # user 116.2 138.55 168.536 167.30 180.30 366.2 100 # user_sort 160.1 184.55 238.249 224.60 255.60 464.3 100 # r2a 3018.4 3399.60 4020.076 3839.70 4202.95 12193.5 100 # r2b 2094.6 2945.30 3367.188 3277.80 3838.35 5183.8 100 # csg 5382.5 6262.20 6708.582 6670.90 6992.80 9078.2 100 # duck 7538.3 8568.55 9275.720 8928.65 9420.20 16678.5 100 |
另一个 dplyr 解决方案,使用
1 2 3 4 5 6 | library(dplyr) df %>% group_by(PATIENT_ID) %>% summarize(n = n()) %>% filter(n >= 2) %>% select(PATIENT_ID) |
1 2 | a = rle(df$PATIENT_ID) df_output=data.frame(PATIENT_ID=a$val[a$len>1]) |
如果df未排序,第一行应该是
试试这个:
1 2 3 4 5 6 7 8 9 10 11 12 | library(dplyr) df %>% group_by(PATIENT_ID) %>% mutate(N=n()) %>% filter(N>=2) %>% select(-N) %>% filter(!duplicated(PATIENT_ID)) # A tibble: 3 x 1 # Groups: PATIENT_ID [3] PATIENT_ID <dbl> 1 1 2 3 3 5 |