Remove rows with all or some NAs (missing values) in data.frame
我想删除此数据框中的行:
a)在所有列中包含
1 2 3 4 5 6 7 | gene hsap mmul mmus rnor cfam 1 ENSG00000208234 0 NA NA NA NA 2 ENSG00000199674 0 2 2 2 2 3 ENSG00000221622 0 NA NA NA NA 4 ENSG00000207604 0 NA NA 1 2 5 ENSG00000207431 0 NA NA NA NA 6 ENSG00000221312 0 1 2 3 2 |
基本上,我想获得如下的数据框。
1 2 3 | gene hsap mmul mmus rnor cfam 2 ENSG00000199674 0 2 2 2 2 6 ENSG00000221312 0 1 2 3 2 |
b)只在某些列中包含
1 2 3 4 | gene hsap mmul mmus rnor cfam 2 ENSG00000199674 0 2 2 2 2 4 ENSG00000207604 0 NA NA 1 2 6 ENSG00000221312 0 1 2 3 2 |
还要检查
1 2 3 4 | > final[complete.cases(final), ] gene hsap mmul mmus rnor cfam 2 ENSG00000199674 0 2 2 2 2 6 ENSG00000221312 0 1 2 3 2 |
1 2 3 4 5 | > final[complete.cases(final[ , 5:6]),] gene hsap mmul mmus rnor cfam 2 ENSG00000199674 0 2 2 2 2 4 ENSG00000207604 0 NA NA 1 2 6 ENSG00000221312 0 1 2 3 2 |
您的解决方案无法运作。如果您坚持使用
1 2 3 4 5 | > final[rowSums(is.na(final[ , 5:6])) == 0, ] gene hsap mmul mmus rnor cfam 2 ENSG00000199674 0 2 2 2 2 4 ENSG00000207604 0 NA NA 1 2 6 ENSG00000221312 0 1 2 3 2 |
但使用
试试
1 2 3 4 5 6 7 8 9 10 | library(tidyr) df %>% drop_na() # gene hsap mmul mmus rnor cfam # 2 ENSG00000199674 0 2 2 2 2 # 6 ENSG00000221312 0 1 2 3 2 df %>% drop_na(rnor, cfam) # gene hsap mmul mmus rnor cfam # 2 ENSG00000199674 0 2 2 2 2 # 4 ENSG00000207604 0 NA NA 1 2 # 6 ENSG00000221312 0 1 2 3 2 |
我更喜欢按照以下方式检查行是否包含任何NA:
1 | row.has.na <- apply(final, 1, function(x){any(is.na(x))}) |
这将返回逻辑向量,其值表示行中是否存在任何NA。您可以使用它来查看要删除的行数:
1 | sum(row.has.na) |
并最终放弃他们
1 | final.filtered <- final[!row.has.na,] |
为了过滤具有某些NA的行,它变得有点棘手(例如,你可以将'final [,5:6]'提供给'apply')。
一般来说,Joris Meys的解决方案似乎更优雅。
如果您想要更好地控制行被视为无效的另一个选项是
1 | final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),] |
使用上面的,这个:
1 2 3 4 5 6 7 | gene hsap mmul mmus rnor cfam 1 ENSG00000208234 0 NA NA NA 2 2 ENSG00000199674 0 2 2 2 2 3 ENSG00000221622 0 NA NA 2 NA 4 ENSG00000207604 0 NA NA 1 2 5 ENSG00000207431 0 NA NA NA NA 6 ENSG00000221312 0 1 2 3 2 |
变为:
1 2 3 4 5 6 | gene hsap mmul mmus rnor cfam 1 ENSG00000208234 0 NA NA NA 2 2 ENSG00000199674 0 2 2 2 2 3 ENSG00000221622 0 NA NA 2 NA 4 ENSG00000207604 0 NA NA 1 2 6 ENSG00000221312 0 1 2 3 2 |
...仅删除第5行,因为它是唯一包含
如果要控制每行有效的NA数,请尝试此功能。对于许多调查数据集,太多空白问题响应可能会破坏结果。所以在一定的阈值后删除它们。此功能允许您选择在删除行之前可以拥有多少个NAs:
1 2 3 | delete.na <- function(DF, n=0) { DF[rowSums(is.na(DF)) <= n,] } |
默认情况下,它将消除所有NA:
1 2 3 4 | delete.na(final) gene hsap mmul mmus rnor cfam 2 ENSG00000199674 0 2 2 2 2 6 ENSG00000221312 0 1 2 3 2 |
或者指定允许的最大NA数:
1 2 3 4 5 | delete.na(final, 2) gene hsap mmul mmus rnor cfam 2 ENSG00000199674 0 2 2 2 2 4 ENSG00000207604 0 NA NA 1 2 6 ENSG00000221312 0 1 2 3 2 |
如果性能优先,请使用
如果您不想使用
在香草
基准测试结果
下面是基于(蓝色),
您的结果可能会因特定数据集的长度,宽度和稀疏度而异。
注意y轴上的对数刻度。
基准脚本
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | #------- Adjust these assumptions for your own use case ------------ row_size <- 1e6L col_size <- 20 # not including ID column p_missing <- 0.05 # likelihood of missing observation (except ID col) col_subset <- 18:21 # second part of question: filter on select columns #------- System info for benchmark ---------------------------------- R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32 library(data.table); packageVersion('data.table') # 1.10.4.3 library(dplyr); packageVersion('dplyr') # 0.7.4 library(tidyr); packageVersion('tidyr') # 0.8.0 library(microbenchmark) #------- Example dataset using above assumptions -------------------- fakeData <- function(m, n, p){ set.seed(123) m <- matrix(runif(m*n), nrow=m, ncol=n) m[m<p] <- NA return(m) } df <- cbind( data.frame(id = paste0('ID',seq(row_size)), stringsAsFactors = FALSE), data.frame(fakeData(row_size, col_size, p_missing) ) ) dt <- data.table(df) par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1) boxplot( microbenchmark( df[complete.cases(df), ], na.omit(df), df %>% drop_na, dt[complete.cases(dt), ], na.omit(dt) ), xlab='', main = 'Performance: Drop any NA observation', col=c(rep('lightblue',2),'salmon',rep('beige',2)) ) boxplot( microbenchmark( df[complete.cases(df[,col_subset]), ], #na.omit(df), # col subset not supported in na.omit.data.frame df %>% drop_na(col_subset), dt[complete.cases(dt[,col_subset,with=FALSE]), ], na.omit(dt, cols=col_subset) # see ?na.omit.data.table ), xlab='', main = 'Performance: Drop NA obs. in select cols', col=c('lightblue','salmon',rep('beige',2)) ) |
使用dplyr包我们可以按如下方式过滤NA:
1 | dplyr::filter(df, !is.na(columnname)) |
这将返回至少具有一个非NA值的行。
1 | final[rowSums(is.na(final))<length(final),] |
这将返回至少具有两个非NA值的行。
1 | final[rowSums(is.na(final))<(length(final)-1),] |
对于你的第一个问题,我有一个代码,我很乐意摆脱所有的NA。感谢@Gregor让它变得更简单。
1 | final[!(rowSums(is.na(final))),] |
对于第二个问题,代码只是前一个解决方案的替代。
1 | final[as.logical((rowSums(is.na(final))-5)),] |
请注意,-5是数据中的列数。这将消除所有NA的行,因为rowSums加起来为5,并且它们在减法后变为零。这一次,as.logical是必要的。
我们也可以使用子集函数。
1 | finalData<-subset(data,!(is.na(data["mmul"]) | is.na(data["rnor"]))) |
这将只给出mmul和rnor中没有NA的那些行
我是合成器:)。在这里,我将答案组合成一个函数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | #' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others #' @param df a data frame #' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5,"place", c("place","age") #' \cr default is NULL, search for all columns #' @param n integer or vector, 0, c(3,5), number/range of NAs allowed. #' \cr If a number, the exact number of NAs kept #' \cr Range includes both ends 3<=n<=5 #' \cr Range could be -Inf, Inf #' @return returns a new df with rows that have NA(s) removed #' @export ez.na.keep = function(df, col=NULL, n=0){ if (!is.null(col)) { # R converts a single row/col to a vector if the parameter col has only one col # see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments df.temp = df[,col,drop=FALSE] } else { df.temp = df } if (length(n)==1){ if (n==0) { # simply call complete.cases which might be faster result = df[complete.cases(df.temp),] } else { # credit: http://stackoverflow.com/a/30461945/2292993 log <- apply(df.temp, 2, is.na) logindex <- apply(log, 1, function(x) sum(x) == n) result = df[logindex, ] } } if (length(n)==2){ min = n[1]; max = n[2] log <- apply(df.temp, 2, is.na) logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max}) result = df[logindex, ] } return(result) } |
假设
1.
1 2 3 4 | > dat[!rowSums((is.na(dat))),] gene hsap mmul mmus rnor cfam 2 ENSG00000199674 0 2 2 2 2 6 ENSG00000221312 0 1 2 3 2 |
2.
1 2 3 4 | > dat[!Reduce('|',lapply(dat,is.na)),] gene hsap mmul mmus rnor cfam 2 ENSG00000199674 0 2 2 2 2 6 ENSG00000221312 0 1 2 3 2 |
1 2 3 4 5 6 | delete.dirt <- function(DF, dart=c('NA')) { dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart)) DF <- DF[dirty_rows, ] } mydata <- delete.dirt(mydata) |
上面的函数删除任何列中具有"NA"的数据帧中的所有行,并返回结果数据。如果要检查多个值,如
我的猜测是,用这种方式可以更优雅地解决这个问题
1 2 3 4 5 6 7 8 9 10 | m <- matrix(1:25, ncol = 5) m[c(1, 6, 13, 25)] <- NA df <- data.frame(m) library(dplyr) df %>% filter_all(any_vars(is.na(.))) #> X1 X2 X3 X4 X5 #> 1 NA NA 11 16 21 #> 2 3 8 NA 18 23 #> 3 5 10 15 20 NA |
一种通用且产生相当可读代码的方法是在dplyr包(
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | library(dplyr) vars_to_check <- c("rnor","cfam") # Filter a specific list of columns to keep only non-missing entries df %>% filter_at(.vars = vars(one_of(vars_to_check)), ~ !is.na(.)) # Filter all the columns to exclude NA df %>% filter_all(~ !is.na(.)) # Filter only numeric columns df %>% filter_if(is.numeric, ~ !is.na(.)) |