R - Discrepancy in summary(data) and summary(data$variable)
我有一个包含 61 个观察值和 2 个变量的数据集。当我汇总整个数据时,第二个变量的分位数、中位数、平均值和最大值有时与我仅从第二个变量汇总得到的结果不同。这是为什么呢?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | data <- read.csv("testdata.csv") head(data) # Group.1 x # 1 10/1/12 0 # 2 10/2/12 126 # 3 10/3/12 11352 # 4 10/4/12 12116 # 5 10/5/12 13294 # 6 10/6/12 15420 summary(data) # Group.1 x # 10/1/12 : 1 Min. : 0 # 10/10/12: 1 1st Qu.: 6778 # 10/11/12: 1 Median :10395 # 10/12/12: 1 Mean : 9354 # 10/13/12: 1 3rd Qu.:12811 # 10/14/12: 1 Max. :21194 # (Other) :55 summary(data[2]) # x # Min. : 0 # 1st Qu.: 6778 # Median :10395 # Mean : 9354 # 3rd Qu.:12811 # Max. :21194 # The following code yield different result: summary(data$x) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 0 6778 10400 9354 12810 21190 |
@r2evans\\' 的注释是正确的,因为差异是由
两种方法的
digits : integer, used for number formatting withsignif() (forsummary.default ) orformat() (forsummary.data.frame ).
假设我们在问题中有
1 2 3 4 | q <- append(quantile(data$x), mean(data$x), after = 3L) q ## 0% 25% 50% 75% 100% ## 0.00 6778.00 10395.00 9354.23 12811.00 21194.00 |
在
1 2 3 | signif(q, digits = 4L) ## 0% 25% 50% 75% 100% ## 0 6778 10400 9354 12810 21190 |
而
1 2 3 | format(q, digits = 4L) ## 0% 25% 50% 75% 100% ##" 0"" 6778""10395"" 9354""12811""21194" |
因此,当使用默认的
如果您明确提供大于 4 的
1 2 3 4 5 6 7 8 9 10 11 12 | summary(data[2], digits = 5L) ## x ## Min. : 0.0 ## 1st Qu.: 6778.0 ## Median :10395.0 ## Mean : 9354.2 ## 3rd Qu.:12811.0 ## Max. :21194.0 summary(data$x, digits = 5L) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0 6778.0 10395.0 9354.2 12811.0 21194.0 |
作为两个方法与默认
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | df <- data.frame(a = 1e5 + 0:100) summary(df$a) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 100000 100000 100000 100000 100100 100100 summary(df) ## a ## Min. :100000 ## 1st Qu.:100025 ## Median :100050 ## Mean :100050 ## 3rd Qu.:100075 ## Max. :100100 |