Why does X[Y] join of data.tables not allow a full outer join, or a left join?
这是一个关于data.table连接语法的哲学问题。我正在寻找越来越多的数据表的用途,但仍在学习…
data.tables的join格式
X[Y, nomatch = NA] Y中的所有行右外部联接(默认)X[Y, nomatch = 0] --只有x和y中匹配的行--内部联接merge(X, Y, all = TRUE) --X和Y的所有行--完全外部联接merge(X, Y, all.x = TRUE) --X中的所有行--左外联接
在我看来,如果
对于我来说,
以下是4种连接类型的代码示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | # sample X and Y data.tables library(data.table) X <- data.table(t = 1:4, a = (1:4)^2) setkey(X, t) X # t a # 1: 1 1 # 2: 2 4 # 3: 3 9 # 4: 4 16 Y <- data.table(t = 3:6, b = (3:6)^2) setkey(Y, t) Y # t b # 1: 3 9 # 2: 4 16 # 3: 5 25 # 4: 6 36 # all rows from Y - right outer join X[Y] # default # t a b # 1: 3 9 9 # 2: 4 16 16 # 3: 5 NA 25 # 4: 6 NA 36 X[Y, nomatch = NA] # same as above # t a b # 1: 3 9 9 # 2: 4 16 16 # 3: 5 NA 25 # 4: 6 NA 36 merge(X, Y, by ="t", all.y = TRUE) # same as above # t a b # 1: 3 9 9 # 2: 4 16 16 # 3: 5 NA 25 # 4: 6 NA 36 identical(X[Y], merge(X, Y, by ="t", all.y = TRUE)) # [1] TRUE # only rows in both X and Y - inner join X[Y, nomatch = 0] # t a b # 1: 3 9 9 # 2: 4 16 16 merge(X, Y, by ="t") # same as above # t a b # 1: 3 9 9 # 2: 4 16 16 merge(X, Y, by ="t", all = FALSE) # same as above # t a b # 1: 3 9 9 # 2: 4 16 16 identical( X[Y, nomatch = 0], merge(X, Y, by ="t", all = FALSE) ) # [1] TRUE # all rows from X - left outer join merge(X, Y, by ="t", all.x = TRUE) # t a b # 1: 1 1 NA # 2: 2 4 NA # 3: 3 9 9 # 4: 4 16 16 # all rows from both X and Y - full outer join merge(X, Y, by ="t", all = TRUE) # t a b # 1: 1 1 NA # 2: 2 4 NA # 3: 3 9 9 # 4: 4 16 16 # 5: 5 NA 25 # 6: 6 NA 36 |
更新:data.table v1.9.6引入了
引用
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index.
Y[X] is a join, looking up Y's rows using X (or X's key if it has one)
merge(X,Y) does both ways at the same time. The number of rows ofX[Y] andY[X] usually differ, whereas the number of rows returned bymerge(X,Y) andmerge(Y,X) is the same.BUT that misses the main point. Most tasks require something to be done on the
data after a join or merge. Why merge all the columns of data, only to
use a small subset of them afterwards? You may suggest
merge(X[,ColsNeeded1],Y[,ColsNeeded2]) , but that requires the programmer to work out which columns are needed.X[Y,j ] in data.table does all that in one step for
you. When you writeX[Y,sum(foo*bar)] , data.table automatically inspects thej expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns thej uses, andY columns enjoy standard R recycling rules within the context of each group. Let's sayfoo is inX , and bar is inY (along with 20 other columns inY ). Isn'tX[Y,sum(foo*bar)] quicker to program and quicker to run than a merge of everything wastefully followed by a subset?
号
如果您想要
1 2 3 4 5 6 | le <- Y[X] mallx <- merge(X, Y, all.x = T) # the column order is different so change to be the same as `merge` setcolorder(le, names(mallx)) identical(le, mallx) # [1] TRUE |
如果你想要一个完整的外部连接
1 2 3 4 5 6 7 8 9 10 11 12 13 | # the unique values for the keys over both data sets unique_keys <- unique(c(X[,t], Y[,t])) Y[X[J(unique_keys)]] ## t b a ## 1: 1 NA 1 ## 2: 2 NA 4 ## 3: 3 9 9 ## 4: 4 16 16 ## 5: 5 25 NA ## 6: 6 36 NA # The following will give the same with the column order X,Y X[Y[J(unique_keys)]] |
号
@mnel的答案很到位,所以一定要接受这个答案。这只是后续行动,评论时间太长。
如mnel所说,通过交换
增加第四个似乎是个好主意。假设我们加上
fr 2301:为x[y]和y[x]join添加merge=true参数,就像merge()一样。
最近的版本加速了
fr 2033:添加by.x和by.y到merge.data.table
如果还有其他人,请让他们来。
根据问题中的这一部分:
why not use the merge syntax for joins rather than the match function's nomatch parameter?
号
如果您喜欢
hm.如果没有by-interaction-with-merge=true,该怎么办?也许我们应该把这个移交给数据表帮助。
这个"答案"是一个供讨论的建议:正如我在评论中所指出的,我建议在[.data.table()中添加一个
各种连接类型的
默认值为与当前默认值相对应的
"all"、"all.x"和"all.y"字符串值对应于
"both"和"not.both"字符串是我目前最好的建议——但是对于内部联接和独占联接,可能有人有更好的字符串建议。(我不确定"exclusive"是否是正确的术语,如果"xor"连接有合适的术语,请更正我。)
使用
交叉连接有时很方便,但可能不适合data.table范式。