R/data.table: separate columns and count occurrences
我有一个很大的
1 2 3 4 5 6 | taxpath N Bacteroidetes; Flavobacteriia; Flavobacteriales; Flavobacteriaceae; Formosa; Formosa sp. Hel3_A1_48; 57 Bacteroidetes; Flavobacteriia; Flavobacteriales; Cryomorphaceae; NA; Cryomorphaceae bacterium BACL29 MAG-121220-bin8; 54 Proteobacteria; Alphaproteobacteria; Pelagibacterales; Pelagibacteraceae; Candidatus Pelagibacter; NA; 53 Proteobacteria; Alphaproteobacteria; Pelagibacterales; NA; NA; NA; 41 Planctomycetes; NA; NA; NA; NA; Planctomycetes bacterium TMED84; 41 |
第一列是
我想做的是用分号分隔每个税路并使用第一个条目。
我想计算每个门等级(第一等级,即拟杆菌门、变形菌门或平面霉菌门)出现的频率。但是,此数字应乘以 N 列中的值。
所以,我所期望的或多或少是这样的。
1 2 3 4 | phylum Nnew Bacteriodetes 111 Proteobacteria 94 Planctomycetes 41 |
你能帮我如何在列中进行拆分,并且 - 我想 - group-by 与列 N 相乘吗?
(PS:稍后,我也想对列 taxpath 中的其他元素也这样做,但我认为将其分配到单独的表中更容易)
这个标记为 data.table 所以这里是一个简单的 data.table 解决方案。
1 2 3 4 5 6 | library(data.table) DT[, .(Nnew = sum(N)), by = sub(";.*","", taxpath)] # sub Nnew # 1: Bacteroidetes 111 # 2: Proteobacteria 94 # 3: Planctomycetes 41 |
我们在
数据
1 2 3 4 5 6 | DT <- fread("taxpath\\t N Bacteroidetes; Flavobacteriia; Flavobacteriales; Flavobacteriaceae; Formosa; Formosa sp. Hel3_A1_48;\\t 57 Bacteroidetes; Flavobacteriia; Flavobacteriales; Cryomorphaceae; NA; Cryomorphaceae bacterium BACL29 MAG-121220-bin8;\\t 54 Proteobacteria; Alphaproteobacteria; Pelagibacterales; Pelagibacteraceae; Candidatus Pelagibacter; NA;\\t 53 Proteobacteria; Alphaproteobacteria; Pelagibacterales; NA; NA; NA;\\t 41 Planctomycetes; NA; NA; NA; NA; Planctomycetes bacterium TMED84;\\t 41") |
我们可以用
2
3
4
5
6
7
8
9
10
11
12
13
newcols <-c("phylum","class","order","family","genus","species")
df1 %>%
mutate(taxpath = sub(";$","", taxpath)) %>%
separate(taxpath, into = newcols, sep=";\\\\s*") %>%
group_by(phylum) %>%
summarise(Nnew = sum(N))
# A tibble: 3 x 2
# phylum Nnew
# <chr> <int>
# 1 Bacteroidetes 326
# 2 Planctomycetes 41
# 3 Proteobacteria 94