You can’t even begin to understand biology, you can’t understand life, unless you understand what it’s all there for, how it arose - and that means evolution.
— Richard Dawkins
01. 输入文件格式
常见的系统发生树的格式主要有三种:Newick、NEXUS以及Phylip。其中,Newick和NEXUS格式的系统发生树能够被大多数软件所识别。除此之外,许多进化生物学分析软件也产生许多其他格式的文件,例如BEAST、MrBayes,PAML以及r8s等。
(1). Newick格式
1 | ((t2:0.04,t1:0.34):0.89,(t5:0.37,(t4:0.03,t3:0.67):0.9):0.59); |
Newick格式文件都是以分号(;)作为结尾,内部节点用一对匹配的括号表示,括号间的节点代表后代节点,例如(t2:0.04, t1:0.34)表示t2、t1的父节点。另外,同级节点之间用逗号分隔,tips用它们的名字表示。分支长度(从父节点到子节点)由子节点后面的实数表示,前面是冒号。与内部节点或分支相关联的数据(例如,自展值)可能编码为节点标签,并由冒号前的简单文本/数字表示。
(2). NEXUS格式
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | #NEXUS [R-package APE, Wed Nov 9 11:46:32 2016] BEGIN TAXA; DIMENSIONS NTAX = 5; TAXLABELS t5 t4 t1 t2 t3 ; END; BEGIN TREES; TRANSLATE 1 t5, 2 t4, 3 t1, 4 t2, 5 t3 ; TREE * UNTITLED = [&R] (1:0.89,((2:0.59,3:0.37):0.34, (4:0.03,5:0.67):0.9):0.04); END; |
NEXUS包含三个区块:TAXA(物种类群信息)、DATA(数据矩阵或多序列比对)以及TREE(Newick格式的系统发育树)。
(3). New Hampshire eXtended format
1 2 3 4 5 | (((ADH2:0.1[&&NHX:S=human], ADH1:0.11[&&NHX:S=human]):0.05[&&NHX:S=primates:D=Y:B=100],ADHY:0.1[&&NHX:S=nematode],ADHX:0.12[&&NHX:S=insect]):0.1[&&NHX:S=metazoa:D=N], (ADH4:0.09[&&NHX:S=yeast],ADH3:0.13[&&NHX:S=yeast], ADH2:0.12[&&NHX:S=yeast],ADH1:0.11[&&NHX:S=yeast]):0.1[&&NHX:S=Fungi]) [&&NHX:D=N]; |
(4). 其他软件的输出格式
- BEAST
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | tree TREE1 = [&R] (((11[&length=9.4]:9.38,14[&length=6.4]:6.385096430786298) [&length=25.7]:25.43,4[&length=9.1]:8.821663252749829) [&length=3.0]:3.10,(12[&length=0.6]:0.56, (10[&length=1.6]:1.56,(7[&length=5.2]:5.19, ((((2[&length=3.3]:3.26,(1[&length=1.3]:1.32, (6[&length=0.8]:0.83,13[&length=0.8]:0.8311577761397366) [&length=2.4]:2.48917886025146) [&length=0.9]:0.9416178372674331) [&length=0.4]:0.49,9[&length=1.7]:1.757288031101215) [&length=2.4]:2.35,8[&length=2.1]:2.1125745387283246) [&length=0.2]:0.23,(3[&length=3.3]:3.31, (15[&length=5.2]:5.27,5[&length=3.2]:3.2710481368304585) [&length=1.0]:1.0409443024626412) [&length=1.9]:2.0372962536780435) [&length=2.8]:2.8446835614595685) [&length=5.3]:5.367459711197171) [&length=2.0]:2.0037467863383043) [&length=4.3]:4.360909907798238)[&length=0.0]; |
BEAST的输出文件将会包含多种进化推断结果,例如分子钟分析通常会有rate,length,height,posterior,HPD以及不确定范围估计。rate代表某一枝系的进化速率,length代表枝长,height代表从节点到根的时间,而posterior代表贝叶斯Clade可信度值。
- MrBayes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | tree con_all_compat = [&U] (8[&prob=1.0]:2.94e-1[&length_mean=2.9e-1],10[&prob=1.0]:2.25e-1[&length_mean=2.2e-1], ((((1[&prob=1.0]:1.43e-1[&length_mean=1.4e-1],2[&prob=1.0]:1.92e-1[&length_mean=1.9e-1]) [&prob=1.0]:1.24e-1[&length_mean=1.2e-1],9[&prob=1.0]:2.27e-1[&length_mean=2.2e-1]) [&prob=1.0]:1.72e-1[&length_mean=1.7e-1],12[&prob=1.0]:5.11e-1[&length_mean=5.1e-1]) [&prob=1.0]:1.76e-1[&length_mean=1.7e-1], (((3[&prob=1.0]:5.46e-2[&length_mean=5.4e-2], (6[&prob=1.0]:1.03e-2[&length_mean=1.0e-2],7[&prob=1.0]:7.13e-3[&length_mean=7.2e-3]) [&prob=1.0]:6.93e-2[&length_mean=6.9e-2]) [&prob=1.0]:6.03e-2[&length_mean=6.0e-2], (4[&prob=1.0]:6.27e-2[&length_mean=6.2e-2],5[&prob=1.0]:6.31e-2[&length_mean=6.3e-2]) [&prob=1.0]:6.07e-2[&length_mean=6.0e-2]) [&prob=1.0]:1.80e-1[&length_mean=1.8e-1],11[&prob=1.0]:2.37e-1[&length_mean=2.3e-1]) [&prob=1.0]:4.05e-1[&length_mean=4.0e-1]) [&prob=1.0]:1.16e+000[&length_mean=1.162699558201079e+000]) [&prob=1.0][&length_mean=0]; |
一般而言,MrBayes大部分数据均会被去除,仅保留prob枝系后验概率与length_mean平均枝长。完整的数据还应该包括,prob_stddev,prob_range,prob(percent),prob+-sd,length_median,length_95%_HPD。
- PAML
杨子恒教授开发的PAML(Phylogenetic Analysis by Maximum Likelihood)软件包主要用于DNA或蛋白质序列的系统发育分析,其中BaseML与CodeML是两个主要子程序。BasseMl可利用多种碱基取代模型估计树拓扑、分支长度和替代参数,CodeML主要是估计同义与非同义替换率、密码子置换模型下正选择的似然比检验。CodeML输出文件均包含树拓扑结构和同义、非同义替换率的估计的mlc文件。
02. 在R中读取树文件
经常使用的R软件包,主要有ape,phylobase,phytools,而本文主要介绍ggtree。其中,treeio可以直接抓取BEAST,CodeML、,MrBayes,r8s的输出结果文件。
Table 1.1: Parser functions defined in treeio
Parser function | Description |
---|---|
read.astral | parsing output of ASTRAL |
read.beast | parsing output of BEAST |
read.codeml | parsing output of CodeML (rst and mlc files) |
read.codeml_mlc | parsing mlc file (output of CodeML) |
read.fasta | parsing FASTA format sequence file |
read.hyphy | parsing output of HYPHY |
read.hyphy.seq | parsing ancestral sequences from HYPHY output |
read.iqtree | parsing IQ-Tree newick string, with ability to parse SH-aLRT and UFBoot support values |
read.jplace | parsing jplace file including output of EPA and pplacer |
read.jtree | parsing jtree format |
read.mega | parsing MEGA Nexus output |
read.mega_tabular | parsing MEGA tabular output |
read.mrbayes | parsing output of MrBayes |
read.newick | parsing newick string, with ability to parse node label as support values |
read.nhx | parsing NHX file including output of PHYLDOG and RevBayes |
read.paml_rst | parsing rst file (output of BaseML or CodeML) |
read.phylip | parsing phylip file (phylip alignment + newick string) |
read.phylip.seq | parsing multiple sequence alignment from phylip file |
read.phylip.tree | parsing newick string from phylip file |
read.r8s | parsing output of r8s |
read.raxml | parsing output of RAxML |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | library(ggtree) file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio") beast <- read.beast(file) beast ## 'treedata' S4 object that stored information of ## '/home/ygc/R/library/treeio/extdata/BEAST/beast_mcc.tree'. ## ## ...@ phylo: ## Phylogenetic tree with 15 tips and 14 internal nodes. ## ## Tip labels: ## A_1995, B_1996, C_1995, D_1987, E_1996, F_1997, ... ## ## Rooted; includes branch lengths. ## ## with the following features available: ## 'height', 'height_0.95_HPD', 'height_median', ## 'height_range', 'length', 'length_0.95_HPD', ## 'length_median', 'length_range', 'posterior', 'rate', ## 'rate_0.95_HPD', 'rate_median', 'rate_range'. |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | file <- system.file("extdata/MrBayes", "Gq_nxs.tre", package="treeio") read.mrbayes(file) ## 'treedata' S4 object that stored information of ## '/home/ygc/R/library/treeio/extdata/MrBayes/Gq_nxs.tre'. ## ## ...@ phylo: ## Phylogenetic tree with 12 tips and 10 internal nodes. ## ## Tip labels: ## B_h, B_s, G_d, G_k, G_q, G_s, ... ## ## Unrooted; includes branch lengths. ## ## with the following features available: ## 'length_0.95HPD', 'length_mean', 'length_median', 'prob', ## 'prob_range', 'prob_stddev', 'prob_percent', 'prob+-sd'. |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | brstfile <- system.file("extdata/PAML_Baseml", "rst", package="treeio") brst <- read.paml_rst(brstfile) brst ## 'treedata' S4 object that stored information of ## '/home/ygc/R/library/treeio/extdata/PAML_Baseml/rst'. ## ## ...@ phylo: ## Phylogenetic tree with 15 tips and 13 internal nodes. ## ## Tip labels: ## A, B, C, D, E, F, ... ## Node labels: ## 16, 17, 18, 19, 20, 21, ... ## ## Unrooted; includes branch lengths. ## ## with the following features available: ## 'subs', 'AA_subs'. |
03. 数据整合与过滤
3.1 使用tidytree将数据转换为数据框dataframe格式
所有被树解析/合并的数据都可以使用tidytree包转换成整洁的数据框。tidytree包提供操作带有关联数据的树。例如,外部数据可以链接到系统发育,或者从不同来源获得的进化数据可以使用tidyverse verbs进行合并。在对树数据进行操作后,可以将其转换回treedata对象,并导出到单个树文件中,在R中进一步分析或使用ggtree可视化。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | library(ape) library(tidytree) library(dplyr) set.seed(2017) tree <- rtree(4) tree ## ## Phylogenetic tree with 4 tips and 3 internal nodes. ## ## Tip labels: ## [1] "t4" "t1" "t3" "t2" ## ## Rooted; includes branch lengths. x <- as_tibble(tree) x ## # A tibble: 7 x 4 ## parent node branch.length label ## <int> <int> <dbl> <chr> ## 1 5 1 0.435 t4 ## 2 7 2 0.674 t1 ## 3 7 3 0.00202 t3 ## 4 6 4 0.0251 t2 ## 5 5 5 NA <NA> ## 6 5 6 0.472 <NA> ## 7 6 7 0.274 <NA> as.phylo(x) ## ## Phylogenetic tree with 4 tips and 3 internal nodes. ## ## Tip labels: ## [1] "t4" "t1" "t3" "t2" ## ## Rooted; includes branch lengths. |
将树文件与物种表型数据相关联:dplyr
1 2 3 4 5 | d <- tibble(label = paste0('t', 1:4), trait = rnorm(4)) y <- full_join(x, d, by = 'label') #通过物种名合并 y |
1 2 3 4 5 6 7 8 9 10 | ## # A tibble: 7 x 5 ## parent node branch.length label trait ## <int> <int> <dbl> <chr> <dbl> ## 1 5 1 0.435 t4 0.943 ## 2 7 2 0.674 t1 -0.171 ## 3 7 3 0.00202 t3 0.570 ## 4 6 4 0.0251 t2 -0.283 ## 5 5 5 NA <NA> NA ## 6 5 6 0.472 <NA> NA ## 7 6 7 0.274 <NA> NA |
treedata对象
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | as.treedata(y) ## 'treedata' S4 object'. ## ## ...@ phylo: ## Phylogenetic tree with 4 tips and 3 internal nodes. ## ## Tip labels: ## [1] "t4" "t1" "t3" "t2" ## ## Rooted; includes branch lengths. ## ## with the following features available: ## 'trait'. y %>% as.treedata %>% as_tibble # 直接合并多个对象 ## # A tibble: 7 x 5 ## parent node branch.length label trait ## <int> <int> <dbl> <chr> <dbl> ## 1 5 1 0.435 t4 0.943 ## 2 7 2 0.674 t1 -0.171 ## 3 7 3 0.00202 t3 0.570 ## 4 6 4 0.0251 t2 -0.283 ## 5 5 5 NA <NA> NA ## 6 5 6 0.472 <NA> NA ## 7 6 7 0.274 <NA> NA |
Access related nodes(访问相关节点)
1 | child(y, 5) |
1 2 3 4 5 | ## # A tibble: 2 x 5 ## parent node branch.length label trait ## <int> <int> <dbl> <chr> <dbl> ## 1 5 1 0.435 t4 0.943 ## 2 5 6 0.472 <NA> NA |
1 | parent(y, 2) |
1 2 3 4 | ## # A tibble: 1 x 5 ## parent node branch.length label trait ## <int> <int> <dbl> <chr> <dbl> ## 1 6 7 0.274 <NA> NA |
1 | offspring(y, 5) |
1 2 3 4 5 6 7 8 9 | ## # A tibble: 6 x 5 ## parent node branch.length label trait ## <int> <int> <dbl> <chr> <dbl> ## 1 5 1 0.435 t4 0.943 ## 2 7 2 0.674 t1 -0.171 ## 3 7 3 0.00202 t3 0.570 ## 4 6 4 0.0251 t2 -0.283 ## 5 5 6 0.472 <NA> NA ## 6 6 7 0.274 <NA> NA |
1 | ancestor(y, 2) |
1 2 3 4 5 6 | ## # A tibble: 3 x 5 ## parent node branch.length label trait ## <int> <int> <dbl> <chr> <dbl> ## 1 5 5 NA <NA> NA ## 2 5 6 0.472 <NA> NA ## 3 6 7 0.274 <NA> NA |
1 | sibling(y, 2) |
1 2 3 4 | ## # A tibble: 1 x 5 ## parent node branch.length label trait ## <int> <int> <dbl> <chr> <dbl> ## 1 7 3 0.00202 t3 0.570 |
1 | MRCA(y, 2, 3) |
1 2 3 4 | ## # A tibble: 1 x 5 ## parent node branch.length label trait ## <int> <int> <dbl> <chr> <dbl> ## 1 6 7 0.274 <NA> NA |
3.2 数据整合
3.2.1 合并codeml与beast的结果
1 2 3 4 5 6 7 8 | beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree") rst_file <- system.file("examples/rst", package="ggtree") mlc_file <- system.file("examples/mlc", package="ggtree") beast_tree <- read.beast(beast_file) codeml_tree <- read.codeml(rst_file, mlc_file) merged_tree <- merge_tree(beast_tree, codeml_tree) merged_tree |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | ## 'treedata' S4 object that stored information of ## '/home/ygc/R/library/ggtree/examples/MCC_FluA_H3.tree', ## '/home/ygc/R/library/ggtree/examples/rst', ## '/home/ygc/R/library/ggtree/examples/mlc'. ## ## ...@ phylo: ## Phylogenetic tree with 76 tips and 75 internal nodes. ## ## Tip labels: ## A/Hokkaido/30-1-a/2013, A/New_York/334/2004, A/New_York/463/2005, A/New_York/452/1999, A/New_York/238/2005, A/New_York/523/1998, ... ## ## Rooted; includes branch lengths. ## ## with the following features available: ## 'height', 'height_0.95_HPD', 'height_median', ## 'height_range', 'length', 'length_0.95_HPD', ## 'length_median', 'length_range', 'posterior', 'rate', ## 'rate_0.95_HPD', 'rate_median', 'rate_range', 'subs', ## 'AA_subs', 't', 'N', 'S', 'dN_vs_dS', 'dN', 'dS', 'N_x_dN', ## 'S_x_dS'. |
使用tidytree包将树对象转换为整齐的数据框,并将其可视化为由CODEML推断的DN/DS、DN和DS的六元散点图与由BEAST在相同分支上推断的速率(以替换/位置/年为单位的替代率)。
1 2 3 4 5 6 7 8 9 10 | library(dplyr) df <- merged_tree %>% as_tibble() %>% select(dN_vs_dS, dN, dS, rate) %>% subset(dN_vs_dS >=0 & dN_vs_dS <= 1.5) %>% tidyr::gather(type, value, dN_vs_dS:dS) df$type[df$type == 'dN_vs_dS'] <- 'dN/dS' df$type <- factor(df$type, levels=c("dN/dS", "dN", "dS")) ggplot(df, aes(rate, value)) + geom_hex() + facet_wrap(~type, scale='free_y') |
04. 系统发育树可视化
有许多用于显示系统发育树的软件包和网络工具,例如TreeView,FigTree11,TreeDyn,Dendroscope,EvolView和iTOL等。其中,如Figtree、TreeDyn和iTOL,允许用户用着色分支、突出显示的分支和树特征来注释树。然而,它们预定义的注释功能通常仅限于某些特定的系统发育数据。随着系统发育树在多学科研究中的应用越来越广泛,越来越需要将来自不同来源的各种类型的系统发育协变量和其他相关数据合并到树中用于可视化和进一步分析。例如,流感病毒具有广泛的宿主范围,多样和动态的基因型和特征性的传播行为,这些行为大多与病毒的进化有关,并且本质上是相互之间的。因此,除了专注于每一种特定分析和数据类型的独立应用程序外,研究分子进化的研究人员还需要一个强大的可编程平台,允许在系统发育树上对数据的许多不同方面(原始数据或来自其他初级分析的数据)进行高水平的集成和可视化,以确定它们的关联和模式。
4.1 基本语法
1 2 3 4 5 6 7 8 | ggplot(tree_object) + geom_tree() + theme_tree() ggtree(tree_object) geom_treescale #增加树分支比例的图例(遗传距离、发散时间等) geom_range #显示分支长度的不确定性(置信区间或范围等) geom_tiplab #添加分类群标签 geom_tippoint,geom_nodepoint #添加末端和内部节点 geom_hilight # 突出显示 geom_cladelabel #分组标签 |
1 2 3 4 5 6 7 8 9 10 11 | library("treeio") library("ggtree") nwk <- system.file("extdata", "sample.nwk", package="treeio") tree <- read.tree(nwk) ggplot(tree, aes(x, y)) + geom_tree() + theme_tree() ggtree(tree, color="firebrick", size=2, linetype="dotted") ggtree(tree, ladderize=FALSE) ggtree(tree, branch.length="none") |
4.2 系统发生树的展示布局
1 2 3 4 5 6 7 8 9 10 11 12 | library(ggtree) set.seed(2017-02-16) tree <- rtree(50) ggtree(tree) ggtree(tree, layout="slanted") ggtree(tree, layout="circular") ggtree(tree, layout="fan", open.angle=120) ggtree(tree, layout="equal_angle") ggtree(tree, layout="daylight") ggtree(tree, branch.length='none') ggtree(tree, branch.length='none', layout='circular') ggtree(tree, layout="daylight", branch.length = 'none') |
1 2 3 4 5 6 7 8 9 | ggtree(tree) + scale_x_reverse() ggtree(tree) + coord_flip() ggtree(tree) + layout_dendrogram() print(ggtree(tree), newpage=TRUE, vp=grid::viewport(angle=-30, width=.9, height=.9)) ggtree(tree, layout='slanted') + coord_flip() ggtree(tree, layout='slanted', branch.length='none') + layout_dendrogram() ggtree(tree, layout='circular') + xlim(-10, NA) ggtree(tree) + scale_x_reverse() + coord_polar(theta='y') ggtree(tree) + scale_x_reverse(limits=c(10, 0)) + coord_polar(theta='y') |
带有分化时间的系统发育树
1 2 3 4 | beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree") beast_tree <- read.beast(beast_file) ggtree(beast_tree, mrsd="2013-01-01") + theme_tree2() |
4.3 展示不同树的组成
4.3.1 树的比例尺
1 2 3 4 5 6 7 8 9 | ggtree(tree) + geom_treescale() # geom_treescale() supports the following parameters: #x and y for tree scale position #width for the length of the tree scale #fontsize for the size of the text #linesize for the size of the line #offset for relative position of the line and the text #color for color of the tree scale |
1 2 3 | ggtree(tree) + geom_treescale(x=0, y=45, width=1, color='red') ggtree(tree) + geom_treescale(fontsize=6, linesize=2, offset=1) ggtree(tree) + theme_tree2() |
4.3.2 展示内部节点与末端
1 2 3 4 | ggtree(tree) + geom_point(aes(shape=isTip, color=isTip), size=3) p <- ggtree(tree) + geom_nodepoint(color="#b5e521", alpha=1/4, size=10) p + geom_tippoint(color="#FDAC4F", shape=8, size=3) |
4.3.3 展示标签
1 2 | p + geom_tiplab(size=3, color="purple") ggtree(tree, layout="circular") + geom_tiplab(aes(angle=angle), color='blue') |
4.4.4 展示根的边
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | ## with root edge = 1 tree1 <- read.tree(text='((A:1,B:2):3,C:2):1;') ggtree(tree1) + geom_tiplab() + geom_rootedge() ## without root edge tree2 <- read.tree(text='((A:1,B:2):3,C:2);') ggtree(tree2) + geom_tiplab() + geom_rootedge() ## setting root edge tree2$root.edge <- 2 ggtree(tree2) + geom_tiplab() + geom_rootedge() ## specify length of root edge for just plotting ## this will ignore tree$root.edge ggtree(tree2) + geom_tiplab() + geom_rootedge(rootedge = 3) |
4.4.5 系统发生树颜色设置
1 2 3 | ggtree(beast_tree, aes(color=rate)) + scale_color_continuous(low='darkgreen', high='red') + theme(legend.position="right") |
祖先状态重构可视化
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | anole.tree<-read.tree("http://www.phytools.org/eqg2015/data/anole.tre") svl <- read.csv("http://www.phytools.org/eqg2015/data/svl.csv", row.names=1) svl <- as.matrix(svl)[,1] fit <- phytools::fastAnc(anole.tree,svl,vars=TRUE,CI=TRUE) td <- data.frame(node = nodeid(anole.tree, names(svl)), trait = svl) nd <- data.frame(node = names(fit$ace), trait = fit$ace) d <- rbind(td, nd) d$node <- as.numeric(d$node) tree <- full_join(anole.tree, d, by = 'node') ggtree(tree, aes(color=trait), layout = 'circular', ladderize = FALSE, continuous = TRUE, size=2) + scale_color_gradientn(colours=c("red", 'orange', 'green', 'cyan', 'blue')) + geom_tiplab(hjust = -.1) + xlim(0, 1.2) + theme(legend.position = c(.05, .85)) ggtree(tree, aes(color=trait), continuous = TRUE, yscale = "trait") + scale_color_viridis_c() + theme_minimal() |
4.4.6 修改树的标尺度量
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | library("treeio") beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree") beast_tree <- read.beast(beast_file) beast_tree p1 <- ggtree(beast_tree, mrsd='2013-01-01') + theme_tree2() + labs(caption="Divergence time") p2 <- ggtree(beast_tree, branch.length='rate') + theme_tree2() + labs(caption="Substitution rate") mlcfile <- system.file("extdata/PAML_Codeml", "mlc", package="treeio") mlc_tree <- read.codeml_mlc(mlcfile) p3 <- ggtree(mlc_tree) + theme_tree2() + labs(caption="nucleotide substitutions per codon") p4 <- ggtree(mlc_tree, branch.length='dN_vs_dS') + theme_tree2() + labs(caption="dN/dS tree") beast_tree2 <- rescale_tree(beast_tree, branch.length='rate') ggtree(beast_tree2) + theme_tree2() |
4.4.7 修改主题
1 2 3 4 | set.seed(2019) x <- rtree(30) ggtree(x, color="red") + theme_tree("steelblue") ggtree(x, color="white") + theme_tree("black") |
4.4.8 同时展示多个树
1 2 3 | trees <- lapply(c(10, 20, 40), rtree) class(trees) <- "multiPhylo" ggtree(trees) + facet_wrap(~.id, scale="free") + geom_tiplab() |
1 2 3 | btrees <- read.tree(system.file("extdata/RAxML", "RAxML_bootstrap.H3", package="treeio")) ggdensitree(btrees, alpha=.3, colour='steelblue') + geom_tiplab(size=3) + xlim(0, 45) |
05. 系统发育树注释
5.1 树的注释
1 2 3 4 5 6 7 8 9 10 11 12 | library(ggtree) treetext = "(((ADH2:0.1[&&NHX:S=human], ADH1:0.11[&&NHX:S=human]): 0.05 [&&NHX:S=primates:D=Y:B=100],ADHY: 0.1[&&NHX:S=nematode],ADHX:0.12 [&&NHX:S=insect]): 0.1[&&NHX:S=metazoa:D=N],(ADH4:0.09[&&NHX:S=yeast], ADH3:0.13[&&NHX:S=yeast], ADH2:0.12[&&NHX:S=yeast], ADH1:0.11[&&NHX:S=yeast]):0.1[&&NHX:S=Fungi])[&&NHX:D=N];" tree <- read.nhx(textConnection(treetext)) ggtree(tree) + geom_tiplab() + geom_label(aes(x=branch, label=S), fill='lightgreen') + geom_label(aes(label=D), fill='steelblue') + geom_text(aes(label=B), hjust=-.5) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | geom_balance highlights the two direct descendant clades of an internal node geom_cladelabel annotate a clade with bar and text label geom_facet plot associated data in specific panel (facet) and align the plot with the tree geom_hilight highlight a clade with rectangle geom_inset add insets (subplots) to tree nodes geom_label2 modified version of geom_label, with subsetting supported geom_nodepoint annotate internal nodes with symbolic points geom_point2 modified version of geom_point, with subsetting supported geom_range bar layer to present uncertainty of evolutionary inference geom_rootpoint annotate root node with symbolic point geom_rootedge add root edge to a tree geom_segment2 modified version of geom_segment, with subsetting supported geom_strip annotate associated taxa with bar and (optional) text label geom_taxalink associate two related taxa by linking them with a curve geom_text2 modified version of geom_text, with subsetting supported geom_tiplab layer of tip labels geom_tippoint annotate external nodes with symbolic points geom_tree tree structure layer, with multiple layout supported geom_treescale tree branch scale legend |
1 | ggtree(tree) + geom_text2(aes(subset=!isTip, label=node), hjust=-.3) + geom_tiplab() ###显示树文件的每个内部节点编号 |
5.2 图层的注释
5.2.1 分组标签
1 2 3 4 5 6 | set.seed(2015-12-21) tree <- rtree(30) p <- ggtree(tree) + xlim(NA, 6) p + geom_cladelabel(node=45, label="test label") + geom_cladelabel(node=34, label="another clade") |
1 2 | p + geom_cladelabel(node=45, label="test label", align=TRUE, offset = .2, color='red') + geom_cladelabel(node=34, label="another clade", align=TRUE, offset = .2, color='blue') |
1 2 | p + geom_cladelabel(node=45, label="test label", align=T, angle=270, hjust='center', offset.text=.5, barsize=1.5) + geom_cladelabel(node=34, label="another clade", align=T, angle=45, fontsize=8) |
1 | p + geom_cladelabel(node=34, label="another clade", align=T, geom='label', fill='lightblue') |
1 2 3 4 5 | ggtree(tree, layout="daylight") + geom_cladelabel(node=35, label="test label", angle=0, fontsize=8, offset=.5, vjust=.5) + geom_cladelabel(node=55, label='another clade', angle=-95, hjust=.5, fontsize=8) |
1 2 3 4 5 | p + geom_tiplab() + geom_strip('t10', 't30', barsize=2, color='red', label="associated taxa", offset.text=.1) + geom_strip('t1', 't18', barsize=2, color='blue', label = "another label", offset.text=.1) |
5.2.2 背景高亮显示
1 2 3 4 | nwk <- system.file("extdata", "sample.nwk", package="treeio") tree <- read.tree(nwk) ggtree(tree) + geom_hilight(node=21, fill="steelblue", alpha=.6) + geom_hilight(node=17, fill="darkgreen", alpha=.6) |
1 2 | ggtree(tree, layout="circular") + geom_hilight(node=21, fill="steelblue", alpha=.6) + geom_hilight(node=23, fill="darkgreen", alpha=.6) |
1 | pg + geom_hilight(node=55) + geom_hilight(node=35, fill='darkgreen') |
1 2 3 | ggtree(tree) + geom_balance(node=16, fill='steelblue', color='white', alpha=0.6, extend=1) + geom_balance(node=19, fill='darkgreen', color='white', alpha=0.6, extend=1) |
5.2.3 类群之间相互关联
1 2 3 | ggtree(tree) + geom_tiplab() + geom_taxalink('A', 'E') + geom_taxalink('F', 'K', color='red', linetype = 'dashed', arrow=grid::arrow(length=grid::unit(0.02, "npc"))) |
5.2.4 分化时间的不确定性估计
1 2 3 4 5 | file <- system.file("extdata/MEGA7", "mtCDNA_timetree.nex", package = "treeio") x <- read.mega(file) p1 <- ggtree(x) + geom_range('reltime_0.95_CI', color='red', size=3, alpha=.3) p2 <- ggtree(x) + geom_range('reltime_0.95_CI', color='red', size=3, alpha=.3, center='reltime') p3 <- p2 + scale_x_range() + theme_tree2() |
5.3 进化软件输出的树注释
1 2 3 4 5 6 7 | file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio") beast <- read.beast(file) ggtree(beast, aes(color=rate)) + geom_range(range='length_0.95_HPD', color='red', alpha=.6, size=2) + geom_nodelab(aes(x=branch, label=round(posterior, 2)), vjust=-.5, size=3) + scale_color_continuous(low="darkgreen", high="red") + theme(legend.position=c(.1, .8)) |
1 2 3 4 5 6 7 8 9 | nwk <- system.file("extdata/HYPHY", "labelledtree.tree", package="treeio") ancseq <- system.file("extdata/HYPHY", "ancseq.nex", package="treeio") tipfas <- system.file("extdata", "pa.fas", package="treeio") hy <- read.hyphy(nwk, ancseq, tipfas) ggtree(hy) + geom_text(aes(x=branch, label=AA_subs), size=2, vjust=-.3, color="firebrick") |
1 2 3 4 5 6 7 8 9 10 11 12 | rstfile <- system.file("extdata/PAML_Codeml", "rst", package="treeio") mlcfile <- system.file("extdata/PAML_Codeml", "mlc", package="treeio") ml <- read.codeml(rstfile, mlcfile) ggtree(ml, aes(color=dN_vs_dS), branch.length='dN_vs_dS') + scale_color_continuous(name='dN/dS', limits=c(0, 1.5), oob=scales::squish, low='darkgreen', high='red') + geom_text(aes(x=branch, label=AA_subs), vjust=-.5, color='steelblue', size=2) + theme_tree2(legend.position=c(.9, .3)) |
06. 系统发育树拓扑结构缩放
image.png
1 2 3 4 5 | library(ggtree) nwk <- system.file("extdata", "sample.nwk", package="treeio") tree <- read.tree(nwk) p <- ggtree(tree) + geom_tiplab() viewClade(p, MRCA(p, "I", "L")) |
image.png
1 2 3 4 | tree2 <- groupClade(tree, c(17, 21)) p <- ggtree(tree2, aes(color=group)) + theme(legend.position='none') + scale_color_manual(values=c("black", "firebrick", "steelblue")) scaleClade(p, node=17, scale=.1) |
1 2 3 4 5 6 | p2 <- p %>% collapse(node=21) + geom_point2(aes(subset=(node==21)), shape=21, size=5, fill='green') p2 <- collapse(p2, node=23) + geom_point2(aes(subset=(node==23)), shape=23, size=5, fill='red') print(p2) expand(p2, node=23) %>% expand(node=21) |
1 2 3 4 5 6 7 8 9 10 | p2 <- p + geom_tiplab() node <- 21 collapse(p2, node, 'max') %>% expand(node) collapse(p2, node, 'min') %>% expand(node) collapse(p2, node, 'mixed') %>% expand(node) collapse(p, 21, 'mixed', fill='steelblue', alpha=.4) %>% collapse(23, 'mixed', fill='firebrick', color='blue') scaleClade(p, 23, .2) %>% collapse(23, 'min', fill="darkgreen") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | data(iris) rn <- paste0(iris[,5], "_", 1:150) rownames(iris) <- rn d_iris <- dist(iris[,-5], method="man") tree_iris <- ape::bionj(d_iris) grp <- list(setosa = rn[1:50], versicolor = rn[51:100], virginica = rn[101:150]) p_iris <- ggtree(tree_iris, layout = 'circular', branch.length='none') groupOTU(p_iris, grp, 'Species') + aes(color=Species) + theme(legend.position="right") tree_iris <- groupOTU(tree_iris, grp, "Species") ggtree(tree_iris, aes(color=Species), layout = 'circular', branch.length = 'none') + theme(legend.position="right") |
1 2 3 | p1 <- p + geom_point2(aes(subset=node==16), color='darkgreen', size=5) p2 <- rotate(p1, 17) %>% rotate(21) flip(p2, 17, 21 |
image.png
1 2 | p3 <- open_tree(p, 180) + geom_tiplab() print(p3) |
image.png
07. 用数据绘制树
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | library(ggimage) library(ggtree) url <- paste0("https://raw.githubusercontent.com/TreeViz/", "metastyle/master/design/viz_targets_exercise/") x <- read.tree(paste0(url, "tree_boots.nwk")) info <- read.csv(paste0(url, "tip_data.csv")) p <- ggtree(x) %<+% info + xlim(-.1, 4) p2 <- p + geom_tiplab(offset = .6, hjust = .5) + geom_tippoint(aes(shape = trophic_habit, color = trophic_habit, size = mass_in_kg)) + theme(legend.position = "right") + scale_size_continuous(range = c(3, 10)) d2 <- read.csv(paste0(url, "inode_data.csv")) p2 %<+% d2 + geom_label(aes(label = vernacularName.y, fill = posterior)) + scale_fill_gradientn(colors = RColorBrewer::brewer.pal(3, "YlGnBu")) |
image.png
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | library(ggtree) remote_folder <- paste0("https://raw.githubusercontent.com/katholt/", "plotTree/master/tree_example_april2015/") ## read the phylogenetic tree tree <- read.tree(paste0(remote_folder, "tree.nwk")) ## read the sampling information data set info <- read.csv(paste0(remote_folder,"info.csv")) ## read and process the allele table snps<-read.csv(paste0(remote_folder, "alleles.csv"), header = F, row.names = 1, stringsAsFactor = F) snps_strainCols <- snps[1,] snps<-snps[-1,] # drop strain names colnames(snps) <- snps_strainCols gapChar <- "?" snp <- t(snps) lsnp <- apply(snp, 1, function(x) { x != snp[1,] & x != gapChar & snp[1,] != gapChar }) lsnp <- as.data.frame(lsnp) lsnp$pos <- as.numeric(rownames(lsnp)) lsnp <- tidyr::gather(lsnp, name, value, -pos) snp_data <- lsnp[lsnp$value, c("name", "pos")] ## read the trait data bar_data <- read.csv(paste0(remote_folder, "bar.csv")) ## visualize the tree p <- ggtree(tree) ## attach the sampling information data set ## and add symbols colored by location p <- p %<+% info + geom_tippoint(aes(color=location)) ## visualize SNP and Trait data using dot and bar charts, ## and align them based on tree structure p + geom_facet(panel = "SNP", data = snp_data, geom = geom_point, mapping=aes(x = pos, color = location), shape = '|') + geom_facet(panel = "Trait", data = bar_data, geom = ggstance::geom_barh, aes(x = dummy_bar_value, color = location, fill = location), stat = "identity", width = .6) + theme_tree2(legend.position=c(.05, .85)) |
image.png
ggtree真的是一个非常优秀的工具,值得每一个系统发育研究者学习,特别感谢Prof. Guangchuang Yu开发的优秀R包。
参考链接:
- LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu*. treeio: an R package for phylogenetic tree input and output with richly annotated and associated data. Molecular Biology and Evolution. 2019, accepted. doi: 10.1093/molbev/msz240.
-
- G Yu, TTY Lam, H Zhu, Y Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution. 2018, 35(2):3041-3043. doi: 10.1093/molbev/msy194.
-
- G Yu, DK Smith, H Zhu, Y Guan, TTY Lam*. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution. 2017, 8(1):28-36. doi: 10.1111/2041-210X.12628.