ggtree: 系统发育树(phylogenetic tree)可视化

You can’t even begin to understand biology, you can’t understand life, unless you understand what it’s all there for, how it arose - and that means evolution.
— Richard Dawkins

01. 输入文件格式

常见的系统发生树的格式主要有三种:Newick、NEXUS以及Phylip。其中,Newick和NEXUS格式的系统发生树能够被大多数软件所识别。除此之外,许多进化生物学分析软件也产生许多其他格式的文件,例如BEAST、MrBayes,PAML以及r8s等。

(1). Newick格式

1
((t2:0.04,t1:0.34):0.89,(t5:0.37,(t4:0.03,t3:0.67):0.9):0.59);

Newick格式文件都是以分号(;)作为结尾,内部节点用一对匹配的括号表示,括号间的节点代表后代节点,例如(t2:0.04, t1:0.34)表示t2、t1的父节点。另外,同级节点之间用逗号分隔,tips用它们的名字表示。分支长度(从父节点到子节点)由子节点后面的实数表示,前面是冒号。与内部节点或分支相关联的数据(例如,自展值)可能编码为节点标签,并由冒号前的简单文本/数字表示。


(2). NEXUS格式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#NEXUS
[R-package APE, Wed Nov  9 11:46:32 2016]

BEGIN TAXA;
    DIMENSIONS NTAX = 5;
    TAXLABELS
        t5
        t4
        t1
        t2
        t3
    ;
END;
BEGIN TREES;
    TRANSLATE
        1   t5,
        2   t4,
        3   t1,
        4   t2,
        5   t3
    ;
    TREE * UNTITLED = [&R] (1:0.89,((2:0.59,3:0.37):0.34,
    (4:0.03,5:0.67):0.9):0.04);
END;

NEXUS包含三个区块:TAXA(物种类群信息)、DATA(数据矩阵或多序列比对)以及TREE(Newick格式的系统发育树)。

(3). New Hampshire eXtended format

1
2
3
4
5
(((ADH2:0.1[&&NHX:S=human],
ADH1:0.11[&&NHX:S=human]):0.05[&&NHX:S=primates:D=Y:B=100],ADHY:0.1[&&NHX:S=nematode],ADHX:0.12[&&NHX:S=insect]):0.1[&&NHX:S=metazoa:D=N],
(ADH4:0.09[&&NHX:S=yeast],ADH3:0.13[&&NHX:S=yeast],
ADH2:0.12[&&NHX:S=yeast],ADH1:0.11[&&NHX:S=yeast]):0.1[&&NHX:S=Fungi])
[&&NHX:D=N];

(4). 其他软件的输出格式

  • BEAST
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
tree TREE1 = [&R]
(((11[&length=9.4]:9.38,14[&length=6.4]:6.385096430786298)
[&length=25.7]:25.43,4[&length=9.1]:8.821663252749829)
[&length=3.0]:3.10,(12[&length=0.6]:0.56,
(10[&length=1.6]:1.56,(7[&length=5.2]:5.19,
((((2[&length=3.3]:3.26,(1[&length=1.3]:1.32,
(6[&length=0.8]:0.83,13[&length=0.8]:0.8311577761397366)
[&length=2.4]:2.48917886025146)
[&length=0.9]:0.9416178372674331)
[&length=0.4]:0.49,9[&length=1.7]:1.757288031101215)
[&length=2.4]:2.35,8[&length=2.1]:2.1125745387283246)
[&length=0.2]:0.23,(3[&length=3.3]:3.31,
(15[&length=5.2]:5.27,5[&length=3.2]:3.2710481368304585)
[&length=1.0]:1.0409443024626412)
[&length=1.9]:2.0372962536780435)
[&length=2.8]:2.8446835614595685)
[&length=5.3]:5.367459711197171)
[&length=2.0]:2.0037467863383043)
[&length=4.3]:4.360909907798238)[&length=0.0];

BEAST的输出文件将会包含多种进化推断结果,例如分子钟分析通常会有rate,length,height,posterior,HPD以及不确定范围估计。rate代表某一枝系的进化速率,length代表枝长,height代表从节点到根的时间,而posterior代表贝叶斯Clade可信度值。

  • MrBayes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
tree con_all_compat = [&U]
(8[&prob=1.0]:2.94e-1[&length_mean=2.9e-1],10[&prob=1.0]:2.25e-1[&length_mean=2.2e-1],
((((1[&prob=1.0]:1.43e-1[&length_mean=1.4e-1],2[&prob=1.0]:1.92e-1[&length_mean=1.9e-1])
[&prob=1.0]:1.24e-1[&length_mean=1.2e-1],9[&prob=1.0]:2.27e-1[&length_mean=2.2e-1])
[&prob=1.0]:1.72e-1[&length_mean=1.7e-1],12[&prob=1.0]:5.11e-1[&length_mean=5.1e-1])
[&prob=1.0]:1.76e-1[&length_mean=1.7e-1],
(((3[&prob=1.0]:5.46e-2[&length_mean=5.4e-2],
(6[&prob=1.0]:1.03e-2[&length_mean=1.0e-2],7[&prob=1.0]:7.13e-3[&length_mean=7.2e-3])
[&prob=1.0]:6.93e-2[&length_mean=6.9e-2])
[&prob=1.0]:6.03e-2[&length_mean=6.0e-2],
(4[&prob=1.0]:6.27e-2[&length_mean=6.2e-2],5[&prob=1.0]:6.31e-2[&length_mean=6.3e-2])
[&prob=1.0]:6.07e-2[&length_mean=6.0e-2])
[&prob=1.0]:1.80e-1[&length_mean=1.8e-1],11[&prob=1.0]:2.37e-1[&length_mean=2.3e-1])
[&prob=1.0]:4.05e-1[&length_mean=4.0e-1])
[&prob=1.0]:1.16e+000[&length_mean=1.162699558201079e+000])
[&prob=1.0][&length_mean=0];

一般而言,MrBayes大部分数据均会被去除,仅保留prob枝系后验概率与length_mean平均枝长。完整的数据还应该包括,prob_stddev,prob_range,prob(percent),prob+-sd,length_median,length_95%_HPD。

  • PAML
    杨子恒教授开发的PAML(Phylogenetic Analysis by Maximum Likelihood)软件包主要用于DNA或蛋白质序列的系统发育分析,其中BaseMLCodeML是两个主要子程序。BasseMl可利用多种碱基取代模型估计树拓扑、分支长度和替代参数,CodeML主要是估计同义与非同义替换率、密码子置换模型下正选择的似然比检验。CodeML输出文件均包含树拓扑结构和同义、非同义替换率的估计的mlc文件。

02. 在R中读取树文件

经常使用的R软件包,主要有ape,phylobase,phytools,而本文主要介绍ggtree。其中,treeio可以直接抓取BEAST,CodeML、,MrBayes,r8s的输出结果文件。

Table 1.1: Parser functions defined in treeio

Parser function Description
read.astral parsing output of ASTRAL
read.beast parsing output of BEAST
read.codeml parsing output of CodeML (rst and mlc files)
read.codeml_mlc parsing mlc file (output of CodeML)
read.fasta parsing FASTA format sequence file
read.hyphy parsing output of HYPHY
read.hyphy.seq parsing ancestral sequences from HYPHY output
read.iqtree parsing IQ-Tree newick string, with ability to parse SH-aLRT and UFBoot support values
read.jplace parsing jplace file including output of EPA and pplacer
read.jtree parsing jtree format
read.mega parsing MEGA Nexus output
read.mega_tabular parsing MEGA tabular output
read.mrbayes parsing output of MrBayes
read.newick parsing newick string, with ability to parse node label as support values
read.nhx parsing NHX file including output of PHYLDOG and RevBayes
read.paml_rst parsing rst file (output of BaseML or CodeML)
read.phylip parsing phylip file (phylip alignment + newick string)
read.phylip.seq parsing multiple sequence alignment from phylip file
read.phylip.tree parsing newick string from phylip file
read.r8s parsing output of r8s
read.raxml parsing output of RAxML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
library(ggtree)

file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio")
beast <- read.beast(file)
beast
## 'treedata' S4 object that stored information of
##  '/home/ygc/R/library/treeio/extdata/BEAST/beast_mcc.tree'.
##
## ...@ phylo:
## Phylogenetic tree with 15 tips and 14 internal nodes.
##
## Tip labels:
##  A_1995, B_1996, C_1995, D_1987, E_1996, F_1997, ...
##
## Rooted; includes branch lengths.
##
## with the following features available:
##  'height',   'height_0.95_HPD',  'height_median',
##  'height_range', 'length',   'length_0.95_HPD',
##  'length_median',    'length_range', 'posterior',    'rate',
##  'rate_0.95_HPD',    'rate_median',  'rate_range'.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
file <- system.file("extdata/MrBayes", "Gq_nxs.tre", package="treeio")
read.mrbayes(file)
## 'treedata' S4 object that stored information of
##  '/home/ygc/R/library/treeio/extdata/MrBayes/Gq_nxs.tre'.
##
## ...@ phylo:
## Phylogenetic tree with 12 tips and 10 internal nodes.
##
## Tip labels:
##  B_h, B_s, G_d, G_k, G_q, G_s, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
##  'length_0.95HPD',   'length_mean',  'length_median',    'prob',
##  'prob_range',   'prob_stddev',  'prob_percent', 'prob+-sd'.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
brstfile <- system.file("extdata/PAML_Baseml", "rst", package="treeio")
brst <- read.paml_rst(brstfile)
brst
## 'treedata' S4 object that stored information of
##  '/home/ygc/R/library/treeio/extdata/PAML_Baseml/rst'.
##
## ...@ phylo:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
##  A, B, C, D, E, F, ...
## Node labels:
##  16, 17, 18, 19, 20, 21, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
##  'subs', 'AA_subs'.

03. 数据整合与过滤

3.1 使用tidytree将数据转换为数据框dataframe格式

所有被树解析/合并的数据都可以使用tidytree包转换成整洁的数据框。tidytree包提供操作带有关联数据的树。例如,外部数据可以链接到系统发育,或者从不同来源获得的进化数据可以使用tidyverse verbs进行合并。在对树数据进行操作后,可以将其转换回treedata对象,并导出到单个树文件中,在R中进一步分析或使用ggtree可视化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
library(ape)
library(tidytree)
library(dplyr)
set.seed(2017)
tree <- rtree(4)
tree
##
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## [1] "t4" "t1" "t3" "t2"
##
## Rooted; includes branch lengths.
x <- as_tibble(tree)
x
## # A tibble: 7 x 4
##   parent  node branch.length label
##    <int> <int>         <dbl> <chr>
## 1      5     1       0.435   t4  
## 2      7     2       0.674   t1  
## 3      7     3       0.00202 t3  
## 4      6     4       0.0251  t2  
## 5      5     5      NA       <NA>
## 6      5     6       0.472   <NA>
## 7      6     7       0.274   <NA>
as.phylo(x)
##
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## [1] "t4" "t1" "t3" "t2"
##
## Rooted; includes branch lengths.

将树文件与物种表型数据相关联:dplyr

1
2
3
4
5
d <- tibble(label = paste0('t', 1:4),
            trait = rnorm(4))

y <- full_join(x, d, by = 'label')  #通过物种名合并
y
1
2
3
4
5
6
7
8
9
10
## # A tibble: 7 x 5
##   parent  node branch.length label  trait
##    <int> <int>         <dbl> <chr>  <dbl>
## 1      5     1       0.435   t4     0.943
## 2      7     2       0.674   t1    -0.171
## 3      7     3       0.00202 t3     0.570
## 4      6     4       0.0251  t2    -0.283
## 5      5     5      NA       <NA>  NA    
## 6      5     6       0.472   <NA>  NA    
## 7      6     7       0.274   <NA>  NA

treedata对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
as.treedata(y)
## 'treedata' S4 object'.
##
## ...@ phylo:
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## [1] "t4" "t1" "t3" "t2"
##
## Rooted; includes branch lengths.
##
## with the following features available:
##  'trait'.

y %>% as.treedata %>% as_tibble  # 直接合并多个对象
## # A tibble: 7 x 5
##   parent  node branch.length label  trait
##    <int> <int>         <dbl> <chr>  <dbl>
## 1      5     1       0.435   t4     0.943
## 2      7     2       0.674   t1    -0.171
## 3      7     3       0.00202 t3     0.570
## 4      6     4       0.0251  t2    -0.283
## 5      5     5      NA       <NA>  NA    
## 6      5     6       0.472   <NA>  NA    
## 7      6     7       0.274   <NA>  NA

Access related nodes(访问相关节点)

1
child(y, 5)
1
2
3
4
5
## # A tibble: 2 x 5
##   parent  node branch.length label  trait
##    <int> <int>         <dbl> <chr>  <dbl>
## 1      5     1         0.435 t4     0.943
## 2      5     6         0.472 <NA>  NA
1
parent(y, 2)

1
2
3
4
## # A tibble: 1 x 5
##   parent  node branch.length label trait
##    <int> <int>         <dbl> <chr> <dbl>
## 1      6     7         0.274 <NA>     NA
1
offspring(y, 5)
1
2
3
4
5
6
7
8
9
## # A tibble: 6 x 5
##   parent  node branch.length label  trait
##    <int> <int>         <dbl> <chr>  <dbl>
## 1      5     1       0.435   t4     0.943
## 2      7     2       0.674   t1    -0.171
## 3      7     3       0.00202 t3     0.570
## 4      6     4       0.0251  t2    -0.283
## 5      5     6       0.472   <NA>  NA    
## 6      6     7       0.274   <NA>  NA
1
ancestor(y, 2)
1
2
3
4
5
6
## # A tibble: 3 x 5
##   parent  node branch.length label trait
##    <int> <int>         <dbl> <chr> <dbl>
## 1      5     5        NA     <NA>     NA
## 2      5     6         0.472 <NA>     NA
## 3      6     7         0.274 <NA>     NA
1
sibling(y, 2)
1
2
3
4
## # A tibble: 1 x 5
##   parent  node branch.length label trait
##    <int> <int>         <dbl> <chr> <dbl>
## 1      7     3       0.00202 t3    0.570
1
MRCA(y, 2, 3)
1
2
3
4
## # A tibble: 1 x 5
##   parent  node branch.length label trait
##    <int> <int>         <dbl> <chr> <dbl>
## 1      6     7         0.274 <NA>     NA

3.2 数据整合

3.2.1 合并codeml与beast的结果

1
2
3
4
5
6
7
8
beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
rst_file <- system.file("examples/rst", package="ggtree")
mlc_file <- system.file("examples/mlc", package="ggtree")
beast_tree <- read.beast(beast_file)
codeml_tree <- read.codeml(rst_file, mlc_file)

merged_tree <- merge_tree(beast_tree, codeml_tree)
merged_tree

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
## 'treedata' S4 object that stored information of
##  '/home/ygc/R/library/ggtree/examples/MCC_FluA_H3.tree',
##  '/home/ygc/R/library/ggtree/examples/rst',
##  '/home/ygc/R/library/ggtree/examples/mlc'.
##
## ...@ phylo:
## Phylogenetic tree with 76 tips and 75 internal nodes.
##
## Tip labels:
##  A/Hokkaido/30-1-a/2013, A/New_York/334/2004, A/New_York/463/2005, A/New_York/452/1999, A/New_York/238/2005, A/New_York/523/1998, ...
##
## Rooted; includes branch lengths.
##
## with the following features available:
##  'height',   'height_0.95_HPD',  'height_median',
##  'height_range', 'length',   'length_0.95_HPD',
##  'length_median',    'length_range', 'posterior',    'rate',
##  'rate_0.95_HPD',    'rate_median',  'rate_range',   'subs',
##  'AA_subs',  't',    'N',    'S',    'dN_vs_dS', 'dN',   'dS',   'N_x_dN',
##  'S_x_dS'.

使用tidytree包将树对象转换为整齐的数据框,并将其可视化为由CODEML推断的DN/DS、DN和DS的六元散点图与由BEAST在相同分支上推断的速率(以替换/位置/年为单位的替代率)。

1
2
3
4
5
6
7
8
9
10
library(dplyr)
df <- merged_tree %>%
  as_tibble() %>%
  select(dN_vs_dS, dN, dS, rate) %>%
  subset(dN_vs_dS >=0 & dN_vs_dS <= 1.5) %>%
  tidyr::gather(type, value, dN_vs_dS:dS)
df$type[df$type == 'dN_vs_dS'] <- 'dN/dS'
df$type <- factor(df$type, levels=c("dN/dS", "dN", "dS"))
ggplot(df, aes(rate, value)) + geom_hex() +
  facet_wrap(~type, scale='free_y')

04. 系统发育树可视化

有许多用于显示系统发育树的软件包和网络工具,例如TreeView,FigTree11,TreeDyn,Dendroscope,EvolView和iTOL等。其中,如Figtree、TreeDyn和iTOL,允许用户用着色分支、突出显示的分支和树特征来注释树。然而,它们预定义的注释功能通常仅限于某些特定的系统发育数据。随着系统发育树在多学科研究中的应用越来越广泛,越来越需要将来自不同来源的各种类型的系统发育协变量和其他相关数据合并到树中用于可视化和进一步分析。例如,流感病毒具有广泛的宿主范围,多样和动态的基因型和特征性的传播行为,这些行为大多与病毒的进化有关,并且本质上是相互之间的。因此,除了专注于每一种特定分析和数据类型的独立应用程序外,研究分子进化的研究人员还需要一个强大的可编程平台,允许在系统发育树上对数据的许多不同方面(原始数据或来自其他初级分析的数据)进行高水平的集成和可视化,以确定它们的关联和模式。

4.1 基本语法

1
2
3
4
5
6
7
8
ggplot(tree_object) + geom_tree() + theme_tree()
ggtree(tree_object)
geom_treescale #增加树分支比例的图例(遗传距离、发散时间等)
geom_range #显示分支长度的不确定性(置信区间或范围等)
geom_tiplab #添加分类群标签
geom_tippoint,geom_nodepoint #添加末端和内部节点
geom_hilight # 突出显示
geom_cladelabel #分组标签
1
2
3
4
5
6
7
8
9
10
11
library("treeio")
library("ggtree")
nwk <- system.file("extdata", "sample.nwk", package="treeio")
tree <- read.tree(nwk)
ggplot(tree, aes(x, y)) + geom_tree() + theme_tree()

ggtree(tree, color="firebrick", size=2, linetype="dotted")

ggtree(tree, ladderize=FALSE)

ggtree(tree, branch.length="none")

4.2 系统发生树的展示布局

1
2
3
4
5
6
7
8
9
10
11
12
library(ggtree)
set.seed(2017-02-16)
tree <- rtree(50)
ggtree(tree)
ggtree(tree, layout="slanted")
ggtree(tree, layout="circular")
ggtree(tree, layout="fan", open.angle=120)
ggtree(tree, layout="equal_angle")
ggtree(tree, layout="daylight")
ggtree(tree, branch.length='none')
ggtree(tree, branch.length='none', layout='circular')
ggtree(tree, layout="daylight", branch.length = 'none')

1
2
3
4
5
6
7
8
9
ggtree(tree) + scale_x_reverse()
ggtree(tree) + coord_flip()
ggtree(tree) + layout_dendrogram()
print(ggtree(tree), newpage=TRUE, vp=grid::viewport(angle=-30, width=.9, height=.9))
ggtree(tree, layout='slanted') + coord_flip()
ggtree(tree, layout='slanted', branch.length='none') + layout_dendrogram()
ggtree(tree, layout='circular') + xlim(-10, NA)
ggtree(tree) + scale_x_reverse() + coord_polar(theta='y')
ggtree(tree) + scale_x_reverse(limits=c(10, 0)) + coord_polar(theta='y')

带有分化时间的系统发育树

1
2
3
4
beast_file <- system.file("examples/MCC_FluA_H3.tree",
                          package="ggtree")
beast_tree <- read.beast(beast_file)
ggtree(beast_tree, mrsd="2013-01-01") + theme_tree2()

4.3 展示不同树的组成

4.3.1 树的比例尺

1
2
3
4
5
6
7
8
9
ggtree(tree) + geom_treescale()

# geom_treescale() supports the following parameters:
#x and y for tree scale position
#width for the length of the tree scale
#fontsize for the size of the text
#linesize for the size of the line
#offset for relative position of the line and the text
#color for color of the tree scale
1
2
3
ggtree(tree) + geom_treescale(x=0, y=45, width=1, color='red')
ggtree(tree) + geom_treescale(fontsize=6, linesize=2, offset=1)
ggtree(tree) + theme_tree2()

4.3.2 展示内部节点与末端

1
2
3
4
ggtree(tree) + geom_point(aes(shape=isTip, color=isTip), size=3)

p <- ggtree(tree) + geom_nodepoint(color="#b5e521", alpha=1/4, size=10)
p + geom_tippoint(color="#FDAC4F", shape=8, size=3)

4.3.3 展示标签

1
2
p + geom_tiplab(size=3, color="purple")
ggtree(tree, layout="circular") + geom_tiplab(aes(angle=angle), color='blue')

4.4.4 展示根的边

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## with root edge = 1
tree1 <- read.tree(text='((A:1,B:2):3,C:2):1;')
ggtree(tree1) + geom_tiplab() + geom_rootedge()

## without root edge
tree2 <- read.tree(text='((A:1,B:2):3,C:2);')
ggtree(tree2) + geom_tiplab() + geom_rootedge()

## setting root edge
tree2$root.edge <- 2
ggtree(tree2) + geom_tiplab() + geom_rootedge()

## specify length of root edge for just plotting
## this will ignore tree$root.edge
ggtree(tree2) + geom_tiplab() + geom_rootedge(rootedge = 3)

4.4.5 系统发生树颜色设置

1
2
3
ggtree(beast_tree, aes(color=rate)) +
    scale_color_continuous(low='darkgreen', high='red') +
    theme(legend.position="right")

祖先状态重构可视化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
anole.tree<-read.tree("http://www.phytools.org/eqg2015/data/anole.tre")
svl <- read.csv("http://www.phytools.org/eqg2015/data/svl.csv",
    row.names=1)
svl <- as.matrix(svl)[,1]
fit <- phytools::fastAnc(anole.tree,svl,vars=TRUE,CI=TRUE)

td <- data.frame(node = nodeid(anole.tree, names(svl)),
               trait = svl)
nd <- data.frame(node = names(fit$ace), trait = fit$ace)

d <- rbind(td, nd)
d$node <- as.numeric(d$node)
tree <- full_join(anole.tree, d, by = 'node')

ggtree(tree, aes(color=trait), layout = 'circular',
        ladderize = FALSE, continuous = TRUE, size=2) +
    scale_color_gradientn(colours=c("red", 'orange', 'green', 'cyan', 'blue')) +
    geom_tiplab(hjust = -.1) + xlim(0, 1.2) + theme(legend.position = c(.05, .85))


ggtree(tree, aes(color=trait), continuous = TRUE, yscale = "trait") +
    scale_color_viridis_c() + theme_minimal()

4.4.6 修改树的标尺度量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
library("treeio")
beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
beast_tree <- read.beast(beast_file)
beast_tree
p1 <- ggtree(beast_tree, mrsd='2013-01-01') + theme_tree2() +
    labs(caption="Divergence time")
p2 <- ggtree(beast_tree, branch.length='rate') + theme_tree2() +
    labs(caption="Substitution rate")
mlcfile <- system.file("extdata/PAML_Codeml", "mlc", package="treeio")
mlc_tree <- read.codeml_mlc(mlcfile)
p3 <- ggtree(mlc_tree) + theme_tree2() +
    labs(caption="nucleotide substitutions per codon")
p4 <- ggtree(mlc_tree, branch.length='dN_vs_dS') + theme_tree2() +
    labs(caption="dN/dS tree")

beast_tree2 <- rescale_tree(beast_tree, branch.length='rate')
ggtree(beast_tree2) + theme_tree2()

4.4.7 修改主题

1
2
3
4
set.seed(2019)
x <- rtree(30)
ggtree(x, color="red") + theme_tree("steelblue")
ggtree(x, color="white") + theme_tree("black")

4.4.8 同时展示多个树

1
2
3
trees <- lapply(c(10, 20, 40), rtree)
class(trees) <- "multiPhylo"
ggtree(trees) + facet_wrap(~.id, scale="free") + geom_tiplab()

1
2
3
btrees <- read.tree(system.file("extdata/RAxML", "RAxML_bootstrap.H3", package="treeio"))
ggdensitree(btrees, alpha=.3, colour='steelblue') +
    geom_tiplab(size=3) + xlim(0, 45)

05. 系统发育树注释

5.1 树的注释

1
2
3
4
5
6
7
8
9
10
11
12
library(ggtree)
treetext = "(((ADH2:0.1[&&NHX:S=human], ADH1:0.11[&&NHX:S=human]):
0.05 [&&NHX:S=primates:D=Y:B=100],ADHY:
0.1[&&NHX:S=nematode],ADHX:0.12 [&&NHX:S=insect]):
0.1[&&NHX:S=metazoa:D=N],(ADH4:0.09[&&NHX:S=yeast],
ADH3:0.13[&&NHX:S=yeast], ADH2:0.12[&&NHX:S=yeast],
ADH1:0.11[&&NHX:S=yeast]):0.1[&&NHX:S=Fungi])[&&NHX:D=N];"
tree <- read.nhx(textConnection(treetext))
ggtree(tree) + geom_tiplab() +
  geom_label(aes(x=branch, label=S), fill='lightgreen') +
  geom_label(aes(label=D), fill='steelblue') +
  geom_text(aes(label=B), hjust=-.5)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
geom_balance     highlights the two direct descendant clades of an internal node
geom_cladelabel annotate a clade with bar and text label
geom_facet  plot associated data in specific panel (facet) and align the plot with the tree
geom_hilight    highlight a clade with rectangle
geom_inset  add insets (subplots) to tree nodes
geom_label2 modified version of geom_label, with subsetting supported
geom_nodepoint  annotate internal nodes with symbolic points
geom_point2 modified version of geom_point, with subsetting supported
geom_range  bar layer to present uncertainty of evolutionary inference
geom_rootpoint  annotate root node with symbolic point
geom_rootedge   add root edge to a tree
geom_segment2   modified version of geom_segment, with subsetting supported
geom_strip  annotate associated taxa with bar and (optional) text label
geom_taxalink   associate two related taxa by linking them with a curve
geom_text2  modified version of geom_text, with subsetting supported
geom_tiplab layer of tip labels
geom_tippoint   annotate external nodes with symbolic points
geom_tree   tree structure layer, with multiple layout supported
geom_treescale  tree branch scale legend

1
ggtree(tree) + geom_text2(aes(subset=!isTip, label=node), hjust=-.3) + geom_tiplab() ###显示树文件的每个内部节点编号

5.2 图层的注释
5.2.1 分组标签

1
2
3
4
5
6
set.seed(2015-12-21)
tree <- rtree(30)
p <- ggtree(tree) + xlim(NA, 6)

p + geom_cladelabel(node=45, label="test label") +
    geom_cladelabel(node=34, label="another clade")
1
2
p + geom_cladelabel(node=45, label="test label", align=TRUE,  offset = .2, color='red') +
    geom_cladelabel(node=34, label="another clade", align=TRUE, offset = .2, color='blue')
1
2
p + geom_cladelabel(node=45, label="test label", align=T, angle=270, hjust='center', offset.text=.5, barsize=1.5) +
    geom_cladelabel(node=34, label="another clade", align=T, angle=45, fontsize=8)
1
p + geom_cladelabel(node=34, label="another clade", align=T, geom='label', fill='lightblue')

1
2
3
4
5
ggtree(tree, layout="daylight") +
  geom_cladelabel(node=35, label="test label", angle=0,
                  fontsize=8, offset=.5, vjust=.5)  +
  geom_cladelabel(node=55, label='another clade',
                  angle=-95, hjust=.5, fontsize=8)
1
2
3
4
5
p + geom_tiplab() +
  geom_strip('t10', 't30', barsize=2, color='red',
            label="associated taxa", offset.text=.1) +
  geom_strip('t1', 't18', barsize=2, color='blue',
            label = "another label", offset.text=.1)

5.2.2 背景高亮显示

1
2
3
4
nwk <- system.file("extdata", "sample.nwk", package="treeio")
tree <- read.tree(nwk)
ggtree(tree) + geom_hilight(node=21, fill="steelblue", alpha=.6) +
    geom_hilight(node=17, fill="darkgreen", alpha=.6)
1
2
ggtree(tree, layout="circular") + geom_hilight(node=21, fill="steelblue", alpha=.6) +
    geom_hilight(node=23, fill="darkgreen", alpha=.6)
1
pg + geom_hilight(node=55) + geom_hilight(node=35, fill='darkgreen')

1
2
3
ggtree(tree) +
  geom_balance(node=16, fill='steelblue', color='white', alpha=0.6, extend=1) +
  geom_balance(node=19, fill='darkgreen', color='white', alpha=0.6, extend=1)

5.2.3 类群之间相互关联

1
2
3
ggtree(tree) + geom_tiplab() + geom_taxalink('A', 'E') +
  geom_taxalink('F', 'K', color='red', linetype = 'dashed',
    arrow=grid::arrow(length=grid::unit(0.02, "npc")))

5.2.4 分化时间的不确定性估计

1
2
3
4
5
file <- system.file("extdata/MEGA7", "mtCDNA_timetree.nex", package = "treeio")
x <- read.mega(file)
p1 <- ggtree(x) + geom_range('reltime_0.95_CI', color='red', size=3, alpha=.3)
p2 <- ggtree(x) + geom_range('reltime_0.95_CI', color='red', size=3, alpha=.3, center='reltime')  
p3 <- p2 + scale_x_range() + theme_tree2()

5.3 进化软件输出的树注释

1
2
3
4
5
6
7
file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio")
beast <- read.beast(file)
ggtree(beast, aes(color=rate))  +
    geom_range(range='length_0.95_HPD', color='red', alpha=.6, size=2) +
    geom_nodelab(aes(x=branch, label=round(posterior, 2)), vjust=-.5, size=3) +
    scale_color_continuous(low="darkgreen", high="red") +
    theme(legend.position=c(.1, .8))

1
2
3
4
5
6
7
8
9
nwk <- system.file("extdata/HYPHY", "labelledtree.tree",
                   package="treeio")
ancseq <- system.file("extdata/HYPHY", "ancseq.nex",
                      package="treeio")
tipfas <- system.file("extdata", "pa.fas", package="treeio")
hy <- read.hyphy(nwk, ancseq, tipfas)
ggtree(hy) +
  geom_text(aes(x=branch, label=AA_subs), size=2,
            vjust=-.3, color="firebrick")

1
2
3
4
5
6
7
8
9
10
11
12
rstfile <- system.file("extdata/PAML_Codeml", "rst",
                       package="treeio")
mlcfile <- system.file("extdata/PAML_Codeml", "mlc",
                       package="treeio")
ml <- read.codeml(rstfile, mlcfile)
ggtree(ml, aes(color=dN_vs_dS), branch.length='dN_vs_dS') +
  scale_color_continuous(name='dN/dS', limits=c(0, 1.5),
                         oob=scales::squish,
                         low='darkgreen', high='red') +
  geom_text(aes(x=branch, label=AA_subs),
            vjust=-.5, color='steelblue', size=2) +
  theme_tree2(legend.position=c(.9, .3))

06. 系统发育树拓扑结构缩放

image.png

1
2
3
4
5
library(ggtree)
nwk <- system.file("extdata", "sample.nwk", package="treeio")
tree <- read.tree(nwk)
p <- ggtree(tree) + geom_tiplab()
viewClade(p, MRCA(p, "I", "L"))

image.png

1
2
3
4
tree2 <- groupClade(tree, c(17, 21))
p <- ggtree(tree2, aes(color=group)) + theme(legend.position='none') +
  scale_color_manual(values=c("black", "firebrick", "steelblue"))
scaleClade(p, node=17, scale=.1)

1
2
3
4
5
6
p2 <- p %>% collapse(node=21) +
  geom_point2(aes(subset=(node==21)), shape=21, size=5, fill='green')
p2 <- collapse(p2, node=23) +
  geom_point2(aes(subset=(node==23)), shape=23, size=5, fill='red')
print(p2)
expand(p2, node=23) %>% expand(node=21)

1
2
3
4
5
6
7
8
9
10
p2 <- p + geom_tiplab()
node <- 21
collapse(p2, node, 'max') %>% expand(node)
collapse(p2, node, 'min') %>% expand(node)
collapse(p2, node, 'mixed') %>% expand(node)

collapse(p, 21, 'mixed', fill='steelblue', alpha=.4) %>%
  collapse(23, 'mixed', fill='firebrick', color='blue')

scaleClade(p, 23, .2) %>% collapse(23, 'min', fill="darkgreen")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
data(iris)
rn <- paste0(iris[,5], "_", 1:150)
rownames(iris) <- rn
d_iris <- dist(iris[,-5], method="man")

tree_iris <- ape::bionj(d_iris)
grp <- list(setosa     = rn[1:50],
            versicolor = rn[51:100],
            virginica  = rn[101:150])

p_iris <- ggtree(tree_iris, layout = 'circular', branch.length='none')
groupOTU(p_iris, grp, 'Species') + aes(color=Species) +
  theme(legend.position="right")

tree_iris <- groupOTU(tree_iris, grp, "Species")
ggtree(tree_iris, aes(color=Species), layout = 'circular', branch.length = 'none') +
  theme(legend.position="right")

1
2
3
p1 <- p + geom_point2(aes(subset=node==16), color='darkgreen', size=5)
p2 <- rotate(p1, 17) %>% rotate(21)
flip(p2, 17, 21

image.png

1
2
p3 <- open_tree(p, 180) + geom_tiplab()
print(p3)

image.png

07. 用数据绘制树

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
library(ggimage)
library(ggtree)
url <- paste0("https://raw.githubusercontent.com/TreeViz/",
            "metastyle/master/design/viz_targets_exercise/")

x <- read.tree(paste0(url, "tree_boots.nwk"))
info <- read.csv(paste0(url, "tip_data.csv"))

p <- ggtree(x) %<+% info + xlim(-.1, 4)
p2 <- p + geom_tiplab(offset = .6, hjust = .5) +
    geom_tippoint(aes(shape = trophic_habit, color = trophic_habit, size = mass_in_kg)) +
    theme(legend.position = "right") + scale_size_continuous(range = c(3, 10))

d2 <- read.csv(paste0(url, "inode_data.csv"))
p2 %<+% d2 + geom_label(aes(label = vernacularName.y, fill = posterior)) +
    scale_fill_gradientn(colors = RColorBrewer::brewer.pal(3, "YlGnBu"))

image.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
library(ggtree)
remote_folder <- paste0("https://raw.githubusercontent.com/katholt/",
                        "plotTree/master/tree_example_april2015/")

## read the phylogenetic tree
tree <- read.tree(paste0(remote_folder, "tree.nwk"))

## read the sampling information data set
info <- read.csv(paste0(remote_folder,"info.csv"))

## read and process the allele table
snps<-read.csv(paste0(remote_folder, "alleles.csv"), header = F,
                row.names = 1, stringsAsFactor = F)
snps_strainCols <- snps[1,]
snps<-snps[-1,] # drop strain names
colnames(snps) <- snps_strainCols

gapChar <- "?"
snp <- t(snps)
lsnp <- apply(snp, 1, function(x) {
        x != snp[1,] & x != gapChar & snp[1,] != gapChar
    })
lsnp <- as.data.frame(lsnp)
lsnp$pos <- as.numeric(rownames(lsnp))
lsnp <- tidyr::gather(lsnp, name, value, -pos)
snp_data <- lsnp[lsnp$value, c("name", "pos")]

## read the trait data
bar_data <- read.csv(paste0(remote_folder, "bar.csv"))

## visualize the tree
p <- ggtree(tree)

## attach the sampling information data set
## and add symbols colored by location
p <- p %<+% info + geom_tippoint(aes(color=location))

## visualize SNP and Trait data using dot and bar charts,
## and align them based on tree structure
p + geom_facet(panel = "SNP", data = snp_data, geom = geom_point,
               mapping=aes(x = pos, color = location), shape = '|') +
    geom_facet(panel = "Trait", data = bar_data, geom = ggstance::geom_barh,
                aes(x = dummy_bar_value, color = location, fill = location),
                stat = "identity", width = .6) +
    theme_tree2(legend.position=c(.05, .85))

image.png

ggtree真的是一个非常优秀的工具,值得每一个系统发育研究者学习,特别感谢Prof. Guangchuang Yu开发的优秀R包。

参考链接:

  1. LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu*. treeio: an R package for phylogenetic tree input and output with richly annotated and associated data. Molecular Biology and Evolution. 2019, accepted. doi: 10.1093/molbev/msz240.
    1. G Yu, TTY Lam, H Zhu, Y Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution. 2018, 35(2):3041-3043. doi: 10.1093/molbev/msy194.
    1. G Yu, DK Smith, H Zhu, Y Guan, TTY Lam*. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution. 2017, 8(1):28-36. doi: 10.1111/2041-210X.12628.