Ubuntu下解决calico/node is not ready: Error querying BIRD: BIRD is not ready: BGP not established

今天搭建kubernetes集群,使用calico作为CNI,但是发现pod一直没有初始化:

1
2
3
4
5
ubuntu@perf-test-0:~/yiwei/performance_test$ kbctl get pod -A
NAMESPACE     NAME                                      READY   STATUS              RESTARTS   AGE
kube-system   calico-kube-controllers-676c4cbdf-zl6gr   1/1     Running             0          26s
kube-system   calico-node-bznhp                         0/1     Running             0          28s
kube-system   calico-node-stwcs                         0/1     Running             0          28s

查看pod日志,发现两个节点之间没有建立BGP连接:

1
2
3
4
5
6
7
8
ubuntu@perf-test-0:~/yiwei/performance_test$ kbctl describe pod calico-node-bznhp -n kube-system
=================================
  Normal   Pulled     88s   kubelet            Container image "calico/node:v3.16.5" already present on machine
  Normal   Created    88s   kubelet            Created container calico-node
  Normal   Started    87s   kubelet            Started container calico-node
  Warning  Unhealthy  84s   kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Warning  Unhealthy  74s   kubelet            Readiness probe failed: 2020-11-15 08:24:34.851 [INFO][192] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.0.1

观察pod日志最后一行,结合集群中node的信息,发现了问题可能的原因:这个node上的calico pod要和192.168.0.1建立连接,而这个IP并没有对应的node!

1
2
3
4
ubuntu@perf-test-0:~/yiwei/performance_test$ kbctl get node -o wide
NAME          STATUS   ROLES    AGE     VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
perf-test-0   Ready    master   17m     v1.18.10   10.117.9.232   <none>        Ubuntu 18.04.5 LTS   4.15.0-122-generic   docker://19.3.13
perf-test-1   Ready    <none>   7m54s   v1.18.10   10.117.9.238   <none>        Ubuntu 18.04.5 LTS   4.15.0-122-generic   docker://19.3.13

搜索该错误,发现了这个GitHub Issue。于是按照如下步骤操作,解决了错误:

  1. 如下图所示,calico运行要求node之间能够通过一些特定的端口和协议连通,这在calico官方文档中也有提到。于是,按照如下命令在所有node当中设置Ubuntu防火墙,打开179端口:
    sudo ufw allow 179/tcp

calico
2. 因为BGP协议互相连接的IP不对,需要打开配置calico的YAML文件,并更改,自己加入IP_AUTODETECTION_METHOD这个环境变量,设置calico使用的IP Interface。可以根据ip address show命令查看合适的IP Interface名称,与kubernetes当中node的INTERNAL-IP保持一致:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
      containers:
        # Runs calico-node container on each Kubernetes node. This
        # container programs network policy and routes on each
        # host.
        - name: calico-node
          image: calico/node:v3.16.5
          envFrom:
          - configMapRef:
              # Allow KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT to be overridden for eBPF mode.
              name: kubernetes-services-endpoint
              optional: true
          env:
            - name: IP_AUTODETECTION_METHOD
              value: interface=ens192

最后reset集群,重新apply上述YAML,成功部署了calico。