kube-dns can not resolve 'kubernetes.default.svc.cluster.local'
使用kargo部署kubernetes集群后,发现kubedns pod不能正常工作:
1 2 3 4 5 6 7 | $ kcsys get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE dnsmasq-alv8k 1/1 Running 2 1d 10.233.86.2 kubemaster dnsmasq-c9y52 1/1 Running 2 1d 10.233.82.2 kubeminion1 dnsmasq-sjouh 1/1 Running 2 1d 10.233.76.6 kubeminion2 kubedns-hxaj7 2/3 CrashLoopBackOff 339 22h 10.233.76.3 kubeminion2 |
PS :
的别名
每个容器(kubedns、dnsmasq)的日志似乎都可以,除了 healthz 容器如下:
1 | 2017/03/01 07:24:32 Healthz probe error: Result of last exec: nslookup: can't resolve 'kubernetes.default.svc.cluster.local' error exit status 1 |
更新
kubedns rc 描述
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | apiVersion: v1 kind: ReplicationController metadata: creationTimestamp: 2017-02-28T08:31:57Z generation: 1 labels: k8s-app: kubedns kubernetes.io/cluster-service:"true" version: v19 name: kubedns namespace: kube-system resourceVersion:"130982" selfLink: /api/v1/namespaces/kube-system/replicationcontrollers/kubedns uid: 5dc9f9f2-fd90-11e6-850d-005056a020b4 spec: replicas: 1 selector: k8s-app: kubedns version: v19 template: metadata: creationTimestamp: null labels: k8s-app: kubedns kubernetes.io/cluster-service:"true" version: v19 spec: containers: - args: - --domain=cluster.local. - --dns-port=10053 - --v=2 image: gcr.io/google_containers/kubedns-amd64:1.9 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 5 httpGet: path: /healthz port: 8080 scheme: HTTP initialDelaySeconds: 60 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 name: kubedns ports: - containerPort: 10053 name: dns-local protocol: UDP - containerPort: 10053 name: dns-tcp-local protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /readiness port: 8081 scheme: HTTP initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 resources: limits: cpu: 100m memory: 170Mi requests: cpu: 70m memory: 70Mi terminationMessagePath: /dev/termination-log - args: - --log-facility=- - --cache-size=1000 - --no-resolv - --server=127.0.0.1#10053 image: gcr.io/google_containers/kube-dnsmasq-amd64:1.3 imagePullPolicy: IfNotPresent name: dnsmasq ports: - containerPort: 53 name: dns protocol: UDP - containerPort: 53 name: dns-tcp protocol: TCP resources: limits: cpu: 100m memory: 170Mi requests: cpu: 70m memory: 70Mi terminationMessagePath: /dev/termination-log - args: - -cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null && nslookup kubernetes.default.svc.cluster.local 127.0.0.1:10053 >/dev/null - -port=8080 - -quiet image: gcr.io/google_containers/exechealthz-amd64:1.1 imagePullPolicy: IfNotPresent name: healthz ports: - containerPort: 8080 protocol: TCP resources: limits: cpu: 10m memory: 50Mi requests: cpu: 10m memory: 50Mi terminationMessagePath: /dev/termination-log dnsPolicy: Default restartPolicy: Always securityContext: {} terminationGracePeriodSeconds: 30 status: fullyLabeledReplicas: 1 observedGeneration: 1 replicas: 1` |
kubedns 服务描述:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | apiVersion: v1 kind: Service metadata: creationTimestamp: 2017-02-28T08:31:58Z labels: k8s-app: kubedns kubernetes.io/cluster-service:"true" kubernetes.io/name: kubedns name: kubedns namespace: kube-system resourceVersion:"10736" selfLink: /api/v1/namespaces/kube-system/services/kubedns uid: 5ed4dd78-fd90-11e6-850d-005056a020b4 spec: clusterIP: 10.233.0.3 ports: - name: dns port: 53 protocol: UDP targetPort: 53 - name: dns-tcp port: 53 protocol: TCP targetPort: 53 selector: k8s-app: kubedns sessionAffinity: None type: ClusterIP status: loadBalancer: {} |
我在 kubedns 容器中发现了一些错误:
1 2 | 1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://10.233.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.233.0.1:443: i/o timeout 1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: Get https://10.233.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.233.0.1:443: i/o timeout |
更新 2
controller-manager pod 的标志:
pod 状态
你能看一下
Kargo 正在将设置
其他 kubernetes 服务的网络绑定到主机,因此您不会遇到此问题。
我不确定这是否是根本问题,但是从 docker daemon 设置中删除
可以从 /etc/systemd/system/docker.service.d/docker-options.conf 中删除 docker 守护程序的 iptables 选项,该选项应如下所示:
[Service]
Environment="DOCKER_OPTS=--insecure-registry=10.233.0.0/18 --graph=/var/lib/docker --iptables=false"
一旦更新,您可以运行
这将允许您测试这是否能解决您的问题。一旦您确认这是修复,您可以覆盖 kargo 部署中的
根据您发布的错误,
1 | dial tcp 10.233.0.1:443: i/o timeout |
这可能意味着三件事:
您的容器网络结构配置不正确
- 在您正在使用的网络解决方案的日志中查找错误
- 确保每个 Docker 守护进程都使用自己的 IP 范围
- 验证容器网络不与宿主网络重叠
您的
时网络流量未转发到 API 服务器
-
检查您的节点 (kubeminion{1,2}) 上的
kube-proxy 日志,并使用您可能发现的任何错误更新您的问题
如果您还看到身份验证错误:
-
检查
kube-controller-manager 的--service-account-private-key-file 和--root-ca-file 标志是否设置为有效的密钥/证书并重新启动服务 -
删除
kube-system 命名空间中的default-token-xxxx 密钥并重新创建kube-dns 部署