背景

Gemfield的K8s cluster 中的一臺node(hostname為ai)遭遇跳閘斷電給掛掉了,辦公室供電跳閘的原因到目前還未知,此為第一個憂傷;當ai重新供電並開機後,ai雖然成功接入了k8s,不過由於之前所有k8s的service的externalIP設置成了ai的host IP,現在這些對應的服務全部不能通過ai的host IP來訪問了!這又是一個深深的憂傷。

以其中的gerrit service為例,該服務通過8080和29418兩個埠對外提供服務,之前使用的就是上文所說的externalIP方式,也就是通過<ai的host ip>:29418 和 <ai的host ip>:8080對外提供訪問。而現在,訪問這些埠就被阻塞住了。

gemfield本文就來探索這背後的原因以及如何恢復這些service的訪問。

gemfield@master:~$ kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ai Ready <none> 76d v1.11.1 192.168.1.188 <none> Ubuntu 18.04.1 LTS 4.15.0-36-generic docker://18.6.0
hp Ready <none> 71d v1.11.1 192.168.1.172 <none> Ubuntu 18.04.1 LTS 4.15.0-30-generic docker://18.6.0
master Ready master 76d v1.11.1 192.168.1.196 <none> Ubuntu 18.04.1 LTS 4.15.0-29-generic docker://18.6.0
ml Ready <none> 76d v1.11.1 192.168.1.121 <none> Ubuntu 18.04.1 LTS 4.15.0-29-generic docker://18.6.0

ai上的PodCIDR

gemfield@master:~$ kubectl describe node ai | grep PodCIDR
PodCIDR: 172.16.2.0/24

K8s的網路

在準備debug上述問題之前,我們需要對K8s的網路有一個簡單的瞭解。

下文提到的埠號29418、8080、30080、30587等,在本文的語義中都是等價的。

訪問<ai的host ip>:29418的traffic是如何到達<gerrit容器的ip>:29418的,以及再如何返回的呢?我們不得不去窺探k8s的網路了,具體來說,就是要能夠回答以下幾個問題:

1,K8s的master和worker等node之間是如何通信的呢?

2,容器和自身所在的node(宿主機)之間是如何通信的呢?

3,容器和同一個node上的容器之間是如何通信的呢?

4,容器和其它宿主機上的容器是如何通信的呢?

5,K8s的CNI是如何發揮作用的呢?

CNI是容器網路介面(Container Network Interface),它的項目地址如下:

https://github.com/containernetworking/cni?

github.com

這裡就不過多描述了,值得注意的是,CNI是專註的,它只關心容器創建(add)和銷毀(del)時候的網路資源的創建與銷毀。calico項目、K8s項目等都在使用CNI。其中k8s使用的是projectcalico/cni-plugin

https://github.com/projectcalico/cni-plugin?

github.com
圖標

你要了解K8s的CNI 的二進位文件在哪個路徑(/opt/...),配置文件在哪個路徑(/etc/cni/...),定義了ADD/DEL...等方法,輸入輸出是什麼,在什麼時候被執行,執行的時候產生了什麼效果(比如產生了iptable rule,產生了route table rule等),這些規則被注入tecd或者kubernetes datastore,然後由calico在confd的動態更新中重新實施這些規則。

開始Debug

gemfield開始了問題的解決之路,我們來一步一步循序漸進抽絲剝繭般的分析。

1,gerrit服務的pod是正常的嗎?

gemfield@master:~$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
gerrit-7477c8576-rvnvx 1/1 Running 0 1d 172.16.1.133 ml
ldap-65d497b7c4-72mfn 1/1 Running 0 1d 172.16.1.131 ml

是正常的;

2, gerrit服務的k8s service是正常的嗎?

是正常的:

gemfield@master:~$ kubectl get svc -o wide

3,ai host上有沒有29418/8080的埠在偵聽呢?

有的:

gemfield@ai:~$ sudo netstat -antp | grep 8080
tcp6 0 0 :::8080 :::* LISTEN 523/kube-proxy

4,把gerrit的externalIP換成NodePort類型交叉驗證下

這一步的意義是什麼呢?就是NodePort類型會將埠映射到每個宿主機的30000+埠上,之前不是說ai不能訪問了嗎,我訪問另外一臺node的同一個服務的埠就可以對比了:

gemfield@master:~$ kubectl get svc -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
gerrit NodePort 10.96.166.167 <none> 29418:30587/TCP,8080:30080/TCP 1d app=gerrit
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 76d <none>
ldap ClusterIP None <none> 389/TCP 1d app=ldap

這裡就很明顯了,<ai的host ip>:30080仍然不可以訪問,但是訪問<ml的host ip>:30080就是正常的,ml是k8s cluster中的另外一臺node。而之前斷電的正是ai這臺node!

而繼續往下debug就不得不熟悉K8s的網路了。

5,iptables

在ai上查看和30080埠相關的iptables規則:

gemfield@ai:~$ sudo iptables-save | grep 30080
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/gerrit:gerrit1" -m tcp --dport 30080 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/gerrit:gerrit1" -m tcp --dport 30080 -j KUBE-SVC-S2MNSV4IPTVV5KRU

這些規則是K8s在創建service的時候create出來的,而這裡ai node上的輸出和ml node是一樣的。所以service的創建是沒有問題的。

6, node上的route

首先看ai上的:

gemfield@ai:~$ ip r | grep bird
blackhole 172.16.2.0/24 proto bird

再看ml上的:

gemfield@ML:~$ ip r | grep bird
172.16.0.0/24 via 192.168.1.196 dev tunl0 proto bird onlink
blackhole 172.16.1.0/24 proto bird
172.16.3.0/24 via 192.168.1.172 dev tunl0 proto bird onlink

呃,route信息不對,ml上沒有到ai上的ipip,但是有到其它node的ipip;而ai上是啥也沒有!據此可以判斷是各個node上的calico CNI之間的通信出現了問題,具體是ai和其它node都不通,而其它node是互通的,進一步把線索指向了ai的跳閘掉電。

7,使用calicoctl命令

沒有這個命令的話可以直接從官網下載,是一個static linked的binary可執行文件。

先看看所有node上的calico容器:

gemfield@master:~$ kubectl get pods --all-namespaces -o wide | grep calico
kube-system calico-node-26655 2/2 Running 0 1d 192.168.1.188 ai
kube-system calico-node-4zsxl 2/2 Running 0 1d 192.168.1.172 hp-probook
kube-system calico-node-69gdw 2/2 Running 0 1d 192.168.1.196 master
kube-system calico-node-xrpjw 2/2 Running 0 1d 192.168.1.121 ml

進行ml node上的calico狀態檢查:

gemfield@master:~$ kubectl -n kube-system exec -it calico-node-xrpjw -- /home/calicoctl node status
Defaulting container name to calico-node.
Use kubectl describe pod/calico-node-xrpjw -n kube-system to see all of the containers in this pod.
Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+---------------+-------------------+-------+------------+-------------+
| 172.19.0.1 | node-to-node mesh | start | 2018-10-19 | Connect |
| 192.168.1.172 | node-to-node mesh | up | 2018-10-19 | Established |
| 192.168.1.196 | node-to-node mesh | up | 2018-10-19 | Established |
+---------------+-------------------+-------+------------+-------------+
IPv6 BGP status
No IPv6 peers found.

進行hp node上的calico狀態檢查:

gemfield@master:~$ kubectl -n kube-system exec -it calico-node-4zsxl -- /home/calicoctl node status
Defaulting container name to calico-node.
Use kubectl describe pod/calico-node-4zsxl -n kube-system to see all of the containers in this pod.
Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+---------------+-------------------+-------+------------+-------------+
| 172.19.0.1 | node-to-node mesh | start | 2018-10-19 | Connect |
| 192.168.1.196 | node-to-node mesh | up | 2018-10-19 | Established |
| 192.168.1.121 | node-to-node mesh | up | 2018-10-19 | Established |
+---------------+-------------------+-------+------------+-------------+

IPv6 BGP status
No IPv6 peers found.

進行ai node上的calico狀態檢查:

gemfield@master:~$ kubectl -n kube-system exec -it calico-node-26655 -- /home/calicoctl node status
Defaulting container name to calico-node.
Use kubectl describe pod/calico-node-26655 -n kube-system to see all of the containers in this pod.
Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+------------+--------------------------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+---------------+-------------------+-------+------------+--------------------------------+
| 192.168.1.172 | node-to-node mesh | start | 2018-10-19 | Connect Socket: Connection |
| | | | | closed |
| 192.168.1.196 | node-to-node mesh | start | 2018-10-19 | Active Socket: Connection |
| | | | | reset by peer |
| 192.168.1.121 | node-to-node mesh | start | 2018-10-19 | Active Socket: Connection |
| | | | | closed |
+---------------+-------------------+-------+------------+--------------------------------+

IPv6 BGP status
No IPv6 peers found.

果然,根據上面的輸出,進一步確認了是ai上的calico無法和其它node上的calico通信。並且令人詫異的是,ai node上的BGP address為什麼是172.19.0.1 而不是192.168.1.* ?

再看看calico的BGP的服務埠是否在偵聽(Calico BGP , listen on 179 port):

#gemfield on calico container
/var # netstat -antp | grep 179
tcp 0 0 0.0.0.0:179 0.0.0.0:* LISTEN 116/bird

8, bird

calico的BGP是由bird實現的,我們找到了bird的配置文件(在/etc/calico/confd/config/目錄下)bird.cfg:

#在gemfield的ai node上的calico容器裏
/etc/calico/confd/config # cat bird.cfg
# Generated by confd
include "bird_aggr.cfg";
include "bird_ipam.cfg";

router id 172.19.0.1;
......

我們發現bird的配置文件裏有router id 172.19.0.1,那這個值是哪裡來的呢?在calico容器裏,bird的配置文件是由confd維護的,confd會定期更新這個bird.cfg,在K8s cluster(哪怕是更一般的情況下),confd是從etcd中讀取值來更新配置文件的。

9,confd

confd的配置文件/etc/calico/confd/conf.d/bird.toml 如下所示:

#在calico容器裏,by gemfield
/etc/calico/confd/conf.d # cat /etc/calico/confd/conf.d/bird.toml
[template]
src = "bird.cfg.template"
dest = "/etc/calico/confd/config/bird.cfg"
prefix = "/calico/bgp/v1"
keys = [
"/host",
"/global",
]
check_cmd = "bird -p -c {{.src}}"
reload_cmd = "pkill -HUP bird || true"

再看模板文件:

#gemfield的calico容器裏
/etc/calico/confd/conf.d # cat /etc/calico/confd/templates/bird.cfg.template
# Generated by confd
include "bird_aggr.cfg";
include "bird_ipam.cfg";
{{$node_ip_key := printf "/host/%s/ip_addr_v4" (getenv "NODENAME")}}{{$node_ip := getv $node_ip_key}}

router id {{$node_ip}};

當node-to-node mesh 被設置後(默認),主要的BIRD配置就是由這些模板定義的。在上面,我們可以看到router id {{$node_ip}};的語句,也即意味著node_ip的值為getv $node_ip_key進一步展開就是getv "/host/ai/ip_addr_v4",這個值不知為何從etcd中讀成了172.19.0.1。

10, etcd

使用etcd的客戶端命令(注意k8s現在使用的是etcd v3, 必須提供ca、key、cert,否則會出現Error: context deadline exceeded):

#在gemfield的etcd容器裏
/ # ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key get --prefix /calico/bgp/v1

但是,從K8s 1.11開始,calico就廢棄了使用etcd作為backend,改為了kubernetes datastore。因此上面的命令是不能得到相關信息的。

projectcalico/calico?

github.com
圖標

etcd-calico(bgp)實現docker跨主機通信,Calico實現了基於BGP協議的路由方案。

11, 誰將這個key的值更新成了172.19.0.1

看看log怎麼說:

#ai上的calico pod
gemfield@master:~$ kubectl -n kube-system logs -f calico-node-26655 calico-node | grep 172.19
2018-10-19 06:41:19.968 [INFO][10] startup.go 564: Using autodetected IPv4 address on interface br-cddff8a3b81c: 172.19.0.1/16
2018-10-19 06:41:22.362 [INFO][119] int_dataplane.go 485: Linux interface addrs changed. addrs=set.mapSet{"172.19.0.1":set.empty{}, "fe80::42:e4ff:fe19:ef2b":set.empty{}} ifaceName="br-cddff8a3b81c"
2018-10-19 06:41:22.363 [INFO][119] int_dataplane.go 641: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"br-cddff8a3b81c", Addrs:set.mapSet{"172.19.0.1":set.empty{}, "fe80::42:e4ff:fe19:ef2b":set.empty{}}}
2018-10-19 06:41:22.363 [INFO][119] hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"br-cddff8a3b81c", Addrs:set.mapSet{"172.19.0.1":set.empty{}, "fe80::42:e4ff:fe19:ef2b":set.empty{}}}
2018-10-19 06:41:22.377 [INFO][119] int_dataplane.go 611: Received *proto.HostMetadataUpdate update from calculation graph msg=hostname:"ai" ipv4_addr:"172.19.0.1"

其中有一句話很顯眼:Using autodetected IPv4 address on interface br-cddff8a3b81c: 172.19.0.1/16。那我們就來看看為什麼在ai上自動選擇了這個IP:

gemfield@ai:/bigdata/gemfield/github/Gemfield$ ip a | grep -C 3 172.19
valid_lft forever preferred_lft forever
46: br-cddff8a3b81c: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:e4:19:ef:2b brd ff:ff:ff:ff:ff:ff
inet 172.19.0.1/16 brd 172.19.255.255 scope global br-cddff8a3b81c
valid_lft forever preferred_lft forever
inet6 fe80::42:e4ff:fe19:ef2b/64 scope link
valid_lft forever preferred_lft forever

是br-cddff8a3b81c,經檢查發現這是以前某個docker-compose服務設置的bridge。將這個birdge刪掉,然後將ai上的calico pod delete掉:

gemfield@master:~$ kubectl -n kube-system delete pod calico-node-26655

這樣會導致ai上的calico容器重啟,再看看重啟時候的log:

2018-10-21 07:33:54.298 [INFO][10] startup.go 251: Early log level set to info
2018-10-21 07:33:54.299 [INFO][10] startup.go 267: Using NODENAME environment for node name
2018-10-21 07:33:54.299 [INFO][10] startup.go 279: Determined node name: ai
2018-10-21 07:33:54.300 [INFO][10] startup.go 302: Checking datastore connection
2018-10-21 07:33:54.336 [INFO][10] startup.go 326: Datastore connection verified
2018-10-21 07:33:54.336 [INFO][10] startup.go 99: Datastore is ready
2018-10-21 07:33:54.350 [INFO][10] startup.go 564: Using autodetected IPv4 address on interface eno1: 192.168.1.188/24
2018-10-21 07:33:54.350 [INFO][10] startup.go 432: Node IPv4 changed, will check for conflicts
2018-10-21 07:33:54.354 [WARNING][10] startup.go 849: IPv4 address has changed. This could happen if there are multiple nodes with the same name. node="ai" original="172.19.0.1" updated="192.168.1.188"
2018-10-21 07:33:54.354 [INFO][10] startup.go 627: No AS number configured on node resource, using global value
2018-10-21 07:33:54.504 [INFO][10] startup.go 510: FELIX_IPV6SUPPORT is false through environment variable
2018-10-21 07:33:54.504 [INFO][10] k8s.go 264: EnsuringInitialized - noop
2018-10-21 07:33:54.530 [INFO][10] startup.go 176: Using node name: ai
2018-10-21 07:33:54.617 [INFO][30] allocate_ipip_addr.go 41: Kubernetes datastore driver handles IPIP allocation - no op
Calico node started successfully

可以看到ip切換到了gem-field所想要的,接著看看calico node status:

/home # wget http://x99.gemfield.org:8080/static/calicoctl
Connecting to x99.gemfield.org:8080 (61.149.179.174:8080)
calicoctl 100% |********************************************************************************************************************************************| 29883k 0:00:00 ETA
/home # chmod +x calicoctl
/home # ./calicoctl node status
Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+---------------+-------------------+-------+----------+-------------+
| 192.168.1.172 | node-to-node mesh | up | 07:33:58 | Established |
| 192.168.1.196 | node-to-node mesh | up | 07:33:59 | Established |
| 192.168.1.121 | node-to-node mesh | up | 07:33:58 | Established |
+---------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

之後訪問gerrit服務在ai上的埠,也正常了。

慶祝

哈哈哈哈哈。


推薦閱讀:
相關文章