第二部分

openshift-sdn的深入研究

拋開表象後,我們可以更深入的了解一下openshift-sdn的實現。openshift-sdn創建了4個資源對象。

新的資源對象

首先是:netnamespace

oc get netnamespace
NAME NETID
default 0
np1 1647324
np2 11107045

netnamespace定義每個namespace的vnid,在ovs-subnet plugin下,是單租戶模式,即不存在netnamespace資源對象的。在ovs-multitelnat下,每個namespace會有唯一的vnid,glocal-namespace default的vnid是0,並且和其他的namespace是互通的。如果我們希望在ovs-multitelnat下,兩個project是互通的,我們可以使用如下命令:

$oc adm pod-network join-projects --to=project1 project2 project3 --loglevel=8

執行完這個命令後,我們會發現互通的project的vnid變成相同的,以此達到互通。而在ovs-networkpolicy plugin下,同樣會使用namespace的vnid修改ovs規則相應的table=80。

其他的資源對象還有:clusternetwork定義了集群網路段,hostsubnet定義了主機pod網路段,同時,還添加了egressnetworkpolicy。

$oc get clusternetwork
NAME NETWORK HOST SUBNET LENGTH SERVICE NETWORK PLUGIN NAME
default 10.128.0.0/14 9 172.30.0.0/16 redhat/openshift-ovs-networkpolicy

$oc get hostsubnet
NAME HOST HOST IP SUBNET
a-master1.example.com a-master1.example.com 172.18.143.111 10.128.0.0/23
a-master2.example.com a-master2.example.com 172.18.143.112 10.130.0.0/23
a-master3.example.com a-master3.example.com 172.18.143.113 10.129.0.0/23
a-node1.example.com a-node1.example.com 172.18.143.114 10.128.2.0/23
a-node2.example.com a-node2.example.com 172.18.143.115 10.131.0.0/23
a-node3.example.com a-node3.example.com 172.18.143.116 10.129.2.0/23
a-node4.example.com a-node4.example.com 172.18.143.117 10.130.2.0/23
a-node5.example.com a-node5.example.com 172.18.143.174 10.131.2.0/23

$oc get egressnetworkpolicy
No resources found.

ovs br0網橋和埠

首先,我們看看node5上相應的網橋和埠的創建情況:

root@a-node5 ~]# ovs-ofctl -O OpenFlow13 show br0
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000fa8909121b41
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
1(vxlan0): addr:ce:08:d2:68:09:19
config: 0
state: 0
speed: 0 Mbps now, 0 Mbps max
2(tun0): addr:f2:d1:08:10:9e:b4
config: 0
state: 0
speed: 0 Mbps now, 0 Mbps max
112(veth5e852cf1): addr:52:2e:59:2b:49:31
config: 0
state: 0
current: 10GB-FD COPPER
speed: 10000 Mbps now, 0 Mbps max
LOCAL(br0): addr:fa:89:09:12:1b:41
config: PORT_DOWN
state: LINK_DOWN
speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=nx-match miss_send_len=0

ovs流表規則

從openshift-sdn源碼中,我們可以看到openshift-sdn對ovs規則的建立和修改,openshift-sdn會創建如下ovs的規則tables,並會監聽資源變化修改相應規則的tables:

// Table 0: initial dispatch based on in_port
// Table 10: VXLAN ingress filtering; filled in by AddHostSubnetRules()
// Table 20: from OpenShift container; validate IP/MAC, assign tenant-id; filled in by setupPodFlows
// Table 21: from OpenShift container; NetworkPolicy plugin uses this for connection tracking
// Table 25: IP from OpenShift container via Service IP; reload tenant-id; filled in by setupPodFlows
// Table 30: general routing
// Table 40: ARP to local container, filled in by setupPodFlows
// Table 50: ARP to remote container; filled in by AddHostSubnetRules()
// Table 60: IP to service from pod
// Table 70: IP to local container: vnid/port mappings; filled in by setupPodFlows
// Table 80: IP policy enforcement; mostly managed by the osdnPolicy
// Table 90: IP to remote container; filled in by AddHostSubnetRules()
// Table 100: egress routing; edited by UpdateNamespaceEgressRules()
// Table 101: egress network policy dispatch; edited by UpdateEgressNetworkPolicy()
// Table 110: outbound multicast filtering, updated by UpdateLocalMulticastFlows()
// Table 111: multicast delivery from local pods to the VXLAN; only one rule, updated by UpdateVXLANMulticastRules()
// Table 120: multicast delivery to local pods (either from VXLAN or local pods); updated by UpdateLocalMulticastFlows()
// Table 253: rule version note

查看ovs流表

同時,我們可以使用ovs-ofctl dump-flows查看流表信息:

[root@a-node5 ~]#ovs-ofctl dump-flows br0 -O OpenFlow13
[root@a-node5 ~]#ovs-ofctl dump-flows br0 -O OpenFlow13 | cut -d , -f1,2,4,5 --complement | sort -u -V

跟蹤數據流

我們可以使用ovs-appctl ofproto/trace產生數據包跟蹤數據流:

使用前,我們需要確定ovs的in_port的數值。當前np1下的busyboxplus啟動在node5上,cni插件會為pod創建ns及veth pair,veth pair一頭連接在pod的eth0埠,另一頭就連在ovs網橋的一個埠,即數據流的in_port:

#首先進入容器執行ip a,得到eth0@if後面的數值325
$oc rsh busyboxplus-deployment-786549c679-gdszc
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if325: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 0a:58:0a:83:03:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.131.3.54/23 brd 10.131.3.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::2034:beff:fed3:560d/64 scope link
valid_lft forever preferred_lft forever

如果容器里沒有ip命令,我們可以使用nsenter進入pod的network namespace執行命令:

[root@a-node5 ~]CONTAINER_ID=$(oc get pods busyboxplus-deployment-786549c679-gdszc -o jsonpath={.status.containerStatuses[0].containerID} | cut -c 10-21
[root@a-node5 ~]PID=$(sudo docker inspect -f {{ .State.Pid }} $CONTAINER_ID)
[root@a-node5 ~]nsenter -t ${PID} -n ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if325: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 0a:58:0a:83:03:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.131.3.54/23 brd 10.131.3.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::2034:beff:fed3:560d/64 scope link
valid_lft forever preferred_lft forever

再通過/sys/class/net/得到對應的in_port

[root@a-node5 ~]ovs-ofctl -O OpenFlow13 show br0 | grep veth
200(veth07cd191b): addr:26:c0:0f:a6:8f:19
201(vethecf60c1a): addr:8a:a2:86:2e:73:35
[root@a-node5 ~]# cat /sys/class/net/{veth07cd191b,vethecf60c1a}/ifindex
325
326

至此,我們得到np1的busyboxplus流入ovs br0的in_port是:200。我們再來看看np1下busyboxplus的pod IP和np2下nginx的pod IP,以此調用ovs-appctl ofproto/trace模擬數據流:

$oc project np1
$oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
busyboxplus-deployment-786549c679-gdszc 1/1 Running 0 2h 10.131.3.54 a-node5.example.com

$oc project np2
$oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
nginx-deployment-66898c95cd-z22j6 1/1 Running 0 1h 10.131.3.55 a-node5.example.com

np1的busyboxplus pod IP:<10.131.3.54>

np2的nginx pod IP:<10.131.3.55>

使用ovs-appctl ofproto/trace模擬busyboxplus POD調用nginx POD的數據流:

[root@a-node5 ~]# ovs-appctl ofproto/trace br0 in_port=200,tcp,nw_src=10.131.3.54,nw_dst=10.131.3.55,ct_state=trk
Flow: ct_state=trk,tcp,in_port=200,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.3.54,nw_dst=10.131.3.55,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=0,tcp_flags=0

bridge("br0")
-------------
0. ip, priority 100
goto_table:20
20. ip,in_port=200,nw_src=10.131.3.54, priority 100
load:0x1922dc->NXM_NX_REG0[]
goto_table:21
21. ip,nw_dst=10.128.0.0/14, priority 200
ct(commit,table=30)
drop

Final flow: ct_state=trk,tcp,reg0=0x1922dc,in_port=200,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.3.54,nw_dst=10.131.3.55,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=0,tcp_flags=0
Megaflow: recirc_id=0,ct_state=+trk,ip,in_port=200,nw_src=10.131.3.54,nw_dst=10.128.0.0/14,nw_frag=no
Datapath actions: 8,ct(commit),recirc(0x1545e)

drop?似乎,我們得到不正確的數據流。是的,在ovs-networkpolicy plugin下,我們創建的流表信息包括connection tracking,會引起有些數據包中斷重啟,ovs2.8可以自動重啟解決這個問題,當前,我使用的是ovs2.7.3,所以,在ovs-networkpolicy plugin下,我無法使用ovs-appctl ofproto/trace跟蹤數據包。

最後,我在ovs-multitenant plugin下測試了一下ovs-appctl ofproto/trace,是工作正常的:

#流出的數據包
[root@node1 ~]# ovs-appctl ofproto/trace br0 "in_port=305,ip,nw_src=10.131.1.107,nw_dst=10.128.2.170"
Flow: ip,in_port=305,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.1.107,nw_dst=10.128.2.170,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0

bridge("br0")
-------------
0. ip, priority 100
goto_table:20
20. ip,in_port=305,nw_src=10.131.1.107, priority 100
load:0x439379->NXM_NX_REG0[]
goto_table:21
21. priority 0
goto_table:30
30. ip,nw_dst=10.128.0.0/14, priority 100
goto_table:90
90. ip,nw_dst=10.128.2.0/23, priority 100
move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31]
-> NXM_NX_TUN_ID[0..31] is now 0x439379
set_field:10.150.1.41->tun_dst
output:1
-> output to kernel tunnel

Final flow: ip,reg0=0x439379,tun_src=0.0.0.0,tun_dst=10.150.1.41,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_flags=0,in_port=305,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.1.107,nw_dst=10.128.2.170,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0
Megaflow: recirc_id=0,ip,tun_id=0/0xffffffff,tun_dst=0.0.0.0,in_port=305,nw_src=10.131.1.107,nw_dst=10.128.2.0/23,nw_ecn=0,nw_frag=no
Datapath actions: set(tunnel(tun_id=0x439379,dst=10.150.1.41,ttl=64,tp_dst=4789,flags(df|key))),1

#流入的數據包
[root@node2~]# ovs-appctl ofproto/trace br0 "in_port=2,tcp,tunnel_id=0x439379,nw_dst=10.128.2.170"
Flow:
tcp,tun_id=0xcc2e30,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=0.0.0.0,nw_dst=10.128.2.170,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=0,tcp_flags=0

bridge("br0")
-------------
0. ip,in_port=2, priority 200
goto_table:30
30. ip,nw_dst=10.128.2.0/23, priority 200
goto_table:70
70. ip,nw_dst=10.128.2.170, priority 100 load:0x15beb->NXM_NX_REG1[] load:0x21a->NXM_NX_REG2[]
goto_table:80
80. priority 200 output:NXM_NX_REG2[]
-> output port is 538

Final flow: tcp,reg1=0x15beb,reg2=0x21a,tun_id=0xcc2e30,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=0.0.0.0,nw_dst=10.128.2.170,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=0,tcp_flags=0 Megaflow: recirc_id=0,ip,in_port=2,nw_src=0.0.0.0/5,nw_dst=10.128.2.170,nw_frag=no Datapath actions: 21

到此,我們可以測試pod到pod的互通,pod到service的互通怎麼實現的呢?一樣,也會走ovs流表信息,但首先會走tun0埠,在tun0埠處會查詢iptables規則,找到對應的pod ip,再查一次流表,之後走vxlan0埠數據包的流程。ovs為什麼能這樣實現呢?首先,tun0埠的type是internal,之後,tun0收到數據包後,會進行如下的處理:

  1. tun0收到數據包之後,發現另一端被進程B打開了,於是將數據包丟給了進程B
  2. 進程B收到數據包之後,做一些跟業務相關的處理,然後構造一個新的數據包,將原來的數據包嵌入在新的數據包中,最後通過socket B將數據包轉發出去,這時候新數據包的源地址變成了eth0的地址,而目的IP地址變成了一個其它的地址,比如是10.33.0.1.
  3. socket B將數據包丟給協議棧
  4. 協議棧根據本地路由,發現這個數據包應該要通過eth0發送出去,於是將數據包交給eth0
  5. eth0通過物理網路將數據包發送出去

參考segmentfault.com/a/1190看看一下數據包的完整流程的解釋。

openshift-sdn cni插件

openshift-sdn cni插件遵循kubernetes cni插件的標準:github.com/containernet,會編譯生成openshift-sdn二進位可執行文件,同時按照框架實現CmdAdd方法和CmdDel方法。(pkg/network/sdn-cni-plugin/openshift-sdn.go)

創建Pod時會調用CmdAdd方法,刪除Pod時調用CmdDel方法。CmdAdd會為Pod創建ns、veth pair,同時,調用cniserver更新ovs流表。同理在CmdDel方法。cniserver是node進程啟動的內部監聽服務。

br0抓包

我們不能在br0上直接抓包,解決方法是創建一個dummy介面,作為br0的鏡像流量進行抓包

# 創建br0-snooper0
ip link add name br0-snooper0 type dummy
ip link set dev br0-snooper0 up
# 添加br0-snooper0作為br0的埠
ovs-vsctl add-port br0 br0-snooper0
# 設置br0的mirror
ovs-vsctl -- set Bridge br0 mirrors=@m
-- --id=@br0-snooper0 get Port br0-snooper0
-- --id=@br0 get Port br0
-- --id=@m create Mirror name=br0mirror
select-dst-port=@br0
select-src-port=@br0
output-port=@br0-snooper0
select_all=1
ovs-vsctl list mirror br0mirror

使用tcpdump在br0-snooper0介面抓包

tcpdump -vvvs0 -npi br0-snooper0 -w /tmp/$(hostname)-$(date +"%m-%d-%H-%M").pcap

測試完成刪除br0的mirror

ovs-vsctl clear bridge br0 mirrors
ovs-vsctl del-port br0 br0-snooper0
# ip link delete br0-snooper0

iptables

最後讓我們看看kubernetes對pod/endpoint/service創建的iptables:

安裝完kubenetes後宿主機的iptables

# 安裝完kubernetes後宿主機的service情況
[root@k-master-1 ~]# k get svc --all-namespaces
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 36d
kube-system kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 36d

# iptables -t nat -nL
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */
DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */
DOCKER all -- 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */
MASQUERADE all -- 172.17.0.0/16 0.0.0.0/0
RETURN all -- 10.244.0.0/16 10.244.0.0/16
MASQUERADE all -- 10.244.0.0/16 !224.0.0.0/4
RETURN all -- !10.244.0.0/16 10.244.0.0/24
MASQUERADE all -- !10.244.0.0/16 10.244.0.0/16

....

Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- 0.0.0.0/0 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-MARK-MASQ udp -- !10.244.0.0/16 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
KUBE-SVC-TCOU7JCQXEZGVUNU udp -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
KUBE-NODEPORTS all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

Chain KUBE-SERVICES設置包括:
1.訪問kubernetes服務,設置ip masquerade;
2.訪問kubernetes服務;(to apiserver)
3.訪問kube-dns服務,設置ipmasquerade;
4.訪問kube-dns服務;
5.to Chain KUBE-NODEPORTS

創建了2個nginx的pod實例和相應的service

# cat deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 2
selector:
matchLabels:
app: my-nginx
template:
metadata:
name: nginx
labels:
app: my-nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80

# cat service.yml
{
"kind": "Service",
"apiVersion": "v1",
"metadata": {
"name": "my-nginx-service"
},
"spec": {
"selector": {
"app": "my-nginx"
},
"ports": [
{
"protocol": "TCP",
"port": 8080,
"targetPort": 80
}
]
}
}

查看iptables的變化:(service->pod)

Chain KUBE-SERVICES (2 references)
target prot opt source destination
....
KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.97.185.147 /* default/my-nginx-service: cluster IP */ tcp dpt:8080
KUBE-SVC-BMQ5UNGRIS2RIY35 tcp -- 0.0.0.0/0 10.97.185.147 /* default/my-nginx-service: cluster IP */ tcp dpt:8080

Chain KUBE-SVC-BMQ5UNGRIS2RIY35 (1 references)
target prot opt source destination
KUBE-SEP-EAIPQUT7232NOUPP all -- 0.0.0.0/0 0.0.0.0/0 /* default/my-nginx-service: */ statistic mode random probability 0.50000000000
KUBE-SEP-3F4NG6RM6J36OUEB all -- 0.0.0.0/0 0.0.0.0/0 /* default/my-nginx-service: */

Chain KUBE-SEP-3F4NG6RM6J36OUEB (1 references)
target prot opt source destination
KUBE-MARK-MASQ all -- 10.244.2.9 0.0.0.0/0 /* default/my-nginx-service: */
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 /* default/my-nginx-service: */ tcp to:10.244.2.9:80

Chain KUBE-SEP-EAIPQUT7232NOUPP (1 references)
target prot opt source destination
KUBE-MARK-MASQ all -- 10.244.1.9 0.0.0.0/0 /* default/my-nginx-service: */
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 /* default/my-nginx-service: */ tcp to:10.244.1.9:80

還可以查看容器ns下的iptables

CONTAINER_ID=$(kubectl get po ratings-v1-6d9f5df564-kzfhd -o jsonpath={.status.containerStatuses[0].containerID} | cut -c 10-21)

$ PID=$(sudo docker inspect -f {{ .State.Pid }} $CONTAINER_ID)

nsenter -t <PID> -n iptables-save
nsenter -t ${PID} -n iptables -t nat -L -n -v

我的研究先總結到這裡,對於sdn,這只是冰山一角。

參考文章:

  1. 基於Open vSwitch的OpenFlow實踐
  2. openshift官方文檔

推薦閱讀:

相关文章