第二部分

openshift-sdn的深入研究

抛开表象后,我们可以更深入的了解一下openshift-sdn的实现。openshift-sdn创建了4个资源对象。

新的资源对象

首先是:netnamespace

oc get netnamespace
NAME NETID
default 0
np1 1647324
np2 11107045

netnamespace定义每个namespace的vnid,在ovs-subnet plugin下,是单租户模式,即不存在netnamespace资源对象的。在ovs-multitelnat下,每个namespace会有唯一的vnid,glocal-namespace default的vnid是0,并且和其他的namespace是互通的。如果我们希望在ovs-multitelnat下,两个project是互通的,我们可以使用如下命令:

$oc adm pod-network join-projects --to=project1 project2 project3 --loglevel=8

执行完这个命令后,我们会发现互通的project的vnid变成相同的,以此达到互通。而在ovs-networkpolicy plugin下,同样会使用namespace的vnid修改ovs规则相应的table=80。

其他的资源对象还有:clusternetwork定义了集群网路段,hostsubnet定义了主机pod网路段,同时,还添加了egressnetworkpolicy。

$oc get clusternetwork
NAME NETWORK HOST SUBNET LENGTH SERVICE NETWORK PLUGIN NAME
default 10.128.0.0/14 9 172.30.0.0/16 redhat/openshift-ovs-networkpolicy

$oc get hostsubnet
NAME HOST HOST IP SUBNET
a-master1.example.com a-master1.example.com 172.18.143.111 10.128.0.0/23
a-master2.example.com a-master2.example.com 172.18.143.112 10.130.0.0/23
a-master3.example.com a-master3.example.com 172.18.143.113 10.129.0.0/23
a-node1.example.com a-node1.example.com 172.18.143.114 10.128.2.0/23
a-node2.example.com a-node2.example.com 172.18.143.115 10.131.0.0/23
a-node3.example.com a-node3.example.com 172.18.143.116 10.129.2.0/23
a-node4.example.com a-node4.example.com 172.18.143.117 10.130.2.0/23
a-node5.example.com a-node5.example.com 172.18.143.174 10.131.2.0/23

$oc get egressnetworkpolicy
No resources found.

ovs br0网桥和埠

首先,我们看看node5上相应的网桥和埠的创建情况:

root@a-node5 ~]# ovs-ofctl -O OpenFlow13 show br0
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000fa8909121b41
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
1(vxlan0): addr:ce:08:d2:68:09:19
config: 0
state: 0
speed: 0 Mbps now, 0 Mbps max
2(tun0): addr:f2:d1:08:10:9e:b4
config: 0
state: 0
speed: 0 Mbps now, 0 Mbps max
112(veth5e852cf1): addr:52:2e:59:2b:49:31
config: 0
state: 0
current: 10GB-FD COPPER
speed: 10000 Mbps now, 0 Mbps max
LOCAL(br0): addr:fa:89:09:12:1b:41
config: PORT_DOWN
state: LINK_DOWN
speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=nx-match miss_send_len=0

ovs流表规则

从openshift-sdn源码中,我们可以看到openshift-sdn对ovs规则的建立和修改,openshift-sdn会创建如下ovs的规则tables,并会监听资源变化修改相应规则的tables:

// Table 0: initial dispatch based on in_port
// Table 10: VXLAN ingress filtering; filled in by AddHostSubnetRules()
// Table 20: from OpenShift container; validate IP/MAC, assign tenant-id; filled in by setupPodFlows
// Table 21: from OpenShift container; NetworkPolicy plugin uses this for connection tracking
// Table 25: IP from OpenShift container via Service IP; reload tenant-id; filled in by setupPodFlows
// Table 30: general routing
// Table 40: ARP to local container, filled in by setupPodFlows
// Table 50: ARP to remote container; filled in by AddHostSubnetRules()
// Table 60: IP to service from pod
// Table 70: IP to local container: vnid/port mappings; filled in by setupPodFlows
// Table 80: IP policy enforcement; mostly managed by the osdnPolicy
// Table 90: IP to remote container; filled in by AddHostSubnetRules()
// Table 100: egress routing; edited by UpdateNamespaceEgressRules()
// Table 101: egress network policy dispatch; edited by UpdateEgressNetworkPolicy()
// Table 110: outbound multicast filtering, updated by UpdateLocalMulticastFlows()
// Table 111: multicast delivery from local pods to the VXLAN; only one rule, updated by UpdateVXLANMulticastRules()
// Table 120: multicast delivery to local pods (either from VXLAN or local pods); updated by UpdateLocalMulticastFlows()
// Table 253: rule version note

查看ovs流表

同时,我们可以使用ovs-ofctl dump-flows查看流表信息:

[root@a-node5 ~]#ovs-ofctl dump-flows br0 -O OpenFlow13
[root@a-node5 ~]#ovs-ofctl dump-flows br0 -O OpenFlow13 | cut -d , -f1,2,4,5 --complement | sort -u -V

跟踪数据流

我们可以使用ovs-appctl ofproto/trace产生数据包跟踪数据流:

使用前,我们需要确定ovs的in_port的数值。当前np1下的busyboxplus启动在node5上,cni插件会为pod创建ns及veth pair,veth pair一头连接在pod的eth0埠,另一头就连在ovs网桥的一个埠,即数据流的in_port:

#首先进入容器执行ip a,得到eth0@if后面的数值325
$oc rsh busyboxplus-deployment-786549c679-gdszc
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if325: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 0a:58:0a:83:03:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.131.3.54/23 brd 10.131.3.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::2034:beff:fed3:560d/64 scope link
valid_lft forever preferred_lft forever

如果容器里没有ip命令,我们可以使用nsenter进入pod的network namespace执行命令:

[root@a-node5 ~]CONTAINER_ID=$(oc get pods busyboxplus-deployment-786549c679-gdszc -o jsonpath={.status.containerStatuses[0].containerID} | cut -c 10-21
[root@a-node5 ~]PID=$(sudo docker inspect -f {{ .State.Pid }} $CONTAINER_ID)
[root@a-node5 ~]nsenter -t ${PID} -n ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if325: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 0a:58:0a:83:03:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.131.3.54/23 brd 10.131.3.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::2034:beff:fed3:560d/64 scope link
valid_lft forever preferred_lft forever

再通过/sys/class/net/得到对应的in_port

[root@a-node5 ~]ovs-ofctl -O OpenFlow13 show br0 | grep veth
200(veth07cd191b): addr:26:c0:0f:a6:8f:19
201(vethecf60c1a): addr:8a:a2:86:2e:73:35
[root@a-node5 ~]# cat /sys/class/net/{veth07cd191b,vethecf60c1a}/ifindex
325
326

至此,我们得到np1的busyboxplus流入ovs br0的in_port是:200。我们再来看看np1下busyboxplus的pod IP和np2下nginx的pod IP,以此调用ovs-appctl ofproto/trace模拟数据流:

$oc project np1
$oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
busyboxplus-deployment-786549c679-gdszc 1/1 Running 0 2h 10.131.3.54 a-node5.example.com

$oc project np2
$oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
nginx-deployment-66898c95cd-z22j6 1/1 Running 0 1h 10.131.3.55 a-node5.example.com

np1的busyboxplus pod IP:<10.131.3.54>

np2的nginx pod IP:<10.131.3.55>

使用ovs-appctl ofproto/trace模拟busyboxplus POD调用nginx POD的数据流:

[root@a-node5 ~]# ovs-appctl ofproto/trace br0 in_port=200,tcp,nw_src=10.131.3.54,nw_dst=10.131.3.55,ct_state=trk
Flow: ct_state=trk,tcp,in_port=200,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.3.54,nw_dst=10.131.3.55,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=0,tcp_flags=0

bridge("br0")
-------------
0. ip, priority 100
goto_table:20
20. ip,in_port=200,nw_src=10.131.3.54, priority 100
load:0x1922dc->NXM_NX_REG0[]
goto_table:21
21. ip,nw_dst=10.128.0.0/14, priority 200
ct(commit,table=30)
drop

Final flow: ct_state=trk,tcp,reg0=0x1922dc,in_port=200,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.3.54,nw_dst=10.131.3.55,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=0,tcp_flags=0
Megaflow: recirc_id=0,ct_state=+trk,ip,in_port=200,nw_src=10.131.3.54,nw_dst=10.128.0.0/14,nw_frag=no
Datapath actions: 8,ct(commit),recirc(0x1545e)

drop?似乎,我们得到不正确的数据流。是的,在ovs-networkpolicy plugin下,我们创建的流表信息包括connection tracking,会引起有些数据包中断重启,ovs2.8可以自动重启解决这个问题,当前,我使用的是ovs2.7.3,所以,在ovs-networkpolicy plugin下,我无法使用ovs-appctl ofproto/trace跟踪数据包。

最后,我在ovs-multitenant plugin下测试了一下ovs-appctl ofproto/trace,是工作正常的:

#流出的数据包
[root@node1 ~]# ovs-appctl ofproto/trace br0 "in_port=305,ip,nw_src=10.131.1.107,nw_dst=10.128.2.170"
Flow: ip,in_port=305,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.1.107,nw_dst=10.128.2.170,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0

bridge("br0")
-------------
0. ip, priority 100
goto_table:20
20. ip,in_port=305,nw_src=10.131.1.107, priority 100
load:0x439379->NXM_NX_REG0[]
goto_table:21
21. priority 0
goto_table:30
30. ip,nw_dst=10.128.0.0/14, priority 100
goto_table:90
90. ip,nw_dst=10.128.2.0/23, priority 100
move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31]
-> NXM_NX_TUN_ID[0..31] is now 0x439379
set_field:10.150.1.41->tun_dst
output:1
-> output to kernel tunnel

Final flow: ip,reg0=0x439379,tun_src=0.0.0.0,tun_dst=10.150.1.41,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_flags=0,in_port=305,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.1.107,nw_dst=10.128.2.170,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0
Megaflow: recirc_id=0,ip,tun_id=0/0xffffffff,tun_dst=0.0.0.0,in_port=305,nw_src=10.131.1.107,nw_dst=10.128.2.0/23,nw_ecn=0,nw_frag=no
Datapath actions: set(tunnel(tun_id=0x439379,dst=10.150.1.41,ttl=64,tp_dst=4789,flags(df|key))),1

#流入的数据包
[root@node2~]# ovs-appctl ofproto/trace br0 "in_port=2,tcp,tunnel_id=0x439379,nw_dst=10.128.2.170"
Flow:
tcp,tun_id=0xcc2e30,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=0.0.0.0,nw_dst=10.128.2.170,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=0,tcp_flags=0

bridge("br0")
-------------
0. ip,in_port=2, priority 200
goto_table:30
30. ip,nw_dst=10.128.2.0/23, priority 200
goto_table:70
70. ip,nw_dst=10.128.2.170, priority 100 load:0x15beb->NXM_NX_REG1[] load:0x21a->NXM_NX_REG2[]
goto_table:80
80. priority 200 output:NXM_NX_REG2[]
-> output port is 538

Final flow: tcp,reg1=0x15beb,reg2=0x21a,tun_id=0xcc2e30,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=0.0.0.0,nw_dst=10.128.2.170,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=0,tcp_flags=0 Megaflow: recirc_id=0,ip,in_port=2,nw_src=0.0.0.0/5,nw_dst=10.128.2.170,nw_frag=no Datapath actions: 21

到此,我们可以测试pod到pod的互通,pod到service的互通怎么实现的呢?一样,也会走ovs流表信息,但首先会走tun0埠,在tun0埠处会查询iptables规则,找到对应的pod ip,再查一次流表,之后走vxlan0埠数据包的流程。ovs为什么能这样实现呢?首先,tun0埠的type是internal,之后,tun0收到数据包后,会进行如下的处理:

  1. tun0收到数据包之后,发现另一端被进程B打开了,于是将数据包丢给了进程B
  2. 进程B收到数据包之后,做一些跟业务相关的处理,然后构造一个新的数据包,将原来的数据包嵌入在新的数据包中,最后通过socket B将数据包转发出去,这时候新数据包的源地址变成了eth0的地址,而目的IP地址变成了一个其它的地址,比如是10.33.0.1.
  3. socket B将数据包丢给协议栈
  4. 协议栈根据本地路由,发现这个数据包应该要通过eth0发送出去,于是将数据包交给eth0
  5. eth0通过物理网路将数据包发送出去

参考segmentfault.com/a/1190看看一下数据包的完整流程的解释。

openshift-sdn cni插件

openshift-sdn cni插件遵循kubernetes cni插件的标准:github.com/containernet,会编译生成openshift-sdn二进位可执行文件,同时按照框架实现CmdAdd方法和CmdDel方法。(pkg/network/sdn-cni-plugin/openshift-sdn.go)

创建Pod时会调用CmdAdd方法,删除Pod时调用CmdDel方法。CmdAdd会为Pod创建ns、veth pair,同时,调用cniserver更新ovs流表。同理在CmdDel方法。cniserver是node进程启动的内部监听服务。

br0抓包

我们不能在br0上直接抓包,解决方法是创建一个dummy介面,作为br0的镜像流量进行抓包

# 创建br0-snooper0
ip link add name br0-snooper0 type dummy
ip link set dev br0-snooper0 up
# 添加br0-snooper0作为br0的埠
ovs-vsctl add-port br0 br0-snooper0
# 设置br0的mirror
ovs-vsctl -- set Bridge br0 mirrors=@m
-- --id=@br0-snooper0 get Port br0-snooper0
-- --id=@br0 get Port br0
-- --id=@m create Mirror name=br0mirror
select-dst-port=@br0
select-src-port=@br0
output-port=@br0-snooper0
select_all=1
ovs-vsctl list mirror br0mirror

使用tcpdump在br0-snooper0介面抓包

tcpdump -vvvs0 -npi br0-snooper0 -w /tmp/$(hostname)-$(date +"%m-%d-%H-%M").pcap

测试完成删除br0的mirror

ovs-vsctl clear bridge br0 mirrors
ovs-vsctl del-port br0 br0-snooper0
# ip link delete br0-snooper0

iptables

最后让我们看看kubernetes对pod/endpoint/service创建的iptables:

安装完kubenetes后宿主机的iptables

# 安装完kubernetes后宿主机的service情况
[root@k-master-1 ~]# k get svc --all-namespaces
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 36d
kube-system kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 36d

# iptables -t nat -nL
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */
DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */
DOCKER all -- 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */
MASQUERADE all -- 172.17.0.0/16 0.0.0.0/0
RETURN all -- 10.244.0.0/16 10.244.0.0/16
MASQUERADE all -- 10.244.0.0/16 !224.0.0.0/4
RETURN all -- !10.244.0.0/16 10.244.0.0/24
MASQUERADE all -- !10.244.0.0/16 10.244.0.0/16

....

Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- 0.0.0.0/0 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-MARK-MASQ udp -- !10.244.0.0/16 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
KUBE-SVC-TCOU7JCQXEZGVUNU udp -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
KUBE-NODEPORTS all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

Chain KUBE-SERVICES设置包括:
1.访问kubernetes服务,设置ip masquerade;
2.访问kubernetes服务;(to apiserver)
3.访问kube-dns服务,设置ipmasquerade;
4.访问kube-dns服务;
5.to Chain KUBE-NODEPORTS

创建了2个nginx的pod实例和相应的service

# cat deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 2
selector:
matchLabels:
app: my-nginx
template:
metadata:
name: nginx
labels:
app: my-nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80

# cat service.yml
{
"kind": "Service",
"apiVersion": "v1",
"metadata": {
"name": "my-nginx-service"
},
"spec": {
"selector": {
"app": "my-nginx"
},
"ports": [
{
"protocol": "TCP",
"port": 8080,
"targetPort": 80
}
]
}
}

查看iptables的变化:(service->pod)

Chain KUBE-SERVICES (2 references)
target prot opt source destination
....
KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.97.185.147 /* default/my-nginx-service: cluster IP */ tcp dpt:8080
KUBE-SVC-BMQ5UNGRIS2RIY35 tcp -- 0.0.0.0/0 10.97.185.147 /* default/my-nginx-service: cluster IP */ tcp dpt:8080

Chain KUBE-SVC-BMQ5UNGRIS2RIY35 (1 references)
target prot opt source destination
KUBE-SEP-EAIPQUT7232NOUPP all -- 0.0.0.0/0 0.0.0.0/0 /* default/my-nginx-service: */ statistic mode random probability 0.50000000000
KUBE-SEP-3F4NG6RM6J36OUEB all -- 0.0.0.0/0 0.0.0.0/0 /* default/my-nginx-service: */

Chain KUBE-SEP-3F4NG6RM6J36OUEB (1 references)
target prot opt source destination
KUBE-MARK-MASQ all -- 10.244.2.9 0.0.0.0/0 /* default/my-nginx-service: */
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 /* default/my-nginx-service: */ tcp to:10.244.2.9:80

Chain KUBE-SEP-EAIPQUT7232NOUPP (1 references)
target prot opt source destination
KUBE-MARK-MASQ all -- 10.244.1.9 0.0.0.0/0 /* default/my-nginx-service: */
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 /* default/my-nginx-service: */ tcp to:10.244.1.9:80

还可以查看容器ns下的iptables

CONTAINER_ID=$(kubectl get po ratings-v1-6d9f5df564-kzfhd -o jsonpath={.status.containerStatuses[0].containerID} | cut -c 10-21)

$ PID=$(sudo docker inspect -f {{ .State.Pid }} $CONTAINER_ID)

nsenter -t <PID> -n iptables-save
nsenter -t ${PID} -n iptables -t nat -L -n -v

我的研究先总结到这里,对于sdn,这只是冰山一角。

参考文章:

  1. 基于Open vSwitch的OpenFlow实践
  2. openshift官方文档

推荐阅读:

相关文章