双主k8s集群其中一个主节点挂掉后恢复
原创2024年9月23日大约 2 分钟
双主k8s集群其中一个主节点挂掉后恢复
本实践将在 Ubuntu 22.04.3LTS 系统上进行安装测试
docker版本 24.0.6
kubeadm版本v1.28.2
版本
背景
集群中一个主节点因
硬盘故障
造成一个主节点无法恢复,进而影响整个k8s集群。
原搭建资源有限,采用了双主的方式搭建的集群。
集群无法启动的根本原因
双主集群中
etcd
为2个节点,一个主节点挂了,导致etcd
无法进行过半选举,处于不可用状态,进而影响整个集群的运行
查看日志报错如下
{"level":"warn","ts":"2024-09-23T07:25:10.239885Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"5ce6420b7f06f234","rtt":"0s","error":"dial tcp 172.17.xxx.xx:2380: connect: no route to host"}
{"level":"warn","ts":"2024-09-23T07:25:10.239985Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"5ce6420b7f06f234","rtt":"0s","error":"dial tcp 172.17.xxx.xx:2380: connect: no route to host"}
操作目标: 使其中仅存的一个主节点的etcd
恢复为单节点
操作前请注意备份
/var/lib/etcd/
export ETCDCTL_API=3
export ETCD_ENDPOINT="http://0.0.0.0:2379"
# apt install etcd-client
# etcd检查健康状态
ETCDCTL_API=3 etcdctl --endpoints=http://172.17.xxx.xx:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health
# 查看etcd集群状态
ETCDCTL_API=3 etcdctl --endpoints=https://172.17.240.83:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
member list
# 如果能正常查看,删除不可用节点即可
ETCDCTL_API=3 etcdctl --endpoints=https://172.17.240.83:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
member remove <member-id>
# 我这边是集群异常,无法执行
# apt install etcd-server
# 使集群变成单节点
etcd --data-dir=/var/lib/etcd/ --force-new-cluster
# 2024-09-23 18:49:32.664882 C | etcdserver/membership: cluster cannot be downgraded (current version: 3.3.25 is lower than determined cluster version: 3.5).
# 下载 对应版本的etcd二进制文件覆盖后执行
etcd --data-dir=/var/lib/etcd/ --force-new-cluster