K8s 集群高可用master节点ETCD挂掉如何恢复?,K8s集群高可用Master节点ETCD恢复指南,ETCD挂掉怎么办?

马肤

温馨提示:这篇文章已超过472天没有更新,请注意相关的内容是否还可用!

摘要:针对 Kubernetes(K8s)集群中高可用 master 节点的 ETCD 挂掉问题,恢复过程包括备份恢复或重新选举新的 ETCD 节点。检查 ETCD 集群状态,确认节点挂掉情况。若存在备份,可恢复 ETCD 数据至备份状态;若无备份,需重新配置健康的 ETCD 节点以恢复集群。确保 Kubernetes 集群的 API 服务可用性,并监控 ETCD 性能以避免再次出现问题。恢复后验证集群状态以确保高可用性和稳定性。

写在前面


  • 很常见的集群运维场景,整理分享
  • 博文内容为 K8s 集群高可用 master 节点故障如何恢复的过程
  • 理解不足小伙伴帮忙指正

    不必太纠结于当下,也不必太忧虑未来,当你经历过一些事情的时候,眼前的风景已经和从前不一样了。——村上春树

    K8s 集群高可用master节点ETCD挂掉如何恢复?,K8s集群高可用Master节点ETCD恢复指南,ETCD挂掉怎么办? 第1张
    (图片来源网络,侵删)

    遇到了什么问题

    今天做实验发现 ,集群其中一个 master 节点上的 etcd 和 apiserver 都挂掉了

    ┌──[root@vms100.liruilongs.github.io]-[~]
    └─$kubectl get nodes
    NAME                          STATUS   ROLES           AGE    VERSION
    vms100.liruilongs.github.io   Ready    control-plane   415d   v1.25.1
    vms101.liruilongs.github.io   Ready    control-plane   415d   v1.25.1
    vms102.liruilongs.github.io   Ready    control-plane   415d   v1.25.1
    vms103.liruilongs.github.io   Ready              415d   v1.25.1
    vms105.liruilongs.github.io   Ready              415d   v1.25.1
    vms106.liruilongs.github.io   Ready              415d   v1.25.1
    ┌──[root@vms100.liruilongs.github.io]-[~]
    └─$
    

    vms100.liruilongs.github.io 这个节点 上的 apiserver 和 etcd

    K8s 集群高可用master节点ETCD挂掉如何恢复?,K8s集群高可用Master节点ETCD恢复指南,ETCD挂掉怎么办? 第2张
    (图片来源网络,侵删)
    ┌──[root@vms100.liruilongs.github.io]-[~]
    └─$kubectl get pod -A -o wide | grep apiserver
    kube-system          kube-apiserver-vms100.liruilongs.github.io            0/1     CrashLoopBackOff   1448 (3m23s ago)   415d   192.168.26.100   vms100.liruilongs.github.io              
    kube-system          kube-apiserver-vms101.liruilongs.github.io            1/1     Running            272 (3h18m ago)    415d   192.168.26.101   vms101.liruilongs.github.io              
    kube-system          kube-apiserver-vms102.liruilongs.github.io            1/1     Running            246 (3h18m ago)    415d   192.168.26.102   vms102.liruilongs.github.io              
    ┌──[root@vms100.liruilongs.github.io]-[~]
    └─$kubectl get pod -A -o wide | grep etcd
    kube-system          etcd-vms100.liruilongs.github.io                      0/1     CrashLoopBackOff   1244 (3m6s ago)    415d   192.168.26.100   vms100.liruilongs.github.io              
    kube-system          etcd-vms101.liruilongs.github.io                      1/1     Running            167 (3h18m ago)    415d   192.168.26.101   vms101.liruilongs.github.io              
    kube-system          etcd-vms102.liruilongs.github.io                      1/1     Running            173 (3h18m ago)    415d   192.168.26.102   vms102.liruilongs.github.io              
    

    查看 keepalived 对应的静态Pod运行正常

    ┌──[root@vms100.liruilongs.github.io]-[~]
    └─$kubectl get pod -A -o wide | grep keep
    kube-system          keepalived-vms100.liruilongs.github.io                1/1     Running            63 (3h50m ago)    415d   192.168.26.100   vms100.liruilongs.github.io              
    kube-system          keepalived-vms101.liruilongs.github.io                1/1     Running            54 (3h51m ago)    415d   192.168.26.101   vms101.liruilongs.github.io              
    kube-system          keepalived-vms102.liruilongs.github.io                1/1     Running            60 (3h51m ago)    415d   192.168.26.102   vms102.liruilongs.github.io              
    ┌──[root@vms100.liruilongs.github.io]-[~]
    └─$
    

    所以可能是 etcd 数据不同步,或者什么原因 导致etcd 挂掉了。因为 每个 master 节点的 apiserver 只和 本节点的 etcd 进行 通信(每个 etcd 的写请求会转发到 etcd 的领导节点),etcd 挂掉,apiserver 无法提供能力,所以也会挂掉。

    通过 etcdctl 可以发现 vms100.liruilongs.github.io 上的 etcd 彻底死掉了

    ┌──[root@vms100.liruilongs.github.io]-[~]
    └─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379 \
     --cert="/etc/kubernetes/pki/etcd/server.crt" \
     --key="/etc/kubernetes/pki/etcd/server.key"  \
     --cacert="/etc/kubernetes/pki/etcd/ca.crt" \
     member list -w table
    Error: dial tcp 127.0.0.1:2379: connect: connection refused
    

    如何排查

    这里我们换一个 etcd 节点 执行 命令

    查看 etcd 集群成员

    ┌──[root@vms100.liruilongs.github.io]-[~]
    └─$ssh vms101.liruilongs.github.io
    Last login: Sat Mar  2 09:52:01 2024 from 192.168.26.100
    ┌──[root@vms101.liruilongs.github.io]-[~]
    └─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379 \
     --cert="/etc/kubernetes/pki/etcd/server.crt"  \
     --key="/etc/kubernetes/pki/etcd/server.key" \
      --cacert="/etc/kubernetes/pki/etcd/ca.crt" \
      member list -w table
    +------------------+---------+-----------------------------+-----------------------------+-----------------------------+
    |        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
    +------------------+---------+-----------------------------+-----------------------------+-----------------------------+
    |  ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
    | 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
    | b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
    +------------------+---------+-----------------------------+-----------------------------+-----------------------------+
    

    查看节点状态

    ┌──[root@vms101.liruilongs.github.io]-[~]
    └─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379 \
      --cert="/etc/kubernetes/pki/etcd/server.crt" \
      --key="/etc/kubernetes/pki/etcd/server.key"  \
      --cacert="/etc/kubernetes/pki/etcd/ca.crt" \
       endpoint status --cluster  -w table
    Failed to get the status of endpoint https://192.168.26.100:2379 (context deadline exceeded)
    +-----------------------------+------------------+---------+---------+-----------+-----------+------------+
    |          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
    +-----------------------------+------------------+---------+---------+-----------+-----------+------------+
    | https://192.168.26.101:2379 | 70059e836d19883d |   3.5.4 |   88 MB |     false |       603 |   22208417 |
    | https://192.168.26.102:2379 | b8cb9f66c2e63b91 |   3.5.4 |   88 MB |      true |       603 |   22208417 |
    +-----------------------------+------------------+---------+---------+-----------+-----------+------------+
    

    确定 ETCD 节点故障

    ┌──[root@vms101.liruilongs.github.io]-[~]
    └─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  \
     --cert="/etc/kubernetes/pki/etcd/server.crt"  \
     --key="/etc/kubernetes/pki/etcd/server.key"  \
     --cacert="/etc/kubernetes/pki/etcd/ca.crt" \
     endpoint  health --cluster  -w table
    https://192.168.26.101:2379 is healthy: successfully committed proposal: took = 3.753357ms
    https://192.168.26.102:2379 is healthy: successfully committed proposal: took = 2.989943ms
    https://192.168.26.100:2379 is unhealthy: failed to connect: dial tcp 192.168.26.100:2379: connect: connection refused
    Error: unhealthy cluster
    

    查看 etcd 的容器日志

    ┌──[root@vms100.liruilongs.github.io]-[~]
    └─$docker ps -a | grep etcd
    0f2f98ebf8c3   a8a176a5d5d6                                        "etcd --advertise-cl…"   4 minutes ago   Exited (2) 4 minutes ago                  k8s_etcd_etcd-vms100.liruilongs.github.io_kube-system_e8c17bb99f9bd8119cdd769556041e18_1252
    a4b39d16a753   registry.aliyuncs.com/google_containers/pause:3.8   "/pause"                 4 hours ago     Up 4 hours                                k8s_POD_etcd-vms100.liruilongs.github.io_kube-system_e8c17bb99f9bd8119cdd769556041e18_54
    ┌──[root@vms100.liruilongs.github.io]-[~]
    └─$docker logs 0f2f98ebf8c3
    {"level":"info","ts":"2024-03-16T14:46:54.644Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd","--advertise-client-urls=https://192.168.26.100:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--experimental-initial-corrupt-check=true","--experimental-watch-progress-notify-interval=5s","--initial-advertise-peer-urls=https://192.168.26.100:2380","--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://192.168.26.100:2380","--name=vms100.liruilongs.github.io","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]}
    {"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
    {"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.26.100:2380"]}
    {"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:479","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
    {"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.26.100:2379"]}
    {"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.4","git-sha":"08407ff76","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"vms100.liruilongs.github.io","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.26.100:2380"],"listen-peer-urls":["https://192.168.26.100:2380"],"advertise-client-urls":["https://192.168.26.100:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.26.100:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
    panic: freepages: failed to get all reachable pages (page 7744: multiple references)
    goroutine 109 [running]:
    go.etcd.io/bbolt.(*DB).freepages.func2(0xc00009c480)
            /go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1056 +0xe9
    created by go.etcd.io/bbolt.(*DB).freepages
            /go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1054 +0x1cd
    ┌──[root@vms100.liruilongs.github.io]-[~]
    └─$
    

    如何解决

    这里最快的办法是重新同步一下这个节点的数据,即把这个故障节点移出 集群,清理完故障节点旧数据在重新添加,操作步骤

    • 清理数据目录,移动静态Pod 的yaml 文件:停止故障节点服务,然后删除etcd数据目录。
    • 移除故障节点:使用member remove命令剔除错误节点,可以在健康的节点执行命令。
    • 添加节点:使用member add命令添加故障节点。
    • 重新启动:移动故障节点yaml文件,进行启动

      注: 静态Pod 通过加载指定目录的 yaml 文件来调度,kubelet 会定时扫描,删除移动 yaml 文件,静态 Pod 会自动停止,同理。添加 yaml 文件会自动创建静态 Pod

      移动静态Pod 的yaml 文件

      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$mv  /etc/kubernetes/manifests/{etcd.yaml,kube-apiserver.yaml}  /tmp/
      

      删除etcd数据目录

      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$rm -rf /var/lib/etcd/*
      

      确认节点 的 etcd 和 apiservier 都已经停止

      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$kubectl get pod -A -o wide | grep apiserver
      kube-system          kube-apiserver-vms101.liruilongs.github.io            1/1     Running   272 (4h15m ago)   415d   192.168.26.101   vms101.liruilongs.github.io              
      kube-system          kube-apiserver-vms102.liruilongs.github.io            1/1     Running   246 (4h15m ago)   415d   192.168.26.102   vms102.liruilongs.github.io              
      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$kubectl get pod -A -o wide | grep etcd
      kube-system          etcd-vms101.liruilongs.github.io                      1/1     Running   167 (4h15m ago)   415d   192.168.26.101   vms101.liruilongs.github.io              
      kube-system          etcd-vms102.liruilongs.github.io                      1/1     Running   173 (4h15m ago)   415d   192.168.26.102   vms102.liruilongs.github.io              
      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$
      

      获取故障节点 ID,下面的操作我们在健康的 etcd 节点执行,或者可以修改 --endpoints

      ┌──[root@vms101.liruilongs.github.io]-[~]
      └─$ETCDCTL_API=3 etcdctl  --endpoints https://192.168.26.101:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
      +------------------+---------+-----------------------------+-----------------------------+-----------------------------+
      |        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
      +------------------+---------+-----------------------------+-----------------------------+-----------------------------+
      |  ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
      | 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
      | b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
      +------------------+---------+-----------------------------+-----------------------------+-----------------------------+
      

      移除故障节点

      ┌──[root@vms101.liruilongs.github.io]-[~]
      └─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"  member remove ee392e5273e89e2
      Member  ee392e5273e89e2 removed from cluster 4816f346663d82a7
      

      重新添加

      ┌──[root@vms101.liruilongs.github.io]-[~]
      └─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"  member add vms100.liruilongs.github.io --peer-urls=https://192.168.26.100:2380
      Member 456f71fdc1ad9917 added to cluster 4816f346663d82a7
      ETCD_NAME="vms100.liruilongs.github.io"
      ETCD_INITIAL_CLUSTER="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380"
      ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.26.100:2380"
      ETCD_INITIAL_CLUSTER_STATE="existing"
      

      回到 100 节点机器,移动 Yaml 文件,恢复节点

      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$mv /tmp/{etcd.yaml,kube-apiserver.yaml} /etc/kubernetes/manifests/
      

      确认 Pod 状态

      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$kubectl get pod -A -o wide | grep etcd
      kube-system          etcd-vms100.liruilongs.github.io                      1/1     Running   0                 16s    192.168.26.100   vms100.liruilongs.github.io              
      kube-system          etcd-vms101.liruilongs.github.io                      1/1     Running   167 (4h32m ago)   415d   192.168.26.101   vms101.liruilongs.github.io              
      kube-system          etcd-vms102.liruilongs.github.io                      1/1     Running   173 (4h32m ago)   415d   192.168.26.102   vms102.liruilongs.github.io              
      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$kubectl get pod -A -o wide | grep apiserver
      kube-system          kube-apiserver-vms100.liruilongs.github.io            1/1     Running   0                 24s    192.168.26.100   vms100.liruilongs.github.io              
      kube-system          kube-apiserver-vms101.liruilongs.github.io            1/1     Running   272 (4h32m ago)   415d   192.168.26.101   vms101.liruilongs.github.io              
      kube-system          kube-apiserver-vms102.liruilongs.github.io            1/1     Running   246 (4h32m ago)   415d   192.168.26.102   vms102.liruilongs.github.io              
      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$
      

      查看 etcd 集群状态

      ┌──[root@vms101.liruilongs.github.io]-[~]
      └─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
      +------------------+-----------+-----------------------------+-----------------------------+-----------------------------+
      |        ID        |  STATUS   |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
      +------------------+-----------+-----------------------------+-----------------------------+-----------------------------+
      | 54952f3b494c0286 | unstarted |                             | https://192.168.26.100:2380 |                             |
      | 70059e836d19883d |   started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
      | b8cb9f66c2e63b91 |   started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
      +------------------+-----------+-----------------------------+-----------------------------+-----------------------------+
      

      这里我们发现 新添加的节点状态不正常,一直是 unstarted

      我们在 故障节点执行 etcd 命令。发现故障节点并没有添加到集群,而是作为一个单节点运行。

      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
      +-----------------+---------+-----------------------------+-----------------------------+-----------------------------+
      |       ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
      +-----------------+---------+-----------------------------+-----------------------------+-----------------------------+
      | ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
      +-----------------+---------+-----------------------------+-----------------------------+-----------------------------+
      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster  -w table
      +-----------------------------+-----------------+---------+---------+-----------+-----------+------------+
      |          ENDPOINT           |       ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
      +-----------------------------+-----------------+---------+---------+-----------+-----------+------------+
      | https://192.168.26.100:2379 | ee392e5273e89e2 |   3.5.4 |  815 kB |      true |         2 |       2261 |
      +-----------------------------+-----------------+---------+---------+-----------+-----------+------------+
      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$
      

      也没有同步 当前集群的数据

      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$kubectl get pod -A -o wide  --server=https://vms100.liruilongs.github.io:6443
      No resources found
      

      遇到这种情况,大部分原因是 某个节点的 etcd配置文件的问题,我的这个问题是 故障节点的 etcd 配置文件,没有集群信息相关配置,所以这里把集群相关配置写入配置

      原本的配置文件

      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$cat /etc/kubernetes/manifests/etcd.yaml
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.100:2379
        creationTimestamp: null
        labels:
          component: etcd
          tier: control-plane
        name: etcd
        namespace: kube-system
      spec:
        containers:
        - command:
          - etcd
          - --advertise-client-urls=https://192.168.26.100:2379
          - --cert-file=/etc/kubernetes/pki/etcd/server.crt
          - --client-cert-auth=true
          - --data-dir=/var/lib/etcd
          - --experimental-initial-corrupt-check=true
          - --experimental-watch-progress-notify-interval=5s
          - --initial-advertise-peer-urls=https://192.168.26.100:2380
          - --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380
          - --key-file=/etc/kubernetes/pki/etcd/server.key
          - --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379
          - --listen-metrics-urls=http://127.0.0.1:2381
          - --listen-peer-urls=https://192.168.26.100:2380
          - --name=vms100.liruilongs.github.io
          - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
          - --peer-client-cert-auth=true
          - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
          - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
          - --snapshot-count=10000
          - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
          image: registry.aliyuncs.com/google_containers/etcd:3.5.4-0
      。。。。。。。。。。。。。。。。
      

      集群信息不全的,添加后的配置文件

      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$cat /etc/kubernetes/manifests/etcd.yaml
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.100:2379
        creationTimestamp: null
        labels:
          component: etcd
          tier: control-plane
        name: etcd
        namespace: kube-system
      spec:
        containers:
        - command:
          - etcd
          - --advertise-client-urls=https://192.168.26.100:2379
          - --cert-file=/etc/kubernetes/pki/etcd/server.crt
          - --client-cert-auth=true
          - --data-dir=/var/lib/etcd
          - --experimental-initial-corrupt-check=true
          - --experimental-watch-progress-notify-interval=5s
          - --initial-advertise-peer-urls=https://192.168.26.100:2380
          - --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380
          - --initial-cluster-state=existing
          - --key-file=/etc/kubernetes/pki/etcd/server.key
          - --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379
          - --listen-metrics-urls=http://127.0.0.1:2381
          - --listen-peer-urls=https://192.168.26.100:2380
          - --name=vms100.liruilongs.github.io
          - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
          - --peer-client-cert-auth=true
          - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
          - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
          - --snapshot-count=10000
          - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      

      然后我们以上面相同的方式从新恢复一次,发现节点直接没有起来

      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$kubectl get pod -A -o wide | grep apiserver
      kube-system          kube-apiserver-vms100.liruilongs.github.io            0/1     CrashLoopBackOff   1 (18s ago)       39s    192.168.26.100   vms100.liruilongs.github.io              
      kube-system          kube-apiserver-vms101.liruilongs.github.io            1/1     Running            272 (5h29m ago)   415d   192.168.26.101   vms101.liruilongs.github.io              
      kube-system          kube-apiserver-vms102.liruilongs.github.io            1/1     Running            246 (5h29m ago)   415d   192.168.26.102   vms102.liruilongs.github.io              
      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$kubectl get pod -A -o wide | grep etcd
      kube-system          etcd-vms100.liruilongs.github.io                      0/1     CrashLoopBackOff   3 (21s ago)       53s    192.168.26.100   vms100.liruilongs.github.io              
      kube-system          etcd-vms101.liruilongs.github.io                      1/1     Running            167 (5h29m ago)   415d   192.168.26.101   vms101.liruilongs.github.io              
      kube-system          etcd-vms102.liruilongs.github.io                      1/1     Running            173 (5h29m ago)   415d   192.168.26.102   vms102.liruilongs.github.io              
      

      查看日志

      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$kubectl logs etcd-vms100.liruilongs.github.io -n kube-system
      .............................
      {"level":"fatal","ts":"2024-03-16T16:25:19.981Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"error validating peerURLs {ClusterID:4816f346663d82a7 Members:[&{ID:b8cb9f66c2e63b91 RaftAttributes:{PeerURLs:[https://192.168.26.102:2380] IsLearner:false} Attributes:{Name:vms102.liruilongs.github.io ClientURLs:[https://192.168.26.102:2379]}} &{ID:3fbbbed942c51f7b RaftAttributes:{PeerURLs:[https://192.168.26.100:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}} &{ID:70059e836d19883d RaftAttributes:{PeerURLs:[https://192.168.26.101:2380] IsLearner:false} Attributes:{Name:vms101.liruilongs.github.io ClientURLs:[https://192.168.26.101:2379]}}] RemovedMemberIDs:[]}: member count is unequal","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
      

      根据日志信息,可以看到有用的信息 RemovedMemberIDs:[]}: member count is unequal ,成员数量不相等,在分析日志

      {
          "level": "info",
          "ts": "2024-03-16T16:25:19.961Z",
          "caller": "etcdmain/etcd.go:73",
          "msg": "Running: ",
          "args": [
              "etcd",
              "--advertise-client-urls=https://192.168.26.100:2379",
              "--cert-file=/etc/kubernetes/pki/etcd/server.crt",
              "--client-cert-auth=true",
              "--data-dir=/var/lib/etcd",
              "--experimental-initial-corrupt-check=true",
              "--experimental-watch-progress-notify-interval=5s",
              "--initial-advertise-peer-urls=https://192.168.26.100:2380",
              "--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380",
              "--initial-cluster-state=existing",
              "--key-file=/etc/kubernetes/pki/etcd/server.key",
              "--listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379",
              "--listen-metrics-urls=http://127.0.0.1:2381",
              "--listen-peer-urls=https://192.168.26.100:2380",
              "--name=vms100.liruilongs.github.io",
              "--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt",
              "--peer-client-cert-auth=true",
              "--peer-key-file=/etc/kubernetes/pki/etcd/peer.key",
              "--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt",
              "--snapshot-count=10000",
              "--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"
          ]
      }
      ..............................................................................
      {
          "level": "warn",
          "ts": "2024-03-16T16:25:19.981Z",
          "caller": "etcdmain/etcd.go:146",
          "msg": "failed to start etcd",
          "error": "error validating peerURLs {ClusterID:4816f346663d82a7 Members:[&{ID:b8cb9f66c2e63b91 RaftAttributes:{PeerURLs:[https://192.168.26.102:2380] IsLearner:false} Attributes:{Name:vms102.liruilongs.github.io ClientURLs:[https://192.168.26.102:2379]}} &{ID:3fbbbed942c51f7b RaftAttributes:{PeerURLs:[https://192.168.26.100:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}} &{ID:70059e836d19883d RaftAttributes:{PeerURLs:[https://192.168.26.101:2380] IsLearner:false} Attributes:{Name:vms101.liruilongs.github.io ClientURLs:[https://192.168.26.101:2379]}}] RemovedMemberIDs:[]}: member count is unequal"
      }
      {
          "level": "fatal",
          "ts": "2024-03-16T16:25:19.981Z",
          "caller": "etcdmain/etcd.go:204",
          "msg": "discovery failed",
          "error": "error validating peerURLs {ClusterID:4816f346663d82a7 Members:[&{ID:b8cb9f66c2e63b91 RaftAttributes:{PeerURLs:[https://192.168.26.102:2380] IsLearner:false} Attributes:{Name:vms102.liruilongs.github.io ClientURLs:[https://192.168.26.102:2379]}} &{ID:3fbbbed942c51f7b RaftAttributes:{PeerURLs:[https://192.168.26.100:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}} &{ID:70059e836d19883d RaftAttributes:{PeerURLs:[https://192.168.26.101:2380] IsLearner:false} Attributes:{Name:vms101.liruilongs.github.io ClientURLs:[https://192.168.26.101:2379]}}] RemovedMemberIDs:[]}: member count is unequal",
          "stacktrace": "go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"
      }
      

      可以看到它提示 可能错误与 vms102.liruilongs.github.io 节点相关

      然后我们看一下 vms102.liruilongs.github.io 的配置文件

      ┌──[root@vms102.liruilongs.github.io]-[~]
      └─$cat /etc/kubernetes/manifests/etcd.yaml
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.102:2379
        creationTimestamp: null
        labels:
          component: etcd
          tier: control-plane
        name: etcd
        namespace: kube-system
      spec:
        containers:
        - command:
          - etcd
          - --advertise-client-urls=https://192.168.26.102:2379
          - --cert-file=/etc/kubernetes/pki/etcd/server.crt
          - --client-cert-auth=true
          - --data-dir=/var/lib/etcd
          - --experimental-initial-corrupt-check=true
          - --experimental-watch-progress-notify-interval=5s
          - --initial-advertise-peer-urls=https://192.168.26.102:2380
          - --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380
          - --initial-cluster-state=existing
          - --key-file=/etc/kubernetes/pki/etcd/server.key
          - --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.102:2379
          - --listen-metrics-urls=http://127.0.0.1:2381
          - --listen-peer-urls=https://192.168.26.102:2380
          - --name=vms102.liruilongs.github.io
          - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
          - --peer-client-cert-auth=true
          - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
          - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
          - --snapshot-count=10000
          - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      

      通过配置文件比对,可以发现,之前配置的故障节点的配置任然有问题,少了一个vms102.liruilongs.github.io=https://192.168.26.102:2380节点信息。

      "--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380",
      "--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380"
      

      修改完配置,按照上面相同的流程重新恢复节点, 节点恢复

      通过 etcdctl 命令检查

      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
      +------------------+---------+-----------------------------+-----------------------------+-----------------------------+
      |        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
      +------------------+---------+-----------------------------+-----------------------------+-----------------------------+
      | 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
      | ac5f6045dbe477b3 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
      | b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
      +------------------+---------+-----------------------------+-----------------------------+-----------------------------+
      
      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster  -w table
      +-----------------------------+------------------+---------+---------+-----------+-----------+------------+
      |          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
      +-----------------------------+------------------+---------+---------+-----------+-----------+------------+
      | https://192.168.26.101:2379 | 70059e836d19883d |   3.5.4 |   88 MB |     false |       603 |   22227327 |
      | https://192.168.26.100:2379 | ac5f6045dbe477b3 |   3.5.4 |   88 MB |     false |       603 |   22227327 |
      | https://192.168.26.102:2379 | b8cb9f66c2e63b91 |   3.5.4 |   88 MB |      true |       603 |   22227327 |
      +-----------------------------+------------------+---------+---------+-----------+-----------+------------+
      ┌──[root@vms100.liruilongs.github.io]-[~]
      └─$
      

      故障节点恢复,在实际的操作中,添加完节点,我们需要确认故障节点的配置文件是否是正确的配置文件


      © 2018-2024 liruilonger@gmail.com, All rights reserved. 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)


0
收藏0
文章版权声明:除非注明,否则均为VPS857原创文章,转载或复制请以超链接形式并注明出处。

相关阅读

  • 【研发日记】Matlab/Simulink自动生成代码(二)——五种选择结构实现方法,Matlab/Simulink自动生成代码的五种选择结构实现方法(二),Matlab/Simulink自动生成代码的五种选择结构实现方法详解(二)
  • 超级好用的C++实用库之跨平台实用方法,跨平台实用方法的C++实用库超好用指南,C++跨平台实用库使用指南,超好用实用方法集合,C++跨平台实用库超好用指南,方法与技巧集合
  • 【动态规划】斐波那契数列模型(C++),斐波那契数列模型(C++实现与动态规划解析),斐波那契数列模型解析与C++实现(动态规划)
  • 【C++】,string类底层的模拟实现,C++中string类的模拟底层实现探究
  • uniapp 小程序实现微信授权登录(前端和后端),Uniapp小程序实现微信授权登录全流程(前端后端全攻略),Uniapp小程序微信授权登录全流程攻略,前端后端全指南
  • Vue脚手架的安装(保姆级教程),Vue脚手架保姆级安装教程,Vue脚手架保姆级安装指南,Vue脚手架保姆级安装指南,从零开始教你如何安装Vue脚手架
  • 如何在树莓派 Raspberry Pi中本地部署一个web站点并实现无公网IP远程访问,树莓派上本地部署Web站点及无公网IP远程访问指南,树莓派部署Web站点及无公网IP远程访问指南,本地部署与远程访问实践,树莓派部署Web站点及无公网IP远程访问实践指南,树莓派部署Web站点及无公网IP远程访问实践指南,本地部署与远程访问详解,树莓派部署Web站点及无公网IP远程访问实践详解,本地部署与远程访问指南,树莓派部署Web站点及无公网IP远程访问实践详解,本地部署与远程访问指南。
  • vue2技术栈实现AI问答机器人功能(流式与非流式两种接口方法),Vue2技术栈实现AI问答机器人功能,流式与非流式接口方法探究,Vue2技术栈实现AI问答机器人功能,流式与非流式接口方法详解
  • 发表评论

    快捷回复:表情:
    评论列表 (暂无评论,0人围观)

    还没有评论,来说两句吧...

    目录[+]

    取消
    微信二维码
    微信二维码
    支付宝二维码