因为服务器过保要下线,需要将etcd迁移到新的机器上,下面是我的踩坑记录:
迁移流程
1. 执行数据备份
备份 v2:
$ etcdctl --endpoints <endpoints> --ca-file <ca-file> --cert-file <cert-file> --key-file <key-file> backup --data-dir <data-dir> --backup-dir <back-dir>
Example:
$ etcdctl --endpoints <endpoints> --ca-file <ca-file> --cert-file <cert-file> --key-file <key-file> backup --data-dir /home/work/etcd/data --backup-dir /home/work/backup/etcd
注:此处的数据目录为:
/home/work/etcd/data
,备份路径为:/home/work/backup/etcd
备份 v3:
首先设置API为v3:
$ echo "export ETCDCTL_API=3" >> ~/.bashrc
$ source ~/.bashrc
备份v3数据:
$ etcdctl --endpoints <endpoints> --cacert <ca-file> --cert <cert-file> --key <key-file> snapshot save <back-file>
Example:
$ etcdctl --endpoints <endpoints> --cacert <ca-file> --cert <cert-file> --key <key-file> snapshot save /home/work/backup/etcd/member/snap/db
注:此处的数据存储目录为
/home/work/backup/etcd/member/snap/db
,这里路径和v2的备份路径相关联,具体关联如下:<v2-backdir>/member/snap/db
数据拷贝至新节点
旧节点数据打包:
$ cd /home/work/backup # 进入备份路径
$ tar -zcvf etcd.tar.gz etcd # 打包数据
传送至新节点:
$ scp etcd.tar.gz root@xxxx:/home/work/backup # scp至新机器(一台机器即可)
2. 新集群恢复
解压旧节点传来的数据
cd /home/work/backup
tar -zxvf etcd.tar.gz
mv etcd /home/work/data/
注:这里旧节点传来的数据在
/home/work/backup
目录下,解压后,需要移动至etcd数据目录,这里etcd的数据目录为/home/work/data/etcd
启动新节点(new-01节点)
因为备份的数据中,存在旧服务的集群信息,因为我们进行了迁移,需要将原本的集群信息覆盖掉(不影响用户数据),添加配置 force-new-cluster: true
,等服务成功启动后,旧集群信息已被覆盖,然后去掉此配置,重启服务即可
注:节点配置中,请勿过早添加其他节点信息,只需配置当前节点的信息即可,后面会依次加入新节点信息
修正当前节点的peerURLs
在迁移过程中,出现了当前节点的peerURLs
错误的问题,需要修正下
查看节点信息:
$ etcdctl --endpoint <endpoints> member list # 查看节点
746832a7a901: name=new-10 peerURLs=http://localhost:2380 clientURLs=https://xxx:4001 isLeader=true
其中peerURLs=http://localhost:2380
和配置中不相同,需要重新设置:
$ etcdctl <endpoints> member update 746832a7a901 https://xxxx:2379 # 更增节点peerurls
至此,我们已经成功在新集群恢复了旧集群的数据,但是服务只有一个节点,不符合高可用要求,需要我们添加更多节点,以满足高可用
3. 加入其他节点
先加入02节点:
$ etcdctl <endpoints> member add new-02 https://xxx:2379
Added member named new-02 with ID 9d152780886604c2 to cluster
ETCD_NAME="new-02"
ETCD_INITIAL_CLUSTER="new-01=https://xxx:2379,new-02=https://xxx:2379"
ETCD_INITIAL_CLUSTER_STATE="existing"
启动02节点,其中关键配置需要设置成上面输出的信息:
ETCD_NAME="new-02"
ETCD_INITIAL_CLUSTER="new-01=https://xxx:2379,new-02=https://xxx:2379"
ETCD_INITIAL_CLUSTER_STATE="existing"
以此类推,依次添加完剩余节点
遇到问题
问题1:failed to find database snapshot file (snap: snapshot file doesn’t exist)
错误日志:
2020-09-07 18:55:26.185156 I | etcdmain: Loading server configuration from "/etc/etcd/etcd.yml"
2020-09-07 18:55:26.186297 I | etcdmain: etcd Version: 3.1.20
2020-09-07 18:55:26.186323 I | etcdmain: Git SHA: 992dbd4d1
2020-09-07 18:55:26.186334 I | etcdmain: Go Version: go1.8.7
2020-09-07 18:55:26.186351 I | etcdmain: Go OS/Arch: linux/amd64
2020-09-07 18:55:26.186362 I | etcdmain: setting maximum number of CPUs to 32, total number of available CPUs is 32
2020-09-07 18:55:26.186415 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-09-07 18:55:26.186452 I | embed: peerTLS: cert = /etc/etcd/ssl/server.crt, key = /etc/etcd/ssl/server.key, ca = , trusted-ca = /etc/etcd/ssl/ca.crt, client-cert-auth = true
2020-09-07 18:55:26.208790 I | embed: listening for peers on https://0.0.0.0:2379
2020-09-07 18:55:26.209014 I | embed: listening for client requests on 0.0.0.0:4001
2020-09-07 18:55:26.210117 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2020-09-07 18:55:27.195133 I | etcdserver: recovered store from snapshot at index 550058
2020-09-07 18:55:27.197062 C | etcdserver: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb6653c]
goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func2(0xc4201e67c0, 0xc42031f720)
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:289 +0x3c
panic(0xd083a0, 0xc4211500a0)
/usr/local/google/home/jpbetz/.gvm/gos/go1.8.7/src/runtime/panic.go:489 +0x2cf
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc4201c55e0, 0xec7066, 0x2a, 0xc4201e65d0, 0x1, 0x1)
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc420232b00, 0x0, 0x135f4e0, 0xc42104b0c0)
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:385 +0x32d4
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc4201cc700, 0xc420268b00, 0x0, 0x0)
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:124 +0x70f
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc4201cc700, 0x6, 0xea43e4, 0x6, 0x1)
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:187 +0x58
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:104 +0x15ba
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:39 +0x61
main.main()
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20
具体参见ISSUE:https://github.com/etcd-io/etcd/issues/9890
解决方案:
-
旧机器启用etcdctl v3 api
export ETCDCTL_API=3
-
保存 v3的快照,并备份至 /home/work/backup/etcd/member/snap/db(如果此路径已存在,则覆盖)
etcdctl --endpoints <endpoints> --cacert <ca-file> --cert <cert-file> --key <key-file> snapshot save /home/work/backup/etcd/member/snap/db
-
重传备份数据至新机器数据路径,然后正常启动即可
问题2:etcdmain: error validating peerURLs * member count is unequal
问题日志:
$ ./etcd --config-file /etc/etcd/etcd.yml
2020-09-07 19:33:09.843024 I | etcdmain: Loading server configuration from "/etc/etcd/etcd.yml"
2020-09-07 19:33:09.844168 I | etcdmain: etcd Version: 3.1.20
2020-09-07 19:33:09.844202 I | etcdmain: Git SHA: 992dbd4d1
2020-09-07 19:33:09.844220 I | etcdmain: Go Version: go1.8.7
2020-09-07 19:33:09.844230 I | etcdmain: Go OS/Arch: linux/amd64
2020-09-07 19:33:09.844247 I | etcdmain: setting maximum number of CPUs to 32, total number of available CPUs is 32
2020-09-07 19:33:09.844314 I | embed: peerTLS: cert = /etc/etcd/ssl/server.crt, key = /etc/etcd/ssl/server.key, ca = , trusted-ca = /etc/etcd/ssl/ca.crt, client-cert-auth = true
2020-09-07 19:33:09.866702 I | embed: listening for peers on https://0.0.0.0:2379
2020-09-07 19:33:09.866924 I | embed: listening for client requests on 0.0.0.0:4001
2020-09-07 19:33:09.868162 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2020-09-07 19:33:09.872117 C | etcdmain: error validating peerURLs {
ClusterID:746832a7a902 Members:[&{
ID:746832a7a901 RaftAttributes:{
PeerURLs:[https://10.142.223.47:2379]} Attributes:{
Name:new-01 ClientURLs:[https://10.142.223.47:4001]}} &{
ID:9d152780886604c2 RaftAttributes:{
PeerURLs:[https://10.142.224.4:2379]} Attributes:{
Name: ClientURLs:[]}}] RemovedMemberIDs:[]}: member count is unequal
此问题发生在添加节点后,新节点启动时,原因是配置中配置了其他未加入的节点,导致校验节点配置时,出现了失败
解决方案:
在添加节点的时候,一定要遵循 添加新节点
-> 启动新节点
的流程,依次添加,配置严格遵循添加节点时,输出的信息配置,不能配置未添加节点,等全部节点添加完成后,可再依次修正节点配置,添加完整的 peer信息