碰到 ERROR: requested WAL segment xxxx has already been removed 的错误。
出现这个报错信息主要原因是主库checkpoint了,但wal日志还没有发送到备库,主库wal日志被清理,导致备库请求恢复的wal和主库实际存在的wal日志不一致。
可能的原因是:
解决办法有如下几种:
后面三种生产上都会用上,有必要实验看看效果。
修改主库参数
[pg@localhost ~]$ vi /data/db1/postgresql.conf
---------------------------------------
synchronous_commit = local
checkpoint_timeout = 30s
max_wal_size = 100MB
min_wal_size = 10MB
----------------------------------------
重载配置
[pg@localhost ~]$ pg_ctl -D /data/db1/ reload
关闭备机
[pg@localhost ~]$ pg_ctl -D /data/db2/ stop
人工切换wal日志
[pg@localhost ~]$ psql -h localhost -p 15431 -U pg postgres
postgres=# select pg_switch_wal();
pg_switch_wal
---------------
0/2C000158
(1 row)
postgres=# create table t6(a int);
CREATE TABLE
postgres=# insert into t6(a) values (generate_series(1,100));
INSERT 0 1
postgres=# select pg_switch_wal();
pg_switch_wal
---------------
0/2D0142F8
(1 row)
postgres=# checkpoint;
CHECKPOINT
postgres=#
启动备库
[pg@localhost ~]$ pg_ctl -D /data/db2/ -l /data/db2/server.log start
查看日志,报错
[pg@localhost ~]$ tail -f /data/db2/log/*
2024-01-12 14:36:17.218 CST [11689] LOG: entering standby mode
2024-01-12 14:36:17.219 CST [11689] LOG: redo starts at 0/2C000060
2024-01-12 14:36:17.219 CST [11689] LOG: consistent recovery state reached at 0/2C000108
2024-01-12 14:36:17.219 CST [11689] LOG: invalid record length at 0/2C000140: wanted 24, got 0
2024-01-12 14:36:17.220 CST [11687] LOG: database system is ready to accept read only connections
2024-01-12 14:36:17.224 CST [11693] LOG: started streaming WAL from primary at 0/2C000000 on timeline 1
2024-01-12 14:36:17.224 CST [11693] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000002C has already been removed
2024-01-12 14:36:17.228 CST [11695] LOG: started streaming WAL from primary at 0/2C000000 on timeline 1
2024-01-12 14:36:17.228 CST [11695] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000002C has already been removed
2024-01-12 14:36:22.242 CST [11717] LOG: started streaming WAL from primary at 0/2C000000 on timeline 1
2024-01-12 14:36:22.242 CST [11717] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000002C has already been removed
两种方式重做备机:
干掉原来的备库(注意备份postgresql.conf和recovery.conf)
[pg@localhost ~]$ rm -rf /data/db2
创建备库(创建完成后将postgresql.conf和recovery.conf复制回来)
[pg@localhost ~]$ pg_basebackup -Pv -Upg -h localhost -p 15431 -D /data/db2
重启备库,再查看日志,无报错。
也可以使用 pg_start_backup 和 pg_stop_backup 创建基础备份。
主要三个步骤:
执行 pg_start_backup命令
[pg@localhost ~]$ psql -h localhost -p 15431 -U pg postgres -c "select pg_start_backup('initial_backup');"
同步基础备份文件到备库
[pg@localhost ~]$ rsync -cva --inplace --exclude=*pg_wal* --exclude=postgresql.conf /data/db1/* /data/db2
执行pg_stop_backup命令
[pg@localhost ~]$ psql -h localhost -p 15431 -U pg postgres -c "select pg_stop_backup();"
重启备库,再查看日志,无报错。
通过归档复制的流程
新建归档目录
[pg@localhost ~]$ mkdir /data/arc
主库开启归档
[pg@localhost ~]$ vi /data/db1/postgresql.conf
---------------------------------------
archive_mode = on
archive_command = 'test ! -f /data/arc/%f && cp %p /data/arc/%f'
archive_timeout = 1
---------------------------------------
使用模拟故障相关操作,模拟故障,启动备库查看日志
2024-01-12 16:37:01.977 CST [12695] LOG: entering standby mode
2024-01-12 16:37:01.979 CST [12695] LOG: redo starts at 0/48000028
2024-01-12 16:37:01.979 CST [12695] LOG: consistent recovery state reached at 0/49000060
2024-01-12 16:37:01.979 CST [12695] LOG: invalid record length at 0/49000060: wanted 24, got 0
2024-01-12 16:37:01.980 CST [12693] LOG: database system is ready to accept read only connections
2024-01-12 16:37:01.984 CST [12699] LOG: started streaming WAL from primary at 0/49000000 on timeline 1
2024-01-12 16:37:01.984 CST [12699] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000049 has already been removed
2024-01-12 16:37:01.988 CST [12701] LOG: started streaming WAL from primary at 0/49000000 on timeline 1
2024-01-12 16:37:01.988 CST [12701] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000049 has already been removed
2024-01-12 16:37:07.006 CST [12703] LOG: started streaming WAL from primary at 0/49000000 on timeline 1
2024-01-12 16:37:07.006 CST [12703] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000049 has already been removed
备库配置归档恢复命令
[pg@localhost ~]$ vi /data/db2/recovery.conf
---------------------------------------
restore_command = 'cp /data/arc/%f %p'
---------------------------------------
重新启动备库,查看备库日志,可以看到备库启动后先从归档中恢复数据,然后再通过WAL流复制数据。
2024-01-12 16:39:04.646 CST [12742] LOG: entering standby mode
2024-01-12 16:39:04.681 CST [12742] LOG: restored log file "000000010000000000000048" from archive
2024-01-12 16:39:04.933 CST [12742] LOG: redo starts at 0/48000028
2024-01-12 16:39:04.948 CST [12742] LOG: restored log file "000000010000000000000049" from archive
2024-01-12 16:39:05.209 CST [12742] LOG: consistent recovery state reached at 0/49000060
2024-01-12 16:39:05.210 CST [12740] LOG: database system is ready to accept read only connections
2024-01-12 16:39:05.218 CST [12742] LOG: restored log file "00000001000000000000004A" from archive
2024-01-12 16:39:05.480 CST [12742] LOG: restored log file "00000001000000000000004B" from archive
2024-01-12 16:39:05.767 CST [12742] LOG: restored log file "00000001000000000000004C" from archive
2024-01-12 16:39:06.077 CST [12742] LOG: restored log file "00000001000000000000004D" from archive
2024-01-12 16:39:06.427 CST [12742] LOG: restored log file "00000001000000000000004E" from archive
2024-01-12 16:39:06.727 CST [12742] LOG: restored log file "00000001000000000000004F" from archive
cp: cannot stat ‘/data/arc/000000010000000000000050’: No such file or directory
2024-01-12 16:39:07.044 CST [12755] LOG: started streaming WAL from primary at 0/50000000 on timeline 1
这个实验得关闭归档,否则pg会优先使用归档。
pg10的 max_replication_slots默认为10,所以主库配置不用调整。直接创建slot即可
[pg@localhost ~]$ psql -h localhost -p 15431 -U pg postgres
psql (10.23)
Type "help" for help.
postgres=# SELECT * FROM pg_create_physical_replication_slot('s1_slot');
slot_name | lsn
-----------+-----
s1_slot |
(1 row)
备库配置 slot
[pg@localhost ~]$ vi /data/db2/recovery.conf
---------------------------------------
primary_slot_name = 's1_slot'
---------------------------------------
[pg@localhost ~]$ psql -h localhost -p 15431 -U pg postgres
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_ls
n | confirmed_flush_lsn
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-----------
--+---------------------
s1_slot | | physical | | | f | t | 14535 | | | 0/7F000060
|
(1 row)
启用replication slots后,备机发生故障时不会清理WAL,所以先查看故障前wal文件情况
[pg@localhost pg_wal]$ ll
total 65536
-rw-------. 1 pg pg 16777216 Jan 15 09:37 00000001000000000000006C
-rw-------. 1 pg postgres 16777216 Jan 15 09:38 00000001000000000000006D
-rw-------. 1 pg pg 16777216 Jan 15 09:38 00000001000000000000006E
-rw-------. 1 pg postgres 16777216 Jan 15 09:37 00000001000000000000006F
drwx------. 2 pg pg 6 Jan 15 09:32 archive_status
关闭备库模拟故障,查看看故障时wal文件情况
[pg@localhost pg_wal]$ ll
total 180224
-rw-------. 1 pg pg 16777216 Jan 15 10:08 00000001000000000000006E
-rw-------. 1 pg postgres 16777216 Jan 15 10:08 00000001000000000000006F
-rw-------. 1 pg pg 16777216 Jan 15 10:08 000000010000000000000070
-rw-------. 1 pg postgres 16777216 Jan 15 10:08 000000010000000000000071
-rw-------. 1 pg postgres 16777216 Jan 15 10:08 000000010000000000000072
-rw-------. 1 pg postgres 16777216 Jan 15 10:08 000000010000000000000073
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000074
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000075
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000076
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000077
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000078
drwx------. 2 pg pg 6 Jan 15 09:32 archive_status
在主库上执行checkpoint,wal文件仍然不会被清理。
可以看到wal没有被清理,一直在增加。
重新启动备库,数据从slot标记的位置开始恢复,无报错。
[pg@localhost ~]$ pg_ctl -D /data/db2/ -l /data/db2/server.log start
查看日志
2024-01-15 10:11:45.044 CST [14528] LOG: entering standby mode
cp: cannot stat ‘/data/arc/00000001000000000000006D’: No such file or directory
2024-01-15 10:11:45.048 CST [14528] LOG: redo starts at 0/6D000060
cp: cannot stat ‘/data/arc/00000001000000000000006E’: No such file or directory
2024-01-15 10:11:45.050 CST [14528] LOG: consistent recovery state reached at 0/6E000060
2024-01-15 10:11:45.050 CST [14528] LOG: invalid record length at 0/6E000060: wanted 24, got 0
2024-01-15 10:11:45.050 CST [14526] LOG: database system is ready to accept read only connections
2024-01-15 10:11:45.055 CST [14534] LOG: started streaming WAL from primary at 0/6E000000 on timeline 1
在主库上执行checkpoint,再查看主库的wal目录,同步过的文件已被清理。
[pg@localhost pg_wal]$ ll
total 98304
-rw-------. 1 pg postgres 16777216 Jan 15 10:10 00000001000000000000007E
-rw-------. 1 pg postgres 16777216 Jan 15 10:28 00000001000000000000007F
-rw-------. 1 pg postgres 16777216 Jan 15 10:08 000000010000000000000080
-rw-------. 1 pg postgres 16777216 Jan 15 10:08 000000010000000000000081
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000082
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000083
drwx------. 2 pg pg 6 Jan 15 09:32 archive_status