From f1bdb0951250d67b355855b768808922f4e7ab81 Mon Sep 17 00:00:00 2001 From: Ian Barwick Date: Tue, 15 Sep 2020 14:21:14 +0900 Subject: [PATCH] doc: note existing pg_rewind corner-case bug --- doc/repmgr-node-rejoin.xml | 45 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/doc/repmgr-node-rejoin.xml b/doc/repmgr-node-rejoin.xml index 71bde8ac..0427e422 100644 --- a/doc/repmgr-node-rejoin.xml +++ b/doc/repmgr-node-rejoin.xml @@ -401,6 +401,51 @@ is running in mode. + + + In all current PostgreSQL versions (as of September 2020), pg_rewind + contains a corner-case bug which affects standbys in a very specific situation. + + + This situation occurs when a standby was shut down before its + primary node, and an attempt is made to attach this standby to another primary + in the same cluster (following a "split brain" situation where the standby + was connected to the wrong primary). In this case, &repmgr; will correctly determine + that pg_rewind should be executed, however + pg_rewind incorrectly decides that no action is necessary. + + + In this situation, &repmgr; will report something like: + + NOTICE: pg_rewind execution required for this node to attach to rejoin target node 1 + DETAIL: rejoin target server's timeline 3 forked off current database system timeline 2 before current recovery point 0/7019C10 + but when executed, pg_rewind will report: + + pg_rewind: servers diverged at WAL location 0/7015540 on timeline 2 + pg_rewind: no rewind required + and if an attempt is made to attach the standby to the new primary, PostgreSQL logs on the standby + will contain errors like: + + [2020-09-07 15:01:41 UTC] LOG: 00000: replication terminated by primary server + [2020-09-07 15:01:41 UTC] DETAIL: End of WAL reached on timeline 2 at 0/7015540. + [2020-09-07 15:01:41 UTC] LOG: 00000: new timeline 3 forked off current database system timeline 2 before current recovery point 0/7019C10 + + + Currently it is not possible to resolve this situation using pg_rewind. + A patch + has been submitted and will hopefully be included in a forthcoming PostgreSQL minor release. + + + As a workaround, start the primary server the standby was previously attached to, + and ensure the standby can be attached to it. If pg_rewind was actually executed, + it will have copied in the .history file from the target primary server; this must + be removed. repmgr node rejoin can then be used to attach the standby to the original + primary. Ensure any changes pending on the primary have propogated to the standby. Then shut down the primary + server first, before shutting down the standby. It should then be possible to + use repmgr node rejoin to attach the standby to the new primary. + + +