Commit Graph

295 Commits

Author SHA1 Message Date
Ian Barwick
09979eaa91 note that "standby follow" requires a primary to be available
While it's technically possible to have a standby follow another
standby while the primary is not available, repmgr will not be able
to update its metadata, which will cause Confusion and Chaos.

Update the documentation to make this clear, and provide a more helpful
error message if this situation occurs. The operation previously
failed anyway, but with an unhelpful message about not being able to
find a node record.
2019-06-11 15:14:17 +09:00
Ian Barwick
341421f8e0 standby follow: remove some ineffective code
For some reason we were taking the trouble to extract an appliction_name
from the local node's conninfo, but this was being subsequently overwritten
with the node name (which is what we want anyway).
2019-06-06 12:12:33 +09:00
Ian Barwick
f5d29f6591 doc: update release notes 2019-06-06 11:30:30 +09:00
Ian Barwick
c0ea5ffa04 Ensure parsed value of --upstream-conninfo is written to recovery.conf
Previously it was being parsed (a step which ensures any "application_name"
set by the caller is changed to the node name), but the original string
was being copied to "primary_conninfo" anyway.
2019-06-06 11:30:24 +09:00
Ian Barwick
45e17223b9 Update variable/field names relating to pg_basebackup's -X option
Now the "xlog nomenclature" Pg versions are fading into the past,
rename things related to handling pg_basebackup's -X option
(was: --xlog-method, now: --wal-method) to start with "wal_"
rather than "xlog_".

This is a cosmetic change for code clarity.
2019-05-30 09:32:06 +09:00
Ian Barwick
c153e2fc02 standby clone: improve --dry-run output
Log positive check results as an additional confirmation that the
upstream configuration appears to be correct.
2019-05-28 00:54:39 +09:00
Ian Barwick
44a39760a1 standby clone: improve source node replication connection check
Previously, the check was attempting to make replication connections
to the source node, and if these were failing, inferring that
insufficient walsenders were available.

However it's quite likely that the connections are refused due to
insufficient user connection permissions. So before performing
the connection check, query the number of potentially available
walsenders on the source node and compare it with the number
required (either 1 or 2) - if insufficient, exit with error and
hint about increasing "max_wal_senders".

Once we've established sufficient walsenders are available, inability
to connect is most likely related to permissions issues on the source
node.
2019-05-28 00:11:53 +09:00
Ian Barwick
b959f771c1 Improve naming/usage of node record variables in "standby clone"
Make it clearer we're dealing with the upstream node record.

Also avoid "overloading" the upstream record when checking for an
existing record with the same node name; this was not technically
a problem but mildly confusing when reading the code.
2019-05-27 23:39:49 +09:00
Ian Barwick
c9e85996f5 repmgr: prevent a standby being cloned from a witness server
Previously repmgr would happily clone from whatever server
it found at the provided source server address. We should
ensure that a standby can only be cloned from a node which
is part of the main replication cluster.

This check fetches a list of nodes from the source server,
connects to the first non-witness server it finds, and
compares the system identifiers of the source node and the
node it has connected to. If there is a mismatch, then the
source server is clearly not part of the main replication
cluster, and is most likely the witness server.
2019-05-22 16:52:25 +09:00
Ian Barwick
dd78a16006 Change return type of is_downstream_node_attached() from bool to NodeAttached
This enables us to better determine whether a node is definitively
attached, definitively not attached, or if it was not possible to
determine the attached state.
2019-05-14 15:57:20 +09:00
Ian Barwick
d8e4c54ea4 "standby switchover": add "--repmgrd-force-unpause"
Implements GitHub #559.
2019-05-10 16:04:07 +09:00
Ian Barwick
4b37562444 Make it clearer that a witness node counts as a "sibling node"
It's not attached to the primary per-se, but needs to know what
the current primary is in order to correctly synchronise its
copy of the metadata.

Per GitHub #560.
2019-05-02 14:22:53 +09:00
Ian Barwick
fed09ecaae standby promote: have former siblings follow new primary 2019-05-02 12:04:49 +09:00
Ian Barwick
98d09f83b5 standby (promote|switchover): improve --dry-run functionality
Continue checks as far as possible.
2019-05-02 12:04:43 +09:00
Ian Barwick
7bbe938e19 Separate promotion candidate walsender/slot checks into discrete functions
For use by "standby promote" as well as "standby follow"
2019-05-02 12:04:40 +09:00
Ian Barwick
63c7f758c3 Remove unneeded server version number variables
No need to pass these around.
2019-05-02 12:04:33 +09:00
Ian Barwick
b9f07f6a91 standby promote: use variable name "local_conn" for the local connection handle
This is consistent with usage in other functions, and makes it easier to
differentiate between the local node connection and the primary connection.
2019-05-02 12:04:26 +09:00
Ian Barwick
e4615b4666 Refactor code for executing --siblings-follow
This will enable provision of "--siblings-follow" to "repmgr standby promote"
2019-05-02 12:04:15 +09:00
Ian Barwick
52905f1eb3 Standardize on "ID: %i" when logging node IDs
Previously there was a mix of "id:", "node id:", "node ID:" and "node_id:".
2019-04-30 17:07:33 +09:00
Ian Barwick
89a7261483 Always quote node names in log messages 2019-04-30 15:52:56 +09:00
Ian Barwick
e32acda8c0 standby switchover: ignore nodes which are unreachable and marked as inactive
Previously "repmgr standby switchover" would abort if any node was unreachable,
as that means it was unable to check if repmgrd is running.

However if the node has been marked as inactive in the repmgr metadata, it's
reasonable to assume the node is no longer part of the replication cluster
and does not need to be checked.
2019-04-29 14:35:49 +09:00
Ian Barwick
ad28cf95bd standby register: add upstream node ID in event details 2019-04-16 11:01:22 +09:00
Ian Barwick
dd454a8374 Miscellaneous string handling cleanup
This is mainly to prevent effectively spurious truncation warnings
in recent GCC versions.
2019-04-10 16:18:56 +09:00
Ian Barwick
77b9887d61 standby clone: improve --dry-run behaviour in barman mode
- emit additional informational output
- ensure that provision of --force does not result in an existing
  data directory being modified in any way
2019-04-08 15:12:22 +09:00
Ian Barwick
67e977592c standby switchover: list nodes which will remain attatched to the old primary
If --siblings-follow is not supplied, list all nodes which repmgr considers
to be siblings (this will include the witness server, if in use), and
which will remain attached to the old primary.
2019-04-02 10:46:59 +09:00
Ian Barwick
79613af8d0 Handle potential NULL return from string_skip_prefix() 2019-03-28 12:45:53 +09:00
Ian Barwick
e44c048ae2 Update code comment 2019-03-28 12:44:30 +09:00
Ian Barwick
1e1c596446 Add various missing close() calls 2019-03-28 11:32:25 +09:00
Ian Barwick
ba1f05ece9 Restrict "node_name" to maximum 63 characters
In "recovery.conf", the configuration parameter "node_name" is used
as the "application_name" value, which will be truncated by PostgreSQL
to 63 characters (NAMEDATALEN - 1).

repmgr sometimes needs to be able to extract the application name from
pg_stat_replication to determine if a node is connected (e.g. when
executing "repmgr standby register"), so the comparison will fail
if "node_name" exceeds 63 characters.
2019-03-28 10:37:57 +09:00
Ian Barwick
73ad689390 standby register: fail if --upstream-node-id is the local node ID 2019-03-27 14:22:55 +09:00
Ian Barwick
6f0f338968 standby follow: set replication user when connecting to local node 2019-03-21 16:43:39 +09:00
Ian Barwick
bd26eb3025 standby switchover: don't attempt to pause repmgrd on unreachable nodes 2019-03-21 13:48:59 +09:00
Ian Barwick
314a1e8f4f use a constant to denote unknown replication lag 2019-03-20 17:26:04 +09:00
Ian Barwick
46efe57cd0 Improve database connection failure logging
Log the output of PQerrorStatus() in a couple of places where it was missing.

Additionally, always log the output of PQerrorStatus() starting with a blank
line, otherwise the first line looks like it was emitted by repmgr, and
it's harder to scan the error message.

Before:

    [2019-03-20 11:24:15] [DETAIL] could not connect to server: Connection refused
            Is the server running on host "localhost" (::1) and accepting
            TCP/IP connections on port 5501?
    could not connect to server: Connection refused
            Is the server running on host "localhost" (127.0.0.1) and accepting
            TCP/IP connections on port 5501?

After:

    [2019-03-20 11:27:21] [DETAIL]
    could not connect to server: Connection refused
            Is the server running on host "localhost" (::1) and accepting
            TCP/IP connections on port 5501?
    could not connect to server: Connection refused
            Is the server running on host "localhost" (127.0.0.1) and accepting
            TCP/IP connections on port 5501?
2019-03-20 11:47:28 +09:00
Ian Barwick
19bf4d7434 Count witness and zero-priority nodes in visibility check 2019-03-14 11:17:51 +09:00
Ian Barwick
b1875a8d91 Split command execution functions into separate library
These may need to be executed by repmgrd.
2019-02-27 14:41:17 +09:00
Ian Barwick
0578053875 standby clone: check upstream connections after data copy operation
With long-running copy operations, it's possible the connection(s) to
the primary/source server may go away for some reason, so recheck
their availability before attempting to reuse.
2019-02-26 14:37:05 +09:00
Ian Barwick
99550b91bd standby register: warn if standby is running and connection params provided
Addresses GitHub #552.
2019-02-22 10:31:00 +09:00
Ian Barwick
f3fc4e5afb Minor syntax formatting tweak
For consistency.
2019-02-15 19:58:35 +09:00
Ian Barwick
7fad2ed2c8 standby switchover: improve error output
It wasn't clear why repmgr thinks the demotion candidate is not
the upstream of the promotion candidate.
2019-02-14 17:22:24 +09:00
Ian Barwick
464ec6bec3 Ensure conninfo param list is initialized for --recovery-conf-only option 2019-02-06 12:58:09 +09:00
Ian Barwick
3bbbf6daa9 "recovery_file_path" is MAXPGPATH 2019-02-06 10:42:09 +09:00
Ian Barwick
cd3312496e Rename functions which return an LSN for clarity 2019-02-06 09:32:53 +09:00
Ian Barwick
cce8b76171 "standby switchover": abort if promotion candidate has WAL replay paused
If replay is paused, we can't be really sure that more WAL will be received
between the check and the promote operation, which would risk the promote
operation not taking place during the switchover (it would happen
as soon as WAL replay is resumed and pending WAL is replayed).

Therefore we simply quit with an informative slew of messages and
leave the user to sort it out.

GitHub #540.
2019-02-05 16:32:39 +09:00
Ian Barwick
2a529e7e8b "standby promote": don't promote if replay paused and in archive recovery
It does not appear feasible to predict if there is still WAL waiting to
be replayed from archive. In this case take no action.

GitHub #540.
2019-02-05 14:39:08 +09:00
Ian Barwick
701944c194 "standby promote": add check for WAL replay status if replay is paused
If WAL replay is paused but WAL is still pending replay, PostgreSQL will ignore
the promote request until WAL replay is unpaused. This may lead to the standby
being promoted at an unpredictable point in time outside of repmgr's
control. Moreover it may not be obvious that this is happening, or why, and
it will appear that an apparently successful promotion attempt has not
actually worked.

To prevent this from happening, repmgr will now refuse to promote the
standy if WAL replay is paused *and* WAL is still pending replay.

GitHub #540.
2019-02-05 13:30:37 +09:00
Ian Barwick
f9a1861ded Refactor ReplInfo struct handling
Eventually we'll want to have this contain the optional replication
info contained in the t_node_info struct, which should then contain a
pointer to a ReplInfo struct.
2019-02-02 18:39:24 +09:00
Ian Barwick
b9ba97a36d "standby switchover": check replication connection to upstream
Ensure repmgr checks the standby (promotion candidate) is currently
attached to the primary (demotion candidate).

Addresses issue reported in GitHub #519.
2019-02-01 15:28:06 +09:00
Ian Barwick
9273e7af73 "standby switchover": avoid potential race condition with WAL location check
Immediately after the demotion candidate (primary) has shut down, we can't
be absolutely sure that the walreceiver has flushed all WAL to disk, so
checking pg_last_wal_receive_lsn() at that point might not reflect
the actual last available WAL location.

To handle this, we'll loop for a while (timeout controlled by configuration
parameter "wal_receive_check_timeout") before finally deciding whether
the standby is still behind the shut-down primary.

Addresses issue raised in GitHub #518.
2019-02-01 12:06:22 +09:00
Ian Barwick
d7420d7274 daemon (start|stop): verify that repmgrd starts/stops.
Note this may not always be possible for "daemon stop" if we are unable
to determine the repmgrd PID.
2019-01-30 14:41:31 +09:00