Commit Graph

274 Commits

Author SHA1 Message Date
Ian Barwick
ad28cf95bd standby register: add upstream node ID in event details 2019-04-16 11:01:22 +09:00
Ian Barwick
dd454a8374 Miscellaneous string handling cleanup
This is mainly to prevent effectively spurious truncation warnings
in recent GCC versions.
2019-04-10 16:18:56 +09:00
Ian Barwick
77b9887d61 standby clone: improve --dry-run behaviour in barman mode
- emit additional informational output
- ensure that provision of --force does not result in an existing
  data directory being modified in any way
2019-04-08 15:12:22 +09:00
Ian Barwick
67e977592c standby switchover: list nodes which will remain attatched to the old primary
If --siblings-follow is not supplied, list all nodes which repmgr considers
to be siblings (this will include the witness server, if in use), and
which will remain attached to the old primary.
2019-04-02 10:46:59 +09:00
Ian Barwick
79613af8d0 Handle potential NULL return from string_skip_prefix() 2019-03-28 12:45:53 +09:00
Ian Barwick
e44c048ae2 Update code comment 2019-03-28 12:44:30 +09:00
Ian Barwick
1e1c596446 Add various missing close() calls 2019-03-28 11:32:25 +09:00
Ian Barwick
ba1f05ece9 Restrict "node_name" to maximum 63 characters
In "recovery.conf", the configuration parameter "node_name" is used
as the "application_name" value, which will be truncated by PostgreSQL
to 63 characters (NAMEDATALEN - 1).

repmgr sometimes needs to be able to extract the application name from
pg_stat_replication to determine if a node is connected (e.g. when
executing "repmgr standby register"), so the comparison will fail
if "node_name" exceeds 63 characters.
2019-03-28 10:37:57 +09:00
Ian Barwick
73ad689390 standby register: fail if --upstream-node-id is the local node ID 2019-03-27 14:22:55 +09:00
Ian Barwick
6f0f338968 standby follow: set replication user when connecting to local node 2019-03-21 16:43:39 +09:00
Ian Barwick
bd26eb3025 standby switchover: don't attempt to pause repmgrd on unreachable nodes 2019-03-21 13:48:59 +09:00
Ian Barwick
314a1e8f4f use a constant to denote unknown replication lag 2019-03-20 17:26:04 +09:00
Ian Barwick
46efe57cd0 Improve database connection failure logging
Log the output of PQerrorStatus() in a couple of places where it was missing.

Additionally, always log the output of PQerrorStatus() starting with a blank
line, otherwise the first line looks like it was emitted by repmgr, and
it's harder to scan the error message.

Before:

    [2019-03-20 11:24:15] [DETAIL] could not connect to server: Connection refused
            Is the server running on host "localhost" (::1) and accepting
            TCP/IP connections on port 5501?
    could not connect to server: Connection refused
            Is the server running on host "localhost" (127.0.0.1) and accepting
            TCP/IP connections on port 5501?

After:

    [2019-03-20 11:27:21] [DETAIL]
    could not connect to server: Connection refused
            Is the server running on host "localhost" (::1) and accepting
            TCP/IP connections on port 5501?
    could not connect to server: Connection refused
            Is the server running on host "localhost" (127.0.0.1) and accepting
            TCP/IP connections on port 5501?
2019-03-20 11:47:28 +09:00
Ian Barwick
19bf4d7434 Count witness and zero-priority nodes in visibility check 2019-03-14 11:17:51 +09:00
Ian Barwick
b1875a8d91 Split command execution functions into separate library
These may need to be executed by repmgrd.
2019-02-27 14:41:17 +09:00
Ian Barwick
0578053875 standby clone: check upstream connections after data copy operation
With long-running copy operations, it's possible the connection(s) to
the primary/source server may go away for some reason, so recheck
their availability before attempting to reuse.
2019-02-26 14:37:05 +09:00
Ian Barwick
99550b91bd standby register: warn if standby is running and connection params provided
Addresses GitHub #552.
2019-02-22 10:31:00 +09:00
Ian Barwick
f3fc4e5afb Minor syntax formatting tweak
For consistency.
2019-02-15 19:58:35 +09:00
Ian Barwick
7fad2ed2c8 standby switchover: improve error output
It wasn't clear why repmgr thinks the demotion candidate is not
the upstream of the promotion candidate.
2019-02-14 17:22:24 +09:00
Ian Barwick
464ec6bec3 Ensure conninfo param list is initialized for --recovery-conf-only option 2019-02-06 12:58:09 +09:00
Ian Barwick
3bbbf6daa9 "recovery_file_path" is MAXPGPATH 2019-02-06 10:42:09 +09:00
Ian Barwick
cd3312496e Rename functions which return an LSN for clarity 2019-02-06 09:32:53 +09:00
Ian Barwick
cce8b76171 "standby switchover": abort if promotion candidate has WAL replay paused
If replay is paused, we can't be really sure that more WAL will be received
between the check and the promote operation, which would risk the promote
operation not taking place during the switchover (it would happen
as soon as WAL replay is resumed and pending WAL is replayed).

Therefore we simply quit with an informative slew of messages and
leave the user to sort it out.

GitHub #540.
2019-02-05 16:32:39 +09:00
Ian Barwick
2a529e7e8b "standby promote": don't promote if replay paused and in archive recovery
It does not appear feasible to predict if there is still WAL waiting to
be replayed from archive. In this case take no action.

GitHub #540.
2019-02-05 14:39:08 +09:00
Ian Barwick
701944c194 "standby promote": add check for WAL replay status if replay is paused
If WAL replay is paused but WAL is still pending replay, PostgreSQL will ignore
the promote request until WAL replay is unpaused. This may lead to the standby
being promoted at an unpredictable point in time outside of repmgr's
control. Moreover it may not be obvious that this is happening, or why, and
it will appear that an apparently successful promotion attempt has not
actually worked.

To prevent this from happening, repmgr will now refuse to promote the
standy if WAL replay is paused *and* WAL is still pending replay.

GitHub #540.
2019-02-05 13:30:37 +09:00
Ian Barwick
f9a1861ded Refactor ReplInfo struct handling
Eventually we'll want to have this contain the optional replication
info contained in the t_node_info struct, which should then contain a
pointer to a ReplInfo struct.
2019-02-02 18:39:24 +09:00
Ian Barwick
b9ba97a36d "standby switchover": check replication connection to upstream
Ensure repmgr checks the standby (promotion candidate) is currently
attached to the primary (demotion candidate).

Addresses issue reported in GitHub #519.
2019-02-01 15:28:06 +09:00
Ian Barwick
9273e7af73 "standby switchover": avoid potential race condition with WAL location check
Immediately after the demotion candidate (primary) has shut down, we can't
be absolutely sure that the walreceiver has flushed all WAL to disk, so
checking pg_last_wal_receive_lsn() at that point might not reflect
the actual last available WAL location.

To handle this, we'll loop for a while (timeout controlled by configuration
parameter "wal_receive_check_timeout") before finally deciding whether
the standby is still behind the shut-down primary.

Addresses issue raised in GitHub #518.
2019-02-01 12:06:22 +09:00
Ian Barwick
d7420d7274 daemon (start|stop): verify that repmgrd starts/stops.
Note this may not always be possible for "daemon stop" if we are unable
to determine the repmgrd PID.
2019-01-30 14:41:31 +09:00
Ian Barwick
59eca2be30 node rejoin: improve error code handling
- return ERR_REJOIN_FAIL in all cases where the rejoin operation fails
 - ensure ERR_FOLLOW_FAIL is not returned
 - document error codes
2019-01-24 10:31:45 +09:00
Ian Barwick
061932d023 "node rejoin": verify status of rejoin target
This adapts the code previously added to "standby follow" to verify
whether the rejoin target can actually be rejoined.
2019-01-23 17:08:55 +09:00
Ian Barwick
3f5762e03a Refactor upstream attachment check code
Move it from the "standby follow" code to an independent function so it can
be used in other contexts, e.g. "node rejoin".
2019-01-23 15:11:42 +09:00
Ian Barwick
7dce3ed234 Update copyright notices to 2019 2019-01-21 14:54:35 +09:00
Ian Barwick
8881b69c06 "standby switchover": check remote data directory configuration
The switchover will fail if the data_directory parameter in repmgr.conf
on the remote node (demotion candidate) is incorrectly configured.
We use the previously added "repmgr node check --data-directory-config
to verify this, and abort early if an issue is discovered.

Implements GitHub #523.
2019-01-16 16:03:49 +09:00
Ian Barwick
0b3a310802 Add --data-directory-config option to "repmgr node check"
Implements part of GitHub #523.
2019-01-16 16:03:44 +09:00
Ian Barwick
d4e993a240 Improve handling of connection URIs when executing remote commands
Previously, if connection URIs were in use and "repmgr standby switchover"
was executed, repmgr would pass the connection URI as-is to the demotion
candidate to execute "repmgr node rejoin". However the presence of
unescaped ampersands in the connection URI was causing the rejoin command
to be incorrectly executed.

Addresses GitHub #525.
2019-01-14 11:11:51 +09:00
Ian Barwick
028c874f81 "standby follow": simplify check when follow target has higher timeline
No need for a CHECKPOINT here, which simplifies things considerably.
2019-01-11 16:34:04 +09:00
Ian Barwick
b3c2831bd3 repmgr: add --dry-run option to "standby promote"
Implements GitHub #522.
2019-01-10 12:36:58 +09:00
Ian Barwick
3389491151 Misc comment and log output corrections 2019-01-09 09:41:59 +09:00
Ian Barwick
313aa3c5d7 Refactor follow verification to reduce need for CHECKPOINT
A CHECKPOINT is not always required; hopefully we can narrow it down
to one corner case where we need to determine the minium recovery
location.

Also get local timeline ID via IDENTIFY_SYSTEM, as fetching it from
pg_control risks returning the prior timeline ID if the timeline
switch has just taken place and no restart point has yet occurred.
2018-12-04 15:27:22 +09:00
Ian Barwick
10d46f7e85 Fix variable name typo 2018-12-04 10:22:23 +09:00
Ian Barwick
9e90fcd584 "standby follow": verify status of follow target
This commit adds infrastruture for repmgr to be able to check
whether one standby can attach to another node, regardless whether
it is a standby or a primary.

This is intended to prevent a node from attempting to follow a
node whose timeline has diverged. The --dry-run option makes
it possible to test a follow operation before it is carried out.

As a useful side-effect this makes it possible for a standby to
follow another standby.

This is an initial implementation; documentation and possibly
further changes to follow.
2018-11-29 17:14:38 +09:00
Ian Barwick
66b40ffc68 Simplify function create_replication_slot()
Following the changes in 793d83b, it's no longer necessary to
pass the server version number.
2018-11-29 14:35:01 +09:00
Ian Barwick
311f7e561e "standby switchover": use empheral witness server connection
Intended to prevent issue reported in GitHub #514.
2018-11-28 14:29:41 +09:00
Ian Barwick
793d83b22c Refactor server version detection
Most of the time we can simply get the version number directly from
the connection handle. Previously it was held in a global variable,
which was an icky way of doing things.

In a few special cases we also need the actual version string, which
is obtained directly from the database.
2018-11-22 21:30:31 +09:00
Ian Barwick
b223cb4cee standby follow: improve handling of --upstream-node-id 2018-11-22 11:16:44 +09:00
Ian Barwick
c3bc5585d9 Add sanity check for extension version
This should cover the cases where the "repmgr" extension was installed
manually but not updated, or an upgrade was not fully completed.
2018-10-31 11:16:36 +09:00
Ian Barwick
c336e384ab Support "pg_promote()" function (PostgreSQL 12 and later)
This is an experimental feature.
2018-10-26 11:02:45 +09:00
Ian Barwick
dc8ffd30c6 "standby switchover": close all connections used to check repmgrd status
The connections used to check repmgrd status on all nodes were not being
closed if repmgrd was not running. Normally this wouldn't be a huge
problem as they will go away when repmgr terminates or the PostgreSQL
server restarted. However, if shutdown mode is "smart", the open
connection on the demotion candidate will cause the shutdown operation
to fail until repmgr times out.
2018-10-23 11:05:28 +09:00
Ian Barwick
36bd7cdc9f Speed up witness "failover" during a switchover 2018-10-18 17:26:29 +09:00