Commit Graph

396 Commits

Author SHA1 Message Date
Ian Barwick
d7420d7274 daemon (start|stop): verify that repmgrd starts/stops.
Note this may not always be possible for "daemon stop" if we are unable
to determine the repmgrd PID.
2019-01-30 14:41:31 +09:00
Ian Barwick
59eca2be30 node rejoin: improve error code handling
- return ERR_REJOIN_FAIL in all cases where the rejoin operation fails
 - ensure ERR_FOLLOW_FAIL is not returned
 - document error codes
2019-01-24 10:31:45 +09:00
Ian Barwick
061932d023 "node rejoin": verify status of rejoin target
This adapts the code previously added to "standby follow" to verify
whether the rejoin target can actually be rejoined.
2019-01-23 17:08:55 +09:00
Ian Barwick
3f5762e03a Refactor upstream attachment check code
Move it from the "standby follow" code to an independent function so it can
be used in other contexts, e.g. "node rejoin".
2019-01-23 15:11:42 +09:00
Ian Barwick
7dce3ed234 Update copyright notices to 2019 2019-01-21 14:54:35 +09:00
Ian Barwick
8881b69c06 "standby switchover": check remote data directory configuration
The switchover will fail if the data_directory parameter in repmgr.conf
on the remote node (demotion candidate) is incorrectly configured.
We use the previously added "repmgr node check --data-directory-config
to verify this, and abort early if an issue is discovered.

Implements GitHub #523.
2019-01-16 16:03:49 +09:00
Ian Barwick
0b3a310802 Add --data-directory-config option to "repmgr node check"
Implements part of GitHub #523.
2019-01-16 16:03:44 +09:00
Ian Barwick
d4e993a240 Improve handling of connection URIs when executing remote commands
Previously, if connection URIs were in use and "repmgr standby switchover"
was executed, repmgr would pass the connection URI as-is to the demotion
candidate to execute "repmgr node rejoin". However the presence of
unescaped ampersands in the connection URI was causing the rejoin command
to be incorrectly executed.

Addresses GitHub #525.
2019-01-14 11:11:51 +09:00
Ian Barwick
028c874f81 "standby follow": simplify check when follow target has higher timeline
No need for a CHECKPOINT here, which simplifies things considerably.
2019-01-11 16:34:04 +09:00
Ian Barwick
b3c2831bd3 repmgr: add --dry-run option to "standby promote"
Implements GitHub #522.
2019-01-10 12:36:58 +09:00
Ian Barwick
3389491151 Misc comment and log output corrections 2019-01-09 09:41:59 +09:00
Ian Barwick
313aa3c5d7 Refactor follow verification to reduce need for CHECKPOINT
A CHECKPOINT is not always required; hopefully we can narrow it down
to one corner case where we need to determine the minium recovery
location.

Also get local timeline ID via IDENTIFY_SYSTEM, as fetching it from
pg_control risks returning the prior timeline ID if the timeline
switch has just taken place and no restart point has yet occurred.
2018-12-04 15:27:22 +09:00
Ian Barwick
10d46f7e85 Fix variable name typo 2018-12-04 10:22:23 +09:00
Ian Barwick
9e90fcd584 "standby follow": verify status of follow target
This commit adds infrastruture for repmgr to be able to check
whether one standby can attach to another node, regardless whether
it is a standby or a primary.

This is intended to prevent a node from attempting to follow a
node whose timeline has diverged. The --dry-run option makes
it possible to test a follow operation before it is carried out.

As a useful side-effect this makes it possible for a standby to
follow another standby.

This is an initial implementation; documentation and possibly
further changes to follow.
2018-11-29 17:14:38 +09:00
Ian Barwick
66b40ffc68 Simplify function create_replication_slot()
Following the changes in 793d83b, it's no longer necessary to
pass the server version number.
2018-11-29 14:35:01 +09:00
Ian Barwick
311f7e561e "standby switchover": use empheral witness server connection
Intended to prevent issue reported in GitHub #514.
2018-11-28 14:29:41 +09:00
Ian Barwick
793d83b22c Refactor server version detection
Most of the time we can simply get the version number directly from
the connection handle. Previously it was held in a global variable,
which was an icky way of doing things.

In a few special cases we also need the actual version string, which
is obtained directly from the database.
2018-11-22 21:30:31 +09:00
Ian Barwick
b223cb4cee standby follow: improve handling of --upstream-node-id 2018-11-22 11:16:44 +09:00
Ian Barwick
c3bc5585d9 Add sanity check for extension version
This should cover the cases where the "repmgr" extension was installed
manually but not updated, or an upgrade was not fully completed.
2018-10-31 11:16:36 +09:00
Ian Barwick
c336e384ab Support "pg_promote()" function (PostgreSQL 12 and later)
This is an experimental feature.
2018-10-26 11:02:45 +09:00
Ian Barwick
dc8ffd30c6 "standby switchover": close all connections used to check repmgrd status
The connections used to check repmgrd status on all nodes were not being
closed if repmgrd was not running. Normally this wouldn't be a huge
problem as they will go away when repmgr terminates or the PostgreSQL
server restarted. However, if shutdown mode is "smart", the open
connection on the demotion candidate will cause the shutdown operation
to fail until repmgr times out.
2018-10-23 11:05:28 +09:00
Ian Barwick
36bd7cdc9f Speed up witness "failover" during a switchover 2018-10-18 17:26:29 +09:00
Ian Barwick
15a5d2ee9d "repmgr standby": use appendPQExpBufferStr/-Char() consistently 2018-10-03 17:31:12 +09:00
Ian Barwick
2491b8ae52 Add functionality to "pause" repmgrd
In some circumstances, e.g. while performing a switchover, it is essential
that repmgrd does not take any kind of failover action, as this will put
the cluster into an incorrect state.

Previously it was necessary to stop repmgrd on all nodes (or at least
those nodes which repmgrd would consider as promotion candidates), however
this is a cumbersome and potentially risk-prone operation, particularly if the
replication cluster contains more than a couple of servers.

To prevent this issue from occurring, this patch introduces the ability
to "pause" repmgrd on all nodes wth a single command ("repmgr daemon pause")
which notifies repmgrd not to take any failover action until the node
is "unpaused" ("repmgr daemon unpause").

"repmgr daemon status" provides an overview of each node and whether repmgrd
is running, and if so whether it is paused.

"repmgr standby switchover" has been modified to automatically pause repmgrd
while carrying out the switchover.

See documentation for further details.
2018-09-27 16:42:10 +09:00
Ian Barwick
9439467958 doc: add troubleshooting section to switchover documentation 2018-09-25 13:47:58 +09:00
Ian Barwick
38e3aae053 repmgr: add parameter "shutdown_check_timeout"
Previously, "repmgr standby switchover" used the configuration file parameters
"reconnect_interval" and "reconnect_attempts" to define a timeout to determine
whether the current primary (demotion candidate) has shut down.

However, these parameters are intended for primary failure detection and are
generally lower in value, while a controlled shutdown may take longer, resulting
in the switchover being aborted as repmgr was not waiting long enough.

To prevent this happening, parameter "shutdown_check_timeout" has been added.
This complements the existing "standby_reconnect_timeout" parameter used
by "repmgr standby switchover".

Implements GitHub #504.
2018-09-25 11:34:06 +09:00
Ian Barwick
9681708b1a repmgr: improve slot handling in "node rejoin"
On the rejoined node, if a replication slot for the new upstream exists
(which is typically the case after a failover), delete that slot.

Also emit a warning about any inactive replication slots which may need
to be cleaned up manually.

GitHub #499.
2018-08-30 12:24:13 +09:00
Ian Barwick
7745844078 "standby switchover": improve replication connection check
Previously repmgr would first check that a replication can be made
from the demotion candidate to the promotion candidate, however it's
preferable to sanity-check the number of available walsenders first,
to provide a more useful error message.
2018-08-24 16:31:25 +09:00
Cédric Villemain
6fc79470fc Fix grep to find conninfo
it used to use \t* but [[:space:]] should be better as it does match more kind
of spaces (the current one being broken in my case on RH7)
2018-08-23 18:33:55 +02:00
Ian Barwick
c3949b2aea "standby clone" - don't copy external config files in dry run mode
Avoid copying files during a --dry-run as it may introduce unexpected changes
on the target node. During an actual clone operation, any problems with
copying files will be detected early and the operation aborted before
the actual database cloning commences.

GitHub #491.
2018-08-20 15:23:37 +09:00
Ian Barwick
6ba49de44e "standby promote": improve log messages
Make it clearer what repmgr is waiting for, and what to do if the
promotion appears to fail.
2018-08-16 11:52:01 +09:00
Ian Barwick
f2bc898761 repmgr: fix handling of slot creation error when cloning
If cloning from another node other than the intended upstream, and
replication slots are in use, once the cloning process is complete,
repmgr will attempt to connect to the intended upstream to create
the replication slot.

Previously it would abort with a connection error, but as this issue
is not fatal to the cloning process itself, and in some situations may
be intentional, it's better to log a warning and continue.

We should probably collate this (and any similar items needing
attention after the cloning operation) into a list output at the end,
otherwise the warning may get overlooked.
2018-08-15 15:12:23 +09:00
Abhijit Menon-Sen
97cafd8c54 Fix upstream node name in warning
This log_warning is supposed to reproduce the error in the block above,
but used the current node's name instead of the intended upstream node.
2018-08-12 09:15:13 +05:30
Ian Barwick
7ecfb333b9 doc: add note about switchover and exclusive backups
Also rename server_not_in_exclusive_backup_mode() to avoid double
negatives.

GitHub #476.
2018-07-19 16:02:31 +09:00
Martín Marqués
8f13a66aaa Check that there is no exclusive backup taking place while we perform
a switchover.

We've found that this can cause some issues with postgres control
metadata (could be a postgres bug) so best thing is *not* no switchover
if there's a backup taking place.

It's also a bad idea from an architectual point of view, as a switchover
is supposed to be planed, so why perform it when we are taking backups.

GitHub #476.
2018-07-19 16:02:21 +09:00
Ian Barwick
673bde2b7f repmgr: fix "primary_slot_name" when using "standby clone" with --recovery-conf-only
Addresses GitHub #474.
2018-07-17 13:42:10 +09:00
Martín Marqués
81de200561 Add information to the --help and docs of standby clone regarding the need
to provide a conninfo line to the upstream from which we will be cloning
from.
2018-07-16 18:56:41 -03:00
Ian Barwick
29de052dd8 repmgr: clarify intent behind --wait-sync timeout processing 2018-07-05 10:09:04 +09:00
Ian Barwick
37311e15a3 repmgr: fix "standby register --wait-sync" when no timeout provided
The default value for "wait_register_sync_seconds" was zero, which is treated
as disabling --wait-sync altogether. Default value now set to -1, which is taken
to mean no timeout value supplied.
2018-07-04 17:22:04 +09:00
Ian Barwick
fcf237fe31 node status: improve output and documentation
In the default text output mode, list inactive slots.

In CSV output mode, list inactive slots as additional information;
add output line with number of missing slots and a list thereof.

Also document --csv output mode.
2018-06-22 11:46:50 +09:00
Ian Barwick
c5ba72c2c5 standby switchover: fix behaviour if witness node is a sibling
The witness node is not a streaming replication standby, so executing
"repmgr standby follow" will fail. Instead, execute "repmgr witness
register --force" to update the witness node record on the primary and
its local copy of all node records.

Addresses GitHub #453.
2018-06-21 16:48:58 +09:00
Ian Barwick
efc388065e standby follow: check node has connect to new primary
After restarting the standby, poll pg_stat_replication on the upstream
until the standby connects, and exit with an error if it doesn't by the
timeout defined in "standby_follow_timeout".

Implments GitHub #444.
2018-06-07 15:04:45 +09:00
Ian Barwick
0108fb2e72 standby follow: add hint about using "node rejoin"
If "repmgr standby follow" is executed on a node which isn't running,
point out "repmgr node rejoin" should probably be used instead.
2018-06-07 15:04:30 +09:00
Ian Barwick
535fba43d3 standby clone: improve external configuration file copying
If --copy-external-config-files was provided, check that we can copy
the files *before* cloning the standby, and abort if an error is
encountered. This will give the user the opportunity to fix any issues
before running the entire (and potentially lengthy) clone.

Previously errors were logged but no action taken, and the final
message indicated the clone operation was successful.

Addresses GitHub #443.
2018-06-07 15:04:01 +09:00
Ian Barwick
7613b1769c standby clone: --recovery-conf-only expects the standby to be registered
Note this in the documentation, and add a HINT about registering it
if the standby record is not available.

Related to GitHub #438.
2018-05-31 09:42:53 +09:00
Ian Barwick
276239422b standby clone: don't assume existence of "user" in upstream conninfo
Usually a seperate user (typically "repmgr") is set up specifically to manage
the repmgr metadata, however there's no compelling requirement to do this, and
it's possible the database owner (usually: "postgres") will be used, in which
case it's possible the username will be left out of the conninfo string.

Addresses GitHub #437.
2018-05-24 15:52:51 +09:00
Ian Barwick
6c518f1403 "standby clone": log actual connection string used to connect to upstream
Useful for diagnostic purposes.
2018-05-10 12:03:13 +09:00
Ian Barwick
8320179f34 Add configuration file parameter "config_directory"
This enables explicit provision of an external configuration file
directory, which if set will be passed to "pg_ctl" as the -D
parameter. Otherwise "pg_ctl" will default to using the data directory,
which will cause some operations to fail if the configuration files
are not present there.

Note this is implemented primarily for feature completeness and for
development/testing purposes. Users who have installed "repmgr" from
a package should not rely on "pg_ctl" to stop/start/restart PostgreSQL,
instead they should set the appropriate "service_..._command" for their
operating system. For more details see:

    https://repmgr.org/docs/4.0/configuration-service-commands.html

Note: in a future release, the presence of "config_directory" in repmgr.conf
will be used to implictly set "--copy-external-config-files=samepath" when
cloning a standby; this is a behaviour change so will be implemented in the
next major realease (repmgr 4.1).

Implements GitHub #424.
2018-04-25 11:58:24 +09:00
Ian Barwick
cda952f1e4 Add "dbname=replication" to all replication connection strings
Previously repmgr was attempting to make replication connections
with "dbname" set to the repmgr database name. While this works
if e.g. the repmgr user also has replication permissions, it will
fail if a dedicated replication user is specified, who only has
permission to access the virtual "replication" database.

Change this to use "dbname=replication" if the replication connection
user is different to the normal repmgr database user.

(We could just always set it to "replication", but that might break
existing installations e.g. where a .pgpass file is in use and there's
no "replication" entry for the normal repmgr database user).

Addresses GitHub #421.
2018-04-12 16:11:16 +09:00
Ian Barwick
62c29aab32 Don't issue a CHECKPOINT after promoting a standby.
Issuing a CHECKPOINT immediately after promoting a standby may impact
performance. Commit 239a548e9d ensures
one is only issued when required, i.e. during a switchover when
pg_rewind will be executed.

This reverts commit a2068768ab.
2018-04-09 14:35:54 +09:00