Commit Graph

193 Commits

Author SHA1 Message Date
Ian Barwick
ce229beff8 repmgrd: add configuration option "always_promote"
In certain corner cases, it's possible repmgrd may end up monitoring
a standby which was a former primary, but the node record has not
yet been updated.

Previously repmgrd would abort the promotion with a cryptic message
about being unable to find a node record for node_id -1 (the
default value for an unknown node id).

This commit addes a new configuration option "always_promote", which
determines whether repmgrd should promote the node in this case.
The default is "false", to effectively maintain the existing behaviour.

Logging output has also been improved to make it clearer what has
happened when this situation occurs.
2020-09-29 14:18:00 +09:00
Ian Barwick
16eeae700c repmgrd: minor log message tweak 2020-09-29 10:23:31 +09:00
Stanislav Paskalev
73e8373337 Add %v, %u and %t parameters to "failover_validation_command"
These indicate:
 - the number of visible nodes sharing the current upstream
 - the number of nodes on the current upstream
 - the total number of nodes in the entire repmgr cluster.

This allows the failover_validation_command to be used to perform
more thorough validations, including cross-referencing external
cluster management state (e.g. if managed by kubernetes).

GitHub #651.
2020-09-17 15:48:12 +09:00
Ian Barwick
a88c80248c repmgrd: minor tweaks to witness node synchronisation
Explicitly roll back if any operation fails, and add debugging output
to track elapsed time between synchronisation intervals.
2020-09-01 09:58:14 +09:00
Ian Barwick
e65738c989 Explicitly unset search path when connecting to database 2020-05-22 16:11:55 +09:00
Ian Barwick
d75a35a788 repmgrd: clarify why node is not configured for automatic failover 2020-05-22 11:21:48 +09:00
Ian Barwick
8233560629 repmgrd: ensure cascaded standby reconnects to primary
If the primary connection went away, and the upstream is not the
primary, attempt to reconnect if the monitoring update fails.

If the upstream is the primary, the reconnection will happen on
the next connection check.
2020-05-22 11:11:58 +09:00
Ian Barwick
cf60844c45 repmgrd: ensure primary connection is reset if same as upstream
Addresses GitHub #633.
2020-05-22 11:11:54 +09:00
Ian Barwick
a863dc7f6c repmgrd: additional check for the upstream connection
It's possible the upstream server was intermittently unavailable in
the interval between checks, invalidating the upstream connection.
With check types "ping" and "connection", the connection would not be
restored, so if the availability check was successful, additionally
verify the upstream connection and restore if necessary.

Addresses GitHub #633.
2020-05-14 10:26:57 +09:00
Ian Barwick
2f667116d8 repmgrd: include node name in log output
Missed in commit fd52df0.
2020-05-12 15:31:47 +09:00
Ian Barwick
8ee4fac5bb repmgrd: minor refactoring of try_primary_reconnect() 2020-05-12 14:52:14 +09:00
Ian Barwick
bb56387aaa repmgrd: consolidate connection closing code
PQfinish() should only be called on local PGconn pointers which
will not be reused.
2020-05-12 14:48:39 +09:00
Ian Barwick
5d00094936 repmgrd: ensure "close_connection()" always called after connection failure 2020-05-12 14:41:33 +09:00
Ian Barwick
ebdfdc530d repmgrd: ensure PQfinish() always executed on failed connections in NodeInfoLists
clear_node_info_list() will clean up any remaining active connections,
but we need to ensure all failed connections are cleaned up at the point
of failure to prevent leaks.

Per report in GitHub #643.
2020-05-12 14:22:08 +09:00
Ian Barwick
e5d3285d02 repmgrd: remove redundant log message 2020-05-11 16:59:32 +09:00
Ian Barwick
fd52df0fab repmgrd: include node name in log output in more places
Still a few places where only the node ID was reported, but it's always
useful to have the node name as well.
2020-05-11 16:55:31 +09:00
Ian Barwick
bcc284cac9 Refactor configuration file reload handling
Rather than parse the configuration file into a new structure and
copy changed values from that into the main structure, we'll copy
the existing structure before parsing the changed configuration
file directly into the nmain structure, and revert using the copy
if any issues are encountered.

This is necessary as preparation for further reworking of the
configuration file structure handling. It also makes the reload
idempotent.

While we're at it, make some general improvements to the reload
handling, particularly:

 - improve logging to show "before" and "after" values
 - collate change notifications and only display if no errors
   were found
 - remove unnecessary double-logging of errors
 - various bugfixes
2020-05-05 15:29:07 +09:00
Ian Barwick
3ca642fee1 repmgrd: log receipt of SIGHUP at log level NOTICE
PostgreSQL itself logs it at log level LOG, which we don't have,
but NOTICE seems reasonable, especially as we log SIGTERM as that.
2020-05-05 13:41:23 +09:00
Ian Barwick
8adcb1348d repmgrd: improve logging of promote_command failure
- log failure *before* we check if the primary has reappeared
- log the error code
2020-04-21 15:02:15 +09:00
Ian Barwick
780453e168 repmgrd: clarify log messages
Display the identity of the node question in the meassges fixed
in commit 8a27c89; this makes it easier to diagnose log output.
2020-04-03 13:02:49 +09:00
Tom Janson
8a27c89d18 repmgrd: fix inverted log message
Warning is emitted when the node in question is in recovery.
2020-04-03 12:39:36 +09:00
Ian Barwick
0a2091d5d3 repmgrd: handle new primary notification during failover check
It's possible a repmgrd instance might still be in the primary check
phase while a primary has already been promoted. Therefore it's
necessary to check for new primary notifications here, so we can
follow a new primary as quickly as possible.
2020-04-02 15:45:14 +09:00
Ian Barwick
9de31428f1 Consolidate replication connection code
In a few places, replication connections are generated from the
parameters used by existing connections. This has resulted in a
number of similar blocks of code which do more-or-less the same
thing almost but not quite identically. In two cases, the code
omitted to set "dbname=replication", which can cause problems
in some contexts.

These code blocks have now been consolidated into standardized
functions.

This also resolves the issue addressed by GitHub #619.
2020-03-05 17:21:37 +09:00
Ian Barwick
eaee7145f6 repmgrd: improve logging
Note node name and type when logging primary node visibility.
2020-02-24 15:33:03 +09:00
Ian Barwick
e782f2d949 repmgrd: improve logging
For easier log analysis, state which node is the current primary.
2020-02-20 13:06:41 +09:00
Ian Barwick
7fdf2f1778 Update copyright notices to 2020 2020-01-13 14:06:20 +09:00
Ian Barwick
2304584679 Fix handling of upstream node change check
repmgrd has a check to see if the upstream node has unexpectedly
changed, e.g. if the repmgrd service is paused and the PostgreSQL
instance has been pointed to another node.

However this check was relying on the node record on the local node
being up-to-date, which may not be the case immediately after a
failover, when the node is still replaying records updated prior
to the node's own record being updated. In this case it will
mistakenly assume the node is following the original primary
and attempt to restart monitoring, which will fail as the original
primary is no longer available.

To prevent this, we check against the node's record on the upstream
node.

Addresses issue noted in GitHub #587 and #588.
2019-10-14 12:28:04 +09:00
Ian Barwick
931da14df1 Rename some "repmgr daemon ..." commands to "repmgr service ..."
"repmgr daemon" can be interpreted to mean the commands affect the local
daemon process only. Rename the commands which affect the entire cluster
to "repmgr service ...".

The "repmgr daemon ..." form of the affected commands is retained for backwards
 compatibility.
2019-08-28 14:58:11 +09:00
Ian Barwick
3e812f6e91 repmgrd: always emit NOTICE when attempting to follow a new primary
Previously, if a standby's repmgrd was looping in degraded monitoring
mode looking for a new primary to follow, once a new primary was
detected the follow command would be executed without any prior
logging at non-DEBUG log levels.
2019-08-26 16:02:41 +09:00
Ian Barwick
75c0987e79 repmgrd: emit node name when reporting follow target attach error
This is consistent with other error messages.
2019-08-13 11:02:52 +09:00
Ian Barwick
d893ce227b repmgrd: optionally exclude/include witness server from child node checks 2019-06-03 16:04:54 +09:00
Ian Barwick
b5ff2ec120 repmgrd: update log text 2019-05-30 16:08:04 +09:00
Ian Barwick
06a83247c9 repmgrd: note node type when logging child node dis/re-connections 2019-05-30 14:06:54 +09:00
Ian Barwick
a6ea1d0fda repmgrd: fix witness node disconnection monitoring 2019-05-30 11:51:50 +09:00
Ian Barwick
fa66e72c2f repmgrd: count witness server as child node for connection monitoring purposes
As the witness server does not, by definition, ever have an entry in pg_stat_replication,
we need to check its "attached" status by connecting to the witness server itself
and querying the reported upstream node ID (which should be set by the witness
server repmgrd). If this matches the current primary node ID, we count it as attached.
2019-05-21 15:19:41 +09:00
Ian Barwick
02245a0014 repmgrd: add missing PQfinish() calls 2019-05-02 18:50:21 +09:00
Ian Barwick
52905f1eb3 Standardize on "ID: %i" when logging node IDs
Previously there was a mix of "id:", "node id:", "node ID:" and "node_id:".
2019-04-30 17:07:33 +09:00
Ian Barwick
87910a5448 repmgrd: improve logging of sibling node's upstream info
If the sibling node has already been promoted (for whatever
reason, e.g. "repmgr standby promote" was executed manually)
and has exited recovery, the upstream node ID will normally
be reported as "-1", which is correct, but looks confusing in
the logs.

We now only report the upstream node ID if the sibling node
is still in recovery, *or* if it has exited recovery but is
still reporting an extant node ID.
2019-04-29 13:51:17 +09:00
Ian Barwick
3231b5034d Remove temporary debugging log output 2019-04-24 13:17:52 +09:00
Ian Barwick
58b33fb411 Clarify a couple of code comments 2019-04-24 10:55:53 +09:00
Ian Barwick
6cbf436bf8 Don't execute "child_nodes_disconnect_command" when repmgrd paused 2019-04-23 14:08:13 +09:00
Ian Barwick
5a90513878 repmgrd: monitor standbys attached to primary
This functionality enables repmgrd (when running on the primary) to
monitor connected child nodes. It will log connections and disconnections
and generate events.

Additionally, repmgrd can execute a custom script if the number of connected
child nodes falls below a configurable threshold. This script can be used
e.g. to "fence" the primary following a failover situation where a new primary
has been promoted and all standbys are now child nodes of that primary.
2019-04-22 16:18:52 +09:00
Ian Barwick
a0c6cb602f repmgrd: remove duplicate function definition 2019-04-16 10:53:05 +09:00
Ian Barwick
27803f93ff repmgrd: always unset upstream node ID when monitoring a primary 2019-04-12 12:26:39 +09:00
Ian Barwick
46d17d0933 repmgrd: fix log output 2019-04-11 16:29:08 +09:00
Ian Barwick
6b79e08706 repmgrd: add addiitonal log output in do_election() 2019-04-11 15:46:20 +09:00
Ian Barwick
cd6a55c7cb repmgrd: improve primary visibility consensus check
Exclude sibling nodes which report they're following a different
node. This shouldn't happen, but could.
2019-04-11 15:46:14 +09:00
Ian Barwick
008bd00a59 repmgrd: store upstream node ID in shared memory 2019-04-11 15:46:09 +09:00
Ian Barwick
5a8741199f repmgrd: exclude witness server from followability check 2019-04-11 11:19:12 +09:00
Ian Barwick
9164d3931b repmgrd: clean up PQExpBuffer handling
Unless the PQExpBuffer is required for the duration of the function,
ensure it's always a variable local to the relevant code block. This
mitigates the risk of accidentally accessing a generically named
PQExpBuffer which hasn't been initialised or was previously terminated.
2019-03-26 13:15:25 +09:00