Commit Graph

200 Commits

Author SHA1 Message Date
Josh Soref
f619c3a8ff Fix various typos in code comments.
Via GitHub #687.
2020-12-22 13:43:06 +09:00
Ian Barwick
93187e9743 Add missing connection close
In a corner-case situation where a standby is unable to attach to
the new primary due to a mismatch in the WAL stream, the connection
used to verify the recovery status of the new primary was not being
closed, leading to a risk of connection exhaustion on the new primary.

Addresses GitHub #682.
2020-12-01 21:33:07 +09:00
Ian Barwick
4d8bc63834 repmgrd: fix issue with incorrect reconnect_interval
Addresses GitHub #673.
2020-11-25 20:40:28 +09:00
Ian Barwick
0fc8c6c79c Add some break statements to silence compiler warnings 2020-10-08 13:10:00 +09:00
Ian Barwick
e10d9fd393 EXPERIMENTAL: synchronise try_primary_reconnect()'s reconnection loop
Per proposal in GitHub #662, this patch attempts to synchronise each
repmgrd's primary reconnection attempts to prevent potential race
conditions. This relies on each node's clock being correcly
synchronised.

Currently this change is experimental and is not enabled by default.
It can be enabled by setting the repmgr.conf parameter
"reconnect_loop_sync".
2020-10-06 13:35:49 +09:00
Ian Barwick
42283bf344 repmgrd: check local connection after promoting local node
In theory the local connection should not be affected by the node's
promotion. However we're handing over control to an external command
which is usually just "repmgr standby promote", but could potentially
be a user-defined script with unknowable side effects. So it's
better to be safe than sorry.
2020-10-05 16:50:41 +09:00
Ian Barwick
5b254a1be9 repmgrd: add parameter "failover_delay"
This parameter is not documented and intended for use during testing.
It should not be used in production.
2020-10-05 16:43:06 +09:00
Ian Barwick
ce229beff8 repmgrd: add configuration option "always_promote"
In certain corner cases, it's possible repmgrd may end up monitoring
a standby which was a former primary, but the node record has not
yet been updated.

Previously repmgrd would abort the promotion with a cryptic message
about being unable to find a node record for node_id -1 (the
default value for an unknown node id).

This commit addes a new configuration option "always_promote", which
determines whether repmgrd should promote the node in this case.
The default is "false", to effectively maintain the existing behaviour.

Logging output has also been improved to make it clearer what has
happened when this situation occurs.
2020-09-29 14:18:00 +09:00
Ian Barwick
16eeae700c repmgrd: minor log message tweak 2020-09-29 10:23:31 +09:00
Stanislav Paskalev
73e8373337 Add %v, %u and %t parameters to "failover_validation_command"
These indicate:
 - the number of visible nodes sharing the current upstream
 - the number of nodes on the current upstream
 - the total number of nodes in the entire repmgr cluster.

This allows the failover_validation_command to be used to perform
more thorough validations, including cross-referencing external
cluster management state (e.g. if managed by kubernetes).

GitHub #651.
2020-09-17 15:48:12 +09:00
Ian Barwick
a88c80248c repmgrd: minor tweaks to witness node synchronisation
Explicitly roll back if any operation fails, and add debugging output
to track elapsed time between synchronisation intervals.
2020-09-01 09:58:14 +09:00
Ian Barwick
e65738c989 Explicitly unset search path when connecting to database 2020-05-22 16:11:55 +09:00
Ian Barwick
d75a35a788 repmgrd: clarify why node is not configured for automatic failover 2020-05-22 11:21:48 +09:00
Ian Barwick
8233560629 repmgrd: ensure cascaded standby reconnects to primary
If the primary connection went away, and the upstream is not the
primary, attempt to reconnect if the monitoring update fails.

If the upstream is the primary, the reconnection will happen on
the next connection check.
2020-05-22 11:11:58 +09:00
Ian Barwick
cf60844c45 repmgrd: ensure primary connection is reset if same as upstream
Addresses GitHub #633.
2020-05-22 11:11:54 +09:00
Ian Barwick
a863dc7f6c repmgrd: additional check for the upstream connection
It's possible the upstream server was intermittently unavailable in
the interval between checks, invalidating the upstream connection.
With check types "ping" and "connection", the connection would not be
restored, so if the availability check was successful, additionally
verify the upstream connection and restore if necessary.

Addresses GitHub #633.
2020-05-14 10:26:57 +09:00
Ian Barwick
2f667116d8 repmgrd: include node name in log output
Missed in commit fd52df0.
2020-05-12 15:31:47 +09:00
Ian Barwick
8ee4fac5bb repmgrd: minor refactoring of try_primary_reconnect() 2020-05-12 14:52:14 +09:00
Ian Barwick
bb56387aaa repmgrd: consolidate connection closing code
PQfinish() should only be called on local PGconn pointers which
will not be reused.
2020-05-12 14:48:39 +09:00
Ian Barwick
5d00094936 repmgrd: ensure "close_connection()" always called after connection failure 2020-05-12 14:41:33 +09:00
Ian Barwick
ebdfdc530d repmgrd: ensure PQfinish() always executed on failed connections in NodeInfoLists
clear_node_info_list() will clean up any remaining active connections,
but we need to ensure all failed connections are cleaned up at the point
of failure to prevent leaks.

Per report in GitHub #643.
2020-05-12 14:22:08 +09:00
Ian Barwick
e5d3285d02 repmgrd: remove redundant log message 2020-05-11 16:59:32 +09:00
Ian Barwick
fd52df0fab repmgrd: include node name in log output in more places
Still a few places where only the node ID was reported, but it's always
useful to have the node name as well.
2020-05-11 16:55:31 +09:00
Ian Barwick
bcc284cac9 Refactor configuration file reload handling
Rather than parse the configuration file into a new structure and
copy changed values from that into the main structure, we'll copy
the existing structure before parsing the changed configuration
file directly into the nmain structure, and revert using the copy
if any issues are encountered.

This is necessary as preparation for further reworking of the
configuration file structure handling. It also makes the reload
idempotent.

While we're at it, make some general improvements to the reload
handling, particularly:

 - improve logging to show "before" and "after" values
 - collate change notifications and only display if no errors
   were found
 - remove unnecessary double-logging of errors
 - various bugfixes
2020-05-05 15:29:07 +09:00
Ian Barwick
3ca642fee1 repmgrd: log receipt of SIGHUP at log level NOTICE
PostgreSQL itself logs it at log level LOG, which we don't have,
but NOTICE seems reasonable, especially as we log SIGTERM as that.
2020-05-05 13:41:23 +09:00
Ian Barwick
8adcb1348d repmgrd: improve logging of promote_command failure
- log failure *before* we check if the primary has reappeared
- log the error code
2020-04-21 15:02:15 +09:00
Ian Barwick
780453e168 repmgrd: clarify log messages
Display the identity of the node question in the meassges fixed
in commit 8a27c89; this makes it easier to diagnose log output.
2020-04-03 13:02:49 +09:00
Tom Janson
8a27c89d18 repmgrd: fix inverted log message
Warning is emitted when the node in question is in recovery.
2020-04-03 12:39:36 +09:00
Ian Barwick
0a2091d5d3 repmgrd: handle new primary notification during failover check
It's possible a repmgrd instance might still be in the primary check
phase while a primary has already been promoted. Therefore it's
necessary to check for new primary notifications here, so we can
follow a new primary as quickly as possible.
2020-04-02 15:45:14 +09:00
Ian Barwick
9de31428f1 Consolidate replication connection code
In a few places, replication connections are generated from the
parameters used by existing connections. This has resulted in a
number of similar blocks of code which do more-or-less the same
thing almost but not quite identically. In two cases, the code
omitted to set "dbname=replication", which can cause problems
in some contexts.

These code blocks have now been consolidated into standardized
functions.

This also resolves the issue addressed by GitHub #619.
2020-03-05 17:21:37 +09:00
Ian Barwick
eaee7145f6 repmgrd: improve logging
Note node name and type when logging primary node visibility.
2020-02-24 15:33:03 +09:00
Ian Barwick
e782f2d949 repmgrd: improve logging
For easier log analysis, state which node is the current primary.
2020-02-20 13:06:41 +09:00
Ian Barwick
7fdf2f1778 Update copyright notices to 2020 2020-01-13 14:06:20 +09:00
Ian Barwick
2304584679 Fix handling of upstream node change check
repmgrd has a check to see if the upstream node has unexpectedly
changed, e.g. if the repmgrd service is paused and the PostgreSQL
instance has been pointed to another node.

However this check was relying on the node record on the local node
being up-to-date, which may not be the case immediately after a
failover, when the node is still replaying records updated prior
to the node's own record being updated. In this case it will
mistakenly assume the node is following the original primary
and attempt to restart monitoring, which will fail as the original
primary is no longer available.

To prevent this, we check against the node's record on the upstream
node.

Addresses issue noted in GitHub #587 and #588.
2019-10-14 12:28:04 +09:00
Ian Barwick
931da14df1 Rename some "repmgr daemon ..." commands to "repmgr service ..."
"repmgr daemon" can be interpreted to mean the commands affect the local
daemon process only. Rename the commands which affect the entire cluster
to "repmgr service ...".

The "repmgr daemon ..." form of the affected commands is retained for backwards
 compatibility.
2019-08-28 14:58:11 +09:00
Ian Barwick
3e812f6e91 repmgrd: always emit NOTICE when attempting to follow a new primary
Previously, if a standby's repmgrd was looping in degraded monitoring
mode looking for a new primary to follow, once a new primary was
detected the follow command would be executed without any prior
logging at non-DEBUG log levels.
2019-08-26 16:02:41 +09:00
Ian Barwick
75c0987e79 repmgrd: emit node name when reporting follow target attach error
This is consistent with other error messages.
2019-08-13 11:02:52 +09:00
Ian Barwick
d893ce227b repmgrd: optionally exclude/include witness server from child node checks 2019-06-03 16:04:54 +09:00
Ian Barwick
b5ff2ec120 repmgrd: update log text 2019-05-30 16:08:04 +09:00
Ian Barwick
06a83247c9 repmgrd: note node type when logging child node dis/re-connections 2019-05-30 14:06:54 +09:00
Ian Barwick
a6ea1d0fda repmgrd: fix witness node disconnection monitoring 2019-05-30 11:51:50 +09:00
Ian Barwick
fa66e72c2f repmgrd: count witness server as child node for connection monitoring purposes
As the witness server does not, by definition, ever have an entry in pg_stat_replication,
we need to check its "attached" status by connecting to the witness server itself
and querying the reported upstream node ID (which should be set by the witness
server repmgrd). If this matches the current primary node ID, we count it as attached.
2019-05-21 15:19:41 +09:00
Ian Barwick
02245a0014 repmgrd: add missing PQfinish() calls 2019-05-02 18:50:21 +09:00
Ian Barwick
52905f1eb3 Standardize on "ID: %i" when logging node IDs
Previously there was a mix of "id:", "node id:", "node ID:" and "node_id:".
2019-04-30 17:07:33 +09:00
Ian Barwick
87910a5448 repmgrd: improve logging of sibling node's upstream info
If the sibling node has already been promoted (for whatever
reason, e.g. "repmgr standby promote" was executed manually)
and has exited recovery, the upstream node ID will normally
be reported as "-1", which is correct, but looks confusing in
the logs.

We now only report the upstream node ID if the sibling node
is still in recovery, *or* if it has exited recovery but is
still reporting an extant node ID.
2019-04-29 13:51:17 +09:00
Ian Barwick
3231b5034d Remove temporary debugging log output 2019-04-24 13:17:52 +09:00
Ian Barwick
58b33fb411 Clarify a couple of code comments 2019-04-24 10:55:53 +09:00
Ian Barwick
6cbf436bf8 Don't execute "child_nodes_disconnect_command" when repmgrd paused 2019-04-23 14:08:13 +09:00
Ian Barwick
5a90513878 repmgrd: monitor standbys attached to primary
This functionality enables repmgrd (when running on the primary) to
monitor connected child nodes. It will log connections and disconnections
and generate events.

Additionally, repmgrd can execute a custom script if the number of connected
child nodes falls below a configurable threshold. This script can be used
e.g. to "fence" the primary following a failover situation where a new primary
has been promoted and all standbys are now child nodes of that primary.
2019-04-22 16:18:52 +09:00
Ian Barwick
a0c6cb602f repmgrd: remove duplicate function definition 2019-04-16 10:53:05 +09:00