The warning emitted gives the impression that monitoring data shouldn't
be written if there's no streaming replication, but we can and should
do this as long as we have a primary connection.
Explictly document this in the code.
Also remove an unused variable warning.
If repmgrd is monitoring a primary which is taken off-line, then later
restored as a standby, detect this change and resume monitoring
in standby node.
Addresses GitHub #338.
All disconnected nodes will be in a static, known state, so as long as
each node has the same meta-information (repmgr.nodes) and is able
to retrieve the last receive LSN of the other nodes, it is possible
for each node to independently determine the best promotion candidate,
thereby reaching consensus without an explicit "voting" process.
get_new_primary() returns NULL if no notification for the new primary has
been received, but the code was expecting it to return UNKNOWN_NODE_ID,
which was causing repmgrd to prematurely drop out of the new primary
detection loop if no notification had been received by the time the loop
started.
Also store the electoral term as a single row, single column table,
to ensure that all repmgrds see the same turn. It is then bumped
by the winning node after it gets promoted.
Various logging improvements.
In previous versions of repmgr, some options had ambiguous meanings,
and/or were used for slightly different purposes. This way we end
up with a couple more options (most of which probably won't need
adjusting) but greater clarity and flexibility.
Removed:
master_reponse_timeout:
renamed to "async_query_timeout", as this was its main usage
retry_promote_interval_secs:
replaced by "primary_notification_timeout"
Added:
async_query_timeout:
timeout (in seconds) when executing asynchronous queries
primary_notification_timeout:
number of seconds to wait for notification from the new primary
after a failover
primary_follow_timeout:
number of seconds to wait for the new primary to become available
when executing "repmgr standby follow"
The node(s) with higher ID will "yield", leaving the decision making
up to the node with the lower ID.
This happens very rarely, usually when the random delay is close
enough on two or mode nodes that vote initiation is simultaneous.
This will cover both the case when an entire node including
repmgrd goes down, and when one PostgreSQL instance goes down
but repmgrd is still up (in which case only one of the repmgrds
will handle the failover).