260 Commits

Author SHA1 Message Date
Ian Barwick
03b29908e2 Miscellaneous string handling cleanup
This is mainly to prevent effectively spurious truncation warnings
in recent GCC versions.
2019-04-10 16:21:09 +09:00
Ian Barwick
347948b79f Fix default return value in alter_system_int() 2019-04-01 14:52:37 +09:00
Ian Barwick
83e492d4ef Add is_server_available_quiet()
For use in cases where the caller collates node availability information
and doesn't want to prematurely emit log output.
2019-04-01 12:24:57 +09:00
Ian Barwick
1906ea89bd Improve copying of strings from database results
Where feasible, specify the maximum string length via sizeof(), and
use snprintf() in place of strncpy().
2019-04-01 11:29:16 +09:00
Ian Barwick
939cbd0721 Cast "int" to "long long" 2019-03-28 11:04:53 +09:00
Ian Barwick
1953ec7459 Restrict "node_name" to maximum 63 characters
In "recovery.conf", the configuration parameter "node_name" is used
as the "application_name" value, which will be truncated by PostgreSQL
to 63 characters (NAMEDATALEN - 1).

repmgr sometimes needs to be able to extract the application name from
pg_stat_replication to determine if a node is connected (e.g. when
executing "repmgr standby register"), so the comparison will fail
if "node_name" exceeds 63 characters.
2019-03-28 10:58:18 +09:00
Ian Barwick
948e076ad9 log_db_error(): fix formatted message handling 2019-03-27 14:27:55 +09:00
Ian Barwick
6441db23ff repmgrd: during failover, check if a node was already promoted
Previously, repmgrd assumed that during a failover, there would not
already be another primary node. However it's possible a node was
promoted manually. While this is not a desirable situation, it's
conceivable this could happen in the wild, so we should check for
it and react accordingly.

Also sanity-check that the follow target can actually be followed.

Addresses issue raised in GitHub #420.
2019-03-22 15:15:49 +09:00
Ian Barwick
4c11a57334 use a constant to denote unknown replication lag 2019-03-22 10:12:19 +09:00
Ian Barwick
37a41a66f9 Check node recovery type before attempting to write an event record
In some corner cases (e.g. immediately after a switchover) where
the current primary has not yet been determined, the provided connection
might not be writeable. This prevents error messages such as
"cannot execute INSERT in a read-only transaction" generating unnecessary
noise in the logs.
2019-03-20 12:14:53 +09:00
Ian Barwick
58f55222d9 Explictly log PQping() failures 2019-03-20 12:13:44 +09:00
Ian Barwick
5cbaff8d0a Improve database connection failure logging
Log the output of PQerrorStatus() in a couple of places where it was missing.

Additionally, always log the output of PQerrorStatus() starting with a blank
line, otherwise the first line looks like it was emitted by repmgr, and
it's harder to scan the error message.

Before:

    [2019-03-20 11:24:15] [DETAIL] could not connect to server: Connection refused
            Is the server running on host "localhost" (::1) and accepting
            TCP/IP connections on port 5501?
    could not connect to server: Connection refused
            Is the server running on host "localhost" (127.0.0.1) and accepting
            TCP/IP connections on port 5501?

After:

    [2019-03-20 11:27:21] [DETAIL]
    could not connect to server: Connection refused
            Is the server running on host "localhost" (::1) and accepting
            TCP/IP connections on port 5501?
    could not connect to server: Connection refused
            Is the server running on host "localhost" (127.0.0.1) and accepting
            TCP/IP connections on port 5501?
2019-03-20 12:13:40 +09:00
Ian Barwick
ce8e1cccc4 Remove outdated comment
This was only relevant for repmgr3 and earlier; in repmgr4 the schema
is hard-coded.
2019-03-18 15:19:25 +09:00
Ian Barwick
39443bbcee Count witness and zero-priority nodes in visibility check 2019-03-15 14:06:58 +09:00
Ian Barwick
fc636b1bd2 Ensure witness node sets last upstream seen time 2019-03-15 14:06:55 +09:00
Ian Barwick
59b7453bbf repmgrd: optionally disconnect WAL receivers during failover
This is intended to ensure that all nodes have a constant LSN while
making the failover decision.

This feature is experimental and needs to be explicitly enabled with the
configuration file option "standby_disconnect_on_failover".

Note enabling this option will result in a delay in the failover decision
until the WAL receiver is disconnected on all nodes.
2019-03-08 15:27:54 +09:00
Ian Barwick
bc6584a90d *_transaction() functions: log error message text as DETAIL
Per behaviour elsewhere.
2019-03-06 13:23:57 +09:00
Ian Barwick
074d79b44f repmgrd: add option "connection_check_type"
This enable selection of the method repmgrd uses to check whether the upstream
node is available. Possible values are:

 - "ping" (default): uses PQping() to check server availability
 - "connection":  executes a query on the connection to check server
   availability (similar to repmgr3.x).
2019-03-06 13:23:53 +09:00
Ian Barwick
19bcfa7264 Rename "..._primary_last_seen" functions to "..._upstream_last_seen"
As that better reflects what they do.
2019-03-06 13:23:33 +09:00
Ian Barwick
9753bcc8c3 repmgrd: during failover, check if other nodes have seen the primary
In a situation where only some standbys are cut off from the primary,
a failover would result in a split brain/split cluster situation,
as it's likely one of the cut-off standbys will promote itself, and
other cut-off standbys (but not all standbys) will follow it.

To prevent this happening, interrogate the other sibiling nodes to
check whether they've seen the primary within a reasonably short interval;
if this is the case, do not take any failover action.

This feature is experimental.
2019-03-06 13:23:22 +09:00
Ian Barwick
39234afcbf standby clone: check upstream connections after data copy operation
With long-running copy operations, it's possible the connection(s) to
the primary/source server may go away for some reason, so recheck
their availability before attempting to reuse.
2019-02-26 14:37:51 +09:00
Ian Barwick
c30e65b3f2 Add some missing query error logging 2019-02-25 13:02:45 +09:00
Ian Barwick
07097575b1 daemon status: add column "upstream last seen"
This displays the interval (in seconds) since the repmgrd instance on
each node last confirmed its upstream node is available.
2019-02-23 13:03:16 +09:00
Ian Barwick
71d151ca87 Don't check status of logical replication slots
We only want to check the status of physical replication slots
to determine whether a streaming replication standby has become
detached and there is therefore a risk of uncontrolled WAL buildup
on the local node.

It's not feasible to second-guess the state of logical replication
slots.
2019-02-23 10:09:43 +09:00
Ian Barwick
de70fd42dc node check: simplify output generation in --is-shutdown-cleanly check 2019-02-22 10:49:06 +09:00
Ian Barwick
85a97c933f Handle unhandled NodeStatus in switch statement 2019-02-15 19:31:06 +09:00
Ian Barwick
9305953bd2 Fix history file parsing
Also add additional debugging output.
2019-02-14 15:52:40 +09:00
Ian Barwick
25019d1cc5 Refactor is_wal_replay_paused() query
Make sure it doesn't emit an error if executed on a node not
in recovery.

The caller should theoretically only execute it on nodes in
recovery, but there are sure to be corner cases where the node
has come out of recovery.
2019-02-12 10:21:05 +09:00
Ian Barwick
f0a0be0248 Remove pointless default allocation in _get_node_record() 2019-02-07 11:41:08 +09:00
Ian Barwick
c7b325e2a4 Add function resume_wal_replay() 2019-02-07 11:33:02 +09:00
Ian Barwick
b89941f218 Store WAL replay pause status in ReplInfo struct 2019-02-07 10:24:42 +09:00
Ian Barwick
2b3b1faa20 refactor query in function get_replication_info()
In particular handle all cases where one of the functions called
in the query can return NULL in the query itself.
2019-02-06 15:40:27 +09:00
Ian Barwick
984ce7420b "daemon status": emit warning if WAL replay is paused
Specifically, if WAL replay is paused *and* WAL is pending replay,
this node cannot be promoted until WAL replay is unpaused. In this
state it is not a suitable promotion candidate in a failover situation.
2019-02-06 13:32:20 +09:00
Ian Barwick
cd3312496e Rename functions which return an LSN for clarity 2019-02-06 09:32:53 +09:00
Ian Barwick
f62b3b2868 Fix Pg10+ function names 2019-02-05 13:37:35 +09:00
Ian Barwick
701944c194 "standby promote": add check for WAL replay status if replay is paused
If WAL replay is paused but WAL is still pending replay, PostgreSQL will ignore
the promote request until WAL replay is unpaused. This may lead to the standby
being promoted at an unpredictable point in time outside of repmgr's
control. Moreover it may not be obvious that this is happening, or why, and
it will appear that an apparently successful promotion attempt has not
actually worked.

To prevent this from happening, repmgr will now refuse to promote the
standy if WAL replay is paused *and* WAL is still pending replay.

GitHub #540.
2019-02-05 13:30:37 +09:00
Ian Barwick
92c73b68a0 Clean up dbutils.c
Put functions into the same "section" as noted in the header file.
2019-02-05 09:36:54 +09:00
Ian Barwick
f9a1861ded Refactor ReplInfo struct handling
Eventually we'll want to have this contain the optional replication
info contained in the t_node_info struct, which should then contain a
pointer to a ReplInfo struct.
2019-02-02 18:39:24 +09:00
Ian Barwick
20b79f998c Define some previously magic numbers 2019-02-01 19:14:16 +09:00
Ian Barwick
bdb4f66a9d Add an Assert() to detect attempted array overflow in param_set...() functions
Previously the code would do nothing if an attempt was made to add parameters
if the array is already full.

As the array is designed to contain all valid libpq connection parameters,
there's no reason it should ever "overflow" like this. If there is, then
it means the caller is attempting to add invalid values. Add an Assert()
so we can easily detect this in the unlikely event it ever occurs.

Noted after examining the issue raised in GitHub #533, which is nonsensical
as it implies we'd be OK with writing beyond the end of the array, however
it doesn't hurt to make it a bit clearer what is happening and why.
2019-01-31 14:11:00 +09:00
Ian Barwick
32b81e7d49 "daemon start": initial implementation 2019-01-29 13:01:14 +09:00
Ian Barwick
a48d408e4e Consistently log strerror output as DETAIL 2019-01-29 12:10:55 +09:00
Ian Barwick
1980deb480 repmgrd: check for a change to the upstream node
If the upstream node has changed, for example after "repmgr standby follow"
was manually executed, restart monitoring to ensure repmgrd is monitoring the
correct node.
2019-01-22 13:33:13 +09:00
Ian Barwick
7dce3ed234 Update copyright notices to 2019 2019-01-21 14:54:35 +09:00
Ian Barwick
d4e993a240 Improve handling of connection URIs when executing remote commands
Previously, if connection URIs were in use and "repmgr standby switchover"
was executed, repmgr would pass the connection URI as-is to the demotion
candidate to execute "repmgr node rejoin". However the presence of
unescaped ampersands in the connection URI was causing the rejoin command
to be incorrectly executed.

Addresses GitHub #525.
2019-01-14 11:11:51 +09:00
Ian Barwick
40408a1734 repmgrd: check binary and extension major versions match
repmgr requires that the same "major version" (e.g. 4.3) is present
on all nodes, otherwise - particularly in the case of repmgrd - it's
highly likely things won't work as expected.

Implements part of GitHub #515.
2019-01-07 15:39:40 +09:00
Ian Barwick
313aa3c5d7 Refactor follow verification to reduce need for CHECKPOINT
A CHECKPOINT is not always required; hopefully we can narrow it down
to one corner case where we need to determine the minium recovery
location.

Also get local timeline ID via IDENTIFY_SYSTEM, as fetching it from
pg_control risks returning the prior timeline ID if the timeline
switch has just taken place and no restart point has yet occurred.
2018-12-04 15:27:22 +09:00
Ian Barwick
c53782cda3 Fix typo in query 2018-11-29 15:24:49 +09:00
Ian Barwick
66b40ffc68 Simplify function create_replication_slot()
Following the changes in 793d83b, it's no longer necessary to
pass the server version number.
2018-11-29 14:35:01 +09:00
Ian Barwick
a6a2be2239 Teach witness repmgrd to deal with the absence of a primary
Previously it would refuse to start if the primary was not reachable,
the thinking being that it's pointless trying to monitor an incomplete
cluster.

However following an aborted failover situation, repmgrd will restart
monitoring and on the witness server, this will lead to it aborting
itself due to to continuing absence of primary.

To resolve this, witness repmgrd will now start monitoring in degraded
mode if no primary is found in the hope a primary will reappear at
some point.
2018-11-29 12:15:41 +09:00