Commit Graph

97 Commits

Author SHA1 Message Date
Ian Barwick
9165d27f9f "repmgr node ...": fixes for 9.3
Mainly to account for the lack of replication slots.
2017-11-16 11:25:16 +09:00
Ian Barwick
a6cc4d80f0 Add "witness register" functionality 2017-11-15 13:47:45 +09:00
Ian Barwick
eb14bb58c6 Add configuration file "passfile"
This will enable a custom .pgpass to be included in "primary_conninfo"
(provided it's supported by the libpq version on the standby).
2017-11-14 19:30:25 +09:00
Ian Barwick
aa089820ab repmgrd: check shared library is loaded
If this isn't the case, "repmgrd" will appear to run but not handle
failover correctly.

Address GitHub #337.
2017-11-10 14:35:17 +09:00
Ian Barwick
4ca7e6a6bf repmgrd: remove unneeded functions 2017-11-09 19:31:08 +09:00
Ian Barwick
79d21b516b repmgrd: fixes to failover handling
get_new_primary() returns NULL if no notification for the new primary has
been received, but the code was expecting it to return UNKNOWN_NODE_ID,
which was causing repmgrd to prematurely drop out of the new primary
detection loop if no notification had been received by the time the loop
started.

Also store the electoral term as a single row, single column table,
to ensure that all repmgrds see the same turn. It is then bumped
by the winning node after it gets promoted.

Various logging improvements.
2017-11-08 14:28:08 +09:00
Ian Barwick
b6b31b15b2 Implement "repmgr cluster cleanup" 2017-09-11 13:48:46 +09:00
Ian Barwick
a9f4a027a7 pgindent run 2017-09-11 11:14:13 +09:00
Ian Barwick
e4f7dc8234 Add copyright notices 2017-09-08 13:27:39 +09:00
Ian Barwick
edee80cc37 Rename option "node check --is-shutdown" to "--is-shutdown-cleanly"
As that's what we really want to know. Also return "UNCLEAN_SHUTDOWN"
if that's the case, rather than "RUNNING" which is confusing, even
though it's a command for internal use.
2017-09-07 11:15:27 +09:00
Ian Barwick
fcd111ac4c Improve logging output during failover process 2017-08-24 22:44:03 +09:00
Ian Barwick
eee8d65259 Update view "replication_status" 2017-08-24 15:05:13 +09:00
Ian Barwick
a659132ea4 repmgrd: write monitoring statistics 2017-08-24 11:49:44 +09:00
Ian Barwick
4c0d719cdb Add replication slot check to "repmgr node check" 2017-08-16 11:17:02 +09:00
Ian Barwick
554673e83e Add "repmgr node check --downstream" 2017-08-15 15:50:46 +09:00
Ian Barwick
10ef30096c "node check": add server role check 2017-08-14 22:57:09 +09:00
Ian Barwick
3b2158edbf Initialise variables, where appropriate 2017-08-14 15:11:42 +09:00
Ian Barwick
eabd56f3be "standby follow": check node system identifiers match 2017-08-14 11:45:08 +09:00
Ian Barwick
0f31756733 General code cleanup 2017-08-14 10:04:53 +09:00
Ian Barwick
b95b3e50e3 Return system identification information with appropriate data types 2017-08-14 08:50:54 +09:00
Ian Barwick
50b82f785e Add function to execute "IDENTIFY_SYSTEM" 2017-08-11 22:01:02 +09:00
Ian Barwick
8a50a72dc5 Additional "node status" output 2017-08-10 17:18:08 +09:00
Ian Barwick
4f2161bd83 Cleanup various #defines 2017-08-10 15:11:53 +09:00
Ian Barwick
7ca68b7cc8 Standardize "primary_conninfo" generation
Previously repmgr would write all the default libpq parameters
into "primary_conninfo" on "standby clone", but not for
"standby follow", which is inconsistent.

For repmgr4 we'll determine that the upstream node's conninfo
must be canonical and contain all required connection parameters,
even if these are available as defaults or environment variables
in the local environment, as those are transient and may not
be available in all environments/situations.

recovery.conf's "primary_conninfo" will be generated using the
upstream's conninfo parameters, except for those specific
to the downstream node. These are:

  - "application_name": this will always be set to the
      "node_name"  of the downstream node
  - "passfile" and "servicefile": these, must of course
    reference files on the downstream node so will be extracted
    from the downstream node's conninfo, if set
2017-08-10 12:37:50 +09:00
Ian Barwick
1d99a07b43 Store configuration file in repmgr.nodes table
When executing repmgr on remote nodes, we otherwise end up jumping
through hoops as we can't make assumptions about where the configuration
file is located, but really need to be able to provide it.

From a support point of view it will also make life easier as it will
be easy to specify exactly which file to provide.
2017-08-10 08:03:24 +09:00
Ian Barwick
a57fb5b50c After switchover, enable sibling standbys to follow new primary 2017-08-10 00:06:16 +09:00
Ian Barwick
f2cf46bba3 Check replication lag before attempting switchover 2017-08-08 10:16:47 +09:00
Ian Barwick
2499b42ef8 switchover: check for pending archive files on the demotion candidate
If the current primary (demotion candidate) still has any files to archive,
it will delay the shutdown until all files are archived. If there is a
substantial number of files, and/or the archive command executes slowly,
this will probably lead to an unwelcome delay in the switchover process.
2017-08-08 00:37:20 +09:00
Ian Barwick
5948cf6cda repmgr standby switchover: add sanity check for pg_rewind useability
pg_rewind will only be executed on a demoted primary if explictly
requested, to prevent transactions on the primary, which
were never replicated, from being automatically overwritten.

If --force-rewind is provided, we'll need to check pg_rewind
is actually useable before we need to use it.
2017-08-04 00:45:55 +09:00
Ian Barwick
112ca6321a Initial switchover implementation
The repmgr3 implementation required the promotion candidate (standby)
to directly work with the demotion candidate's data directory,
directly execute server control commands etc.

Here we delegated a lot more of that work to the repmgr on the
demotion candidate, which reduces the amount of back-and-forth
over SSH and generally makes things cleaner and smoother.

In particular the repmgr on the demotion candidate will carry
out a thorough check that the node is shut down and report
the last checkpoint LSN to the promotion candidate; this
can then be used to determine whether pg_rewind needs to be
executed on the demoted primary before reintegrating it back
into the cluster (todo).

Also implement "--dry-run" for this action, which will sanity-check the
nodes as far as possible without executing the switchover.

Additionally some of the new repmgr node commands (or command options)
introduced for this can be also executed by the user to obtain
additional information about the status of each node.
2017-08-03 16:38:37 +09:00
Ian Barwick
aa528dfdfb Consolidate generation of various server control commands
This is needed for better switchover control, so we can instruct
the remote repmgr to issue the appropriate server command rather
than trying to work out what it should be from the local node.
2017-08-02 12:01:20 +09:00
Ian Barwick
e5d50bbfd5 Separate configuration file queries into a discrete function
Simplifies main application code and makes it easier to reuse
the queries.
2017-08-02 00:04:20 +09:00
Ian Barwick
f023b9c90c Add "repmgr node archive-config" 2017-08-01 17:38:54 +09:00
Ian Barwick
8a5665a421 repmgr node status: add information about current LSN locations for streaming standbys 2017-08-01 10:34:12 +09:00
Ian Barwick
dc24d62009 repmgrd: improve BDR recovery handling 2017-07-27 11:53:55 +09:00
Ian Barwick
8a2e4db1bc Add "repmgr node status"
Outputs an overview of a node's status, and emits warnings if any
issues detected.
2017-07-25 00:39:04 +09:00
Ian Barwick
b99443b0c8 Improvements to repmgr cluster show
Add documentation; show recovery status in --csv mode.
2017-07-20 10:25:13 +09:00
Ian Barwick
49ac9cf9ca Add "repmgr cluster show" 2017-07-19 17:36:21 +09:00
Ian Barwick
23e6440dfd repmgrd: initiate primary monitoring when local node is promoted manually 2017-07-19 11:15:38 +09:00
Ian Barwick
6e270b2faf repmgrd: catch cases where more than one node has initiated voting
The node(s) with higher ID will "yield", leaving the decision making
up to the node with the lower ID.

This happens very rarely, usually when the random delay is close
enough on two or mode nodes that vote initiation is simultaneous.
2017-07-18 17:04:24 +09:00
Ian Barwick
a56bb41891 Remove redundant fields from node record struct 2017-07-17 14:11:14 +09:00
Ian Barwick
0dcd479322 Store node status in node record struct 2017-07-17 13:50:17 +09:00
Ian Barwick
ec554e5694 Improve connection handling
Set "connect_timeout" and "fallback_application_name" if not present.
2017-07-17 11:10:37 +09:00
Ian Barwick
a29bc3e0fa Rename config.[ch] to configfile.[ch] 2017-07-16 09:41:26 +09:00
Ian Barwick
951c7dbd07 repmgrd: in BDR mode, have each repmgrd monitor each node
This will cover both the case when an entire node including
repmgrd goes down, and when one PostgreSQL instance goes down
but repmgrd is still up (in which case only one of the repmgrds
will handle the failover).
2017-07-14 15:01:18 +09:00
Ian Barwick
e3b3fb65f0 repmgrd: restrict BDR monitoring to two node setup
It's not safe to have more than two nodes with this kind of
"failover", so we don't need to select alternative nodes by
priority.
2017-07-14 12:56:11 +09:00
Ian Barwick
dfcf85a62f repmgrd: further BDR sanity checks 2017-07-14 10:27:28 +09:00
Ian Barwick
7eadbf6b17 Various improvements to "repmgr bdr register/unregister" 2017-07-12 22:38:03 +09:00
Ian Barwick
0a1addfdc0 When registering a BDR node, sync repmgr.nodes from another node
If a BDR node is added via bdr_group_join(), repmgr.nodes will
start off empty, so we'll need to sync it ourselves before adding
it to the repmgr replication set.
2017-07-12 10:11:25 +09:00
Ian Barwick
1cccb1dd5a Add "repmgr bdr unregister" 2017-07-12 10:11:21 +09:00