319 Commits

Author SHA1 Message Date
Ian Barwick
b7f20ee1f7 repmgrd: don't start if node is inactive and failover=automatic
If failover=automatic, it would be reasonable to expect repmgrd to
consider this node as a promotion candidate, however this will not
happen if it is marked inactive. This often happens when a failed
primary is recloned as a standby but not re-registered, and if
repmgrd would run it would give the incorrect impression that
failover capability is available.

Addresses GitHub #153.
2016-09-28 10:59:20 +09:00
Ian Barwick
8de84707d9 Always use PQstatus to check connection status
This addresses GitHib #234.
2016-08-25 08:35:47 +09:00
Ian Barwick
ef7bed1b3d repmgrd: refactor standby monitoring status query and code
This had grown somewhat complex with addition of handling for
various corner cases. Much of the work has now been delegated
to the query itself.
2016-08-16 19:15:58 +09:00
Ian Barwick
6bd1c6a36d Skip largely pointless master reconnection attempt.
Experimental - see notes in code.
2016-08-16 13:25:39 +09:00
Ian Barwick
9831cabd4d Minor refactoring of do_master_failover()
- rename some variables for clarity
- ensure all structures are initialised correctly
- update code comments
2016-08-16 11:23:59 +09:00
Ian Barwick
a310417a49 Refactor standby monitoring query
Addresses GitHub #224
2016-08-11 17:28:16 +09:00
Ian Barwick
84ab37c600 Improve handling of failover events when failover is set to manual
- prevent repmgrd from repeatedly executing the failover code
- add event notification 'standby_disconnect_manual'
- update documentation

This addresses GitHub #221.
2016-08-09 12:09:09 +09:00
Ian Barwick
6a198401db Fix repmgrd's command line help option parsing
As in commit d0c05e6f46, properly distinguish between
the command line option -? and getopt's unknown option marker '?'
2016-08-08 21:17:56 +09:00
Ian Barwick
cb78802027 repmgrd: prevent endless loops in failover with manual node
The LSN reported by the shared memory function defaults to "0/0"
(InvalidXLogRecPtr) - this indicates that the repmgrd on that node
hasn't been able to update it yet. However during failover several
places in the code assumed this is an error, which would cause
an endless loop waiting for updates which would never come.

To get around this without changing function definitions, we can
store an explicit message in the shared memory location field so the
caller can tell whether the other node hasn't yet updated the field,
or encountered situation which means it should not be considered
as a promotion candidate (which in most cases will be because
`failover` is set to `manual`.

Resolves GitHub #222.
2016-08-08 14:29:24 +09:00
Ian Barwick
02668ee045 Parse the contents of the "pg_basebackup_options" parameter in repmgr.conf
This is to ensure that when repmgr executes pg_basebackup it doesn't
add any options which would conflict with user-supplied options.

This is related to GitHub #206, where the -S/--slot option has been
added for 9.6 - it's important to check this doesn't conflict with
-X/--xlog-method.

While we're at it, rename the ErrorList handling code to ItemList
etc. so we can use it for generic non-error-related lists.
2016-07-26 16:12:43 +09:00
Ian Barwick
091541619d Fix repmgrd monitoring calculation when in archive recovery 2016-07-06 09:27:31 +09:00
Ian Barwick
74f6f97f26 repmgrd: log whether in standby or witness monitor loop
This is mainly for development and debugging purposes.
2016-06-29 10:31:57 +09:00
Ian Barwick
f1ee6e19b6 Ensure configuration options correctly initialised in repmgrd.c
Per GitHub #150.

Also remove unused variable.
2016-06-27 11:26:05 +09:00
Ian Barwick
a2b5ba595a repmgrd: reword log message for clarity 2016-06-23 09:47:35 +09:00
Ian Barwick
c16ab3c889 Fix handling of global PGconn variables in repmgrd
Don't call PQfinish before calling terminate(), elsewhere always
set to NULL after calling PQfinish().

This fixes GitHub #182.
2016-06-21 17:30:22 +09:00
Ian Barwick
dd5b6f9f12 Whitespace fixes 2016-06-21 16:04:41 +09:00
Ian Barwick
303bb22ee1 Note potential replication lag check improvement 2016-06-20 12:23:34 +09:00
Ian Barwick
5d8b1a3a31 monitoring: ensure that invalid replication_lag value is not inserted.
Per Github #189.
2016-06-20 10:55:25 +09:00
Ian Barwick
1ade1acb22 Report standby location as last apply location when in archive recovery
Otherwise the monitoring table's 'last_wal_standby_location' will stay at
the location of the last streaming WAL received.

This complements the bugfix applied in e814c1120e.
2016-06-15 15:41:10 +09:00
Ian Barwick
66fd003ab4 Schema-qualify pg_catalog objects 2016-06-10 17:58:10 +09:00
Martin
b6ebd34e2f Some other indentation fixes found 2016-06-03 20:20:43 -03:00
Martin
46ff9fb587 No code change, just indentation was incorrect in the failover part
making it hard to read.
2016-06-03 20:20:43 -03:00
Ian Barwick
e814c1120e repmgrd: handle situations where streaming replication is inactive 2016-05-12 22:17:44 +09:00
Ian Barwick
247823db4d Remove extraneous PQfinish() 2016-05-12 14:05:44 +09:00
Ian Barwick
0a798bf6e4 Comment fixes and formatting tweaks 2016-05-12 09:52:22 +09:00
Ian Barwick
21b2ff1a1f repmgrd: better handling of missing upstream_node_id
Ensure we default to master node.
2016-05-12 09:20:33 +09:00
Ian Barwick
57f9432692 Add missing newlines in log messages 2016-05-11 21:47:40 +09:00
Ian Barwick
54d3c7a4ca repmgrd: avoid additional connection to local instance in do_master_failover() 2016-05-11 09:55:38 +09:00
Ian Barwick
b0f6b7bad7 repmgrd: rename variable for clarity 2016-05-11 08:29:55 +09:00
Ian Barwick
4dbbf40196 Don't follow the promotion candidate standby if the primary reappears 2016-05-10 13:58:59 +09:00
Ian Barwick
d5e24689a4 Don't terminate a standby's repmgrd if self-promotion fails due to master reappearing
Per GitHub #173
2016-05-10 11:45:03 +09:00
Ian Barwick
2946c097f0 repmgrd: rename some variables to better match the system functions they're populated from 2016-04-12 15:51:42 +09:00
Ian Barwick
5d32026b79 Improve debugging output for node resyncing
We'll need this for testing.
2016-04-01 11:29:35 +09:00
Ian Barwick
190cc7dcb4 Rename copy_configuration () to witness_copy_node_records()
As it's witness-specific. Per suggestion from Martín.
2016-04-01 08:44:23 +09:00
Ian Barwick
c48c248c15 Regularly sync witness server repl_nodes table.
Although the witness server will resync the repl_nodes table following
a failover, other operations (e.g. removing or cloning a standby)
were previously not reflected in the witness server's copy of this
table.

As a short-term workaround, automatically resync the table at regular
intervals (defined by the configuration file parameter
"witness_repl_nodes_sync_interval_secs", default 30 seconds).
2016-03-29 16:49:28 +09:00
Ian Barwick
c828598bfb It's unlikely this situation will occur on a witness server
Which is why the error message is for master/standby only.
2016-03-28 15:53:25 +09:00
Ian Barwick
d400d7f9ac repmgrd: fix error message 2016-02-24 15:33:36 +09:00
Ian Barwick
c6e1bc205a Prevent repmgr/repmgrd running as root 2016-02-22 14:58:17 +09:00
Ian Barwick
1375adcac8 Standardize capitalisation in log messages 2016-01-28 07:24:45 +09:00
Ian Barwick
e859a58405 Change some repmgrd log messages to NOTICE
So key events during failover on promoted and following standbys
logged at the same level.
2016-01-27 18:39:27 +09:00
Ian Barwick
b72058dba8 Update copyright notice to 2016 2016-01-05 15:57:46 +09:00
Ian Barwick
7b2439b824 repmgrd: -v/--verbose option does not require a parameter 2016-01-05 10:45:47 +09:00
Ian Barwick
7a4d84379c Prevent invalid replication_lag values being written to the monitoring table
A fix for this was introduced with commit ee9270fe8d
and removed in 4f1c67a1bf.

Refactor the original fix to simply omit attempting to write an invalid entry
to the monitoring table.
2016-01-04 13:31:50 +09:00
Ian Barwick
490e12b1af Clean up whitespace and comments 2016-01-04 11:58:33 +09:00
Martín Marqués
7b9df3ac8f Merge pull request #133 from martinmarques/fix-standby-follows-other-node-repmgrd-fails
Fix standby follows other node repmgrd fails
2015-12-29 13:25:09 -03:00
Martín Marqués
d6bf870316 Merge pull request #131 from martinmarques/fix-failed-standby
Fix failed standby
2015-12-29 13:24:08 -03:00
Ian Barwick
cfec04d19f Modify log output to hint 2015-12-18 17:24:04 +09:00
Martin
4f1c67a1bf This doesn't really mean the standby s following a new master, so we are
removing it.
Basically, on startup the standby will start receiving again from the
begining of the WAL and so received will be lower then applied.

A proper code is needed to make sure the standby is still following the
correct master (as per node information)
2015-12-17 12:17:03 -03:00
Martín Marqués
aca2b9547f Change where we activate back the standby node that was failed.
We will do it where we are sending the message that says that the
standby has recovered, eliminating some complexity
2015-12-11 09:36:48 -03:00
Martín Marqués
c9db7f57d2 Fix bug discovered last week which prevents recovered standby from being
used in the cluster.
Main issue was that if the local repmgrd was not able to connect locally,
it would set the local node as failed (active = false). This is fine, because
we actually don't know if the node is active (actually, it's not active ATM)
so it's best to keep it out of the cluster.
The problem is that if the postgres service comes back up, and is able to
recover by it self, then we should ack that fact and set it as active.
There was another issue related with repmgrd being terminated if the postgres
service was downs. This is not the correct thing to do: we should keep
trying to connect to the local standby.
2015-12-07 16:14:19 -03:00