282 Commits

Author SHA1 Message Date
Ian Barwick
05b47cb2a8 Prevent repmgr/repmgrd running as root 2016-02-23 14:37:44 +09:00
Ian Barwick
1375adcac8 Standardize capitalisation in log messages 2016-01-28 07:24:45 +09:00
Ian Barwick
e859a58405 Change some repmgrd log messages to NOTICE
So key events during failover on promoted and following standbys
logged at the same level.
2016-01-27 18:39:27 +09:00
Ian Barwick
b72058dba8 Update copyright notice to 2016 2016-01-05 15:57:46 +09:00
Ian Barwick
7b2439b824 repmgrd: -v/--verbose option does not require a parameter 2016-01-05 10:45:47 +09:00
Ian Barwick
7a4d84379c Prevent invalid replication_lag values being written to the monitoring table
A fix for this was introduced with commit ee9270fe8d
and removed in 4f1c67a1bf.

Refactor the original fix to simply omit attempting to write an invalid entry
to the monitoring table.
2016-01-04 13:31:50 +09:00
Ian Barwick
490e12b1af Clean up whitespace and comments 2016-01-04 11:58:33 +09:00
Martín Marqués
7b9df3ac8f Merge pull request #133 from martinmarques/fix-standby-follows-other-node-repmgrd-fails
Fix standby follows other node repmgrd fails
2015-12-29 13:25:09 -03:00
Martín Marqués
d6bf870316 Merge pull request #131 from martinmarques/fix-failed-standby
Fix failed standby
2015-12-29 13:24:08 -03:00
Ian Barwick
cfec04d19f Modify log output to hint 2015-12-18 17:24:04 +09:00
Martin
4f1c67a1bf This doesn't really mean the standby s following a new master, so we are
removing it.
Basically, on startup the standby will start receiving again from the
begining of the WAL and so received will be lower then applied.

A proper code is needed to make sure the standby is still following the
correct master (as per node information)
2015-12-17 12:17:03 -03:00
Martín Marqués
aca2b9547f Change where we activate back the standby node that was failed.
We will do it where we are sending the message that says that the
standby has recovered, eliminating some complexity
2015-12-11 09:36:48 -03:00
Martín Marqués
c9db7f57d2 Fix bug discovered last week which prevents recovered standby from being
used in the cluster.
Main issue was that if the local repmgrd was not able to connect locally,
it would set the local node as failed (active = false). This is fine, because
we actually don't know if the node is active (actually, it's not active ATM)
so it's best to keep it out of the cluster.
The problem is that if the postgres service comes back up, and is able to
recover by it self, then we should ack that fact and set it as active.
There was another issue related with repmgrd being terminated if the postgres
service was downs. This is not the correct thing to do: we should keep
trying to connect to the local standby.
2015-12-07 16:14:19 -03:00
Martín Marqués
96ac39ba0f Fix bug discovered last week which prevents recovered standby from being
used in the cluster.
Main issue was that if the local repmgrd was not able to connect locally,
it would set the local node as failed (active = false). This is fine, because
we actually don't know if the node is active (actually, it's not active ATM)
so it's best to keep it out of the cluster.
The problem is that if the postgres service comes back up, and is able to
recover by it self, then we should ack that fact and set it as active.
There was another issue related with repmgrd being terminated if the postgres
service was downs. This is not the correct thing to do: we should keep
trying to connect to the local standby.
2015-12-07 15:59:28 -03:00
Ian Barwick
120688013e Add "standby switchover" mode
Perform a switchover by:
 - stopping current primary node
 - promoting this standby node to primary
 - forcing previous primary node to follow this node

Caveats:
 - repmgrd must not be running, otherwise it may
   attempt a failover
   (TODO: find some way of notifying repmgrd of planned
    activity like this)
 - currently only set up for two-node operation; any other
   standbys will probably become downstream cascaded standbys
   of the old primary once it's restarted
 - as we're executing repmgr remotely (on the old primary),
   we'll need the location of its configuration file; this
   can be provided explicitly with -C/--remote-config-file,
   otherwise repmgr will look in default locations on the
   remote server
 - this does not yet support "rewinding" stopped nodes
   which will be unable to catch up with the primary

TODO:
 - update help, docs
 - make connection test timeouts/intervals configurable
2015-11-30 12:20:24 +09:00
Ian Barwick
933647d6de Make t_node_info generally available
And have it include all the fields from the repl_nodes table.
2015-11-25 12:57:18 +09:00
Ian Barwick
d1b4280182 Add /etc/repmgr.conf as a default configuration file location
Also refactor configuration file handling while we're at it.

Previously a configuration file would be ignored if it couldn't
be opened, however that is now treated as an error.
2015-11-19 15:16:18 +09:00
Ian Barwick
64d038c823 Simplify logger_init() parameters
We're passing the t_configuration_options structure anyway, no need to
pass items it contains as separate parameters.
2015-11-19 14:05:20 +09:00
Ian Barwick
9018dc65de Metadata update also handled by repmgr 2015-11-18 13:17:51 +09:00
Ian Barwick
9cbd8df089 When following a new primary, have repmgr (not repmgrd) create the new slot 2015-11-18 13:06:56 +09:00
Ian Barwick
8ab1901a93 Repurpose -v/--verbose; add -t/--terse option (repmgr only)
repmgr and particularly repmgrd currently produce substantial
amounts of log output. Much of this is only useful when troubleshooting
or debugging.

Previously the -v/--verbose option just forced the log level to
INFO. With repmgrd this is pretty pointless - just set the log
level in the configuration file. With repmgr the configuration
file can be overriden by the new -L/--log-level option.

-v/--verbose now provides an additional, chattier/pedantic level
of logging ("Opening *this* logfile", "Executing *this* query",
"running in *this* loop") which is helpful for understanding
repmgr/repmgrd's behaviour, particularly for troubleshooting.
What additional verbose logging is generated will of course a
also depends on the log level set, so e.g. someone trying to
work out which configuration file is actually being opened
can use '--log-level=INFO --verbose' without being bothered
by an avalanche of extra verbose debugging output.

-t/--terse option will silence certain non-essential output, at
the moment any HINTs.

Note that -v/--verbose and -t/--terse are not mutually exclusive
(suggestions for better names welcome).
2015-11-16 13:06:32 +09:00
Ian Barwick
617ea8cb78 Add log_hint() function for logging hints
There are a few places where additional hints are written as log
output, usually LOG_NOTICE. Create an explicit function to provide
hints in a standardized manner; by storing the log level of the
previous logger call, we can ensure the hint is only displayed when
the log message itself would be.

Part of an ongoing effort to better control repmgr's logging output.
2015-11-13 14:29:11 +09:00
Ian Barwick
142517fcca Always use catalog path when calling system functions
Removes any risk of issues due to search path mangling etc.
2015-11-11 11:17:47 +09:00
Ian Barwick
abb02cab76 Improve configuration file parsing
Related to Github #127.

- use the previously introduced repmgr_atoi() function to parse
  integers better
- collate all detected errors and output as a list, rather than
  failing on the first error.
2015-11-09 14:56:35 +09:00
Ian Barwick
8e66e4811c Rename variable 'reconnect_intvl' to 'reconnect_interval'
For consistency with the configuration file parameter name
2015-11-09 11:04:42 +09:00
Ian Barwick
b911483d5e Specify relevant node in error message 2015-10-28 16:10:08 +09:00
Ian Barwick
ee9270fe8d Terminate repmgrd if standby is no longer connected to upstream 2015-10-28 16:05:35 +09:00
Ian Barwick
ded716e403 Improve logging and event notifications when following new upstream node 2015-10-23 09:36:42 +09:00
Ian Barwick
d639dc3342 Add note about checking replication slots when following upstream node 2015-10-23 09:21:01 +09:00
Ian Barwick
17ed81ebb7 Improve log messages when following new primary 2015-10-23 09:17:20 +09:00
Ian Barwick
b00c507ee4 Minor formatting tweak 2015-10-23 08:51:26 +09:00
Ian Barwick
6e7eee4c01 Only log some debug items if verbose flag is set. 2015-10-23 08:29:35 +09:00
Ian Barwick
5c59e8fc5b Add missing space 2015-10-22 13:13:01 +09:00
Martín Marqués
120be2db1c Fix bug which prevents repmgrd from starting when the cluster name has
upper case letters.
2015-10-06 20:54:28 -03:00
Ian Barwick
e115825cd6 Fix comment capitalization 2015-09-30 14:58:43 +09:00
Ian Barwick
c3bd02b83d Standardize if-statement formatting
"if(" -> "if ("
2015-09-24 17:45:08 +09:00
Ian Barwick
8e7d110a22 Check for existing master record before deleting it
Otherwise repmgr implies it's deleting a record which isn't actually
there.
2015-09-24 17:39:39 +09:00
Tomas Vondra
ef6b24551a call update_node_record_set_upstream() for STANDBY FOLLOW
repmgrd correctly updates ID of the upstream node after automatic
failover, but repmgr was not doing that for manual failvers.

This moves the existing function to dbutils and modifies it so that
it does not rely on global variables with configuration (available
just in repmgrd).

This should fix issue #67 (hopefully, haven't done much testing).
2015-09-23 12:32:47 +09:00
Ian Barwick
30fd111cba Rework config file handling
If no configuration file provided, also check default Postgres
sysconfig dir.

It would also be useful to check the configuration directory
provided by the RPM/DEB packages, not sure if that's programmatically
feasible.
2015-09-21 15:55:29 +09:00
Ian Barwick
65e63b062e Generally tidy up help output 2015-09-21 11:49:06 +09:00
Ian Barwick
053f672caa Treat -?/--help and -V/--version as normal options
Currently repmgr/repmgrd will only accept these as valid when
provided as the first command line option, however it's possible
a user will want to get the output of those options by adding
them to the end of a previously inputted command.

Note that after the first of these options is encountered, the
program will terminate and not process any other options. This
is consistent with psql's behaviour

Per GitHub issue #107 from Sébastien Gross.
2015-09-21 09:53:51 +09:00
Ian Barwick
7345ddcf00 Whitespace tweak 2015-09-10 14:27:21 +09:00
Gianni Ciolli
462d446477 Bug #90 fix (autofailover with reconnect_attemps > 1).
The main change is that now check_connection requires a conninfo
parameter, and the connection object has type (PGconn **) so it can be
replaced by check_connection if needed.

The bug was caused by the fact that the first failure resulted in
*conn == NULL, so that subsequent checks of the upstream connection
were failing irrespectively of the actual state of the upstream node.

Now, when *conn == NULL, check_connection will use conninfo to
establish a new connection and place it into *conn. We introduce a new
INTERNAL_ERROR code for the case when they are both NULL.

In passing, we also reworded a confusing error message, distinguishing
a timeout from the actual elapsed time.
2015-08-10 20:58:43 +02:00
Ian Barwick
1e5792f8df Remove unused function 2015-04-14 14:29:47 +09:00
Ian Barwick
a01fefa7d0 After standby promotion, ensure metadata is updated by repmgr
Previously this was handled by repmgrd but if a standby is promoted
directly this will leave the metadata in an incorrect state.
2015-04-14 13:39:48 +09:00
Ian Barwick
07d220cb00 Correct monitoring table column names
It would be more consistent to change the "primary" to "master"
but that would make the table incompatible with the v2.0 table.
2015-03-31 18:14:32 +09:00
Ian Barwick
4dfeffe087 Add constant NODE_NOT_FOUND
Which is what the magic number means in those contexts.
2015-03-31 14:35:16 +09:00
Ian Barwick
18544c82ca Prevent rempgrd from looping infinitely if node was not registered 2015-03-31 14:25:08 +09:00
Ian Barwick
0f86bdcd05 Fixes for event logging
We can't always assume a valid connection to the master
2015-03-31 14:15:29 +09:00
Ian Barwick
3e621f43d1 Use 100 as the default priority; 0 or less means node will never be promoted 2015-03-26 10:38:20 +09:00