Commit Graph

965 Commits

Author SHA1 Message Date
Ian Barwick 7fda2a1bcf doc: fix typo in repmgr.conf.sample 2018-10-08 09:37:41 +09:00
Ian Barwick d26141b8ab Fix LWLockRelease() call in unset_bdr_failover_handler() 2018-10-08 09:37:31 +09:00
Ian Barwick 4a6b5fe913 Update control file checks for PostgreSQL 11 2018-09-27 14:08:39 +09:00
Ian Barwick a71e644255 repmgrd: document parameters which can be reloaded via SIGHUP
Also add a new subsection with details on reloading repmgrd configuration.
2018-09-27 10:44:34 +09:00
Ian Barwick 8646fd6004 doc: fix link in 4.1.1 release notes 2018-09-25 14:30:57 +09:00
Ian Barwick 3e1bb1a523 doc: minor fixes to "repmgr.conf.sample" 2018-09-25 10:54:54 +09:00
Ian Barwick f5e58fc062 doc: update "repmgr node rejoin" documentation
Clarify various points related to --force-rewind and pg_rewind usage.
2018-09-14 14:09:33 +09:00
Ian Barwick 6b95a96f3a repmgr: improve "cluster show" output
Only output full contents of connection error messages in --verbose mode,
otherwise it can spew a lot of text onto the screen.
2018-09-12 14:17:39 +09:00
Ian Barwick bd146ae9ac repmgrd: update local node id in shared memory after local node restart
Also ensure local node restarts are handled more elegantly, so we're not
surprised by a stale connection handle.

GitHub #502.
2018-09-12 14:17:35 +09:00
Ian Barwick c7f8e48d12 Bump version
4.1.2
2018-09-07 13:08:55 +09:00
Ian Barwick 322190516c doc: update link 2018-09-05 15:41:32 +09:00
Ian Barwick 31a49ff781 doc: update version v4.1.1 2018-09-04 12:33:44 +09:00
Ian Barwick a6f99b58dd doc: update 4.1.1 release notes 2018-09-04 12:33:10 +09:00
Ian Barwick 09b041433e doc: update 4.1.1 release notes 2018-09-04 09:46:59 +09:00
Ian Barwick 058c8168e1 repmgrd: fix syntax 2018-08-30 15:54:31 +09:00
Ian Barwick 0468e47ef3 repmgrd: improve reconnection handling
Previously, if the server being monitored was not available, repmgrd
would always close the existing connection handle and open a new one.

However, in some cases, e.g. a brief network outage, the existing
connection handle is still good and does not need to be reopened.

This could be particularly problematic if monitoring_history is on,
as this risks leaving orphan sessions on the primary which (given
a sufficiently unstable network) could lead to all available backends
being occupied.

Instead, during an outage we now use a new connection to verify
the server is accessible; if the old connection is still available
(e.g. following a short network interruption) we continue using that;
if  not (e.g. the server was restarted), we use the new one.
2018-08-30 15:47:49 +09:00
Ian Barwick 216326f316 doc: update release notes 2018-08-30 13:09:41 +09:00
Ian Barwick 3fb20ce774 repmgr: improve slot handling in "node rejoin"
On the rejoined node, if a replication slot for the new upstream exists
(which is typically the case after a failover), delete that slot.

Also emit a warning about any inactive replication slots which may need
to be cleaned up manually.

GitHub #499.
2018-08-30 11:57:44 +09:00
Ian Barwick e468ca859e repmgrd: improve monitoring statistics logging
Add more granular logging to help diagnose issues, and also keep track
of when the last monitoring statistics update was set and emit that
as DETAIL every time we emit a log status update.
2018-08-29 14:48:30 +09:00
Ian Barwick 623c84c022 Add additional query error logging
It's unlikely we'll get an error in these cases, but you never know.

Also, with queries which return a list of node records, it's necessary
to call _populate_node_records() even if the query fails, so a properly
initalised, albeit empty list is returned to the caller.
2018-08-29 10:27:42 +09:00
Ian Barwick c2dded1d7b Log text of failed queries at log level ERROR
Previously query texts were always logged at log level DEBUG, but
that doesn't help much in a normal production environment when
trying to identify the cause of issues.

Also make various other minor improvements to query logging and
handling of database errors.

Implements GitHub #498.
2018-08-29 10:09:51 +09:00
Ian Barwick 457dbbd267 "standby switchover": improve replication connection check
Previously repmgr would first check that a replication can be made
from the demotion candidate to the promotion candidate, however it's
preferable to sanity-check the number of available walsenders first,
to provide a more useful error message.
2018-08-24 16:31:46 +09:00
Ian Barwick 5485c06bc1 doc: fix internal link 2018-08-24 09:43:18 +09:00
Cédric Villemain 00ae42eb07 Fix grep to find conninfo
it used to use \t* but [[:space:]] should be better as it does match more kind
of spaces (the current one being broken in my case on RH7)
2018-08-24 09:20:51 +09:00
Ian Barwick 33525491ae doc: update package signing key link 2018-08-23 12:33:48 +09:00
Ian Barwick 8c84f7a214 doc: update source requirement links
Per report from Daymel Bonne.
2018-08-23 10:56:49 +09:00
Ian Barwick efe4bed88e doc: improve event notification documentation
- add undocumented events (per report from Daymel Bonne)
 - split up list into sections for better overview
 - where feasible, add cross-links
2018-08-23 10:22:05 +09:00
Ian Barwick 9ba8dcbac3 doc: clarify statement about BDR HA support 2018-08-23 09:36:58 +09:00
Ian Barwick a8996a5bfa doc: clarify when "standby follow" can be used.
The unqualified wording previously implied that any running server could
be rejoined with "standby follow", which is not the case with a
"split brain" primary.
2018-08-21 13:53:21 +09:00
Ian Barwick 4cbba98193 repmgr: add "cluster_cleanup" event
GitHub #492.
2018-08-20 16:48:08 +09:00
Ian Barwick 23e6b85de3 doc: document sources of old package versions 2018-08-20 14:16:48 +09:00
Ian Barwick d5ecb09f22 doc: add information about snapshot packages 2018-08-20 13:03:04 +09:00
Ian Barwick 719dd93676 doc: update release notes 2018-08-20 12:33:11 +09:00
Ian Barwick 5747f1d446 repmgrd: improve cascaded standby failover handling
In particular, improve handling of the case where the standby follow
command fails due to the primary not being available.

GitHub #480.
2018-08-16 17:14:05 +09:00
Ian Barwick 9313b43cb1 repmgrd: fix PQExpBuffer handling in upstream failover handler
Was sometimes leading to blank log lines.
2018-08-16 16:14:14 +09:00
Ian Barwick 5aeb1b0589 repmgrd: don't imply primary is in recovery if it's not available 2018-08-16 15:31:25 +09:00
Ian Barwick 6c93388848 repmgrd: fix "repmgrd_upstream_reconnect" event notification
Upstream node is not always the primary node.

Per report in GitHub #480.
2018-08-16 14:57:11 +09:00
Ian Barwick d4ad8ce20c "standby clone" - don't copy external config files in dry run mode
Avoid copying files during a --dry-run as it may introduce unexpected changes
on the target node. During an actual clone operation, any problems with
copying files will be detected early and the operation aborted before
the actual database cloning commences.

GitHub #491.
2018-08-16 14:03:39 +09:00
Ian Barwick bacab8d31c "standby promote": improve log messages
Make it clearer what repmgr is waiting for, and what to do if the
promotion appears to fail.
2018-08-16 11:52:18 +09:00
Ian Barwick 14856e3a4d repmgrd: ensure primary connection handle is refreshed after reconnect
In some circumstances, if monitoring history was in use, repmgrd was attempting
to fetch the primary's current LSN on a stale connection handle.
2018-08-15 16:57:21 +09:00
Ian Barwick ca9242badb repmgr: fix handling of slot creation error when cloning
If cloning from another node other than the intended upstream, and
replication slots are in use, once the cloning process is complete,
repmgr will attempt to connect to the intended upstream to create
the replication slot.

Previously it would abort with a connection error, but as this issue
is not fatal to the cloning process itself, and in some situations may
be intentional, it's better to log a warning and continue.

We should probably collate this (and any similar items needing
attention after the cloning operation) into a list output at the end,
otherwise the warning may get overlooked.
2018-08-15 15:11:13 +09:00
Ian Barwick ff0929e882 doc: update FAQ
Explain why some values in recovery.conf are surrounded by pairs of single
quotes.
2018-08-15 14:48:23 +09:00
Abhijit Menon-Sen 8cd1811edb Fix upstream node name in warning
This log_warning is supposed to reproduce the error in the block above,
but used the current node's name instead of the intended upstream node.
2018-08-14 10:10:50 +09:00
Ian Barwick bf15c0d40f doc: improve "repmgr cluster cleanup" documentation 2018-08-14 10:09:18 +09:00
Ian Barwick 9ae9d31165 repmgr: truncate version string if necessary
Some distributions may add extra information to PG_VERSION after
the actual version number (e.g. "10.4 (Debian 10.4-2.pgdg90+1)"), so
copy the version number string up until the first space is found.

GitHub #490.
2018-08-14 09:56:54 +09:00
Ian Barwick d5064bdc02 doc: clarify repmgrd FAQ item
"priority" must be 0 or greater.
2018-08-10 10:53:08 +09:00
Ian Barwick 9d0524a008 doc: update FAQ
Add note about why repmgrd refuses to start up if the upstream is
not running.
2018-08-10 10:47:23 +09:00
Ian Barwick 5398fd2d22 doc: better explain where pg_bindir won't be applied
Basically any setting which can contain a user-defined script
*must* have the full path set, even if it's repmgr being executed.

We could potentially apply some heuristics to detect if the first
item in the setting is "repmgr" (or more precisely repmgrd's program
name), but this will require some careful thought and testing
that it works as intended.
2018-08-10 10:29:06 +09:00
Ian Barwick 4c44c01380 doc: update release notes 2018-08-10 09:52:39 +09:00
Ian Barwick 5113ab0274 repmgrd: fix startup on witness node when local data is stale
Previously, when running on a witness server, repmgrd didn't consider
the local cache of the "repmgr.nodes" table might be outdated, e.g.
as repmgrd wasn't running on the witness server during a failover,
so could potentially end up monitoring a former primary now running
as a standby.

When running on a witness server, at startup repmgrd will now scan
all nodes to determine the current primary, and refresh its local
cache from there. This will also ensure it can start up even if the
node currently registered as primary in the local cache is not available.

Implements GitHub #488 and #489.
2018-08-09 16:42:20 +09:00