Commit Graph

960 Commits

Author SHA1 Message Date
Ian Barwick
9439467958 doc: add troubleshooting section to switchover documentation 2018-09-25 13:47:58 +09:00
Ian Barwick
38e3aae053 repmgr: add parameter "shutdown_check_timeout"
Previously, "repmgr standby switchover" used the configuration file parameters
"reconnect_interval" and "reconnect_attempts" to define a timeout to determine
whether the current primary (demotion candidate) has shut down.

However, these parameters are intended for primary failure detection and are
generally lower in value, while a controlled shutdown may take longer, resulting
in the switchover being aborted as repmgr was not waiting long enough.

To prevent this happening, parameter "shutdown_check_timeout" has been added.
This complements the existing "standby_reconnect_timeout" parameter used
by "repmgr standby switchover".

Implements GitHub #504.
2018-09-25 11:34:06 +09:00
Ian Barwick
80bef0eb28 doc: minor fixes to "repmgr.conf.sample" 2018-09-25 10:53:24 +09:00
Ian Barwick
bea4b03cc2 doc: update "repmgr node rejoin" documentation
Clarify various points related to --force-rewind and pg_rewind usage.
2018-09-14 14:08:34 +09:00
Ian Barwick
97905b02ae repmgrd: fix comment 2018-09-13 10:15:22 +09:00
Ian Barwick
b0a2ee2259 get_all_node_records(): display any error encountered and return success status
In many cases we'll want to bail out with an error if the node list can't
be retrieved for any reason. This saves some repetitive coding.
2018-09-13 10:14:43 +09:00
Ian Barwick
bb4fdcda98 doc: update link 2018-09-12 14:17:14 +09:00
Ian Barwick
7b33faa09b repmgr: improve "cluster show" output
Only output full contents of connection error messages in --verbose mode,
otherwise it can spew a lot of text onto the screen.
2018-09-07 16:59:54 +09:00
Ian Barwick
5de2b1ee13 repmgrd: update local node id in shared memory after local node restart
Also ensure local node restarts are handled more elegantly, so we're not
surprised by a stale connection handle.

GitHub #502.
2018-09-07 11:59:53 +09:00
Ian Barwick
f184b1e68a doc: update 4.1.1 release notes 2018-09-04 12:35:46 +09:00
Ian Barwick
bd2f6db1e1 doc: update 4.1.1 release notes 2018-09-04 09:47:38 +09:00
Ian Barwick
1693ec0e90 repmgrd: fix syntax 2018-08-30 16:27:07 +09:00
Ian Barwick
17e75f6b31 repmgrd: improve reconnection handling
Previously, if the server being monitored was not available, repmgrd
would always close the existing connection handle and open a new one.

However, in some cases, e.g. a brief network outage, the existing
connection handle is still good and does not need to be reopened.

This could be particularly problematic if monitoring_history is on,
as this risks leaving orphan sessions on the primary which (given
a sufficiently unstable network) could lead to all available backends
being occupied.

Instead, during an outage we now use a new connection to verify
the server is accessible; if the old connection is still available
(e.g. following a short network interruption) we continue using that;
if  not (e.g. the server was restarted), we use the new one.
2018-08-30 15:46:08 +09:00
Ian Barwick
3b8586d82a doc: update release notes 2018-08-30 13:05:17 +09:00
Ian Barwick
6acec3e041 doc: fix internal link 2018-08-30 12:40:08 +09:00
Ian Barwick
1d830bf0e2 doc: update package signing key link 2018-08-30 12:40:05 +09:00
Ian Barwick
3f99ee8ede doc: update source requirement links
Per report from Daymel Bonne.
2018-08-30 12:40:02 +09:00
Ian Barwick
b5f640d04d doc: improve event notification documentation
- add undocumented events (per report from Daymel Bonne)
 - split up list into sections for better overview
 - where feasible, add cross-links
2018-08-30 12:39:58 +09:00
Ian Barwick
92a62a958e doc: clarify statement about BDR HA support 2018-08-30 12:39:54 +09:00
Ian Barwick
a4a956593c doc: clarify when "standby follow" can be used.
The unqualified wording previously implied that any running server could
be rejoined with "standby follow", which is not the case with a
"split brain" primary.
2018-08-30 12:39:51 +09:00
Ian Barwick
ceeb6d7130 repmgrd: improve monitoring statistics logging
Add more granular logging to help diagnose issues, and also keep track
of when the last monitoring statistics update was set and emit that
as DETAIL every time we emit a log status update.
2018-08-30 12:36:59 +09:00
Ian Barwick
9681708b1a repmgr: improve slot handling in "node rejoin"
On the rejoined node, if a replication slot for the new upstream exists
(which is typically the case after a failover), delete that slot.

Also emit a warning about any inactive replication slots which may need
to be cleaned up manually.

GitHub #499.
2018-08-30 12:24:13 +09:00
Ian Barwick
3573950425 Add additional query error logging
It's unlikely we'll get an error in these cases, but you never know.

Also, with queries which return a list of node records, it's necessary
to call _populate_node_records() even if the query fails, so a properly
initalised, albeit empty list is returned to the caller.
2018-08-29 10:25:43 +09:00
Ian Barwick
c1586e39b7 Log text of failed queries at log level ERROR
Previously query texts were always logged at log level DEBUG, but
that doesn't help much in a normal production environment when
trying to identify the cause of issues.

Also make various other minor improvements to query logging and
handling of database errors.

Implements GitHub #498.
2018-08-29 10:08:52 +09:00
Ian Barwick
7745844078 "standby switchover": improve replication connection check
Previously repmgr would first check that a replication can be made
from the demotion candidate to the promotion candidate, however it's
preferable to sanity-check the number of available walsenders first,
to provide a more useful error message.
2018-08-24 16:31:25 +09:00
Ian Barwick
e1e59e85d7 repmgr: add "cluster_cleanup" event
GitHub #492.
2018-08-24 09:20:05 +09:00
Cédric Villemain
6fc79470fc Fix grep to find conninfo
it used to use \t* but [[:space:]] should be better as it does match more kind
of spaces (the current one being broken in my case on RH7)
2018-08-23 18:33:55 +02:00
Ian Barwick
b7d576863d doc: update FAQ
Add note about why repmgrd refuses to start up if the upstream is
not running.
2018-08-20 15:33:55 +09:00
Ian Barwick
c1338df5e3 doc: clarify repmgrd FAQ item
"priority" must be 0 or greater.
2018-08-20 15:30:43 +09:00
Ian Barwick
221fb63e92 repmgrd: fix startup on witness node when local data is stale
Previously, when running on a witness server, repmgrd didn't consider
the local cache of the "repmgr.nodes" table might be outdated, e.g.
as repmgrd wasn't running on the witness server during a failover,
so could potentially end up monitoring a former primary now running
as a standby.

When running on a witness server, at startup repmgrd will now scan
all nodes to determine the current primary, and refresh its local
cache from there. This will also ensure it can start up even if the
node currently registered as primary in the local cache is not available.

Implements GitHub #488 and #489.
2018-08-20 15:29:29 +09:00
Ian Barwick
987823861f doc: document sources of old package versions 2018-08-20 15:25:00 +09:00
Ian Barwick
7a6eb6321b doc: add information about snapshot packages 2018-08-20 15:24:57 +09:00
Ian Barwick
f4df6696ba doc: update release notes 2018-08-20 15:24:43 +09:00
Ian Barwick
bc584d84f6 repmgrd: improve cascaded standby failover handling
In particular, improve handling of the case where the standby follow
command fails due to the primary not being available.

GitHub #480.
2018-08-20 15:23:54 +09:00
Ian Barwick
76f5bcf3cd repmgrd: fix PQExpBuffer handling in upstream failover handler
Was sometimes leading to blank log lines.
2018-08-20 15:23:50 +09:00
Ian Barwick
b1aab930af repmgrd: don't imply primary is in recovery if it's not available 2018-08-20 15:23:46 +09:00
Ian Barwick
58994365ff repmgrd: fix "repmgrd_upstream_reconnect" event notification
Upstream node is not always the primary node.

Per report in GitHub #480.
2018-08-20 15:23:42 +09:00
Ian Barwick
c3949b2aea "standby clone" - don't copy external config files in dry run mode
Avoid copying files during a --dry-run as it may introduce unexpected changes
on the target node. During an actual clone operation, any problems with
copying files will be detected early and the operation aborted before
the actual database cloning commences.

GitHub #491.
2018-08-20 15:23:37 +09:00
Ian Barwick
6ba49de44e "standby promote": improve log messages
Make it clearer what repmgr is waiting for, and what to do if the
promotion appears to fail.
2018-08-16 11:52:01 +09:00
Ian Barwick
b61f853a69 repmgrd: ensure primary connection handle is refreshed after reconnect
In some circumstances, if monitoring history was in use, repmgrd was attempting
to fetch the primary's current LSN on a stale connection handle.
2018-08-15 16:55:03 +09:00
Ian Barwick
f2bc898761 repmgr: fix handling of slot creation error when cloning
If cloning from another node other than the intended upstream, and
replication slots are in use, once the cloning process is complete,
repmgr will attempt to connect to the intended upstream to create
the replication slot.

Previously it would abort with a connection error, but as this issue
is not fatal to the cloning process itself, and in some situations may
be intentional, it's better to log a warning and continue.

We should probably collate this (and any similar items needing
attention after the cloning operation) into a list output at the end,
otherwise the warning may get overlooked.
2018-08-15 15:12:23 +09:00
Ian Barwick
7bcf87b8ed doc: update FAQ
Explain why some values in recovery.conf are surrounded by pairs of single
quotes.
2018-08-15 14:42:56 +09:00
Ian Barwick
6983547325 doc: improve "repmgr cluster cleanup" documentation 2018-08-14 10:09:52 +09:00
Ian Barwick
34c4f4c3f8 repmgr: truncate version string if necessary
Some distributions may add extra information to PG_VERSION after
the actual version number (e.g. "10.4 (Debian 10.4-2.pgdg90+1)"), so
copy the version number string up until the first space is found.

GitHub #490.
2018-08-14 09:55:23 +09:00
Ian Barwick
f8667c1aac doc: better explain where pg_bindir won't be applied
Basically any setting which can contain a user-defined script
*must* have the full path set, even if it's repmgr being executed.

We could potentially apply some heuristics to detect if the first
item in the setting is "repmgr" (or more precisely repmgrd's program
name), but this will require some careful thought and testing
that it works as intended.
2018-08-14 09:54:27 +09:00
Ian Barwick
08ab6290c1 Add dummy 4.2 extension SQL file 2018-08-14 09:54:27 +09:00
Abhijit Menon-Sen
97cafd8c54 Fix upstream node name in warning
This log_warning is supposed to reproduce the error in the block above,
but used the current node's name instead of the intended upstream node.
2018-08-12 09:15:13 +05:30
Ian Barwick
78b969f208 repmgrd: report version number *after* logger initialisation
This ensures the version number always makes it into the log destination.

Implements GitHub #487.
2018-08-08 15:44:06 +09:00
Ian Barwick
3f558416f3 doc: clarify witness server location 2018-08-07 13:10:30 +09:00
Ian Barwick
410fa5e54d Bump master branch to 4.2dev 2018-08-07 13:03:28 +09:00