Commit Graph

815 Commits

Author SHA1 Message Date
Ian Barwick
b624fc7efa Bump version
4.0.5
2018-05-01 09:21:32 +09:00
Ian Barwick
67ccd4dcb3 repmgrd: don't explicitly close connections on shutdown 2018-04-30 15:13:30 +09:00
Ian Barwick
6de3a5a997 Fix parsing of "archive_ready_critical" configuration file parameter.
Per report in GitHub #426.
2018-04-28 06:59:20 +09:00
Ian Barwick
f86e89ba45 repmgrd: notify sibling nodes to follow new primary after pg_ctl timeout
If "pg_ctl promote" fails due to a timeout, but the promotion itself succeeds,
have repmgrd on the new primary explicitly notify any sibling nodes to
follow it.

Previously the sibling nodes would wait "primary_notification_timeout" seconds
before attempting to discover the new primary.

This (and preceding commit eac80ae) address GitHub #425.
2018-04-27 11:59:00 +09:00
Ian Barwick
a6d0ba07ed repmgrd: handle pg_ctl timeout
It's possible "pg_ctl promote" will timeout, causing "repmgr standby
follow" to return with an error; however the promotion itself will usually
succeed, so detect this case and handle accordingly.
2018-04-26 19:23:26 +09:00
Ian Barwick
b553a70ad5 repmgrd: always close the connection if the pointer is not NULL 2018-04-25 14:08:17 +09:00
Ian Barwick
3364f8bdf0 Add configuration file parameter "config_directory"
This enables explicit provision of an external configuration file
directory, which if set will be passed to "pg_ctl" as the -D
parameter. Otherwise "pg_ctl" will default to using the data directory,
which will cause some operations to fail if the configuration files
are not present there.

Note this is implemented primarily for feature completeness and for
development/testing purposes. Users who have installed "repmgr" from
a package should not rely on "pg_ctl" to stop/start/restart PostgreSQL,
instead they should set the appropriate "service_..._command" for their
operating system. For more details see:

    https://repmgr.org/docs/4.0/configuration-service-commands.html

Note: in a future release, the presence of "config_directory" in repmgr.conf
will be used to implictly set "--copy-external-config-files=samepath" when
cloning a standby; this is a behaviour change so will be implemented in the
next major realease (repmgr 4.1).

Implements GitHub #424.
2018-04-25 11:57:27 +09:00
Ian Barwick
242fa287b4 repmgrd: catch corner case in standby connection handle check
If repmgrd marks the local node as unavailable, and it was actually
restarting but a failover event occured before the next local node
check, failover will continue with the stale connection handle.

Add a final local node check just before starting the failover
process, so repmgrd can reconnect if it wasn't able to before.
2018-04-24 21:55:36 +09:00
Ian Barwick
fa908432c8 Minor doc and log output tweaks 2018-04-24 21:08:31 +09:00
Ian Barwick
afa942fef6 repmgrd: prevent standby connection handle from going stale
If monitoring history not in use, there's no activity on the standby's
connection handle, so if e.g. the standby is restarted, PQstatus()
never returns CONNECTION_BAD and repmgrd never notices the connection
is stale. Therefore execute a throw-away statement at "monitor_interval_secs".
2018-04-23 23:51:03 +09:00
Ian Barwick
94cfc66b04 doc: minor clarification 2018-04-20 12:23:04 +09:00
Ian Barwick
87eae9a50f doc: additional details about repmgrd usage in Debian/Ubuntu 2018-04-20 12:04:15 +09:00
Ian Barwick
82a37f4865 doc: add Debian package details 2018-04-20 10:57:19 +09:00
Ian Barwick
a38f727b7d doc: Improve CentOS package-related documentation 2018-04-20 10:31:42 +09:00
Ian Barwick
e6df936c1b doc: link to service command configuration from switchover section 2018-04-19 17:09:10 +09:00
Ian Barwick
91ca997d40 doc: improve configuration documentation
With special attention to setting service commands, and extra special
mention of "pg_ctlcluster" for Debian/Ubuntu users.
2018-04-19 16:49:26 +09:00
Ian Barwick
65c90a2a64 doc: update CentOS package documentation 2018-04-19 14:27:17 +09:00
Ian Barwick
90cba78f52 repmgrd: tweak event notifications on standby failure
The event notification was only being created if there was a valid
primary connection; it should be created in any case, so an event
notification script can be executed.
2018-04-17 10:27:25 +09:00
Ian Barwick
f8908d7e31 Bump version
4.0.5dev
2018-04-13 10:18:04 +09:00
Ian Barwick
478bbcccbf Add "dbname=replication" to all replication connection strings
Previously repmgr was attempting to make replication connections
with "dbname" set to the repmgr database name. While this works
if e.g. the repmgr user also has replication permissions, it will
fail if a dedicated replication user is specified, who only has
permission to access the virtual "replication" database.

Change this to use "dbname=replication" if the replication connection
user is different to the normal repmgr database user.

(We could just always set it to "replication", but that might break
existing installations e.g. where a .pgpass file is in use and there's
no "replication" entry for the normal repmgr database user).

Addresses GitHub #421.
2018-04-12 16:10:02 +09:00
Ian Barwick
a03d41de28 doc: mention --recovery-conf-only introduced in repmgr 4.0.4
Per GitHub #419.
2018-04-12 13:13:11 +09:00
Ian Barwick
f1e527adcb doc: various updates related to "standby clone" operations. 2018-04-12 13:08:05 +09:00
Ian Barwick
09e597dcdd Fix superuser password handling
When establishing a superuser connection, the connection parameters
were being copied from the existing (non-superuser) connection, which
in some circumstances can lead to that user's password being
included in the copied parameter list. The password parameter, if set, will
now always be removed, which will cause libpq to retrieve the correct
one from the .pgpass file.

Addresses GitHub #400.
2018-04-12 12:50:17 +09:00
Ian Barwick
94a7f0c719 Don't issue a CHECKPOINT after promoting a standby.
Issuing a CHECKPOINT immediately after promoting a standby may impact
performance. Commit 239a548e9d ensures
one is only issued when required, i.e. during a switchover when
pg_rewind will be executed.

This reverts commit a2068768ab.
2018-04-09 14:39:47 +09:00
Ian Barwick
6ac42f1593 "standby register": add sanity check when --upstream-node-id not supplied
If --upstream-node-id was not supplied to "repmgr standby register",
repmgr defaults to the primary node as upstream node. If the local node is
available, we now double-check that it's attached to the primary,
in case the lack of --upstream-node-id was an accidental ommission.

This check is only made when the local node is available.

This behaviour can be overriden with -F/--force (though it's hard to
imagine a scenario where that would be useful).

Addresses GitHub #395.
2018-04-05 17:40:05 +09:00
Ian Barwick
94b72382e5 doc: minor FAQ tweaks 2018-04-05 17:10:52 +09:00
Ian Barwick
18c12f58a4 doc: add a section about repmgrd and service commands etc. 2018-04-05 11:47:35 +09:00
Ian Barwick
cf3fa18085 doc: miscelleneous FAQ updates
- clarify pg_rewind item
 - add note about what's included in recovery.conf
2018-04-04 10:08:04 +09:00
Ian Barwick
a5281d93dc Add TODO for pg_rewind changes coming in PostgreSQL 11 2018-04-03 21:57:50 +09:00
Ian Barwick
0d73d3c2b5 Enable provision of "archive_cleanup_command" in recovery.conf
If "archive_cleanup_command" is defined in "repmgr.conf", a corresponding
entry will be made in the node's "recovery.conf" file after cloning a
standby.

Note that we recommend using PgBarman to manage WAL archives, but are
providing this facility to help repmgr to be integrated in existing environments.

Implements GitHub #416.
2018-04-03 14:11:24 +09:00
Ian Barwick
23c99304a6 "node rejoin": actively check for node to rejoin cluster
Previously repmgr was relying on whatever command was configured to
start PostgreSQL to determine whether the node being rejoined had
started correctly. However it's preferable to actively poll the upstream
to confirm it has restarted and actually attached as a standby before
confirming success of the "node rejoin" action.

This can be overridden with the -W/--no-wait option.

(Note that for consistency with other PostgreSQL utilities, the
short form of the --wait option is now "-w"; this is currently
only used in "repmgr standby follow".)

Also update "repmgr node rejoin" documentation with a list of supported
options, and add some useful index entries for "pg_rewind".

Implements GitHub #415.
2018-04-03 10:36:13 +09:00
Ian Barwick
1ab16bc6c2 doc: fix option description for "repmgr primary register" 2018-04-03 10:10:05 +09:00
Ian Barwick
7f1f04636d Refactor pg_control parsing
The "data_checksum_version" field towards the end of the ControlFileData struct,
meaning its position varies between versions. Previously this wasn't a problem
as it was only required for operations involving 9.5 and later, and its position
within the control file has not changed between the current release and current
HEAD.

However, in order to support pg_rewind in 9.3 and 9.4, which both have changes in
the control file format, we'll need version-specific parsing. This will also make
it easier to deal with any future changes to the control file format.
2018-04-02 20:55:10 +09:00
Ian Barwick
6a1797cadd Enable pg_rewind to be used with PostgreSQL 9.3/9.4
pg_rewind is not part of the core distribution for those, but we
provided support in repmgr 3.3 so should extend it to repmgr 4.

Note that there is no check in place whether the pg_rewind binary
exists, so it's up to the user to ensure it's present.

Addresses GitHub #413.
2018-04-02 20:55:04 +09:00
Ian Barwick
94d26dbe9f Always set "connect_timeout" when pinging a PostgreSQL instance
Insert "connect_timeout=2" into the connection parameters, if not
explicitly set by the user. This will prevent excessive wait time
for the host operating system to report a connection timeout.
2018-04-02 09:31:42 +09:00
Ian Barwick
ae655eb4fd Add TODO list
This file will collate various requests and ideas for future developement.
In particular it will reference requests which come in via the GitHub issue
tracker, so we can acknowledge and close off the request and not have an
open unresolved issue hanging around.
2018-03-30 14:18:51 +09:00
Ian Barwick
65371489c6 repmgrd: handle failover with two nodes in the primary location
If two nodes were in the primary location, and at least one node in
another location, the non-failed node in the primary location was not
recognising itself as a promotion candidate.

Addresses GitHub #407.
2018-03-30 12:17:34 +09:00
Ian Barwick
28c7737dc0 Log pg_control access errors as WARNINGs rather than DEBUG
This will make it easier to diagnose issues, possibly with an incorrect
"data_directory" setting in "repmgr.conf".
2018-03-30 11:24:44 +09:00
Ian Barwick
505d72d19c "standby switchover": force checkpoint if pg_rewind requested.
Addresses issue described in GitHub #378.

PostgreSQL itself doesn't issue a checkpoint after promotion to ensure
the newly promoted server is available as quickly as possible, so we'll
only execute an explicit CHECKPOINT when it's actually required, i.e.
when pg_rewind will be executed. This is required as pg_rewind uses
the timeline reported in the pg_control file to compare with the
server to be rewound, and the pg_control timeline is only updated after
the first checkpoint, so there is an interval where pg_rewind will
erroneously assume both servers are on the timeline and take no action.
2018-03-30 09:12:25 +09:00
Ian Barwick
b292ac61f8 "standby switchover": update hint 2018-03-30 09:12:21 +09:00
Ian Barwick
293d66bf71 Fix minimum accepted value for "degraded_monitoring_timeout"
Should be -1, the default.

Addresses GitHub #411.
2018-03-30 09:12:17 +09:00
Ian Barwick
3e1f0ec168 repmgr: move demoted primary check to the final step during switchover
This will give the demoted primary more time to start up as a standby,
during which "standby follow" can be executed on sibling nodes, if
specified.
2018-03-27 16:41:13 +09:00
Ian Barwick
6f9a1f975e repmgr: poll demoted primary after restart during switchover
During a switchover operation, once the demoted primary has been restarted
as a standby, repmgr attempts to reconnect to verify its status and drop
any redundant replication slots. However it's possible the standby may still
be in the startup phase, so poll for "standby_reconnect_timeout" seconds
before giving up.

Addresses GitHub #408.
2018-03-27 15:58:18 +09:00
Ian Barwick
deea4f69f7 Fix "repmgr cluster crosscheck" output
Addresses GitHub #398.
2018-03-27 10:28:27 +09:00
Ian Barwick
37e53108a2 Consolidate connection closure calls 2018-03-27 08:52:23 +09:00
Ian Barwick
96cf06204c doc: add note about remote command execution
When executing a command on a remote server, repmgr expects the remote binary
to be in the same location as the local binary. It's reasonable to assume
repmgr will be deployed in a unified environment; if not, the onus is on the
user to ensure repmgr can find the remote binary, e.g. by creating appropriate
symlinks.

Addresses query in GitHub #406.
2018-03-27 08:47:56 +09:00
Ian Barwick
381e22c2c7 Misc tweaks to witness code 2018-03-26 20:59:38 +09:00
Ian Barwick
7e2af17783 repmgrd: tweak log notices when marking a standby as failed
Announce what we're going to do (set the node record inactive) *before*
performing the action. Makes reading the log slightly easier.
2018-03-23 13:27:37 +08:00
Ian Barwick
b4272853e7 Add event "repmgrd_failover_aborted" 2018-03-23 10:44:00 +08:00
Ian Barwick
562b6ddfc2 Add error code ERR_FOLLOW_FAIL 2018-03-23 10:34:19 +08:00