Compare commits

..

452 Commits

Author SHA1 Message Date
Ian Barwick
222f7e6080 doc: add a link to the current documentation from the contents page 2019-04-03 10:47:19 +09:00
Ian Barwick
446695e328 doc: fix typos 2018-10-23 09:22:11 +09:00
Ian Barwick
ec3da13e22 doc: fix typo
Per user report on mailing list.
2018-10-23 09:00:46 +09:00
Ian Barwick
1488c014ff Changes for a 4.1.2 snapshot release 2018-10-16 13:24:48 +09:00
Ian Barwick
f471316504 repmgrd: improve promotion script failure handling
While scanning for a new primary following a promotion script failure,
repmgrd was treating a witness server as a potential new primary
and would attempt to "follow" it. Fortunately "repmgr standby follow"
would do the right thing and choose the actual primary, if available,
otherwise do nothing, so the cluster would eventually end up in the
correct state, albeit for the wrong reason.

By skipping the witness server as a potential new primary,
repmgrd will do the right thing if the original primary does come
back online, i.e. resume monitoring as before.
2018-10-16 11:39:54 +09:00
Gilles Pietri
726299f7ef Missing comma in sudoers example 2018-10-11 09:59:15 +09:00
Ian Barwick
7fda2a1bcf doc: fix typo in repmgr.conf.sample 2018-10-08 09:37:41 +09:00
Ian Barwick
d26141b8ab Fix LWLockRelease() call in unset_bdr_failover_handler() 2018-10-08 09:37:31 +09:00
Ian Barwick
4a6b5fe913 Update control file checks for PostgreSQL 11 2018-09-27 14:08:39 +09:00
Ian Barwick
a71e644255 repmgrd: document parameters which can be reloaded via SIGHUP
Also add a new subsection with details on reloading repmgrd configuration.
2018-09-27 10:44:34 +09:00
Ian Barwick
8646fd6004 doc: fix link in 4.1.1 release notes 2018-09-25 14:30:57 +09:00
Ian Barwick
3e1bb1a523 doc: minor fixes to "repmgr.conf.sample" 2018-09-25 10:54:54 +09:00
Ian Barwick
f5e58fc062 doc: update "repmgr node rejoin" documentation
Clarify various points related to --force-rewind and pg_rewind usage.
2018-09-14 14:09:33 +09:00
Ian Barwick
6b95a96f3a repmgr: improve "cluster show" output
Only output full contents of connection error messages in --verbose mode,
otherwise it can spew a lot of text onto the screen.
2018-09-12 14:17:39 +09:00
Ian Barwick
bd146ae9ac repmgrd: update local node id in shared memory after local node restart
Also ensure local node restarts are handled more elegantly, so we're not
surprised by a stale connection handle.

GitHub #502.
2018-09-12 14:17:35 +09:00
Ian Barwick
c7f8e48d12 Bump version
4.1.2
2018-09-07 13:08:55 +09:00
Ian Barwick
322190516c doc: update link 2018-09-05 15:41:32 +09:00
Ian Barwick
31a49ff781 doc: update version 2018-09-04 12:33:44 +09:00
Ian Barwick
a6f99b58dd doc: update 4.1.1 release notes 2018-09-04 12:33:10 +09:00
Ian Barwick
09b041433e doc: update 4.1.1 release notes 2018-09-04 09:46:59 +09:00
Ian Barwick
058c8168e1 repmgrd: fix syntax 2018-08-30 15:54:31 +09:00
Ian Barwick
0468e47ef3 repmgrd: improve reconnection handling
Previously, if the server being monitored was not available, repmgrd
would always close the existing connection handle and open a new one.

However, in some cases, e.g. a brief network outage, the existing
connection handle is still good and does not need to be reopened.

This could be particularly problematic if monitoring_history is on,
as this risks leaving orphan sessions on the primary which (given
a sufficiently unstable network) could lead to all available backends
being occupied.

Instead, during an outage we now use a new connection to verify
the server is accessible; if the old connection is still available
(e.g. following a short network interruption) we continue using that;
if  not (e.g. the server was restarted), we use the new one.
2018-08-30 15:47:49 +09:00
Ian Barwick
216326f316 doc: update release notes 2018-08-30 13:09:41 +09:00
Ian Barwick
3fb20ce774 repmgr: improve slot handling in "node rejoin"
On the rejoined node, if a replication slot for the new upstream exists
(which is typically the case after a failover), delete that slot.

Also emit a warning about any inactive replication slots which may need
to be cleaned up manually.

GitHub #499.
2018-08-30 11:57:44 +09:00
Ian Barwick
e468ca859e repmgrd: improve monitoring statistics logging
Add more granular logging to help diagnose issues, and also keep track
of when the last monitoring statistics update was set and emit that
as DETAIL every time we emit a log status update.
2018-08-29 14:48:30 +09:00
Ian Barwick
623c84c022 Add additional query error logging
It's unlikely we'll get an error in these cases, but you never know.

Also, with queries which return a list of node records, it's necessary
to call _populate_node_records() even if the query fails, so a properly
initalised, albeit empty list is returned to the caller.
2018-08-29 10:27:42 +09:00
Ian Barwick
c2dded1d7b Log text of failed queries at log level ERROR
Previously query texts were always logged at log level DEBUG, but
that doesn't help much in a normal production environment when
trying to identify the cause of issues.

Also make various other minor improvements to query logging and
handling of database errors.

Implements GitHub #498.
2018-08-29 10:09:51 +09:00
Ian Barwick
457dbbd267 "standby switchover": improve replication connection check
Previously repmgr would first check that a replication can be made
from the demotion candidate to the promotion candidate, however it's
preferable to sanity-check the number of available walsenders first,
to provide a more useful error message.
2018-08-24 16:31:46 +09:00
Ian Barwick
5485c06bc1 doc: fix internal link 2018-08-24 09:43:18 +09:00
Cédric Villemain
00ae42eb07 Fix grep to find conninfo
it used to use \t* but [[:space:]] should be better as it does match more kind
of spaces (the current one being broken in my case on RH7)
2018-08-24 09:20:51 +09:00
Ian Barwick
33525491ae doc: update package signing key link 2018-08-23 12:33:48 +09:00
Ian Barwick
8c84f7a214 doc: update source requirement links
Per report from Daymel Bonne.
2018-08-23 10:56:49 +09:00
Ian Barwick
efe4bed88e doc: improve event notification documentation
- add undocumented events (per report from Daymel Bonne)
 - split up list into sections for better overview
 - where feasible, add cross-links
2018-08-23 10:22:05 +09:00
Ian Barwick
9ba8dcbac3 doc: clarify statement about BDR HA support 2018-08-23 09:36:58 +09:00
Ian Barwick
a8996a5bfa doc: clarify when "standby follow" can be used.
The unqualified wording previously implied that any running server could
be rejoined with "standby follow", which is not the case with a
"split brain" primary.
2018-08-21 13:53:21 +09:00
Ian Barwick
4cbba98193 repmgr: add "cluster_cleanup" event
GitHub #492.
2018-08-20 16:48:08 +09:00
Ian Barwick
23e6b85de3 doc: document sources of old package versions 2018-08-20 14:16:48 +09:00
Ian Barwick
d5ecb09f22 doc: add information about snapshot packages 2018-08-20 13:03:04 +09:00
Ian Barwick
719dd93676 doc: update release notes 2018-08-20 12:33:11 +09:00
Ian Barwick
5747f1d446 repmgrd: improve cascaded standby failover handling
In particular, improve handling of the case where the standby follow
command fails due to the primary not being available.

GitHub #480.
2018-08-16 17:14:05 +09:00
Ian Barwick
9313b43cb1 repmgrd: fix PQExpBuffer handling in upstream failover handler
Was sometimes leading to blank log lines.
2018-08-16 16:14:14 +09:00
Ian Barwick
5aeb1b0589 repmgrd: don't imply primary is in recovery if it's not available 2018-08-16 15:31:25 +09:00
Ian Barwick
6c93388848 repmgrd: fix "repmgrd_upstream_reconnect" event notification
Upstream node is not always the primary node.

Per report in GitHub #480.
2018-08-16 14:57:11 +09:00
Ian Barwick
d4ad8ce20c "standby clone" - don't copy external config files in dry run mode
Avoid copying files during a --dry-run as it may introduce unexpected changes
on the target node. During an actual clone operation, any problems with
copying files will be detected early and the operation aborted before
the actual database cloning commences.

GitHub #491.
2018-08-16 14:03:39 +09:00
Ian Barwick
bacab8d31c "standby promote": improve log messages
Make it clearer what repmgr is waiting for, and what to do if the
promotion appears to fail.
2018-08-16 11:52:18 +09:00
Ian Barwick
14856e3a4d repmgrd: ensure primary connection handle is refreshed after reconnect
In some circumstances, if monitoring history was in use, repmgrd was attempting
to fetch the primary's current LSN on a stale connection handle.
2018-08-15 16:57:21 +09:00
Ian Barwick
ca9242badb repmgr: fix handling of slot creation error when cloning
If cloning from another node other than the intended upstream, and
replication slots are in use, once the cloning process is complete,
repmgr will attempt to connect to the intended upstream to create
the replication slot.

Previously it would abort with a connection error, but as this issue
is not fatal to the cloning process itself, and in some situations may
be intentional, it's better to log a warning and continue.

We should probably collate this (and any similar items needing
attention after the cloning operation) into a list output at the end,
otherwise the warning may get overlooked.
2018-08-15 15:11:13 +09:00
Ian Barwick
ff0929e882 doc: update FAQ
Explain why some values in recovery.conf are surrounded by pairs of single
quotes.
2018-08-15 14:48:23 +09:00
Abhijit Menon-Sen
8cd1811edb Fix upstream node name in warning
This log_warning is supposed to reproduce the error in the block above,
but used the current node's name instead of the intended upstream node.
2018-08-14 10:10:50 +09:00
Ian Barwick
bf15c0d40f doc: improve "repmgr cluster cleanup" documentation 2018-08-14 10:09:18 +09:00
Ian Barwick
9ae9d31165 repmgr: truncate version string if necessary
Some distributions may add extra information to PG_VERSION after
the actual version number (e.g. "10.4 (Debian 10.4-2.pgdg90+1)"), so
copy the version number string up until the first space is found.

GitHub #490.
2018-08-14 09:56:54 +09:00
Ian Barwick
d5064bdc02 doc: clarify repmgrd FAQ item
"priority" must be 0 or greater.
2018-08-10 10:53:08 +09:00
Ian Barwick
9d0524a008 doc: update FAQ
Add note about why repmgrd refuses to start up if the upstream is
not running.
2018-08-10 10:47:23 +09:00
Ian Barwick
5398fd2d22 doc: better explain where pg_bindir won't be applied
Basically any setting which can contain a user-defined script
*must* have the full path set, even if it's repmgr being executed.

We could potentially apply some heuristics to detect if the first
item in the setting is "repmgr" (or more precisely repmgrd's program
name), but this will require some careful thought and testing
that it works as intended.
2018-08-10 10:29:06 +09:00
Ian Barwick
4c44c01380 doc: update release notes 2018-08-10 09:52:39 +09:00
Ian Barwick
5113ab0274 repmgrd: fix startup on witness node when local data is stale
Previously, when running on a witness server, repmgrd didn't consider
the local cache of the "repmgr.nodes" table might be outdated, e.g.
as repmgrd wasn't running on the witness server during a failover,
so could potentially end up monitoring a former primary now running
as a standby.

When running on a witness server, at startup repmgrd will now scan
all nodes to determine the current primary, and refresh its local
cache from there. This will also ensure it can start up even if the
node currently registered as primary in the local cache is not available.

Implements GitHub #488 and #489.
2018-08-09 16:42:20 +09:00
Ian Barwick
25f68bb283 repmgrd: report version number *after* logger initialisation
This ensures the version number always makes it into the log destination.

Implements GitHub #487.
2018-08-08 15:45:48 +09:00
Ian Barwick
730f67258c Bump version
4.1.1
2018-08-07 15:22:11 +09:00
Ian Barwick
ca0e4de1ee doc: clarify witness server location 2018-08-07 13:11:27 +09:00
Ian Barwick
2fb0f056fe repmgrd: fix configuration file reloading
Don't allow "promote_command" or "follow_command" to be empty.

GitHub #486.
2018-08-02 16:35:36 +09:00
Ian Barwick
3a789d53e0 repmgrd: always reopen log file after receiving SIGHUP
For whatever reason, since at least repmgr 2.0 the log file was only
ever reopened if a configuration file change took place.

GitHub #485.
2018-08-02 10:51:18 +09:00
Ian Barwick
fb67b2cd4f doc: fix typo 2018-08-01 16:37:01 +09:00
Ian Barwick
9f07804b6a doc: update repmgrd log rotation configuration
In the sample logrotate configuration file, use "copytruncate" rather than "create",
as repmgrd currently doesn't reopen the log file (unless the configuration changes).

Per suggestion in GitHub #465.
2018-08-01 16:33:22 +09:00
Ian Barwick
d5b2fa2309 doc: update 2ndQuadrant repository locations in packaging appendix 2018-08-01 15:57:45 +09:00
Ian Barwick
d696c4019e repmgrd: consolidate SIGHUP handling
Move identical code blocks into single function.
2018-08-01 11:53:57 +09:00
Ian Barwick
e6ffbcc67a doc: add note about new repository structure to 4.1.0 release notes 2018-08-01 11:47:27 +09:00
Ian Barwick
e1410831e0 doc: update 4.1.0 release notes 2018-08-01 11:38:08 +09:00
Ian Barwick
cb4f6f6e3f doc: add release date for 4.1.0 2018-07-31 10:58:06 +09:00
Ian Barwick
75e5d79654 doc: update Debian installation instructions
2ndQuadrant repository structure has changed.
2018-07-31 10:53:04 +09:00
Ian Barwick
55fbe12971 doc: update RPM installation instructions
2ndQuadrant repository structure has changed.

Also remove reference to the old, very deprecated original repmgr RPM
repository.
2018-07-30 17:26:46 +09:00
Ian Barwick
db4199e08f doc: update document build version for 4.1 branch 2018-07-24 14:02:38 +09:00
Ian Barwick
0d9ed02729 doc: fix typo 2018-07-24 14:02:08 +09:00
Ian Barwick
8e9f0b802b Create 4.1 branch 2018-07-24 10:22:31 +09:00
Ian Barwick
c236405251 Update extension metadata for 4.1 release
This release does not make any changes to the extension database
objects.
2018-07-24 09:56:43 +09:00
Ian Barwick
527a5f7fee doc: update release notes and upgrade instructions 2018-07-24 09:54:06 +09:00
Ian Barwick
937cffd54c doc: clarify BDR repmgrd configuration
Link directly to section about configuring the "event_notification_command".
2018-07-23 13:21:11 +09:00
Ian Barwick
2b1e12591a doc: fix markup errors 2018-07-23 13:18:38 +09:00
Ian Barwick
7ecfb333b9 doc: add note about switchover and exclusive backups
Also rename server_not_in_exclusive_backup_mode() to avoid double
negatives.

GitHub #476.
2018-07-19 16:02:31 +09:00
Martín Marqués
8f13a66aaa Check that there is no exclusive backup taking place while we perform
a switchover.

We've found that this can cause some issues with postgres control
metadata (could be a postgres bug) so best thing is *not* no switchover
if there's a backup taking place.

It's also a bad idea from an architectual point of view, as a switchover
is supposed to be planed, so why perform it when we are taking backups.

GitHub #476.
2018-07-19 16:02:21 +09:00
Ian Barwick
ef35d071bf Fix is_active_bdr_node() query for BDR 2.x
Copy/paste error when adapting the query for BDR 3.x.
2018-07-19 09:50:30 +09:00
Ian Barwick
b87f9dabb4 doc: remove duplicate item in list of event notifications 2018-07-18 16:10:55 +09:00
Ian Barwick
7decc7975f Fix BDR version check
repgexp_match() is only available from PostgreSQL 10 and later.
2018-07-18 10:54:16 +09:00
Ian Barwick
a5cfc244bc repmgr: have "node status" check for missing downstream nodes
This matches the behaviour of "node check".
2018-07-18 10:27:19 +09:00
Ian Barwick
673bde2b7f repmgr: fix "primary_slot_name" when using "standby clone" with --recovery-conf-only
Addresses GitHub #474.
2018-07-17 13:42:10 +09:00
Martín Marqués
81de200561 Add information to the --help and docs of standby clone regarding the need
to provide a conninfo line to the upstream from which we will be cloning
from.
2018-07-16 18:56:41 -03:00
Ian Barwick
cb46fb6410 repmgrd: when reloading configuration, log any errors encountered 2018-07-16 16:46:39 +09:00
Ian Barwick
bd58e4128c repmgrd: log "promote_command" at log_level "INFO"
If repmgrd is promoting the local node, it was only logging the contents
of "promote_command" at DEBUG level; it would be useful to see this at
the default log level.

Related to GitHub #473.
2018-07-16 15:33:10 +09:00
Ian Barwick
63242e2277 doc: update documentation of "promote_command" and "service_promote_command"
The documentation implied it would override "promote_command", which is
not the case.

"promote_command" is used by repmgrd to execute "repmgr standby promote"
(either directly or via a custom script).

"service_promote_command" can be set to specify a package-level service
command to promote the local PostgreSQL instance from standby to primary,
e.g. Debian's pg_ctlcluster. If set, this will be executed by "repmgr standby promote".

Also update code comments to clarify usage.

Related to GitHub #473.
2018-07-16 14:43:53 +09:00
Ian Barwick
69782cf703 repmgr: enable "witness unregister" to be run on any node
Provide the ID of the witness node with --node-id=...

Implements GitHub #472.
2018-07-13 17:37:59 +09:00
Ian Barwick
5acb3e6790 doc: update release notes 2018-07-13 15:35:34 +09:00
Ian Barwick
6dfcaa357e doc: update release notes 2018-07-13 15:06:04 +09:00
Ian Barwick
8acc50e752 Bump version number in configure.in 2018-07-13 14:05:29 +09:00
Ian Barwick
56919ea499 repmgr: add -q/--quiet option
This suppresses log output below log level ERROR. This is useful mainly
when repmgr is being executed programmatically, e.g. in a cronjob,
where it's only useful to receive output if something goes wrong.

Note we advise against using this option when executing repmgr
commands which operate on PostgreSQL nodes (standby follow,
standby promote, standby switchover, node rejoin), particularly when
executed by repmgrd, as the log output will provide valuable
troubleshooting information.

Implements suggestion in GitHub #468.
2018-07-13 12:09:41 +09:00
Ian Barwick
b3f64987cb repmgr: add --csv output to "cluster event"
Implements GitHub #471.
2018-07-13 11:19:42 +09:00
Ian Barwick
388ac2f392 repmgrd: enable package to supply default PID file path
Also add documentation for packagers about paths which can be patched
as default package values.
2018-07-13 10:26:47 +09:00
Ian Barwick
8b059bc9b0 Change default for "log_level" to INFO
Default was previously NOTICE (as in repmgr 3.x) but documentation
implied it was INFO, and many of the the documentation examples assume
it is.

This produces some quite informative log output, without creating excessive
log file volume. In particular it's useful to get a better idea of what
repmgrd is actually doing.

Also add documentation section for the log configuration parameters.

GitHub #470, containing change suggested in GitHub #467.
2018-07-12 14:50:48 +09:00
Ian Barwick
cfa7155784 doc: update links to configuration file sections 2018-07-12 11:43:04 +09:00
Ian Barwick
47644b55ed doc: rearrange repmgr.conf documentation 2018-07-12 11:36:28 +09:00
Ian Barwick
17f30ec364 repmgrd: add additional local node connection check
It's possible there are corner-cases where do_election() is called while the
local connection is invalid, so perform an additional check.
2018-07-11 15:11:20 +09:00
Ian Barwick
c6b8d78bad doc: add extra emphasis about not running repmgrd during switchover
One day this will no longer be an issue, until then let's hope the
fine documentation is read.
2018-07-11 09:53:29 +09:00
Ian Barwick
ae60caacdd repmgr: make "node check" and "node status" return ERR_NODE_STATUS when appropriate
If any issue is detected (and "node check" is not being executed with a specific
individual check), "ERR_NODE_STATUS" is returned.
2018-07-05 14:31:06 +09:00
Ian Barwick
92d0e6809b repmgr: "cluster show" to return non-zero value if an issue encountered 2018-07-05 13:32:50 +09:00
Ian Barwick
4c7c681a14 repmgr: have "cluster show" exit with a non-zero value if issues detected
If any issues are detected (e.g. node not reachable, unexpected node status
etc.), "repmgr cluster show" returns exit code 25 ("ERR_NODE_STATUS").

Note that exit code 25 was introduced recently as "ERR_CLUSTER_CHECK",
however it makes sense to use this to indicate issues detected by any
command which can detect node issues.

Addresses GitHub #456.
2018-07-05 11:03:48 +09:00
Ian Barwick
29de052dd8 repmgr: clarify intent behind --wait-sync timeout processing 2018-07-05 10:09:04 +09:00
Ian Barwick
ebf2a3a7cc doc: fix typo in release notes 2018-07-05 08:45:10 +09:00
Ian Barwick
37311e15a3 repmgr: fix "standby register --wait-sync" when no timeout provided
The default value for "wait_register_sync_seconds" was zero, which is treated
as disabling --wait-sync altogether. Default value now set to -1, which is taken
to mean no timeout value supplied.
2018-07-04 17:22:04 +09:00
Ian Barwick
a194cf56b3 repmgr: exit with an error if an unrecognised command line option is provided.
This matches the behaviour of other PostgreSQL utilities such as psql, though
repmgr will only abort once all command line options are parsed, so as many
errors as possible are found and displayed. If a repmgr "command" (e.g.
"repmgr primary ..." was provided, a hint about the relevant command
help section (e.g. "repmgr primary --help") will be provided alongside
the generic help command (i.e. "repmgr --help").

Addresses GitHub #464, with further improvements.
2018-07-04 11:02:50 +09:00
Abhijit Menon-Sen
c4f9205f17 Merge pull request #460 from gclough/repmgr_conf_sample_typo_priority
Fixed typo in repmgr.conf.sample, "priority"
2018-07-03 17:43:57 +05:30
Abhijit Menon-Sen
6d09ebcfb5 Merge pull request #462 from gclough/repmgr_cluster_help_2
Fix "cluster cleanup" help
2018-07-03 17:43:35 +05:30
Abhijit Menon-Sen
319a29583d Merge pull request #461 from gclough/add_cluster_cleanup_help
Added "cluster cleanup" to help
2018-07-03 17:43:20 +05:30
Greg Clough
a5d47fd478 Fix "cluster cleanup" help
Fix "cluster cleanup" help
2018-06-29 22:57:06 +01:00
Greg Clough
190104c7db Added "cluster cleanup" to help 2018-06-29 22:54:59 +01:00
Greg Clough
ff16d3b3bb Fixed typo in repmgr.conf.sample, "priority"
Fixed typo in repmgr.conf.sample, "priority"
2018-06-29 22:00:09 +01:00
Ian Barwick
802755fd60 repmgrd: daemonize process by default
It's hard to imagine a use case where this isn't desirable, but
in case, for whatever reason, the user does not wish to daemonize the
process, the command line option "--daemonize=false" can be provided.

Implements GitHub #458.
2018-06-29 22:01:49 +09:00
Ian Barwick
d00c0c67d0 repmgrd: document PID file options/configuration 2018-06-29 17:00:25 +09:00
Ian Barwick
8d636690bd repmgrd: create pid file by default
Traditionally repmgrd will only write a pidfile if explicitly requested with
-p/--pid-file. However it's normally desirable to have a pidfile, and it's
preferable to have one used by default to prevent accidentally starting a second
repmgrd instance.

Following changes made:

 - add configuration file parameter "repmgrd_pid_file" (initially overridden by
   -p/--pid-file for backwards compatibility, though eventually we'll want to
   drop -p/--pid-file altogether)
 - add command line option --no-pid-file
 - if neither "repmgrd_pid_file" nor -p/--pid-file is set, create the pid file
   in a temporary directory

Implements GitHub #457.
2018-06-29 14:36:24 +09:00
Ian Barwick
b2081dca52 De-overload configuration file parameter "standby_reconnect_timeout"
Currently the (very generic sounding) "standby_reconnect_timeout" configuration
file parameter is used in several different contexts and it would be useful
to have more granular control over the different timeouts it's used to configure.

This patch introduces "node_rejoin_timeout", used in place of "standby_reconnect_timeout"
(which wasn't documented) when "repmgr node rejoin" is executed, to determine
how long to wait for the node to rejoin the replication cluster.

Additionally "repmgrd_standby_startup_timeout" is introduced as a timeout for
failover situations, when repmgrd executes "repmgr standby follow" to follow
a new primary, and waits for the standby to restart and become available
for connections.

"standby_reconnect_timeout" is now only relevant for "repmgr standby switchover".

Implements GitHub #454.
2018-06-28 18:00:55 +09:00
Ian Barwick
080a29c33b node check: add --missing-slots check
This enables an explicit check for slots which should exist (according
to the repmgr metadata) but which aren't present.
2018-06-22 17:21:40 +09:00
Ian Barwick
dd7a4068d2 node check: implement CSV output
This is advertised in the --help output and placeholder code was in
place, but it wasn't actually implemented.
2018-06-22 13:14:57 +09:00
Ian Barwick
fcf237fe31 node status: improve output and documentation
In the default text output mode, list inactive slots.

In CSV output mode, list inactive slots as additional information;
add output line with number of missing slots and a list thereof.

Also document --csv output mode.
2018-06-22 11:46:50 +09:00
Ian Barwick
4d70a667fb node check: clarify status information for witness server
Previously the output gave the impression the server was a primary,
which is technically the case, but it's not the actual cluster primary.

Also output an error if the node is in recovery, which is unlikely but
you never know.
2018-06-22 10:15:45 +09:00
Ian Barwick
c5ba72c2c5 standby switchover: fix behaviour if witness node is a sibling
The witness node is not a streaming replication standby, so executing
"repmgr standby follow" will fail. Instead, execute "repmgr witness
register --force" to update the witness node record on the primary and
its local copy of all node records.

Addresses GitHub #453.
2018-06-21 16:48:58 +09:00
Ian Barwick
0f97a98f28 repmgr: don't count witness node as a standby when running "node status"
Addresses GitHub #451.
2018-06-21 13:06:18 +09:00
Ian Barwick
269e3242c8 "repmgr node ...": update comments and formatting 2018-06-21 12:12:07 +09:00
Ian Barwick
b0ed87832b repmgr: don't count witness node as a standby when running "node check"
Addresses GitHub #451.
2018-06-21 11:13:46 +09:00
Ian Barwick
836d2125fe Improve BDR3 node query
We can get everything we need from bdr.node_summary
2018-06-15 14:30:06 +09:00
Ian Barwick
bf0d67c60a Add repmgr.nodes to the BDR replication set 2018-06-15 14:29:08 +09:00
Ian Barwick
e1d807188d Add extension upgrade files 2018-06-15 14:27:42 +09:00
Ian Barwick
108c3a36fb Enable creation of repmgr extension on BDR3 node 2018-06-15 14:26:47 +09:00
Ian Barwick
8377704596 Convert BDR query functions to handle BDR2/BDR3 2018-06-15 14:26:07 +09:00
Ian Barwick
4f642f8332 Detect and store BDR major version number when executing "is_bdr_db()"
BDR3 metadata structure is very different to BDR1/2, so we'll need to
generate queries according to version.
2018-06-15 14:25:55 +09:00
Ian Barwick
029ba46470 doc: remove info about old RPM package repository 2018-06-15 13:27:19 +09:00
Ian Barwick
098f8eaf2a doc: finalize release notes 2018-06-15 13:27:14 +09:00
Ian Barwick
d60bd232f0 Enable "recovery_min_apply_delay" to be zero.
Addresses GitHub #448.
2018-06-14 11:11:33 +09:00
Ian Barwick
eca1943026 doc: emphasize that repmgrd should not be running during a switchover 2018-06-12 10:30:35 +09:00
Ian Barwick
bcab4bc391 _create_event(): log event and node ID for debugging 2018-06-12 10:30:30 +09:00
Ian Barwick
bb320a64f5 repmgr: consolidate code in "standby switchover"
Commit 41274f5525 left us with two if statements
in sequence with exactly the same condition, so consolidate both into a single
statement. Clarify code comments while we're at it.
2018-06-12 10:30:24 +09:00
Ian Barwick
3b0cde2846 repmgr: cluster check commands - non-zero exit code if node(s) unavailable
Return ERR_CLUSTER_CHECK if one or nodes was not reachable.

Implements GitHub #447.
2018-06-12 10:30:11 +09:00
Ian Barwick
00704913a6 doc: 4.0.6 release notes 2018-06-12 10:29:35 +09:00
Ian Barwick
efc388065e standby follow: check node has connect to new primary
After restarting the standby, poll pg_stat_replication on the upstream
until the standby connects, and exit with an error if it doesn't by the
timeout defined in "standby_follow_timeout".

Implments GitHub #444.
2018-06-07 15:04:45 +09:00
Ian Barwick
e12fbb7b4d doc: update release notes 2018-06-07 15:04:38 +09:00
Ian Barwick
0108fb2e72 standby follow: add hint about using "node rejoin"
If "repmgr standby follow" is executed on a node which isn't running,
point out "repmgr node rejoin" should probably be used instead.
2018-06-07 15:04:30 +09:00
Ian Barwick
e408351697 doc: fix typos 2018-06-07 15:04:25 +09:00
Ian Barwick
f904cd2573 witness_register: check for existing node with same name 2018-06-07 15:04:18 +09:00
Ian Barwick
95fe7ea621 repmgrd: ensure local node is counted as quorum member
Rename "standby_nodes" to "sibling_nodes" to make it clearer in the
code what total is actually provided by the struct.

Addresses GitHub #439.
2018-06-07 15:04:12 +09:00
Ian Barwick
a50ac039da doc: fix typo 2018-06-07 15:04:06 +09:00
Ian Barwick
535fba43d3 standby clone: improve external configuration file copying
If --copy-external-config-files was provided, check that we can copy
the files *before* cloning the standby, and abort if an error is
encountered. This will give the user the opportunity to fix any issues
before running the entire (and potentially lengthy) clone.

Previously errors were logged but no action taken, and the final
message indicated the clone operation was successful.

Addresses GitHub #443.
2018-06-07 15:04:01 +09:00
Ian Barwick
043a6c5bea repmgrd: ensue degraded monitoring timeout works on standby
Parameter "degraded_monitoring_timeout" was not being acted on when
monitoring a streaming replication standby.

Addresses GitHub #439.
2018-06-07 15:03:52 +09:00
Ian Barwick
8da26f1c6c If --dry-run specified, ensure minimum log level is INFO
When executed with --dry-run, repmgr outputs detail about what would
happen using log level INFO. If the log_level is configured to
NOTICE or higher, it's possible some or all of the --dry-run output
might not be displayed.

Addresses GitHub #441.
2018-06-07 15:03:43 +09:00
Ian Barwick
7861392450 node rejoin: avoid outputting empty DETAIL message 2018-06-07 15:03:36 +09:00
Ian Barwick
b297e40d77 node rejoin: improve handling of --config-file parameter
Fixes bug when parsing --config-file values (GitHub #442).

Also improves handling in --dry-run mode, as some checks for the
provided files were being skipped if --dry-run supplied, even though
they are intended to work with --dry-run.
2018-06-07 15:03:30 +09:00
Ian Barwick
7613b1769c standby clone: --recovery-conf-only expects the standby to be registered
Note this in the documentation, and add a HINT about registering it
if the standby record is not available.

Related to GitHub #438.
2018-05-31 09:42:53 +09:00
Ian Barwick
b1b49748a7 "config_file" is MAXPGPATH, not MAXLEN
The two values are the same anyway, so change is more for consistency.
2018-05-24 15:52:57 +09:00
Ian Barwick
276239422b standby clone: don't assume existence of "user" in upstream conninfo
Usually a seperate user (typically "repmgr") is set up specifically to manage
the repmgr metadata, however there's no compelling requirement to do this, and
it's possible the database owner (usually: "postgres") will be used, in which
case it's possible the username will be left out of the conninfo string.

Addresses GitHub #437.
2018-05-24 15:52:51 +09:00
Martín Marqués
49418e096e Fix typo in a code comment 2018-05-19 12:30:03 -03:00
Ian Barwick
6c518f1403 "standby clone": log actual connection string used to connect to upstream
Useful for diagnostic purposes.
2018-05-10 12:03:13 +09:00
Ian Barwick
b365765bc8 Fix check for -d/--dbname parameter
Not a bug per-se, just meant some unnecessary processing was done on
an empty string.

Per note from petere.
2018-05-10 12:03:09 +09:00
Ian Barwick
bd63948937 Include "arpa/inet.h" in dbutils.c
Needed for htonl() on FreeBSD.
2018-05-10 12:03:04 +09:00
Ian Barwick
69c1f147ea doc: update 2ndQuadrant repository information
Canonical link for each repository should not include any directories.
2018-05-10 10:39:31 +09:00
Ian Barwick
ce8d3cf0b0 doc: update repository information 2018-05-10 10:39:27 +09:00
Ian Barwick
14134f8e70 doc: update package installation information
Document the new public 2ndQuadrant apt repository
2018-05-10 10:39:23 +09:00
Ian Barwick
be8448ddcb doc: update package installation information
Document the new, public 2ndQuadrant RPM repository.
2018-05-10 10:39:18 +09:00
Ian Barwick
a2ff1536ad doc: add notes about package compatibility
We need to emphasise that the repmgr packages are only compatible
with packages based on the PGDG filesystem layout; 3rd party vendor
packages often put application and data directories elsewhere.
See e.g. GitHub #427.
2018-05-10 10:38:54 +09:00
Ian Barwick
9c0c1b663e Minor documentation fixes 2018-05-10 10:25:29 +09:00
Ian Barwick
2d43feb34b doc: update HISTORY and add 4.0.5 release notes 2018-05-01 10:21:40 +09:00
Ian Barwick
6f315c1b3c repmgrd: don't explicitly close connections on shutdown 2018-05-01 10:21:10 +09:00
Ian Barwick
635bdccb2c Fix parsing of "archive_ready_critical" configuration file parameter.
Per report in GitHub #426.
2018-04-28 07:00:56 +09:00
Ian Barwick
16048a879e repmgrd: notify sibling nodes to follow new primary after pg_ctl timeout
If "pg_ctl promote" fails due to a timeout, but the promotion itself succeeds,
have repmgrd on the new primary explicitly notify any sibling nodes to
follow it.

Previously the sibling nodes would wait "primary_notification_timeout" seconds
before attempting to discover the new primary.

This (and preceding commit eac80ae) address GitHub #425.
2018-04-27 11:54:21 +09:00
Ian Barwick
eac80ae9c1 repmgrd: handle pg_ctl timeout
It's possible "pg_ctl promote" will timeout, causing "repmgr standby
follow" to return with an error; however the promotion itself will usually
succeed, so detect this case and handle accordingly.
2018-04-26 19:19:42 +09:00
Ian Barwick
887b845aa0 repmgrd: always close the connection if the pointer is not NULL 2018-04-26 10:04:07 +09:00
Ian Barwick
8320179f34 Add configuration file parameter "config_directory"
This enables explicit provision of an external configuration file
directory, which if set will be passed to "pg_ctl" as the -D
parameter. Otherwise "pg_ctl" will default to using the data directory,
which will cause some operations to fail if the configuration files
are not present there.

Note this is implemented primarily for feature completeness and for
development/testing purposes. Users who have installed "repmgr" from
a package should not rely on "pg_ctl" to stop/start/restart PostgreSQL,
instead they should set the appropriate "service_..._command" for their
operating system. For more details see:

    https://repmgr.org/docs/4.0/configuration-service-commands.html

Note: in a future release, the presence of "config_directory" in repmgr.conf
will be used to implictly set "--copy-external-config-files=samepath" when
cloning a standby; this is a behaviour change so will be implemented in the
next major realease (repmgr 4.1).

Implements GitHub #424.
2018-04-25 11:58:24 +09:00
Ian Barwick
7822aa784f repmgrd: catch corner case in standby connection handle check
If repmgrd marks the local node as unavailable, and it was actually
restarting but a failover event occured before the next local node
check, failover will continue with the stale connection handle.

Add a final local node check just before starting the failover
process, so repmgrd can reconnect if it wasn't able to before.
2018-04-24 21:56:57 +09:00
Ian Barwick
4455ded935 repmgrd: prevent standby connection handle from going stale
If monitoring history not in use, there's no activity on the standby's
connection handle, so if e.g. the standby is restarted, PQstatus()
never returns CONNECTION_BAD and repmgrd never notices the connection
is stale. Therefore execute a throw-away statement at "monitor_interval_secs".
2018-04-24 21:56:52 +09:00
Ian Barwick
fd0b850f41 Minor doc and log output tweaks 2018-04-24 21:08:05 +09:00
Ian Barwick
d9ac1d6fd0 doc: minor clarification 2018-04-20 12:58:46 +09:00
Ian Barwick
11e4d9fd05 doc: additional details about repmgrd usage in Debian/Ubuntu 2018-04-20 12:58:41 +09:00
Ian Barwick
4b54106f48 doc: add Debian package details 2018-04-20 12:58:37 +09:00
Ian Barwick
f3941ceab0 doc: Improve CentOS package-related documentation 2018-04-20 12:58:33 +09:00
Ian Barwick
93f80c413e doc: link to service command configuration from switchover section 2018-04-20 10:15:22 +09:00
Ian Barwick
09b8a86605 doc: improve configuration documentation
With special attention to setting service commands, and extra special
mention of "pg_ctlcluster" for Debian/Ubuntu users.
2018-04-20 10:15:18 +09:00
Ian Barwick
6b3d54a5f3 doc: update CentOS package documentation 2018-04-20 10:15:14 +09:00
Ian Barwick
85ab2d94b7 repmgrd: tweak event notifications on standby failure
The event notification was only being created if there was a valid
primary connection; it should be created in any case, so an event
notification script can be executed.
2018-04-20 10:15:08 +09:00
Ian Barwick
cda952f1e4 Add "dbname=replication" to all replication connection strings
Previously repmgr was attempting to make replication connections
with "dbname" set to the repmgr database name. While this works
if e.g. the repmgr user also has replication permissions, it will
fail if a dedicated replication user is specified, who only has
permission to access the virtual "replication" database.

Change this to use "dbname=replication" if the replication connection
user is different to the normal repmgr database user.

(We could just always set it to "replication", but that might break
existing installations e.g. where a .pgpass file is in use and there's
no "replication" entry for the normal repmgr database user).

Addresses GitHub #421.
2018-04-12 16:11:16 +09:00
Ian Barwick
99ad57f88a doc: mention --recovery-conf-only introduced in repmgr 4.0.4
Per GitHub #419.
2018-04-12 16:11:12 +09:00
Ian Barwick
ad0671ead2 doc: various updates related to "standby clone" operations. 2018-04-12 16:11:07 +09:00
Ian Barwick
1bbb2ef213 Fix superuser password handling
When establishing a superuser connection, the connection parameters
were being copied from the existing (non-superuser) connection, which
in some circumstances can lead to that user's password being
included in the copied parameter list. The password parameter, if set, will
now always be removed, which will cause libpq to retrieve the correct
one from the .pgpass file.

Addresses GitHub #400.
2018-04-12 12:49:41 +09:00
Ian Barwick
62c29aab32 Don't issue a CHECKPOINT after promoting a standby.
Issuing a CHECKPOINT immediately after promoting a standby may impact
performance. Commit 239a548e9d ensures
one is only issued when required, i.e. during a switchover when
pg_rewind will be executed.

This reverts commit a2068768ab.
2018-04-09 14:35:54 +09:00
Ian Barwick
b9dc94f28f doc: update FAQ location 2018-04-07 11:46:10 +09:00
Ian Barwick
e8ba213174 "standby register": add sanity check when --upstream-node-id not supplied
If --upstream-node-id was not supplied to "repmgr standby register",
repmgr defaults to the primary node as upstream node. If the local node is
available, we now double-check that it's attached to the primary,
in case the lack of --upstream-node-id was an accidental ommission.

This check is only made when the local node is available.

This behaviour can be overriden with -F/--force (though it's hard to
imagine a scenario where that would be useful).

Addresses GitHub #395.
2018-04-05 17:38:55 +09:00
Ian Barwick
0dcddbb062 doc: minor FAQ tweaks 2018-04-05 17:10:33 +09:00
Ian Barwick
b4dab86c3b doc: add a section about repmgrd and service commands etc. 2018-04-05 11:49:08 +09:00
Ian Barwick
644a56a645 doc: miscelleneous FAQ updates
- clarify pg_rewind item
 - add note about what's included in recovery.conf
2018-04-04 10:07:08 +09:00
Ian Barwick
4876a9fde3 Add TODO for pg_rewind changes coming in PostgreSQL 11 2018-04-03 21:56:46 +09:00
Ian Barwick
ec998bf9c5 doc: update HISTORY and release notes 2018-04-03 15:00:49 +09:00
Ian Barwick
e36b180de8 Ensure correct server version number used for replication stats query 2018-04-03 14:45:37 +09:00
Ian Barwick
a2068768ab Execute a CHECKPOINT immediately after promoting the server
This ensures "pg_control" is updated with the latest timeline, mainly
to ensure that if "pg_rewind" is executed as part of a switchover
that it sees the latest timeline.

Per suggestion from GitHub user "superflav" in GitHub #378.

See also:

  https://www.postgresql.org/message-id/flat/20150428180253.GU30322%40tamriel.snowman.net
2018-04-03 14:44:44 +09:00
Ian Barwick
bde9fea48c Fix directory creation when cloning from Barman 2018-04-03 14:44:03 +09:00
Ian Barwick
cdaf84c329 doc: minor readbility fix 2018-04-03 14:42:48 +09:00
Ian Barwick
c4cd0c46da doc: add note about replication slots and PostgreSQL upgrades 2018-04-03 14:41:58 +09:00
Ian Barwick
3b00dc912a Catch various corner cases when restarting a PostgreSQL instance 2018-04-03 14:40:53 +09:00
Ian Barwick
1a80de1290 doc: document "primary_follow_timeout" configuration file parameter. 2018-04-03 14:39:38 +09:00
Ian Barwick
26b565dff2 Improve repmgrd logging in BDR mode
Also ensure interval status log line is shown as intended
2018-04-03 14:38:32 +09:00
Ian Barwick
96811ccc01 repmgrd: tweak log notices when marking a standby as failed
Announce what we're going to do (set the node record inactive) *before*
performing the action. Makes reading the log slightly easier.
2018-04-03 14:37:43 +09:00
Ian Barwick
73982859f6 repmgrd: improve log output
- emit explicit startup NOTICE
- emit NOTICE when falling back to degraded monitoring on a primary node
- improve log message and event notification details when monitoring
  a former primary which has been reconnected as a standby
2018-04-03 14:37:06 +09:00
Ian Barwick
afb7ca886c doc: note change of shared library name from "repmgr_funcs" to "repmgr" 2018-04-03 14:35:45 +09:00
Ian Barwick
df11ad894f doc: update release notes
Add note about requiring 4.0.3 or later on all nodes when performing
a switchover from a noder running 4.0.3 or later.

Per report in GitHub #388.
2018-04-03 14:35:18 +09:00
Ian Barwick
614b4ae84b doc: update 4.0.4 release date 2018-04-03 14:34:24 +09:00
Ian Barwick
1e1b4b1a65 "standby register/follow": provide primary node details for event notifications
For events generated by these commands, it may be useful to know details
of the primary node. This makes following additional parameters available
to event notification scripts:

- %p: node ID of the primary
- %a: node name of the primary
- %c: conninfo string for the primary

Implements GitHub #375
2018-04-03 14:32:19 +09:00
Ian Barwick
cf64f9e95c Always initialise t_conninfo_param_list structures 2018-04-03 14:31:24 +09:00
Ian Barwick
dfdebd6c08 Enable provision of "archive_cleanup_command" in recovery.conf
If "archive_cleanup_command" is defined in "repmgr.conf", a corresponding
entry will be made in the node's "recovery.conf" file after cloning a
standby.

Note that we recommend using PgBarman to manage WAL archives, but are
providing this facility to help repmgr to be integrated in existing environments.

Implements GitHub #416.
2018-04-03 14:10:21 +09:00
Ian Barwick
63a11f8926 "standby promote": make timeout values configurable
This introduces following new configuration file parameters, which
were previously hard-coded values:

 - promote_check_timeout
 - promote_check_interval

Implements GitHub #387.
2018-04-03 14:10:14 +09:00
Ian Barwick
a3f371b8c0 "node rejoin": actively check for node to rejoin cluster
Previously repmgr was relying on whatever command was configured to
start PostgreSQL to determine whether the node being rejoined had
started correctly. However it's preferable to actively poll the upstream
to confirm it has restarted and actually attached as a standby before
confirming success of the "node rejoin" action.

This can be overridden with the -W/--no-wait option.

(Note that for consistency with other PostgreSQL utilities, the
short form of the --wait option is now "-w"; this is currently
only used in "repmgr standby follow".)

Also update "repmgr node rejoin" documentation with a list of supported
options, and add some useful index entries for "pg_rewind".

Implements GitHub #415.
2018-04-03 10:34:44 +09:00
Ian Barwick
938692c169 doc: fix option description for "repmgr primary register" 2018-04-03 10:09:24 +09:00
Ian Barwick
ad24b04c35 Refactor pg_control parsing
The "data_checksum_version" field towards the end of the ControlFileData struct,
meaning its position varies between versions. Previously this wasn't a problem
as it was only required for operations involving 9.5 and later, and its position
within the control file has not changed between the current release and current
HEAD.

However, in order to support pg_rewind in 9.3 and 9.4, which both have changes in
the control file format, we'll need version-specific parsing. This will also make
it easier to deal with any future changes to the control file format.
2018-04-02 20:54:42 +09:00
Ian Barwick
3ccf1cf182 Enable pg_rewind to be used with PostgreSQL 9.3/9.4
pg_rewind is not part of the core distribution for those, but we
provided support in repmgr 3.3 so should extend it to repmgr 4.

Note that there is no check in place whether the pg_rewind binary
exists, so it's up to the user to ensure it's present.

Addresses GitHub #413.
2018-04-02 20:54:29 +09:00
Ian Barwick
5e4bdb5a1b repmgrd: handle failover with two nodes in the primary location
If two nodes were in the primary location, and at least one node in
another location, the non-failed node in the primary location was not
recognising itself as a promotion candidate.

Addresses GitHub #407.
2018-04-02 20:51:27 +09:00
Ian Barwick
50321bb95d Log pg_control access errors as WARNINGs rather than DEBUG
This will make it easier to diagnose issues, possibly with an incorrect
"data_directory" setting in "repmgr.conf".
2018-04-02 09:28:56 +09:00
Ian Barwick
253c215c12 Add TODO list
This file will collate various requests and ideas for future developement.
In particular it will reference requests which come in via the GitHub issue
tracker, so we can acknowledge and close off the request and not have an
open unresolved issue hanging around.
2018-03-30 14:24:36 +09:00
Ian Barwick
22c40ae62d doc: update HISTORY and release notes 2018-03-30 09:41:48 +09:00
Ian Barwick
239a548e9d "standby switchover": force checkpoint if pg_rewind requested.
Addresses issue described in GitHub #378.

PostgreSQL itself doesn't issue a checkpoint after promotion to ensure
the newly promoted server is available as quickly as possible, so we'll
only execute an explicit CHECKPOINT when it's actually required, i.e.
when pg_rewind will be executed. This is required as pg_rewind uses
the timeline reported in the pg_control file to compare with the
server to be rewound, and the pg_control timeline is only updated after
the first checkpoint, so there is an interval where pg_rewind will
erroneously assume both servers are on the timeline and take no action.
2018-03-29 23:55:08 +09:00
Ian Barwick
231ef5563e "standby switchover": update hint 2018-03-29 23:41:59 +09:00
Ian Barwick
e1413fa8ea Fix minimum accepted value for "degraded_monitoring_timeout"
Should be -1, the default.

Addresses GitHub #411.
2018-03-29 21:15:03 +09:00
Ian Barwick
7111483b65 repmgr: move demoted primary check to the final step during switchover
This will give the demoted primary more time to start up as a standby,
during which "standby follow" can be executed on sibling nodes, if
specified.
2018-03-27 16:44:15 +09:00
Ian Barwick
1558497ae4 repmgr: poll demoted primary after restart during switchover
During a switchover operation, once the demoted primary has been restarted
as a standby, repmgr attempts to reconnect to verify its status and drop
any redundant replication slots. However it's possible the standby may still
be in the startup phase, so poll for "standby_reconnect_timeout" seconds
before giving up.

Addresses GitHub #408.
2018-03-27 16:44:10 +09:00
Ian Barwick
9c5e76401f Fix "repmgr cluster crosscheck" output
Addresses GitHub #398.
2018-03-27 16:44:04 +09:00
Ian Barwick
a403da67bc Consolidate connection closure calls 2018-03-27 16:43:59 +09:00
Ian Barwick
71b13f5307 doc: add note about remote command execution
When executing a command on a remote server, repmgr expects the remote binary
to be in the same location as the local binary. It's reasonable to assume
repmgr will be deployed in a unified environment; if not, the onus is on the
user to ensure repmgr can find the remote binary, e.g. by creating appropriate
symlinks.

Addresses query in GitHub #406.
2018-03-27 16:43:55 +09:00
Ian Barwick
1c5561d114 Misc tweaks to witness code 2018-03-26 20:59:29 +09:00
Ian Barwick
c0b607ef41 doc: update list of event notifications 2018-03-23 10:40:39 +08:00
Ian Barwick
462fdca4b4 Tidy up queries in dbutils.c
- standardize formatting
- prefix various internal function calls with "pg_catalog.", to
  mitigate possible risks from CVE-2018-1058
2018-03-23 10:28:28 +08:00
Ian Barwick
0e55a60660 Add event "repmgrd_failover_aborted" 2018-03-21 13:23:06 +09:00
Ian Barwick
93deab3e96 Add error code ERR_FOLLOW_FAIL 2018-03-21 13:11:30 +09:00
Ian Barwick
81c69e3677 repmgrd: fix typo 2018-03-21 12:36:15 +09:00
Ian Barwick
0219f4c91f Always set "connect_timeout" when pinging a PostgreSQL instance
Insert "connect_timeout=2" into the connection parameters, if not
explicitly set by the user. This will prevent excessive wait time
for the host operating system to report a connection timeout.
2018-03-21 11:48:57 +09:00
Ian Barwick
85a4adc99c Update HISTORY 2018-03-21 06:48:32 +09:00
Martín Marqués
208d7d418e While reviewing 7cb6e5af8d before merging
I noticed that besides the result cleanup added, there was still a missing
spot inside the if condition.

Adding the PQclear that was missing.
2018-03-13 11:43:36 -03:00
Martín Marqués
7cb6e5af8d Merge pull request #403 from AndrzejNowicki/master
Clear node list to avoid memory leak on witness
2018-03-13 11:41:10 -03:00
Andrzej Nowicki
d2a2df13d5 One more memory leak fixed 2018-03-13 11:23:33 +01:00
Andrzej Nowicki
358e001218 Clear node list to avoid memory leak, fixes #402 2018-03-13 11:05:24 +01:00
Ian Barwick
d7702b3444 Correctly handle error message pointer when parsing strings.
When parsing conninfo strings, ensure the error message pointer is
actually returned to the caller.

Not a criticial issue, just meant the contents of the error message
were not being displayed.
2018-03-10 14:29:12 +09:00
Ian Barwick
a8286030c0 doc: update "repmgr primary unregister" description
As noted by GitHub user yonj1e in GitHub #396.
2018-03-08 19:11:41 +09:00
Ian Barwick
ff0ba3e19a doc: update FAQ
Additional clarification for "repmgr standby clone --recovery-conf-only"
2018-03-08 19:11:33 +09:00
Ian Barwick
6f5cce7e6f doc: update FAQ
Add entry about upgrading PostgreSQL
2018-03-08 19:11:21 +09:00
Ian Barwick
509f7a8255 Fix parsing of -k/--keep-history option
GitHub #394.
2018-03-07 19:22:04 +09:00
Ian Barwick
e8cdf72ecd Add 4.0.4 release notes 2018-03-07 19:21:49 +09:00
Ian Barwick
2a99dfa15b repmgrd: fix failover handling in "manual" mode
Regression was introduced in commit c7a585c555
2018-03-07 19:21:40 +09:00
Ian Barwick
bad034f7ee repmgrd: remove duplicate local record check in BDR mode 2018-03-07 19:21:33 +09:00
Ian Barwick
cdb504d700 Add event "repmgrd_shutdown"
Implements GitHub #393
2018-03-06 11:00:03 +09:00
Ian Barwick
0af2077bed repmgrd: add debug log output for "monitor_interval_secs" sleep in all modes 2018-03-06 10:56:21 +09:00
Emre Hasegeli
dea87b7285 Add witness options to the main help
GitHub #392
2018-03-06 10:55:06 +09:00
Martín Marqués
d6b13f3428 Merge pull request #391 from hasegeli/helpmissing
Add missing options to the main help
2018-03-02 15:36:53 -03:00
Emre Hasegeli
5808d8190e Add missing options to the main help 2018-03-02 17:08:50 +01:00
Ian Barwick
d2a5cc23cc "standby clone": improve replication user selection
Use the upstream node's replication user when checking the replication
connection.
2018-03-02 16:43:23 +09:00
Ian Barwick
9981ede1af "standby clone": fix --superuser handling
get_superuser_connection() was erroneously using the local node record
to connect to as a superuser, which works when registering the primary
but obviously not when cloning a standby.

Addresses GitHub #380.
2018-03-02 16:43:19 +09:00
Ian Barwick
40ccae57a3 Update HISTORY 2018-03-02 11:05:30 +09:00
Ian Barwick
3c2b8e5792 "standby clone": remove restriction on replication slots in Barman mode
While it's preferable to avoid standby replication slots if Barman is in
use, there's no technical reason to prevent this.

Implements GitHub #379.
2018-03-02 11:05:25 +09:00
Ian Barwick
354231284e repmgr: escape "restore_command" in generated recovery.conf 2018-03-02 11:05:21 +09:00
Ian Barwick
dbbfcb6a63 "standy clone": fix primary_conninfo when --upstream-conninfo provided 2018-03-02 11:05:15 +09:00
Ian Barwick
bc766a48ed repmgrd: retry standby connection after cascading standby failover 2018-03-02 11:05:07 +09:00
Ian Barwick
55441f2729 repmgrd: add configuration file parameter "standby_reconnect_timeout"
This is used for determining a timeout when reconnecting to the standby
after executing the "follow_command". This will normally not need to be
set explicitly, but maybe useful in cases where the standby's startup
phase can last longer than usual.
2018-03-02 11:04:56 +09:00
Ian Barwick
e38a9ec7e1 repmgrd: fix main monitoring loop for witness server
Missing "break" was breaking it when following a new primary.
2018-03-02 11:04:22 +09:00
Ian Barwick
c1356b9e0d repmgrd: retry standby connection after "follow_command" executed
It's possible that the standby is still starting up after the "follow_command"
completes, so poll for a while until we get a connection.
2018-03-02 11:04:19 +09:00
Ian Barwick
383a17fba1 doc: add <options> section for various commands 2018-02-26 16:54:27 +09:00
Ian Barwick
29cb153643 "node status": improve replication slot warnings
Addresses GitHub #385
2018-02-23 11:19:33 +09:00
Ian Barwick
15625183c1 "standby clone": document --recovery-conf-only option 2018-02-23 11:19:21 +09:00
Ian Barwick
b6a1b75d22 "standby clone --recovery-conf-only": display generated file with --dry-run
Refactor the original code which generates "recovery.conf" to place the
output into a buffer, which can either be output as "recovery.conf"
or copied to a buffer specified by the caller.
2018-02-23 11:18:45 +09:00
Ian Barwick
c644ddde51 Fix typo in function name 2018-02-22 15:50:57 +09:00
Ian Barwick
ee98a3a58e "standby clone": add --recovery-conf-only option
This will generate "recovery.conf" for an existing standby.

Typical use-case is a standby cloned manually from an external data
source (e.g. Barman), where "recovery.conf" needs to be created
(and if required a replication slot).

The --dry-run option will check the pre-requisites but not actually
create "recovery.conf" or a replication slot.

This requires that the upstream node is running, a replication connection
can be made and if required a replication slot can be created.

Implements GitHub #382.
2018-02-22 15:50:51 +09:00
Ian Barwick
22b3a74fa0 repmgrd: improve detection of status change from primary to standby
If repmgrd is running in degraded mode on a primary which has been stopped,
then manually been brought back online as a standby (e.g. by creating
recovery.conf and starting the server), ensure it not only detects the
change but automatically updates the node record so it can resume
monitoring the node as a standby.

Previously, repmgrd was looping waiting for the record to be updated
(as is done transparently when executing "repmgr node rejoin") but
if the record was not updated within the timeout period (e.g. by
"repmgr standby register) it would fail to resume monitoring as a
standby.

It seems reasonable to have repmgrd automatically update the node record,
as this will restore failover capability as quickly as possible. If this
is not desired, then the onus is on the user to shut down repmgrd while
making the desired changes.
2018-02-22 15:50:45 +09:00
Ian Barwick
98af51da03 "node rejoin": ensure --dry-run is honoured
Addresses GitHub #383.
2018-02-20 15:31:03 +09:00
Ian Barwick
e5eff3f6d5 doc: update 4.0.3 release notes 2018-02-16 12:15:44 +09:00
Ian Barwick
728a256a93 doc: update release notes 2018-02-16 12:15:35 +09:00
Ian Barwick
f5f02ae0ee Replace remaining instances of strcpy() with strncpy()
Also use strncmp() to match.
2018-02-15 13:31:55 +09:00
Ian Barwick
64d85587de repmgrd: check "repmgr" extension is installed before starting
Implements GitHub #361.
2018-02-12 11:38:31 +09:00
Ian Barwick
6b7f6089ba "node status": add warning about missing replication slots
Implements GitHub #364.
2018-02-12 11:38:27 +09:00
Ian Barwick
5719a0dfd3 Update repmgr.conf.sample
Add missing parameter "monitor_interval_secs"
2018-02-12 11:38:22 +09:00
Ian Barwick
927bf038a0 "standby switchover": check demotion candidate can make replication connection
Check it's actually possible for the demotion candidate to attach to
the promotion candidate before executing the switchover.

As with other checks of this nature, there's a faint possibility the
situation could change between the time the check is carried out and
the demotion candidate is restarted to connect to the promotion candidate,
but there's not a lot we can do about that. The main purpose is to
be able to catch existing misconfigurations before anything gets changed.

Implements GitHub #370.
2018-02-09 10:00:54 +09:00
Ian Barwick
76a93af15c "witness register": fix primary node check
Addresses GitHub #377, based on report by user yonj1e in #373.
2018-02-08 16:41:04 +09:00
Ian Barwick
ee2df36a76 "standby switchover": additional sanity checks
Check that sufficient walsenders will be available on the promotion
candidate, and if replication slots are in use check if enough of
those will be available.

Note these checks can't guarantee that the walsenders/slots will
be available at the appropriate points during the switchover process,
but do ensure that existing configuration problems will be caught.

Implements GitHub #371.
2018-02-08 15:19:24 +09:00
Ian Barwick
571e6b2783 "standby clone": cowardly refuse to clone into an active data directory
By checking the PID file in the same way pg_ctl does, we can be pretty
much certain whether the target data directory contains an active
PostgreSQL instance.
2018-02-08 10:19:05 +09:00
Ian Barwick
76cc11b786 Fix "standby clone" in Barman mode with --no-upstream-connection
"--upstream-node-id", if provided, was not being passed through to
the SQL query executed via the Barman server.

Also modified the query to select the primary node if "--upstream-node-id"
is not provided.

Note: this is a very niche use case.
2018-02-07 16:34:01 +09:00
Ian Barwick
56710f4819 repmgr: simplify data directory checks when cloning
Attempting to use the contents of pg_control to tell whether the directory
is in use by PostgreSQL can result in false positives; we should use
a check based on the pidfile.

Also change the HINT to indicate a data directory can be overwritten
if -F/--force is provided.
2018-02-07 14:45:37 +09:00
Ian Barwick
f9528efdb8 "standby clone": ensure "pg_subtrans" directory is created in Barman mode 2018-02-07 14:45:04 +09:00
Ian Barwick
658ec20e37 doc: fix GitHub reference in release notes 2018-02-07 14:43:47 +09:00
Ian Barwick
e6aa831782 Update HISTORY and release notes 2018-02-07 14:43:43 +09:00
Ian Barwick
9b56f157dc Move parse_output_to_argv() to configfile.c
So it can be used by parse_pg_basebackup_options().

Addresses GitHub #376.
2018-02-07 09:47:50 +09:00
Ian Barwick
05f872effe Fix typo in HINT 2018-02-07 08:56:29 +09:00
Ian Barwick
ae691688be doc: fix descriptions of %p event notification script parameter 2018-02-05 15:52:48 +09:00
Ian Barwick
57f1e939c5 "standby register": add event notification "standby_register_sync"
Implements GitHub #374.
2018-02-05 15:20:19 +09:00
Ian Barwick
48b5deebf3 doc: minor fixes to BDR docs
Also remove duplicate file.
2018-02-05 14:01:37 +09:00
Ian Barwick
1868453953 doc: improve BDR failover documentation 2018-02-05 13:25:49 +09:00
Ian Barwick
dd45189fa8 "cluster show": output any connection error messagesin list of warnings
This ensures any connection errors are displayed by default in a
comprehensible, easily reportable way, and saves having to request/filter
DEBUG output.

Implements GitHub #369.
2018-02-05 10:36:04 +09:00
Ian Barwick
a79c4fae88 "cluster show": minor code cleanup 2018-02-05 10:36:00 +09:00
Ian Barwick
657ed83921 "cluster show": improve handling of database errors
In particular, if running "repmgr cluster show" against a database
without the repmgr metadata, showing the error (rather than just
"no records found" etc.) will provide some clues about the problem.
2018-02-05 10:35:56 +09:00
Tony Finch
4fb085f52d "repmgr node status": correct upstream node info (#363)
repmgr was printing the name and ID of this node instead of its upstream

Signed-off-by: Tony Finch <dot@dotat.at>
2018-02-05 09:52:58 +09:00
Ian Barwick
d0bb5b1565 Ensure an inactive PostgreSQL data directory can be deleted.
Addresses GitHub #366.
2018-02-02 17:18:51 +09:00
Ian Barwick
ee64f3a745 "standby follow": finalize implementation of --dry-run option 2018-02-02 17:18:47 +09:00
Ian Barwick
6c81e54f76 "standby follow": check for replication slot availability on target node 2018-02-02 17:18:43 +09:00
Ian Barwick
65bf203a89 Improve "repmgr primary unregister" documentation and --help output
Per observations in GitHub #373
2018-02-02 17:18:36 +09:00
Ian Barwick
b4dbee517f doc: note password SSH requirements for "standby switchover" 2018-02-02 17:18:31 +09:00
Ian Barwick
e23d28a22d "standby follow": initial implementation of --dry-run option
GitHub #363.
2018-02-01 14:16:49 +09:00
Ian Barwick
811d2a45bd "standby switchover": improve log messages and add new exit code
Previously, if an issue was encountered with the old primary, but user
provided -F/--force to have repmgr promote the standby anyway, repmgr
would exit with the log message "STANDBY SWITCHOVER is complete"
and exit code 0 (SUCCESS).

To better report this partial completion, repmgr will now emit the message
"STANDBY SWITCHOVER has completed with issues" (and a HINT to check preceding
log messages) and new exit code 22 (ERR_SWITCHOVER_INCOMPLETE).
2018-01-31 11:03:54 +09:00
Ian Barwick
92f4710ee2 Have do_standby_follow_internal() not abort on error
Pass the error code back to the caller instead, mainly so
"repmgr node rejoin" can better report errors.
2018-01-31 11:03:27 +09:00
Ian Barwick
044d8a1098 repmgr: improve switchover handling when "pg_ctl" used
If logging output not explicitly rediretced with "-l" in the pg_ctl
options, repmgr would hang waiting for pg_ctl output.

Note that we recommend using the OS-level service commands where
available.
2018-01-30 16:56:26 +09:00
Ian Barwick
b38f45120c "repmgr standby register": improve error output when standby not running
Add explicit HINT
2018-01-27 07:17:34 +09:00
Ian Barwick
db3a046393 doc: expand upgrade documentation
Include section about using pg_upgrade
2018-01-25 10:48:24 +09:00
Ian Barwick
ec068e38a2 Remove --bdr-only configuration option
This was required for a specific use case during pre-release
development and is no longer needed now the physical streaming
replication handling is implemented.
2018-01-25 10:48:09 +09:00
Ian Barwick
3a382e826e doc: update 4.0.2 release notes
Add details about upgrading.
2018-01-19 09:10:42 +09:00
Ian Barwick
3dcf57a333 doc: add 4.0.2 release notes 2018-01-19 09:10:42 +09:00
Vlad
f658c8d3d8 doc: add missing word in overview
GitHub pull request #362
2018-01-19 09:09:40 +09:00
Ian Barwick
375a96a5c8 repmgrd: log execution error in "repmgrd_get_local_node_id()"
That shouldn't happen, but if it does it will make it easier to
identify the issue.
2018-01-16 11:16:19 +09:00
Ian Barwick
b4d6724405 doc: improve switchover documentation
Emphasize need to set the "service_*_command" options when repmgr is
installed from a package.
2018-01-16 11:16:19 +09:00
Ian Barwick
8fd0c4ad83 repmgr: assume node is actually shutting down if pingable and that's the reported status 2018-01-12 21:53:37 +09:00
Ian Barwick
7ccae6c2b1 repmgr: automatically create slot name if missing
It's possible that a node was registered with "use_replication_slots=false"
but that was later changed to "use_replication_slots=true". If the node
was not subsequently re-registered, the node record will contain an empty
slot name, which will cause any slot creation operation during
"standby follow" or "node rejoin" to fail.

To prevent this happening, check for an empty slot name and automatically
set before proceeding.

Addresses GitHub #343.
2018-01-11 14:47:50 +09:00
Ian Barwick
61d46172b9 repmgr: catch possible corner case when checking node shutdown status
It's conceivable that PQping is returning "no response" but the
shutdown hasn't quite completed.
2018-01-10 15:09:21 +09:00
Ian Barwick
810471b2f2 repmgr: during switchover, correctly detect unclean shutdown status 2018-01-10 12:25:16 +09:00
Ian Barwick
5bd8cf958a repmgr standby switchover: add "%p" event notification parameter
This will contain the node ID of the former primary.
2018-01-10 12:25:12 +09:00
Ian Barwick
5a45997db5 doc: document command line options for "standby switchover" 2018-01-10 12:25:07 +09:00
Ian Barwick
f1f5100007 repmgr standby switchover: add event details 2018-01-10 12:25:00 +09:00
Ian Barwick
1c8ad4d89b Consolidate parsing of output from executing repmgr on a remote server
This should also fix the issue reported in GitHub #349.
2018-01-09 16:24:13 +09:00
Ian Barwick
842a610e84 Fix call to is_active_bdr_node() in BDR repmgrd
Following the fix to "is_active_bdr_node()" in 841f03ae, it turns out
the call in repmgrd-bdr.c was only accidentally working; explicitly
test for a false return value.
2018-01-04 21:03:36 +09:00
Ian Barwick
fcb7e7a29b "repmgr bdr register": create missing connection replication set if needed
Previously the assumption was that the "repmgr" replication set would be
set up when the nodes are created, however no checks were implemented
and this was not well-documented.

Addresses GitHub #347.
2018-01-04 17:46:49 +09:00
Ian Barwick
26e404b1f3 "repmgr bdr register": improve node name check
We'll use "bdr.bdr_get_local_node_name()" to check the local BDR node
name and the repmgr one match.
2018-01-04 17:46:44 +09:00
Ian Barwick
625d032435 doc: link event notification page from relevate command reference pages 2018-01-04 14:56:15 +09:00
Ian Barwick
3d07d65966 doc: update package documentation 2018-01-04 14:56:12 +09:00
Ian Barwick
b705127a34 "repmgr standby register": add --wait-start option
Implements GitHub #356.
2018-01-04 14:56:08 +09:00
Ian Barwick
832b38c5cb doc: fix typos in "repmgr primary unregister" command reference 2018-01-04 14:56:02 +09:00
Ian Barwick
3739a7b84d doc: add link to event notifications page from "repmgr cluster event" 2018-01-04 14:55:56 +09:00
Ian Barwick
841f03aeba Fix query in is_active_bdr_node()
Boolean column was not being checked correctly.

Also add detail output in "repmgr node role --check", where the function
is called.
2018-01-04 14:55:51 +09:00
Ian Barwick
cad12b1fb7 "repmgr cluster event": move query to dbutils.c 2018-01-04 14:55:46 +09:00
Ian Barwick
d31cc80d26 docs: document "repmgr cluster event --terse" 2018-01-04 14:55:40 +09:00
Ian Barwick
625187a61e "repmgr cluster events": optionally omit "Details" column with --terse
Implements GitHub #360.
2018-01-04 14:55:34 +09:00
Ian Barwick
e64d965c6a repmgrd: document standby_[failure|recovery] event notifications
Also clean up the relevant code section.

Addresses GitHub #359.
2018-01-04 09:33:37 +09:00
Ian Barwick
5d8ec136e6 repmgr node rejoin: handle missing node record correctly
If a connection was provided for a database other than the "repmgr"
database, error was logged but execution continued, resulting in
the connection being finished twice.

Addresses GitHub #358.
2018-01-03 15:17:01 +09:00
Ian Barwick
9951a8e106 doc: add appendix with details about packages
work-in-progress
2018-01-02 17:23:24 +09:00
Ian Barwick
26a9e848fd Update copyright notices to 2018 2018-01-02 10:19:46 +09:00
Ian Barwick
ba0b0a497f doc: Fix event notification placeholder typo
Per report from Carlos.
2018-01-01 10:28:19 +09:00
Ian Barwick
09dc43a61c docs: update HISTORY 2017-12-27 10:22:25 +09:00
Ian Barwick
b349f82571 doc: update documentation build instructions
Describe how to build documentation as a single file, and also note
requirement to build against 9.6 or earlier.
2017-12-27 10:05:44 +09:00
Ian Barwick
adbb627850 Merge branch 'doc-nochunks' of https://github.com/fanf2/repmgr
Pull request GitHub #353.
2017-12-27 09:58:09 +09:00
Ian Barwick
c47f976bde repmgr.conf.sample: fix command line argument
"repmgr node check --archive-ready" is correct, however abbreviated
versions will be accepted by getopt_long() if they don't match
or partially match any other options.

Per report by "chaintng" in GitHub #355.
2017-12-27 09:39:14 +09:00
Tony Finch
7c8cd7a482 doc: an optional all-in-one-file manual 2017-12-21 18:31:05 +00:00
Ian Barwick
edce8addbd repmgr: add missing -W option to getopt_long() invocation
Addresses GitHub #350.
2017-12-20 10:24:58 +09:00
Martín Marqués
b0f6202448 Merge pull request #352 from dbonne/master
Fix package name
2017-12-19 15:21:51 -03:00
Daymel Bonne Solís
985b13b6d3 Fix package name 2017-12-19 13:09:55 -05:00
Martín Marqués
69e64a9464 Add more information to the setting up sudo without requiretty in
the documentation

Signed-off-by: Martín Marqués <martin.marques@2ndquadrant.com>
2017-12-14 14:39:22 -03:00
Martín Marqués
f58954b3be Switch spaces for tabs in repmgr.conf sample file.
This makes comments stay aligned in most cases the conf file is
modified, and when indentation changes, it's easy to re-align
(by removing or adding a tab)

Signed-off-by: Martín Marqués <martin.marques@2ndquadrant.com>
2017-12-14 07:00:05 -03:00
Ian Barwick
3761d17752 docs: update 4.0.1 release date 2017-12-13 15:16:26 +09:00
Ian Barwick
8c121da8a1 Add diagnostic option "repmgr node check --has-passfile"
This checks if the active libpq version (9.6 and later) has the
"passfile" option, and returns 0 if present, 1 if not.
`
2017-12-11 20:09:48 +09:00
Abhijit Menon-Sen
6e9e4543e8 Fix typo: upstream_node_id → upstream_node 2017-12-08 09:46:58 +05:30
Ian Barwick
c94f1b7338 Fix unpackaged upgrade SQL for PostgreSQL 9.3 2017-12-04 17:52:36 +09:00
Ian Barwick
f78c169c3d docs: improve event notification documentation 2017-11-29 14:43:28 +09:00
Ian Barwick
f2db9f3ea4 docs: minor fixes to various examples 2017-11-29 11:33:42 +09:00
Ian Barwick
9944324c3a docs: add additional note about setting "wal_log_hints"
Useful to reference this when discussing PostgreSQL configuration in
general.
2017-11-29 11:22:12 +09:00
Ian Barwick
836f32bdbc Update release notes 2017-11-28 13:42:09 +09:00
Ian Barwick
cebbc73c38 Update HISTORY 2017-11-28 13:01:45 +09:00
Ian Barwick
472d703d2e repmgr: initialise "voting_term" in "repmgr primary register"
This previously happened in the extension SQL code, which could
potentially cause replay problems if installing on a BDR cluster.

As this table is only required for streaming replication failover,
move the initialisation to "repmgr primary register".

Addresses GitHub #344 .
2017-11-28 11:08:12 +09:00
Ian Barwick
de34e4e89b docs: add 2ndQ yum repository installation instructions
These replace the HTML document at https://repmgr.org/yum-repository.html
2017-11-24 14:13:33 +09:00
Ian Barwick
3a8ee126f3 Delete any replication slots copied by pg_rewind
If --force-rewind is used in conjunction with "repmgr node rejoin",
any replication slots present on the source node will be copied too;
it's essential to remove these to prevent stale slots being extant
when the node starts up.

We do this at file system level *before* the server starts to minimize
the risk of any problems.

Addresses GitHub #334
2017-11-24 11:13:31 +09:00
Ian Barwick
da93dd1f57 docs: fix configuration file example
Per report from Carlos Chapi.
2017-11-24 09:26:09 +09:00
Ian Barwick
295c18f6ff repmgr: fix configuration file sanity check
The check was being carried out regardless of whether --copy-external-config-files
was specified, which means cloning will fail if no SSH connection is available.

Addresses GitHub #342
2017-11-23 22:48:34 +09:00
Ian Barwick
81beec54aa repmgr: fix return code output for repmgr node check --action=...
Addresses GitHub #340
2017-11-23 10:34:21 +09:00
Martín Marqués
2e42226f68 Fix missing FQN for the nodes table.
This bug was not detected before because most users work with the repmgr
user. For that reason, the repmgr schema is already in the search_path
by default.

Add the repmgr schema to the nodes table in the LEFT JOIN used for
cluster show (and in other places)

Signed-off-by: Martín Marqués <martin.marques@2ndquadrant.com>
2017-11-22 17:13:58 -03:00
Ian Barwick
de10d7984a docs: update 4.0.0 release notes 2017-11-21 16:54:13 +09:00
Ian Barwick
404aab4041 docs: miscellaneous updates 2017-11-20 15:47:59 +09:00
Ian Barwick
8c422d6084 Remove unneeded functions 2017-11-20 15:18:21 +09:00
Ian Barwick
8b78b7292d docs: add note about "service_promote_command" in repmgr.conf.sample
It must never contain "repmgr standby promote", as it is intended
to enable use of package-level promote commands such as Debian's
"pg_ctlcluster promote".

Addresses GitHub #336.
2017-11-20 12:29:47 +09:00
Ian Barwick
4cebba32e2 remove spurios "/base" path element in Barman tablespace cloning code.
Addresses GitHub #339
2017-11-20 10:50:26 +09:00
Ian Barwick
c9f12cfbe0 repmgr: don't add empty "passfile" parameter in recovery.conf 2017-11-20 10:27:45 +09:00
Ian Barwick
5b4c92392c docs: expand witness documentation 2017-11-17 11:00:43 +09:00
Ian Barwick
e2b94adec3 docs: miscellaneous cleanup 2017-11-17 09:39:11 +09:00
Ian Barwick
3164bfa043 docs: add initial witness server documentation 2017-11-17 08:51:21 +09:00
Ian Barwick
08b443dce0 repmgrd: renable monitoring data recording when in archive recovery.
The warning emitted gives the impression that monitoring data shouldn't
be written if there's no streaming replication, but we can and should
do this as long as we have a primary connection.

Explictly document this in the code.

Also remove an unused variable warning.
2017-11-16 17:17:17 +09:00
Ian Barwick
9165d27f9f "repmgr node ...": fixes for 9.3
Mainly to account for the lack of replication slots.
2017-11-16 11:25:16 +09:00
Ian Barwick
b8b991398a Escape double-quotes in strings passed to an event notification script
The string in question will be generated internally by repmgr as a simple
one-line string with no control characters etc., so all that needs to be
escaped at the moment are any double quotes.
2017-11-16 10:36:48 +09:00
Ian Barwick
a9a17f206e docs: improve documentation of pg_basebackup_options 2017-11-15 20:50:13 +09:00
Ian Barwick
9d432546bf repmgrd: don't fail over unless more than 50% of active nodes are visible. 2017-11-15 13:48:28 +09:00
Ian Barwick
3c557ebd8e repmgrd: finalize witness failover handling 2017-11-15 13:48:25 +09:00
Ian Barwick
4efeb52cba repmgrd: synchronise repmgr.nodes table on witness server 2017-11-15 13:48:21 +09:00
Ian Barwick
60422c66f9 repmgrd: handle witness server 2017-11-15 13:48:17 +09:00
Ian Barwick
b63872afbb "witness register": set upstream_node_id to that of the primary 2017-11-15 13:48:14 +09:00
Ian Barwick
a31980b590 repmgrd: basic witness node monitoring 2017-11-15 13:48:11 +09:00
Ian Barwick
e07a3c7976 docs: add witness command reference files to file list 2017-11-15 13:48:06 +09:00
Ian Barwick
9d9a1be062 docs: add command reference for "witness (un)register" 2017-11-15 13:48:03 +09:00
Ian Barwick
8208b3f844 witness (un)register: add --dry-run mode 2017-11-15 13:48:00 +09:00
Ian Barwick
ecb8297b1f witness unregister: enable execution when witness server is down
Also add help output for "repmgr witness --help".
2017-11-15 13:47:54 +09:00
Ian Barwick
1553596f84 repmgr: minor fix to "repmgr standby --help" output 2017-11-15 13:47:52 +09:00
Ian Barwick
022d9c58c2 Add "witness unregister" functionality 2017-11-15 13:47:48 +09:00
Ian Barwick
a6cc4d80f0 Add "witness register" functionality 2017-11-15 13:47:45 +09:00
Ian Barwick
7fffe3ed96 witness: initial code framework 2017-11-15 13:47:41 +09:00
Ian Barwick
9b93a595f5 docs: add some more index entries 2017-11-14 20:55:37 +09:00
Ian Barwick
c34e08b802 docs: document "passfile" configuration file parameter 2017-11-14 20:53:26 +09:00
Ian Barwick
eb14bb58c6 Add configuration file "passfile"
This will enable a custom .pgpass to be included in "primary_conninfo"
(provided it's supported by the libpq version on the standby).
2017-11-14 19:30:25 +09:00
Ian Barwick
aa28069d8b docs: update release notes
Add note about changes to password handling.1
2017-11-14 18:47:39 +09:00
Ian Barwick
a1e272f64c Update extension SQL 2017-11-13 10:02:46 +09:00
Ian Barwick
9908a9c662 repmgrd: detect role change from primary to standby
If repmgrd is monitoring a primary which is taken off-line, then later
restored as a standby, detect this change and resume monitoring
in standby node.

Addresses GitHub #338.
2017-11-10 17:19:30 +09:00
Ian Barwick
aa089820ab repmgrd: check shared library is loaded
If this isn't the case, "repmgrd" will appear to run but not handle
failover correctly.

Address GitHub #337.
2017-11-10 14:35:17 +09:00
Ian Barwick
0230bafae1 repmgrd: updates related to node_id handling 2017-11-10 12:07:31 +09:00
Ian Barwick
de577adc67 repmgrd: catch corner cases where monitoring data is not available 2017-11-09 22:27:09 +09:00
Ian Barwick
fed17d49e3 repmgrd: ensure shmem is reinitialised after a restart 2017-11-09 19:31:21 +09:00
Ian Barwick
d80763f974 repmgrd: misc fixes 2017-11-09 19:31:16 +09:00
Ian Barwick
331e982bdb repmgrd: fix priority/node_id tie-break check 2017-11-09 19:31:12 +09:00
Ian Barwick
4ca7e6a6bf repmgrd: remove unneeded functions 2017-11-09 19:31:08 +09:00
Ian Barwick
6ac6e0733a repmgrd: simplify the candidate selection logic
All disconnected nodes will be in a static, known state, so as long as
each node has the same meta-information (repmgr.nodes) and is able
to retrieve the last receive LSN of the other nodes, it is possible
for each node to independently determine the best promotion candidate,
thereby reaching consensus without an explicit "voting" process.
2017-11-09 19:31:04 +09:00
Ian Barwick
79d21b516b repmgrd: fixes to failover handling
get_new_primary() returns NULL if no notification for the new primary has
been received, but the code was expecting it to return UNKNOWN_NODE_ID,
which was causing repmgrd to prematurely drop out of the new primary
detection loop if no notification had been received by the time the loop
started.

Also store the electoral term as a single row, single column table,
to ensure that all repmgrds see the same turn. It is then bumped
by the winning node after it gets promoted.

Various logging improvements.
2017-11-08 14:28:08 +09:00
Ian Barwick
7232187f4d Ensure shared memory functions handle NULL parameters correctly 2017-11-08 12:19:07 +09:00
Ian Barwick
fe98270b3f Update .gitignore
Ignore output from "make installcheck"
2017-11-08 12:09:33 +09:00
Ian Barwick
5a3e20fc38 README: update links to https versions 2017-11-08 12:07:35 +09:00
Ian Barwick
4ef2b111da Fix lock acquisition in shared memory functions 2017-11-08 11:55:08 +09:00
Ian Barwick
97471626b4 Update repmgr.conf.sample 2017-11-02 17:43:03 +09:00
Ian Barwick
4bd236b64c docs: fix example in BDR section 2017-11-02 11:23:41 +09:00
Ian Barwick
615dd2ecf4 docs: tweak Markdown URL formatting 2017-11-01 10:58:23 +09:00
Ian Barwick
1c1887f9cc docs: update links to repmgr 4.0 documentation 2017-11-01 10:50:22 +09:00
Ian Barwick
d3f11a640d docs: update copyright info 2017-11-01 09:35:57 +09:00
Ian Barwick
2341da7a06 docs: convert command reference sections to <refentry> format
Note that most entries still need a bit more tidying up, consistent structuring,
provision of more examples etc.
2017-10-31 11:27:13 +09:00
Ian Barwick
2c468d64fb "standby follow": get upstream record before server restart, if required
The standby may not always be available for connections right after it's
restarted, so attempting to connect and get the node's upstream record
after the restart may fail. Record is now retrieved before the restart.

Addresses GitHub #333.
2017-10-27 16:30:14 +09:00
Ian Barwick
9d9b74d740 docs: add sample output to "standby follow" and "standby promote" 2017-10-27 15:03:34 +09:00
Ian Barwick
a90d4419a6 docs: add note about building docs 2017-10-27 10:44:16 +09:00
Ian Barwick
68756c79f3 Fix typo 2017-10-27 09:50:48 +09:00
Ian Barwick
8ad081e7b5 docs: finalize conversion of existing BDR repmgr documentation 2017-10-26 18:52:35 +09:00
Ian Barwick
6b76704817 Initial conversion of existing BDR repmgr documentation 2017-10-26 16:29:40 +09:00
Ian Barwick
c03c509e73 docs: update configuration documentation 2017-10-26 16:11:17 +09:00
Ian Barwick
d9db4f6c45 repmgr node rejoin: add --dry-run option 2017-10-25 11:01:58 +09:00
Ian Barwick
c89d59fe96 Improve trim() function
Did not cope well with trailing spaces or entirely blank strings.
2017-10-24 15:34:43 +09:00
Ian Barwick
02b6d3748b Docs: update "repmgr cluster show" 2017-10-24 13:48:38 +09:00
Ian Barwick
7c3abe28b9 Standardize terminology on "primary" (in place of "master") 2017-10-24 13:42:50 +09:00
Ian Barwick
a39b8ccc2d --dry-run available for "node rejoin" 2017-10-23 10:40:21 +09:00
Ian Barwick
5638d4ab89 docs: fix formatting 2017-10-23 09:59:29 +09:00
Ian Barwick
37bdad290c Add --help output for "repmgr node service"
Addresses GitHub #329.
2017-10-20 16:44:44 +09:00
Ian Barwick
8911434da5 Add --help output for "repmgr node rejoin"
Addresses GitHub #329.
2017-10-20 16:31:17 +09:00
Ian Barwick
8a2bbcebfd docs: fix typo 2017-10-20 16:05:05 +09:00
Ian Barwick
61f01f8305 node rewind: add check for pg_rewind and --dry-run mode
Addresses GitHub #330
2017-10-20 14:15:23 +09:00
Ian Barwick
a35d77b7f0 Note Barman configuration file parameter changes 2017-10-20 11:30:36 +09:00
Ian Barwick
40ea1abbb4 Fix error message typo 2017-10-20 11:18:53 +09:00
Ian Barwick
785bfe9837 Prevent relative configuration file path being stored in the repmgr metadata
The configuration file path is stored to make remote execution of repmgr
(e.g. during "repmgr standby switchover") simpler, so relative paths
make no sense.

Addresses GitHub #332
2017-10-20 10:57:43 +09:00
Ian Barwick
31cd54bcff Update README
Main body of documentation moved to DocBook format and hosted at:

    https://repmgr.org/docs/index.html

as the existing README and sundry additional files were becoming
unmanageable. Conversion to DocBook format enables all documentation
to be managed in a single structured system, with cross-references,
indexes, linkable URLS etc.
2017-10-19 16:32:00 +09:00
Ian Barwick
35c8bb4e75 docs: update "repmgr cluster show" page 2017-10-19 16:21:59 +09:00
Ian Barwick
6b9ac22029 docs: expand release notes and redirect "changes-in-repmgr4.md" 2017-10-19 14:09:14 +09:00
Ian Barwick
7bf3c78f57 Add 4.0 release notes 2017-10-19 13:58:41 +09:00
Ian Barwick
34ee16899e doc: add missing entry for "priority" in repmgr.conf.sample
Per report from Shaun Thomas.
2017-10-19 13:14:52 +09:00
Ian Barwick
0938685ae7 docs: add more index references 2017-10-19 12:21:50 +09:00
Ian Barwick
b400436fba docs: note way of forcing recovery then quitting in single user mode 2017-10-18 22:31:06 +09:00
Ian Barwick
2745c92fc8 Documentation: update markup 2017-10-18 11:12:20 +09:00
Ian Barwick
34c0131b2d Update package signature documentation 2017-10-18 10:50:49 +09:00
Ian Barwick
c9abfdcc04 Document "upgrading-from-repmgr3.md" moved to main repmgr documentation 2017-10-18 09:37:16 +09:00
Ian Barwick
a878d7aaea Update "repmgr node rejoin" documentation 2017-10-17 17:40:50 +09:00
Ian Barwick
93aa7cea1a Add placeholder FAQ.md
This replaces the original FAQ maintainted for repmgr 3.x; repmgr 4
documentation is now available in DocBook format.
2017-10-17 16:31:55 +09:00
Ian Barwick
f00e6296e9 Move deprecated command line option
Not required in repmgr4, we're keeping it around for backwards compatibility;
a warning will be issued if used.
2017-10-17 16:07:44 +09:00
Ian Barwick
91354a71cc Add FAQ to documentation 2017-10-17 15:46:36 +09:00
Ian Barwick
c78cb6e1d6 Bump dev version number 2017-10-17 13:09:37 +09:00
Ian Barwick
71430a9f65 Various documentation fixes 2017-10-17 11:00:37 +09:00
Ian Barwick
3e93f847fd Update doc version 2017-10-16 11:25:56 +09:00
73 changed files with 4243 additions and 1295 deletions

72
HISTORY
View File

@@ -1,21 +1,65 @@
4.1.1 2018-09-05
logging: explicitly log the text of failed queries as ERRORs to
assist logfile analysis; GitHub #498
repmgr: truncate version string, if necessary; GitHub #490 (Ian)
repmgr: improve messages emitted during "standby promote" (Ian)
repmgr: "standby clone" - don't copy external config files in --dry-run
mode; GitHub #491 (Ian)
repmgr: add "cluster_cleanup" event; GitHub #492 (Ian)
repmgr: (standby switchover) improve detection of free walsenders;
GitHub #495 (Ian)
repmgr: (node rejoin) improve replication slot handling; GitHub #499 (Ian)
repmgrd: ensure that sending SIGHUP always results in the log file
being reopened; GitHub #485 (Ian)
repmgrd: report version number *after* logger initialisation; GitHub #487 (Ian)
repmgrd: fix startup on witness node when local data is stale; GitHub #488/#489 (Ian)
repmgrd: improve cascaded standby failover handling; GitHub #480 (Ian)
repmgrd: improve reconnection handling (Ian)
4.1.0 2018-07-31
repmgr: change default log_level to INFO, add documentation; GitHub #470 (Ian)
repmgr: add "--missing-slots" check to "repmgr node check" (Ian)
repmgr: improve command line error handling; GitHub #464 (Ian)
repmgr: fix "standby register --wait-sync" when no timeout provided (Ian)
repmgr: "cluster show" returns non-zero value if an issue encountered;
GitHub #456 (Ian)
repmgr: "node check" and "node status" returns non-zero value if an issue
encountered (Ian)
repmgr: add CSV output mode to "cluster event"; GitHub #471 (Ian)
repmgr: add -q/--quiet option to suppress non-error output; GitHub #468 (Ian)
repmgr: "node status" returns non-zero value if an issue encountered (Ian)
repmgr: enable "recovery_min_apply_delay" to be 0; GitHub #448 (Ian)
repmgr: "cluster cleanup" - add missing help options; GitHub #461/#462 (gclough)
repmgr: ensure witness node follows new primary after switchover;
GitHub #453 (Ian)
repmgr: fix witness node handling in "node check"/"node status";
GitHub #451 (Ian)
repmgr: fix "primary_slot_name" when using "standby clone" with --recovery-conf-only;
GitHub #474 (Ian)
repmgr: don't perform a switchover if an exclusive backup is running;
GitHub #476 (Martín)
repmgr: enable "witness unregister" to be run on any node; GitHub #472 (Ian)
repmgrd: create a PID file by default; GitHub #457 (Ian)
repmgrd: daemonize process by default; GitHub #458 (Ian)
4.0.6 2018-06-14 4.0.6 2018-06-14
repmgr: (witness register) prevent registration of a witness server with the repmgr: (witness register) prevent registration of a witness server with the
same name as an existing node (Ian) same name as an existing node (Ian)
repmgr: (standby follow) check node has actually connected to new primary repmgr: (standby follow) check node has actually connected to new primary
before reporting success; GitHub #444 (Ian) before reporting success; GitHub #444 (Ian)
repmgr: (standby clone) improve handling of external configuration file copying, repmgr: (standby clone) improve handling of external configuration file copying,
including consideration in --dry-run check; GitHub #443 (Ian) including consideration in --dry-run check; GitHub #443 (Ian)
repmgr: (standby clone) don't require presence of "user" parameter in repmgr: (standby clone) don't require presence of "user" parameter in
conninfo string; GitHub #437 (Ian) conninfo string; GitHub #437 (Ian)
repmgr: (standby clone) improve documentation of --recovery-conf-only repmgr: (standby clone) improve documentation of --recovery-conf-only
mode; GitHub #438 (Ian) mode; GitHub #438 (Ian)
repmgr: (node rejoin) fix bug when parsing --config-files parameter; repmgr: (node rejoin) fix bug when parsing --config-files parameter;
GitHub #442 (Ian) GitHub #442 (Ian)
repmgr: when using --dry-run, force log level to INFO to ensure output repmgr: when using --dry-run, force log level to INFO to ensure output
will always be displayed; GitHub #441 (Ian) will always be displayed; GitHub #441 (Ian)
repmgr: (cluster matrix/crosscheck) return non-zero exit code if node repmgr: (cluster matrix/crosscheck) return non-zero exit code if node
connection issues detected; GitHub #447 (Ian) connection issues detected; GitHub #447 (Ian)
repmgrd: ensure local node is counted as quorum member; GitHub #439 (Ian) repmgrd: ensure local node is counted as quorum member; GitHub #439 (Ian)
4.0.5 2018-05-02 4.0.5 2018-05-02
repmgr: poll demoted primary after restart as a standby during a repmgr: poll demoted primary after restart as a standby during a

View File

@@ -11,7 +11,10 @@ EXTENSION = repmgr
DATA = \ DATA = \
repmgr--unpackaged--4.0.sql \ repmgr--unpackaged--4.0.sql \
repmgr--4.0.sql repmgr--4.0.sql \
repmgr--4.0--4.1.sql \
repmgr--4.1.sql
REGRESS = repmgr_extension REGRESS = repmgr_extension

View File

@@ -28,10 +28,8 @@ char config_file_path[MAXPGPATH] = "";
static bool config_file_provided = false; static bool config_file_provided = false;
bool config_file_found = false; bool config_file_found = false;
static void parse_config(t_configuration_options *options, bool terse);
static void _parse_config(t_configuration_options *options, ItemList *error_list, ItemList *warning_list); static void _parse_config(t_configuration_options *options, ItemList *error_list, ItemList *warning_list);
static bool parse_bool(const char *s,
const char *config_item,
ItemList *error_list);
static void _parse_line(char *buf, char *name, char *value); static void _parse_line(char *buf, char *name, char *value);
static void parse_event_notifications_list(t_configuration_options *options, const char *arg); static void parse_event_notifications_list(t_configuration_options *options, const char *arg);
@@ -241,7 +239,7 @@ end_search:
} }
void static void
parse_config(t_configuration_options *options, bool terse) parse_config(t_configuration_options *options, bool terse)
{ {
/* Collate configuration file errors here for friendlier reporting */ /* Collate configuration file errors here for friendlier reporting */
@@ -333,6 +331,12 @@ _parse_config(t_configuration_options *options, ItemList *error_list, ItemList *
options->primary_follow_timeout = DEFAULT_PRIMARY_FOLLOW_TIMEOUT; options->primary_follow_timeout = DEFAULT_PRIMARY_FOLLOW_TIMEOUT;
options->standby_follow_timeout = DEFAULT_STANDBY_FOLLOW_TIMEOUT; options->standby_follow_timeout = DEFAULT_STANDBY_FOLLOW_TIMEOUT;
/*------------------------
* standby switchover settings
*------------------------
*/
options->standby_reconnect_timeout = DEFAULT_STANDBY_RECONNECT_TIMEOUT;
/*----------------- /*-----------------
* repmgrd settings * repmgrd settings
*----------------- *-----------------
@@ -352,7 +356,8 @@ _parse_config(t_configuration_options *options, ItemList *error_list, ItemList *
options->degraded_monitoring_timeout = -1; options->degraded_monitoring_timeout = -1;
options->async_query_timeout = DEFAULT_ASYNC_QUERY_TIMEOUT; options->async_query_timeout = DEFAULT_ASYNC_QUERY_TIMEOUT;
options->primary_notification_timeout = DEFAULT_PRIMARY_NOTIFICATION_TIMEOUT; options->primary_notification_timeout = DEFAULT_PRIMARY_NOTIFICATION_TIMEOUT;
options->standby_reconnect_timeout = DEFAULT_STANDBY_RECONNECT_TIMEOUT; options->repmgrd_standby_startup_timeout = -1; /* defaults to "standby_reconnect_timeout" if not set */
memset(options->repmgrd_pid_file, 0, sizeof(options->repmgrd_pid_file));
/*------------- /*-------------
* witness settings * witness settings
@@ -539,6 +544,14 @@ _parse_config(t_configuration_options *options, ItemList *error_list, ItemList *
else if (strcmp(name, "standby_follow_timeout") == 0) else if (strcmp(name, "standby_follow_timeout") == 0)
options->standby_follow_timeout = repmgr_atoi(value, name, error_list, 0); options->standby_follow_timeout = repmgr_atoi(value, name, error_list, 0);
/* standby switchover settings */
else if (strcmp(name, "standby_reconnect_timeout") == 0)
options->standby_reconnect_timeout = repmgr_atoi(value, name, error_list, 0);
/* node rejoin settings */
else if (strcmp(name, "node_rejoin_timeout") == 0)
options->node_rejoin_timeout = repmgr_atoi(value, name, error_list, 0);
/* node check settings */ /* node check settings */
else if (strcmp(name, "archive_ready_warning") == 0) else if (strcmp(name, "archive_ready_warning") == 0)
options->archive_ready_warning = repmgr_atoi(value, name, error_list, 1); options->archive_ready_warning = repmgr_atoi(value, name, error_list, 1);
@@ -588,8 +601,10 @@ _parse_config(t_configuration_options *options, ItemList *error_list, ItemList *
options->async_query_timeout = repmgr_atoi(value, name, error_list, 0); options->async_query_timeout = repmgr_atoi(value, name, error_list, 0);
else if (strcmp(name, "primary_notification_timeout") == 0) else if (strcmp(name, "primary_notification_timeout") == 0)
options->primary_notification_timeout = repmgr_atoi(value, name, error_list, 0); options->primary_notification_timeout = repmgr_atoi(value, name, error_list, 0);
else if (strcmp(name, "standby_reconnect_timeout") == 0) else if (strcmp(name, "repmgrd_standby_startup_timeout") == 0)
options->standby_reconnect_timeout = repmgr_atoi(value, name, error_list, 0); options->repmgrd_standby_startup_timeout = repmgr_atoi(value, name, error_list, 0);
else if (strcmp(name, "repmgrd_pid_file") == 0)
strncpy(options->repmgrd_pid_file, value, MAXPGPATH);
/* witness settings */ /* witness settings */
else if (strcmp(name, "witness_sync_interval") == 0) else if (strcmp(name, "witness_sync_interval") == 0)
@@ -771,6 +786,17 @@ _parse_config(t_configuration_options *options, ItemList *error_list, ItemList *
PQconninfoFree(conninfo_options); PQconninfoFree(conninfo_options);
} }
/* set values for parameters which default to other parameters */
/*
* From 4.1, "repmgrd_standby_startup_timeout" replaces "standby_reconnect_timeout"
* in repmgrd; fall back to "standby_reconnect_timeout" if no value explicitly provided
*/
if (options->repmgrd_standby_startup_timeout == -1)
{
options->repmgrd_standby_startup_timeout = options->standby_reconnect_timeout;
}
/* add warning about changed "barman_" parameter meanings */ /* add warning about changed "barman_" parameter meanings */
if ((options->barman_host[0] == '\0' && options->barman_server[0] != '\0') || if ((options->barman_host[0] == '\0' && options->barman_server[0] != '\0') ||
(options->barman_host[0] != '\0' && options->barman_server[0] == '\0')) (options->barman_host[0] != '\0' && options->barman_server[0] == '\0'))
@@ -795,6 +821,12 @@ _parse_config(t_configuration_options *options, ItemList *error_list, ItemList *
item_list_append(error_list, item_list_append(error_list,
_("\replication_lag_critical\" must be greater than \"replication_lag_warning\"")); _("\replication_lag_critical\" must be greater than \"replication_lag_warning\""));
} }
if (options->standby_reconnect_timeout < options->node_rejoin_timeout)
{
item_list_append(error_list,
_("\"standby_reconnect_timeout\" must be equal to or greater than \"node_rejoin_timeout\""));
}
} }
@@ -959,12 +991,11 @@ parse_time_unit_parameter(const char *name, const char *value, char *dest, ItemL
char *ptr = NULL; char *ptr = NULL;
int targ = strtol(value, &ptr, 10); int targ = strtol(value, &ptr, 10);
if (targ < 1) if (targ < 0)
{ {
if (errors != NULL) if (errors != NULL)
{ {
item_list_append_format( item_list_append_format(errors,
errors,
_("invalid value provided for \"%s\""), _("invalid value provided for \"%s\""),
name); name);
} }
@@ -1018,13 +1049,16 @@ parse_time_unit_parameter(const char *name, const char *value, char *dest, ItemL
* - promote_delay * - promote_delay
* - reconnect_attempts * - reconnect_attempts
* - reconnect_interval * - reconnect_interval
* - repmgrd_standby_startup_timeout
* - retry_promote_interval_secs * - retry_promote_interval_secs
* *
* non-changeable options * non-changeable options (repmgrd references these from the "repmgr.nodes"
* table, not the configuration file)
* *
* - node_id * - node_id
* - node_name * - node_name
* - data_directory * - data_directory
* - location
* - priority * - priority
* - replication_type * - replication_type
* *
@@ -1033,7 +1067,7 @@ parse_time_unit_parameter(const char *name, const char *value, char *dest, ItemL
*/ */
bool bool
reload_config(t_configuration_options *orig_options) reload_config(t_configuration_options *orig_options, t_server_type server_type)
{ {
PGconn *conn; PGconn *conn;
t_configuration_options new_options = T_CONFIGURATION_OPTIONS_INITIALIZER; t_configuration_options new_options = T_CONFIGURATION_OPTIONS_INITIALIZER;
@@ -1043,17 +1077,50 @@ reload_config(t_configuration_options *orig_options)
static ItemList config_errors = {NULL, NULL}; static ItemList config_errors = {NULL, NULL};
static ItemList config_warnings = {NULL, NULL}; static ItemList config_warnings = {NULL, NULL};
PQExpBufferData errors;
log_info(_("reloading configuration file")); log_info(_("reloading configuration file"));
_parse_config(&new_options, &config_errors, &config_warnings); _parse_config(&new_options, &config_errors, &config_warnings);
if (server_type == PRIMARY || server_type == STANDBY)
{
if (new_options.promote_command[0] == '\0')
{
item_list_append(&config_errors, _("\"promote_command\": required parameter was not found"));
}
if (new_options.follow_command[0] == '\0')
{
item_list_append(&config_errors, _("\"follow_command\": required parameter was not found"));
}
}
if (config_errors.head != NULL) if (config_errors.head != NULL)
{ {
/* XXX dump errors to log */ ItemListCell *cell = NULL;
log_warning(_("unable to parse new configuration, retaining current configuration")); log_warning(_("unable to parse new configuration, retaining current configuration"));
initPQExpBuffer(&errors);
appendPQExpBuffer(&errors,
"following errors were detected:\n");
for (cell = config_errors.head; cell; cell = cell->next)
{
appendPQExpBuffer(&errors,
" %s\n", cell->string);
}
log_detail("%s", errors.data);
termPQExpBuffer(&errors);
return false; return false;
} }
/* The following options cannot be changed */ /* The following options cannot be changed */
if (new_options.node_id != orig_options->node_id) if (new_options.node_id != orig_options->node_id)
@@ -1207,7 +1274,7 @@ reload_config(t_configuration_options *orig_options)
config_changed = true; config_changed = true;
} }
/* promote_delay */ /* promote_delay (for testing use only; not documented */
if (orig_options->promote_delay != new_options.promote_delay) if (orig_options->promote_delay != new_options.promote_delay)
{ {
orig_options->promote_delay = new_options.promote_delay; orig_options->promote_delay = new_options.promote_delay;
@@ -1234,6 +1301,15 @@ reload_config(t_configuration_options *orig_options)
config_changed = true; config_changed = true;
} }
/* repmgrd_standby_startup_timeout */
if (orig_options->repmgrd_standby_startup_timeout != new_options.repmgrd_standby_startup_timeout)
{
orig_options->repmgrd_standby_startup_timeout = new_options.repmgrd_standby_startup_timeout;
log_info(_("\"repmgrd_standby_startup_timeout\" is now \"%i\""), new_options.repmgrd_standby_startup_timeout);
config_changed = true;
}
/* /*
* Handle changes to logging configuration * Handle changes to logging configuration
*/ */
@@ -1326,13 +1402,23 @@ exit_with_config_file_errors(ItemList *config_errors, ItemList *config_warnings,
void void
exit_with_cli_errors(ItemList *error_list) exit_with_cli_errors(ItemList *error_list, const char *repmgr_command)
{ {
fprintf(stderr, _("The following command line errors were encountered:\n")); fprintf(stderr, _("The following command line errors were encountered:\n"));
print_item_list(error_list); print_item_list(error_list);
fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname()); if (repmgr_command != NULL)
{
fprintf(stderr, _("Try \"%s --help\" or \"%s %s --help\" for more information.\n"),
progname(),
progname(),
repmgr_command);
}
else
{
fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname());
}
exit(ERR_BAD_CONFIG); exit(ERR_BAD_CONFIG);
} }
@@ -1437,7 +1523,7 @@ repmgr_atoi(const char *value, const char *config_item, ItemList *error_list, in
* *
* https://www.postgresql.org/docs/current/static/config-setting.html * https://www.postgresql.org/docs/current/static/config-setting.html
*/ */
static bool bool
parse_bool(const char *s, const char *config_item, ItemList *error_list) parse_bool(const char *s, const char *config_item, ItemList *error_list)
{ {
PQExpBufferData errors; PQExpBufferData errors;
@@ -1723,6 +1809,9 @@ free_parsed_argv(char ***argv_array)
} }
bool bool
parse_pg_basebackup_options(const char *pg_basebackup_options, t_basebackup_options *backup_options, int server_version_num, ItemList *error_list) parse_pg_basebackup_options(const char *pg_basebackup_options, t_basebackup_options *backup_options, int server_version_num, ItemList *error_list)
{ {

View File

@@ -102,6 +102,12 @@ typedef struct
int primary_follow_timeout; int primary_follow_timeout;
int standby_follow_timeout; int standby_follow_timeout;
/* standby switchover settings */
int standby_reconnect_timeout;
/* node rejoin settings */
int node_rejoin_timeout;
/* node check settings */ /* node check settings */
int archive_ready_warning; int archive_ready_warning;
int archive_ready_critical; int archive_ready_critical;
@@ -124,7 +130,8 @@ typedef struct
int degraded_monitoring_timeout; int degraded_monitoring_timeout;
int async_query_timeout; int async_query_timeout;
int primary_notification_timeout; int primary_notification_timeout;
int standby_reconnect_timeout; int repmgrd_standby_startup_timeout;
char repmgrd_pid_file[MAXPGPATH];
/* BDR settings */ /* BDR settings */
bool bdr_local_monitoring_only; bool bdr_local_monitoring_only;
@@ -173,6 +180,10 @@ typedef struct
/* standby follow settings */ \ /* standby follow settings */ \
DEFAULT_PRIMARY_FOLLOW_TIMEOUT, \ DEFAULT_PRIMARY_FOLLOW_TIMEOUT, \
DEFAULT_STANDBY_FOLLOW_TIMEOUT, \ DEFAULT_STANDBY_FOLLOW_TIMEOUT, \
/* standby switchover settings */ \
DEFAULT_STANDBY_RECONNECT_TIMEOUT, \
/* node rejoin settings */ \
DEFAULT_NODE_REJOIN_TIMEOUT, \
/* node check settings */ \ /* node check settings */ \
DEFAULT_ARCHIVE_READY_WARNING, DEFAULT_ARCHIVE_READY_CRITICAL, \ DEFAULT_ARCHIVE_READY_WARNING, DEFAULT_ARCHIVE_READY_CRITICAL, \
DEFAULT_REPLICATION_LAG_WARNING, DEFAULT_REPLICATION_LAG_CRITICAL, \ DEFAULT_REPLICATION_LAG_WARNING, DEFAULT_REPLICATION_LAG_CRITICAL, \
@@ -186,7 +197,7 @@ typedef struct
false, -1, \ false, -1, \
DEFAULT_ASYNC_QUERY_TIMEOUT, \ DEFAULT_ASYNC_QUERY_TIMEOUT, \
DEFAULT_PRIMARY_NOTIFICATION_TIMEOUT, \ DEFAULT_PRIMARY_NOTIFICATION_TIMEOUT, \
DEFAULT_STANDBY_RECONNECT_TIMEOUT, \ -1, "", \
/* BDR settings */ \ /* BDR settings */ \
false, DEFAULT_BDR_RECOVERY_TIMEOUT, \ false, DEFAULT_BDR_RECOVERY_TIMEOUT, \
/* service settings */ \ /* service settings */ \
@@ -262,16 +273,20 @@ typedef struct
"", "", "", "" \ "", "", "", "" \
} }
#include "dbutils.h"
void set_progname(const char *argv0); void set_progname(const char *argv0);
const char *progname(void); const char *progname(void);
void load_config(const char *config_file, bool verbose, bool terse, t_configuration_options *options, char *argv0); void load_config(const char *config_file, bool verbose, bool terse, t_configuration_options *options, char *argv0);
void parse_config(t_configuration_options *options, bool terse); bool reload_config(t_configuration_options *orig_options, t_server_type server_type);
bool reload_config(t_configuration_options *orig_options);
bool parse_recovery_conf(const char *data_dir, t_recovery_conf *conf); bool parse_recovery_conf(const char *data_dir, t_recovery_conf *conf);
bool parse_bool(const char *s,
const char *config_item,
ItemList *error_list);
int repmgr_atoi(const char *s, int repmgr_atoi(const char *s,
const char *config_item, const char *config_item,
ItemList *error_list, ItemList *error_list,
@@ -287,7 +302,7 @@ void free_parsed_argv(char ***argv_array);
/* called by repmgr-client and repmgrd */ /* called by repmgr-client and repmgrd */
void exit_with_cli_errors(ItemList *error_list); void exit_with_cli_errors(ItemList *error_list, const char *repmgr_command);
void print_item_list(ItemList *item_list); void print_item_list(ItemList *item_list);
#endif /* _REPMGR_CONFIGFILE_H_ */ #endif /* _REPMGR_CONFIGFILE_H_ */

18
configure vendored
View File

@@ -1,6 +1,6 @@
#! /bin/sh #! /bin/sh
# Guess values for system-dependent variables and create Makefiles. # Guess values for system-dependent variables and create Makefiles.
# Generated by GNU Autoconf 2.69 for repmgr 4.0.5. # Generated by GNU Autoconf 2.69 for repmgr 4.1.2.
# #
# Report bugs to <pgsql-bugs@postgresql.org>. # Report bugs to <pgsql-bugs@postgresql.org>.
# #
@@ -582,8 +582,8 @@ MAKEFLAGS=
# Identity of this package. # Identity of this package.
PACKAGE_NAME='repmgr' PACKAGE_NAME='repmgr'
PACKAGE_TARNAME='repmgr' PACKAGE_TARNAME='repmgr'
PACKAGE_VERSION='4.0.5' PACKAGE_VERSION='4.1.2'
PACKAGE_STRING='repmgr 4.0.5' PACKAGE_STRING='repmgr 4.1.2'
PACKAGE_BUGREPORT='pgsql-bugs@postgresql.org' PACKAGE_BUGREPORT='pgsql-bugs@postgresql.org'
PACKAGE_URL='https://2ndquadrant.com/en/resources/repmgr/' PACKAGE_URL='https://2ndquadrant.com/en/resources/repmgr/'
@@ -1178,7 +1178,7 @@ if test "$ac_init_help" = "long"; then
# Omit some internal or obsolete options to make the list less imposing. # Omit some internal or obsolete options to make the list less imposing.
# This message is too long to be a string in the A/UX 3.1 sh. # This message is too long to be a string in the A/UX 3.1 sh.
cat <<_ACEOF cat <<_ACEOF
\`configure' configures repmgr 4.0.5 to adapt to many kinds of systems. \`configure' configures repmgr 4.1.2 to adapt to many kinds of systems.
Usage: $0 [OPTION]... [VAR=VALUE]... Usage: $0 [OPTION]... [VAR=VALUE]...
@@ -1239,7 +1239,7 @@ fi
if test -n "$ac_init_help"; then if test -n "$ac_init_help"; then
case $ac_init_help in case $ac_init_help in
short | recursive ) echo "Configuration of repmgr 4.0.5:";; short | recursive ) echo "Configuration of repmgr 4.1.2:";;
esac esac
cat <<\_ACEOF cat <<\_ACEOF
@@ -1313,7 +1313,7 @@ fi
test -n "$ac_init_help" && exit $ac_status test -n "$ac_init_help" && exit $ac_status
if $ac_init_version; then if $ac_init_version; then
cat <<\_ACEOF cat <<\_ACEOF
repmgr configure 4.0.5 repmgr configure 4.1.2
generated by GNU Autoconf 2.69 generated by GNU Autoconf 2.69
Copyright (C) 2012 Free Software Foundation, Inc. Copyright (C) 2012 Free Software Foundation, Inc.
@@ -1332,7 +1332,7 @@ cat >config.log <<_ACEOF
This file contains any messages produced by compilers while This file contains any messages produced by compilers while
running configure, to aid debugging if configure makes a mistake. running configure, to aid debugging if configure makes a mistake.
It was created by repmgr $as_me 4.0.5, which was It was created by repmgr $as_me 4.1.2, which was
generated by GNU Autoconf 2.69. Invocation command line was generated by GNU Autoconf 2.69. Invocation command line was
$ $0 $@ $ $0 $@
@@ -2359,7 +2359,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
# report actual input values of CONFIG_FILES etc. instead of their # report actual input values of CONFIG_FILES etc. instead of their
# values after options handling. # values after options handling.
ac_log=" ac_log="
This file was extended by repmgr $as_me 4.0.5, which was This file was extended by repmgr $as_me 4.1.2, which was
generated by GNU Autoconf 2.69. Invocation command line was generated by GNU Autoconf 2.69. Invocation command line was
CONFIG_FILES = $CONFIG_FILES CONFIG_FILES = $CONFIG_FILES
@@ -2422,7 +2422,7 @@ _ACEOF
cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`" ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`"
ac_cs_version="\\ ac_cs_version="\\
repmgr config.status 4.0.5 repmgr config.status 4.1.2
configured by $0, generated by GNU Autoconf 2.69, configured by $0, generated by GNU Autoconf 2.69,
with options \\"\$ac_cs_config\\" with options \\"\$ac_cs_config\\"

View File

@@ -1,4 +1,4 @@
AC_INIT([repmgr], [4.0.6], [pgsql-bugs@postgresql.org], [repmgr], [https://2ndquadrant.com/en/resources/repmgr/]) AC_INIT([repmgr], [4.1.2], [pgsql-bugs@postgresql.org], [repmgr], [https://2ndquadrant.com/en/resources/repmgr/])
AC_COPYRIGHT([Copyright (c) 2010-2018, 2ndQuadrant Ltd.]) AC_COPYRIGHT([Copyright (c) 2010-2018, 2ndQuadrant Ltd.])

View File

@@ -227,7 +227,15 @@ get_controlfile(const char *DataDir)
control_file_info->control_file_processed = true; control_file_info->control_file_processed = true;
if (version_num >= 90500) if (version_num >= 110000)
{
ControlFileData11 *ptr = (struct ControlFileData11 *)ControlFileDataPtr;
control_file_info->system_identifier = ptr->system_identifier;
control_file_info->state = ptr->state;
control_file_info->checkPoint = ptr->checkPoint;
control_file_info->data_checksum_version = ptr->data_checksum_version;
}
else if (version_num >= 90500)
{ {
ControlFileData95 *ptr = (struct ControlFileData95 *)ControlFileDataPtr; ControlFileData95 *ptr = (struct ControlFileData95 *)ControlFileDataPtr;
control_file_info->system_identifier = ptr->system_identifier; control_file_info->system_identifier = ptr->system_identifier;

View File

@@ -265,6 +265,71 @@ typedef struct ControlFileData95
} ControlFileData95; } ControlFileData95;
/*
* Following field removed in 11:
*
* XLogRecPtr prevCheckPoint;
*
* In 10, following field appended *after* "data_checksum_version":
*
* char mock_authentication_nonce[MOCK_AUTH_NONCE_LEN];
*
* (but we don't care about that)
*/
typedef struct ControlFileData11
{
uint64 system_identifier;
uint32 pg_control_version; /* PG_CONTROL_VERSION */
uint32 catalog_version_no; /* see catversion.h */
DBState state; /* see enum above */
pg_time_t time; /* time stamp of last pg_control update */
XLogRecPtr checkPoint; /* last check point record ptr */
CheckPoint95 checkPointCopy; /* copy of last check point record */
XLogRecPtr unloggedLSN; /* current fake LSN value, for unlogged rels */
XLogRecPtr minRecoveryPoint;
TimeLineID minRecoveryPointTLI;
XLogRecPtr backupStartPoint;
XLogRecPtr backupEndPoint;
bool backupEndRequired;
int wal_level;
bool wal_log_hints;
int MaxConnections;
int max_worker_processes;
int max_prepared_xacts;
int max_locks_per_xact;
bool track_commit_timestamp;
uint32 maxAlign; /* alignment requirement for tuples */
double floatFormat; /* constant 1234567.0 */
uint32 blcksz; /* data block size for this DB */
uint32 relseg_size; /* blocks per segment of large relation */
uint32 xlog_blcksz; /* block size within WAL files */
uint32 xlog_seg_size; /* size of each WAL segment */
uint32 nameDataLen; /* catalog name field width */
uint32 indexMaxKeys; /* max number of columns in an index */
uint32 toast_max_chunk_size; /* chunk size in TOAST tables */
uint32 loblksize; /* chunk size in pg_largeobject */
bool enableIntTimes; /* int64 storage enabled? */
bool float4ByVal; /* float4 pass-by-value? */
bool float8ByVal; /* float8, int8, etc pass-by-value? */
uint32 data_checksum_version;
} ControlFileData11;
extern DBState get_db_state(const char *data_directory); extern DBState get_db_state(const char *data_directory);

1111
dbutils.c

File diff suppressed because it is too large Load Diff

View File

@@ -29,7 +29,9 @@
#include "voting.h" #include "voting.h"
#define REPMGR_NODES_COLUMNS "n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name " #define REPMGR_NODES_COLUMNS "n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name "
#define BDR_NODES_COLUMNS "node_sysid, node_timeline, node_dboid, node_status, node_name, node_local_dsn, node_init_from_dsn, node_read_only, node_seq_id" #define BDR2_NODES_COLUMNS "node_sysid, node_timeline, node_dboid, node_name, node_local_dsn, ''"
#define BDR3_NODES_COLUMNS "ns.node_id, 0, 0, ns.node_name, ns.interface_connstr, ns.peer_state_name"
#define ERRBUFF_SIZE 512 #define ERRBUFF_SIZE 512
@@ -94,6 +96,14 @@ typedef enum
SLOT_ACTIVE SLOT_ACTIVE
} ReplSlotStatus; } ReplSlotStatus;
typedef enum
{
BACKUP_STATE_UNKNOWN = -1,
BACKUP_STATE_IN_BACKUP,
BACKUP_STATE_NO_BACKUP
} BackupState;
/* /*
* Struct to store node information * Struct to store node information
*/ */
@@ -237,18 +247,14 @@ typedef struct s_bdr_node_info
char node_sysid[MAXLEN]; char node_sysid[MAXLEN];
uint32 node_timeline; uint32 node_timeline;
uint32 node_dboid; uint32 node_dboid;
char node_status;
char node_name[MAXLEN]; char node_name[MAXLEN];
char node_local_dsn[MAXLEN]; char node_local_dsn[MAXLEN];
char node_init_from_dsn[MAXLEN]; char peer_state_name[MAXLEN];
bool read_only;
uint32 node_seq_id;
} t_bdr_node_info; } t_bdr_node_info;
#define T_BDR_NODE_INFO_INITIALIZER { \ #define T_BDR_NODE_INFO_INITIALIZER { \
"", InvalidOid, InvalidOid, \ "", InvalidOid, InvalidOid, \
'?', "", "", "", \ "", "", "" \
false, -1 \
} }
@@ -392,6 +398,7 @@ int get_ready_archive_files(PGconn *conn, const char *data_directory);
bool identify_system(PGconn *repl_conn, t_system_identification *identification); bool identify_system(PGconn *repl_conn, t_system_identification *identification);
bool repmgrd_set_local_node_id(PGconn *conn, int local_node_id); bool repmgrd_set_local_node_id(PGconn *conn, int local_node_id);
int repmgrd_get_local_node_id(PGconn *conn); int repmgrd_get_local_node_id(PGconn *conn);
BackupState server_in_exclusive_backup_mode(PGconn *conn);
/* extension functions */ /* extension functions */
ExtensionStatus get_repmgr_extension_status(PGconn *conn); ExtensionStatus get_repmgr_extension_status(PGconn *conn);
@@ -468,7 +475,7 @@ int wait_connection_availability(PGconn *conn, long long timeout);
/* node availability functions */ /* node availability functions */
bool is_server_available(const char *conninfo); bool is_server_available(const char *conninfo);
bool is_server_available_params(t_conninfo_param_list *param_list); bool is_server_available_params(t_conninfo_param_list *param_list);
void connection_ping(PGconn *conn); ExecStatusType connection_ping(PGconn *conn);
/* monitoring functions */ /* monitoring functions */
void void
@@ -507,12 +514,14 @@ void get_node_replication_stats(PGconn *conn, int server_version_num, t_node_in
bool is_downstream_node_attached(PGconn *conn, char *node_name); bool is_downstream_node_attached(PGconn *conn, char *node_name);
/* BDR functions */ /* BDR functions */
int get_bdr_version_num(void);
void get_all_bdr_node_records(PGconn *conn, BdrNodeInfoList *node_list); void get_all_bdr_node_records(PGconn *conn, BdrNodeInfoList *node_list);
RecordStatus get_bdr_node_record_by_name(PGconn *conn, const char *node_name, t_bdr_node_info *node_info); RecordStatus get_bdr_node_record_by_name(PGconn *conn, const char *node_name, t_bdr_node_info *node_info);
bool is_bdr_db(PGconn *conn, PQExpBufferData *output); bool is_bdr_db(PGconn *conn, PQExpBufferData *output);
bool is_bdr_db_quiet(PGconn *conn); bool is_bdr_db_quiet(PGconn *conn);
bool is_active_bdr_node(PGconn *conn, const char *node_name); bool is_active_bdr_node(PGconn *conn, const char *node_name);
bool is_bdr_repmgr(PGconn *conn); bool is_bdr_repmgr(PGconn *conn);
char *get_default_bdr_replication_set(PGconn *conn);
bool is_table_in_bdr_replication_set(PGconn *conn, const char *tablename, const char *set); bool is_table_in_bdr_replication_set(PGconn *conn, const char *tablename, const char *set);
bool add_table_to_bdr_replication_set(PGconn *conn, const char *tablename, const char *set); bool add_table_to_bdr_replication_set(PGconn *conn, const char *tablename, const char *set);
void add_extension_tables_to_bdr_replication_set(PGconn *conn); void add_extension_tables_to_bdr_replication_set(PGconn *conn);

View File

@@ -108,6 +108,14 @@
is not possible, contact your vendor for assistance. is not possible, contact your vendor for assistance.
</para> </para>
</sect2> </sect2>
<sect2 id="faq-old-packages">
<title>How can I obtain old versions of &repmgr; packages?</title>
<para>
See appendix <xref linkend="packages-old-versions"> for details.
</para>
</sect2>
</sect1> </sect1>
<sect1 id="faq-repmgr" xreflabel="repmgr"> <sect1 id="faq-repmgr" xreflabel="repmgr">
@@ -239,11 +247,22 @@
Under some circumstances event notifications can be generated for servers Under some circumstances event notifications can be generated for servers
which have not yet been registered; it's also useful to retain a record which have not yet been registered; it's also useful to retain a record
of events which includes servers removed from the replication cluster of events which includes servers removed from the replication cluster
which no longer have an entry in the <literal>repmrg.nodes</literal> table. which no longer have an entry in the <literal>repmgr.nodes</literal> table.
</para> </para>
</sect2> </sect2>
<sect2 id="faq-repmgr-recovery-conf-quoted-values" xreflabel="Quoted values in recovery.conf">
<title>Why are some values in <filename>recovery.conf</filename> surrounded by pairs of single quotes?</title>
<para>
This is to ensure that user-supplied values which are written as parameter values in <filename>recovery.conf</filename>
are escaped correctly and do not cause errors when <filename>recovery.conf</filename> is parsed.
</para>
<para>
The escaping is performed by an internal PostgreSQL routine, which leaves strings consisting
of digits and alphabetical characters only as-is, but wraps everything else in pairs of single quotes,
even if the string does not contain any characters which need escaping.
</para>
</sect2>
</sect1> </sect1>
@@ -255,7 +274,7 @@
<sect2 id="faq-repmgrd-prevent-promotion" xreflabel="Prevent standby from being promoted to primary"> <sect2 id="faq-repmgrd-prevent-promotion" xreflabel="Prevent standby from being promoted to primary">
<title>How can I prevent a node from ever being promoted to primary?</title> <title>How can I prevent a node from ever being promoted to primary?</title>
<para> <para>
In `repmgr.conf`, set its priority to a value of 0 or less; apply the changed setting with In <filename>repmgr.conf</filename>, set its priority to a value of <literal>0</literal>; apply the changed setting with
<command><link linkend="repmgr-standby-register">repmgr standby register --force</link></command>. <command><link linkend="repmgr-standby-register">repmgr standby register --force</link></command>.
</para> </para>
<para> <para>
@@ -303,5 +322,36 @@
</para> </para>
</sect2> </sect2>
<sect2 id="faq-repmgrd-pg-bindir" xreflabel="repmgrd does not apply pg_bindir to promote_command or follow_command">
<title>
<application>repmgrd</application> ignores pg_bindir when executing <varname>promote_command</varname> or <varname>follow_command</varname>
</title>
<para>
<varname>promote_command</varname> or <varname>follow_command</varname> can be user-defined scripts,
so &repmgr; will not apply <option>pg_bindir</option> even if excuting &repmgr;. Always provide the full
path; see <xref linkend="repmgrd-automatic-failover-configuration"> for more details.
</para>
</sect2>
<sect2 id="faq-repmgrd-startup-no-upstream" xreflabel="repmgrd does not start if upstream node is not running">
<title>
<application>repmgrd</application> aborts startup with the error "<literal>upstream node must be running before repmgrd can start</literal>"
</title>
<para>
<application>repmgrd</application> does this to avoid starting up on a replication cluster
which is not in a healthy state. If the upstream is unavailable, <application>repmgrd</application>
may initiate a failover immediately after starting up, which could have unintended side-effects,
particularly if <application>repmgrd</application> is not running on other nodes.
</para>
<para>
In particular, it's possible that the node's local copy of the <literal>repmgr.nodes</literal> copy
is out-of-date, which may lead to incorrect failover behaviour.
</para>
<para>
The onus is therefore on the adminstrator to manually set the cluster to a stable, healthy state before
starting <application>repmgrd</application>.
</para>
</sect2>
</sect1> </sect1>
</appendix> </appendix>

View File

@@ -53,11 +53,11 @@
<tbody> <tbody>
<row> <row>
<entry>Repository URL:</entry> <entry>Repository URL:</entry>
<entry><ulink url="https://rpm.2ndquadrant.com/">https://rpm.2ndquadrant.com/</ulink></entry> <entry><ulink url="https://dl.2ndquadrant.com/">https://dl.2ndquadrant.com/</ulink></entry>
</row> </row>
<row> <row>
<entry>Repository documentation:</entry> <entry>Repository documentation:</entry>
<entry><ulink url="https://repmgr.org/docs/4.0/installation-packages.html#INSTALLATION-PACKAGES-REDHAT-2NDQ">https://repmgr.org/docs/4.0/installation-packages.html#INSTALLATION-PACKAGES-REDHAT-2NDQ</ulink></entry> <entry><ulink url="https://repmgr.org/docs/4.1/installation-packages.html#INSTALLATION-PACKAGES-REDHAT-2NDQ">https://repmgr.org/docs/4.1/installation-packages.html#INSTALLATION-PACKAGES-REDHAT-2NDQ</ulink></entry>
</row> </row>
</tbody> </tbody>
</tgroup> </tgroup>
@@ -253,6 +253,23 @@
</para> </para>
<table id="apt-2ndquadrant-repository">
<title>2ndQuadrant public repository</title>
<tgroup cols="2">
<tbody>
<row>
<entry>Repository URL:</entry>
<entry><ulink url="https://dl.2ndquadrant.com/">https://dl.2ndquadrant.com/</ulink></entry>
</row>
<row>
<entry>Repository documentation:</entry>
<entry><ulink url="https://repmgr.org/docs/4.1/installation-packages.html#INSTALLATION-PACKAGES-DEBIAN">https://repmgr.org/docs/4.1/installation-packages.html#INSTALLATION-PACKAGES-DEBIAN</ulink></entry>
</row>
</tbody>
</tgroup>
</table>
<table id="apt-repository"> <table id="apt-repository">
<title>PostgreSQL Community APT repository (PGDG)</title> <title>PostgreSQL Community APT repository (PGDG)</title>
<tgroup cols="2"> <tgroup cols="2">
@@ -364,4 +381,169 @@
</sect2> </sect2>
</sect1> </sect1>
<sect1 id="packages-snapshot" xreflabel="Snapshot packages">
<title>Snapshot packages</title>
<indexterm>
<primary>snapshot packages</primary>
</indexterm>
<indexterm>
<primary>packages</primary>
<secondary>snaphots</secondary>
</indexterm>
<para>
For testing new features and bug fixes, from time to time 2ndQuadrant provides
so-called &quot;snapshot packages&quot; via its public repository. These packages
are built from the &repmgr; source at a particular point in time, and are not formal
releases.
</para>
<note>
<para>
We do not recommend installing these packages in a production environment
unless specifically advised.
</para>
</note>
<para>
To install a snapshot package, it's necessary to install the 2ndQuadrant public snapshot repository,
following the instructions here: <ulink url="https://dl.2ndquadrant.com/default/release/site/">https://dl.2ndquadrant.com/default/release/site/</ulink> but replace <literal>release</literal> with <literal>snapshot</literal>
in the appropriate URL.
</para>
<para>
For example, to install the snapshot RPM repository for PostgreSQL 9.6, execute (as <literal>root</literal>):
<programlisting>
curl https://dl.2ndquadrant.com/default/snapshot/get/9.6/rpm | bash</programlisting>
or as a normal user with root sudo access:
<programlisting>
curl https://dl.2ndquadrant.com/default/snapshot/get/9.6/rpm | sudo bash</programlisting>
</para>
<para>
Alternatively you can browse the repository here:
<ulink url="https://dl.2ndquadrant.com/default/snapshot/browse/">https://dl.2ndquadrant.com/default/snapshot/browse/</ulink>.
</para>
<para>
Once the repository is installed, installing or updating &repmgr; will result in the latest snapshot
package being installed.
</para>
<para>
The package name will be formatted like this:
<programlisting>
repmgr96-4.1.1-0.0git320.g5113ab0.1.el7.x86_64.rpm</programlisting>
containg the snapshot build number (here: <literal>320</literal>) and the hash
of the <application>git</application> commit it was built from (here: <literal>g5113ab0</literal>).
</para>
<para>
Note that the next formal release (in the above example <literal>4.1.1</literal>), once available,
will install in place of any snapshot builds.
</para>
</sect1>
<sect1 id="packages-old-versions" xreflabel="Installing old package versions">
<title>Installing old package versions</title>
<indexterm>
<primary>old packages</primary>
</indexterm>
<indexterm>
<primary>packages</primary>
<secondary>old versions</secondary>
</indexterm>
<sect2 id="packages-old-versions-debian" xreflabel="old Debian package versions">
<title>Debian/Ubuntu</title>
<para>
An archive of old packages (<literal>3.3.2</literal> and later) for Debian/Ubuntu-based systems is available here:
<ulink url="http://atalia.postgresql.org/morgue/r/repmgr/">http://atalia.postgresql.org/morgue/r/repmgr/</ulink>
</para>
</sect2>
<sect2 id="packages-old-versions-rhel-centos" xreflabel="old RHEL/CentOS package versions">
<title>RHEL/CentOS</title>
<para>
Old RPM packages (<literal>3.2</literal> and later) can be retrieved from the
(deprecated) 2ndQuadrant repository at
<ulink url="http://packages.2ndquadrant.com/">http://packages.2ndquadrant.com/</ulink>
by installing the appropriate repository RPM:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<ulink url="http://packages.2ndquadrant.com/repmgr/yum-repo-rpms/repmgr-fedora-1.0-1.noarch.rpm">http://packages.2ndquadrant.com/repmgr/yum-repo-rpms/repmgr-fedora-1.0-1.noarch.rpm</ulink>
</simpara>
</listitem>
<listitem>
<simpara>
<ulink url="http://packages.2ndquadrant.com/repmgr/yum-repo-rpms/repmgr-rhel-1.0-1.noarch.rpm">http://packages.2ndquadrant.com/repmgr/yum-repo-rpms/repmgr-rhel-1.0-1.noarch.rpm</ulink>
</simpara>
</listitem>
</itemizedlist>
<para>
Old versions can be located with e.g.:
<programlisting>
yum --showduplicates list repmgr96</programlisting>
(substitute the appropriate package name; see <xref linkend="packages-centos">) and installed with:
<programlisting>
yum install {package_name}-{version}</programlisting>
where <literal>{package_name}</literal> is the base package name (e.g. <literal>repmgr96</literal>)
and <literal>{version}</literal> is the version listed by the
<command> yum --showduplicates list ...</command> command, e.g. <literal>4.0.6-1.rhel6</literal>.
</para>
<para>For example:
<programlisting>
yum install repmgr96-4.0.6-1.rhel6</programlisting>
</para>
</sect2>
</sect1>
<sect1 id="packages-packager-info" xreflabel="Information for packagers">
<title>Information for packagers</title>
<indexterm>
<primary>packages</primary>
<secondary>information for packagers</secondary>
</indexterm>
<para>
We recommend patching the following parameters when
building the package as built-in default values for user convenience.
These values can nevertheless be overridden by the user, if desired.
</para>
<itemizedlist>
<listitem>
<para>
Configuration file location: the default configuration file location
can be hard-coded by patching <varname>package_conf_file</varname>
in <filename>configfile.c</filename>:
<programlisting>
/* packagers: if feasible, patch configuration file path into "package_conf_file" */
char package_conf_file[MAXPGPATH] = "";</programlisting>
</para>
<para>
See also: <xref linkend="configuration-file">
</para>
</listitem>
<listitem>
<para>
PID file location: the default <application>repmgrd</application> PID file
location can be hard-coded by patching <varname>package_pid_file</varname>
in <filename>repmgrd.c</filename>:
<programlisting>
/* packagers: if feasible, patch PID file path into "package_pid_file" */
char package_pid_file[MAXPGPATH] = "";</programlisting>
</para>
<para>
See also: <xref linkend="repmgrd-pid-file">
</para>
</listitem>
</itemizedlist>
</sect1>
</appendix> </appendix>

View File

@@ -15,9 +15,373 @@
See also: <xref linkend="upgrading-repmgr"> See also: <xref linkend="upgrading-repmgr">
</para> </para>
<sect1 id="release-4.1.1">
<title>Release 4.1.1</title>
<para><emphasis>Wed September 5, 2018</emphasis></para>
<para>
repmgr 4.1.1 contains a number of usability enhancements and bug fixes.
</para>
<para>
We recommend upgrading to this version as soon as possible.
This release can be installed as a simple package upgrade from repmgr 4.0 ~ 4.1.0;
<application>repmgrd</application> (if running) should be restarted.
See <xref linkend="upgrading-repmgr"> for more details.
</para>
<sect2>
<title>repmgr enhancements</title>
<para>
<itemizedlist>
<listitem>
<para>
<command><link linkend="repmgr-standby-switchover">repmgr standby switchover --dry-run</link></command>
no longer copies external configuration files to test they can be copied; this avoids making
any changes to the target system. (GitHub #491).
</para>
</listitem>
<listitem>
<para>
<command><link linkend="repmgr-cluster-cleanup">repmgr cluster cleanup</link></command>:
add <literal>cluster_cleanup</literal> event. (GitHub #492).
</para>
</listitem>
<listitem>
<para>
<command><link linkend="repmgr-standby-switchover">repmgr standby switchover</link></command>:
improve detection of free walsenders. (GitHub #495).
</para>
</listitem>
<listitem>
<para>
Improve messages emitted during
<command><link linkend="repmgr-standby-promote">repmgr standby promote</link></command>.
</para>
</listitem>
</itemizedlist>
</para>
</sect2>
<sect2>
<title>repmgrd enhancements</title>
<para>
<itemizedlist>
<listitem>
<para>
Always reopen the log file after
receiving <literal>SIGHUP</literal>. Previously this only happened if
a configuration file change was detected.
(GitHub #485).
</para>
</listitem>
<listitem>
<para>
Report version number <emphasis>after</emphasis>
logger initialisation. (GitHub #487).
</para>
</listitem>
<listitem>
<para>
Improve cascaded standby failover handling. (GitHub #480).
</para>
</listitem>
<listitem>
<para>
Improve reconnection handling after brief network outages; if
monitoring data being collected, this could lead to orphaned
sessions on the primary. (GitHub #480).
</para>
</listitem>
<listitem>
<para>
Check <varname>promote_command</varname> and <varname>follow_command</varname>
are defined when reloading configuration. These were checked on startup but
not reload by <application>repmgrd</application>, which made it possible to
make <application>repmgrd</application> with invalid values. It's unlikely
anyone would want to do this, but we should make it impossible anyway.
(GitHub #486).
</para>
</listitem>
</itemizedlist>
</para>
</sect2>
<sect2>
<title>Other</title>
<para>
<itemizedlist>
<listitem>
<para>
Text of any failed queries will now be logged as <literal>ERROR</literal> to assist
logfile analysis at log levels higher than <literal>DEBUG</literal>.
(GitHub #498).
</para>
</listitem>
</itemizedlist>
</para>
</sect2>
<sect2>
<title>Bug fixes</title>
<para>
<itemizedlist>
<listitem>
<para>
<command><link linkend="repmgr-node-rejoin">repmgr node rejoin</link></command>:
remove new upstream's replication slot if it still exists on the rejoined
standby. (GitHub #499).
</para>
</listitem>
<listitem>
<para>
<application>repmgrd</application>: fix startup on witness node when local data is stale. (GitHub #488, #489).
</para>
</listitem>
<listitem>
<para>
Truncate version string reported by PostgreSQL if necessary; some
distributions insert additional detail after the actual version.
(GitHub #490).
</para>
</listitem>
</itemizedlist>
</para>
</sect2>
</sect1>
<sect1 id="release-4.1.0">
<title>Release 4.1.0</title>
<para><emphasis>Tue July 31, 2018</emphasis></para>
<para>
&repmgr; 4.1.0 introduces some changes to <application>repmgrd</application>
behaviour and some additional configuration parameters.
</para>
<para>
This release can be installed as a simple package upgrade from repmgr 4.0 ~ 4.0.6.
The following post-upgrade steps must be carried out:
<itemizedlist>
<listitem>
<para>
Execute <command>ALTER EXTENSION repmgr UPDATE</command>
on the primary server in the database where &repmgr; is installed.
</para>
</listitem>
<listitem>
<para>
<application>repmgrd</application> must be restarted on all nodes where it is running.
</para>
</listitem>
</itemizedlist>
A restart of the PostgreSQL server is <emphasis>not</emphasis> required
for this release (unless upgrading from repmgr 3.x).
</para>
<para>
See <xref linkend="upgrading-repmgr-extension"> for more details.
</para>
<para>
Configuration changes are backwards-compatible and no changes to
<filename>repmgr.conf</filename> are required. However users should
review the changes listed below.
</para>
<note>
<para>
<emphasis>Repository changes</emphasis>
</para>
<para>
Coinciding with this release, the 2ndQuadrant repository structure has changed.
See section <xref linkend="installation-packages"> for details, particularly
if you are using a RPM-based system.
</para>
</note>
<sect2>
<title>Configuration file changes</title>
<para>
<itemizedlist>
<listitem>
<para>
Default for <xref linkend="repmgr-conf-log-level"> is now <option>INFO</option>.
This produces additional informative log output, without creating excessive additional
log file volume, and matches the setting assumed for examples in the documentation.
(GitHub #470).
</para>
</listitem>
<listitem>
<para>
<varname>recovery_min_apply_delay</varname> now accepts a minimum value
of <literal>zero</literal> (GitHub #448).
</para>
</listitem>
</itemizedlist>
</para>
</sect2>
<sect2>
<title>repmgr enhancements</title>
<para>
<itemizedlist>
<listitem>
<para>
<application>repmgr</application>: always exit with an error if an unrecognised
command line option is provided. This matches the behaviour of other PostgreSQL
utilities such as <application>psql</application>. (GitHub #464).
</para>
</listitem>
<listitem>
<para>
<application>repmgr</application>: add <option>-q/--quiet</option> option to suppress non-error
output. (GitHub #468).
</para>
</listitem>
<listitem>
<para>
<command><link linkend="repmgr-cluster-show">repmgr cluster show</link></command>,
<command><link linkend="repmgr-node-check">repmgr node check</link></command> and
<command><link linkend="repmgr-node-status">repmgr node status</link></command>
return non-zero exit code if node status issues detected. (GitHub #456).
</para>
</listitem>
<listitem>
<para>
Add <option>--csv</option> output option for
<command><link linkend="repmgr-cluster-event">repmgr cluster event</link></command>.
(GitHub #471).
</para>
</listitem>
<listitem>
<para>
<command><link linkend="repmgr-witness-unregister">repmgr witness unregister</link></command>
can be run on any node, by providing the ID of the witness node with <option>--node-id</option>.
(GitHub #472).
</para>
</listitem>
<listitem>
<para>
<command><link linkend="repmgr-standby-switchover">repmgr standby switchover</link></command>
will refuse to run if an exclusive backup is taking place on the current primary.
(GitHub #476).
</para>
</listitem>
</itemizedlist>
</para>
</sect2>
<sect2>
<title>repmgrd enhancements</title>
<para>
<itemizedlist>
<listitem>
<para>
<application>repmgrd</application>: create a PID file by default
(GitHub #457). For details, see <xref linkend="repmgrd-pid-file">.
</para>
</listitem>
<listitem>
<para>
<application>repmgrd</application>: daemonize process by default.
In case, for whatever reason, the user does not wish to daemonize the
process, provide <option>--daemonize=false</option>.
(GitHub #458).
</para>
</listitem>
</itemizedlist>
</para>
</sect2>
<sect2>
<title>Bug fixes</title>
<para>
<itemizedlist>
<listitem>
<para>
<command><link linkend="repmgr-standby-register">repmgr standby register --wait-sync</link></command>:
fix behaviour when no timeout provided.
</para>
</listitem>
<listitem>
<para>
<command><link linkend="repmgr-cluster-cleanup">repmgr cluster cleanup</link></command>:
add missing help options. (GitHub #461/#462).
</para>
</listitem>
<listitem>
<para>
Ensure witness node follows new primary after switchover. (GitHub #453).
</para>
</listitem>
<listitem>
<para>
<command><link linkend="repmgr-node-check">repmgr node check</link></command> and
<command><link linkend="repmgr-node-status">repmgr node status</link></command>:
fix witness node handling. (GitHub #451).
</para>
</listitem>
<listitem>
<para>
When using <command><link linkend="repmgr-standby-clone">repmgr standby clone</link></command>
with <option>--recovery-conf-only</option> and replication slots, ensure
<varname>primary_slot_name</varname> is set correctly. (GitHub #474).
</para>
</listitem>
</itemizedlist>
</para>
</sect2>
</sect1>
<sect1 id="release-4.0.6"> <sect1 id="release-4.0.6">
<title>Release 4.0.6</title> <title>Release 4.0.6</title>
<para><emphasis>June 14, 2018</emphasis></para> <para><emphasis>Thu June 14, 2018</emphasis></para>
<para> <para>
&repmgr; 4.0.6 contains a number of bug fixes and usability enhancements. &repmgr; 4.0.6 contains a number of bug fixes and usability enhancements.
</para> </para>
@@ -140,7 +504,7 @@
<listitem> <listitem>
<para> <para>
Various documentation improvements, with particular emphasis on Various documentation improvements, with particular emphasis on
the importance of setting appropriate <link linkend="configuration-service-commands">service commands</link> the importance of setting appropriate <link linkend="configuration-file-service-commands">service commands</link>
instead of relying on <application>pg_ctl</application>. instead of relying on <application>pg_ctl</application>.
</para> </para>
</listitem> </listitem>

View File

@@ -5,14 +5,14 @@
<title>repmgr source code signing key</title> <title>repmgr source code signing key</title>
<para> <para>
The signing key ID used for <application>repmgr</application> source code bundles is: The signing key ID used for <application>repmgr</application> source code bundles is:
<ulink url="http://packages.2ndquadrant.com/repmgr/SOURCE-GPG-KEY-repmgr"> <ulink url="https://repmgr.org/download/SOURCE-GPG-KEY-repmgr">
<literal>0x297F1DCC</literal></ulink>. <literal>0x297F1DCC</literal></ulink>.
</para> </para>
<para> <para>
To download the <application>repmgr</application> source key to your computer: To download the <application>repmgr</application> source key to your computer:
<programlisting> <programlisting>
curl -s http://packages.2ndquadrant.com/repmgr/SOURCE-GPG-KEY-repmgr | gpg --import curl -s https://repmgr.org/download/SOURCE-GPG-KEY-repmgr | gpg --import
gpg --fingerprint 0x297F1DCC gpg --fingerprint 0x297F1DCC
</programlisting> </programlisting>
then verify that the fingerprint is the expected value: then verify that the fingerprint is the expected value:

View File

@@ -0,0 +1,107 @@
<sect1 id="configuration-file-log-settings" xreflabel="log settings">
<indexterm>
<primary>repmgr.conf</primary>
<secondary>log settings</secondary>
</indexterm>
<indexterm>
<primary>log settings</primary>
<secondary>configuration in repmgr.conf</secondary>
</indexterm>
<title>Log settings</title>
<para>
By default, &repmgr; and <application>repmgrd</application> write log output to
<literal>STDERR</literal>. An alternative log destination can be specified
(either a file or <literal>syslog</literal>).
</para>
<note>
<para>
The &repmgr; application itself will continue to write log output to <literal>STDERR</literal>
even if another log destination is configured, as otherwise any output resulting from a command
line operation will "disappear" into the log.
</para>
<para>
This behaviour can be overriden with the command line option <option>--log-to-file</option>,
which will redirect all logging output to the configured log destination. This is recommended
when &repmgr; is executed by another application, particularly <application>repmgrd</application>,
to enable log output generated by the &repmgr; application to be stored for later reference.
</para>
</note>
<variablelist>
<varlistentry id="repmgr-conf-log-level" xreflabel="log_level">
<term><varname>log_level</varname> (<type>string</type>)
<indexterm>
<primary><varname>log_level</varname> configuration file parameter</primary>
</indexterm>
</term>
<listitem>
<para>
One of <option>DEBUG</option>, <option>INFO</option>, <option>NOTICE</option>,
<option>WARNING</option>, <option>ERROR</option>, <option>ALERT</option>, <option>CRIT</option>
or <option>EMERG</option>.
</para>
<para>
Default is <option>INFO</option>.
</para>
<para>
Note that <option>DEBUG</option> will produce a substantial amount of log output
and should not be enabled in normal use.
</para>
</listitem>
</varlistentry>
<varlistentry id="repmgr-conf-log-facility" xreflabel="log_facility">
<term><varname>log_facility</varname> (<type>string</type>)
<indexterm>
<primary><varname>log_facility</varname> configuration file parameter</primary>
</indexterm>
</term>
<listitem>
<para>
Logging facility: possible values are <option>STDERR</option> (default), or for
syslog integration, one of <option>LOCAL0</option>, <option>LOCAL1</option>, <option>...</option>,
<option>LOCAL7</option>, <option>USER</option>.
</para>
</listitem>
</varlistentry>
<varlistentry id="repmgr-conf-log-file" xreflabel="log_file">
<term><varname>log_file</varname> (<type>string</type>)
<indexterm>
<primary><varname>log_file</varname> configuration file parameter</primary>
</indexterm>
</term>
<listitem>
<para>
If <xref linkend="repmgr-conf-log-facility"> is set to <option>STDERR</option>, log output
can be redirected to the specified file.
</para>
<para>
See <xref linkend="repmgrd-log-rotation"> for information on configuring log rotation.
</para>
</listitem>
</varlistentry>
<varlistentry id="repmgr-conf-log-status-interval" xreflabel="log_status_interval">
<term><varname>log_status_interval</varname> (<type>integer</type>)
<indexterm>
<primary><varname>log_status_interval</varname> configuration file parameter</primary>
</indexterm>
</term>
<listitem>
<para>
This setting causes <application>repmgrd</application> to emit a status log
line at the specified interval (in seconds, default <literal>300</literal>)
describing <application>repmgrd</application>'s current state, e.g.:
</para>
<programlisting>
[2018-07-12 00:47:32] [INFO] monitoring connection to upstream node "node1" (node ID: 1)</programlisting>
</listitem>
</varlistentry>
</variablelist>
</sect1>

View File

@@ -1,10 +1,10 @@
<sect1 id="configuration-file-settings" xreflabel="configuration file settings"> <sect1 id="configuration-file-settings" xreflabel="required configuration file settings">
<indexterm> <indexterm>
<primary>repmgr.conf</primary> <primary>repmgr.conf</primary>
<secondary>basic settings</secondary> <secondary>required settings</secondary>
</indexterm> </indexterm>
<title>Basic configuration file settings</title> <title>Required configuration file settings</title>
<para> <para>
Each <filename>repmgr.conf</filename> file must contain the following parameters: Each <filename>repmgr.conf</filename> file must contain the following parameters:
</para> </para>

View File

@@ -1,4 +1,4 @@
<sect1 id="configuration-service-commands" xreflabel="service command settings"> <sect1 id="configuration-file-service-commands" xreflabel="service command settings">
<indexterm> <indexterm>
<primary>repmgr.conf</primary> <primary>repmgr.conf</primary>
<secondary>service command settings</secondary> <secondary>service command settings</secondary>
@@ -17,9 +17,9 @@
<link linkend="repmgr-node-rejoin"><command>repmgr node rejoin</command></link>. <link linkend="repmgr-node-rejoin"><command>repmgr node rejoin</command></link>.
</para> </para>
<para> <para>
By default, &repmgr; will use PostgreSQL's <command>pg_ctl</command> to control the PostgreSQL By default, &repmgr; will use PostgreSQL's <command>pg_ctl</command> utility to control the PostgreSQL
server. However this can lead to various problems, particularly when PostgreSQL has been server. However this can lead to various problems, particularly when PostgreSQL has been
installed from packages, and expecially so if <application>systemd</application> is in use. installed from packages, and especially so if <application>systemd</application> is in use.
</para> </para>
@@ -47,6 +47,14 @@
service_restart_command service_restart_command
service_reload_command</programlisting> service_reload_command</programlisting>
</para> </para>
<note>
<para>
&repmgr; will not apply <option>pg_bindir</option> when executing any of these commands;
these can be user-defined scripts so must always be specified with the full path.
</para>
</note>
<note> <note>
<para> <para>
It's also possible to specify a <varname>service_promote_command</varname>. It's also possible to specify a <varname>service_promote_command</varname>.
@@ -56,7 +64,7 @@
</para> </para>
<para> <para>
If your packaging system does not provide such a command, it can be left empty, If your packaging system does not provide such a command, it can be left empty,
and &repmgr; will generate the appropriate <command>pg_ctl ... promote</command> command. and &repmgr; will generate the appropriate `pg_ctl ... promote` command.
</para> </para>
<para> <para>
Do not confuse this with <varname>promote_command</varname>, which is used Do not confuse this with <varname>promote_command</varname>, which is used
@@ -64,7 +72,6 @@
</para> </para>
</note> </note>
<para> <para>
To confirm which command &repmgr; will execute for each action, use To confirm which command &repmgr; will execute for each action, use
<command>repmgr node service --list --action=...</command>, e.g.: <command>repmgr node service --list --action=...</command>, e.g.:
@@ -92,7 +99,7 @@
Defaults:postgres !requiretty Defaults:postgres !requiretty
postgres ALL = NOPASSWD: /usr/bin/systemctl stop postgresql-9.6, \ postgres ALL = NOPASSWD: /usr/bin/systemctl stop postgresql-9.6, \
/usr/bin/systemctl start postgresql-9.6, \ /usr/bin/systemctl start postgresql-9.6, \
/usr/bin/systemctl restart postgresql-9.6 \ /usr/bin/systemctl restart postgresql-9.6, \
/usr/bin/systemctl reload postgresql-9.6</programlisting> /usr/bin/systemctl reload postgresql-9.6</programlisting>
</para> </para>

View File

@@ -2,16 +2,17 @@
<title>repmgr configuration</title> <title>repmgr configuration</title>
&configuration-file; &configuration-file;
&configuration-file-settings; &configuration-file-required-settings;
&configuration-service-commands; &configuration-file-log-settings;
&configuration-file-service-commands;
<sect1 id="configuration-permissions" xreflabel="User permissions"> <sect1 id="configuration-permissions" xreflabel="Database user permissions">
<indexterm> <indexterm>
<primary>configuration</primary> <primary>configuration</primary>
<secondary>user permissions</secondary> <secondary>database user permissions</secondary>
</indexterm> </indexterm>
<title>repmgr user permissions</title> <title>repmgr database user permissions</title>
<para> <para>
&repmgr; will create an extension database containing objects &repmgr; will create an extension database containing objects
for administering &repmgr; metadata. The user defined in the <varname>conninfo</varname> for administering &repmgr; metadata. The user defined in the <varname>conninfo</varname>

View File

@@ -16,15 +16,22 @@
<para> <para>
A typical use case for a witness server is a two-node streaming replication A typical use case for a witness server is a two-node streaming replication
setup, where the primary and standby are in different locations (data centres). setup, where the primary and standby are in different locations (data centres).
By creating a witness server in the same location as the primary, if the primary By creating a witness server in the same location (data centre) as the primary,
becomes unavailable it's possible for the standby to decide whether it can if the primary becomes unavailable it's possible for the standby to decide whether
promote itself without risking a "split brain" scenario: if it can't see either the it can promote itself without risking a "split brain" scenario: if it can't see either the
witness or the primary server, it's likely there's a network-level interruption witness or the primary server, it's likely there's a network-level interruption
and it should not promote itself. If it can seen the witness but not the primary, and it should not promote itself. If it can seen the witness but not the primary,
this proves there is no network interruption and the primary itself is unavailable, this proves there is no network interruption and the primary itself is unavailable,
and it can therefore promote itself (and ideally take action to fence the and it can therefore promote itself (and ideally take action to fence the
former primary). former primary).
</para> </para>
<note>
<para>
<emphasis>Never</emphasis> install a witness server on the same physical host
as another node in the replication cluster managed by &repmgr; - it's essential
the witness is not affected in any way by failure of another node.
</para>
</note>
<para> <para>
For more complex replication scenarios,e.g. with multiple datacentres, it may For more complex replication scenarios,e.g. with multiple datacentres, it may
be preferable to use location-based failover, which ensures that only nodes be preferable to use location-based failover, which ensures that only nodes

View File

@@ -147,34 +147,104 @@
<para> <para>
By default, all notification types will be passed to the designated script; By default, all notification types will be passed to the designated script;
the notification types can be filtered to explicitly named ones using the the notification types can be filtered to explicitly named ones using the
<varname>event_notifications</varname> parameter: <varname>event_notifications</varname> parameter.
</para>
<para>
Events generated by the &repmgr; command:
<itemizedlist spacing="compact" mark="bullet"> <itemizedlist spacing="compact" mark="bullet">
<listitem> <listitem>
<simpara><literal>primary_register</literal></simpara> <simpara><literal><link linkend="repmgr-primary-register-events">cluster_created</link></literal></simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara><literal>primary_unregister</literal></simpara> <simpara><literal><link linkend="repmgr-primary-register-events">primary_register</link></literal></simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara><literal>standby_register</literal></simpara> <simpara><literal><link linkend="repmgr-primary-unregister-events">primary_unregister</link></literal></simpara>
</listitem>
<listitem>
<simpara><literal><link linkend="repmgr-standby-clone-events">standby_clone</link></literal></simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara><literal>standby_register_sync</literal></simpara> <simpara><literal><link linkend="repmgr-standby-register-events">standby_register</link></literal></simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara><literal>standby_unregister</literal></simpara> <simpara><literal><link linkend="repmgr-standby-register-events">standby_register_sync</link></literal></simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara><literal>standby_clone</literal></simpara> <simpara><literal><link linkend="repmgr-standby-unregister-events">standby_unregister</link></literal></simpara>
</listitem>
<listitem>
<simpara><literal><link linkend="repmgr-standby-promote-events">standby_promote</link></literal></simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara><literal>standby_promote</literal></simpara> <simpara><literal><link linkend="repmgr-standby-follow-events">standby_follow</link></literal></simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara><literal>standby_follow</literal></simpara> <simpara><literal><link linkend="repmgr-standby-switchover-events">standby_switchover</link></literal></simpara>
</listitem> </listitem>
<listitem>
<simpara><literal><link linkend="repmgr-witness-register-events">witness_register</link></literal></simpara>
</listitem>
<listitem>
<simpara><literal><link linkend="repmgr-witness-unregister-events">witness_unregister</link></literal></simpara>
</listitem>
<listitem>
<simpara><literal><link linkend="repmgr-node-rejoin-events">node_rejoin</link></literal></simpara>
</listitem>
<listitem>
<simpara><literal><link linkend="repmgr-cluster-cleanup-events">cluster_cleanup</link></literal></simpara>
</listitem>
</itemizedlist>
</para>
<para>
Events generated by <application>repmgrd</application> (streaming replication mode):
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara><literal>repmgrd_start</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_shutdown</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_reload</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_failover_promote</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_failover_follow</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_failover_aborted</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_standby_reconnect</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_promote_error</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_local_disconnect</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_local_reconnect</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_upstream_disconnect</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_upstream_reconnect</literal></simpara>
</listitem>
<listitem> <listitem>
<simpara><literal>standby_disconnect_manual</literal></simpara> <simpara><literal>standby_disconnect_manual</literal></simpara>
</listitem> </listitem>
@@ -184,39 +254,13 @@
<listitem> <listitem>
<simpara><literal>standby_recovery</literal></simpara> <simpara><literal>standby_recovery</literal></simpara>
</listitem> </listitem>
<listitem>
<simpara><literal>witness_register</literal></simpara> </itemizedlist>
</listitem> </para>
<listitem>
<simpara><literal>witness_unregister</literal></simpara> <para>
</listitem> Events generated by <application>repmgrd</application> (BDR mode):
<listitem> <itemizedlist spacing="compact" mark="bullet">
<simpara><literal>node_rejoin</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_start</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_shutdown</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_failover_promote</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_failover_follow</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_failover_aborted</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_upstream_disconnect</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_upstream_reconnect</literal></simpara>
</listitem>
<listitem>
<simpara><literal>repmgrd_promote_error</literal></simpara>
</listitem>
<listitem> <listitem>
<simpara><literal>bdr_failover</literal></simpara> <simpara><literal>bdr_failover</literal></simpara>
</listitem> </listitem>

View File

@@ -38,8 +38,9 @@
<!ENTITY quickstart SYSTEM "quickstart.sgml"> <!ENTITY quickstart SYSTEM "quickstart.sgml">
<!ENTITY configuration SYSTEM "configuration.sgml"> <!ENTITY configuration SYSTEM "configuration.sgml">
<!ENTITY configuration-file SYSTEM "configuration-file.sgml"> <!ENTITY configuration-file SYSTEM "configuration-file.sgml">
<!ENTITY configuration-file-settings SYSTEM "configuration-file-settings.sgml"> <!ENTITY configuration-file-required-settings SYSTEM "configuration-file-required-settings.sgml">
<!ENTITY configuration-service-commands SYSTEM "configuration-service-commands.sgml"> <!ENTITY configuration-file-log-settings SYSTEM "configuration-file-log-settings.sgml">
<!ENTITY configuration-file-service-commands SYSTEM "configuration-file-service-commands.sgml">
<!ENTITY cloning-standbys SYSTEM "cloning-standbys.sgml"> <!ENTITY cloning-standbys SYSTEM "cloning-standbys.sgml">
<!ENTITY promoting-standby SYSTEM "promoting-standby.sgml"> <!ENTITY promoting-standby SYSTEM "promoting-standby.sgml">
<!ENTITY follow-new-primary SYSTEM "follow-new-primary.sgml"> <!ENTITY follow-new-primary SYSTEM "follow-new-primary.sgml">

View File

@@ -16,7 +16,7 @@
<para> <para>
&repmgr; RPM packages for RedHat/CentOS variants and Fedora are available from the &repmgr; RPM packages for RedHat/CentOS variants and Fedora are available from the
<ulink url="https://2ndquadrant.com">2ndQuadrant</ulink> <ulink url="https://2ndquadrant.com">2ndQuadrant</ulink>
<ulink url="https://rpm.2ndquadrant.com/">public RPM repository</ulink>; see following <ulink url="https://dl.2ndquadrant.com/">public repository</ulink>; see following
section for details. section for details.
</para> </para>
<para> <para>
@@ -38,7 +38,7 @@
<para> <para>
For more information on the package contents, including details of installation For more information on the package contents, including details of installation
paths and relevant <link linkend="configuration-service-commands">service commands</link>, paths and relevant <link linkend="configuration-file-service-commands">service commands</link>,
see the appendix section <xref linkend="packages-centos">. see the appendix section <xref linkend="packages-centos">.
</para> </para>
@@ -46,26 +46,15 @@
<sect3 id="installation-packages-redhat-2ndq"> <sect3 id="installation-packages-redhat-2ndq">
<title>2ndQuadrant public RPM yum repository</title> <title>2ndQuadrant public RPM yum repository</title>
<note>
<para>
<ulink url="https://2ndquadrant.com">2ndQuadrant</ulink> previously provided a dedicated
&repmgr; repository at
<ulink url="http://packages.2ndquadrant.com/repmgr/">http://packages.2ndquadrant.com/repmgr/</ulink>.
This repository will be deprecated in a future release as it is now replaced by
the <ulink url="https://rpm.2ndquadrant.com/">public RPM repository</ulink>
documented below.
</para>
</note>
<para> <para>
Beginning with <ulink url="https://repmgr.org/docs/4.0/release-4.0.5.html">repmgr 4.0.5</ulink>, Beginning with <ulink url="https://repmgr.org/docs/4.1/release-4.0.5.html">repmgr 4.0.5</ulink>,
<ulink url="https://2ndquadrant.com/">2ndQuadrant</ulink> provides a dedicated <literal>yum</literal> <ulink url="https://2ndquadrant.com/">2ndQuadrant</ulink> provides a dedicated <literal>yum</literal>
<ulink url="https://rpm.2ndquadrant.com/">public RPM repository</ulink> for 2ndQuadrant software, <ulink url="https://dl.2ndquadrant.com/">public repository</ulink> for 2ndQuadrant software,
including &repmgr;. We recommend using this for all future &repmgr; releases. including &repmgr;. We recommend using this for all future &repmgr; releases.
</para> </para>
<para> <para>
General instructions for using this repository can be found on its General instructions for using this repository can be found on its
<ulink url="https://rpm.2ndquadrant.com/">homepage</ulink>. Specific instructions <ulink url="https://dl.2ndquadrant.com/">homepage</ulink>. Specific instructions
for installing &repmgr; follow below. for installing &repmgr; follow below.
</para> </para>
<para> <para>
@@ -75,20 +64,19 @@
<listitem> <listitem>
<para> <para>
Locate the repository RPM for your PostgreSQL version from the list at: Locate the repository RPM for your PostgreSQL version from the list at:
<ulink url="https://rpm.2ndquadrant.com/">https://rpm.2ndquadrant.com/</ulink> <ulink url="https://dl.2ndquadrant.com/">https://dl.2ndquadrant.com/</ulink>
</para> </para>
</listitem> </listitem>
<listitem> <listitem>
<para> <para>
Install the repository RPM for your distribution and PostgreSQL version Install the repository definition for your distribution and PostgreSQL version
(this enables the 2ndQuadrant repository as a source of &repmgr; packages). (this enables the 2ndQuadrant repository as a source of &repmgr; packages).
</para> </para>
<para> <para>
For example, for PostgreSQL 10 on CentOS, execute: For example, for PostgreSQL 10 on CentOS, execute:
<programlisting> <programlisting>
sudo yum install https://rpm.2ndquadrant.com/site/content/2ndquadrant-repo-10-1-1.el7.noarch.rpm curl https://dl.2ndquadrant.com/default/release/get/10/rpm | sudo bash</programlisting>
</programlisting>
</para> </para>
<para> <para>
Verify that the repository is installed with: Verify that the repository is installed with:
@@ -96,8 +84,8 @@ sudo yum install https://rpm.2ndquadrant.com/site/content/2ndquadrant-repo-10-1-
sudo yum repolist</programlisting> sudo yum repolist</programlisting>
The output should contain two entries like this: The output should contain two entries like this:
<programlisting> <programlisting>
2ndquadrant-repo-10/7/x86_64 2ndQuadrant packages for PG10 for rhel 7 - x86_64 1 2ndquadrant-dl-default-release-pg10/7/x86_64 2ndQuadrant packages (PG10) for 7 - x86_64 4
2ndquadrant-repo-10-debug/7/x86_64 2ndQuadrant packages for PG10 for rhel 7 - x86_64 - Debug 1</programlisting> 2ndquadrant-dl-default-release-pg10-debug/7/x86_64 2ndQuadrant packages (PG10) for 7 - x86_64 - Debug 3</programlisting>
</para> </para>
</listitem> </listitem>
@@ -167,7 +155,7 @@ $ yum install repmgr10</programlisting>
</para> </para>
<para> <para>
For more information on the package contents, including details of installation For more information on the package contents, including details of installation
paths and relevant <link linkend="configuration-service-commands">service commands</link>, paths and relevant <link linkend="configuration-file-service-commands">service commands</link>,
see the appendix section <xref linkend="packages-debian-ubuntu">. see the appendix section <xref linkend="packages-debian-ubuntu">.
</para> </para>
@@ -177,52 +165,43 @@ $ yum install repmgr10</programlisting>
<para> <para>
Beginning with <ulink url="https://repmgr.org/docs/4.0/release-4.0.5.html">repmgr 4.0.5</ulink>, Beginning with <ulink url="https://repmgr.org/docs/4.0/release-4.0.5.html">repmgr 4.0.5</ulink>,
<ulink url="https://2ndquadrant.com/">2ndQuadrant</ulink> provides a <ulink url="https://2ndquadrant.com/">2ndQuadrant</ulink> provides a
<ulink url="https://apt.2ndquadrant.com/">public apt repository</ulink> for 2ndQuadrant software, <ulink url="https://dl.2ndquadrant.com/">public apt repository</ulink> for 2ndQuadrant software,
including &repmgr;. including &repmgr;.
</para> </para>
<para> <para>
General instructions for using this repository can be found on its General instructions for using this repository can be found on its
<ulink url="https://apt.2ndquadrant.com/">homepage</ulink>. Specific instructions <ulink url="https://dl.2ndquadrant.com/">homepage</ulink>. Specific instructions
for installing &repmgr; follow below. for installing &repmgr; follow below.
</para> </para>
<para> <para>
<emphasis>Installation</emphasis> <emphasis>Installation</emphasis>
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<para> <para>
If not already present, install the <application>apt-transport-https</application> package: Install the repository definition for your distribution and PostgreSQL version
<programlisting> (this enables the 2ndQuadrant repository as a source of &repmgr; packages) by executing:
sudo apt-get install apt-transport-https</programlisting> <programlisting>
curl https://dl.2ndquadrant.com/default/release/get/deb | sudo bash</programlisting>
</para> </para>
</listitem> <note>
<para>
This will automatically install the following additional packages, if not already present:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara><literal>lsb-release</literal></simpara>
</listitem>
<listitem>
<simpara><literal>apt-transport-https</literal></simpara>
</listitem>
</itemizedlist>
</para>
</note>
</listitem>
<listitem>
<para>
Create <filename>/etc/apt/sources.list.d/2ndquadrant.list</filename> as follows:
<programlisting>
sudo sh -c 'echo "deb https://apt.2ndquadrant.com/ $(lsb_release -cs)-2ndquadrant main" > /etc/apt/sources.list.d/2ndquadrant.list'</programlisting>
</para>
</listitem>
<listitem>
<para>
Install the 2ndQuadrant <ulink url="https://apt.2ndquadrant.com/site/keys/9904CD4BD6BAF0C3.asc">repository key</ulink>:
<programlisting>
sudo apt-get install curl ca-certificates
curl https://apt.2ndquadrant.com/site/keys/9904CD4BD6BAF0C3.asc | sudo apt-key add -</programlisting>
</para>
</listitem>
<listitem>
<para>
Update the package list
<programlisting>
sudo apt-get update</programlisting>
</para>
</listitem>
<listitem> <listitem>
<para> <para>

View File

@@ -12,8 +12,8 @@
To install &repmgr; the prerequisites for compiling To install &repmgr; the prerequisites for compiling
&postgres; must be installed. These are described in &postgres;'s &postgres; must be installed. These are described in &postgres;'s
documentation documentation
on <ulink url="https://www.postgresql.org/docs/current/install-requirements.html">build requirements</ulink> on <ulink url="https://www.postgresql.org/docs/current/static/install-requirements.html">build requirements</ulink>
and <ulink url="https://www.postgresql.org/docs/current/docguide-toolsets.html">build requirements for documentation</ulink>. and <ulink url="https://www.postgresql.org/docs/current/static/docguide-toolsets.html">build requirements for documentation</ulink>.
</para> </para>
<para> <para>

View File

@@ -234,17 +234,34 @@
<para> <para>
<filename>repmgr.conf</filename> should not be stored inside the PostgreSQL data directory, <filename>repmgr.conf</filename> should not be stored inside the PostgreSQL data directory,
as it could be overwritten when setting up or reinitialising the PostgreSQL as it could be overwritten when setting up or reinitialising the PostgreSQL
server. See sections on <xref linkend="configuration-file"> and <xref linkend="configuration-file-settings"> server. See sections <xref linkend="configuration"> and <xref linkend="configuration-file">
for further details about <filename>repmgr.conf</filename>. for further details about <filename>repmgr.conf</filename>.
</para> </para>
<tip> <tip>
<simpara> <simpara>
For Debian-based distributions we recommend explictly setting For Debian-based distributions we recommend explictly setting
<literal>pg_bindir</literal> to the directory where <command>pg_ctl</command> and other binaries <option>pg_bindir</option> to the directory where <command>pg_ctl</command> and other binaries
not in the standard path are located. For PostgreSQL 9.6 this would be <filename>/usr/lib/postgresql/9.6/bin/</filename>. not in the standard path are located. For PostgreSQL 9.6 this would be <filename>/usr/lib/postgresql/9.6/bin/</filename>.
</simpara> </simpara>
</tip> </tip>
<note>
<para>
&repmgr; only uses <option>pg_bindir</option> when it executes
PostgreSQL binaries directly.
</para>
<para>
For user-defined scripts such as <option>promote_command</option> and the
various <option>service_*_command</option>s, you <emphasis>must</emphasis>
always explicitly provide the full path to the binary or script being
executed, even if it is &repmgr; itself.
</para>
<para>
This is because these options can contain user-defined scripts in arbitrary
locations, so prepending <option>pg_bindir</option> may break them.
</para>
</note>
<para> <para>
See the file See the file
<ulink url="https://raw.githubusercontent.com/2ndQuadrant/repmgr/master/repmgr.conf.sample">repmgr.conf.sample</> <ulink url="https://raw.githubusercontent.com/2ndQuadrant/repmgr/master/repmgr.conf.sample">repmgr.conf.sample</>

View File

@@ -15,9 +15,14 @@
<title>Description</title> <title>Description</title>
<para> <para>
Purges monitoring history from the <literal>repmgr.monitoring_history</literal> table to Purges monitoring history from the <literal>repmgr.monitoring_history</literal> table to
prevent excessive table growth. Use the <literal>-k/--keep-history</literal> to specify the prevent excessive table growth.
number of days of monitoring history to retain. This command can be used </para>
manually or as a cronjob. <para>
By default <emphasis>all</emphasis> data will be removed; Use the <option>-k/--keep-history</option>
option to specify the number of days of monitoring history to retain.
</para>
<para>
This command can be executed manually or as a cronjob.
</para> </para>
</refsect1> </refsect1>
@@ -38,4 +43,21 @@
<filename>repmgr.conf</filename>. <filename>repmgr.conf</filename>.
</para> </para>
</refsect1> </refsect1>
<refsect1 id="repmgr-cluster-cleanup-events">
<title>Event notifications</title>
<para>
A <literal>cluster_cleanup</literal> <link linkend="event-notifications">event notification</link> will be generated.
</para>
</refsect1>
<refsect1>
<title>See also</title>
<para>
For more details see the sections <xref linkend="repmgrd-monitoring"> and
<xref linkend="repmgrd-monitoring-configuration">.
</para>
</refsect1>
</refentry> </refentry>

View File

@@ -56,7 +56,7 @@
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term><option>ERR_CLUSTER_CHECK (25)</option></term> <term><option>ERR_NODE_STATUS (25)</option></term>
<listitem> <listitem>
<para> <para>
One or more nodes could not be reached. One or more nodes could not be reached.

View File

@@ -49,6 +49,22 @@
</para> </para>
</refsect1> </refsect1>
<refsect1>
<title>Output format</title>
<para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<literal>--csv</literal>: generate output in CSV format. Note that the <literal>Details</literal>
column will currently not be emitted in CSV format.
</simpara>
</listitem>
</itemizedlist>
</para>
</refsect1>
<refsect1> <refsect1>
<title>Example</title> <title>Example</title>
<para> <para>

View File

@@ -116,7 +116,7 @@
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term><option>ERR_CLUSTER_CHECK (25)</option></term> <term><option>ERR_NODE_STATUS (25)</option></term>
<listitem> <listitem>
<para> <para>
One or more nodes could not be reached. One or more nodes could not be reached.

View File

@@ -81,35 +81,91 @@
<refsect1> <refsect1>
<title>Options</title> <title>Options</title>
<para>
<command>repmgr cluster show</command> accepts an optional parameter <literal>--csv</literal>, which <variablelist>
outputs the replication cluster's status in a simple CSV format, suitable for
parsing by scripts: <varlistentry>
<programlisting> <term><option>--csv</option></term>
<listitem>
<para>
<command>repmgr cluster show</command> accepts an optional parameter <literal>--csv</literal>, which
outputs the replication cluster's status in a simple CSV format, suitable for
parsing by scripts:
<programlisting>
$ repmgr -f /etc/repmgr.conf cluster show --csv $ repmgr -f /etc/repmgr.conf cluster show --csv
1,-1,-1 1,-1,-1
2,0,0 2,0,0
3,0,1</programlisting> 3,0,1</programlisting>
</para> </para>
<para> <para>
The columns have following meanings: The columns have following meanings:
<itemizedlist spacing="compact" mark="bullet"> <itemizedlist spacing="compact" mark="bullet">
<listitem> <listitem>
<simpara> <simpara>
node ID node ID
</simpara> </simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara> <simpara>
availability (0 = available, -1 = unavailable) availability (0 = available, -1 = unavailable)
</simpara> </simpara>
</listitem> </listitem>
<listitem>
<simpara>
recovery state (0 = not in recovery, 1 = in recovery, -1 = unknown)
</simpara>
</listitem>
</itemizedlist>
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><option>--verbose</option></term>
<listitem> <listitem>
<simpara> <para>
recovery state (0 = not in recovery, 1 = in recovery, -1 = unknown) Display the full text of any database connection error messages
</simpara> </para>
</listitem> </listitem>
</itemizedlist> </varlistentry>
</variablelist>
</refsect1>
<refsect1>
<title>Exit codes</title>
<para>
Following exit codes can be emitted by <command>repmgr cluster show</command>:
</para>
<variablelist>
<varlistentry>
<term><option>SUCCESS (0)</option></term>
<listitem>
<para>
No issues were detected.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><option>ERR_NODE_STATUS (25)</option></term>
<listitem>
<para>
One or more issues were detected.
</para>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1>
<title>See also</title>
<para>
<xref linkend="repmgr-node-status">, <xref linkend="repmgr-node-check">
</para> </para>
</refsect1> </refsect1>

View File

@@ -61,7 +61,9 @@
<listitem> <listitem>
<simpara> <simpara>
<literal>--archive-ready</literal>: checks for WAL files which have not yet been archived <literal>--archive-ready</literal>: checks for WAL files which have not yet been archived,
and returns <literal>WARNING</literal> or <literal>CRITICAL</literal> if the number
exceeds <varname>archive_ready_warning</varname> or <varname>archive_ready_critical</varname> respectively.
</simpara> </simpara>
</listitem> </listitem>
@@ -77,6 +79,12 @@
</simpara> </simpara>
</listitem> </listitem>
<listitem>
<simpara>
<literal>--missing-slots</literal>: checks there are no missing replication slots
</simpara>
</listitem>
</itemizedlist> </itemizedlist>
</para> </para>
</refsect1> </refsect1>
@@ -101,4 +109,80 @@
</itemizedlist> </itemizedlist>
</para> </para>
</refsect1> </refsect1>
<refsect1>
<title>Exit codes</title>
<para>
When executing <command>repmgr node check</command> with one of the individual
checks listed above, &repmgr; will emit one of the following Nagios-style exit codes
(even if <literal>--nagios</literal> is not supplied):
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<literal>0</literal>: OK
</simpara>
</listitem>
<listitem>
<simpara>
<literal>1</literal>: WARNING
</simpara>
</listitem>
<listitem>
<simpara>
<literal>2</literal>: ERROR
</simpara>
</listitem>
<listitem>
<simpara>
<literal>3</literal>: UNKNOWN
</simpara>
</listitem>
</itemizedlist>
</para>
<para>
Following exit codes can be emitted by <command>repmgr status check</command>
if no individual check was specified.
</para>
<variablelist>
<varlistentry>
<term><option>SUCCESS (0)</option></term>
<listitem>
<para>
No issues were detected.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><option>ERR_NODE_STATUS (25)</option></term>
<listitem>
<para>
One or more issues were detected.
</para>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1>
<title>See also</title>
<para>
<xref linkend="repmgr-node-status">, <xref linkend="repmgr-cluster-show">
</para>
</refsect1>
</refentry> </refentry>

View File

@@ -28,6 +28,10 @@
If the node is running and needs to be attached to the current primary, use If the node is running and needs to be attached to the current primary, use
<xref linkend="repmgr-standby-follow">. <xref linkend="repmgr-standby-follow">.
</para> </para>
<para>
Note <xref linkend="repmgr-standby-follow"> can only be used for standbys which have not diverged
from the rest of the cluster.
</para>
</tip> </tip>
</refsect1> </refsect1>
@@ -63,10 +67,10 @@
<term><option>--force-rewind[=/path/to/pg_rewind]</option></term> <term><option>--force-rewind[=/path/to/pg_rewind]</option></term>
<listitem> <listitem>
<para> <para>
Execute <application>pg_rewind</application> if necessary. Execute <application>pg_rewind</application>.
</para> </para>
<para> <para>
It is only necessary to provide the <application>pg_rewind</application> It is only necessary to provide the <application>pg_rewind</application> path
if using PostgreSQL 9.3 or 9.4, and <application>pg_rewind</application> if using PostgreSQL 9.3 or 9.4, and <application>pg_rewind</application>
is not installed in the PostgreSQL <filename>bin</filename> directory. is not installed in the PostgreSQL <filename>bin</filename> directory.
</para> </para>
@@ -115,8 +119,26 @@
</variablelist> </variablelist>
</refsect1> </refsect1>
<refsect1> <refsect1>
<title>Configuration file settings</title>
<para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<literal>node_rejoin_timeout</literal>:
the maximum length of time (in seconds) to wait for
the node to reconnect to the replication cluster (defaults to
the value set in <literal>standby_reconnect_timeout</literal>,
60 seconds).
</simpara>
</listitem>
</itemizedlist>
</para>
</refsect1>
<refsect1 id="repmgr-node-rejoin-events">
<title>Event notifications</title> <title>Event notifications</title>
<para> <para>
A <literal>node_rejoin</literal> <link linkend="event-notifications">event notification</link> will be generated. A <literal>node_rejoin</literal> <link linkend="event-notifications">event notification</link> will be generated.
@@ -171,7 +193,7 @@
</note> </note>
<para> <para>
To have <command>repmgr node rejoin</command> use <command>pg_rewind</command> if required, To have <command>repmgr node rejoin</command> use <command>pg_rewind</command>,
pass the command line option <literal>--force-rewind</literal>, which will tell &repmgr; pass the command line option <literal>--force-rewind</literal>, which will tell &repmgr;
to execute <command>pg_rewind</command> to ensure the node can be rejoined successfully. to execute <command>pg_rewind</command> to ensure the node can be rejoined successfully.
</para> </para>
@@ -204,6 +226,15 @@
INFO: pg_rewind would now be executed INFO: pg_rewind would now be executed
DETAIL: pg_rewind command is: DETAIL: pg_rewind command is:
pg_rewind -D '/var/lib/postgresql/data' --source-server='host=node1 dbname=repmgr user=repmgr'</programlisting> pg_rewind -D '/var/lib/postgresql/data' --source-server='host=node1 dbname=repmgr user=repmgr'</programlisting>
<note>
<para>
If <option>--force-rewind</option> is used with the <option>--dry-run</option> option,
this checks the prerequisites for using <application>pg_rewind</application>, but cannot
predict the outcome of actually executing <application>pg_rewind</application>.
</para>
</note>
<programlisting> <programlisting>
$ repmgr node rejoin -f /etc/repmgr.conf -d 'host=node1 dbname=repmgr user=repmgr' \ $ repmgr node rejoin -f /etc/repmgr.conf -d 'host=node1 dbname=repmgr user=repmgr' \
--force-rewind --config-files=postgresql.local.conf,postgresql.conf --verbose --force-rewind --config-files=postgresql.local.conf,postgresql.conf --verbose

View File

@@ -52,10 +52,40 @@
</para> </para>
</refsect1> </refsect1>
<refsect1>
<title>Exit codes</title>
<para>
Following exit codes can be emitted by <command>repmgr node status</command>:
</para>
<variablelist>
<varlistentry>
<term><option>SUCCESS (0)</option></term>
<listitem>
<para>
No issues were detected.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><option>ERR_NODE_STATUS (25)</option></term>
<listitem>
<para>
One or more issues were detected.
</para>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1> <refsect1>
<title>See also</title> <title>See also</title>
<para> <para>
See <xref linkend="repmgr-node-check"> to diagnose issues. See <xref linkend="repmgr-node-check"> to diagnose issues and <xref linkend="repmgr-cluster-show">
for an overview of all nodes in the cluster.
</para> </para>
</refsect1> </refsect1>
</refentry> </refentry>

View File

@@ -17,7 +17,7 @@
<title>Description</title> <title>Description</title>
<para> <para>
<command>repmgr primary register</command> registers a primary node in a <command>repmgr primary register</command> registers a primary node in a
streaming replication cluster, and configures it for use with repmgr, including streaming replication cluster, and configures it for use with &repmgr;, including
installing the &repmgr; extension. This command needs to be executed before any installing the &repmgr; extension. This command needs to be executed before any
standby nodes are registered. standby nodes are registered.
</para> </para>
@@ -75,10 +75,18 @@
</refsect1> </refsect1>
<refsect1> <refsect1 id="repmgr-primary-register-events">
<title>Event notifications</title> <title>Event notifications</title>
<para> <para>
A <literal>primary_register</literal> <link linkend="event-notifications">event notification</link> will be generated. Following <link linkend="event-notifications">event notifications</link> will be generated:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara><literal>cluster_created</literal></simpara>
</listitem>
<listitem>
<simpara><literal>primary_register</literal></simpara>
</listitem>
</itemizedlist>
</para> </para>
</refsect1> </refsect1>

View File

@@ -64,7 +64,7 @@
</refsect1> </refsect1>
<refsect1> <refsect1 id="repmgr-primary-unregister-events">
<title>Event notifications</title> <title>Event notifications</title>
<para> <para>
A <literal>primary_unregister</literal> <link linkend="event-notifications">event notification</link> will be generated. A <literal>primary_unregister</literal> <link linkend="event-notifications">event notification</link> will be generated.

View File

@@ -49,7 +49,7 @@
not be copied by default. &repmgr; can copy these files, either to the same not be copied by default. &repmgr; can copy these files, either to the same
location on the standby server (provided appropriate directory and file permissions location on the standby server (provided appropriate directory and file permissions
are available), or into the standby's data directory. This requires passwordless are available), or into the standby's data directory. This requires passwordless
SSH access to the primary server. Add the option <literal>--copy-external-config-files</literal> SSH access to the primary server. Add the option <option>--copy-external-config-files</option>
to the <command>repmgr standby clone</command> command; by default files will be copied to to the <command>repmgr standby clone</command> command; by default files will be copied to
the same path as on the upstream server. Note that the user executing <command>repmgr</command> the same path as on the upstream server. Note that the user executing <command>repmgr</command>
must have write access to those directories. must have write access to those directories.
@@ -59,12 +59,29 @@
<literal>--copy-external-config-files=pgdata</literal>, but note that <literal>--copy-external-config-files=pgdata</literal>, but note that
any include directives in the copied files may need to be updated. any include directives in the copied files may need to be updated.
</para> </para>
<note>
<para>
When executing <command>repmgr standby clone</command> with the
<option>--copy-external-config-files</option> aand <option>--dry-run</option>
options, &repmgr; will check the SSH connection to the source node, but
will not verify whether the files can actually be copied.
</para>
<para>
During the actual clone operation, a check will be made before the database itself
is cloned to determine whether the files can actually be copied; if any problems are
encountered, the clone operation will be aborted, enabling the user to fix
any issues before retrying the clone operation.
</para>
</note>
<tip> <tip>
<simpara> <simpara>
For reliable configuration file management we recommend using a For reliable configuration file management we recommend using a
configuration management tool such as Ansible, Chef, Puppet or Salt. configuration management tool such as Ansible, Chef, Puppet or Salt.
</simpara> </simpara>
</tip> </tip>
</refsect1> </refsect1>
<refsect1 id="repmgr-standby-clone-recovery-conf"> <refsect1 id="repmgr-standby-clone-recovery-conf">
@@ -213,6 +230,15 @@
<variablelist> <variablelist>
<varlistentry>
<term><option>-d, --dbname=CONNINFO</option></term>
<listitem>
<para>
Connection string of the upstream node to use for cloning.
</para>
</listitem>
</varlistentry>
<varlistentry> <varlistentry>
<term><option>--dry-run</option></term> <term><option>--dry-run</option></term>
<listitem> <listitem>
@@ -324,7 +350,7 @@
</variablelist> </variablelist>
</refsect1> </refsect1>
<refsect1> <refsect1 id="repmgr-standby-clone-events">
<title>Event notifications</title> <title>Event notifications</title>
<para> <para>
A <literal>standby_clone</literal> <link linkend="event-notifications">event notification</link> will be generated. A <literal>standby_clone</literal> <link linkend="event-notifications">event notification</link> will be generated.

View File

@@ -94,7 +94,7 @@
</variablelist> </variablelist>
</refsect1> </refsect1>
<refsect1> <refsect1 id="repmgr-standby-follow-events">
<title>Event notifications</title> <title>Event notifications</title>
<para> <para>
A <literal>standby_follow</literal> <link linkend="event-notifications">event notification</link> will be generated. A <literal>standby_follow</literal> <link linkend="event-notifications">event notification</link> will be generated.

View File

@@ -50,7 +50,7 @@
</refsect1> </refsect1>
<refsect1> <refsect1 id="repmgr-standby-promote-events">
<title>Event notifications</title> <title>Event notifications</title>
<para> <para>
A <literal>standby_promote</literal> <link linkend="event-notifications">event notification</link> will be generated. A <literal>standby_promote</literal> <link linkend="event-notifications">event notification</link> will be generated.

View File

@@ -159,7 +159,7 @@
</variablelist> </variablelist>
</refsect1> </refsect1>
<refsect1> <refsect1 id="repmgr-standby-register-events">
<title>Event notifications</title> <title>Event notifications</title>
<para> <para>
A <literal>standby_register</literal> <link linkend="event-notifications">event notification</link> A <literal>standby_register</literal> <link linkend="event-notifications">event notification</link>

View File

@@ -46,6 +46,9 @@
<application>repmgrd</application> should not be active on any nodes while a switchover is being <application>repmgrd</application> should not be active on any nodes while a switchover is being
executed. This restriction may be lifted in a later version. executed. This restriction may be lifted in a later version.
</para> </para>
<para>
&repmgr; will not perform the switchover if an exclusive backup is running on the current primary.
</para>
</note> </note>
</refsect1> </refsect1>
@@ -163,8 +166,8 @@
<listitem> <listitem>
<simpara> <simpara>
<literal>standby_reconnect_timeout</literal>: <literal>standby_reconnect_timeout</literal>:
Number of seconds to attempt to reconnect to the demoted primary number of seconds to attempt to wait for the demoted primary
once it has been restarted. to reconnect to the promoted primary (default: 60 seconds)
</simpara> </simpara>
</listitem> </listitem>
@@ -193,7 +196,7 @@
</para> </para>
</refsect1> </refsect1>
<refsect1> <refsect1 id="repmgr-standby-switchover-events">
<title>Event notifications</title> <title>Event notifications</title>
<para> <para>
<literal>standby_switchover</literal> and <literal>standby_promote</literal> <literal>standby_switchover</literal> and <literal>standby_promote</literal>

View File

@@ -59,7 +59,7 @@
</variablelist> </variablelist>
</refsect1> </refsect1>
<refsect1> <refsect1 id="repmgr-standby-unregister-events">
<title>Event notifications</title> <title>Event notifications</title>
<para> <para>
A <literal>standby_unregister</literal> <link linkend="event-notifications">event notification</link> will be generated. A <literal>standby_unregister</literal> <link linkend="event-notifications">event notification</link> will be generated.

View File

@@ -50,7 +50,7 @@
</refsect1> </refsect1>
<refsect1> <refsect1 id="repmgr-witness-register-events">
<title>Event notifications</title> <title>Event notifications</title>
<para> <para>
A <literal>witness_register</literal> <link linkend="event-notifications">event notification</link> will be generated. A <literal>witness_register</literal> <link linkend="event-notifications">event notification</link> will be generated.

View File

@@ -20,7 +20,10 @@
</para> </para>
<para> <para>
The node does not have to be running to be unregistered, however if this is the The node does not have to be running to be unregistered, however if this is the
case then connection information for the primary server must be provided. case then either provide connection information for the primary server, or
execute <command>repmgr witness unregister</command> on a running node and
provide the parameter <option>--node-id</option> with the node ID of the
witness server.
</para> </para>
<para> <para>
Execute with the <literal>--dry-run</literal> option to check what would happen Execute with the <literal>--dry-run</literal> option to check what would happen
@@ -36,17 +39,17 @@
INFO: connecting to witness node "node3" (ID: 3) INFO: connecting to witness node "node3" (ID: 3)
INFO: unregistering witness node 3 INFO: unregistering witness node 3
INFO: witness unregistration complete INFO: witness unregistration complete
DETAIL: witness node with id 3 (conninfo: host=node3 dbname=repmgr user=repmgr port=5499) successfully unregistered</programlisting> DETAIL: witness node with UD 3 successfully unregistered</programlisting>
</para> </para>
<para> <para>
Unregistering a non-running witness node: Unregistering a non-running witness node:
<programlisting> <programlisting>
$ repmgr -f /etc/repmgr.conf witness unregister -h node1 -p 5501 -F $ repmgr -f /etc/repmgr.conf witness unregister -h node1 -p 5501 -F
INFO: connecting to witness node "node3" (ID: 3) INFO: connecting to node "node3" (ID: 3)
NOTICE: unable to connect to witness node "node3" (ID: 3), removing node record on cluster primary only NOTICE: unable to connect to node "node3" (ID: 3), removing node record on cluster primary only
INFO: unregistering witness node 3 INFO: unregistering witness node 3
INFO: witness unregistration complete INFO: witness unregistration complete
DETAIL: witness node with id 3 (conninfo: host=node3 dbname=repmgr user=repmgr port=5499) successfully unregistered</programlisting> DETAIL: witness node with id ID 3 successfully unregistered</programlisting>
</para> </para>
</refsect1> </refsect1>
@@ -62,8 +65,34 @@
</para> </para>
</refsect1> </refsect1>
<refsect1> <refsect1>
<title>Options</title>
<variablelist>
<varlistentry>
<term><option>--dry-run</option></term>
<listitem>
<para>
Check prerequisites but don't actually unregister the witness.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><option>--node-id</option></term>
<listitem>
<para>
Unregister witness server with the specified node ID.
</para>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="repmgr-witness-unregister-events">
<title>Event notifications</title> <title>Event notifications</title>
<para> <para>
A <literal>witness_unregister</literal> <link linkend="event-notifications">event notification</link> will be generated. A <literal>witness_unregister</literal> <link linkend="event-notifications">event notification</link> will be generated.

View File

@@ -15,7 +15,7 @@
</para> </para>
<note> <note>
<simpara> <simpara>
Due to the nature of BDR, it's only safe to use this solution for Due to the nature of BDR 1.x/2.x, it's only safe to use this solution for
a two-node scenario. Introducing additional nodes will create an inherent a two-node scenario. Introducing additional nodes will create an inherent
risk of node desynchronisation if a node goes down without being cleanly risk of node desynchronisation if a node goes down without being cleanly
removed from the cluster. removed from the cluster.

View File

@@ -24,7 +24,7 @@
<para> <para>
To use <application>repmgrd</application>, its associated function library <emphasis>must</emphasis> be To use <application>repmgrd</application>, its associated function library <emphasis>must</emphasis> be
included in <filename>postgresql.conf</filename> with: included via <filename>postgresql.conf</filename> with:
<programlisting> <programlisting>
shared_preload_libraries = 'repmgr'</programlisting> shared_preload_libraries = 'repmgr'</programlisting>
@@ -34,23 +34,6 @@
the <ulink url="https://www.postgresql.org/docs/current/static/runtime-config-client.html#GUC-SHARED-PRELOAD-LIBRARIES">PostgreSQL documentation</ulink>. the <ulink url="https://www.postgresql.org/docs/current/static/runtime-config-client.html#GUC-SHARED-PRELOAD-LIBRARIES">PostgreSQL documentation</ulink>.
</para> </para>
<para>
To apply configuration file changes to a running <application>repmgrd</application>
daemon, execute the operating system's r<application>repmgrd</application> service reload command
(see <xref linkend="appendix-packages"> for examples),
or for instances which were manually started, execute <command>kill -HUP</command>, e.g.
<command>kill -HUP `cat /tmp/repmgrd.pid`</command>.
</para>
<note>
<para>
Check the <application>repmgrd</application> log to see what changes were
applied, or if any issues were encountered when reloading the configuration.
</para>
</note>
<para>
Note that only a subset of configuration file parameters can be changed on a
running <application>repmgrd</application> daemon.
</para>
<sect2 id="repmgrd-automatic-failover-configuration"> <sect2 id="repmgrd-automatic-failover-configuration">
<title>automatic failover configuration</title> <title>automatic failover configuration</title>
@@ -63,8 +46,17 @@
follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'</programlisting> follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'</programlisting>
</para> </para>
<para> <para>
Adjust file paths as appropriate; we recomment specifying the full path to the &repmgr; binary. Adjust file paths as appropriate; alway specify the full path to the &repmgr; binary.
</para> </para>
<note>
<para>
&repmgr; will not apply <option>pg_bindir</option> when executing <option>promote_command</option>
or <option>follow_command</option>; these can be user-defined scripts so must always be
specified with the full path.
</para>
</note>
<para> <para>
Note that the <literal>--log-to-file</literal> option will cause Note that the <literal>--log-to-file</literal> option will cause
output generated by the &repmgr; command, when executed by <application>repmgrd</application>, output generated by the &repmgr; command, when executed by <application>repmgrd</application>,
@@ -130,11 +122,11 @@
particularly on <application>systemd</application>-based systems. particularly on <application>systemd</application>-based systems.
</para> </para>
<para> <para>
For more details, see <xref linkend="configuration-service-commands">. For more details, see <xref linkend="configuration-file-service-commands">.
</para> </para>
</sect2> </sect2>
<sect2 id="repmgrd-monitoring-configuration"> <sect2 id="repmgrd-monitoring-configuration" xreflabel="repmgrd monitoring configuration">
<indexterm> <indexterm>
<primary>repmgrd</primary> <primary>repmgrd</primary>
<secondary>monitoring configuration</secondary> <secondary>monitoring configuration</secondary>
@@ -157,6 +149,203 @@
</para> </para>
</sect2> </sect2>
<sect2 id="repmgrd-reloading-configuration"xreflabel="reloading repmgrd configuration">
<indexterm>
<primary>repmgrd</primary>
<secondary>applying configuration changes</secondary>
</indexterm>
<title>Applying configuration changes to repmgrd</title>
<para>
To apply configuration file changes to a running <application>repmgrd</application>
daemon, execute the operating system's <application>repmgrd</application> service reload command
(see <xref linkend="appendix-packages"> for examples),
or for instances which were manually started, execute <command>kill -HUP</command>, e.g.
<command>kill -HUP `cat /tmp/repmgrd.pid`</command>.
</para>
<tip>
<para>
Check the <application>repmgrd</application> log to see what changes were
applied, or if any issues were encountered when reloading the configuration.
</para>
</tip>
<para>
Note that only the following subset of configuration file parameters can be changed on a
running <application>repmgrd</application> daemon:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<varname>async_query_timeout</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>bdr_local_monitoring_only</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>bdr_recovery_timeout</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>conninfo</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>degraded_monitoring_timeout</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>event_notification_command</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>event_notifications</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>failover</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>follow_command</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>log_facility</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>log_file</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>log_level</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>log_status_interval</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>monitor_interval_secs</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>monitoring_history</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>primary_notification_timeout</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>promote_command</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>reconnect_attempts</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>reconnect_interval</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>repmgrd_standby_startup_timeout</varname>
</simpara>
</listitem>
</itemizedlist>
<para>
The following set of configuration file parameters must be updated via
<command><link linkend="repmgr-standby-register">repmgr standby register --force</link></command>,
as they require changes to the <literal>repmgr.nodes</literal> table so they are visible to
all nodes in the replication cluster:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<varname>node_id</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>node_name</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>data_directory</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>location</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>priority</varname>
</simpara>
</listitem>
</itemizedlist>
<note>
<para>
After executing <command><link linkend="repmgr-standby-register">repmgr standby register --force</link></command>,
<application>repmgrd</application> <emphasis>must</emphasis> be restarted for the changes to take effect.
</para>
</note>
</sect2>
</sect1> </sect1>
<sect1 id="repmgrd-daemon"> <sect1 id="repmgrd-daemon">
@@ -177,10 +366,63 @@
<para> <para>
<application>repmgrd</application> can be started manually like this: <application>repmgrd</application> can be started manually like this:
<programlisting> <programlisting>
repmgrd -f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid --daemonize</programlisting> repmgrd -f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid</programlisting>
and stopped with <command>kill `cat /tmp/repmgrd.pid`</command>. Adjust paths as appropriate. and stopped with <command>kill `cat /tmp/repmgrd.pid`</command>. Adjust paths as appropriate.
</para> </para>
<sect2 id="repmgrd-pid-file" xreflabel="repmgrd's PID file">
<indexterm>
<primary>repmgrd</primary>
<secondary>PID file</secondary>
</indexterm>
<indexterm>
<primary>PID file</primary>
<secondary>repmgrd</secondary>
</indexterm>
<title>repmgrd's PID file</title>
<para>
<application>repmgrd</application> will generate a PID file by default.
</para>
<note>
<simpara>
This is a behaviour change from previous versions (earlier than 4.1), where
the PID file had to be explicitly specified with the command line
parameter <option> --pid-file</option>.
</simpara>
</note>
<para>
The PID file can be specified in <filename>repmgr.conf</filename> with the configuration
parameter <varname>repmgrd_pid_file</varname>.
</para>
<para>
It can also be specified on the command line (as in previous versions) with
the command line parameter <option>--pid-file</option>. Note this will override
any value set in <filename>repmgr.conf</filename> with <varname>repmgrd_pid_file</varname>.
<option>--pid-file</option> may be deprecated in future releases.
</para>
<para>
If a PID file location was specified by the package maintainer, <application>repmgrd</application>
will use that. This only applies if &repmgr; was installed from a package and the package
maintainer has specified the PID file location.
</para>
<para>
If none of the above apply, <application>repmgrd</application> will create a PID file
in the operating system's temporary directory (das etermined by the environment variable
<varname>TMPDIR</varname>, or if that is not set, will use <filename>/tmp</filename>).
</para>
<para>
To prevent a PID file being generated at all, provide the command line option
<option>--no-pid-file</option>.
</para>
<para>
To see which PID file <application>repmgrd</application> would use, execute <application>repmgrd</application>
with the option <option>--show-pid-file</option>. <application>repmgrd</application>
will not start if this option is provided. Note that the value shown is the
file <application>repmgrd</application> would use next time it starts, and is
not necessarily the PID file currently in use.
</para>
</sect2>
<sect2 id="repmgrd-configuration-debian-ubuntu"> <sect2 id="repmgrd-configuration-debian-ubuntu">
<indexterm> <indexterm>
<primary>repmgrd</primary> <primary>repmgrd</primary>
@@ -269,25 +511,34 @@ REPMGRD_ENABLED=no
<secondary>repmgrd</secondary> <secondary>repmgrd</secondary>
</indexterm> </indexterm>
<indexterm>
<primary>repmgrd</primary>
<secondary>log rotation</secondary>
</indexterm>
<title>repmgrd log rotation</title> <title>repmgrd log rotation</title>
<para> <para>
To ensure the current <application>repmgrd</application> logfile To ensure the current <application>repmgrd</application> logfile
(specified in <filename>repmgr.conf</filename> with the parameter (specified in <filename>repmgr.conf</filename> with the parameter
<option>log_file</option> does not grow indefinitely, configure your <option>log_file</option>) does not grow indefinitely, configure your
system's <command>logrotate</command> to regularly rotate it. system's <command>logrotate</command> to regularly rotate it.
</para> </para>
<para> <para>
Sample configuration to rotate logfiles weekly with retention for Sample configuration to rotate logfiles weekly with retention for
up to 52 weeks and rotation forced if a file grows beyond 100Mb: up to 52 weeks and rotation forced if a file grows beyond 100Mb:
<programlisting> <programlisting>
/var/log/postgresql/repmgr-9.6.log { /var/log/repmgr/repmgrd.log {
missingok missingok
compress compress
rotate 52 rotate 52
maxsize 100M maxsize 100M
weekly weekly
create 0600 postgres postgres create 0600 postgres postgres
postrotate
/usr/bin/killall -HUP repmgrd
endscript
}</programlisting> }</programlisting>
</para> </para>
</sect1> </sect1>
</chapter> </chapter>

View File

@@ -1,4 +1,4 @@
<chapter id="repmgrd-degraded-monitoring"> <chapter id="repmgrd-degraded-monitoring" xreflabel="repmgrd degraded monitoring">
<indexterm> <indexterm>
<primary>repmgrd</primary> <primary>repmgrd</primary>
<secondary>degraded monitoring</secondary> <secondary>degraded monitoring</secondary>
@@ -7,8 +7,8 @@
<title>"degraded monitoring" mode</title> <title>"degraded monitoring" mode</title>
<para> <para>
In certain circumstances, <application>repmgrd</application> is not able to fulfill its primary mission In certain circumstances, <application>repmgrd</application> is not able to fulfill its primary mission
of monitoring the nodes' upstream server. In these cases it enters "degraded of monitoring the node's upstream server. In these cases it enters &quot;degraded monitoring&quot;
monitoring" mode, where <application>repmgrd</application> remains active but is waiting for the situation mode, where <application>repmgrd</application> remains active but is waiting for the situation
to be resolved. to be resolved.
</para> </para>
<para> <para>

View File

@@ -1,4 +1,4 @@
<chapter id="repmgrd-monitoring"> <chapter id="repmgrd-monitoring" xreflabel="Monitoring with repmgrd">
<indexterm> <indexterm>
<primary>repmgrd</primary> <primary>repmgrd</primary>
<secondary>monitoring</secondary> <secondary>monitoring</secondary>

View File

@@ -40,8 +40,8 @@
In a failover situation, <application>repmgrd</application> will check if any servers in the In a failover situation, <application>repmgrd</application> will check if any servers in the
same location as the current primary node are visible. If not, <application>repmgrd</application> same location as the current primary node are visible. If not, <application>repmgrd</application>
will assume a network interruption and not promote any node in any will assume a network interruption and not promote any node in any
other location (it will however enter <xref linkend="repmgrd-degraded-monitoring"> mode until other location (it will however enter <link linkend="repmgrd-degraded-monitoring">degraded monitoring</link>
a primary becomes visible). mode until a primary becomes visible).
</para> </para>
</chapter> </chapter>

View File

@@ -57,7 +57,14 @@
<para> <para>
As mentioned in the previous section, success of the switchover operation depends on As mentioned in the previous section, success of the switchover operation depends on
&repmgr; being able to shut down the current primary server quickly and cleanly. &repmgr; being able to shut down the current primary server quickly and cleanly.
</para>
<para>
Ensure that the promotion candidate has sufficient free walsenders available
(PostgreSQL configuration item <varname>max_wal_senders</varname>), and if replication
slots are in use, at least one free slot is available for the demotion candidate (
PostgreSQL configuration item <varname>max_replication_slots</varname>).
</para> </para>
<para> <para>
@@ -104,7 +111,7 @@
server. server.
</para> </para>
<para> <para>
For more details, see <xref linkend="configuration-service-commands">. For more details, see <xref linkend="configuration-file-service-commands">.
</para> </para>
</important> </important>
@@ -121,15 +128,21 @@
</simpara> </simpara>
</note> </note>
<para> <para>
Check that access from applications is minimalized or preferably blocked Check that access from applications is minimalized or preferably blocked
completely, so applications are not unexpectedly interrupted. completely, so applications are not unexpectedly interrupted.
</para> </para>
<note>
<para>
If an exclusive backup is running on the current primary, &repmgr; will not perform the
switchover.
</para>
</note>
<para> <para>
Check there is no significant replication lag on standbys attached to the Check there is no significant replication lag on standbys attached to the
current primary. current primary.
</para> </para>
<para> <para>
@@ -147,6 +160,7 @@
</para> </para>
</note> </note>
<para> <para>
Finally, consider executing <command>repmgr standby switchover</command> with the Finally, consider executing <command>repmgr standby switchover</command> with the
<literal>--dry-run</literal> option; this will perform any necessary checks and inform you about <literal>--dry-run</literal> option; this will perform any necessary checks and inform you about

View File

@@ -29,8 +29,18 @@
</listitem> </listitem>
<listitem> <listitem>
<simpara> <simpara>
In the database where the &repmgr; extension is installed, execute <application>repmgrd</application> (if running) must be restarted.
<command>ALTER EXTENSION repmgr UPDATE</command>. </simpara>
</listitem>
<listitem>
<simpara>
For major releases, e.g. from <literal>4.0.x</literal> to <literal>4.1</literal>,
execute <command>ALTER EXTENSION repmgr UPDATE</command>
on the primary node in the database where the &repmgr; extension is installed.
</simpara>
<simpara>
This will update the extension metadata and, if necessary, apply
changes to the &repmgr; extension objects.
</simpara> </simpara>
</listitem> </listitem>
</orderedlist> </orderedlist>
@@ -41,10 +51,6 @@
release as they may contain upgrade instructions particular to individual versions. release as they may contain upgrade instructions particular to individual versions.
</para> </para>
<para>
If the <application>repmgrd</application> daemon is in use, we recommend stopping it
before upgrading &repmgr;.
</para>
<para> <para>
Note that it may be necessary to restart the PostgreSQL server if the upgrade contains Note that it may be necessary to restart the PostgreSQL server if the upgrade contains
changes to the shared object file used by <application>repmgrd</application>; check the changes to the shared object file used by <application>repmgrd</application>; check the

View File

@@ -1 +1 @@
<!ENTITY repmgrversion "4.0.6"> <!ENTITY repmgrversion "4.1.1">

View File

@@ -46,6 +46,6 @@
#define ERR_SWITCHOVER_INCOMPLETE 22 #define ERR_SWITCHOVER_INCOMPLETE 22
#define ERR_FOLLOW_FAIL 23 #define ERR_FOLLOW_FAIL 23
#define ERR_REJOIN_FAIL 24 #define ERR_REJOIN_FAIL 24
#define ERR_CLUSTER_CHECK 25 #define ERR_NODE_STATUS 25
#endif /* _ERRCODE_H_ */ #endif /* _ERRCODE_H_ */

12
log.c
View File

@@ -42,7 +42,7 @@ _stderr_log_with_level(const char *level_name, int level, const char *fmt, va_li
__attribute__((format(PG_PRINTF_ATTRIBUTE, 3, 0))); __attribute__((format(PG_PRINTF_ATTRIBUTE, 3, 0)));
int log_type = REPMGR_STDERR; int log_type = REPMGR_STDERR;
int log_level = LOG_NOTICE; int log_level = LOG_INFO;
int last_log_level = LOG_INFO; int last_log_level = LOG_INFO;
int verbose_logging = false; int verbose_logging = false;
int terse_logging = false; int terse_logging = false;
@@ -70,7 +70,7 @@ _stderr_log_with_level(const char *level_name, int level, const char *fmt, va_li
/* /*
* Store the requested level so that if there's a subsequent log_hint() or * Store the requested level so that if there's a subsequent log_hint() or
* log_detail(), we can suppress that if appropriate. * log_detail(), we can suppress that if --terse was specified,
*/ */
last_log_level = level; last_log_level = level;
@@ -329,6 +329,13 @@ logger_set_terse(void)
} }
void
logger_set_level(int new_log_level)
{
log_level = new_log_level;
}
void void
logger_set_min_level(int min_log_level) logger_set_min_level(int min_log_level)
{ {
@@ -336,6 +343,7 @@ logger_set_min_level(int min_log_level)
log_level = min_log_level; log_level = min_log_level;
} }
int int
detect_log_level(const char *level) detect_log_level(const char *level)
{ {

1
log.h
View File

@@ -129,6 +129,7 @@ bool logger_shutdown(void);
void logger_set_verbose(void); void logger_set_verbose(void);
void logger_set_terse(void); void logger_set_terse(void);
void logger_set_min_level(int min_log_level); void logger_set_min_level(int min_log_level);
void logger_set_level(int new_log_level);
void void
log_detail(const char *fmt,...) log_detail(const char *fmt,...)

2
repmgr--4.0--4.1.sql Normal file
View File

@@ -0,0 +1,2 @@
-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION repmgr" to load this file. \quit

167
repmgr--4.1.sql Normal file
View File

@@ -0,0 +1,167 @@
-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION repmgr" to load this file. \quit
CREATE TABLE repmgr.nodes (
node_id INTEGER PRIMARY KEY,
upstream_node_id INTEGER NULL REFERENCES nodes (node_id) DEFERRABLE,
active BOOLEAN NOT NULL DEFAULT TRUE,
node_name TEXT NOT NULL,
type TEXT NOT NULL CHECK (type IN('primary','standby','witness','bdr')),
location TEXT NOT NULL DEFAULT 'default',
priority INT NOT NULL DEFAULT 100,
conninfo TEXT NOT NULL,
repluser VARCHAR(63) NOT NULL,
slot_name TEXT NULL,
config_file TEXT NOT NULL
);
CREATE TABLE repmgr.events (
node_id INTEGER NOT NULL,
event TEXT NOT NULL,
successful BOOLEAN NOT NULL DEFAULT TRUE,
event_timestamp TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
details TEXT NULL
);
DO $repmgr$
DECLARE
DECLARE server_version_num INT;
BEGIN
SELECT setting
FROM pg_catalog.pg_settings
WHERE name = 'server_version_num'
INTO server_version_num;
IF server_version_num >= 90400 THEN
EXECUTE $repmgr_func$
CREATE TABLE repmgr.monitoring_history (
primary_node_id INTEGER NOT NULL,
standby_node_id INTEGER NOT NULL,
last_monitor_time TIMESTAMP WITH TIME ZONE NOT NULL,
last_apply_time TIMESTAMP WITH TIME ZONE,
last_wal_primary_location PG_LSN NOT NULL,
last_wal_standby_location PG_LSN,
replication_lag BIGINT NOT NULL,
apply_lag BIGINT NOT NULL
)
$repmgr_func$;
ELSE
EXECUTE $repmgr_func$
CREATE TABLE repmgr.monitoring_history (
primary_node_id INTEGER NOT NULL,
standby_node_id INTEGER NOT NULL,
last_monitor_time TIMESTAMP WITH TIME ZONE NOT NULL,
last_apply_time TIMESTAMP WITH TIME ZONE,
last_wal_primary_location TEXT NOT NULL,
last_wal_standby_location TEXT,
replication_lag BIGINT NOT NULL,
apply_lag BIGINT NOT NULL
)
$repmgr_func$;
END IF;
END$repmgr$;
CREATE INDEX idx_monitoring_history_time
ON repmgr.monitoring_history (last_monitor_time, standby_node_id);
CREATE VIEW repmgr.show_nodes AS
SELECT n.node_id,
n.node_name,
n.active,
n.upstream_node_id,
un.node_name AS upstream_node_name,
n.type,
n.priority,
n.conninfo
FROM repmgr.nodes n
LEFT JOIN repmgr.nodes un
ON un.node_id = n.upstream_node_id;
/* XXX update upgrade scripts! */
CREATE TABLE repmgr.voting_term (
term INT NOT NULL
);
CREATE UNIQUE INDEX voting_term_restrict
ON repmgr.voting_term ((TRUE));
CREATE RULE voting_term_delete AS
ON DELETE TO repmgr.voting_term
DO INSTEAD NOTHING;
/* ================= */
/* repmgrd functions */
/* ================= */
/* monitoring functions */
CREATE FUNCTION set_local_node_id(INT)
RETURNS VOID
AS 'MODULE_PATHNAME', 'set_local_node_id'
LANGUAGE C STRICT;
CREATE FUNCTION get_local_node_id()
RETURNS INT
AS 'MODULE_PATHNAME', 'get_local_node_id'
LANGUAGE C STRICT;
CREATE FUNCTION standby_set_last_updated()
RETURNS TIMESTAMP WITH TIME ZONE
AS 'MODULE_PATHNAME', 'standby_set_last_updated'
LANGUAGE C STRICT;
CREATE FUNCTION standby_get_last_updated()
RETURNS TIMESTAMP WITH TIME ZONE
AS 'MODULE_PATHNAME', 'standby_get_last_updated'
LANGUAGE C STRICT;
/* failover functions */
CREATE FUNCTION notify_follow_primary(INT)
RETURNS VOID
AS 'MODULE_PATHNAME', 'notify_follow_primary'
LANGUAGE C STRICT;
CREATE FUNCTION get_new_primary()
RETURNS INT
AS 'MODULE_PATHNAME', 'get_new_primary'
LANGUAGE C STRICT;
CREATE FUNCTION reset_voting_status()
RETURNS VOID
AS 'MODULE_PATHNAME', 'reset_voting_status'
LANGUAGE C STRICT;
CREATE FUNCTION am_bdr_failover_handler(INT)
RETURNS BOOL
AS 'MODULE_PATHNAME', 'am_bdr_failover_handler'
LANGUAGE C STRICT;
CREATE FUNCTION unset_bdr_failover_handler()
RETURNS VOID
AS 'MODULE_PATHNAME', 'unset_bdr_failover_handler'
LANGUAGE C STRICT;
CREATE VIEW repmgr.replication_status AS
SELECT m.primary_node_id, m.standby_node_id, n.node_name AS standby_name,
n.type AS node_type, n.active, last_monitor_time,
CASE WHEN n.type='standby' THEN m.last_wal_primary_location ELSE NULL END AS last_wal_primary_location,
m.last_wal_standby_location,
CASE WHEN n.type='standby' THEN pg_catalog.pg_size_pretty(m.replication_lag) ELSE NULL END AS replication_lag,
CASE WHEN n.type='standby' THEN
CASE WHEN replication_lag > 0 THEN age(now(), m.last_apply_time) ELSE '0'::INTERVAL END
ELSE NULL
END AS replication_time_lag,
CASE WHEN n.type='standby' THEN pg_catalog.pg_size_pretty(m.apply_lag) ELSE NULL END AS apply_lag,
AGE(NOW(), CASE WHEN pg_catalog.pg_is_in_recovery() THEN repmgr.standby_get_last_updated() ELSE m.last_monitor_time END) AS communication_time_lag
FROM repmgr.monitoring_history m
JOIN repmgr.nodes n ON m.standby_node_id = n.node_id
WHERE (m.standby_node_id, m.last_monitor_time) IN (
SELECT m1.standby_node_id, MAX(m1.last_monitor_time)
FROM repmgr.monitoring_history m1 GROUP BY 1
);

View File

@@ -83,9 +83,10 @@ do_bdr_register(void)
exit(ERR_BAD_CONFIG); exit(ERR_BAD_CONFIG);
} }
if (bdr_nodes.node_count > 2) /* BDR 2 implementation is for 2 nodes only */
if (get_bdr_version_num() < 3 && bdr_nodes.node_count > 2)
{ {
log_error(_("repmgr can only support BDR clusters with 2 nodes")); log_error(_("repmgr can only support BDR 2.x clusters with 2 nodes"));
log_detail(_("this BDR cluster has %i nodes"), bdr_nodes.node_count); log_detail(_("this BDR cluster has %i nodes"), bdr_nodes.node_count);
PQfinish(conn); PQfinish(conn);
pfree(dbname); pfree(dbname);
@@ -176,6 +177,7 @@ do_bdr_register(void)
if (bdr_node_has_repmgr_set(conn, config_file_options.node_name) == false) if (bdr_node_has_repmgr_set(conn, config_file_options.node_name) == false)
{ {
log_debug("bdr_node_has_repmgr_set() = false");
bdr_node_set_repmgr_set(conn, config_file_options.node_name); bdr_node_set_repmgr_set(conn, config_file_options.node_name);
} }
@@ -201,6 +203,7 @@ do_bdr_register(void)
if (bdr_nodes.node_count == 0) if (bdr_nodes.node_count == 0)
{ {
log_error(_("unable to retrieve any BDR node records")); log_error(_("unable to retrieve any BDR node records"));
log_detail("%s", PQerrorMessage(conn));
PQfinish(conn); PQfinish(conn);
exit(ERR_BAD_CONFIG); exit(ERR_BAD_CONFIG);
} }
@@ -252,7 +255,35 @@ do_bdr_register(void)
} }
/* Add the repmgr extension tables to a replication set */ /* Add the repmgr extension tables to a replication set */
add_extension_tables_to_bdr_replication_set(conn);
if (get_bdr_version_num() < 3)
{
add_extension_tables_to_bdr_replication_set(conn);
}
else
{
/* this is the only table we need to replicate */
char *replication_set = get_default_bdr_replication_set(conn);
/*
* this probably won't happen, but we need to be sure we're using
* the replication set metadata correctly...
*/
if (conn == NULL)
{
log_error(_("unable to retrieve default BDR replication set"));
log_hint(_("see preceding messages"));
log_debug("check query in get_default_bdr_replication_set()");
exit(ERR_BAD_CONFIG);
}
if (is_table_in_bdr_replication_set(conn, "nodes", replication_set) == false)
{
add_table_to_bdr_replication_set(conn, "nodes", replication_set);
}
pfree(replication_set);
}
initPQExpBuffer(&event_details); initPQExpBuffer(&event_details);

View File

@@ -83,6 +83,8 @@ do_cluster_show(void)
int i = 0; int i = 0;
ItemList warnings = {NULL, NULL}; ItemList warnings = {NULL, NULL};
bool success = false; bool success = false;
bool error_found = false;
bool connection_error_found = false;
/* Connect to local database to obtain cluster connection data */ /* Connect to local database to obtain cluster connection data */
log_verbose(LOG_INFO, _("connecting to database")); log_verbose(LOG_INFO, _("connecting to database"));
@@ -140,14 +142,26 @@ do_cluster_show(void)
} }
else else
{ {
char error[MAXLEN];
strncpy(error, PQerrorMessage(cell->node_info->conn), MAXLEN);
cell->node_info->node_status = NODE_STATUS_DOWN; cell->node_info->node_status = NODE_STATUS_DOWN;
cell->node_info->recovery_type = RECTYPE_UNKNOWN; cell->node_info->recovery_type = RECTYPE_UNKNOWN;
item_list_append_format(&warnings,
"when attempting to connect to node \"%s\" (ID: %i), following error encountered :\n\"%s\"", connection_error_found = true;
cell->node_info->node_name, cell->node_info->node_id, trim(error));
if (runtime_options.verbose)
{
char error[MAXLEN];
strncpy(error, PQerrorMessage(cell->node_info->conn), MAXLEN);
item_list_append_format(&warnings,
"when attempting to connect to node \"%s\" (ID: %i), following error encountered :\n\"%s\"",
cell->node_info->node_name, cell->node_info->node_id, trim(error));
}
else
{
item_list_append_format(&warnings,
"unable to connect to node \"%s\" (ID: %i)",
cell->node_info->node_name, cell->node_info->node_id);
}
} }
initPQExpBuffer(&details); initPQExpBuffer(&details);
@@ -218,6 +232,7 @@ do_cluster_show(void)
else else
{ {
appendPQExpBuffer(&details, "- failed"); appendPQExpBuffer(&details, "- failed");
error_found = true;
} }
} }
} }
@@ -281,6 +296,7 @@ do_cluster_show(void)
else else
{ {
appendPQExpBuffer(&details, "- failed"); appendPQExpBuffer(&details, "- failed");
error_found = true;
} }
} }
} }
@@ -292,17 +308,27 @@ do_cluster_show(void)
if (cell->node_info->node_status == NODE_STATUS_UP) if (cell->node_info->node_status == NODE_STATUS_UP)
{ {
if (cell->node_info->active == true) if (cell->node_info->active == true)
{
appendPQExpBuffer(&details, "* running"); appendPQExpBuffer(&details, "* running");
}
else else
{
appendPQExpBuffer(&details, "! running"); appendPQExpBuffer(&details, "! running");
error_found = true;
}
} }
/* node is unreachable */ /* node is unreachable */
else else
{ {
if (cell->node_info->active == true) if (cell->node_info->active == true)
{
appendPQExpBuffer(&details, "? unreachable"); appendPQExpBuffer(&details, "? unreachable");
}
else else
{
appendPQExpBuffer(&details, "- failed"); appendPQExpBuffer(&details, "- failed");
error_found = true;
}
} }
} }
break; break;
@@ -310,6 +336,7 @@ do_cluster_show(void)
{ {
/* this should never happen */ /* this should never happen */
appendPQExpBuffer(&details, "? unknown node type"); appendPQExpBuffer(&details, "? unknown node type");
error_found = true;
} }
break; break;
} }
@@ -414,7 +441,6 @@ do_cluster_show(void)
PQfinish(conn); PQfinish(conn);
/* emit any warnings */ /* emit any warnings */
if (warnings.head != NULL && runtime_options.terse == false && runtime_options.output_mode != OM_CSV) if (warnings.head != NULL && runtime_options.terse == false && runtime_options.output_mode != OM_CSV)
{ {
ItemListCell *cell = NULL; ItemListCell *cell = NULL;
@@ -424,6 +450,25 @@ do_cluster_show(void)
{ {
printf(_(" - %s\n"), cell->string); printf(_(" - %s\n"), cell->string);
} }
if (runtime_options.verbose == false && connection_error_found == true)
{
log_hint(_("execute with --verbose option to see connection error messages"));
}
}
/*
* If warnings were noted, even if they're not displayed (e.g. in --csv node),
* that means something's not right so we need to emit a non-zero exit code.
*/
if (warnings.head != NULL)
{
error_found = true;
}
if (error_found == true)
{
exit(ERR_NODE_STATUS);
} }
} }
@@ -436,6 +481,7 @@ do_cluster_show(void)
* --all * --all
* --node-[id|name] * --node-[id|name]
* --event * --event
* --csv
*/ */
void void
@@ -480,8 +526,12 @@ do_cluster_event(void)
strncpy(headers_event[EV_TIMESTAMP].title, _("Timestamp"), MAXLEN); strncpy(headers_event[EV_TIMESTAMP].title, _("Timestamp"), MAXLEN);
strncpy(headers_event[EV_DETAILS].title, _("Details"), MAXLEN); strncpy(headers_event[EV_DETAILS].title, _("Details"), MAXLEN);
/* if --terse provided, simply omit the "Details" column */ /*
if (runtime_options.terse == true) * If --terse or --csv provided, simply omit the "Details" column.
* In --csv mode we'd need to quote/escape the contents "Details" column,
* which is doable but which will remain a TODO for now.
*/
if (runtime_options.terse == true || runtime_options.output_mode == OM_CSV)
column_count --; column_count --;
for (i = 0; i < column_count; i++) for (i = 0; i < column_count; i++)
@@ -504,47 +554,64 @@ do_cluster_event(void)
} }
for (i = 0; i < column_count; i++) if (runtime_options.output_mode == OM_TEXT)
{ {
if (i == 0) for (i = 0; i < column_count; i++)
printf(" "); {
else if (i == 0)
printf(" | "); printf(" ");
else
printf(" | ");
printf("%-*s", printf("%-*s",
headers_event[i].max_length, headers_event[i].max_length,
headers_event[i].title); headers_event[i].title);
}
printf("\n");
printf("-");
for (i = 0; i < column_count; i++)
{
int j;
for (j = 0; j < headers_event[i].max_length; j++)
printf("-");
if (i < (column_count - 1))
printf("-+-");
else
printf("-");
}
printf("\n");
} }
printf("\n");
printf("-");
for (i = 0; i < column_count; i++)
{
int j;
for (j = 0; j < headers_event[i].max_length; j++)
printf("-");
if (i < (column_count - 1))
printf("-+-");
else
printf("-");
}
printf("\n");
for (i = 0; i < PQntuples(res); i++) for (i = 0; i < PQntuples(res); i++)
{ {
int j; int j;
printf(" "); if (runtime_options.output_mode == OM_CSV)
for (j = 0; j < column_count; j++)
{ {
printf("%-*s", for (j = 0; j < column_count; j++)
headers_event[j].max_length, {
PQgetvalue(res, i, j)); printf("%s", PQgetvalue(res, i, j));
if ((j + 1) < column_count)
{
printf(",");
}
}
}
else
{
printf(" ");
for (j = 0; j < column_count; j++)
{
printf("%-*s",
headers_event[j].max_length,
PQgetvalue(res, i, j));
if (j < (column_count - 1)) if (j < (column_count - 1))
printf(" | "); printf(" | ");
}
} }
printf("\n"); printf("\n");
@@ -554,7 +621,8 @@ do_cluster_event(void)
PQfinish(conn); PQfinish(conn);
puts(""); if (runtime_options.output_mode == OM_TEXT)
puts("");
} }
@@ -696,7 +764,7 @@ do_cluster_crosscheck(void)
if (error_found == true) if (error_found == true)
{ {
exit(ERR_CLUSTER_CHECK); exit(ERR_NODE_STATUS);
} }
} }
@@ -786,7 +854,7 @@ do_cluster_matrix()
if (error_found == true) if (error_found == true)
{ {
exit(ERR_CLUSTER_CHECK); exit(ERR_NODE_STATUS);
} }
} }
@@ -1282,6 +1350,7 @@ do_cluster_cleanup(void)
PGconn *conn = NULL; PGconn *conn = NULL;
PGconn *primary_conn = NULL; PGconn *primary_conn = NULL;
int entries_to_delete = 0; int entries_to_delete = 0;
PQExpBufferData event_details;
conn = establish_db_connection(config_file_options.conninfo, true); conn = establish_db_connection(config_file_options.conninfo, true);
@@ -1295,7 +1364,13 @@ do_cluster_cleanup(void)
entries_to_delete = get_number_of_monitoring_records_to_delete(primary_conn, runtime_options.keep_history); entries_to_delete = get_number_of_monitoring_records_to_delete(primary_conn, runtime_options.keep_history);
if (entries_to_delete == 0) if (entries_to_delete < 0)
{
log_error(_("unable to query number of monitoring records to clean up"));
PQfinish(primary_conn);
exit(ERR_DB_QUERY);
}
else if (entries_to_delete == 0)
{ {
log_info(_("no monitoring records to delete")); log_info(_("no monitoring records to delete"));
PQfinish(primary_conn); PQfinish(primary_conn);
@@ -1305,10 +1380,23 @@ do_cluster_cleanup(void)
log_debug("at least %i monitoring records for deletion", log_debug("at least %i monitoring records for deletion",
entries_to_delete); entries_to_delete);
initPQExpBuffer(&event_details);
if (delete_monitoring_records(primary_conn, runtime_options.keep_history) == false) if (delete_monitoring_records(primary_conn, runtime_options.keep_history) == false)
{ {
log_error(_("unable to delete monitoring records")); appendPQExpBuffer(&event_details,
_("unable to delete monitoring records"));
log_error("%s", event_details.data);
log_detail("%s", PQerrorMessage(primary_conn)); log_detail("%s", PQerrorMessage(primary_conn));
create_event_notification(primary_conn,
&config_file_options,
config_file_options.node_id,
"cluster_cleanup",
false,
event_details.data);
PQfinish(primary_conn); PQfinish(primary_conn);
exit(ERR_DB_QUERY); exit(ERR_DB_QUERY);
} }
@@ -1320,7 +1408,22 @@ do_cluster_cleanup(void)
log_detail("%s", PQerrorMessage(primary_conn)); log_detail("%s", PQerrorMessage(primary_conn));
} }
appendPQExpBuffer(&event_details,
_("monitoring records deleted"));
if (runtime_options.keep_history > 0)
appendPQExpBuffer(&event_details,
_("; records newer than %i day(s) retained"),
runtime_options.keep_history);
create_event_notification(primary_conn,
&config_file_options,
config_file_options.node_id,
"cluster_cleanup",
true,
event_details.data);
termPQExpBuffer(&event_details);
PQfinish(primary_conn); PQfinish(primary_conn);
if (runtime_options.keep_history > 0) if (runtime_options.keep_history > 0)
@@ -1347,6 +1450,7 @@ do_cluster_help(void)
printf(_(" %s [OPTIONS] cluster matrix\n"), progname()); printf(_(" %s [OPTIONS] cluster matrix\n"), progname());
printf(_(" %s [OPTIONS] cluster crosscheck\n"), progname()); printf(_(" %s [OPTIONS] cluster crosscheck\n"), progname());
printf(_(" %s [OPTIONS] cluster event\n"), progname()); printf(_(" %s [OPTIONS] cluster event\n"), progname());
printf(_(" %s [OPTIONS] cluster cleanup\n"), progname());
puts(""); puts("");
printf(_("CLUSTER SHOW\n")); printf(_("CLUSTER SHOW\n"));
@@ -1386,6 +1490,7 @@ do_cluster_help(void)
printf(_(" --event filter specific event\n")); printf(_(" --event filter specific event\n"));
printf(_(" --node-id restrict entries to node with this ID\n")); printf(_(" --node-id restrict entries to node with this ID\n"));
printf(_(" --node-name restrict entries to node with this name\n")); printf(_(" --node-name restrict entries to node with this name\n"));
printf(_(" --csv emit output as CSV\n"));
puts(""); puts("");
printf(_("CLUSTER CLEANUP\n")); printf(_("CLUSTER CLEANUP\n"));

View File

@@ -47,6 +47,7 @@ static CheckStatus do_node_check_downstream(PGconn *conn, OutputMode mode, Check
static CheckStatus do_node_check_replication_lag(PGconn *conn, OutputMode mode, t_node_info *node_info, CheckStatusList *list_output); static CheckStatus do_node_check_replication_lag(PGconn *conn, OutputMode mode, t_node_info *node_info, CheckStatusList *list_output);
static CheckStatus do_node_check_role(PGconn *conn, OutputMode mode, t_node_info *node_info, CheckStatusList *list_output); static CheckStatus do_node_check_role(PGconn *conn, OutputMode mode, t_node_info *node_info, CheckStatusList *list_output);
static CheckStatus do_node_check_slots(PGconn *conn, OutputMode mode, t_node_info *node_info, CheckStatusList *list_output); static CheckStatus do_node_check_slots(PGconn *conn, OutputMode mode, t_node_info *node_info, CheckStatusList *list_output);
static CheckStatus do_node_check_missing_slots(PGconn *conn, OutputMode mode, t_node_info *node_info, CheckStatusList *list_output);
/* /*
* NODE STATUS * NODE STATUS
@@ -169,11 +170,17 @@ do_node_status(void)
} }
else else
{ {
/* "archive_mode" is not "off", i.e. one of "on", "always" */
bool enabled = true; bool enabled = true;
PQExpBufferData archiving_status; PQExpBufferData archiving_status;
char archive_command[MAXLEN] = ""; char archive_command[MAXLEN] = "";
initPQExpBuffer(&archiving_status); initPQExpBuffer(&archiving_status);
/*
* if the node is a standby, and "archive_mode" is "on", archiving will
* actually be disabled.
*/
if (recovery_type == RECTYPE_STANDBY) if (recovery_type == RECTYPE_STANDBY)
{ {
if (guc_set(conn, "archive_mode", "=", "on")) if (guc_set(conn, "archive_mode", "=", "on"))
@@ -251,6 +258,55 @@ do_node_status(void)
"disabled"); "disabled");
} }
/* check for attached nodes */
{
NodeInfoList downstream_nodes = T_NODE_INFO_LIST_INITIALIZER;
NodeInfoListCell *node_cell = NULL;
ItemList missing_nodes = {NULL, NULL};
int missing_nodes_count = 0;
int expected_nodes_count = 0;
get_downstream_node_records(conn, config_file_options.node_id, &downstream_nodes);
/* if a witness node is present, we'll need to remove this from the total */
expected_nodes_count = downstream_nodes.node_count;
for (node_cell = downstream_nodes.head; node_cell; node_cell = node_cell->next)
{
/* skip witness server */
if (node_cell->node_info->type == WITNESS)
{
expected_nodes_count --;
continue;
}
if (is_downstream_node_attached(conn, node_cell->node_info->node_name) == false)
{
missing_nodes_count++;
item_list_append_format(&missing_nodes,
"%s (ID: %i)",
node_cell->node_info->node_name,
node_cell->node_info->node_id);
}
}
if (missing_nodes_count)
{
ItemListCell *missing_cell = NULL;
item_list_append_format(&warnings,
_("- %i of %i downstream nodes not attached:"),
missing_nodes_count,
expected_nodes_count);
for (missing_cell = missing_nodes.head; missing_cell; missing_cell = missing_cell->next)
{
item_list_append_format(&warnings,
" - %s\n", missing_cell->string);
}
}
}
if (server_version_num < 90400) if (server_version_num < 90400)
{ {
key_value_list_set(&node_status, key_value_list_set(&node_status,
@@ -486,18 +542,31 @@ do_node_status(void)
termPQExpBuffer(&output); termPQExpBuffer(&output);
if (runtime_options.output_mode == OM_TEXT && warnings.head != NULL && runtime_options.terse == false) if (warnings.head != NULL && runtime_options.terse == false && runtime_options.output_mode == OM_TEXT)
{ {
log_warning(_("following issue(s) were detected:")); log_warning(_("following issue(s) were detected:"));
print_item_list(&warnings); print_item_list(&warnings);
log_hint(_("execute \"repmgr node check\" for more details")); log_hint(_("execute \"repmgr node check\" for more details"));
} }
clear_node_info_list(&missing_slots);
key_value_list_free(&node_status); key_value_list_free(&node_status);
item_list_free(&warnings); item_list_free(&warnings);
PQfinish(conn); PQfinish(conn);
/*
* If warnings were noted, even if they're not displayed (e.g. in --csv node),
* that means something's not right so we need to emit a non-zero exit code.
*/
if (warnings.head != NULL)
{
exit(ERR_NODE_STATUS);
}
return;
} }
/* /*
* Returns information about the running state of the node. * Returns information about the running state of the node.
* For internal use during "standby switchover". * For internal use during "standby switchover".
@@ -628,6 +697,7 @@ do_node_check(void)
CheckStatusList status_list = {NULL, NULL}; CheckStatusList status_list = {NULL, NULL};
CheckStatusListCell *cell = NULL; CheckStatusListCell *cell = NULL;
bool issue_detected = false;
/* for internal use */ /* for internal use */
if (runtime_options.has_passfile == true) if (runtime_options.has_passfile == true)
@@ -712,6 +782,17 @@ do_node_check(void)
exit(return_code); exit(return_code);
} }
if (runtime_options.missing_slots == true)
{
return_code = do_node_check_missing_slots(conn,
runtime_options.output_mode,
&node_info,
NULL);
PQfinish(conn);
exit(return_code);
}
if (runtime_options.output_mode == OM_NAGIOS) if (runtime_options.output_mode == OM_NAGIOS)
{ {
log_error(_("--nagios can only be used with a specific check")); log_error(_("--nagios can only be used with a specific check"));
@@ -725,11 +806,23 @@ do_node_check(void)
initPQExpBuffer(&output); initPQExpBuffer(&output);
/* order functions are called is also output order */ /* order functions are called is also output order */
(void) do_node_check_role(conn, runtime_options.output_mode, &node_info, &status_list); if (do_node_check_role(conn, runtime_options.output_mode, &node_info, &status_list) != CHECK_STATUS_OK)
(void) do_node_check_replication_lag(conn, runtime_options.output_mode, &node_info, &status_list); issue_detected = true;
(void) do_node_check_archive_ready(conn, runtime_options.output_mode, &status_list);
(void) do_node_check_downstream(conn, runtime_options.output_mode, &status_list); if (do_node_check_replication_lag(conn, runtime_options.output_mode, &node_info, &status_list) != CHECK_STATUS_OK)
(void) do_node_check_slots(conn, runtime_options.output_mode, &node_info, &status_list); issue_detected = true;
if (do_node_check_archive_ready(conn, runtime_options.output_mode, &status_list) != CHECK_STATUS_OK)
issue_detected = true;
if (do_node_check_downstream(conn, runtime_options.output_mode, &status_list) != CHECK_STATUS_OK)
issue_detected = true;
if (do_node_check_slots(conn, runtime_options.output_mode, &node_info, &status_list) != CHECK_STATUS_OK)
issue_detected = true;
if (do_node_check_missing_slots(conn, runtime_options.output_mode, &node_info, &status_list) != CHECK_STATUS_OK)
issue_detected = true;
if (runtime_options.output_mode == OM_CSV) if (runtime_options.output_mode == OM_CSV)
{ {
@@ -786,6 +879,11 @@ do_node_check(void)
check_status_list_free(&status_list); check_status_list_free(&status_list);
PQfinish(conn); PQfinish(conn);
if (issue_detected == true)
{
exit(ERR_NODE_STATUS);
}
} }
@@ -1047,6 +1145,7 @@ do_node_check_downstream(PGconn *conn, OutputMode mode, CheckStatusList *list_ou
for (cell = downstream_nodes.head; cell; cell = cell->next) for (cell = downstream_nodes.head; cell; cell = cell->next)
{ {
/* skip witness server */
if (cell->node_info->type == WITNESS) if (cell->node_info->type == WITNESS)
{ {
expected_nodes_count --; expected_nodes_count --;
@@ -1583,6 +1682,130 @@ do_node_check_slots(PGconn *conn, OutputMode mode, t_node_info *node_info, Check
} }
static CheckStatus
do_node_check_missing_slots(PGconn *conn, OutputMode mode, t_node_info *node_info, CheckStatusList *list_output)
{
CheckStatus status = CHECK_STATUS_OK;
PQExpBufferData details;
NodeInfoList missing_slots = T_NODE_INFO_LIST_INITIALIZER;
if (mode == OM_CSV && list_output == NULL)
{
log_error(_("--csv output not provided with --missing-slots option"));
PQfinish(conn);
exit(ERR_BAD_CONFIG);
}
initPQExpBuffer(&details);
if (server_version_num < 90400)
{
appendPQExpBuffer(&details,
_("replication slots not available for this PostgreSQL version"));
}
else
{
get_downstream_nodes_with_missing_slot(conn,
config_file_options.node_id,
&missing_slots);
if (missing_slots.node_count == 0)
{
appendPQExpBuffer(&details,
_("node has no missing replication slots"));
}
else
{
NodeInfoListCell *missing_slot_cell = NULL;
bool first_element = true;
status = CHECK_STATUS_CRITICAL;
appendPQExpBuffer(&details,
_("%i replication slots are missing"),
missing_slots.node_count);
if (missing_slots.node_count)
{
appendPQExpBuffer(&details, ": ");
for (missing_slot_cell = missing_slots.head; missing_slot_cell; missing_slot_cell = missing_slot_cell->next)
{
if (first_element == true)
{
first_element = false;
}
else
{
appendPQExpBuffer(&details, ", ");
}
appendPQExpBuffer(&details, "%s", missing_slot_cell->node_info->slot_name);
}
}
}
}
switch (mode)
{
case OM_NAGIOS:
{
printf("REPMGR_MISSING_SLOTS %s: %s | missing_slots=%i",
output_check_status(status),
details.data,
missing_slots.node_count);
if (missing_slots.node_count)
{
NodeInfoListCell *missing_slot_cell = NULL;
bool first_element = true;
printf(";");
for (missing_slot_cell = missing_slots.head; missing_slot_cell; missing_slot_cell = missing_slot_cell->next)
{
if (first_element == true)
{
first_element = false;
}
else
{
printf(",");
}
printf("%s", missing_slot_cell->node_info->slot_name);
}
}
printf("\n");
break;
}
case OM_CSV:
case OM_TEXT:
if (list_output != NULL)
{
check_status_list_set(list_output,
"Replication slots",
status,
details.data);
}
else
{
printf("%s (%s)\n",
output_check_status(status),
details.data);
}
default:
break;
}
clear_node_info_list(&missing_slots);
termPQExpBuffer(&details);
return status;
}
void void
do_node_service(void) do_node_service(void)
{ {
@@ -2136,19 +2359,19 @@ do_node_rejoin(void)
{ {
log_verbose(LOG_INFO, _("waiting for node %i to respond to pings; %i of max %i attempts"), log_verbose(LOG_INFO, _("waiting for node %i to respond to pings; %i of max %i attempts"),
config_file_options.node_id, config_file_options.node_id,
i + 1, config_file_options.standby_reconnect_timeout); i + 1, config_file_options.node_rejoin_timeout);
} }
else else
{ {
log_debug("sleeping 1 second waiting for node %i to respond to pings; %i of max %i attempts", log_debug("sleeping 1 second waiting for node %i to respond to pings; %i of max %i attempts",
config_file_options.node_id, config_file_options.node_id,
i + 1, config_file_options.standby_reconnect_timeout); i + 1, config_file_options.node_rejoin_timeout);
} }
sleep(1); sleep(1);
} }
for (; i < config_file_options.standby_reconnect_timeout; i++) for (; i < config_file_options.node_rejoin_timeout; i++)
{ {
success = is_downstream_node_attached(upstream_conn, config_file_options.node_name); success = is_downstream_node_attached(upstream_conn, config_file_options.node_name);
@@ -2163,13 +2386,13 @@ do_node_rejoin(void)
{ {
log_info(_("waiting for node %i to connect to new primary; %i of max %i attempts"), log_info(_("waiting for node %i to connect to new primary; %i of max %i attempts"),
config_file_options.node_id, config_file_options.node_id,
i + 1, config_file_options.standby_reconnect_timeout); i + 1, config_file_options.node_rejoin_timeout);
} }
else else
{ {
log_debug("sleeping 1 second waiting for node %i to connect to new primary; %i of max %i attempts", log_debug("sleeping 1 second waiting for node %i to connect to new primary; %i of max %i attempts",
config_file_options.node_id, config_file_options.node_id,
i + 1, config_file_options.standby_reconnect_timeout); i + 1, config_file_options.node_rejoin_timeout);
} }
sleep(1); sleep(1);
@@ -2194,6 +2417,54 @@ do_node_rejoin(void)
success = is_downstream_node_attached(upstream_conn, config_file_options.node_name); success = is_downstream_node_attached(upstream_conn, config_file_options.node_name);
} }
/*
* Handle replication slots:
* - if a slot for the new upstream exists, delete that
* - warn about any other inactive replication slots
*/
if (runtime_options.force_rewind_used == false && config_file_options.use_replication_slots)
{
PGconn *local_conn = NULL;
local_conn = establish_db_connection(config_file_options.conninfo, false);
if (PQstatus(local_conn) != CONNECTION_OK)
{
log_warning(_("unable to connect to local node to check replication slot status"));
log_hint(_("execute \"repmgr node check\" to check inactive slots and drop manually if necessary"));
}
else
{
KeyValueList inactive_replication_slots = {NULL, NULL};
KeyValueListCell *cell = NULL;
int inactive_count = 0;
PQExpBufferData slotinfo;
drop_replication_slot_if_exists(local_conn,
config_file_options.node_id,
primary_node_record.slot_name);
(void) get_inactive_replication_slots(local_conn, &inactive_replication_slots);
initPQExpBuffer(&slotinfo);
for (cell = inactive_replication_slots.head; cell; cell = cell->next)
{
appendPQExpBuffer(&slotinfo,
" - %s (%s)", cell->key, cell->value);
inactive_count++;
}
if (inactive_count > 0)
{
log_warning(_("%i inactive replication slots detected"), inactive_count);
log_detail(_("inactive replication slots:\n%s"), slotinfo.data);
log_hint(_("these replication slots may need to be removed manually"));
}
termPQExpBuffer(&slotinfo);
PQfinish(local_conn);
}
}
if (success == true) if (success == true)
{ {
@@ -2203,7 +2474,8 @@ do_node_rejoin(void)
else else
{ {
/* /*
* if we reach here, no record found in upstream node's pg_stat_replication */ * if we reach here, no record found in upstream node's pg_stat_replication
*/
log_notice(_("NODE REJOIN has completed but node is not yet reattached to upstream")); log_notice(_("NODE REJOIN has completed but node is not yet reattached to upstream"));
log_hint(_("you will need to manually check the node's replication status")); log_hint(_("you will need to manually check the node's replication status"));
} }
@@ -2664,6 +2936,7 @@ do_node_help(void)
printf(_(" --replication-lag replication lag in seconds (standbys only)\n")); printf(_(" --replication-lag replication lag in seconds (standbys only)\n"));
printf(_(" --role check node has expected role\n")); printf(_(" --role check node has expected role\n"));
printf(_(" --slots check for inactive replication slots\n")); printf(_(" --slots check for inactive replication slots\n"));
printf(_(" --missing-slots check for missing replication slots\n"));
puts(""); puts("");

View File

@@ -64,12 +64,10 @@ do_primary_register(void)
PQfinish(conn); PQfinish(conn);
exit(ERR_BAD_CONFIG); exit(ERR_BAD_CONFIG);
} }
else
{ log_error(_("unable to determine server's recovery type"));
log_error(_("connection to node lost")); PQfinish(conn);
PQfinish(conn); exit(ERR_DB_CONN);
exit(ERR_DB_CONN);
}
} }
log_verbose(LOG_INFO, _("server is not in recovery")); log_verbose(LOG_INFO, _("server is not in recovery"));

View File

@@ -89,8 +89,6 @@ static int run_file_backup(t_node_info *node_record);
static void copy_configuration_files(bool delete_after_copy); static void copy_configuration_files(bool delete_after_copy);
static void drop_replication_slot_if_exists(PGconn *conn, int node_id, char *slot_name);
static void tablespace_data_append(TablespaceDataList *list, const char *name, const char *oid, const char *location); static void tablespace_data_append(TablespaceDataList *list, const char *name, const char *oid, const char *location);
static void get_barman_property(char *dst, char *name, char *local_repmgr_directory); static void get_barman_property(char *dst, char *name, char *local_repmgr_directory);
@@ -471,6 +469,7 @@ do_standby_clone(void)
termPQExpBuffer(&msg); termPQExpBuffer(&msg);
r = test_ssh_connection(runtime_options.host, runtime_options.remote_user); r = test_ssh_connection(runtime_options.host, runtime_options.remote_user);
if (r != 0) if (r != 0)
{ {
log_error(_("remote host \"%s\" is not reachable via SSH - unable to copy external configuration files"), log_error(_("remote host \"%s\" is not reachable via SSH - unable to copy external configuration files"),
@@ -498,32 +497,41 @@ do_standby_clone(void)
termPQExpBuffer(&msg); termPQExpBuffer(&msg);
/* /*
* Here we'll attempt an initial test copy of the detected external * Here we'll attempt an initial test copy of the detected external
* files, to detect any issues before we run the base backup. * files, to detect any issues before we run the base backup.
* *
* Note this will exit with an error, unless -F/--force supplied. * Note this will exit with an error, unless -F/--force supplied.
* *
* We don't do this during a --dry-run as it may introduce unexpected changes
* on the local node; during an actual clone operation, any problems with
* copying files will be detected early and the operation aborted before
* the actual database cloning commences.
*
* TODO: put the files in a temporary directory and move to their final * TODO: put the files in a temporary directory and move to their final
* destination once the database has been cloned. * destination once the database has been cloned.
*/ */
if (runtime_options.copy_external_config_files_destination == CONFIG_FILE_SAMEPATH) if (runtime_options.dry_run == false)
{ {
/* if (runtime_options.copy_external_config_files_destination == CONFIG_FILE_SAMEPATH)
* Files will be placed in the same path as on the source server; {
* don't delete after copying. /*
*/ * Files will be placed in the same path as on the source server;
copy_configuration_files(false); * don't delete after copying.
*/
copy_configuration_files(false);
} }
else else
{ {
/* /*
* Files will be placed in the data directory - delete after copying. * Files will be placed in the data directory - delete after copying.
* They'll be copied again later; see TODO above. * They'll be copied again later; see TODO above.
*/ */
copy_configuration_files(true); copy_configuration_files(true);
}
} }
} }
@@ -1054,6 +1062,7 @@ _do_create_recovery_conf(void)
local_node_record.slot_name, local_node_record.slot_name,
upstream_node_record.node_name, upstream_node_record.node_name,
upstream_node_id); upstream_node_id);
if (runtime_options.force == false && runtime_options.dry_run == false) if (runtime_options.force == false && runtime_options.dry_run == false)
{ {
log_error("%s", msg.data); log_error("%s", msg.data);
@@ -1085,7 +1094,7 @@ _do_create_recovery_conf(void)
initPQExpBuffer(&msg); initPQExpBuffer(&msg);
appendPQExpBuffer(&msg, appendPQExpBuffer(&msg,
_("insufficient free replicaiton slots on upstream node \"%s\" (ID: %i)"), _("insufficient free replication slots on upstream node \"%s\" (ID: %i)"),
upstream_node_record.node_name, upstream_node_record.node_name,
upstream_node_id); upstream_node_id);
@@ -1141,14 +1150,14 @@ _do_create_recovery_conf(void)
if (runtime_options.dry_run == true) if (runtime_options.dry_run == true)
{ {
char recovery_conf_contents[MAXLEN] = ""; char recovery_conf_contents[MAXLEN] = "";
create_recovery_file(&upstream_node_record, &recovery_conninfo, recovery_conf_contents, false); create_recovery_file(&local_node_record, &recovery_conninfo, recovery_conf_contents, false);
log_info(_("would create \"recovery.conf\" file in \"%s\""), local_data_directory); log_info(_("would create \"recovery.conf\" file in \"%s\""), local_data_directory);
log_detail(_("\n%s"), recovery_conf_contents); log_detail(_("\n%s"), recovery_conf_contents);
} }
else else
{ {
if (!create_recovery_file(&upstream_node_record, &recovery_conninfo, local_data_directory, true)) if (!create_recovery_file(&local_node_record, &recovery_conninfo, local_data_directory, true))
{ {
log_error(_("unable to create \"recovery.conf\"")); log_error(_("unable to create \"recovery.conf\""));
} }
@@ -1557,8 +1566,8 @@ do_standby_register(void)
exit(ERR_BAD_CONFIG); exit(ERR_BAD_CONFIG);
} }
log_warning(_("this node does not appear to be attached to upstream node \"%s\" (ID: %i)"), log_warning(_("this node does not appear to be attached to upstream node \"%s\" (ID: %i)"),
config_file_options.node_name, upstream_node_record.node_name,
config_file_options.node_id); upstream_node_record.node_id);
} }
PQfinish(upstream_conn); PQfinish(upstream_conn);
} }
@@ -1708,11 +1717,16 @@ do_standby_register(void)
termPQExpBuffer(&details); termPQExpBuffer(&details);
/* if --wait-sync option set, wait for the records to synchronise */ /*
* If --wait-sync option set, wait for the records to synchronise
* (unless 0 seconds provided, which disables it, which is the same as
* not providing the option). The default value is -1, which means
* no timeout.
*/
if (PQstatus(conn) == CONNECTION_OK && if (PQstatus(conn) == CONNECTION_OK &&
runtime_options.wait_register_sync == true && runtime_options.wait_register_sync == true &&
runtime_options.wait_register_sync_seconds > 0) runtime_options.wait_register_sync_seconds != 0)
{ {
bool sync_ok = false; bool sync_ok = false;
int timer = 0; int timer = 0;
@@ -1736,7 +1750,11 @@ do_standby_register(void)
{ {
bool records_match = true; bool records_match = true;
if (runtime_options.wait_register_sync_seconds && runtime_options.wait_register_sync_seconds == timer) /*
* If timeout set to a positive value, check if we've reached it and
* exit the loop
*/
if (runtime_options.wait_register_sync_seconds > 0 && runtime_options.wait_register_sync_seconds == timer)
break; break;
node_record_status = get_node_record(conn, node_record_status = get_node_record(conn,
@@ -2040,6 +2058,8 @@ _do_standby_promote_internal(PGconn *conn)
local_node_record.node_name, local_node_record.node_name,
local_node_record.node_id, local_node_record.node_id,
script); script);
log_detail(_("waiting up to %i seconds (parameter \"promote_check_timeout\") for promotion to complete"),
config_file_options.promote_check_timeout);
r = system(script); r = system(script);
if (r != 0) if (r != 0)
@@ -2065,6 +2085,8 @@ _do_standby_promote_internal(PGconn *conn)
if (recovery_type == RECTYPE_STANDBY) if (recovery_type == RECTYPE_STANDBY)
{ {
log_error(_("STANDBY PROMOTE failed, node is still a standby")); log_error(_("STANDBY PROMOTE failed, node is still a standby"));
log_detail(_("node still in recovery after %i seconds"), config_file_options.promote_check_timeout);
log_hint(_("the node may need more time to promote itself, check the PostgreSQL log for details"));
PQfinish(conn); PQfinish(conn);
exit(ERR_PROMOTION_FAIL); exit(ERR_PROMOTION_FAIL);
} }
@@ -2710,6 +2732,10 @@ do_standby_follow_internal(PGconn *primary_conn, t_node_info *primary_node_recor
* If replication slots are in use, and an inactive one for this node * If replication slots are in use, and an inactive one for this node
* exists on the former upstream, drop it. * exists on the former upstream, drop it.
* *
* Note that if this function is called by do_standby_switchover(), the
* "repmgr node rejoin" command executed on the demotion candidate may already
* have removed the slot, so there may be nothing to do.
*
* XXX check if former upstream is current primary? * XXX check if former upstream is current primary?
*/ */
@@ -2817,6 +2843,12 @@ do_standby_switchover(void)
int reachable_sibling_nodes_with_slot_count = 0; int reachable_sibling_nodes_with_slot_count = 0;
int unreachable_sibling_node_count = 0; int unreachable_sibling_node_count = 0;
/* number of free walsenders required on promotion candidate */
int min_required_wal_senders = 1;
/* this will be calculated as max_wal_senders - COUNT(*) FROM pg_stat_replication */
int available_wal_senders = 0;
/* number of free replication slots required on promotion candidate */ /* number of free replication slots required on promotion candidate */
int min_required_free_slots = 0; int min_required_free_slots = 0;
@@ -2901,6 +2933,25 @@ do_standby_switchover(void)
exit(ERR_DB_QUERY); exit(ERR_DB_QUERY);
} }
/*
* Check that there's no exclusive backups running on the primary.
* We don't want to end up damaging the backup and also leaving the server in an
* state where there's control data saying it's in backup mode but there's no
* backup_label in PGDATA.
* If the DBA wants to do the switchover anyway, he should first stop the
* backup that's running.
*/
if (server_in_exclusive_backup_mode(remote_conn) != BACKUP_STATE_NO_BACKUP)
{
log_error(_("unable to perform a switchover while primary server is in exclusive backup mode"));
log_hint(_("stop backup before attempting the switchover"));
PQfinish(local_conn);
PQfinish(remote_conn);
exit(ERR_SWITCHOVER_FAIL);
}
/* /*
* Check this standby is attached to the demotion candidate * Check this standby is attached to the demotion candidate
* TODO: * TODO:
@@ -3067,6 +3118,176 @@ do_standby_switchover(void)
} }
termPQExpBuffer(&command_output); termPQExpBuffer(&command_output);
/*
* populate local node record with current state of various replication-related
* values, so we can check for sufficient walsenders and replication slots
*/
get_node_replication_stats(local_conn, server_version_num, &local_node_record);
available_wal_senders = local_node_record.max_wal_senders -
local_node_record.attached_wal_receivers;
/*
* If --siblings-follow specified, get list and check they're reachable
* (if not just issue a warning)
*/
get_active_sibling_node_records(local_conn,
local_node_record.node_id,
local_node_record.upstream_node_id,
&sibling_nodes);
if (runtime_options.siblings_follow == false)
{
if (sibling_nodes.node_count > 0)
{
log_warning(_("%i sibling nodes found, but option \"--siblings-follow\" not specified"),
sibling_nodes.node_count);
log_detail(_("these nodes will remain attached to the current primary"));
}
}
else
{
char host[MAXLEN] = "";
NodeInfoListCell *cell;
log_verbose(LOG_INFO, _("%i active sibling nodes found"),
sibling_nodes.node_count);
if (sibling_nodes.node_count == 0)
{
log_warning(_("option \"--sibling-nodes\" specified, but no sibling nodes exist"));
}
else
{
/* include walsender for promotion candidate in total */
for (cell = sibling_nodes.head; cell; cell = cell->next)
{
/* get host from node record */
get_conninfo_value(cell->node_info->conninfo, "host", host);
r = test_ssh_connection(host, runtime_options.remote_user);
if (r != 0)
{
cell->node_info->reachable = false;
unreachable_sibling_node_count++;
}
else
{
cell->node_info->reachable = true;
reachable_sibling_node_count++;
min_required_wal_senders++;
if (cell->node_info->slot_name[0] != '\0')
{
reachable_sibling_nodes_with_slot_count++;
min_required_free_slots++;
}
}
}
if (unreachable_sibling_node_count > 0)
{
if (runtime_options.force == false)
{
log_error(_("%i of %i sibling nodes unreachable via SSH:"),
unreachable_sibling_node_count,
sibling_nodes.node_count);
}
else
{
log_warning(_("%i of %i sibling nodes unreachable via SSH:"),
unreachable_sibling_node_count,
sibling_nodes.node_count);
}
/* display list of unreachable sibling nodes */
for (cell = sibling_nodes.head; cell; cell = cell->next)
{
if (cell->node_info->reachable == true)
continue;
log_detail(" %s (ID: %i)",
cell->node_info->node_name,
cell->node_info->node_id);
}
if (runtime_options.force == false)
{
log_hint(_("use -F/--force to proceed in any case"));
PQfinish(local_conn);
exit(ERR_BAD_CONFIG);
}
if (runtime_options.dry_run == true)
{
log_detail(_("F/--force specified, would proceed anyway"));
}
else
{
log_detail(_("F/--force specified, proceeding anyway"));
}
}
else
{
char *msg = _("all sibling nodes are reachable via SSH");
if (runtime_options.dry_run == true)
{
log_info("%s", msg);
}
else
{
log_verbose(LOG_INFO, "%s", msg);
}
}
}
}
/*
* check there are sufficient free walsenders - obviously there's potential
* for a later race condition if some walsenders come into use before the
* switchover operation gets around to attaching the sibling nodes, but
* this should catch any actual existing configuration issue (and if anyone's
* performing a switchover in such an unstable environment, they only have
* themselves to blame).
*/
if (available_wal_senders < min_required_wal_senders)
{
if (runtime_options.force == false || runtime_options.dry_run == true)
{
log_error(_("insufficient free walsenders on promotion candidate"));
log_detail(_("at least %i walsenders required but only %i free walsenders on promotion candidate"),
min_required_wal_senders,
available_wal_senders);
log_hint(_("increase parameter \"max_wal_senders\" or use -F/--force to proceed in any case"));
if (runtime_options.dry_run == false)
{
PQfinish(local_conn);
exit(ERR_BAD_CONFIG);
}
}
else
{
log_warning(_("insufficient free walsenders on promotion candidate"));
log_detail(_("at least %i walsenders required but only %i free walsender(s) on promotion candidate"),
min_required_wal_senders,
available_wal_senders);
}
}
else
{
if (runtime_options.dry_run == true)
{
log_info(_("%i walsenders required, %i available"),
min_required_wal_senders,
available_wal_senders);
}
}
/* check demotion candidate can make replication connection to promotion candidate */ /* check demotion candidate can make replication connection to promotion candidate */
{ {
initPQExpBuffer(&remote_command_str); initPQExpBuffer(&remote_command_str);
@@ -3310,171 +3531,6 @@ do_standby_switchover(void)
PQfinish(remote_conn); PQfinish(remote_conn);
/*
* populate local node record with current state of various replication-related
* values, so we can check for sufficient walsenders and replication slots
*/
get_node_replication_stats(local_conn, server_version_num, &local_node_record);
/*
* If --siblings-follow specified, get list and check they're reachable
* (if not just issue a warning)
*/
get_active_sibling_node_records(local_conn,
local_node_record.node_id,
local_node_record.upstream_node_id,
&sibling_nodes);
if (runtime_options.siblings_follow == false)
{
if (sibling_nodes.node_count > 0)
{
log_warning(_("%i sibling nodes found, but option \"--siblings-follow\" not specified"),
sibling_nodes.node_count);
log_detail(_("these nodes will remain attached to the current primary"));
}
}
else
{
char host[MAXLEN] = "";
NodeInfoListCell *cell;
log_verbose(LOG_INFO, _("%i active sibling nodes found"),
sibling_nodes.node_count);
if (sibling_nodes.node_count == 0)
{
log_warning(_("option \"--sibling-nodes\" specified, but no sibling nodes exist"));
}
else
{
/* include walsender for promotion candidate in total */
int min_required_wal_senders = 1;
int available_wal_senders = local_node_record.max_wal_senders -
local_node_record.attached_wal_receivers;
for (cell = sibling_nodes.head; cell; cell = cell->next)
{
/* get host from node record */
get_conninfo_value(cell->node_info->conninfo, "host", host);
r = test_ssh_connection(host, runtime_options.remote_user);
if (r != 0)
{
cell->node_info->reachable = false;
unreachable_sibling_node_count++;
}
else
{
cell->node_info->reachable = true;
reachable_sibling_node_count++;
min_required_wal_senders++;
if (cell->node_info->slot_name[0] != '\0')
{
reachable_sibling_nodes_with_slot_count++;
min_required_free_slots++;
}
}
}
if (unreachable_sibling_node_count > 0)
{
if (runtime_options.force == false)
{
log_error(_("%i of %i sibling nodes unreachable via SSH:"),
unreachable_sibling_node_count,
sibling_nodes.node_count);
}
else
{
log_warning(_("%i of %i sibling nodes unreachable via SSH:"),
unreachable_sibling_node_count,
sibling_nodes.node_count);
}
/* display list of unreachable sibling nodes */
for (cell = sibling_nodes.head; cell; cell = cell->next)
{
if (cell->node_info->reachable == true)
continue;
log_detail(" %s (ID: %i)",
cell->node_info->node_name,
cell->node_info->node_id);
}
if (runtime_options.force == false)
{
log_hint(_("use -F/--force to proceed in any case"));
PQfinish(local_conn);
exit(ERR_BAD_CONFIG);
}
if (runtime_options.dry_run == true)
{
log_detail(_("F/--force specified, would proceed anyway"));
}
else
{
log_detail(_("F/--force specified, proceeding anyway"));
}
}
else
{
char *msg = _("all sibling nodes are reachable via SSH");
if (runtime_options.dry_run == true)
{
log_info("%s", msg);
}
else
{
log_verbose(LOG_INFO, "%s", msg);
}
}
/*
* check there are sufficient free walsenders - obviously there's potential
* for a later race condition if some walsenders come into use before the
* switchover operation gets around to attaching the sibling nodes, but
* this should catch any actual existing configuration issue.
*/
if (available_wal_senders < min_required_wal_senders)
{
if (runtime_options.force == false || runtime_options.dry_run == true)
{
log_error(_("insufficient free walsenders to attach all sibling nodes"));
log_detail(_("at least %i walsenders required but only %i free walsenders on promotion candidate"),
min_required_wal_senders,
available_wal_senders);
log_hint(_("increase parameter \"max_wal_senders\" or use -F/--force to proceed in any case"));
if (runtime_options.dry_run == false)
{
PQfinish(local_conn);
exit(ERR_BAD_CONFIG);
}
}
else
{
log_warning(_("insufficient free walsenders to attach all sibling nodes"));
log_detail(_("at least %i walsenders required but only %i free walsender(s) on promotion candidate"),
min_required_wal_senders,
available_wal_senders);
}
}
else
{
if (runtime_options.dry_run == true)
{
log_info(_("%i walsenders required, %i available"),
min_required_wal_senders,
available_wal_senders);
}
}
}
}
/* /*
* if replication slots are required by demotion candidate and/or siblings, * if replication slots are required by demotion candidate and/or siblings,
@@ -5082,65 +5138,81 @@ run_basebackup(t_node_info *node_record)
{ {
PGconn *upstream_conn = NULL; PGconn *upstream_conn = NULL;
upstream_conn = establish_db_connection(upstream_node_record.conninfo, true); upstream_conn = establish_db_connection(upstream_node_record.conninfo, false);
record_status = get_slot_record(upstream_conn, node_record->slot_name, &slot_info); /*
* It's possible the upstream node is not yet running, in which case we'll
if (record_status == RECORD_FOUND) * have to rely on the user taking action to create the slot
*/
if (PQstatus(upstream_conn) != CONNECTION_OK)
{ {
log_verbose(LOG_INFO, log_warning(_("unable to connect to upstream node to create replication slot"));
_("replication slot \"%s\" aleady exists on upstream node %i"), /*
node_record->slot_name, * TODO: if slot creation also handled by "standby register", update warning
upstream_node_id); */
slot_exists_on_upstream = true; log_hint(_("you may need to create the replication slot manually"));
} }
else else
{ {
PQExpBufferData event_details; record_status = get_slot_record(upstream_conn, node_record->slot_name, &slot_info);
log_notice(_("creating replication slot \"%s\" on upstream node %i"), if (record_status == RECORD_FOUND)
node_record->slot_name,
upstream_node_id);
get_superuser_connection(&upstream_conn, &superuser_conn, &privileged_conn);
initPQExpBuffer(&event_details);
if (create_replication_slot(privileged_conn, node_record->slot_name, source_server_version_num, &event_details) == false)
{ {
log_error("%s", event_details.data); log_verbose(LOG_INFO,
_("replication slot \"%s\" aleady exists on upstream node %i"),
node_record->slot_name,
upstream_node_id);
slot_exists_on_upstream = true;
}
else
{
PQExpBufferData event_details;
create_event_notification( log_notice(_("creating replication slot \"%s\" on upstream node %i"),
primary_conn, node_record->slot_name,
&config_file_options, upstream_node_id);
config_file_options.node_id,
"standby_clone",
false,
event_details.data);
PQfinish(source_conn); get_superuser_connection(&upstream_conn, &superuser_conn, &privileged_conn);
initPQExpBuffer(&event_details);
if (create_replication_slot(privileged_conn, node_record->slot_name, source_server_version_num, &event_details) == false)
{
log_error("%s", event_details.data);
create_event_notification(primary_conn,
&config_file_options,
config_file_options.node_id,
"standby_clone",
false,
event_details.data);
PQfinish(source_conn);
if (superuser_conn != NULL)
PQfinish(superuser_conn);
exit(ERR_DB_QUERY);
}
if (superuser_conn != NULL) if (superuser_conn != NULL)
PQfinish(superuser_conn); PQfinish(superuser_conn);
exit(ERR_DB_QUERY); termPQExpBuffer(&event_details);
} }
if (superuser_conn != NULL) PQfinish(upstream_conn);
PQfinish(superuser_conn);
termPQExpBuffer(&event_details);
} }
PQfinish(upstream_conn);
} }
/* delete slot on source server */
get_superuser_connection(&source_conn, &superuser_conn, &privileged_conn); get_superuser_connection(&source_conn, &superuser_conn, &privileged_conn);
if (slot_info.active == false) if (slot_info.active == false)
{ {
if (slot_exists_on_upstream == false) if (slot_exists_on_upstream == false)
{ {
if (drop_replication_slot(source_conn, node_record->slot_name) == true) if (drop_replication_slot(privileged_conn, node_record->slot_name) == true)
{ {
log_notice(_("replication slot \"%s\" deleted on source node"), node_record->slot_name); log_notice(_("replication slot \"%s\" deleted on source node"), node_record->slot_name);
} }
@@ -5798,7 +5870,7 @@ get_barman_property(char *dst, char *name, char *local_repmgr_directory)
initPQExpBuffer(&command_output); initPQExpBuffer(&command_output);
maxlen_snprintf(command, maxlen_snprintf(command,
"grep \"^\t%s:\" %s/show-server.txt", "grep \"^[[:space:]]%s:\" %s/show-server.txt",
name, local_repmgr_tmp_directory); name, local_repmgr_tmp_directory);
(void) local_command(command, &command_output); (void) local_command(command, &command_output);
@@ -5995,45 +6067,6 @@ check_recovery_type(PGconn *conn)
} }
static void
drop_replication_slot_if_exists(PGconn *conn, int node_id, char *slot_name)
{
t_replication_slot slot_info = T_REPLICATION_SLOT_INITIALIZER;
RecordStatus record_status = get_slot_record(conn, slot_name, &slot_info);
log_verbose(LOG_DEBUG, "attempting to delete slot \"%s\" on node %i",
slot_name, node_id);
if (record_status != RECORD_FOUND)
{
log_info(_("no slot record found for slot \"%s\" on node %i"),
slot_name, node_id);
}
else
{
if (slot_info.active == false)
{
if (drop_replication_slot(conn, slot_name) == true)
{
log_notice(_("replication slot \"%s\" deleted on node %i"), slot_name, node_id);
}
else
{
log_error(_("unable to delete replication slot \"%s\" on node %i"), slot_name, node_id);
}
}
/*
* if active replication slot exists, call Houston as we have a
* problem
*/
else
{
log_warning(_("replication slot \"%s\" is still active on node %i"), slot_name, node_id);
}
}
}
/* /*
* Creates a recovery.conf file for a standby * Creates a recovery.conf file for a standby
@@ -6504,6 +6537,7 @@ do_standby_help(void)
puts(""); puts("");
printf(_(" \"standby clone\" clones a standby from the primary or an upstream node.\n")); printf(_(" \"standby clone\" clones a standby from the primary or an upstream node.\n"));
puts(""); puts("");
printf(_(" -d, --dbname=conninfo conninfo of the upstream node to use for cloning.\n"));
printf(_(" -c, --fast-checkpoint force fast checkpoint\n")); printf(_(" -c, --fast-checkpoint force fast checkpoint\n"));
printf(_(" --copy-external-config-files[={samepath|pgdata}]\n" \ printf(_(" --copy-external-config-files[={samepath|pgdata}]\n" \
" copy configuration files located outside the \n" \ " copy configuration files located outside the \n" \

View File

@@ -310,55 +310,59 @@ do_witness_register(void)
void void
do_witness_unregister(void) do_witness_unregister(void)
{ {
PGconn *witness_conn = NULL; PGconn *local_conn = NULL;
PGconn *primary_conn = NULL; PGconn *primary_conn = NULL;
t_node_info node_record = T_NODE_INFO_INITIALIZER; t_node_info node_record = T_NODE_INFO_INITIALIZER;
RecordStatus record_status = RECORD_NOT_FOUND; RecordStatus record_status = RECORD_NOT_FOUND;
bool node_record_deleted = false; bool node_record_deleted = false;
bool witness_available = true; bool local_node_available = true;
int witness_node_id = UNKNOWN_NODE_ID;
log_info(_("connecting to witness node \"%s\" (ID: %i)"), if (runtime_options.node_id != UNKNOWN_NODE_ID)
{
/* user has specified the witness node id */
witness_node_id = runtime_options.node_id;
}
else
{
/* assume witness node is local node */
witness_node_id = config_file_options.node_id;
}
log_info(_("connecting to node \"%s\" (ID: %i)"),
config_file_options.node_name, config_file_options.node_name,
config_file_options.node_id); config_file_options.node_id);
witness_conn = establish_db_connection_quiet(config_file_options.conninfo); local_conn = establish_db_connection_quiet(config_file_options.conninfo);
if (PQstatus(witness_conn) != CONNECTION_OK) if (PQstatus(local_conn) != CONNECTION_OK)
{ {
if (!runtime_options.force) if (!runtime_options.force)
{ {
log_error(_("unable to connect to witness node \"%s\" (ID: %i)"), log_error(_("unable to connect to node \"%s\" (ID: %i)"),
config_file_options.node_name, config_file_options.node_name,
config_file_options.node_id); config_file_options.node_id);
log_detail("%s", PQerrorMessage(witness_conn)); log_detail("%s", PQerrorMessage(local_conn));
log_hint(_("provide -F/--force to remove the witness record if the server is not running"));
exit(ERR_BAD_CONFIG); exit(ERR_BAD_CONFIG);
} }
log_notice(_("unable to connect to witness node \"%s\" (ID: %i), removing node record on cluster primary only"), log_notice(_("unable to connect to witness node \"%s\" (ID: %i), removing node record on cluster primary only"),
config_file_options.node_name, config_file_options.node_name,
config_file_options.node_id); config_file_options.node_id);
witness_available = false; local_node_available = false;
} }
if (witness_available == true) if (local_node_available == true)
{ {
primary_conn = get_primary_connection_quiet(witness_conn, NULL, NULL); primary_conn = get_primary_connection_quiet(local_conn, NULL, NULL);
} }
else else
{ {
/* /*
* Extract the repmgr user and database names from the conninfo string * Assume user has provided connection details for the primary server
* provided in repmgr.conf
*/ */
get_conninfo_value(config_file_options.conninfo, "user", repmgr_user);
get_conninfo_value(config_file_options.conninfo, "dbname", repmgr_db);
param_set_ine(&source_conninfo, "user", repmgr_user);
param_set_ine(&source_conninfo, "dbname", repmgr_db);
primary_conn = establish_db_connection_by_params(&source_conninfo, false); primary_conn = establish_db_connection_by_params(&source_conninfo, false);
} }
if (PQstatus(primary_conn) != CONNECTION_OK) if (PQstatus(primary_conn) != CONNECTION_OK)
@@ -366,26 +370,26 @@ do_witness_unregister(void)
log_error(_("unable to connect to primary")); log_error(_("unable to connect to primary"));
log_detail("%s", PQerrorMessage(primary_conn)); log_detail("%s", PQerrorMessage(primary_conn));
if (witness_available == true) if (local_node_available == true)
{ {
PQfinish(witness_conn); PQfinish(local_conn);
} }
else else if (runtime_options.connection_param_provided == false)
{ {
log_hint(_("provide connection details to primary server")); log_hint(_("provide connection details for the primary server"));
} }
exit(ERR_BAD_CONFIG); exit(ERR_BAD_CONFIG);
} }
/* Check node exists and is really a witness */ /* Check node exists and is really a witness */
record_status = get_node_record(primary_conn, config_file_options.node_id, &node_record); record_status = get_node_record(primary_conn, witness_node_id, &node_record);
if (record_status != RECORD_FOUND) if (record_status != RECORD_FOUND)
{ {
log_error(_("no record found for node %i"), config_file_options.node_id); log_error(_("no record found for node %i"), witness_node_id);
if (witness_available == true) if (local_node_available == true)
PQfinish(witness_conn); PQfinish(local_conn);
PQfinish(primary_conn); PQfinish(primary_conn);
exit(ERR_BAD_CONFIG); exit(ERR_BAD_CONFIG);
@@ -393,11 +397,17 @@ do_witness_unregister(void)
if (node_record.type != WITNESS) if (node_record.type != WITNESS)
{ {
/*
* The node (either explicitly provided with --node-id, or the local node)
* is not a witness.
*
* TODO: scan node list and print hint about identity of known witness servers.
*/
log_error(_("node %i is not a witness node"), config_file_options.node_id); log_error(_("node %i is not a witness node"), config_file_options.node_id);
log_detail(_("node %i is a %s node"), config_file_options.node_id, get_node_type_string(node_record.type)); log_detail(_("node %i is a %s node"), config_file_options.node_id, get_node_type_string(node_record.type));
if (witness_available == true) if (local_node_available == true)
PQfinish(witness_conn); PQfinish(local_conn);
PQfinish(primary_conn); PQfinish(primary_conn);
exit(ERR_BAD_CONFIG); exit(ERR_BAD_CONFIG);
@@ -406,49 +416,43 @@ do_witness_unregister(void)
if (runtime_options.dry_run == true) if (runtime_options.dry_run == true)
{ {
log_info(_("prerequisites for unregistering the witness node are met")); log_info(_("prerequisites for unregistering the witness node are met"));
if (witness_available == true) if (local_node_available == true)
PQfinish(witness_conn); PQfinish(local_conn);
PQfinish(primary_conn); PQfinish(primary_conn);
exit(SUCCESS); exit(SUCCESS);
} }
log_info(_("unregistering witness node %i"), config_file_options.node_id); log_info(_("unregistering witness node %i"), witness_node_id);
node_record_deleted = delete_node_record(primary_conn, node_record_deleted = delete_node_record(primary_conn,
config_file_options.node_id); witness_node_id);
if (node_record_deleted == false) if (node_record_deleted == false)
{ {
PQfinish(primary_conn); PQfinish(primary_conn);
PQfinish(witness_conn);
exit(ERR_BAD_CONFIG);
}
/* sync records from primary */ if (local_node_available == true)
if (witness_available == true && witness_copy_node_records(primary_conn, witness_conn) == false) PQfinish(local_conn);
{ PQfinish(local_conn);
log_error(_("unable to copy repmgr node records from primary"));
PQfinish(primary_conn);
PQfinish(witness_conn);
exit(ERR_BAD_CONFIG); exit(ERR_BAD_CONFIG);
} }
/* Log the event */ /* Log the event */
create_event_record(primary_conn, create_event_record(primary_conn,
&config_file_options, &config_file_options,
config_file_options.node_id, witness_node_id,
"witness_unregister", "witness_unregister",
true, true,
NULL); NULL);
PQfinish(primary_conn); PQfinish(primary_conn);
if (witness_available == true) if (local_node_available == true)
PQfinish(witness_conn); PQfinish(local_conn);
log_info(_("witness unregistration complete")); log_info(_("witness unregistration complete"));
log_detail(_("witness node with id %i (conninfo: %s) successfully unregistered"), log_detail(_("witness node with ID %i successfully unregistered"),
config_file_options.node_id, config_file_options.conninfo); witness_node_id);
return; return;
} }
@@ -468,16 +472,19 @@ void do_witness_help(void)
puts(""); puts("");
printf(_(" Requires provision of connection information for the primary\n")); printf(_(" Requires provision of connection information for the primary\n"));
puts(""); puts("");
printf(_(" --dry-run check prerequisites but don't make any changes\n")); printf(_(" --dry-run check prerequisites but don't make any changes\n"));
printf(_(" -F, --force overwrite an existing node record\n")); printf(_(" -F, --force overwrite an existing node record\n"));
puts(""); puts("");
printf(_("WITNESS UNREGISTER\n")); printf(_("WITNESS UNREGISTER\n"));
puts(""); puts("");
printf(_(" \"witness register\" unregisters a witness node.\n")); printf(_(" \"witness register\" unregisters a witness node.\n"));
puts(""); puts("");
printf(_(" --dry-run check prerequisites but don't make any changes\n")); printf(_(" --dry-run check prerequisites but don't make any changes\n"));
printf(_(" -F, --force unregister when witness node not running\n")); printf(_(" -F, --force unregister when witness node not running\n"));
printf(_(" --node-id node ID of the witness node (provide if executing on\n"));
printf(_(" another node)\n"));
puts(""); puts("");
return; return;

View File

@@ -47,6 +47,7 @@ typedef struct
/* logging options */ /* logging options */
char log_level[MAXLEN]; /* overrides setting in repmgr.conf */ char log_level[MAXLEN]; /* overrides setting in repmgr.conf */
bool log_to_file; bool log_to_file;
bool quiet;
bool terse; bool terse;
bool verbose; bool verbose;
@@ -106,6 +107,7 @@ typedef struct
bool replication_lag; bool replication_lag;
bool role; bool role;
bool slots; bool slots;
bool missing_slots;
bool has_passfile; bool has_passfile;
bool replication_connection; bool replication_connection;
@@ -137,7 +139,7 @@ typedef struct
/* general configuration options */ \ /* general configuration options */ \
"", false, false, "", false, false, \ "", false, false, "", false, false, \
/* logging options */ \ /* logging options */ \
"", false, false, false, \ "", false, false, false, false, \
/* output options */ \ /* output options */ \
false, false, false, \ false, false, false, \
/* database connection options */ \ /* database connection options */ \
@@ -152,13 +154,13 @@ typedef struct
/* "standby clone"/"standby follow" options */ \ /* "standby clone"/"standby follow" options */ \
NO_UPSTREAM_NODE, \ NO_UPSTREAM_NODE, \
/* "standby register" options */ \ /* "standby register" options */ \
false, 0, DEFAULT_WAIT_START, \ false, -1, DEFAULT_WAIT_START, \
/* "standby switchover" options */ \ /* "standby switchover" options */ \
false, false, "", false, \ false, false, "", false, \
/* "node status" options */ \ /* "node status" options */ \
false, \ false, \
/* "node check" options */ \ /* "node check" options */ \
false, false, false, false, false, false, false, \ false, false, false, false, false, false, false, false, \
/* "node join" options */ \ /* "node join" options */ \
"", \ "", \
/* "node service" options */ \ /* "node service" options */ \
@@ -235,5 +237,6 @@ extern void get_node_config_directory(char *config_dir_buf);
extern void get_node_data_directory(char *data_dir_buf); extern void get_node_data_directory(char *data_dir_buf);
extern void init_node_record(t_node_info *node_record); extern void init_node_record(t_node_info *node_record);
extern bool can_use_pg_rewind(PGconn *conn, const char *data_directory, PQExpBufferData *reason); extern bool can_use_pg_rewind(PGconn *conn, const char *data_directory, PQExpBufferData *reason);
extern void drop_replication_slot_if_exists(PGconn *conn, int node_id, char *slot_name);
#endif /* _REPMGR_CLIENT_GLOBAL_H_ */ #endif /* _REPMGR_CLIENT_GLOBAL_H_ */

View File

@@ -98,7 +98,7 @@ main(int argc, char **argv)
{ {
t_conninfo_param_list default_conninfo = T_CONNINFO_PARAM_LIST_INITIALIZER; t_conninfo_param_list default_conninfo = T_CONNINFO_PARAM_LIST_INITIALIZER;
int optindex; int optindex = 0;
int c; int c;
char *repmgr_command = NULL; char *repmgr_command = NULL;
@@ -108,6 +108,7 @@ main(int argc, char **argv)
char *dummy_action = ""; char *dummy_action = "";
bool help_option = false; bool help_option = false;
bool option_error_found = false;
set_progname(argv[0]); set_progname(argv[0]);
@@ -178,7 +179,10 @@ main(int argc, char **argv)
strncpy(runtime_options.username, pw->pw_name, MAXLEN); strncpy(runtime_options.username, pw->pw_name, MAXLEN);
} }
while ((c = getopt_long(argc, argv, "?Vb:f:FwWd:h:p:U:R:S:D:ck:L:tvC:", long_options, /* Make getopt emitting errors */
opterr = 1;
while ((c = getopt_long(argc, argv, "?Vb:f:FwWd:h:p:U:R:S:D:ck:L:qtvC:", long_options,
&optindex)) != -1) &optindex)) != -1)
{ {
/* /*
@@ -196,13 +200,7 @@ main(int argc, char **argv)
case OPT_HELP: /* --help */ case OPT_HELP: /* --help */
help_option = true; help_option = true;
break; break;
case '?':
/* Actual help option given */
if (strcmp(argv[optind - 1], "-?") == 0)
{
help_option = true;
}
break;
case 'V': case 'V':
/* /*
@@ -473,6 +471,10 @@ main(int argc, char **argv)
runtime_options.slots = true; runtime_options.slots = true;
break; break;
case OPT_MISSING_SLOTS:
runtime_options.missing_slots = true;
break;
case OPT_HAS_PASSFILE: case OPT_HAS_PASSFILE:
runtime_options.has_passfile = true; runtime_options.has_passfile = true;
break; break;
@@ -572,6 +574,12 @@ main(int argc, char **argv)
logger_output_mode = OM_DAEMON; logger_output_mode = OM_DAEMON;
break; break;
/* --quiet */
case 'q':
runtime_options.quiet = true;
break;
/* --terse */ /* --terse */
case 't': case 't':
runtime_options.terse = true; runtime_options.terse = true;
@@ -627,9 +635,24 @@ main(int argc, char **argv)
_("--recovery-min-apply-delay is now a configuration file parameter, \"recovery_min_apply_delay\"")); _("--recovery-min-apply-delay is now a configuration file parameter, \"recovery_min_apply_delay\""));
break; break;
case ':': /* missing option argument */
option_error_found = true;
break;
case '?':
/* Actual help option given? */
if (strcmp(argv[optind - 1], "-?") == 0)
{
help_option = true;
break;
}
/* otherwise fall through to default */
default: /* invalid option */
option_error_found = true;
break;
} }
} }
/* /*
* If -d/--dbname appears to be a conninfo string, validate by attempting * If -d/--dbname appears to be a conninfo string, validate by attempting
* to parse it (and if successful, store the parsed parameters) * to parse it (and if successful, store the parsed parameters)
@@ -730,9 +753,10 @@ main(int argc, char **argv)
if (cli_errors.head != NULL) if (cli_errors.head != NULL)
{ {
free_conninfo_params(&source_conninfo); free_conninfo_params(&source_conninfo);
exit_with_cli_errors(&cli_errors); exit_with_cli_errors(&cli_errors, NULL);
} }
/*---------- /*----------
* Determine the node type and action; following are valid: * Determine the node type and action; following are valid:
* *
@@ -979,9 +1003,30 @@ main(int argc, char **argv)
if (cli_errors.head != NULL) if (cli_errors.head != NULL)
{ {
free_conninfo_params(&source_conninfo); free_conninfo_params(&source_conninfo);
exit_with_cli_errors(&cli_errors);
exit_with_cli_errors(&cli_errors, valid_repmgr_command_found == true ? repmgr_command : NULL);
} }
/* no errors detected by repmgr, but getopt might have */
if (option_error_found == true)
{
if (valid_repmgr_command_found == true)
{
printf(_("Try \"%s --help\" or \"%s %s --help\" for more information.\n"),
progname(),
progname(),
repmgr_command);
}
else
{
printf(_("Try \"repmgr --help\" for more information.\n"));
}
free_conninfo_params(&source_conninfo);
exit(ERR_BAD_CONFIG);
}
/* /*
* Print any warnings about inappropriate command line options, unless * Print any warnings about inappropriate command line options, unless
* -t/--terse set * -t/--terse set
@@ -1077,6 +1122,17 @@ main(int argc, char **argv)
logger_set_min_level(LOG_INFO); logger_set_min_level(LOG_INFO);
} }
/*
* If -q/--quiet supplied, suppress any non-ERROR log output.
* This overrides everything else; we'll leave it up to the user to deal with the
* consequences of e.g. running --dry-run together with -q/--quiet.
*/
if (runtime_options.quiet == true)
{
logger_set_level(LOG_ERROR);
}
/* /*
* Node configuration information is not needed for all actions, with * Node configuration information is not needed for all actions, with
@@ -1463,6 +1519,7 @@ check_cli_parameters(const int action)
{ {
case PRIMARY_UNREGISTER: case PRIMARY_UNREGISTER:
case STANDBY_UNREGISTER: case STANDBY_UNREGISTER:
case WITNESS_UNREGISTER:
case CLUSTER_EVENT: case CLUSTER_EVENT:
case CLUSTER_MATRIX: case CLUSTER_MATRIX:
case CLUSTER_CROSSCHECK: case CLUSTER_CROSSCHECK:
@@ -1503,6 +1560,7 @@ check_cli_parameters(const int action)
case STANDBY_CLONE: case STANDBY_CLONE:
case STANDBY_REGISTER: case STANDBY_REGISTER:
case STANDBY_FOLLOW: case STANDBY_FOLLOW:
case BDR_REGISTER:
break; break;
default: default:
item_list_append_format(&cli_warnings, item_list_append_format(&cli_warnings,
@@ -1845,7 +1903,7 @@ do_help(void)
printf(_(" %s [OPTIONS] standby {register|unregister|clone|promote|follow|switchover}\n"), progname()); printf(_(" %s [OPTIONS] standby {register|unregister|clone|promote|follow|switchover}\n"), progname());
printf(_(" %s [OPTIONS] bdr {register|unregister}\n"), progname()); printf(_(" %s [OPTIONS] bdr {register|unregister}\n"), progname());
printf(_(" %s [OPTIONS] node {status|check|rejoin|service}\n"), progname()); printf(_(" %s [OPTIONS] node {status|check|rejoin|service}\n"), progname());
printf(_(" %s [OPTIONS] cluster {show|event|matrix|crosscheck}\n"), progname()); printf(_(" %s [OPTIONS] cluster {show|event|matrix|crosscheck|cleanup}\n"), progname());
printf(_(" %s [OPTIONS] witness {register|unregister}\n"), progname()); printf(_(" %s [OPTIONS] witness {register|unregister}\n"), progname());
puts(""); puts("");
@@ -1894,6 +1952,7 @@ do_help(void)
printf(_(" --dry-run show what would happen for action, but don't execute it\n")); printf(_(" --dry-run show what would happen for action, but don't execute it\n"));
printf(_(" -L, --log-level set log level (overrides configuration file; default: NOTICE)\n")); printf(_(" -L, --log-level set log level (overrides configuration file; default: NOTICE)\n"));
printf(_(" --log-to-file log to file (or logging facility) defined in repmgr.conf\n")); printf(_(" --log-to-file log to file (or logging facility) defined in repmgr.conf\n"));
printf(_(" -q, --quiet suppress all log output apart from errors\n"));
printf(_(" -t, --terse don't display detail, hints and other non-critical output\n")); printf(_(" -t, --terse don't display detail, hints and other non-critical output\n"));
printf(_(" -v, --verbose display additional log output (useful for debugging)\n")); printf(_(" -v, --verbose display additional log output (useful for debugging)\n"));
@@ -2919,3 +2978,46 @@ can_use_pg_rewind(PGconn *conn, const char *data_directory, PQExpBufferData *rea
return can_use; return can_use;
} }
void
drop_replication_slot_if_exists(PGconn *conn, int node_id, char *slot_name)
{
t_replication_slot slot_info = T_REPLICATION_SLOT_INITIALIZER;
RecordStatus record_status = get_slot_record(conn, slot_name, &slot_info);
log_verbose(LOG_DEBUG, "attempting to delete slot \"%s\" on node %i",
slot_name, node_id);
if (record_status != RECORD_FOUND)
{
/* this is a good thing */
log_verbose(LOG_INFO,
_("slot \"%s\" does not exist on node %i, nothing to remove"),
slot_name, node_id);
}
else
{
if (slot_info.active == false)
{
if (drop_replication_slot(conn, slot_name) == true)
{
log_notice(_("replication slot \"%s\" deleted on node %i"), slot_name, node_id);
}
else
{
log_error(_("unable to delete replication slot \"%s\" on node %i"), slot_name, node_id);
}
}
/*
* if active replication slot exists, call Houston as we have a
* problem
*/
else
{
log_warning(_("replication slot \"%s\" is still active on node %i"), slot_name, node_id);
}
}
}

View File

@@ -87,6 +87,7 @@
#define OPT_REMOTE_NODE_ID 1038 #define OPT_REMOTE_NODE_ID 1038
#define OPT_RECOVERY_CONF_ONLY 1039 #define OPT_RECOVERY_CONF_ONLY 1039
#define OPT_NO_WAIT 1040 #define OPT_NO_WAIT 1040
#define OPT_MISSING_SLOTS 1041
/* deprecated since 3.3 */ /* deprecated since 3.3 */
#define OPT_DATA_DIR 999 #define OPT_DATA_DIR 999
@@ -125,6 +126,7 @@ static struct option long_options[] =
/* logging options */ /* logging options */
{"log-level", required_argument, NULL, 'L'}, {"log-level", required_argument, NULL, 'L'},
{"log-to-file", no_argument, NULL, OPT_LOG_TO_FILE}, {"log-to-file", no_argument, NULL, OPT_LOG_TO_FILE},
{"quiet", no_argument, NULL, 'q'},
{"terse", no_argument, NULL, 't'}, {"terse", no_argument, NULL, 't'},
{"verbose", no_argument, NULL, 'v'}, {"verbose", no_argument, NULL, 'v'},
@@ -164,6 +166,7 @@ static struct option long_options[] =
{"replication-lag", no_argument, NULL, OPT_REPLICATION_LAG}, {"replication-lag", no_argument, NULL, OPT_REPLICATION_LAG},
{"role", no_argument, NULL, OPT_ROLE}, {"role", no_argument, NULL, OPT_ROLE},
{"slots", no_argument, NULL, OPT_SLOTS}, {"slots", no_argument, NULL, OPT_SLOTS},
{"missing-slots", no_argument, NULL, OPT_MISSING_SLOTS},
{"has-passfile", no_argument, NULL, OPT_HAS_PASSFILE}, {"has-passfile", no_argument, NULL, OPT_HAS_PASSFILE},
{"replication-connection", no_argument, NULL, OPT_REPL_CONN}, {"replication-connection", no_argument, NULL, OPT_REPL_CONN},

View File

@@ -416,9 +416,9 @@ unset_bdr_failover_handler(PG_FUNCTION_ARGS)
LWLockAcquire(shared_state->lock, LW_EXCLUSIVE); LWLockAcquire(shared_state->lock, LW_EXCLUSIVE);
shared_state->bdr_failover_handler = UNKNOWN_NODE_ID; shared_state->bdr_failover_handler = UNKNOWN_NODE_ID;
LWLockRelease(shared_state->lock);
} }
LWLockRelease(shared_state->lock);
PG_RETURN_VOID(); PG_RETURN_VOID();
} }

View File

@@ -98,7 +98,7 @@
#log_facility=STDERR # Logging facility: possible values are STDERR, or for #log_facility=STDERR # Logging facility: possible values are STDERR, or for
# syslog integration, one of LOCAL0, LOCAL1, ..., LOCAL7, USER # syslog integration, one of LOCAL0, LOCAL1, ..., LOCAL7, USER
#log_file='' # stderr can be redirected to an arbitrary file #log_file='' # STDERR can be redirected to an arbitrary file
#log_status_interval=300 # interval (in seconds) for repmgrd to log a status message #log_status_interval=300 # interval (in seconds) for repmgrd to log a status message
@@ -143,6 +143,11 @@
# Debian/Ubuntu users: you will probably need to # Debian/Ubuntu users: you will probably need to
# set this to the directory where `pg_ctl` is located, # set this to the directory where `pg_ctl` is located,
# e.g. /usr/lib/postgresql/9.6/bin/ # e.g. /usr/lib/postgresql/9.6/bin/
#
# *NOTE* "pg_bindir" is only used when repmgr directly
# executes PostgreSQL binaries; any user-defined scripts
# *must* be specified with the full path
#
#use_primary_conninfo_password=false # explicitly set "password" in recovery.conf's #use_primary_conninfo_password=false # explicitly set "password" in recovery.conf's
# "primary_conninfo" parameter using the value contained # "primary_conninfo" parameter using the value contained
# in the environment variable PGPASSWORD # in the environment variable PGPASSWORD
@@ -156,7 +161,7 @@
# Examples: # Examples:
# #
# pg_ctl_options='-s' # pg_ctl_options='-s'
# pg_basebackup_options='--label=repmgr_backup # pg_basebackup_options='--label=repmgr_backup'
# rsync_options=--archive --checksum --compress --progress --rsh="ssh -o \"StrictHostKeyChecking no\"" # rsync_options=--archive --checksum --compress --progress --rsh="ssh -o \"StrictHostKeyChecking no\""
# ssh_options=-o "StrictHostKeyChecking no" # ssh_options=-o "StrictHostKeyChecking no"
@@ -183,11 +188,11 @@ ssh_options='-q -o ConnectTimeout=10' # Options to append to "ssh"
# parameter can be provided multiple times. # parameter can be provided multiple times.
#restore_command='' # This will be placed in the recovery.conf file generated #restore_command='' # This will be placed in the recovery.conf file generated
# by repmgr. # by repmgr.
#archive_cleanup_command='' # This will be placed in the recovery.conf file generated #archive_cleanup_command='' # This will be placed in the recovery.conf file generated
# by repmgr. Note we recommend using Barman for managing # by repmgr. Note we recommend using Barman for managing
# WAL archives (see: https://www.pgbarman.org ) # WAL archives (see: https://www.pgbarman.org )
#recovery_min_apply_delay= # If provided, "recovery_min_apply_delay" in recovery.conf #recovery_min_apply_delay= # If provided, "recovery_min_apply_delay" in recovery.conf
# will be set to this value (PostgreSQL 9.4 and later). # will be set to this value (PostgreSQL 9.4 and later).
@@ -207,7 +212,7 @@ ssh_options='-q -o ConnectTimeout=10' # Options to append to "ssh"
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
# Standby follow settings # "standby follow" settings
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
# These settings apply when instructing a standby to follow the new primary # These settings apply when instructing a standby to follow the new primary
@@ -219,6 +224,28 @@ ssh_options='-q -o ConnectTimeout=10' # Options to append to "ssh"
# for the standby to connect to the primary # for the standby to connect to the primary
#------------------------------------------------------------------------------
# "standby switchover" settings
#------------------------------------------------------------------------------
# These settings apply when switching roles between a primary and a standby
# ("repmgr standby switchover").
#standby_reconnect_timeout=60 # The max length of time (in seconds) to wait
# for the demoted standby to reconnect to the promoted
# primary (note: this value should be equal to or greater
# than that set for "node_rejoin_timeout")
#------------------------------------------------------------------------------
# "node rejoin" settings
#------------------------------------------------------------------------------
# These settings apply when reintegrating a node into a replication cluster
# with "repmgrd_node_rejoin"
#node_rejoin_timeout=60 # The maximum length of time (in seconds) to wait for
# the node to reconnect to the replication cluster
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
# Barman options # Barman options
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
@@ -236,6 +263,11 @@ ssh_options='-q -o ConnectTimeout=10' # Options to append to "ssh"
# These settings are only applied when repmgrd is running. Values shown # These settings are only applied when repmgrd is running. Values shown
# are defaults. # are defaults.
#repmgrd_pid_file= # Path of PID file to use for repmgrd; if not set, a PID file will
# be generated in a temporary directory specified by the environment
# variable $TMPDIR, or if not set, in "/tmp". This value can be overridden
# by the command line option "-p/--pid-file"; the command line option
# "--no-pid-file" will force PID file creation to be skipped.
#failover=manual # one of 'automatic', 'manual'. #failover=manual # one of 'automatic', 'manual'.
# determines what action to take in the event of upstream failure # determines what action to take in the event of upstream failure
# #
@@ -245,13 +277,13 @@ ssh_options='-q -o ConnectTimeout=10' # Options to append to "ssh"
# manual attention to reattach it to replication # manual attention to reattach it to replication
# (does not apply to BDR mode) # (does not apply to BDR mode)
#priority=100 # indicate a preferred priorty for promoting nodes; #priority=100 # indicate a preferred priority for promoting nodes;
# a value of zero prevents the node being promoted to primary # a value of zero prevents the node being promoted to primary
# (default: 100) # (default: 100)
#reconnect_attempts=6 # Number attempts which will be made to reconnect to an unreachable #reconnect_attempts=6 # Number of attempts which will be made to reconnect to an unreachable
# primary (or other upstream node) # primary (or other upstream node)
#reconnect_interval=10 # Interval between attempts to reconnect to an unreachable #reconnect_interval=10 # Interval between attempts to reconnect to an unreachable
# primary (or other upstream node) # primary (or other upstream node)
#promote_command= # command repmgrd executes when promoting a new primary; use something like: #promote_command= # command repmgrd executes when promoting a new primary; use something like:
# #
@@ -265,8 +297,9 @@ ssh_options='-q -o ConnectTimeout=10' # Options to append to "ssh"
#primary_notification_timeout=60 # Interval (in seconds) which repmgrd on a standby #primary_notification_timeout=60 # Interval (in seconds) which repmgrd on a standby
# will wait for a notification from the new primary, # will wait for a notification from the new primary,
# before falling back to degraded monitoring # before falling back to degraded monitoring
#standby_reconnect_timeout=60 # Interval (in seconds) which repmgrd on a standby will wait #repmgrd_standby_startup_timeout=60 # Interval (in seconds) which repmgrd on a standby will wait
# to reconnect to the local node after executing "follow_command" # for the the local node to restart and become ready to accept connections after
# executing "follow_command" (defaults to the value set in "standby_reconnect_timeout")
#monitoring_history=no # Whether to write monitoring data to the "montoring_history" table #monitoring_history=no # Whether to write monitoring data to the "montoring_history" table
#monitor_interval_secs=2 # Interval (in seconds) at which to write monitoring data #monitor_interval_secs=2 # Interval (in seconds) at which to write monitoring data
@@ -304,7 +337,7 @@ ssh_options='-q -o ConnectTimeout=10' # Options to append to "ssh"
# #
# Debian/Ubuntu users: use "sudo pg_ctlcluster" to execute service control commands. # Debian/Ubuntu users: use "sudo pg_ctlcluster" to execute service control commands.
# #
# For more details, see: https://repmgr.org/docs/4.0/configuration-service-commands.html # For more details, see: https://repmgr.org/docs/4.1/configuration-service-commands.html
#service_start_command = '' #service_start_command = ''
#service_stop_command = '' #service_stop_command = ''
@@ -348,7 +381,7 @@ ssh_options='-q -o ConnectTimeout=10' # Options to append to "ssh"
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
#bdr_local_monitoring_only=false # Only monitor the local node; no checks will be #bdr_local_monitoring_only=false # Only monitor the local node; no checks will be
# performed on the other node # performed on the other node
#bdr_recovery_timeout # If a BDR node was offline and has become available #bdr_recovery_timeout # If a BDR node was offline and has become available
# maximum length of time in seconds to wait for the # maximum length of time in seconds to wait for the
# node to reconnect to the cluster # node to reconnect to the cluster

View File

@@ -1,6 +1,6 @@
# repmgr extension # repmgr extension
comment = 'Replication manager for PostgreSQL' comment = 'Replication manager for PostgreSQL'
default_version = '4.0' default_version = '4.1'
module_pathname = '$libdir/repmgr' module_pathname = '$libdir/repmgr'
relocatable = false relocatable = false
schema = repmgr schema = repmgr

View File

@@ -49,6 +49,8 @@
#define REPLICATION_TYPE_BDR 2 #define REPLICATION_TYPE_BDR 2
#define UNKNOWN_SERVER_VERSION_NUM -1 #define UNKNOWN_SERVER_VERSION_NUM -1
#define UNKNOWN_BDR_VERSION_NUM -1
#define UNKNOWN_TIMELINE_ID -1 #define UNKNOWN_TIMELINE_ID -1
#define UNKNOWN_SYSTEM_IDENTIFIER 0 #define UNKNOWN_SYSTEM_IDENTIFIER 0
@@ -58,6 +60,8 @@
#define VOTING_TERM_NOT_SET -1 #define VOTING_TERM_NOT_SET -1
#define BDR2_REPLICATION_SET_NAME "repmgr"
/* /*
* various default values - ensure repmgr.conf.sample is update * various default values - ensure repmgr.conf.sample is update
* if any of these are changed * if any of these are changed
@@ -81,6 +85,7 @@
#define DEFAULT_PROMOTE_CHECK_TIMEOUT 60 /* seconds */ #define DEFAULT_PROMOTE_CHECK_TIMEOUT 60 /* seconds */
#define DEFAULT_PROMOTE_CHECK_INTERVAL 1 /* seconds */ #define DEFAULT_PROMOTE_CHECK_INTERVAL 1 /* seconds */
#define DEFAULT_STANDBY_RECONNECT_TIMEOUT 60 /* seconds */ #define DEFAULT_STANDBY_RECONNECT_TIMEOUT 60 /* seconds */
#define DEFAULT_NODE_REJOIN_TIMEOUT 60 /* seconds */
#ifndef RECOVERY_COMMAND_FILE #ifndef RECOVERY_COMMAND_FILE
#define RECOVERY_COMMAND_FILE "recovery.conf" #define RECOVERY_COMMAND_FILE "recovery.conf"

View File

@@ -1,3 +1,2 @@
#define REPMGR_VERSION_DATE "" #define REPMGR_VERSION_DATE ""
#define REPMGR_VERSION "4.0.6" #define REPMGR_VERSION "4.1.2"

View File

@@ -214,7 +214,8 @@ monitor_bdr(void)
log_warning(_("unable to connect to node %s (ID %i)"), log_warning(_("unable to connect to node %s (ID %i)"),
cell->node_info->node_name, cell->node_info->node_id); cell->node_info->node_name, cell->node_info->node_id);
cell->node_info->conn = try_reconnect(cell->node_info); //cell->node_info->conn = try_reconnect(cell->node_info);
try_reconnect(&cell->node_info->conn, cell->node_info);
/* node has recovered - log and continue */ /* node has recovered - log and continue */
if (cell->node_info->node_status == NODE_STATUS_UP) if (cell->node_info->node_status == NODE_STATUS_UP)
@@ -293,7 +294,7 @@ loop:
/* /*
* if we can reload, then could need to change local_conn * if we can reload, then could need to change local_conn
*/ */
if (reload_config(&config_file_options)) if (reload_config(&config_file_options, BDR))
{ {
PQfinish(local_conn); PQfinish(local_conn);
local_conn = establish_db_connection(config_file_options.conninfo, true); local_conn = establish_db_connection(config_file_options.conninfo, true);
@@ -303,11 +304,12 @@ loop:
got_SIGHUP = false; got_SIGHUP = false;
} }
/* XXX this looks like it will never be called */
if (got_SIGHUP) if (got_SIGHUP)
{ {
log_debug("SIGHUP received"); log_debug("SIGHUP received");
if (reload_config(&config_file_options)) if (reload_config(&config_file_options, BDR))
{ {
PQfinish(local_conn); PQfinish(local_conn);
local_conn = establish_db_connection(config_file_options.conninfo, true); local_conn = establish_db_connection(config_file_options.conninfo, true);

View File

@@ -60,6 +60,8 @@ static int primary_node_id = UNKNOWN_NODE_ID;
static t_node_info upstream_node_info = T_NODE_INFO_INITIALIZER; static t_node_info upstream_node_info = T_NODE_INFO_INITIALIZER;
static NodeInfoList sibling_nodes = T_NODE_INFO_LIST_INITIALIZER; static NodeInfoList sibling_nodes = T_NODE_INFO_LIST_INITIALIZER;
static instr_time last_monitoring_update;
static ElectionResult do_election(void); static ElectionResult do_election(void);
static const char *_print_election_result(ElectionResult result); static const char *_print_election_result(ElectionResult result);
@@ -81,6 +83,8 @@ static bool do_witness_failover(void);
static void update_monitoring_history(void); static void update_monitoring_history(void);
static void handle_sighup(PGconn **conn, t_server_type server_type);
static const char * format_failover_state(FailoverState failover_state); static const char * format_failover_state(FailoverState failover_state);
@@ -162,8 +166,8 @@ do_physical_node_check(void)
if (config_file_options.failover == FAILOVER_AUTOMATIC) if (config_file_options.failover == FAILOVER_AUTOMATIC)
{ {
/* /*
* check that promote/follow commands are defined, otherwise repmgrd * Check that "promote_command" and "follow_command" are defined, otherwise repmgrd
* won't be able to perform any useful action * won't be able to perform any useful action in a failover situation.
*/ */
bool required_param_missing = false; bool required_param_missing = false;
@@ -175,14 +179,24 @@ do_physical_node_check(void)
if (config_file_options.service_promote_command[0] != '\0') if (config_file_options.service_promote_command[0] != '\0')
{ {
/* /*
* if repmgrd executes "service_promote_command" directly, * "service_promote_command" is *not* a substitute for "promote_command";
* repmgr metadata won't get updated * it is intended for use in those systems (e.g. Debian) where there's a service
* level promote command (e.g. pg_ctlcluster).
*
* "promote_command" should either execute "repmgr standby promote" directly, or
* a script which executes "repmgr standby promote". This is essential, as the
* repmgr metadata is updated by "repmgr standby promote".
*
* "service_promote_command", if set, will be executed by "repmgr standby promote",
* but never by repmgrd.
*
*/ */
log_hint(_("\"service_promote_command\" is set, but can only be executed by \"repmgr standby promote\"")); log_hint(_("\"service_promote_command\" is set, but can only be executed by \"repmgr standby promote\""));
} }
required_param_missing = true; required_param_missing = true;
} }
if (config_file_options.follow_command[0] == '\0') if (config_file_options.follow_command[0] == '\0')
{ {
log_error(_("\"follow_command\" must be defined in the configuration file")); log_error(_("\"follow_command\" must be defined in the configuration file"));
@@ -254,7 +268,12 @@ monitor_streaming_primary(void)
* TODO: cache node list here, refresh at `node_list_refresh_interval` * TODO: cache node list here, refresh at `node_list_refresh_interval`
* also return reason for inavailability so we can log it * also return reason for inavailability so we can log it
*/ */
if (is_server_available(local_node_info.conninfo) == false)
(void) connection_ping(local_conn);
check_connection(&local_node_info, &local_conn);
if (PQstatus(local_conn) != CONNECTION_OK)
{ {
/* local node is down, we were expecting it to be up */ /* local node is down, we were expecting it to be up */
@@ -274,8 +293,6 @@ monitor_streaming_primary(void)
local_node_info.node_status = NODE_STATUS_UNKNOWN; local_node_info.node_status = NODE_STATUS_UNKNOWN;
close_connection(&local_conn);
/* /*
* as we're monitoring the primary, no point in trying to * as we're monitoring the primary, no point in trying to
* write the event to the database * write the event to the database
@@ -291,11 +308,12 @@ monitor_streaming_primary(void)
termPQExpBuffer(&event_details); termPQExpBuffer(&event_details);
local_conn = try_reconnect(&local_node_info); try_reconnect(&local_conn, &local_node_info);
if (local_node_info.node_status == NODE_STATUS_UP) if (local_node_info.node_status == NODE_STATUS_UP)
{ {
int local_node_unreachable_elapsed = calculate_elapsed(local_node_unreachable_start); int local_node_unreachable_elapsed = calculate_elapsed(local_node_unreachable_start);
int stored_local_node_id = UNKNOWN_NODE_ID;
initPQExpBuffer(&event_details); initPQExpBuffer(&event_details);
@@ -312,6 +330,17 @@ monitor_streaming_primary(void)
event_details.data); event_details.data);
termPQExpBuffer(&event_details); termPQExpBuffer(&event_details);
/*
* If the local node was restarted, we'll need to reinitialise values
* stored in shared memory.
*/
stored_local_node_id = repmgrd_get_local_node_id(local_conn);
if (stored_local_node_id == UNKNOWN_NODE_ID)
{
repmgrd_set_local_node_id(local_conn, config_file_options.node_id);
}
goto loop; goto loop;
} }
@@ -535,26 +564,7 @@ loop:
if (got_SIGHUP) if (got_SIGHUP)
{ {
log_debug("SIGHUP received"); handle_sighup(&local_conn, PRIMARY);
if (reload_config(&config_file_options))
{
close_connection(&local_conn);
local_conn = establish_db_connection(config_file_options.conninfo, true);
if (*config_file_options.log_file)
{
FILE *fd;
fd = freopen(config_file_options.log_file, "a", stderr);
if (fd == NULL)
{
fprintf(stderr, "error reopening stderr to \"%s\": %s",
config_file_options.log_file, strerror(errno));
}
}
}
got_SIGHUP = false;
} }
log_verbose(LOG_DEBUG, "sleeping %i seconds (parameter \"monitor_interval_secs\")", log_verbose(LOG_DEBUG, "sleeping %i seconds (parameter \"monitor_interval_secs\")",
@@ -572,9 +582,11 @@ monitor_streaming_standby(void)
instr_time log_status_interval_start; instr_time log_status_interval_start;
PQExpBufferData event_details; PQExpBufferData event_details;
log_debug("monitor_streaming_standby()");
reset_node_voting_status(); reset_node_voting_status();
log_debug("monitor_streaming_standby()"); INSTR_TIME_SET_ZERO(last_monitoring_update);
/* /*
* If no upstream node id is specified in the metadata, we'll try and * If no upstream node id is specified in the metadata, we'll try and
@@ -723,10 +735,9 @@ monitor_streaming_standby(void)
_("unable to connect to upstream node \"%s\" (node ID: %i)"), _("unable to connect to upstream node \"%s\" (node ID: %i)"),
upstream_node_info.node_name, upstream_node_info.node_id); upstream_node_info.node_name, upstream_node_info.node_id);
/* */ /* XXX possible pre-action event */
if (upstream_node_info.type == STANDBY) if (upstream_node_info.type == STANDBY)
{ {
/* XXX possible pre-action event */
create_event_record(primary_conn, create_event_record(primary_conn,
&config_file_options, &config_file_options,
config_file_options.node_id, config_file_options.node_id,
@@ -748,8 +759,6 @@ monitor_streaming_standby(void)
log_warning("%s", event_details.data); log_warning("%s", event_details.data);
termPQExpBuffer(&event_details); termPQExpBuffer(&event_details);
close_connection(&upstream_conn);
/* /*
* if local node is unreachable, make a last-minute attempt to reconnect * if local node is unreachable, make a last-minute attempt to reconnect
* before continuing with the failover process * before continuing with the failover process
@@ -760,13 +769,18 @@ monitor_streaming_standby(void)
check_connection(&local_node_info, &local_conn); check_connection(&local_node_info, &local_conn);
} }
upstream_conn = try_reconnect(&upstream_node_info); try_reconnect(&upstream_conn, &upstream_node_info);
/* Node has recovered - log and continue */ /* Node has recovered - log and continue */
if (upstream_node_info.node_status == NODE_STATUS_UP) if (upstream_node_info.node_status == NODE_STATUS_UP)
{ {
int upstream_node_unreachable_elapsed = calculate_elapsed(upstream_node_unreachable_start); int upstream_node_unreachable_elapsed = calculate_elapsed(upstream_node_unreachable_start);
if (upstream_node_info.type == PRIMARY)
{
primary_conn = upstream_conn;
}
initPQExpBuffer(&event_details); initPQExpBuffer(&event_details);
appendPQExpBuffer(&event_details, appendPQExpBuffer(&event_details,
@@ -774,7 +788,7 @@ monitor_streaming_standby(void)
upstream_node_unreachable_elapsed); upstream_node_unreachable_elapsed);
log_notice("%s", event_details.data); log_notice("%s", event_details.data);
create_event_notification(upstream_conn, create_event_notification(primary_conn,
&config_file_options, &config_file_options,
config_file_options.node_id, config_file_options.node_id,
"repmgrd_upstream_reconnect", "repmgrd_upstream_reconnect",
@@ -994,6 +1008,13 @@ monitor_streaming_standby(void)
continue; continue;
} }
/* skip witness node - we can't possibly "follow" that */
if (cell->node_info->type == WITNESS)
{
continue;
}
cell->node_info->conn = establish_db_connection(cell->node_info->conninfo, false); cell->node_info->conn = establish_db_connection(cell->node_info->conninfo, false);
if (PQstatus(cell->node_info->conn) != CONNECTION_OK) if (PQstatus(cell->node_info->conn) != CONNECTION_OK)
@@ -1016,6 +1037,7 @@ monitor_streaming_standby(void)
follow_new_primary(follow_node_id); follow_new_primary(follow_node_id);
} }
} }
clear_node_info_list(&sibling_nodes); clear_node_info_list(&sibling_nodes);
} }
} }
@@ -1044,8 +1066,7 @@ loop:
if (config_file_options.failover == FAILOVER_MANUAL) if (config_file_options.failover == FAILOVER_MANUAL)
{ {
appendPQExpBuffer( appendPQExpBuffer(&monitoring_summary,
&monitoring_summary,
_(" (automatic failover disabled)")); _(" (automatic failover disabled)"));
} }
@@ -1055,6 +1076,18 @@ loop:
{ {
log_detail(_("waiting for upstream or another primary to reappear")); log_detail(_("waiting for upstream or another primary to reappear"));
} }
else if (config_file_options.monitoring_history == true)
{
if (INSTR_TIME_IS_ZERO(last_monitoring_update))
{
log_detail(_("no monitoring statistics have been written yet"));
}
else
{
log_detail(_("last monitoring statistics update was %i seconds ago"),
calculate_elapsed(last_monitoring_update));
}
}
INSTR_TIME_SET_CURRENT(log_status_interval_start); INSTR_TIME_SET_CURRENT(log_status_interval_start);
} }
@@ -1066,7 +1099,16 @@ loop:
} }
else else
{ {
connection_ping(local_conn); if (config_file_options.monitoring_history == true)
{
log_verbose(LOG_WARNING, _("monitoring_history requested but primary connection not available"));
}
/*
* if monitoring not in use, we'll need to ensure the local connection
* handle isn't stale
*/
(void) connection_ping(local_conn);
} }
/* /*
@@ -1119,8 +1161,11 @@ loop:
} }
else else
{ {
/* we've reconnected to the local node after an outage */
if (local_node_info.active == false) if (local_node_info.active == false)
{ {
int stored_local_node_id = UNKNOWN_NODE_ID;
if (PQstatus(primary_conn) == CONNECTION_OK) if (PQstatus(primary_conn) == CONNECTION_OK)
{ {
if (update_node_record_set_active(primary_conn, local_node_info.node_id, true) == true) if (update_node_record_set_active(primary_conn, local_node_info.node_id, true) == true)
@@ -1136,45 +1181,36 @@ loop:
local_node_info.node_name, local_node_info.node_name,
local_node_info.node_id); local_node_info.node_id);
log_warning("%s", event_details.data) log_notice("%s", event_details.data);
create_event_notification(primary_conn,
create_event_notification(primary_conn, &config_file_options,
&config_file_options, local_node_info.node_id,
local_node_info.node_id, "standby_recovery",
"standby_recovery", true,
true, event_details.data);
event_details.data);
termPQExpBuffer(&event_details); termPQExpBuffer(&event_details);
} }
} }
/*
* If the local node was restarted, we'll need to reinitialise values
* stored in shared memory.
*/
stored_local_node_id = repmgrd_get_local_node_id(local_conn);
if (stored_local_node_id == UNKNOWN_NODE_ID)
{
repmgrd_set_local_node_id(local_conn, config_file_options.node_id);
}
} }
} }
if (got_SIGHUP) if (got_SIGHUP)
{ {
log_debug("SIGHUP received"); handle_sighup(&local_conn, STANDBY);
if (reload_config(&config_file_options))
{
close_connection(&local_conn);
local_conn = establish_db_connection(config_file_options.conninfo, true);
if (*config_file_options.log_file)
{
FILE *fd;
fd = freopen(config_file_options.log_file, "a", stderr);
if (fd == NULL)
{
fprintf(stderr, "error reopening stderr to \"%s\": %s",
config_file_options.log_file, strerror(errno));
}
}
}
got_SIGHUP = false;
} }
log_verbose(LOG_DEBUG, "sleeping %i seconds (parameter \"monitor_interval_secs\")", log_verbose(LOG_DEBUG, "sleeping %i seconds (parameter \"monitor_interval_secs\")",
@@ -1194,36 +1230,18 @@ monitor_streaming_witness(void)
PQExpBufferData event_details; PQExpBufferData event_details;
RecordStatus record_status; RecordStatus record_status;
int primary_node_id = UNKNOWN_NODE_ID;
reset_node_voting_status(); reset_node_voting_status();
log_debug("monitor_streaming_witness()"); log_debug("monitor_streaming_witness()");
if (get_primary_node_record(local_conn, &upstream_node_info) == false) /*
{ * At this point we can't trust the local copy of "repmgr.nodes", as
PQExpBufferData event_details; * it may not have been updated. We'll scan the cluster for the current
['' * primary and refresh the copy from that before proceeding further.
initPQExpBuffer(&event_details); */
primary_conn = get_primary_connection_quiet(local_conn, &primary_node_id, NULL);
appendPQExpBuffer(&event_details,
_("unable to retrieve record for primary node"));
log_error("%s", event_details.data);
log_hint(_("execute \"repmgr witness register --force\" to update the witness node "));
close_connection(&local_conn);
create_event_notification(NULL,
&config_file_options,
config_file_options.node_id,
"repmgrd_shutdown",
false,
event_details.data);
termPQExpBuffer(&event_details);
terminate(ERR_BAD_CONFIG);
}
primary_conn = establish_db_connection(upstream_node_info.conninfo, false);
/* /*
* Primary node must be running at repmgrd startup. * Primary node must be running at repmgrd startup.
@@ -1248,7 +1266,7 @@ monitor_streaming_witness(void)
* refresh upstream node record from primary, so it's as up-to-date * refresh upstream node record from primary, so it's as up-to-date
* as possible * as possible
*/ */
record_status = get_node_record(primary_conn, upstream_node_info.node_id, &upstream_node_info); record_status = get_node_record(primary_conn, primary_node_id, &upstream_node_info);
/* /*
* This is unlikely to happen; if it does emit a warning for diagnostic * This is unlikely to happen; if it does emit a warning for diagnostic
@@ -1320,8 +1338,7 @@ monitor_streaming_witness(void)
true, true,
event_details.data); event_details.data);
close_connection(&primary_conn); try_reconnect(&primary_conn, &upstream_node_info);
primary_conn = try_reconnect(&upstream_node_info);
/* Node has recovered - log and continue */ /* Node has recovered - log and continue */
if (upstream_node_info.node_status == NODE_STATUS_UP) if (upstream_node_info.node_status == NODE_STATUS_UP)
@@ -1335,7 +1352,7 @@ monitor_streaming_witness(void)
upstream_node_unreachable_elapsed); upstream_node_unreachable_elapsed);
log_notice("%s", event_details.data); log_notice("%s", event_details.data);
create_event_notification(upstream_conn, create_event_notification(primary_conn,
&config_file_options, &config_file_options,
config_file_options.node_id, config_file_options.node_id,
"repmgrd_upstream_reconnect", "repmgrd_upstream_reconnect",
@@ -1458,6 +1475,105 @@ monitor_streaming_witness(void)
} }
loop: loop:
/*
* handle local node failure
*
* currently we'll just check the connection, and try to reconnect
*
* TODO: add timeout, after which we run in degraded state
*/
(void) connection_ping(local_conn);
check_connection(&local_node_info, &local_conn);
if (PQstatus(local_conn) != CONNECTION_OK)
{
if (local_node_info.active == true)
{
bool success = true;
PQExpBufferData event_details;
initPQExpBuffer(&event_details);
local_node_info.active = false;
appendPQExpBuffer(&event_details,
_("unable to connect to local node \"%s\" (ID: %i), marking inactive"),
local_node_info.node_name,
local_node_info.node_id);
log_notice("%s", event_details.data);
if (PQstatus(primary_conn) == CONNECTION_OK)
{
if (update_node_record_set_active(primary_conn, local_node_info.node_id, false) == false)
{
success = false;
log_warning(_("unable to mark node \"%s\" (ID: %i) as inactive"),
local_node_info.node_name,
local_node_info.node_id);
}
}
create_event_notification(primary_conn,
&config_file_options,
local_node_info.node_id,
"standby_failure",
success,
event_details.data);
termPQExpBuffer(&event_details);
}
}
else
{
/* we've reconnected to the local node after an outage */
if (local_node_info.active == false)
{
int stored_local_node_id = UNKNOWN_NODE_ID;
if (PQstatus(primary_conn) == CONNECTION_OK)
{
if (update_node_record_set_active(primary_conn, local_node_info.node_id, true) == true)
{
PQExpBufferData event_details;
initPQExpBuffer(&event_details);
local_node_info.active = true;
appendPQExpBuffer(&event_details,
_("reconnected to local node \"%s\" (ID: %i), marking active"),
local_node_info.node_name,
local_node_info.node_id);
log_notice("%s", event_details.data);
create_event_notification(primary_conn,
&config_file_options,
local_node_info.node_id,
"standby_recovery",
true,
event_details.data);
termPQExpBuffer(&event_details);
}
}
/*
* If the local node was restarted, we'll need to reinitialise values
* stored in shared memory.
*/
stored_local_node_id = repmgrd_get_local_node_id(local_conn);
if (stored_local_node_id == UNKNOWN_NODE_ID)
{
repmgrd_set_local_node_id(local_conn, config_file_options.node_id);
}
}
}
/* refresh repmgr.nodes after "witness_sync_interval" seconds */ /* refresh repmgr.nodes after "witness_sync_interval" seconds */
{ {
@@ -1501,28 +1617,10 @@ loop:
} }
if (got_SIGHUP) if (got_SIGHUP)
{ {
log_debug("SIGHUP received"); handle_sighup(&local_conn, WITNESS);
if (reload_config(&config_file_options))
{
close_connection(&local_conn);
local_conn = establish_db_connection(config_file_options.conninfo, true);
if (*config_file_options.log_file)
{
FILE *fd;
fd = freopen(config_file_options.log_file, "a", stderr);
if (fd == NULL)
{
fprintf(stderr, "error reopening stderr to \"%s\": %s",
config_file_options.log_file, strerror(errno));
}
}
}
got_SIGHUP = false;
} }
log_verbose(LOG_DEBUG, "sleeping %i seconds (parameter \"monitor_interval_secs\")", log_verbose(LOG_DEBUG, "sleeping %i seconds (parameter \"monitor_interval_secs\")",
@@ -1539,8 +1637,15 @@ loop:
static bool static bool
do_primary_failover(void) do_primary_failover(void)
{ {
ElectionResult election_result;
/*
* Double-check status of the local connection
*/
check_connection(&local_node_info, &local_conn);
/* attempt to initiate voting process */ /* attempt to initiate voting process */
ElectionResult election_result = do_election(); election_result = do_election();
/* TODO add pre-event notification here */ /* TODO add pre-event notification here */
failover_state = FAILOVER_STATE_UNKNOWN; failover_state = FAILOVER_STATE_UNKNOWN;
@@ -1761,12 +1866,21 @@ update_monitoring_history(void)
long long unsigned int replication_lag_bytes = 0; long long unsigned int replication_lag_bytes = 0;
/* both local and primary connections must be available */ /* both local and primary connections must be available */
if (PQstatus(primary_conn) != CONNECTION_OK || PQstatus(local_conn) != CONNECTION_OK) if (PQstatus(primary_conn) != CONNECTION_OK)
{
log_warning(_("primary connection is not available, unable to update monitoring history"));
return; return;
}
if (PQstatus(local_conn) != CONNECTION_OK)
{
log_warning(_("local connection is not available, unable to update monitoring history"));
return;
}
if (get_replication_info(local_conn, &replication_info) == false) if (get_replication_info(local_conn, &replication_info) == false)
{ {
log_warning(_("unable to retrieve replication status information")); log_warning(_("unable to retrieve replication status information, unable to update monitoring history"));
return; return;
} }
@@ -1818,8 +1932,7 @@ update_monitoring_history(void)
replication_lag_bytes = 0; replication_lag_bytes = 0;
} }
add_monitoring_record( add_monitoring_record(primary_conn,
primary_conn,
local_conn, local_conn,
primary_node_id, primary_node_id,
local_node_info.node_id, local_node_info.node_id,
@@ -1829,6 +1942,8 @@ update_monitoring_history(void)
replication_info.last_xact_replay_timestamp, replication_info.last_xact_replay_timestamp,
replication_lag_bytes, replication_lag_bytes,
apply_lag_bytes); apply_lag_bytes);
INSTR_TIME_SET_CURRENT(last_monitoring_update);
} }
@@ -1853,7 +1968,7 @@ do_upstream_standby_failover(void)
t_node_info primary_node_info = T_NODE_INFO_INITIALIZER; t_node_info primary_node_info = T_NODE_INFO_INITIALIZER;
RecordStatus record_status = RECORD_NOT_FOUND; RecordStatus record_status = RECORD_NOT_FOUND;
RecoveryType primary_type = RECTYPE_UNKNOWN; RecoveryType primary_type = RECTYPE_UNKNOWN;
int i, r; int i, standby_follow_result;
char parsed_follow_command[MAXPGPATH] = ""; char parsed_follow_command[MAXPGPATH] = "";
close_connection(&upstream_conn); close_connection(&upstream_conn);
@@ -1887,9 +2002,18 @@ do_upstream_standby_failover(void)
if (primary_type != RECTYPE_PRIMARY) if (primary_type != RECTYPE_PRIMARY)
{ {
log_error(_("last known primary\"%s\" (ID: %i) is in recovery, not following"), if (primary_type == RECTYPE_STANDBY)
primary_node_info.node_name, {
primary_node_info.node_id); log_error(_("last known primary \"%s\" (ID: %i) is in recovery, not following"),
primary_node_info.node_name,
primary_node_info.node_id);
}
else
{
log_error(_("unable to determine status of last known primary \"%s\" (ID: %i), not following"),
primary_node_info.node_name,
primary_node_info.node_id);
}
close_connection(&primary_conn); close_connection(&primary_conn);
monitoring_state = MS_DEGRADED; monitoring_state = MS_DEGRADED;
@@ -1900,8 +2024,6 @@ do_upstream_standby_failover(void)
/* Close the connection to this server */ /* Close the connection to this server */
close_connection(&local_conn); close_connection(&local_conn);
initPQExpBuffer(&event_details);
log_debug(_("standby follow command is:\n \"%s\""), log_debug(_("standby follow command is:\n \"%s\""),
config_file_options.follow_command); config_file_options.follow_command);
@@ -1911,10 +2033,12 @@ do_upstream_standby_failover(void)
*/ */
parse_follow_command(parsed_follow_command, config_file_options.follow_command, primary_node_info.node_id); parse_follow_command(parsed_follow_command, config_file_options.follow_command, primary_node_info.node_id);
r = system(parsed_follow_command); standby_follow_result = system(parsed_follow_command);
if (r != 0) if (standby_follow_result != 0)
{ {
initPQExpBuffer(&event_details);
appendPQExpBuffer(&event_details, appendPQExpBuffer(&event_details,
_("unable to execute follow command:\n %s"), _("unable to execute follow command:\n %s"),
config_file_options.follow_command); config_file_options.follow_command);
@@ -1925,8 +2049,7 @@ do_upstream_standby_failover(void)
* It may not possible to write to the event notification table but we * It may not possible to write to the event notification table but we
* should be able to generate an external notification if required. * should be able to generate an external notification if required.
*/ */
create_event_notification( create_event_notification(primary_conn,
primary_conn,
&config_file_options, &config_file_options,
local_node_info.node_id, local_node_info.node_id,
"repmgrd_failover_follow", "repmgrd_failover_follow",
@@ -1939,18 +2062,22 @@ do_upstream_standby_failover(void)
/* /*
* It's possible that the standby is still starting up after the "follow_command" * It's possible that the standby is still starting up after the "follow_command"
* completes, so poll for a while until we get a connection. * completes, so poll for a while until we get a connection.
*
* NOTE: we've previously closed the local connection, so even if the follow command
* failed for whatever reason and the local node remained up, we can re-open
* the local connection.
*/ */
for (i = 0; i < config_file_options.standby_reconnect_timeout; i++) for (i = 0; i < config_file_options.repmgrd_standby_startup_timeout; i++)
{ {
local_conn = establish_db_connection(local_node_info.conninfo, false); local_conn = establish_db_connection(local_node_info.conninfo, false);
if (PQstatus(local_conn) == CONNECTION_OK) if (PQstatus(local_conn) == CONNECTION_OK)
break; break;
log_debug("sleeping 1 second; %i of %i attempts to reconnect to local node", log_debug("sleeping 1 second; %i of %i (\"repmgrd_standby_startup_timeout\") attempts to reconnect to local node",
i + 1, i + 1,
config_file_options.standby_reconnect_timeout); config_file_options.repmgrd_standby_startup_timeout);
sleep(1); sleep(1);
} }
@@ -1964,28 +2091,47 @@ do_upstream_standby_failover(void)
/* refresh shared memory settings which will have been zapped by the restart */ /* refresh shared memory settings which will have been zapped by the restart */
repmgrd_set_local_node_id(local_conn, config_file_options.node_id); repmgrd_set_local_node_id(local_conn, config_file_options.node_id);
if (update_node_record_set_upstream(primary_conn, /*
local_node_info.node_id, *
primary_node_info.node_id) == false) */
if (standby_follow_result != 0)
{ {
appendPQExpBuffer(&event_details, monitoring_state = MS_DEGRADED;
_("unable to set node %i's new upstream ID to %i"), INSTR_TIME_SET_CURRENT(degraded_monitoring_start);
local_node_info.node_id,
primary_node_info.node_id);
log_error("%s", event_details.data); return FAILOVER_STATE_FOLLOW_FAIL;
}
create_event_notification( /*
NULL, * update upstream_node_id to primary node (but only if follow command
&config_file_options, * was successful)
local_node_info.node_id, */
"repmgrd_failover_follow",
false,
event_details.data);
termPQExpBuffer(&event_details); {
if (update_node_record_set_upstream(primary_conn,
local_node_info.node_id,
primary_node_info.node_id) == false)
{
initPQExpBuffer(&event_details);
appendPQExpBuffer(&event_details,
_("unable to set node %i's new upstream ID to %i"),
local_node_info.node_id,
primary_node_info.node_id);
terminate(ERR_BAD_CONFIG); log_error("%s", event_details.data);
create_event_notification(NULL,
&config_file_options,
local_node_info.node_id,
"repmgrd_failover_follow",
false,
event_details.data);
termPQExpBuffer(&event_details);
terminate(ERR_BAD_CONFIG);
}
} }
/* refresh own internal node record */ /* refresh own internal node record */
@@ -2001,6 +2147,8 @@ do_upstream_standby_failover(void)
local_node_info.upstream_node_id = primary_node_info.node_id; local_node_info.upstream_node_id = primary_node_info.node_id;
} }
initPQExpBuffer(&event_details);
appendPQExpBuffer(&event_details, appendPQExpBuffer(&event_details,
_("node %i is now following primary node %i"), _("node %i is now following primary node %i"),
local_node_info.node_id, local_node_info.node_id,
@@ -2008,8 +2156,7 @@ do_upstream_standby_failover(void)
log_notice("%s", event_details.data); log_notice("%s", event_details.data);
create_event_notification( create_event_notification(primary_conn,
primary_conn,
&config_file_options, &config_file_options,
local_node_info.node_id, local_node_info.node_id,
"repmgrd_failover_follow", "repmgrd_failover_follow",
@@ -2056,10 +2203,10 @@ promote_self(void)
return FAILOVER_STATE_PROMOTION_FAILED; return FAILOVER_STATE_PROMOTION_FAILED;
} }
/* the presence of either of this command has been established already */ /* the presence of this command has been established already */
promote_command = config_file_options.promote_command; promote_command = config_file_options.promote_command;
log_debug("promote command is:\n \"%s\"", log_info(_("promote_command is:\n \"%s\""),
promote_command); promote_command);
if (log_type == REPMGR_STDERR && *config_file_options.log_file) if (log_type == REPMGR_STDERR && *config_file_options.log_file)
@@ -2247,6 +2394,8 @@ follow_new_primary(int new_primary_id)
RecordStatus record_status = RECORD_NOT_FOUND; RecordStatus record_status = RECORD_NOT_FOUND;
bool new_primary_ok = false; bool new_primary_ok = false;
log_verbose(LOG_DEBUG, "follow_new_primary(): new primary id is %i", new_primary_id);
record_status = get_node_record(local_conn, new_primary_id, &new_primary); record_status = get_node_record(local_conn, new_primary_id, &new_primary);
if (record_status != RECORD_FOUND) if (record_status != RECORD_FOUND)
@@ -2391,7 +2540,7 @@ follow_new_primary(int new_primary_id)
* completes, so poll for a while until we get a connection. * completes, so poll for a while until we get a connection.
*/ */
for (i = 0; i < config_file_options.standby_reconnect_timeout; i++) for (i = 0; i < config_file_options.repmgrd_standby_startup_timeout; i++)
{ {
local_conn = establish_db_connection(local_node_info.conninfo, false); local_conn = establish_db_connection(local_node_info.conninfo, false);
@@ -2400,7 +2549,7 @@ follow_new_primary(int new_primary_id)
log_debug("sleeping 1 second; %i of %i attempts to reconnect to local node", log_debug("sleeping 1 second; %i of %i attempts to reconnect to local node",
i + 1, i + 1,
config_file_options.standby_reconnect_timeout); config_file_options.repmgrd_standby_startup_timeout);
sleep(1); sleep(1);
} }
@@ -2466,20 +2615,26 @@ witness_follow_new_primary(int new_primary_id)
{ {
RecoveryType primary_recovery_type = get_recovery_type(upstream_conn); RecoveryType primary_recovery_type = get_recovery_type(upstream_conn);
if (primary_recovery_type == RECTYPE_PRIMARY) switch (primary_recovery_type)
{ {
new_primary_ok = true; case RECTYPE_PRIMARY:
} new_primary_ok = true;
else break;
{ case RECTYPE_STANDBY:
new_primary_ok = false; new_primary_ok = false;
log_warning(_("new primary is not in recovery")); log_warning(_("new primary is not in recovery"));
close_connection(&upstream_conn); break;
case RECTYPE_UNKNOWN:
new_primary_ok = false;
log_warning(_("unable to determine status of new primary"));
break;
} }
} }
if (new_primary_ok == false) if (new_primary_ok == false)
{ {
close_connection(&upstream_conn);
return FAILOVER_STATE_FOLLOW_FAIL; return FAILOVER_STATE_FOLLOW_FAIL;
} }
@@ -2919,9 +3074,18 @@ check_connection(t_node_info *node_info, PGconn **conn)
} }
else else
{ {
int stored_local_node_id = UNKNOWN_NODE_ID;
log_info(_("reconnected to node \"%s\" (ID: %i)"), log_info(_("reconnected to node \"%s\" (ID: %i)"),
node_info->node_name, node_info->node_name,
node_info->node_id); node_info->node_id);
stored_local_node_id = repmgrd_get_local_node_id(*conn);
if (stored_local_node_id == UNKNOWN_NODE_ID)
{
repmgrd_set_local_node_id(*conn, config_file_options.node_id);
}
} }
} }
} }
@@ -2965,3 +3129,30 @@ format_failover_state(FailoverState failover_state)
} }
static void
handle_sighup(PGconn **conn, t_server_type server_type)
{
log_debug("SIGHUP received");
if (reload_config(&config_file_options, server_type))
{
PQfinish(*conn);
*conn = establish_db_connection(config_file_options.conninfo, true);
}
if (*config_file_options.log_file)
{
FILE *fd;
log_debug("reopening %s", config_file_options.log_file);
fd = freopen(config_file_options.log_file, "a", stderr);
if (fd == NULL)
{
fprintf(stderr, "error reopening stderr to \"%s\": %s",
config_file_options.log_file, strerror(errno));
}
}
got_SIGHUP = false;
}

153
repmgrd.c
View File

@@ -35,8 +35,10 @@
static char *config_file = NULL; static char *config_file = NULL;
static bool verbose = false; static bool verbose = false;
static char *pid_file = NULL; static char pid_file[MAXPGPATH];
static bool daemonize = false; static bool daemonize = true;
static bool show_pid_file = false;
static bool no_pid_file = false;
t_configuration_options config_file_options = T_CONFIGURATION_OPTIONS_INITIALIZER; t_configuration_options config_file_options = T_CONFIGURATION_OPTIONS_INITIALIZER;
@@ -99,8 +101,10 @@ main(int argc, char **argv)
{"config-file", required_argument, NULL, 'f'}, {"config-file", required_argument, NULL, 'f'},
/* daemon options */ /* daemon options */
{"daemonize", no_argument, NULL, 'd'}, {"daemonize", optional_argument, NULL, 'd'},
{"pid-file", required_argument, NULL, 'p'}, {"pid-file", required_argument, NULL, 'p'},
{"show-pid-file", no_argument, NULL, 's'},
{"no-pid-file", no_argument, NULL, OPT_NO_PID_FILE},
/* logging options */ /* logging options */
{"log-level", required_argument, NULL, 'L'}, {"log-level", required_argument, NULL, 'L'},
@@ -113,8 +117,6 @@ main(int argc, char **argv)
set_progname(argv[0]); set_progname(argv[0]);
srand(time(NULL));
/* Disallow running as root */ /* Disallow running as root */
if (geteuid() == 0) if (geteuid() == 0)
{ {
@@ -128,6 +130,10 @@ main(int argc, char **argv)
exit(1); exit(1);
} }
srand(time(NULL));
memset(pid_file, 0, MAXPGPATH);
while ((c = getopt_long(argc, argv, "?Vf:L:vdp:m", long_options, &optindex)) != -1) while ((c = getopt_long(argc, argv, "?Vf:L:vdp:m", long_options, &optindex)) != -1)
{ {
switch (c) switch (c)
@@ -169,11 +175,22 @@ main(int argc, char **argv)
/* daemon options */ /* daemon options */
case 'd': case 'd':
daemonize = true; if (optarg != NULL)
{
daemonize = parse_bool(optarg, "-d/--daemonize", &cli_errors);
}
break; break;
case 'p': case 'p':
pid_file = optarg; strncpy(pid_file, optarg, MAXPGPATH);
break;
case 's':
show_pid_file = true;
break;
case OPT_NO_PID_FILE:
no_pid_file = true;
break; break;
/* logging options */ /* logging options */
@@ -220,7 +237,7 @@ main(int argc, char **argv)
/* Exit here already if errors in command line options found */ /* Exit here already if errors in command line options found */
if (cli_errors.head != NULL) if (cli_errors.head != NULL)
{ {
exit_with_cli_errors(&cli_errors); exit_with_cli_errors(&cli_errors, NULL);
} }
startup_event_logged = false; startup_event_logged = false;
@@ -239,6 +256,58 @@ main(int argc, char **argv)
*/ */
load_config(config_file, verbose, false, &config_file_options, argv[0]); load_config(config_file, verbose, false, &config_file_options, argv[0]);
/* Determine pid file location, unless --no-pid-file supplied */
if (no_pid_file == false)
{
if (config_file_options.repmgrd_pid_file[0] != '\0')
{
if (pid_file[0] != '\0')
{
log_warning(_("\"repmgrd_pid_file\" will be overridden by --pid-file"));
}
else
{
strncpy(pid_file, config_file_options.repmgrd_pid_file, MAXPGPATH);
}
}
/* no pid file provided - determine location */
if (pid_file[0] == '\0')
{
/* packagers: if feasible, patch PID file path into "package_pid_file" */
char package_pid_file[MAXPGPATH] = "";
if (package_pid_file[0] != '\0')
{
maxpath_snprintf(pid_file, "%s", package_pid_file);
}
else
{
const char *tmpdir = getenv("TMPDIR");
if (!tmpdir)
tmpdir = "/tmp";
maxpath_snprintf(pid_file, "%s/repmgrd.pid", tmpdir);
}
}
}
else
{
/* --no-pid-file supplied - overwrite any value provided with --pid-file ... */
memset(pid_file, 0, MAXPGPATH);
}
/* If --show-pid-file supplied, output the location (if set) and exit */
if (show_pid_file == true)
{
printf("%s\n", pid_file);
exit(SUCCESS);
}
/* Some configuration file items can be overriden by command line options */ /* Some configuration file items can be overriden by command line options */
@@ -251,8 +320,6 @@ main(int argc, char **argv)
strncpy(config_file_options.log_level, cli_log_level, MAXLEN); strncpy(config_file_options.log_level, cli_log_level, MAXLEN);
} }
log_notice(_("repmgrd (repmgr %s) starting up"), REPMGR_VERSION);
/* /*
* -m/--monitoring-history, if provided, will override repmgr.conf's * -m/--monitoring-history, if provided, will override repmgr.conf's
* monitoring_history; this is for backwards compatibility as it's * monitoring_history; this is for backwards compatibility as it's
@@ -280,6 +347,8 @@ main(int argc, char **argv)
logger_init(&config_file_options, progname()); logger_init(&config_file_options, progname());
log_notice(_("repmgrd (%s %s) starting up"), progname(), REPMGR_VERSION);
if (verbose) if (verbose)
logger_set_verbose(); logger_set_verbose();
@@ -414,7 +483,7 @@ main(int argc, char **argv)
daemonize_process(); daemonize_process();
} }
if (pid_file != NULL) if (pid_file[0] != '\0')
{ {
check_and_create_pid_file(pid_file); check_and_create_pid_file(pid_file);
} }
@@ -669,6 +738,8 @@ show_help(void)
{ {
printf(_("%s: replication management daemon for PostgreSQL\n"), progname()); printf(_("%s: replication management daemon for PostgreSQL\n"), progname());
puts(""); puts("");
printf(_("%s monitors a cluster of servers and optionally performs failover.\n"), progname());
puts("");
printf(_("Usage:\n")); printf(_("Usage:\n"));
printf(_(" %s [OPTIONS]\n"), progname()); printf(_(" %s [OPTIONS]\n"), progname());
@@ -688,19 +759,21 @@ show_help(void)
puts(""); puts("");
printf(_("General configuration options:\n")); printf(_("Daemon configuration options:\n"));
printf(_(" -d, --daemonize detach process from foreground\n")); printf(_(" -d, --daemonize[=true/false]\n"));
printf(_(" -p, --pid-file=PATH write a PID file\n")); printf(_(" detach process from foreground (default: true)\n"));
printf(_(" -p, --pid-file=PATH use the specified PID file\n"));
printf(_(" -s, --show-pid-file show PID file which would be used by the current configuration\n"));
printf(_(" --no-pid-file don't write a PID file\n"));
puts(""); puts("");
printf(_("%s monitors a cluster of servers and optionally performs failover.\n"), progname());
} }
PGconn * void
try_reconnect(t_node_info *node_info) try_reconnect(PGconn **conn, t_node_info *node_info)
{ {
PGconn *conn; PGconn *our_conn;
t_conninfo_param_list conninfo_params = T_CONNINFO_PARAM_LIST_INITIALIZER; t_conninfo_param_list conninfo_params = T_CONNINFO_PARAM_LIST_INITIALIZER;
int i; int i;
@@ -709,7 +782,6 @@ try_reconnect(t_node_info *node_info)
initialize_conninfo_params(&conninfo_params, false); initialize_conninfo_params(&conninfo_params, false);
/* we assume by now the conninfo string is parseable */ /* we assume by now the conninfo string is parseable */
(void) parse_conninfo_string(node_info->conninfo, &conninfo_params, NULL, false); (void) parse_conninfo_string(node_info->conninfo, &conninfo_params, NULL, false);
@@ -732,18 +804,47 @@ try_reconnect(t_node_info *node_info)
* degraded monitoring? - make that configurable * degraded monitoring? - make that configurable
*/ */
conn = establish_db_connection_by_params(&conninfo_params, false); our_conn = establish_db_connection_by_params(&conninfo_params, false);
if (PQstatus(conn) == CONNECTION_OK) if (PQstatus(our_conn) == CONNECTION_OK)
{ {
free_conninfo_params(&conninfo_params); free_conninfo_params(&conninfo_params);
log_info(_("connection to node %i succeeded"), node_info->node_id);
if (PQstatus(*conn) == CONNECTION_BAD)
{
log_verbose(LOG_INFO, "original connection handle returned CONNECTION_BAD, using new connection");
close_connection(conn);
*conn = our_conn;
}
else
{
ExecStatusType ping_result;
ping_result = connection_ping(*conn);
if (ping_result != PGRES_TUPLES_OK)
{
log_info("original connnection no longer available, using new connection");
close_connection(conn);
*conn = our_conn;
}
else
{
log_info(_("original connection is still available"));
PQfinish(our_conn);
}
}
node_info->node_status = NODE_STATUS_UP; node_info->node_status = NODE_STATUS_UP;
return conn;
return;
} }
close_connection(&conn); close_connection(&our_conn);
log_notice(_("unable to reconnect to node")); log_notice(_("unable to reconnect to node %i"), node_info->node_id);
} }
if (i + 1 < max_attempts) if (i + 1 < max_attempts)
@@ -762,7 +863,7 @@ try_reconnect(t_node_info *node_info)
free_conninfo_params(&conninfo_params); free_conninfo_params(&conninfo_params);
return NULL; return;
} }
@@ -802,7 +903,7 @@ terminate(int retval)
{ {
logger_shutdown(); logger_shutdown();
if (pid_file) if (pid_file[0] != '\0')
{ {
unlink(pid_file); unlink(pid_file);
} }

View File

@@ -10,6 +10,8 @@
#include <time.h> #include <time.h>
#include "portability/instr_time.h" #include "portability/instr_time.h"
#define OPT_NO_PID_FILE 1000
extern volatile sig_atomic_t got_SIGHUP; extern volatile sig_atomic_t got_SIGHUP;
extern MonitoringState monitoring_state; extern MonitoringState monitoring_state;
extern instr_time degraded_monitoring_start; extern instr_time degraded_monitoring_start;
@@ -19,11 +21,13 @@ extern t_node_info local_node_info;
extern PGconn *local_conn; extern PGconn *local_conn;
extern bool startup_event_logged; extern bool startup_event_logged;
PGconn *try_reconnect(t_node_info *node_info); void try_reconnect(PGconn **conn, t_node_info *node_info);
int calculate_elapsed(instr_time start_time); int calculate_elapsed(instr_time start_time);
const char *print_monitoring_state(MonitoringState monitoring_state); const char *print_monitoring_state(MonitoringState monitoring_state);
void update_registration(PGconn *conn); void update_registration(PGconn *conn);
void terminate(int retval); void terminate(int retval);
#endif /* _REPMGRD_H_ */ #endif /* _REPMGRD_H_ */