Commit Graph

297 Commits

Author SHA1 Message Date
Ian Barwick
177b84345d Fix debug logging
Per GitHub #630.
2020-04-20 11:08:35 +09:00
Ian Barwick
599bab590a Create temporary pg.auto.conf file with the same permissions as the original
Commit 0574279 set the file permissions to 0600 rather than the user's
umask, but if initdb was executed with -g/--allow-group-access, the
file is maintained with 0640, so we'll just maintain the existing
permssions.
2020-04-07 13:29:59 +09:00
Ian Barwick
cd80f265ac standby clone: warn about missing pg_rewind prerequisites
These are not essential for cloning a standby, but useful to warn
as early as possible in case the user is intending to use pg_rewind.
2020-04-06 15:37:37 +09:00
Ian Barwick
f3258c5002 cluster cleanup: explicitly log vacuum operation 2020-03-26 11:38:51 +09:00
Ian Barwick
2e9bc31c8c Consolidate code for establishing a superuser connection 2020-03-25 11:02:31 +09:00
Ian Barwick
fb5ce720f3 standby promote: fall back to "pg_ctl promote" if necessary
From PostgreSQL 12, the SQL-level function "pg_promote()" can be used
to promote a PostgreSQL instance, however usage is restricted to
superusers and users to whom explicit execution permission for this
function has been granted.

Therefore, if execution permission is not available, fall back to
"pg_ctl promote".
2020-03-06 12:53:37 +09:00
Ian Barwick
7c96afc6fb Move user information database functions into their own section
Code reorganisation for easier location of functions.
2020-03-06 12:53:34 +09:00
Ian Barwick
9de31428f1 Consolidate replication connection code
In a few places, replication connections are generated from the
parameters used by existing connections. This has resulted in a
number of similar blocks of code which do more-or-less the same
thing almost but not quite identically. In two cases, the code
omitted to set "dbname=replication", which can cause problems
in some contexts.

These code blocks have now been consolidated into standardized
functions.

This also resolves the issue addressed by GitHub #619.
2020-03-05 17:21:37 +09:00
Ian Barwick
76af2d9e08 cluster show: don't display witness node timeline ID
The witness node is not part of the replication cluster, so its timeline
ID is not of any relevance.
2020-02-25 10:33:54 +09:00
Ian Barwick
cd7f36a6fd Add general check function "check_replication_slots_available()"
Make the code previously only used by "standby follow" generally
available - we'll want to use this from "node rejoin" as well.

While we're at it, when reporting failure due to lack of free
replication slots, report the current value of "max_replication_slots".
2020-02-03 16:43:55 +09:00
Ian Barwick
4d4ed3bcd6 Remove BDR 2.x support
The BDR 2.x support was conceptual only and was never used in
production. As BDR 2.x will be EOL'd shortly, there is no risk it will
be needed.
2020-01-16 09:52:42 +09:00
Ian Barwick
7fdf2f1778 Update copyright notices to 2020 2020-01-13 14:06:20 +09:00
Ian Barwick
220ec7fc96 Minimize user permissions requirements for replication slots
Enable operations which create or drop replication slots to be carried
out with the minimum necessary user permissions, i.e. a user with the
REPLICATION attribute.

This can be the repmgr user, or a dedicated replication user.
In the latter case, if the dedicated replication user is only
permitted to make replication connections, the streaming
replication protocol is used to create/drop slots.

Implements part of GitHub #536.
2019-10-30 15:51:15 +09:00
Ian Barwick
f0693271d3 Clean up replication slot creation/deletion functions 2019-10-25 14:31:11 +09:00
Ian Barwick
45b9002e5b Modify function "is_replication_role()" to reference "pg_roles"
"pg_authid" is restricted to superusers.
2019-10-24 16:57:50 +09:00
Ian Barwick
cc540a54e5 Add functions for slot creation via the streaming replication protocol 2019-10-24 10:05:32 +09:00
Ian Barwick
9083f26990 Clarify usage of log_db_error() function 2019-10-24 10:04:15 +09:00
Ian Barwick
63ddc2d39e Let _establish_db_connection() work with replication connections
Previously this was limited to establish_db_connection_by_params().
2019-10-23 14:45:44 +09:00
Ian Barwick
dc11330d58 Rename replication slot create/drop functions
Append "_sql" to the respective function names, as we'll later be
creating equivalent functions which use the replication protocol
so need a way to distinguish between them.
2019-10-23 13:43:09 +09:00
Ian Barwick
be494f0d5f standby clone: minimize requirement to check upstream data directory location
repmgr has always insisted on determining the upstream's data directory
location, which requires superuser permissions (or from PostgreSQL 10,
membership of the default role "pg_read_all_settings").

Knowledge of the data directory location was required to implement rsync
cloning (now deprecated), but with pg_basebackup the minimum permission
requirement is now only a normal user with access to the repmgr metadata
and a user with replication permissions. The ability to determine the
data directory location is only required if the user specifies the
--copy-external-config-files option, which needs to be able to determine
the data directory to work out which configuration files are located
outside it.

This patch makes it possible to clone a standby with minimum
permissions, with appropriate checks for available permissions if
--copy-external-config-files is provided.

Implements part of GitHub #536 and addresses issue raised in #586.
2019-10-23 10:46:40 +09:00
Ian Barwick
52abe309df Add function is_replication_role() 2019-10-17 17:13:18 +09:00
Ian Barwick
507b27c05d Mark set_repmgrd_pid() as "RETURNS NULL ON NULL INPUT"
When unsetting the PID, we'll want to set the pidfile to NULL rather
than an empty string.
2019-08-19 20:15:30 +09:00
Ian Barwick
28f4536372 repmgrd: fix pidfile handling at shutdown 2019-08-19 17:55:18 +09:00
Ian Barwick
3df65d0eb3 Simplify pg_has_role() call
Specifying CURRENT_USER is superfluous here.
2019-08-07 14:43:56 +09:00
Ian Barwick
38b373e6df "node check": check role membership when trying to read pg_settings
From PostgreSQL 10, a member of the default roles "pg_monitor" and/or
"pg_read_all_settings" can read pg_settings without requiring superuser
privileges.

Previously, a hint was being emitted about making the repmgr user a
member of one of those groups, but no check for membership was being
made, meaning the check could only be run by a superuser.
2019-08-07 14:26:48 +09:00
Ian Barwick
10870503d1 Add missing field in init_replication_info()
"upstream_node_id" was not being initialised.
2019-08-06 21:24:37 +09:00
Ian Barwick
6aca764d5e Fix extension version number query 2019-06-06 12:46:12 +09:00
Ian Barwick
c153e2fc02 standby clone: improve --dry-run output
Log positive check results as an additional confirmation that the
upstream configuration appears to be correct.
2019-05-28 00:54:39 +09:00
Ian Barwick
c560dfbbce cluster show: display timeline ID
This helps provide a better picture of the state of the cluster, i.e.
making it more obvious whether there's been a timeline divergence.

This also provides infrastructure for further improvements in cluster
status display and diagnosis.

Note this is only available in PostgreSQL 9.6 and later as it relies
on the SQL functions for interrogating pg_control, which can be executed
remotely. As PostgreSQL 9.5 will shortly be the only community-supported
version without these functions, it's not worth the effort of trying
to duplicate their functionality.
2019-05-27 09:39:19 +09:00
Ian Barwick
c9e85996f5 repmgr: prevent a standby being cloned from a witness server
Previously repmgr would happily clone from whatever server
it found at the provided source server address. We should
ensure that a standby can only be cloned from a node which
is part of the main replication cluster.

This check fetches a list of nodes from the source server,
connects to the first non-witness server it finds, and
compares the system identifiers of the source node and the
node it has connected to. If there is a mismatch, then the
source server is clearly not part of the main replication
cluster, and is most likely the witness server.
2019-05-22 16:52:25 +09:00
Ian Barwick
dd78a16006 Change return type of is_downstream_node_attached() from bool to NodeAttached
This enables us to better determine whether a node is definitively
attached, definitively not attached, or if it was not possible to
determine the attached state.
2019-05-14 15:57:20 +09:00
Ian Barwick
89a7261483 Always quote node names in log messages 2019-04-30 15:52:56 +09:00
Ian Barwick
9fe2fa2daf daemon status: make output more like that of "cluster show"
In particular make any issues with unexpected server state more
obvious.
2019-04-25 14:45:41 +09:00
Ian Barwick
5a90513878 repmgrd: monitor standbys attached to primary
This functionality enables repmgrd (when running on the primary) to
monitor connected child nodes. It will log connections and disconnections
and generate events.

Additionally, repmgrd can execute a custom script if the number of connected
child nodes falls below a configurable threshold. This script can be used
e.g. to "fence" the primary following a failover situation where a new primary
has been promoted and all standbys are now child nodes of that primary.
2019-04-22 16:18:52 +09:00
Ian Barwick
27803f93ff repmgrd: always unset upstream node ID when monitoring a primary 2019-04-12 12:26:39 +09:00
Ian Barwick
cd6a55c7cb repmgrd: improve primary visibility consensus check
Exclude sibling nodes which report they're following a different
node. This shouldn't happen, but could.
2019-04-11 15:46:14 +09:00
Ian Barwick
008bd00a59 repmgrd: store upstream node ID in shared memory 2019-04-11 15:46:09 +09:00
Ian Barwick
dd454a8374 Miscellaneous string handling cleanup
This is mainly to prevent effectively spurious truncation warnings
in recent GCC versions.
2019-04-10 16:18:56 +09:00
Ian Barwick
a564f365c1 Fix default return value in alter_system_int() 2019-04-01 14:50:19 +09:00
Ian Barwick
799ac6d453 Add is_server_available_quiet()
For use in cases where the caller collates node availability information
and doesn't want to prematurely emit log output.
2019-04-01 12:27:30 +09:00
Ian Barwick
57c0ccd477 Improve copying of strings from database results
Where feasible, specify the maximum string length via sizeof(), and
use snprintf() in place of strncpy().
2019-04-01 11:19:58 +09:00
Ian Barwick
ece20f4831 Cast "int" to "long long" 2019-03-28 11:02:25 +09:00
Ian Barwick
ba1f05ece9 Restrict "node_name" to maximum 63 characters
In "recovery.conf", the configuration parameter "node_name" is used
as the "application_name" value, which will be truncated by PostgreSQL
to 63 characters (NAMEDATALEN - 1).

repmgr sometimes needs to be able to extract the application name from
pg_stat_replication to determine if a node is connected (e.g. when
executing "repmgr standby register"), so the comparison will fail
if "node_name" exceeds 63 characters.
2019-03-28 10:37:57 +09:00
Ian Barwick
e9ece34aeb log_db_error(): fix formatted message handling 2019-03-27 11:00:31 +09:00
Ian Barwick
539861cb58 repmgrd: during failover, check if a node was already promoted
Previously, repmgrd assumed that during a failover, there would not
already be another primary node. However it's possible a node was
promoted manually. While this is not a desirable situation, it's
conceivable this could happen in the wild, so we should check for
it and react accordingly.

Also sanity-check that the follow target can actually be followed.

Addresses issue raised in GitHub #420.
2019-03-22 14:06:41 +09:00
Ian Barwick
314a1e8f4f use a constant to denote unknown replication lag 2019-03-20 17:26:04 +09:00
Ian Barwick
b84d98fe81 Explictly log PQping() failures 2019-03-20 11:47:32 +09:00
Ian Barwick
46efe57cd0 Improve database connection failure logging
Log the output of PQerrorStatus() in a couple of places where it was missing.

Additionally, always log the output of PQerrorStatus() starting with a blank
line, otherwise the first line looks like it was emitted by repmgr, and
it's harder to scan the error message.

Before:

    [2019-03-20 11:24:15] [DETAIL] could not connect to server: Connection refused
            Is the server running on host "localhost" (::1) and accepting
            TCP/IP connections on port 5501?
    could not connect to server: Connection refused
            Is the server running on host "localhost" (127.0.0.1) and accepting
            TCP/IP connections on port 5501?

After:

    [2019-03-20 11:27:21] [DETAIL]
    could not connect to server: Connection refused
            Is the server running on host "localhost" (::1) and accepting
            TCP/IP connections on port 5501?
    could not connect to server: Connection refused
            Is the server running on host "localhost" (127.0.0.1) and accepting
            TCP/IP connections on port 5501?
2019-03-20 11:47:28 +09:00
Ian Barwick
39df55c39c Check node recovery type before attempting to write an event record
In some corner cases (e.g. immediately after a switchover) where
the current primary has not yet been determined, the provided connection
might not be writeable. This prevents error messages such as
"cannot execute INSERT in a read-only transaction" generating unnecessary
noise in the logs.
2019-03-18 15:26:16 +09:00
Ian Barwick
f54ff85cfa Remove outdated comment
This was only relevant for repmgr3 and earlier; in repmgr4 the schema
is hard-coded.
2019-03-18 15:19:11 +09:00