repmgrdautomatic failoverAutomatic failover with repmgrdrepmgrd is a management and monitoring daemon which runs
on each node in a replication cluster. It can automate actions such as
failover and updating standbys to follow the new primary, as well as
providing monitoring information about the state of each standby.
repmgrdcascading replicationcascading replicationrepmgrdrepmgrd and cascading replication
Cascading replication - where a standby can connect to an upstream node and not
the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and
repmgrd support cascading replication by keeping track of the relationship
between standby servers - each node record is stored with the node id of its
upstream ("parent") server (except of course the primary server).
In a failover situation where the primary node fails and a top-level standby
is promoted, a standby connected to another standby will not be affected
and continue working as normal (even if the upstream standby it's connected
to becomes the primary node). If however the node's direct upstream fails,
the "cascaded standby" will attempt to reconnect to that node's parent
(unless failover is set to manual in
repmgr.conf).
repmgrdnetwork splitsnetwork splitsHandling network splits with repmgrd
A common pattern for replication cluster setups is to spread servers over
more than one datacentre. This can provide benefits such as geographically-
distributed read replicas and DR (disaster recovery capability). However
this also means there is a risk of disconnection at network level between
datacentre locations, which would result in a split-brain scenario if
servers in a secondary data centre were no longer able to see the primary
in the main data centre and promoted a standby among themselves.
&repmgr; enables provision of "" to
artificially create a quorum of servers in a particular location, ensuring
that nodes in another location will not elect a new primary if they
are unable to see the majority of nodes. However this approach does not
scale well, particularly with more complex replication setups, e.g.
where the majority of nodes are located outside of the primary datacentre.
It also means the witness node needs to be managed as an
extra PostgreSQL instance outside of the main replication cluster, which
adds administrative and programming complexity.
repmgr4 introduces the concept of location:
each node is associated with an arbitrary location string (default is
default); this is set in repmgr.conf, e.g.:
node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/data'
location='dc1'
In a failover situation, repmgrd will check if any servers in the
same location as the current primary node are visible. If not, repmgrd
will assume a network interruption and not promote any node in any
other location (it will however enter degraded monitoring
mode until a primary becomes visible).