repmgrd overview

repmgrd overview repmgrd overview repmgrd ("replication manager daemon") is a management and monitoring daemon which runs on each node in a replication cluster. It can automate actions such as failover and updating standbys to follow the new primary, as well as providing monitoring information about the state of each standby. repmgrd demonstration To demonstrate automatic failover, set up a 3-node replication cluster (one primary and two standbys streaming directly from the primary) so that the cluster looks something like this: $ repmgr -f /etc/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Connection string ----+-------+---------+-----------+----------+----------+-------------------------------------- 1 | node1 | primary | * running | | default | host=node1 dbname=repmgr user=repmgr 2 | node2 | standby | running | node1 | default | host=node2 dbname=repmgr user=repmgr 3 | node3 | standby | running | node1 | default | host=node3 dbname=repmgr user=repmgr See section Required configuration for automatic failover for an example of minimal repmgr.conf file settings suitable for use with repmgrd. Start repmgrd on each standby and verify that it's running by examining the log output, which at log level INFO will look like this: [2017-08-24 17:31:00] [NOTICE] using configuration file "/etc/repmgr.conf" [2017-08-24 17:31:00] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr" [2017-08-24 17:31:00] [NOTICE] starting monitoring of node node2 (ID: 2) [2017-08-24 17:31:00] [INFO] monitoring connection to upstream node "node1" (node ID: 1) Each repmgrd should also have recorded its successful startup as an event: $ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start Node ID | Name | Event | OK | Timestamp | Details ---------+-------+---------------+----+---------------------+------------------------------------------------------------- 3 | node3 | repmgrd_start | t | 2017-08-24 17:35:54 | monitoring connection to upstream node "node1" (node ID: 1) 2 | node2 | repmgrd_start | t | 2017-08-24 17:35:50 | monitoring connection to upstream node "node1" (node ID: 1) 1 | node1 | repmgrd_start | t | 2017-08-24 17:35:46 | monitoring cluster primary "node1" (node ID: 1) Now stop the current primary server with e.g.: pg_ctl -D /var/lib/postgresql/data -m immediate stop This will force the primary to shut down straight away, aborting all processes and transactions. This will cause a flurry of activity in the repmgrd log files as each repmgrd detects the failure of the primary and a failover decision is made. This is an extract from the log of a standby server (node2) which has promoted to new primary after failure of the original primary (node1). [2017-08-24 23:32:01] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2017-08-24 23:32:08] [WARNING] unable to connect to upstream node "node1" (node ID: 1) [2017-08-24 23:32:08] [INFO] checking state of node 1, 1 of 5 attempts [2017-08-24 23:32:08] [INFO] sleeping 1 seconds until next reconnection attempt [2017-08-24 23:32:09] [INFO] checking state of node 1, 2 of 5 attempts [2017-08-24 23:32:09] [INFO] sleeping 1 seconds until next reconnection attempt [2017-08-24 23:32:10] [INFO] checking state of node 1, 3 of 5 attempts [2017-08-24 23:32:10] [INFO] sleeping 1 seconds until next reconnection attempt [2017-08-24 23:32:11] [INFO] checking state of node 1, 4 of 5 attempts [2017-08-24 23:32:11] [INFO] sleeping 1 seconds until next reconnection attempt [2017-08-24 23:32:12] [INFO] checking state of node 1, 5 of 5 attempts [2017-08-24 23:32:12] [WARNING] unable to reconnect to node 1 after 5 attempts INFO: setting voting term to 1 INFO: node 2 is candidate INFO: node 3 has received request from node 2 for electoral term 1 (our term: 0) [2017-08-24 23:32:12] [NOTICE] this node is the winner, will now promote self and inform other nodes INFO: connecting to standby database NOTICE: promoting standby DETAIL: promoting server using 'pg_ctl -l /var/log/postgres/startup.log -w -D '/var/lib/pgsql/data' promote' INFO: reconnecting to promoted server NOTICE: STANDBY PROMOTE successful DETAIL: node 2 was successfully promoted to primary INFO: node 3 received notification to follow node 2 [2017-08-24 23:32:13] [INFO] switching to primary monitoring mode The cluster status will now look like this, with the original primary (node1) marked as inactive, and standby node3 now following the new primary (node2): $ repmgr -f /etc/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Connection string ----+-------+---------+-----------+----------+----------+---------------------------------------------------- 1 | node1 | primary | - failed | | default | host=node1 dbname=repmgr user=repmgr 2 | node2 | primary | * running | | default | host=node2 dbname=repmgr user=repmgr 3 | node3 | standby | running | node2 | default | host=node3 dbname=repmgr user=repmgr repmgr cluster event will display a summary of what happened to each server during the failover: $ repmgr -f /etc/repmgr.conf cluster event Node ID | Name | Event | OK | Timestamp | Details ---------+-------+--------------------------+----+---------------------+----------------------------------------------------------------------------------- 3 | node3 | repmgrd_failover_follow | t | 2017-08-24 23:32:16 | node 3 now following new upstream node 2 3 | node3 | standby_follow | t | 2017-08-24 23:32:16 | node 3 is now attached to node 2 2 | node2 | repmgrd_failover_promote | t | 2017-08-24 23:32:13 | node 2 promoted to primary; old primary 1 marked as failed 2 | node2 | standby_promote | t | 2017-08-24 23:32:13 | node 2 was successfully promoted to primary