repmgr/doc/repmgrd-overview.sgml

<chapter id="repmgrd-overview" xreflabel="repmgrd overview">
  <indexterm>
    <primary>repmgrd</primary>
    <secondary>overview</secondary>
  </indexterm>

  <title>repmgrd overview</title>

  <para>
    <application>repmgrd</application> (&quot;<literal>replication manager daemon</literal>&quot;)
    is a management and monitoring daemon which runs
    on each node in a replication cluster. It can automate actions such as
    failover and updating standbys to follow the new primary, as well as
    providing monitoring information about the state of each standby.
  </para>

  <sect1 id="repmgrd-demonstration">

    <title>repmgrd demonstration</title>
    <para>
  To demonstrate automatic failover, set up a 3-node replication cluster (one primary
  and two standbys streaming directly from the primary) so that the cluster looks
  something like this:
  <programlisting>
    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status    | Upstream | Location | Connection string
    ----+-------+---------+-----------+----------+----------+--------------------------------------
     1  | node1 | primary | * running |          | default  | host=node1 dbname=repmgr user=repmgr
     2  | node2 | standby |   running | node1    | default  | host=node2 dbname=repmgr user=repmgr
     3  | node3 | standby |   running | node1    | default  | host=node3 dbname=repmgr user=repmgr</programlisting>
 </para>

 <tip>
   <para>
     See section <link linkend="repmgrd-automatic-failover-configuration">Required configuration for automatic failover</link>
     for an example of minimal <filename>repmgr.conf</filename> file settings suitable for use with <application>repmgrd</application>.
   </para>
 </tip>
 <para>
  Start <application>repmgrd</application> on each standby and verify that it's running by examining the
  log output, which at log level <literal>INFO</literal> will look like this:
  <programlisting>
    [2017-08-24 17:31:00] [NOTICE] using configuration file "/etc/repmgr.conf"
    [2017-08-24 17:31:00] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr"
    [2017-08-24 17:31:00] [NOTICE] starting monitoring of node <literal>node2</literal> (ID: 2)
    [2017-08-24 17:31:00] [INFO] monitoring connection to upstream node "node1" (node ID: 1)</programlisting>
 </para>
 <para>
  Each <application>repmgrd</application> should also have recorded its successful startup as an event:
  <programlisting>
    $ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start
     Node ID | Name  | Event         | OK | Timestamp           | Details
    ---------+-------+---------------+----+---------------------+-------------------------------------------------------------
     3       | node3 | repmgrd_start | t  | 2017-08-24 17:35:54 | monitoring connection to upstream node "node1" (node ID: 1)
     2       | node2 | repmgrd_start | t  | 2017-08-24 17:35:50 | monitoring connection to upstream node "node1" (node ID: 1)
     1       | node1 | repmgrd_start | t  | 2017-08-24 17:35:46 | monitoring cluster primary "node1" (node ID: 1)  </programlisting>
 </para>
 <para>
  Now stop the current primary server with e.g.:
  <programlisting>
    pg_ctl -D /var/lib/postgresql/data -m immediate stop</programlisting>
 </para>
 <para>
  This will force the primary to shut down straight away, aborting all processes
  and transactions.  This will cause a flurry of activity in the <application>repmgrd</application> log
  files as each <application>repmgrd</application> detects the failure of the primary and a failover
  decision is made. This is an extract from the log of a standby server (<literal>node2</literal>)
  which has promoted to new primary after failure of the original primary (<literal>node1</literal>).
  <programlisting>
    [2017-08-24 23:32:01] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2017-08-24 23:32:08] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
    [2017-08-24 23:32:08] [INFO] checking state of node 1, 1 of 5 attempts
    [2017-08-24 23:32:08] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-08-24 23:32:09] [INFO] checking state of node 1, 2 of 5 attempts
    [2017-08-24 23:32:09] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-08-24 23:32:10] [INFO] checking state of node 1, 3 of 5 attempts
    [2017-08-24 23:32:10] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-08-24 23:32:11] [INFO] checking state of node 1, 4 of 5 attempts
    [2017-08-24 23:32:11] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-08-24 23:32:12] [INFO] checking state of node 1, 5 of 5 attempts
    [2017-08-24 23:32:12] [WARNING] unable to reconnect to node 1 after 5 attempts
    INFO:  setting voting term to 1
    INFO:  node 2 is candidate
    INFO:  node 3 has received request from node 2 for electoral term 1 (our term: 0)
    [2017-08-24 23:32:12] [NOTICE] this node is the winner, will now promote self and inform other nodes
    INFO: connecting to standby database
    NOTICE: promoting standby
    DETAIL: promoting server using 'pg_ctl -l /var/log/postgres/startup.log -w -D '/var/lib/pgsql/data' promote'
    INFO: reconnecting to promoted server
    NOTICE: STANDBY PROMOTE successful
    DETAIL: node 2 was successfully promoted to primary
    INFO:  node 3 received notification to follow node 2
    [2017-08-24 23:32:13] [INFO] switching to primary monitoring mode</programlisting>
 </para>
 <para>
  The cluster status will now look like this, with the original primary (<literal>node1</literal>)
  marked as inactive, and standby <literal>node3</literal> now following the new primary
  (<literal>node2</literal>):
  <programlisting>
    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status    | Upstream | Location | Connection string
    ----+-------+---------+-----------+----------+----------+----------------------------------------------------
     1  | node1 | primary | - failed  |          | default  | host=node1 dbname=repmgr user=repmgr
     2  | node2 | primary | * running |          | default  | host=node2 dbname=repmgr user=repmgr
     3  | node3 | standby |   running | node2    | default  | host=node3 dbname=repmgr user=repmgr</programlisting>

 </para>
 <para>
  <command>repmgr cluster event</command> will display a summary of what happened to each server
  during the failover:
  <programlisting>
    $ repmgr -f /etc/repmgr.conf cluster event
     Node ID | Name  | Event                    | OK | Timestamp           | Details
    ---------+-------+--------------------------+----+---------------------+-----------------------------------------------------------------------------------
     3       | node3 | repmgrd_failover_follow  | t  | 2017-08-24 23:32:16 | node 3 now following new upstream node 2
     3       | node3 | standby_follow           | t  | 2017-08-24 23:32:16 | node 3 is now attached to node 2
     2       | node2 | repmgrd_failover_promote | t  | 2017-08-24 23:32:13 | node 2 promoted to primary; old primary 1 marked as failed
     2       | node2 | standby_promote          | t  | 2017-08-24 23:32:13 | node 2 was successfully promoted to primary</programlisting>
 </para>

  </sect1>
</chapter>