From 87ea7850cab3597407f211c83429682272bccdf6 Mon Sep 17 00:00:00 2001
From: Ian Barwick <barwick@gmail.com>
Date: Thu, 5 Oct 2017 15:21:22 +0900
Subject: [PATCH] More updates

---
 doc/event-notifications.sgml  | 181 ++++++++++++++++++++++++++++++++++
 doc/filelist.sgml             |   4 +-
 doc/repmgr-cluster-event.sgml |  37 +++++++
 doc/repmgr.sgml               |   3 +
 doc/repmgrd-monitoring.sgml   |  71 +++++++++++++
 5 files changed, 295 insertions(+), 1 deletion(-)
 create mode 100644 doc/event-notifications.sgml
 create mode 100644 doc/repmgr-cluster-event.sgml
 create mode 100644 doc/repmgrd-monitoring.sgml
diff --git a/doc/event-notifications.sgml b/doc/event-notifications.sgml
new file mode 100644
index 00000000..46b327d3
--- /dev/null
+++ b/doc/event-notifications.sgml
@@ -0,0 +1,181 @@
+<chapter id="event-notifications" xreflabel="event notifications">
+ <title>Event Notifications</title>
+ <para>
+  Each time `repmgr` or `repmgrd` perform a significant event, a record
+  of that event is written into the `repmgr.events` table together with
+  a timestamp, an indication of failure or success, and further details
+  if appropriate. This is useful for gaining an overview of events
+  affecting the replication cluster. However note that this table has
+  advisory character and should be used in combination with the `repmgr`
+  and PostgreSQL logs to obtain details of any events.
+ </para>
+ <para>
+  Example output after a primary was registered and a standby cloned
+  and registered:
+  <programlisting>
+    repmgr=# SELECT * from repmgr.events ;
+     node_id |      event       | successful |        event_timestamp        |                                       details
+    ---------+------------------+------------+-------------------------------+-------------------------------------------------------------------------------------
+           1 | primary_register | t          | 2016-01-08 15:04:39.781733+09 |
+           2 | standby_clone    | t          | 2016-01-08 15:04:49.530001+09 | Cloned from host 'repmgr_node1', port 5432; backup method: pg_basebackup; --force: N
+           2 | standby_register | t          | 2016-01-08 15:04:50.621292+09 |
+    (3 rows)</programlisting>
+ </para>
+ <para>
+  Alternatively, use <xref linkend="repmgr-cluster-event"> to output a
+  formatted list of events.
+ </para>
+ <para>
+  Additionally, event notifications can be passed to a user-defined program
+  or script which can take further action, e.g. send email notifications.
+  This is done by setting the `event_notification_command` parameter in
+  `repmgr.conf`.
+ </para>
+ <para>
+  This parameter accepts the following format placeholders:
+ </para>
+
+ <variablelist>
+  <varlistentry>
+   <term><option>%n</option></term>
+   <listitem>
+    <para>
+      node ID
+    </para>
+   </listitem>
+  </varlistentry>
+
+  <varlistentry>
+   <term><option>%e</option></term>
+   <listitem>
+    <para>
+     event type
+    </para>
+   </listitem>
+  </varlistentry>
+
+  <varlistentry>
+   <term><option>%t</option></term>
+   <listitem>
+    <para>
+     success (1 or 0)
+    </para>
+   </listitem>
+  </varlistentry>
+  <varlistentry>
+   <term><option>%t</option></term>
+   <listitem>
+    <para>
+     timestamp
+    </para>
+   </listitem>
+  </varlistentry>
+
+  <varlistentry>
+   <term><option>%d</option></term>
+   <listitem>
+    <para>
+     details
+    </para>
+   </listitem>
+  </varlistentry>
+ </variablelist>
+ <para>
+  The values provided for <literal>%t</literal> and <literal>%d</literal>
+  will probably contain spaces, so should be quoted in the provided command
+  configuration, e.g.:
+  <programlisting>
+    event_notification_command='/path/to/some/script %n %e %s "%t" "%d"'
+  </programlisting>
+ </para>
+ <para>
+  Additionally the following format placeholders are available for the event
+  type <varname>bdr_failover</varname> and optionally <varname>bdr_recovery</varname>:
+ </para>
+ <variablelist>
+  <varlistentry>
+   <term><option>%c</option></term>
+   <listitem>
+    <para>
+     conninfo string of the next available node
+    </para>
+   </listitem>
+  </varlistentry>
+  <varlistentry>
+   <term><option>%a</option></term>
+   <listitem>
+    <para>
+     name of the next available node
+    </para>
+   </listitem>
+  </varlistentry>
+ </variablelist>
+ <para>
+  These should always be quoted.
+ </para>
+ <para>
+  By default, all notification types will be passed to the designated script;
+  the notification types can be filtered to explicitly named ones:
+  <itemizedlist spacing="compact" mark="bullet">
+
+   <listitem>
+    <simpara><literal>primary_register</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>standby_register</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>standby_unregister</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>standby_clone</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>standby_promote</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>standby_follow</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>standby_disconnect_manual</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>repmgrd_start</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>repmgrd_shutdown</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>repmgrd_failover_promote</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>repmgrd_failover_follow</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>bdr_failover</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>bdr_reconnect</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>bdr_recovery</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>bdr_register</literal></simpara>
+   </listitem>
+   <listitem>
+    <simpara><literal>bdr_unregister</literal></simpara>
+   </listitem>
+
+  </itemizedlist>
+ </para>
+ <para>
+  Note that under some circumstances (e.g. when no replication cluster primary
+  could be located), it will not be possible to write an entry into the
+  <literal>repmgr.events</literal>
+  table, in which case executing a script via <varname>event_notification_command</varname>
+  can serve as a fallback by generating some form of notification.
+ </para>
+
+
+</chapter>
diff --git a/doc/filelist.sgml b/doc/filelist.sgml
index 940cf68b..246b7504 100644
--- a/doc/filelist.sgml
+++ b/doc/filelist.sgml
@@ -43,10 +43,12 @@
 <!ENTITY promoting-standby  SYSTEM "promoting-standby.sgml">
 <!ENTITY follow-new-primary  SYSTEM "follow-new-primary.sgml">
 <!ENTITY switchover  SYSTEM "switchover.sgml">
+<!ENTITY event-notifications  SYSTEM "event-notifications.sgml">
 
 <!ENTITY repmgrd-automatic-failover SYSTEM "repmgrd-automatic-failover.sgml">
 <!ENTITY repmgrd-configuration SYSTEM "repmgrd-configuration.sgml">
 <!ENTITY repmgrd-demonstration SYSTEM "repmgrd-demonstration.sgml">
+<!ENTITY repmgrd-monitoring SYSTEM "repmgrd-monitoring.sgml">
 
 <!ENTITY repmgr-primary-register SYSTEM "repmgr-primary-register.sgml">
 <!ENTITY repmgr-primary-unregister SYSTEM "repmgr-primary-unregister.sgml">
@@ -62,9 +64,9 @@
 <!ENTITY repmgr-cluster-show SYSTEM "repmgr-cluster-show.sgml">
 <!ENTITY repmgr-cluster-matrix SYSTEM "repmgr-cluster-matrix.sgml">
 <!ENTITY repmgr-cluster-crosscheck SYSTEM "repmgr-cluster-crosscheck.sgml">
+<!ENTITY repmgr-cluster-event SYSTEM "repmgr-cluster-event.sgml">
 <!ENTITY repmgr-cluster-cleanup SYSTEM "repmgr-cluster-cleanup.sgml">
 
-
 <!ENTITY appendix-signatures      SYSTEM "appendix-signatures.sgml">
 
 <!ENTITY bookindex  SYSTEM "bookindex.sgml">
diff --git a/doc/repmgr-cluster-event.sgml b/doc/repmgr-cluster-event.sgml
new file mode 100644
index 00000000..f1f24fb7
--- /dev/null
+++ b/doc/repmgr-cluster-event.sgml
@@ -0,0 +1,37 @@
+<chapter id="repmgr-cluster-event" xreflabel="repmgr cluster event">
+ <indexterm>
+  <primary>repmgr cluster event</primary>
+ </indexterm>
+ <title>repmgr cluster event</title>
+ <para>
+  This outputs a formatted list of cluster events, as stored in the
+  <literal>repmgr.events</literal> table. Output is in reverse chronological order, and
+  can be filtered with the following options:
+ <itemizedlist spacing="compact" mark="bullet">
+  <listitem>
+    <simpara><literal>--all</literal>: outputs all entries</simpara>
+  </listitem>
+  <listitem>
+    <simpara><literal>--limit</literal>: set the maximum number of entries to output (default: 20)</simpara>
+  </listitem>
+  <listitem>
+    <simpara><literal>--node-id</literal>: restrict entries to node with this ID</simpara>
+  </listitem>
+  <listitem>
+    <simpara><literal>--node-name</literal>: restrict entries to node with this name</simpara>
+  </listitem>
+  <listitem>
+    <simpara><literal>--event</literal>: filter specific event</simpara>
+  </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+  Example:
+  <programlisting>
+    $ repmgr -f /etc/repmgr.conf cluster event --event=standby_register
+     Node ID | Name  | Event            | OK | Timestamp           | Details
+    ---------+-------+------------------+----+---------------------+--------------------------------
+     3       | node3 | standby_register | t  | 2017-08-17 10:28:55 | standby registration succeeded
+     2       | node2 | standby_register | t  | 2017-08-17 10:28:53 | standby registration succeeded</programlisting>
+ </para>
+</chapter>
diff --git a/doc/repmgr.sgml b/doc/repmgr.sgml
index 11bb2b5d..475f42f6 100644
--- a/doc/repmgr.sgml
+++ b/doc/repmgr.sgml
@@ -73,6 +73,7 @@
   &promoting-standby;
   &follow-new-primary;
   &switchover;
+  &event-notifications;
  </part>
 
  <part id="using-repmgrd">
@@ -80,6 +81,7 @@
   &repmgrd-automatic-failover;
   &repmgrd-configuration;
   &repmgrd-demonstration;
+  &repmgrd-monitoring;
  </part>
 
  <part id="repmgr-command-reference">
@@ -99,6 +101,7 @@
   &repmgr-cluster-show;
   &repmgr-cluster-matrix;
   &repmgr-cluster-crosscheck;
+  &repmgr-cluster-event;
   &repmgr-cluster-cleanup;
  </part>
 
diff --git a/doc/repmgrd-monitoring.sgml b/doc/repmgrd-monitoring.sgml
new file mode 100644
index 00000000..7daaac0a
--- /dev/null
+++ b/doc/repmgrd-monitoring.sgml
@@ -0,0 +1,71 @@
+<chapter id="repmgrd-monitoring">
+ <title>Monitoring with repmgrd</title>
+ <para>
+  When `repmgrd` is running with the option <literal>monitoring_history=true</literal>,
+  it will constantly write standby node status information to the
+  <varname>monitoring_history</varname> table, providing a near-real time
+  overview of replication status on all nodes
+  in the cluster.
+ </para>
+ <para>
+   The view <literal>replication_status</literal> shows the most recent state
+   for each node, e.g.:
+  <programlisting>
+    repmgr=# select * from repmgr.replication_status;
+    -[ RECORD 1 ]-------------+------------------------------
+    primary_node_id           | 1
+    standby_node_id           | 2
+    standby_name              | node2
+    node_type                 | standby
+    active                    | t
+    last_monitor_time         | 2017-08-24 16:28:41.260478+09
+    last_wal_primary_location | 0/6D57A00
+    last_wal_standby_location | 0/5000000
+    replication_lag           | 29 MB
+    replication_time_lag      | 00:00:11.736163
+    apply_lag                 | 15 MB
+    communication_time_lag    | 00:00:01.365643</programlisting>
+ </para>
+ <para>
+  The interval in which monitoring history is written is controlled by the
+  configuration parameter <varname>monitor_interval_secs</varname>;
+  default is 2.
+ </para>
+ <para>
+  As this can generate a large amount of monitoring data in the table
+  <literal>repmgr.monitoring_history</literal>. it's advisable to regularly
+  purge historical data using the <xref linkend="repmgr-cluster-cleanup">
+  command; use the <literal>-k/--keep-history</literal> option to
+  specify how many day's worth of data should be retained.
+ </para>
+ <para>
+  It's possible to use <command>repmgrd</command> to run in monitoring
+  mode only (without automatic failover capability) for some or all
+  nodes by setting <literal>failover=manual</literal> in the node's
+  <filename>repmgr.conf</filename> file. In the event of the node's upstream failing,
+  no failover action will be taken and the node will require manual intervention to
+  be reattached to replication. If this occurs, an
+  <link linkend="event-notifications">event notification</link>
+  <varname>standby_disconnect_manual</varname> will be created.
+ </para>
+ <para>
+  Note that when a standby node is not streaming directly from its upstream
+  node, e.g. recovering WAL from an archive, <varname>apply_lag</varname> will always appear as
+  <literal>0 bytes</literal>.
+ </para>
+ <tip>
+  <para>
+   If monitoring history is enabled, the contents of the <literal>repmgr.monitoring_history</literal>
+   table will be replicated to attached standbys. This means there will be a small but
+   constant stream of replication activity which may not be desirable. To prevent
+   this, convert the table to an <literal>UNLOGGED</literal> one with:
+   <programlisting>
+     ALTER TABLE repmgr.monitoring_history SET UNLOGGED;</programlisting>
+  </para>
+  <para>
+   This will however mean that monitoring history will not be available on
+   another node following a failover, and the view <literal>repmgr.replication_status</literal>
+   will not work on standbys.
+  </para>
+ </tip>
+</chapter>