Add a quick setup for autofailover

2026-07-16 14:29:05 +00:00 · 2012-06-26 07:49:43 -05:00
parent e3c3c22b6e
commit d58ea77798
2 changed files with 215 additions and 3 deletions
@@ -23,7 +23,7 @@ databases as a single cluster.  repmgr includes two components:
 Supported Releases
 ------------------

-repmgr works with PostgreSQL versions 9.0 and 9.1.
+repmgr works with PostgreSQL versions 9.0 and superior.

 There are currently no incompatibilities when upgrading repmgr from 9.0 to 9.1,
 so your 9.0 configuration will work with 9.1
@@ -389,7 +389,7 @@ walkthrough assumes the following setup:

 * Another standby server called "node3" with a similar configuration to "node2".

-* The Postgress installation in each of the above is defined as $PGDATA, 
+* The Postgres installation in each of the above is defined as $PGDATA, 
  which is represented here as ``/var/lib/pgsql/9.0/data``
  
 Creating some sample data
@@ -514,12 +514,14 @@ following the standard directory structure of a RHEL system.  It should contain:

  cluster=test
  node=1
+  node_name=earth
  conninfo='host=node1 user=repmgr dbname=pgbench'

 On "node2" create the file ``/var/lib/pgsql/repmgr/repmgr.conf`` with::

  cluster=test
  node=2
+  node_name=mars
  conninfo='host=node2 user=repmgr dbname=pgbench'

 The STANDBY CLONE process should have created a recovery.conf file on
@@ -712,12 +714,14 @@ and it should contain::

  cluster=test
  node=1
+  node_name=earth
  conninfo='host=127.0.0.1 dbname=testdb'

 On "standby" create the file ``/home/standby/repmgr/repmgr.conf`` with::

  cluster=test
  node=2
+  node_name=mars
  conninfo='host=127.0.0.1 dbname=testdb'

 Next, with "prime" server running, we want to use the ``clone standby`` command
@@ -1133,4 +1137,3 @@ Jaime Casanova
 Simon Riggs
 Greg Smith
 Cedric Villemain
-
@@ -0,0 +1,209 @@
+=====================================================
+ PostgreSQL Automatic Fail-Over - User Documentation
+=====================================================
+
+Automatic Failover
+==================
+
+repmgr allows setups for automatic failover when it detects the failure of the master node.
+Following is a quick setup for this.
+
+Installation
+============
+
+For convenience, we define:
+
+  * node1 is the hostname fully qualified of the Master server, IP 192.168.1.10
+  * node2 is the hostname fully qualified of the Standby server, IP 192.168.1.11
+  * witness is the hostname fully qualified of the server used for witness, IP 192.168.1.12
+
+:Note: It is not recommanded to use name defining status of a server like «masterserver»,
+       this is a name leading to confusion once a failover take place and the Master is
+       now on the «standbyserver».
+
+Summary
+-------
+
+2 PostgreSQL servers are involved in the replication.  Automatic fail-over need
+to vote to decide what server it should promote, thus an odd number is required
+and a witness-repmgrd is installed in a third server where it uses a PostgreSQL
+cluster to communicate with other repmgrd daemons.
+
+1. Install PostgreSQL in all the servers involved (including the server used for
+witness)
+2. Install repmgr in all the servers involved (including the server used for witness)
+3. Configure the Master PostreSQL
+4. Clone the Master to the Standby using "repmgr standby clone" command
+5. Configure repmgr in all the servers involved (including the server used for witness)
+6. Register Master and Standby nodes
+7. Initiate witness server
+8. Start the repmgrd daemons in all nodes
+
+:Note: A complete Hight-Availability design need at least 3 servers to still have
+       a backup node after a first failure.
+
+Install PostgreSQL
+------------------
+
+You can install PostgreSQL using any of the recommended methods. You should ensure
+it's 9.0 or superior.
+
+Install repmgr
+--------------
+
+Install repmgr following the steps in the README.
+
+Configure PostreSQL
+-------------------
+
+Log in node1.
+
+Edit the file postgresql.conf and modify the parameters::
+
+  listen_addresses='*'
+  wal_level = 'hot_standby'
+  archive_mode = on
+  archive_command = 'cd .'	 # we can also use exit 0, anything that 
+                             # just does nothing
+  max_wal_senders = 10
+  wal_keep_segments = 5000   # 80 GB required on pg_xlog
+  hot_standby = on
+  shared_preload_libraries = 'repmgr_funcs'
+
+Edit the file pg_hba.conf and add lines for the replication::
+
+  host     repmgr           repmgr      127.0.0.1/32            trust
+  host     repmgr           repmgr      192.168.1.10/30         trust
+  host     replication      all         192.168.1.10/30         trust
+
+:Note: It is also possible to use a password authentication (md5), .pgpass file
+       should be edited to allow connection between each node.
+
+Create the user and database to manage replication::
+
+  su - postgres
+  createuser -s repmgr
+  createdb -O repmgr repmgr
+  psql -f /usr/share/postgresql/9.0/contrib/repmgr_funcs.sql repmgr
+
+Restart the PostgreSQL server::
+
+  pg_ctl -D $PGDATA restart
+
+And check everything is fine in the server log.
+
+Create the ssh-key for the postgres user and copy it to other servers::
+
+  su - postgres
+  ssh-keygen             # /!\ do not use a passphrase /!\
+  cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
+  chmod 600 ~/.ssh/authorized_keys
+  exit
+  rsync -avz ~postgres/.ssh/authorized_keys node2:~postgres/.ssh/
+  rsync -avz ~postgres/.ssh/authorized_keys witness:~postgres/.ssh/
+  rsync -avz ~postgres/.ssh/id_rsa* node2:~postgres/.ssh/
+  rsync -avz ~postgres/.ssh/id_rsa* witness:~postgres/.ssh/
+
+Clone Master
+------------
+
+Log in node2.
+
+Clone the node1 (the current Master)::
+
+  su - postgres
+  repmgr -d repmgr -U repmgr standby clone node1
+
+Start the PostgreSQL server::
+
+  pg_ctl -D $PGDATA start
+
+And check everything is fine in the server log.
+
+Configure repmgr
+----------------
+
+Log in each server and configure repmgr by editing the file
+/etc/repmgr/repmgr.conf::
+
+  cluster=my_cluster
+  node=1
+  node_name=earth
+  conninfo='host=192.168.1.10 dbname=repmgr user=repmgr'
+  master_response_timeout=60
+  failover=automatic
+  promote_command='promote_command.sh'
+  follow_command='repmgr standby follow -f /etc/repmgr/repmgr.conf'
+
+* *cluster* is the name of the current replication.
+* *node* is the number of the current node (1, 2 or 3 in the current example).
+* *node_name* is an identifier for every node.
+* *conninfo* is used to connect to the local PostgreSQL server (where the configuration file is) from any node. In the witness server configuration it is needed to add a 'port=5499' to the conninfo.
+* *master_response_timeout* is the maximum amount of time we are going to wait before deciding the master has died and start failover procedure.
+* *failover* configure behavior : *manual* or *automatic*.
+* *promote_command* the command executed to do the failover (including the PostgreSQL failover itself). The command must return 0 on success.
+* *follow_command* the command executed to address the current standby to another Master. The command must return 0 on success.
+
+Register Master and Standby
+---------------------------
+
+Log in node1.
+
+Register the node as Master::
+
+  su - postgres
+  repmgr -f /etc/repmgr/repmgr.conf master register
+
+Log in node2.
+
+Register the node as Standby::
+
+  su - postgres
+  repmgr -f /etc/repmgr/repmgr.conf standby register
+
+Initialize witness server
+-------------------------
+
+Log in witness.
+
+Initialize the witness server::
+
+  su - postgres
+  repmgr -d repmgr -U repmgr -h 192.168.1.10 -D $WITNESS_PGDATA -f /etc/repmgr/repmgr.conf witness create node1
+
+It needs information to connect to the master to copy the configuration of the cluster, also it needs to know where it should initialize it's own $PGDATA.
+As part of the procees it also ask for the superuser password so it can connect when needed.
+
+Start the repmgrd daemons
+-------------------------
+
+Log in node2 and witness.
+
+  su - postgres
+  repmgrd -f /etc/repmgr/repmgr.conf > /var/log/postgresql/repmgr.log 2>&1
+
+:Note: The Master does not need a repmgrd daemon.
+
+
+Suspend Automatic behavior
+==========================
+
+Edit the repmgr.conf of the node to remove from automatic processing and change::
+
+	failover=manual
+
+Then, signal repmgrd daemon::
+
+	su - postgres
+	kill -HUP `pidoff repmgrd`
+
+TODO : -HUP configuration update is not implemented and it should check its
+	   configuration file  against its configuration in DB, updating
+	   accordingly the SQL conf (especialy the failover manual or auto)
+	   this allow witness-standby and standby-not-promotable features
+	   and simpler usage of the tool ;)
+
+Usage
+=====
+
+The repmgr documentation is in the README file (how to build, options, etc.)