Update README with standard two-host examples

2026-06-01 03:39:05 +00:00 · 2011-02-23 08:42:06 -05:00
parent 9c6288993b
commit 5dcec5818f
1 changed files with 320 additions and 84 deletions
@@ -250,7 +250,7 @@ on their partner node without a password.
 First generate a ssh key, using an empty passphrase, and copy the resulting 
 keys and a maching authorization file to a privledged user on the other system::

-  [postgres@db1]$ ssh-keygen -t rsa
+  [postgres@node1]$ ssh-keygen -t rsa
  Generating public/private rsa key pair.
  Enter file in which to save the key (/var/lib/pgsql/.ssh/id_rsa): 
  Enter passphrase (empty for no passphrase): 
@@ -259,19 +259,19 @@ keys and a maching authorization file to a privledged user on the other system::
  Your public key has been saved in /var/lib/pgsql/.ssh/id_rsa.pub.
  The key fingerprint is:
  aa:bb:cc:dd:ee:ff:aa:11:22:33:44:55:66:77:88:99 postgres@db1.domain.com
-  [postgres@db1]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
-  [postgres@db1]$ chmod go-rwx ~/.ssh/*
-  [postgres@db1]$ cd ~/.ssh
-  [postgres@db1]$ scp id_rsa.pub id_rsa authorized_keys user@db2:
+  [postgres@node1]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
+  [postgres@node1]$ chmod go-rwx ~/.ssh/*
+  [postgres@node1]$ cd ~/.ssh
+  [postgres@node1]$ scp id_rsa.pub id_rsa authorized_keys postgres@node2:

 Login as a user on the other system, and install the files into the postgres 
 user's account::

-  [user@db2 ~]$ sudo chown postgres.postgres authorized_keys id_rsa.pub id_rsa
-  [user@db2 ~]$ sudo mkdir -p ~postgres/.ssh
-  [user@db2 ~]$ sudo chown postgres.postgres ~postgres/.ssh
-  [user@db2 ~]$ sudo mv authorized_keys id_rsa.pub id_rsa ~postgres/.ssh
-  [user@db2 ~]$ sudo chmod -R go-rwx ~postgres/.ssh
+  [user@node2 ~]$ sudo chown postgres.postgres authorized_keys id_rsa.pub id_rsa
+  [user@node2 ~]$ sudo mkdir -p ~postgres/.ssh
+  [user@node2 ~]$ sudo chown postgres.postgres ~postgres/.ssh
+  [user@node2 ~]$ sudo mv authorized_keys id_rsa.pub id_rsa ~postgres/.ssh
+  [user@node2 ~]$ sudo chmod -R go-rwx ~postgres/.ssh

 Now test that ssh in both directions works.  You may have to accept some new 
 known hosts in the process.
@@ -381,13 +381,13 @@ its port if is different from the default one.

 * master register

-  * Registers a master in a cluster, it needs to be executed before any node is 
-    registered
+  * Registers a master in a cluster, it needs to be executed before any
+    standby nodes are registered

 * standby register

-  * Registers a standby in a cluster, it needs to be executed before any repmgrd 
-    is executed
+  * Registers a standby in a cluster, it needs to be executed before
+    repmgrd will function on the node.

 * standby clone [node to be cloned] 

@@ -432,8 +432,8 @@ its port if is different from the default one.

        ./repmgr standby follow

-Examples
-========
+Brief examples
+==============

 Suppose we have 3 nodes: node1 (the initial master), node2 and node3

@@ -464,8 +464,8 @@ this::

  PATH=$PGDIR/bin:$PATH repmgr standby promote

-repmgr Daemon
-=============
+repmgrd Daemon
+==============

 Command line syntax
 -------------------
@@ -507,7 +507,7 @@ Lag monitoring
 To look at the current lag between primary and each node listed
 in ``repl_node``, consult the ``repl_status`` view::

-  psql -d postgres -c "SELECT * FROM repl_status"
+  psql -d postgres -c "SELECT * FROM repmgr_test.repl_status"

 This view shows the latest monitor info from every node.
 
@@ -539,52 +539,260 @@ and the same in the standby.
 The repmgr daemon creates 2 connections: one to the master and another to the
 standby.

-Error codes
-===========
-
-When the repmgr or repmgrd program exits, it will set one of the
-following 
-
-* SUCCESS 0:  Program ran successfully.
-
-* ERR_BAD_CONFIG 1:  One of the configuration checks the program makes failed.
-* ERR_BAD_RSYNC 2:  An rsync call made by the program returned an error.
-* ERR_STOP_BACKUP 3:  A ``pg_stop_backup()`` call made by the program didn't succeed.
-* ERR_NO_RESTART 4:  An attempt to restart a PostgreSQL instance failed.
-* ERR_NEEDS_XLOG 5:  Could note create the ``pg_xlog`` directory when cloning.
-* ERR_DB_CON 6:  Error when trying to connect to a database.
-* ERR_DB_QUERY 7:  Error executing a database query.
-* ERR_PROMOTED 8:  Exiting program because the node has been promoted to master.
-* ERR_BAD_PASSWORD 9:  Password used to connect to a database was rejected.
-
 Detailed walkthrough
 ====================

 This assumes you've already followed the steps in "Installation Outline" to
-install repmgr and repmgr on the system.
+install repmgr and repmgrd on the system.

-The following scenario involves two PostgreSQL installations on the same server
-hardware, so that additional systems aren't needed for testing.  A normal
-production installation of ``repmgr`` will normally involve two different
-systems running on the same port, typically the default of 5432, 
-with both using files owned by the ``postgres`` user account.  In places where
-``127.0.0.1`` is used as a host name below, you would instead use the name of
-the relevant host for that parameter.  You can usually leave out changes
-to the port number in this case too.
+A normal production installation of ``repmgr`` will normally involve two
+different systems running on the same port, typically the default of 5432, 
+with both using files owned by the ``postgres`` user account.  This
+walkthrough assumes the following setup:

-The test setup assumes you might be using the default installation of
+* A primary (master) server called "node1," running as the "postgres" user 
+  who is also the owner of the files. This server is operating on port 5432.  This
+  server will be known as "node1" in the cluster "test".
+
+* A secondary (standby) server called "node2," running as the "postgres" user 
+  who is also the owner of the files. This server is operating on port 5432.  This
+  server will be known as "node2" in the cluster "test".
+
+* Another standby server called "node3" with a similar configuration to "node2".
+
+* The Postgress installation in each of the above is defined as $PGDATA, 
+  which is represented here as ``/var/lib/pgsql/9.0/data``
+  
+Creating some sample data
+-------------------------
+
+If you already have a database with useful data to replicate, you can
+skip this step and use it instead.  But if you do not already have
+data in this cluster to replication, you can create some like this::
+
+    createdb pgbench
+	pgbench -i -s 10 pgbench
+	
+Examples below will use the database name ``pgbench`` to match this.
+Substitute the name of your database instead.  Note that the standby
+nodes created here will include information for every database in the
+cluster, not just the specified one.  Needing the database name is
+mainly for user authentication purposes.
+
+Setting up a repmgr user
+------------------------
+
+Make sure that the "standby" user has a role in the database, "pgbench" in this
+case, and can login.   On "node1"::
+
+  createuser --login --superuser repmgr
+
+Alternately you could start ``psql`` on the pgbench database on "node1" and at
+the node1b# prompt type::
+
+  CREATE ROLE repmgr SUPERUSER LOGIN;
+
+The main advantage of the latter is that you can do it remotely to any
+system you already have superuser access to.
+
+Clearing the PostgreSQL installation on the Standby
+---------------------------------------------------
+
+To setup a new streaming replica, startin by removing any PostgreSQL
+installation on the existing standby nodes.
+
+* Stop any server on "node2" and "node3".  You can confirm the database
+  servers running using a command like this:
+  
+    ps -eaf | grep postgreg
+	
+  And looking for the various database server processes:  server, logger,
+  wal writer, and autovacuum launcher.
+  
+* Go to "node2" and "node3" database directories and remove the PostgreSQL installation::
+
+    cd $PGDATA
+    rm -rf *
+
+  This will delete the entire database installation in ``/var/lib/pgsql/9.0/data``.
+  Be careful that $PGDATA is defined here; executing ``ls`` to confirm you're
+  in the right place is always a good idea before executing ``rm``.
+
+Testing remote access to the master
+-----------------------------------
+
+On the "node2" server, first test that you can connect to "node1" the
+way repmgr will by executing::
+
+  psql -h node1 -U repmgr -d pgbench
+
+Possible sources for a problem here include:
+
+ * Login role specified was not created on "node1"
+
+ * The database configuration on "node1" is not listening on a TCP/IP port.
+   That could be because the ``listen_addresses`` parameter was not updated,
+   or if it was but the server wasn't restarted afterwards.  You can
+   test this on "node1" itself the same way::
+
+     psql -h node1 -U repmgr -d pgbench
+
+   With the "-h" parameter forcing a connnection over TCP/IP, rather
+   than the default UNIX socket method.
+
+ * There is a firewall setup that prevents incoming access to the
+   PostgreSQL port (defaulting to 5432) used to access "node1".  In
+   this situation you would be able to connect to the "node1" server
+   on itself, but not from any other host, and you'd just get a timeout
+   when trying rather than a proper error message.
+	 
+ * The ``pg_hba.conf`` file does not list appropriate statements to allow
+   this user to login.  In this case you should connect to the server,
+   but see an error message mentioning the ``pg_hba.conf``.
+
+Cloning the standby
+-------------------
+
+With "node1" server running, we want to use the ``clone standby`` command
+in repmgr to copy over the entire PostgreSQL database cluster onto the
+"node2" server.  Execute the clone process with::
+
+  repmgr -D $PGDATA -d pgbench -p 5432 -U repmgr -R postgres --verbose standby clone node1
+
+Here "-U" specifies the database user to connect to the master as, while
+"-R" specifies what user to run the rsync command as.  Potentially you
+could leave out one or both of these, in situations where the user and/or
+role setup is the same on each node.
+
+If this fails with an error message about accessing the master database,
+you should return to the previous step and confirm access to "node1"
+from "node2" with ``psql``, using the same parameters given to repmgr.
+
+Setup repmgr configuration file
+-------------------------------
+
+Create a directory to store each repmgr configuration in for each node.
+In that, there needs to be a ``repmgr.conf`` file for each node in the cluster.
+For each node we'll assume this is stored in ``/var/lib/pgsql/repmgr/repmgr.conf``
+following the standard directory structure of a RHEL system.  It should contain::
+
+  cluster=test
+  node=1
+  conninfo='host=node1 user=repmgr dbname=pgbench'
+
+On "node2" create the file ``/var/lib/pgsql/repmgr/repmgr.conf`` with::
+
+  cluster=test
+  node=2
+  conninfo='host=node2 user=repmgr dbname=pgbench'
+
+The STANDBY CLONE process should have created a recovery.conf file on
+"node2" in the $PGDATA directory that reads as follows::
+
+  standby_mode = 'on'
+  primary_conninfo = 'host=node1 port=5432'
+
+Registering the master and standby
+----------------------------------
+
+First, register the master by typing on "node1"::
+
+  repmgr -f /var/lib/pgsql/repmgr/repmgr.conf --verbose master register
+
+Start the "standby" server.
+
+Register the standby by typing on "node2"::
+
+  repmgr -f /var/lib/pgsql/repmgr/repmgr.conf --verbose standby register
+
+At this point, you have a functioning primary on "node1" and a functioning
+standby server running on "node2".  You can confirm the master knows
+about the standby, and that it is keeping it current, by running the
+following on the master::
+
+  psql -x -d pgbench -c "SELECT * FROM repmgr_test.repl_status"
+
+Some tests you might do at this point include:
+
+* Insert some records into the primary server here, confirm they appear
+  very quickly (within milliseconds) on the standby, and that the
+  repl_status view advances accordingly.
+
+* Verify that you can run queries against the standby server, but
+  cannot make insertions into the standby database.  
+
+Simulating the failure of the primary server
+--------------------------------------------
+
+To simulate the loss of the primary server, simply stop the "node1" server.
+At this point, the standby contains the database as it existed at the time of
+the "failure" of the primary server.
+
+Promoting the Standby to be the Primary
+---------------------------------------
+
+Now you can promote the standby server to be the primary, to allow
+applications to read and write to the database again, by typing::
+
+  repmgr -f /var/lib/pgsql/repmgr/repmgr.conf --verbose standby promote
+
+The server restarts and now has read/write ability.
+
+Bringing the former Primary up as a Standby
+-------------------------------------------
+
+To make the former primary act as a standby, which is necessary before
+restoring the original roles, type::
+
+  repmgr -U postgres -R postgres -h node1 -p 5432 -d pgbench --force --verbose standby clone
+
+Stop and restart the "node1" server, which is now acting as a standby server.
+
+Make sure the record(s) inserted the earlier step are still available on the
+now standby (prime).  Confirm the database on "node1" is read-only.
+
+Restoring the original roles of prime to primary and standby to standby
+-----------------------------------------------------------------------
+
+Now restore to the original configuration by stopping
+"node2" (now acting as a primary), promoting "node1" again to be the
+primary server, then bringing up "node2" as a standby with a valid
+``recovery.conf`` file.
+
+Stop the "node2" server::
+
+  repmgr -f /var/lib/pgsql/repmgr/repmgr.conf standby promote
+
+Now the original primary, "node1" is acting again as primary.
+
+Start the "node2" server and type this on "node1"::
+
+  repmgr standby clone --force -h node2 -p 5432 -U postgres -R postgres --verbose
+
+Verify the roles have reversed by attempting to insert a record on "node"
+and on "node1".
+
+The servers are now again acting as primary on "node1" and standby on "node2".
+
+Alternate setup:  both servers on one host
+==========================================
+
+Another test setup assumes you might be using the default installation of
 PostgreSQL on port 5432 for some other purpose, and instead relocates these
-instances onto different ports running as different users:
+instances onto different ports running as different users.  In places where
+``127.0.0.1`` is used as a host name, a more traditional configuration
+would instead use the name of the relevant host for that parameter. 
+You can usually leave out changes to the port number in this case too.

-* A primary (master) server called “prime," with a user as “prime," who is
+* A primary (master) server called "prime," with a user as "prime," who is
  also the owner of the files. This server is operating on port 5433.  This
-  server will be known as “node1" in the cluster “test"
+  server will be known as "node1" in the cluster "test"

-* A standby server called “standby", with a user of “standby", who is the
+* A standby server called "standby", with a user of "standby", who is the
  owner of the files.  This server is operating on port 5434.  This server
-  will be known and “node2" on the cluster “test."
+  will be known and "node2" on the cluster "test."

-* A database exists on “prime" called “testdb."
+* A database exists on "prime" called "testdb."

 * The Postgress installation in each of the above is defined as $PGDATA, 
  which is represented here with ``/data/prime`` as the "prime" server and 
@@ -623,7 +831,7 @@ Setup a streaming replica, strip away any PostgreSQL installation on the existin

 * Stop both servers.

-* Go to “standby" database directory and remove the PostgreSQL installation::
+* Go to "standby" database directory and remove the PostgreSQL installation::

    cd $PGDATA
    rm -rf *
@@ -635,33 +843,33 @@ Building the standby

 Create a directory to store each repmgr configuration in for each node.
 In that, there needs to be a ``repmgr.conf`` file for each node in the cluster.
-For “prime" we'll assume this is stored in ``/home/prime/repmgr``
+For "prime" we'll assume this is stored in ``/home/prime/repmgr``
 and it should contain::

  cluster=test
  node=1
  conninfo='host=127.0.0.1 dbname=testdb'

-On “standby" create the file ``/home/standby/repmgr/repmgr.conf`` with::
+On "standby" create the file ``/home/standby/repmgr/repmgr.conf`` with::

  cluster=test
  node=2
  conninfo='host=127.0.0.1 dbname=testdb'

-Next, with “prime" server running, we want to use the ``clone standby`` command
+Next, with "prime" server running, we want to use the ``clone standby`` command
 in repmgr to copy over the entire PostgreSQL database cluster onto the
-“standby" server.  On the “standby" server, type::
+"standby" server.  On the "standby" server, type::

  repmgr -D $PGDATA -p 5433 -U prime -R prime --verbose standby clone localhost

-Next, we need a recovery.conf file on “standby" in the $PGDATA directory
+Next, we need a recovery.conf file on "standby" in the $PGDATA directory
 that reads as follows::

  standby_mode = 'on'
  primary_conninfo = 'host=127.0.0.1 port=5433'

-Make sure that standby has a qualifying role in the database, “testdb" in this
-case, and can login. Start ``psql`` on the testdb database on “prime" and at
+Make sure that standby has a qualifying role in the database, "testdb" in this
+case, and can login. Start ``psql`` on the testdb database on "prime" and at
 the testdb# prompt type::

  CREATE ROLE standby SUPERUSER LOGIN
@@ -669,30 +877,40 @@ the testdb# prompt type::
 Registering the master and standby
 ----------------------------------

-First, register the master by typing on “prime"::
+First, register the master by typing on "prime"::

  repmgr -f /home/prime/repmgr/repmgr.conf --verbose master register

-On “standby," edit the ``postgresql.conf`` file and change the port to 5434.
+On "standby," edit the ``postgresql.conf`` file and change the port to 5434.

-Start the “standby" server.
+Start the "standby" server.

-Register the standby by typing on “standby"::
+Register the standby by typing on "standby"::

  repmgr -f /home/standby/repmgr/repmgr.conf --verbose standby register

-At this point, you have a functioning primary on “prime" and a functioning
-standby server running on “standby."  It's recommended that you insert some
-records into the primary server here, then confirm they appear very quickly
-(within milliseconds) on the standby.  Also verify that one can make queries
-against the standby server and cannot make insertions into the standby database.  
+At this point, you have a functioning primary on "prime" and a functioning
+standby server running on "standby."  You can confirm the master knows
+about the standby, and that it is keeping it current, by running the
+following on the master::
+
+  psql -x -d pgbench -c "SELECT * FROM repmgr_test.repl_status"
+
+Some tests you might do at this point include:
+
+* Insert some records into the primary server here, confirm they appear
+  very quickly (within milliseconds) on the standby, and that the
+  repl_status view advances accordingly.
+
+* Verify that you can run queries against the standby server, but
+  cannot make insertions into the standby database.  

 Simulating the failure of the primary server
 --------------------------------------------

-To simulate the loss of the primary server, simply stop the “prime" server.
+To simulate the loss of the primary server, simply stop the "prime" server.
 At this point, the standby contains the database as it existed at the time of
-the “failure" of the primary server.
+the "failure" of the primary server.

 Promoting the Standby to be the Primary
 ---------------------------------------
@@ -712,36 +930,54 @@ restoring the original roles, type::

  repmgr -U standby -R prime -h 127.0.0.1 -p 5433 -d testdb --force --verbose standby clone

-Stop and restart the “prime" server, which is now acting as a standby server.
+Stop and restart the "prime" server, which is now acting as a standby server.

 Make sure the record(s) inserted the earlier step are still available on the
-now standby (prime).  Confirm the database on “prime" is read-only.
+now standby (prime).  Confirm the database on "prime" is read-only.

 Restoring the original roles of prime to primary and standby to standby
 -----------------------------------------------------------------------

 Now restore to the original configuration by stopping the
-“standby" (now acting as a primary), promoting “prime" again to be the
-primary server, then bringing up “standby" as a standby with a valid
-``recovery.conf`` file on “standby".
+"standby" (now acting as a primary), promoting "prime" again to be the
+primary server, then bringing up "standby" as a standby with a valid
+``recovery.conf`` file on "standby".

-Stop the “standby" server::
+Stop the "standby" server::

  repmgr -f /home/prime/repmgr/repmgr.conf standby promote

-Now the original primary, “prime" is acting again as primary.
+Now the original primary, "prime" is acting again as primary.

-Start the “standby" server and type this on “prime"::
+Start the "standby" server and type this on "prime"::

  repmgr standby clone --force -h 127.0.0.1 -p 5434 -U prime -R standby --verbose

-Stop the “standby" and change the port to be 5434 in the ``postgresql.conf``
+Stop the "standby" and change the port to be 5434 in the ``postgresql.conf``
 file.

-Verify the roles have reversed by attempting to insert a record on “standby"
-and on “prime."
+Verify the roles have reversed by attempting to insert a record on "standby"
+and on "prime."

-The servers are now again acting as primary on “prime" and standby on “standby".
+The servers are now again acting as primary on "prime" and standby on "standby".
+
+Error codes
+===========
+
+When the repmgr or repmgrd program exits, it will set one of the
+following 
+
+* SUCCESS 0:  Program ran successfully.
+
+* ERR_BAD_CONFIG 1:  One of the configuration checks the program makes failed.
+* ERR_BAD_RSYNC 2:  An rsync call made by the program returned an error.
+* ERR_STOP_BACKUP 3:  A ``pg_stop_backup()`` call made by the program didn't succeed.
+* ERR_NO_RESTART 4:  An attempt to restart a PostgreSQL instance failed.
+* ERR_NEEDS_XLOG 5:  Could note create the ``pg_xlog`` directory when cloning.
+* ERR_DB_CON 6:  Error when trying to connect to a database.
+* ERR_DB_QUERY 7:  Error executing a database query.
+* ERR_PROMOTED 8:  Exiting program because the node has been promoted to master.
+* ERR_BAD_PASSWORD 9:  Password used to connect to a database was rejected.

 License and Contributions
 =========================