Commit Graph

284 Commits

Author SHA1 Message Date
Jaime Casanova
2e7acf03c4 If PQgetCancel() returns NULL we should also return false.
Noted by Andres Freund.
2013-07-12 08:01:01 -05:00
Jaime Casanova
2bc8044fda Improve messages in wait_connection_availability, so we know what
error makes the failover procedure to start

By gripe from Andres Freund
2013-07-10 19:25:58 -05:00
Jaime Casanova
b0b44a157f If PQcancel() fails, consider it as if the master is failing.
Because PQcancel() establish a new synchronous connection to the
database, if it fails it means something wrong has happenned with
master. So instead of just ignore the failure, CancelQuery() now
reports a failure condition so we can detect master's death in
that situation.

This is very important specially when only postmaster crashes but
other children/backend connections are still there. Because the
children connection won't fail and CancelQuery() failure is our
only indication of something wrong happenning.
Currently we just ignore the PQcancel() failure which leads us to
a situation in which we just loop forever
trying to cancel the async query.

Reported by: Martin Euser <martin.euser@nl.abnamro.com>
Problem analyzed and bug spotted by: Andres Freund <andres@2ndquadrant.com>
Patch by: Jaime Casanova <jaime@2ndquadrant.com>
2013-07-10 09:53:45 -05:00
Jaime Casanova
49a2531930 Options -F -W -I -v doesn't accept arguments, which means that on
getopt_long shouldn't be marked with the colon (:) character.

This has been wrong since day one, so backpatching all the way until
1.1
2013-01-13 16:37:39 -05:00
Jaime Casanova
4191b77e70 If the node is a witness don't bother asking its position, it always
will be 0/0. We just need to check that we can connect to it to determine
if we are in the majority.
2013-01-11 03:42:08 -05:00
Jaime Casanova
2a5d431481 Fix a problem that caused a standby to promote itself without going to
voting procedure.

This is because of a race condition inside CheckPrimaryConnection().

This has independently reported by Alex Railean and Dumitru, and Frank Jördens.
Analyzed and fixed by Cédric Villemain.

The fix have been verified to work by Frank
2012-12-19 12:01:27 -05:00
Jaime Casanova
93a999adc7 Formatting code using astyle 2012-12-11 11:49:07 -05:00
Jaime Casanova
088ca29fe3 To select new master it needs to know which standby has received more
xlog records from master, so it standby should use pg_last_xlog_receive_location()
to report their positions. This solves a possible situation in which
a standby that is considered as new master when promoted is no longer
the best option.
2012-12-03 09:18:08 -05:00
Jaime Casanova
30e9d06172 Add an option for STANDBY FOLLOW to wait for a master to appear.
This is important for autofailover to do the right thing when
standbys detected master death at different times.

While this is a new option, seems important for the autofailover
to work properly so i will consider the lack of it a bug and
will backpatch to 2.0 where autofailover was introduced.

For gripe from Alex Railean, about a standby not finding the new
master because the new master hasn't finish promoting.
2012-11-14 15:09:26 -05:00
Jaime Casanova
cd1a84252e Fix node decision logic when priorities are involved. Currently if
two nodes with different prorities are equally good to be promoted
the second one (with a lower priority, considering them
in descending order) will win.

Per report from Brailean Dumitru
2012-09-16 02:47:02 -05:00
Jaime Casanova
2e19b3688b Add a comment 2012-09-16 02:26:18 -05:00
Jaime Casanova
de883a4c84 Keep compiler quiet. Noted when compiling in FreeBSD in which i
get a warning for an uninitialized variable.

Also, define InvalidXLogRecPtr. We don't really need it but using
it make the initialization future proof (considering that in 9.3
XLogRecPtr will change its structure).
2012-09-16 02:21:18 -05:00
Jaime Casanova
499a501afd Make repmgr compatible with FreeBSD.
We need to add an #include and make it use a different path for the
"true" binary.

Maybe we need to make this changes for all BSD systems but having no
evidence of that i prefer to make this only for systems with __FreeBSD__
2012-09-15 17:37:59 -05:00
Jaime Casanova
0a9107d76d Improve sample of commands for promote and follow 2012-09-15 17:37:43 -05:00
Jaime Casanova
95ec0450da When we have more command-line arguments than we should have we
need to show that last value and we should use only optind for that
instead of optind+1
2012-08-30 02:11:48 -05:00
Jaime Casanova
57aa95f674 Fix documentation to always use -h sintax to refer to the node we
want to clone or connect to, instead of relying on the fact that
for some time putting that argument at last worked.
2012-08-30 02:10:10 -05:00
Jaime Casanova
56d2ae4e81 Fix HISTORY to show from newest to oldest v2.0beta1 2012-07-27 11:26:18 -05:00
Jaime Casanova
3edd87a041 Fix tabs in HISTORY 2012-07-27 11:20:56 -05:00
Jaime Casanova
740208da1c Fix typos in RELEASE NOTES 2012-07-27 11:15:50 -05:00
Jaime Casanova
664e1a8321 Now that we can have no monitoring we need to check all nodes at failover
not only those in repl_monitor
2012-07-21 17:49:38 -05:00
Jaime Casanova
d43c6334da Prepare HISTORY and release notes for release 2012-07-21 12:06:33 -05:00
Jaime Casanova
f984b3fd33 Document tunables added in aaf35947ed 2012-07-21 11:10:59 -05:00
Jaime Casanova
aaf35947ed Add tunables for connection retries to master and interval between
connection retries, these parameters along with master_response_timeout
determines the amount of time since failure to failover
2012-07-21 11:01:00 -05:00
Jaime Casanova
08ed0aa987 Commit 2d24518d9d added an additional
'}' at the end of parse_config(). removing.
2012-07-21 10:42:58 -05:00
Jaime Casanova
2d24518d9d If master_response_timeout hasn't been set in repmgr.conf it defaults
to zero, which was causing to a false positive in the failure detection
logic in wait_connection_availability(). So, change that to defaults to 60s
and add a check to avoid it being set to zero or negative.

Problem reported and analyzed by Andrew Newman
2012-07-21 09:49:05 -05:00
Jaime Casanova
a6c94b29de Change release notes because of commit bf241ba1d6 2012-07-06 02:00:46 -05:00
Jaime Casanova
bf241ba1d6 Make the monitoring history capabilities of repmgr be optional and
turned off by default. Most of it has been superseeded by
pg_stat_replication view, we can still start it by using the switch
--monitoring-history
2012-07-06 01:51:22 -05:00
Jaime Casanova
41dbc39527 Add release notes 2012-07-05 09:35:23 -05:00
Jaime Casanova
50b7147f15 Change Copyright date to cover 2012 2012-07-04 10:47:26 -05:00
Jaime Casanova
f5e57aa433 Add an option for "no-history" mode, where repmgrd just checks the
conectivity of master but don't INSERT any data into it
2012-07-04 10:07:31 -05:00
Jaime Casanova
ac5a9d1fd6 The release changed, just wait a little before setting it.
Also make well known names in HISTORY be only names, without
last name
2012-07-02 00:06:57 -05:00
Jaime Casanova
cb740b68be Add a check of the connection inside the CancelQuery() so it check
that before trying to cancel a query, which can block.
2012-06-26 11:29:02 -05:00
Jaime Casanova
d58ea77798 Add a quick setup for autofailover 2012-06-26 07:49:43 -05:00
Jaime Casanova
e3c3c22b6e Improve the version message to actually show the repmgr version not
only postgresql's one
2012-06-25 22:54:48 -05:00
Jaime Casanova
861a3c8f22 Fix CLUSTER CLEANUP, it needs to establish a local connection in order
to look for the master
2012-06-16 01:32:59 -05:00
Jaime Casanova
e51870b504 Force to enter a password for the superuser in the witness, this is
in case we need to send a password to connect as stated in
master's pg_hba.conf.
2012-06-15 13:51:45 -05:00
Jaime Casanova
5651720560 Remove a variable left in last commit 2012-06-15 09:46:01 -05:00
Jaime Casanova
d32a6cdb24 Remove kludge added to create user and db for witness.
It's too fragile, almost always cause a "segment violation" and
don't seems to be very useful.
2012-06-15 09:41:54 -05:00
Jaime Casanova
9e10987b90 Fix a few bugs introduced when merging features 2012-06-15 09:40:09 -05:00
Jaime Casanova
64fce88e99 Add a CLUSTER CLEANUP command to clean monitor's history,
also include a --keep-history (-k) option to indicate how many
days of history to keep
2012-06-13 00:39:54 -05:00
Jaime Casanova
7a76f1998c getMasterConnection() cannot avoid checking the same node that asks
to find the master.
This was a micro optimization based on the fact that all commands that
needed to detect the master were executed from the standby but now that
we have CLUSTER level commands that is not true anymore
2012-06-12 23:28:24 -05:00
Jaime Casanova
4db046a8ea Allow repmgr to obtain tablespace's locations from pg 9.2 and later
in which we no longer have a spclocation column in pg_tablespaces
2012-06-12 11:08:15 -05:00
Jaime Casanova
331eca447a STANDBY CLONE should be run by a SUPERUSER, otherwise we won't be able
to retrieve data_directory and the other parameters we need by
querying the database.
2012-06-12 09:42:50 -05:00
Jaime Casanova
b5b2f93f7e Merge branches 'master' and 'async' 2011-12-02 00:28:17 -05:00
Jaime Casanova
9d03d4a254 After checking that master is alive, is_pgup() should return not keep
checking forever.
2011-12-01 23:58:12 -05:00
Jaime Casanova
3b2ccc5b78 Add a master_response_timeout parameter and use it to limit the amount
of time we spent a reponse from master before declaring the failure.
Also, change is_pgup() so it use PQsendQuery() instead of PQexec to
execute the check of master
2011-12-01 01:20:33 -05:00
Jaime Casanova
89a1e2bcbd Not even consider old master as an option in failover 2011-11-27 19:17:59 -05:00
Jaime Casanova
7077a7c68f Add -w option to pg_ctl commands so we wait until command is finish.
Or at least, we try. By default, after 60 seconds pg_ctl just return.
This make useless to wait ourselves after pg_ctl start of witness so
remove the sleep
2011-11-27 18:38:53 -05:00
Jaime Casanova
9b8fb7e960 Remove last argument from log_err, left in commit 55c7ea4b5e.
Also rephrase the sentence

Reported by Jeroen Dekkers
2011-11-25 14:59:29 -05:00
Jaime Casanova
55c7ea4b5e Fix a wrong message.
It was saying the problem is the version of the PostgreSQL server while
it actually is because the MASTER REGISTER command was running on a
standby node
2011-11-10 09:38:12 -05:00