Commit Graph

130 Commits

Author SHA1 Message Date
Christian Kruse
680f23fb1d copyright push 2014-01-23 10:37:49 +01:00
Christian Kruse
897daddcc7 removed not needed arguments to avoid compiler warnings 2014-01-22 15:17:28 +01:00
Christian Kruse
de58eff7c1 added a chdir() for proper daemonizing 2014-01-22 14:30:38 +01:00
Christian Kruse
e007a55967 fix: do not use fsync()
We do not need fsync(), the fflush() is enough to avoid concurrent
logs.
2014-01-22 11:47:50 +01:00
Christian Kruse
4ef6fbb5fe do not close stderr but reopen it to /dev/null
We want stderr to be always a valid file descriptor
2014-01-21 16:25:57 +01:00
Christian Kruse
2e61d7b156 refactoring: daemonizing is now a function 2014-01-21 16:19:49 +01:00
Christian Kruse
4496a0761e we now use a function and are more sophisticated
Refactoring part: we now use a function to generate the PID
file. Sophistication: we now check if the PID contained in the file is a
valid PID. We ignore the file if it doesn't.
2014-01-21 16:18:15 +01:00
Christian Kruse
3978ead184 use a second fork to avoid a terminal
after the setsid() we are the process leader. And as a process leader we
are able to open a new terminal, even if we currently don't own one. So
we do another fork and do not call setsid() and not become a process
leader to avoid that.
2014-01-21 15:51:33 +01:00
Christian Kruse
b36dbf61fe reopening stdin and stdout to /dev/null now
stdin, stdout and stderr should always be valid file handles. Thus we
don't close them but reopen them to /dev/null
2014-01-21 15:31:38 +01:00
Christian Kruse
84466ecca5 log_crit() is more appropriate 2014-01-21 15:23:20 +01:00
Christian Kruse
649086e5e4 use unlink() instead of remove()
`remove()` will do a rmdir if necessary - we don't want that. So we use `unlink()`
2014-01-21 15:22:31 +01:00
Christian Kruse
7cf2eb440d renamed config options to a much more descriptive name 2014-01-21 15:19:50 +01:00
Christian Kruse
1db61ce277 fix: fail when repmgr_funcs is not pre-loaded
when repmgr_funcs is not pre-loaded `repmgr_update_standby_location()`
will return false and `repmgr_get_last_standby_location()` will return
an empty string. Thus we may end in an endless loop. To avoid that we fail.
2014-01-21 13:54:10 +01:00
Christian Kruse
41abf9a7ef fix: flushing and fsync()ing the log file
When not flushing and fsync()ing it the output may be garbled due to
concurrent writes to the file (system() spawns a child process with
stdin/stdout/stderr inherited from it's parent)
2014-01-21 13:52:27 +01:00
Christian Kruse
abebc53ddc fix: sscanf() does not set variables to 0 on error 2014-01-21 13:48:41 +01:00
Christian Kruse
5fc4a0382f added config options sleep_delay and sleep_monitor
sleep_monitor replaces the old SLEEP_MONITOR define and makes it
configurable; this is the interval in which we monitor

sleep_delay replaces the old sleep(300) when waiting for the master to
recover.
2014-01-17 14:35:50 +01:00
Christian Kruse
a7d3c9b93a fix: also close stderr when using syslog logging 2014-01-17 12:14:26 +01:00
Christian Kruse
ee9dc9e247 do not use exit()
We avoid using exit() to be able to clean up when we have to
terminate. This includes removal of the PID file as well as closing
database connections.
2014-01-17 11:28:55 +01:00
Christian Kruse
94cb5b94e7 fix: reopen log file on SIGHUP 2014-01-16 17:16:45 +01:00
Christian Kruse
a08aa50f92 fix: close stdin and stdout only in repmgrd
closing stdin and stdout might cause problems when using system(), so we
avoid it.
2014-01-16 16:01:58 +01:00
Christian Kruse
9563877fbb new config option, stdout/stdin closed
Now stdin and stdout get closed. Additionally stderr gets closed and
reopened to the new config option „logfile“ if specified
2014-01-16 15:22:34 +01:00
Christian Kruse
77aa6aa326 do not exit in pg_version 2014-01-16 14:48:42 +01:00
Christian Kruse
18206b3a64 do not exit() in is_witness 2014-01-16 14:28:56 +01:00
Christian Kruse
91446bcf93 fix: do not try to reconnect infinitely 2014-01-10 17:26:02 +01:00
Christian Kruse
dcdf8788ae fix: handle connection loss to standby
We do basically the same as we do for the master since connections drop
from time to time
2014-01-10 17:12:03 +01:00
Christian Kruse
4fabfbbbd0 fix: do not exit in is_standby()
Instead we now return an int with 0 meaning „not a standby,“ 1 meaning
„is a standby“ and -1 meaning „connection dropped“
2014-01-10 17:11:16 +01:00
Christian Kruse
c41030b40e Merge branch 'REL2_0_STABLE'
Conflicts:
	HISTORY
	dbutils.h
	repmgr.c
	repmgrd.c
	version.h
2014-01-10 16:07:33 +01:00
Christian Kruse
4c3d7f80ed now code compiles with -ansi -pedantic and has less warnings 2014-01-09 14:45:07 +01:00
Christian Kruse
0e8ff1730e added handling of a PID file 2014-01-09 13:04:40 +01:00
Christian Kruse
634fdff303 fix: do not call setup_event_handlers() on WIN32
If we put setup_event_handlers() in #ifdef WIN32, we have to do it for
the call and the declaration, too
2014-01-09 12:57:16 +01:00
Christian Kruse
920f925e4b added a new cli option --daemonize
This option forks the process and generates a new session. This
effectively detaches it from the shell. Don't forget to redirect stderr
or use syslog for logging!
2014-01-08 11:53:15 +01:00
Christian Kruse
9fe2d6886e white space cleanup 2014-01-07 16:42:06 +01:00
Jaime Casanova
3c8df59eb9 Make repmgr compile in 9.3.
Patch provided by Shawn Ellis with some fixes by me.
2013-11-14 00:43:35 -05:00
Jaime Casanova
d99024ba11 Make repmgrd survive to the failover
To do this it needs to reconnect to the new master
2013-09-26 11:58:59 -05:00
Jaime Casanova
1afaa3a26f Rearrange the logic in do_failover() for further improvements.
Specially, make this a more coordinated process by making all
nodes waiting for the others before going to the next step.

This is one step further in following Andres Freund advices
but there is still a lot to do in order to complete that,
specially it could be needed to add more fields to repl_nodes
and to the shm area.
2013-09-23 18:28:58 -05:00
Jaime Casanova
079a7c9f16 In a failover situation get the nodes in a well defined order.
When deciding which node will be the new master, we should get the
nodes in a well defined order otherwise two standbys could process
nodes with the same priority in different order and end up with
a two master situation.
2013-07-26 00:59:50 -05:00
Jaime Casanova
3b66a31ac9 In a failover situation get the nodes in a well defined order.
When deciding which node will be the new master, we should get the
nodes in a well defined order otherwise two standbys could process
nodes with the same priority in different order and end up with
a two master situation.
2013-07-26 00:52:31 -05:00
Jaime Casanova
ab1d380843 If PQcancel() fails, consider it as if the master is failing.
Because PQcancel() establish a new synchronous connection to the
database, if it fails it means something wrong has happenned with
master. So instead of just ignore the failure, CancelQuery() now
reports a failure condition so we can detect master's death in
that situation.

This is very important specially when only postmaster crashes but
other children/backend connections are still there. Because the
children connection won't fail and CancelQuery() failure is our
only indication of something wrong happenning.
Currently we just ignore the PQcancel() failure which leads us to
a situation in which we just loop forever
trying to cancel the async query.

Reported by: Martin Euser <martin.euser@nl.abnamro.com>
Problem analyzed and bug spotted by: Andres Freund <andres@2ndquadrant.com>
Patch by: Jaime Casanova <jaime@2ndquadrant.com>
2013-07-10 10:21:51 -05:00
Jaime Casanova
b0b44a157f If PQcancel() fails, consider it as if the master is failing.
Because PQcancel() establish a new synchronous connection to the
database, if it fails it means something wrong has happenned with
master. So instead of just ignore the failure, CancelQuery() now
reports a failure condition so we can detect master's death in
that situation.

This is very important specially when only postmaster crashes but
other children/backend connections are still there. Because the
children connection won't fail and CancelQuery() failure is our
only indication of something wrong happenning.
Currently we just ignore the PQcancel() failure which leads us to
a situation in which we just loop forever
trying to cancel the async query.

Reported by: Martin Euser <martin.euser@nl.abnamro.com>
Problem analyzed and bug spotted by: Andres Freund <andres@2ndquadrant.com>
Patch by: Jaime Casanova <jaime@2ndquadrant.com>
2013-07-10 09:53:45 -05:00
Jaime Casanova
7d94151494 If the node is a witness don't bother asking its position, it always
will be 0/0. We just need to check that we can connect to it to determine
if we are in the majority.
2013-01-11 03:44:50 -05:00
Jaime Casanova
4191b77e70 If the node is a witness don't bother asking its position, it always
will be 0/0. We just need to check that we can connect to it to determine
if we are in the majority.
2013-01-11 03:42:08 -05:00
Jaime Casanova
2a5d431481 Fix a problem that caused a standby to promote itself without going to
voting procedure.

This is because of a race condition inside CheckPrimaryConnection().

This has independently reported by Alex Railean and Dumitru, and Frank Jördens.
Analyzed and fixed by Cédric Villemain.

The fix have been verified to work by Frank
2012-12-19 12:01:27 -05:00
Jaime Casanova
81b8a944de Fix a problem that caused a standby to promote itself without going to
voting procedure.

This is because of a race condition inside CheckPrimaryConnection().

This has independently reported by Alex Railean and Dumitru, and Frank Jördens.
Analyzed and fixed by Cédric Villemain.

The fix have been verified to work by Frank
2012-12-19 11:45:58 -05:00
Jaime Casanova
93a999adc7 Formatting code using astyle 2012-12-11 11:49:07 -05:00
Jaime Casanova
1b69282df9 Formatting code using astyle 2012-12-11 11:47:59 -05:00
Jaime Casanova
06dd252f69 To select new master it needs to know which standby has received more
xlog records from master, so it standby should use pg_last_xlog_receive_location()
to report their positions. This solves a possible situation in which
a standby that is considered as new master when promoted is no longer
the best option.
2012-12-03 09:27:12 -05:00
Jaime Casanova
088ca29fe3 To select new master it needs to know which standby has received more
xlog records from master, so it standby should use pg_last_xlog_receive_location()
to report their positions. This solves a possible situation in which
a standby that is considered as new master when promoted is no longer
the best option.
2012-12-03 09:18:08 -05:00
Jaime Casanova
cd1a84252e Fix node decision logic when priorities are involved. Currently if
two nodes with different prorities are equally good to be promoted
the second one (with a lower priority, considering them
in descending order) will win.

Per report from Brailean Dumitru
2012-09-16 02:47:02 -05:00
Jaime Casanova
5f33d9d715 Fix node decision logic when priorities are involved. Currently if
two nodes with different prorities are equally good to be promoted
the second one (with a lower priority, considering them
in descending order) will win.

Per report from Brailean Dumitru
2012-09-16 02:38:28 -05:00
Jaime Casanova
2e19b3688b Add a comment 2012-09-16 02:26:18 -05:00