Refactoring part: we now use a function to generate the PID
file. Sophistication: we now check if the PID contained in the file is a
valid PID. We ignore the file if it doesn't.
after the setsid() we are the process leader. And as a process leader we
are able to open a new terminal, even if we currently don't own one. So
we do another fork and do not call setsid() and not become a process
leader to avoid that.
when repmgr_funcs is not pre-loaded `repmgr_update_standby_location()`
will return false and `repmgr_get_last_standby_location()` will return
an empty string. Thus we may end in an endless loop. To avoid that we fail.
When not flushing and fsync()ing it the output may be garbled due to
concurrent writes to the file (system() spawns a child process with
stdin/stdout/stderr inherited from it's parent)
sleep_monitor replaces the old SLEEP_MONITOR define and makes it
configurable; this is the interval in which we monitor
sleep_delay replaces the old sleep(300) when waiting for the master to
recover.
This option forks the process and generates a new session. This
effectively detaches it from the shell. Don't forget to redirect stderr
or use syslog for logging!
Specially, make this a more coordinated process by making all
nodes waiting for the others before going to the next step.
This is one step further in following Andres Freund advices
but there is still a lot to do in order to complete that,
specially it could be needed to add more fields to repl_nodes
and to the shm area.
When deciding which node will be the new master, we should get the
nodes in a well defined order otherwise two standbys could process
nodes with the same priority in different order and end up with
a two master situation.
When deciding which node will be the new master, we should get the
nodes in a well defined order otherwise two standbys could process
nodes with the same priority in different order and end up with
a two master situation.
Because PQcancel() establish a new synchronous connection to the
database, if it fails it means something wrong has happenned with
master. So instead of just ignore the failure, CancelQuery() now
reports a failure condition so we can detect master's death in
that situation.
This is very important specially when only postmaster crashes but
other children/backend connections are still there. Because the
children connection won't fail and CancelQuery() failure is our
only indication of something wrong happenning.
Currently we just ignore the PQcancel() failure which leads us to
a situation in which we just loop forever
trying to cancel the async query.
Reported by: Martin Euser <martin.euser@nl.abnamro.com>
Problem analyzed and bug spotted by: Andres Freund <andres@2ndquadrant.com>
Patch by: Jaime Casanova <jaime@2ndquadrant.com>
Because PQcancel() establish a new synchronous connection to the
database, if it fails it means something wrong has happenned with
master. So instead of just ignore the failure, CancelQuery() now
reports a failure condition so we can detect master's death in
that situation.
This is very important specially when only postmaster crashes but
other children/backend connections are still there. Because the
children connection won't fail and CancelQuery() failure is our
only indication of something wrong happenning.
Currently we just ignore the PQcancel() failure which leads us to
a situation in which we just loop forever
trying to cancel the async query.
Reported by: Martin Euser <martin.euser@nl.abnamro.com>
Problem analyzed and bug spotted by: Andres Freund <andres@2ndquadrant.com>
Patch by: Jaime Casanova <jaime@2ndquadrant.com>
voting procedure.
This is because of a race condition inside CheckPrimaryConnection().
This has independently reported by Alex Railean and Dumitru, and Frank Jördens.
Analyzed and fixed by Cédric Villemain.
The fix have been verified to work by Frank
voting procedure.
This is because of a race condition inside CheckPrimaryConnection().
This has independently reported by Alex Railean and Dumitru, and Frank Jördens.
Analyzed and fixed by Cédric Villemain.
The fix have been verified to work by Frank
xlog records from master, so it standby should use pg_last_xlog_receive_location()
to report their positions. This solves a possible situation in which
a standby that is considered as new master when promoted is no longer
the best option.
xlog records from master, so it standby should use pg_last_xlog_receive_location()
to report their positions. This solves a possible situation in which
a standby that is considered as new master when promoted is no longer
the best option.
two nodes with different prorities are equally good to be promoted
the second one (with a lower priority, considering them
in descending order) will win.
Per report from Brailean Dumitru
two nodes with different prorities are equally good to be promoted
the second one (with a lower priority, considering them
in descending order) will win.
Per report from Brailean Dumitru