Fix bug discovered last week which prevents recovered standby from being

used in the cluster. Main issue was that if the local repmgrd was not able to connect locally, it would set the local node as failed (active = false). This is fine, because we actually don't know if the node is active (actually, it's not active ATM) so it's best to keep it out of the cluster. The problem is that if the postgres service comes back up, and is able to recover by it self, then we should ack that fact and set it as active. There was another issue related with repmgrd being terminated if the postgres service was downs. This is not the correct thing to do: we should keep trying to connect to the local standby.
2026-07-16 14:29:05 +00:00 · 2015-12-07 16:14:19 -03:00
parent 96ac39ba0f
commit c9db7f57d2
1 changed files with 4 additions and 9 deletions
@@ -690,18 +690,12 @@ standby_monitor(void)
 		initPQExpBuffer(&errmsg);

 		appendPQExpBuffer(&errmsg,
-						  _("failed to connect to local node, node marked as failed and terminating!"));
+						  _("failed to connect to local node, node marked as failed!"));

 		log_err("%s\n", errmsg.data);

-		create_event_record(master_conn,
-							&local_options,
-							local_options.node,
-							"repmgrd_shutdown",
-							false,
-							errmsg.data);
-
-		terminate(ERR_DB_CON);
+		//terminate(ERR_DB_CON);
+		goto continue_monitoring_standby;
 	}

 	upstream_conn = get_upstream_connection(my_local_conn,
@@ -830,6 +824,7 @@ standby_monitor(void)

 	PQfinish(upstream_conn);

+ continue_monitoring_standby:
 	/* Check if we still are a standby, we could have been promoted */
 	do
 	{