Add checkout_failure_limit config/feature (#911)

In a high availability deployment of PgCat, it is possible that a client may land on a container of PgCat that is very busy with clients and as such the new client might be perpetually stuck in checkout failure loop because all connections are used by other clients. This is specially true in session mode pools with long-lived client connections (e.g. FDW connections). One way to fix this issue is to close client connections after they encounter some number of checkout failure. This will force the client to hit the Network load balancer again, land on a different process/container, try to checkout a connection on the new process/container. if it fails, it is disconnected and tries with another one. This mechanism is guaranteed to eventually land on a balanced state where all clients are able to find connections provided that the overall number of connections across all containers matches the number of clients. I was able to reproduce this issue in a control environment and was able to show this PR is able to fix it.
2026-03-23 01:16:30 +00:00 · 2025-02-27 13:17:00 -06:00
parent f8e2fcd0ed
commit 3349cecc18
6 changed files with 162 additions and 1 deletions
--- a/CONFIG.md
+++ b/CONFIG.md
@@ -298,6 +298,19 @@ Load balancing mode
 `random` selects the server at random
 `loc` selects the server with the least outstanding busy connections

+### checkout_failure_limit
+```
+path: pools.<pool_name>.checkout_failure_limit
+default: 0 (disabled)
+```
+
+`Maximum number of checkout failures a client is allowed before it
+gets disconnected. This is needed to prevent persistent client/server
+imbalance in high availability setups where multiple PgCat instances are placed
+behind a single load balancer. If for any reason a client lands on a PgCat instance that has 
+a large number of connected clients, it might get stuck in perpetual checkout failure loop especially
+in session mode
+`
 ### default_role
 ```
 path: pools.<pool_name>.default_role