Add checkout_failure_limit config/feature (#911)

In a high availability deployment of PgCat, it is possible that a client may land on a container of PgCat that is very busy with clients and as such the new client might be perpetually stuck in checkout failure loop because all connections are used by other clients. This is specially true in session mode pools with long-lived client connections (e.g. FDW connections).

One way to fix this issue is to close client connections after they encounter some number of checkout failure. This will force the client to hit the Network load balancer again, land on a different process/container, try to checkout a connection on the new process/container. if it fails, it is disconnected and tries with another one.

This mechanism is guaranteed to eventually land on a balanced state where all clients are able to find connections provided that the overall number of connections across all containers matches the number of clients.

I was able to reproduce this issue in a control environment and was able to show this PR is able to fix it.
This commit is contained in:
Mostafa
2025-02-27 13:17:00 -06:00
committed by GitHub
parent f8e2fcd0ed
commit 3349cecc18
6 changed files with 162 additions and 1 deletions

View File

@@ -298,6 +298,19 @@ Load balancing mode
`random` selects the server at random
`loc` selects the server with the least outstanding busy connections
### checkout_failure_limit
```
path: pools.<pool_name>.checkout_failure_limit
default: 0 (disabled)
```
`Maximum number of checkout failures a client is allowed before it
gets disconnected. This is needed to prevent persistent client/server
imbalance in high availability setups where multiple PgCat instances are placed
behind a single load balancer. If for any reason a client lands on a PgCat instance that has
a large number of connected clients, it might get stuck in perpetual checkout failure loop especially
in session mode
`
### default_role
```
path: pools.<pool_name>.default_role