What is wrong
Stats reported by SHOW POOLS seem to be leaking. We see lingering cl_idle , cl_waiting, and similarly for sv_idle , sv_active. We confirmed that these are reporting issues not actual lingering clients.
This behavior is readily reproducible by running
while true; do
psql "postgres://sharding_user:sharding_user@localhost:6432/sharded_db" -c "SELECT 1" > /dev/null 2>&1 &
done
Why it happens
I wasn't able to get to figure our the reason for the bug but my best guess is that we have race conditions when updating pool-level stats. So even though individual update operations are atomic, we perform a check then update sequence which is not protected by a guard.
https://github.com/postgresml/pgcat/blob/main/src/stats/pool.rs#L174-L179
I am also suspecting that using Relaxed ordering might allow this behavior (I changed all operations to use Ordering::SeqCst but still got lingering clients)
How to fix
Since SHOW POOLS/SHOW SERVER/SHOW CLIENTS all show the current state of the proxy (as opposed to SHOW STATS which show aggregate values), this PR refactors SHOW POOLS to have it construct the results directly from SHOW SERVER and SHOW CLIENT datasets. This reduces the complexity of stat updates and eliminates the need for having locks when updating pool stats as we only care about updating individual client/server states.
This will change the semantics of maxwait, so instead of it holding the maxwait time ever encountered by a client (connected or disconnected), it will only consider connected clients which should be okay given PgCat tends to hold on to client connections more than Pgbouncer.
* keep track of current stats and zero them after updating averages
* Try tests
* typo
* remove commented test stuff
* Avoid dividing by zero
* Fix test
* refactor, get rid of iterator. do it manually
* trigger build
* Fix
* Refactor stats to use atomics
When we are dealing with a high number of connections, generated
stats cannot be consumed fast enough by the stats collector loop.
This makes the stats subsystem inconsistent and a log of
warning messages are thrown due to unregistered server/clients.
This change refactors the stats subsystem so it uses atomics:
- Now counters are handled using U64 atomics
- Event system is dropped and averages are calculated using a loop
every 15 seconds.
- Now, instead of snapshots being generated ever second we keep track of servers/clients
that have registered. Each pool/server/client has its own instance of the counter and
makes changes directly, instead of adding an event that gets processed later.
* Manually mplement Hash/Eq in `config::Address` ignoring stats
* Add tests for client connection counters
* Allow connecting to dockerized dev pgcat from the host
* stats: Decrease cl_idle when idle socket disconnects
Sometimes we want an admin to be able to ban a host for some time to route traffic away from that host for reasons like partial outages, replication lag, and scheduled maintenance.
We can achieve this today using a configuration update but a quicker approach is to send a control command to PgCat that bans the replica for some specified duration.
This command does not change the current banning rules like
Primaries cannot be banned
When all replicas are banned, all replicas are unbanned
Connection to the CI databases is viewed by Postgres as coming from localhost. The pg_hba.conf file generated by the docker image uses trust for these connections, that's why we had no test coverage on SASL and md5 branches.
This PR fixes this issue. There was also an issue with under-reporting code coverage. This should be fixed now