* Refactor stats to use atomics
When we are dealing with a high number of connections, generated
stats cannot be consumed fast enough by the stats collector loop.
This makes the stats subsystem inconsistent and a log of
warning messages are thrown due to unregistered server/clients.
This change refactors the stats subsystem so it uses atomics:
- Now counters are handled using U64 atomics
- Event system is dropped and averages are calculated using a loop
every 15 seconds.
- Now, instead of snapshots being generated ever second we keep track of servers/clients
that have registered. Each pool/server/client has its own instance of the counter and
makes changes directly, instead of adding an event that gets processed later.
* Manually mplement Hash/Eq in `config::Address` ignoring stats
* Add tests for client connection counters
* Allow connecting to dockerized dev pgcat from the host
* stats: Decrease cl_idle when idle socket disconnects
* Prepared stmt sharding
s
tests
* len check
* remove python test
* latest rust
* move that to debug for sure
* Add the actual tests
* latest image
* Update tests/ruby/sharding_spec.rb
This is an implementation of Query mirroring in PgCat (outlined here #302)
In configs, we match mirror hosts with the servers handling the traffic. A mirror host will receive the same protocol messages as the main server it was matched with.
This is done by creating an async task for each mirror server, it communicates with the main server through two channels, one for the protocol messages and one for the exit signal. The mirror server sends the protocol packets to the underlying PostgreSQL server. We receive from the underlying PostgreSQL server as soon as the data is available and we immediately discard it. We use bb8 to manage the life cycle of the connection, not for pooling since each mirror server handler is more or less single-threaded.
We don't have any connection pooling in the mirrors. Matching each mirror connection to an actual server connection guarantees that we will not have more connections to any of the mirrors than the parent pool would allow.
Sometimes we want an admin to be able to ban a host for some time to route traffic away from that host for reasons like partial outages, replication lag, and scheduled maintenance.
We can achieve this today using a configuration update but a quicker approach is to send a control command to PgCat that bans the replica for some specified duration.
This command does not change the current banning rules like
Primaries cannot be banned
When all replicas are banned, all replicas are unbanned
* Adds SHUTDOWN command to PgCat as alternate option to sending SIGINT
* Check if we're already in SHUTDOWN sequence
* Send signal directly from shutdown instead of using channel
* Add tests
* trigger build
* Lowercase response and boolean change
* Update tests
* Fix tests
* typo
We identified a bug where RELOAD fails to update the pools.
To reproduce you need to start at some config state, modify that state a bit, reload, revert the configs back to the original state, and reload. The last reload will fail to update the pool because PgCat "thinks" the pool state didn't change.
This is because we use a HashSet to keep track of config hashes but we never remove values from it.
Say we start with State A, we modify pool configs to State B and reload. Now the POOL_HASHES struct has State A and State B. Attempting to go back to State A will encounter a hashset hit which is interpreted by PgCat as "Configs are the same, no need to reload pools"
We fix this by attaching a config_hash value to ConnectionPool object and we calculate that value when we create the pool. This eliminates the need for a global variable. One shortcoming here is that changing any config under one user in the pool will trigger a reload for the entire pool (which is fine I think)
Connection to the CI databases is viewed by Postgres as coming from localhost. The pg_hba.conf file generated by the docker image uses trust for these connections, that's why we had no test coverage on SASL and md5 branches.
This PR fixes this issue. There was also an issue with under-reporting code coverage. This should be fixed now
Code coverage logic was missing coverage from rust tests. This is now fixed.
Also, we weren't reaping spawned PgCat processes correctly which left zombie processes.
We have encountered a case where PgCat pools were stuck following a database incident. Our best understanding at this point is that the PgCat -> Postgres connections died silently and because Tokio defaults to disabling keepalives, connections in the pool were marked as busy forever. Only when we deployed PgCat did we see recovery.
This PR introduces tcp_keepalives to PgCat. This sets the defaults to be
keepalives_idle: 5 # seconds
keepalives_interval: 5 # seconds
keepalives_count: 5 # a count
These settings can detect the death of an idle connection within 30 seconds of its death. Please note that the connection can remain idle forever (from an application perspective) as long as the keepalive packets are flowing so disconnection will only occur if the other end is not acknowledging keepalive packets (keepalive packet acks are handled by the OS, the application does not need to do anything). I plan to add tcp_user_timeout in a follow-up PR.
Least outstanding connections load balancing can improve the load distribution between instances but for Pgcat it may also improve handling slow replicas that don't go completely down. With LoC, traffic will quickly move away from the slow replica without waiting for the replica to be banned.
If all replicas slow down equally (due to a bad query that is hitting all replicas), the algorithm will degenerate to Random Load Balancing (which is what we had in Pgcat until today).
This may also allow Pgcat to accommodate pools with differently-sized replicas.
* Don't send discard all when state is changed in transaction
* Remove unnecessary clone
* spelling
* Move transaction check to SET command
* Add test for set command in transaction
* type
* Update comments
* Update comments
* use moves instead of clones for initial message
* don't make message mutable
* Update unwrap
* but i'm not a wrapper
* Add set local test
* change continue
* Send DISCARD ALL even if client is not in transaction
* fmt
* Added tests + avoided sending extra discard all
* Adds set name logic to beginning of handle client
* fmt
* refactor dead code handling
* Refactor reading command tag
* remove unnecessary trim
* Removing debugging statement
* typo
* typo{
* documentation
* edit text
* un-unwrap
* run ci
* run ci
Co-authored-by: Zain Kabani <zain.kabani@instacart.com>
* Validates pgcat is closed after shutdown python tests
* Fix pgrep logic
* Moves sigterm step to after cleanup to decouple
* Replace subprocess with os.system for running pgcat
* Initial commit for graceful shutdown
* fmt
* Add .vscode to gitignore
* Updates shutdown logic to use channels
* fmt
* fmt
* Adds shutdown timeout
* Fmt and updates tomls
* Updates readme
* fmt and updates log levels
* Update python tests to test shutdown
* merge changes
* Rename listener rx and update bash to be in line with master
* Update python test bash script ordering
* Adds error response message before shutdown
* Add details on shutdown event loop
* Fixes response length for error
* Adds handler for sigterm
* Uses ready for query function and fixes number of bytes
* fmt
* Fix Dev env
* Update tests/sharding/query_routing_setup.sql
* Update tests/sharding/query_routing_setup.sql
* bring pgcat.toml on ci and local dev to parity
* more parity
* pool names
* pool names
* less diff
* fix tests
* fmt
* add other user to setup
Co-authored-by: Lev Kokotov <levkk@users.noreply.github.com>
* Add support for multi-database / multi-user pools
* Nothing
* cargo fmt
* CI
* remove test users
* rename pool
* Update tests to use admin user/pass
* more fixes
* Revert bad change
* Use PGDATABASE env var
* send server info in case of admin
* Support reloading the entire config (including sharding logic) without restart.
* Fix bug incorrectly handing error reporting when the shard is set incorrectly via SET SHARD TO command.
selected wrong shard and the connection keep reporting fatal #80.
* Fix total_received and avg_recv admin database statistics.
* Enabling the query parser by default.
* More tests.
* Refactor query routing into its own module
* commments; tests; dead code
* error message
* safer startup
* hm
* dont have to be public
* wow
* fix ci
* ok
* nl
* no more silent errors