* Fix prepared statement not found when prepared stmt has error
* cleanup debug
* remove more debug msgs
* sure debugged this..
* version bump
* add rust tests
* Add golang test suite to reproduce issue with unnamed parameterized prepared statements
* Allow caching of unnamed prepared statements
* Passthrough describe on portals
* Remove unneeded kill
* Update Dockerfile.ci with golang
* Move out update of Dockerfiles to separate PR
* Initial commit
* Cleanup and add stats
* Use an arc instead of full clones to store the parse packets
* Use mutex instead
* fmt
* clippy
* fmt
* fix?
* fix?
* fmt
* typo
* Update docs
* Refactor custom protocol
* fmt
* move custom protocol handling to before parsing
* Support describe
* Add LRU for server side statement cache
* rename variable
* Refactoring
* Move docs
* Fix test
* fix
* Update tests
* trigger build
* Add more tests
* Reorder handling sync
* Support when a named describe is sent along with Parse (go pgx) and expecting results
* don't talk to client if not needed when client sends Parse
* fmt :(
* refactor tests
* nit
* Reduce hashing
* Reducing work done to decode describe and parse messages
* minor refactor
* Merge branch 'main' into zain/reimplment-prepared-statements-with-global-lru-cache
* Rewrite extended and prepared protocol message handling to better support mocking response packets and close
* An attempt to better handle if there are DDL changes that might break cached plans with ideas about how to further improve it
* fix
* Minor stats fixed and cleanup
* Cosmetic fixes (#64)
* Cosmetic fixes
* fix test
* Change server drop for statement cache error to a `deallocate all`
* Updated comments and added new idea for handling DDL changes impacting cached plans
* fix test?
* Revert test change
* trigger build, flakey test
* Avoid potential race conditions by changing get_or_insert to promote for pool LRU
* remove ps enabled variable on the server in favor of using an option
* Add close to the Extended Protocol buffer
---------
Co-authored-by: Lev Kokotov <levkk@users.noreply.github.com>
The TL;DR for the change is that we allow QueryRouter to set the active shard to None. This signals to the Pool::get method that we have no shard selected. The get method follows a no_shard_specified_behavior config to know how to route the query.
Original PR description
Ruby-pg library makes a startup query to SET client_encoding to ... if Encoding.default_internal value is set (Code). This query is troublesome because we cannot possibly attach a routing comment to it. PgCat, by default, will route that query to the default shard.
Everything is fine until shard 0 has issues, Clients will all be attempting to send this query to shard0 which increases the connection latency significantly for all clients, even those not interested in shard0
This PR introduces no_shard_specified_behavior that defines the behavior in case we have routing-by-comment enabled but we get a query without a comment. The allowed behaviors are
random: Picks a shard at random
random_healthy: Picks a shard at random favoring shards with the least number of recent connection/checkout errors
shard_<number>: e.g. shard_0, shard_4, etc. picks a specific shard, everytime
In order to achieve this, this PR introduces an error_count on the Address Object that tracks the number of errors since the last checkout and uses that metric to sort shards by error count before making a routing decision.
I didn't want to use address stats to avoid introducing a routing dependency on internal stats (We might do that in the future but I prefer to avoid this for the time being.
I also made changes to the test environment to replace Ruby's TOML reader library, It appears to be abandoned and does not support mixed arrays (which we use in the config toml), and it also does not play nicely with single-quoted regular expressions. I opted for using yj which is a CLI tool that can convert from toml to JSON and back. So I refactor the tests to use that library.
* User server parameters struct instead of server info bytesmut
* Refactor to use hashmap for all params and add server parameters to client
* Sync parameters on client server checkout
* minor refactor
* update client side parameters when changed
* Move the SET statement logic from the C packet to the S packet.
* trigger build
* revert validation changes
* remove comment
* Try fix
* Reset cleanup state after sync
* fix server version test
* Track application name through client life for stats
* Add tests
* minor refactoring
* fmt
* fix
* fmt
* Make infer role configurable and fix double parse bug
* Fix tests
* Enable infer_role_from query in toml for tests
* Fix test
* Add max length config, add logging for which application is failing to parse, and change config name
* fmt
* Update src/config.rs
---------
Co-authored-by: Lev Kokotov <levkk@users.noreply.github.com>
What is wrong
Stats reported by SHOW POOLS seem to be leaking. We see lingering cl_idle , cl_waiting, and similarly for sv_idle , sv_active. We confirmed that these are reporting issues not actual lingering clients.
This behavior is readily reproducible by running
while true; do
psql "postgres://sharding_user:sharding_user@localhost:6432/sharded_db" -c "SELECT 1" > /dev/null 2>&1 &
done
Why it happens
I wasn't able to get to figure our the reason for the bug but my best guess is that we have race conditions when updating pool-level stats. So even though individual update operations are atomic, we perform a check then update sequence which is not protected by a guard.
https://github.com/postgresml/pgcat/blob/main/src/stats/pool.rs#L174-L179
I am also suspecting that using Relaxed ordering might allow this behavior (I changed all operations to use Ordering::SeqCst but still got lingering clients)
How to fix
Since SHOW POOLS/SHOW SERVER/SHOW CLIENTS all show the current state of the proxy (as opposed to SHOW STATS which show aggregate values), this PR refactors SHOW POOLS to have it construct the results directly from SHOW SERVER and SHOW CLIENT datasets. This reduces the complexity of stat updates and eliminates the need for having locks when updating pool stats as we only care about updating individual client/server states.
This will change the semantics of maxwait, so instead of it holding the maxwait time ever encountered by a client (connected or disconnected), it will only consider connected clients which should be okay given PgCat tends to hold on to client connections more than Pgbouncer.
* Add a new exec_simple_query method
This adds a new `exec_simple_query` method so we can make 'out of band'
queries to servers that don't interfere with pools at all.
In order to reuse startup code for making these simple queries,
we need to set the stats (`Reporter`) optional, so using these
simple queries wont interfere with stats.
* Add auth passthough (auth_query)
Adds a feature that allows setting auth passthrough for md5 auth.
It adds 3 new (general and pool) config parameters:
- `auth_query`: An string containing a query that will be executed on boot
to obtain the hash of a given user. This query have to use a placeholder `$1`,
so pgcat can replace it with the user its trying to fetch the hash from.
- `auth_query_user`: The user to use for connecting to the server and executing the
auth_query.
- `auth_query_password`: The password to use for connecting to the server and executing the
auth_query.
The configuration can be done either on the general config (so pools share them) or in a per-pool basis.
The behavior is, at boot time, when validating server connections, a hash is fetched per server
and stored in the pool. When new server connections are created, and no cleartext password is specified,
the obtained hash is used for creating them, if the hash could not be obtained for whatever reason, it retries
it.
When client authentication is tried, it uses cleartext passwords if specified, it not, it checks whether
we have query_auth set up, if so, it tries to use the obtained hash for making client auth. If there is no
hash (we could not obtain one when validating the connection), a new fetch is tried.
Once we have a hash, we authenticate using it against whathever the client has sent us, if there is a failure
we refetch the hash and retry auth (so password changes can be done).
The idea with this 'retrial' mechanism is to make it fault tolerant, so if for whatever reason hash could not be
obtained during connection validation, or the password has change, we can still connect later.
* Add documentation for Auth passthrough
* Refactor stats to use atomics
When we are dealing with a high number of connections, generated
stats cannot be consumed fast enough by the stats collector loop.
This makes the stats subsystem inconsistent and a log of
warning messages are thrown due to unregistered server/clients.
This change refactors the stats subsystem so it uses atomics:
- Now counters are handled using U64 atomics
- Event system is dropped and averages are calculated using a loop
every 15 seconds.
- Now, instead of snapshots being generated ever second we keep track of servers/clients
that have registered. Each pool/server/client has its own instance of the counter and
makes changes directly, instead of adding an event that gets processed later.
* Manually mplement Hash/Eq in `config::Address` ignoring stats
* Add tests for client connection counters
* Allow connecting to dockerized dev pgcat from the host
* stats: Decrease cl_idle when idle socket disconnects
* Prepared stmt sharding
s
tests
* len check
* remove python test
* latest rust
* move that to debug for sure
* Add the actual tests
* latest image
* Update tests/ruby/sharding_spec.rb
Sometimes we want an admin to be able to ban a host for some time to route traffic away from that host for reasons like partial outages, replication lag, and scheduled maintenance.
We can achieve this today using a configuration update but a quicker approach is to send a control command to PgCat that bans the replica for some specified duration.
This command does not change the current banning rules like
Primaries cannot be banned
When all replicas are banned, all replicas are unbanned
* Adds configuration for logging connections and removes get_config from entrypoint
* typo
* rename connection config var and add to toml files
* update config log
* fmt
* Don't send discard all when state is changed in transaction
* Remove unnecessary clone
* spelling
* Move transaction check to SET command
* Add test for set command in transaction
* type
* Update comments
* Update comments
* use moves instead of clones for initial message
* don't make message mutable
* Update unwrap
* but i'm not a wrapper
* Add set local test
* change continue