In a high availability deployment of PgCat, it is possible that a client may land on a container of PgCat that is very busy with clients and as such the new client might be perpetually stuck in checkout failure loop because all connections are used by other clients. This is specially true in session mode pools with long-lived client connections (e.g. FDW connections).
One way to fix this issue is to close client connections after they encounter some number of checkout failure. This will force the client to hit the Network load balancer again, land on a different process/container, try to checkout a connection on the new process/container. if it fails, it is disconnected and tries with another one.
This mechanism is guaranteed to eventually land on a balanced state where all clients are able to find connections provided that the overall number of connections across all containers matches the number of clients.
I was able to reproduce this issue in a control environment and was able to show this PR is able to fix it.
Currently, `connect_timeout` sounds like it should be for connections to
the Postgres server. It's actually used for obtaining a connection from
the pool.
* Initial commit
* Cleanup and add stats
* Use an arc instead of full clones to store the parse packets
* Use mutex instead
* fmt
* clippy
* fmt
* fix?
* fix?
* fmt
* typo
* Update docs
* Refactor custom protocol
* fmt
* move custom protocol handling to before parsing
* Support describe
* Add LRU for server side statement cache
* rename variable
* Refactoring
* Move docs
* Fix test
* fix
* Update tests
* trigger build
* Add more tests
* Reorder handling sync
* Support when a named describe is sent along with Parse (go pgx) and expecting results
* don't talk to client if not needed when client sends Parse
* fmt :(
* refactor tests
* nit
* Reduce hashing
* Reducing work done to decode describe and parse messages
* minor refactor
* Merge branch 'main' into zain/reimplment-prepared-statements-with-global-lru-cache
* Rewrite extended and prepared protocol message handling to better support mocking response packets and close
* An attempt to better handle if there are DDL changes that might break cached plans with ideas about how to further improve it
* fix
* Minor stats fixed and cleanup
* Cosmetic fixes (#64)
* Cosmetic fixes
* fix test
* Change server drop for statement cache error to a `deallocate all`
* Updated comments and added new idea for handling DDL changes impacting cached plans
* fix test?
* Revert test change
* trigger build, flakey test
* Avoid potential race conditions by changing get_or_insert to promote for pool LRU
* remove ps enabled variable on the server in favor of using an option
* Add close to the Extended Protocol buffer
---------
Co-authored-by: Lev Kokotov <levkk@users.noreply.github.com>
* Add dns_cache so server addresses are cached and invalidated when DNS changes.
Adds a module to deal with dns_cache feature. It's
main struct is CachedResolver, which is a simple thread safe
hostname <-> Ips cache with the ability to refresh resolutions
every `dns_max_ttl` seconds. This way, a client can check whether its
ip address has changed.
* Allow reloading dns cached
* Add documentation for dns_cached
* Add a new exec_simple_query method
This adds a new `exec_simple_query` method so we can make 'out of band'
queries to servers that don't interfere with pools at all.
In order to reuse startup code for making these simple queries,
we need to set the stats (`Reporter`) optional, so using these
simple queries wont interfere with stats.
* Add auth passthough (auth_query)
Adds a feature that allows setting auth passthrough for md5 auth.
It adds 3 new (general and pool) config parameters:
- `auth_query`: An string containing a query that will be executed on boot
to obtain the hash of a given user. This query have to use a placeholder `$1`,
so pgcat can replace it with the user its trying to fetch the hash from.
- `auth_query_user`: The user to use for connecting to the server and executing the
auth_query.
- `auth_query_password`: The password to use for connecting to the server and executing the
auth_query.
The configuration can be done either on the general config (so pools share them) or in a per-pool basis.
The behavior is, at boot time, when validating server connections, a hash is fetched per server
and stored in the pool. When new server connections are created, and no cleartext password is specified,
the obtained hash is used for creating them, if the hash could not be obtained for whatever reason, it retries
it.
When client authentication is tried, it uses cleartext passwords if specified, it not, it checks whether
we have query_auth set up, if so, it tries to use the obtained hash for making client auth. If there is no
hash (we could not obtain one when validating the connection), a new fetch is tried.
Once we have a hash, we authenticate using it against whathever the client has sent us, if there is a failure
we refetch the hash and retry auth (so password changes can be done).
The idea with this 'retrial' mechanism is to make it fault tolerant, so if for whatever reason hash could not be
obtained during connection validation, or the password has change, we can still connect later.
* Add documentation for Auth passthrough
This PR adds a utility script that generates config documentation from pgcat.toml. Ideally, we'd want to generate the configs directly from config.rs where the actual defaults are set but this is a good start as we already had several undocumented config flags.