In a high availability deployment of PgCat, it is possible that a client may land on a container of PgCat that is very busy with clients and as such the new client might be perpetually stuck in checkout failure loop because all connections are used by other clients. This is specially true in session mode pools with long-lived client connections (e.g. FDW connections).
One way to fix this issue is to close client connections after they encounter some number of checkout failure. This will force the client to hit the Network load balancer again, land on a different process/container, try to checkout a connection on the new process/container. if it fails, it is disconnected and tries with another one.
This mechanism is guaranteed to eventually land on a balanced state where all clients are able to find connections provided that the overall number of connections across all containers matches the number of clients.
I was able to reproduce this issue in a control environment and was able to show this PR is able to fix it.
Currently, `connect_timeout` sounds like it should be for connections to
the Postgres server. It's actually used for obtaining a connection from
the pool.
* Make user min_pool_size configurable
* Set user server_lifetime only if specified
* Increment chart version
* Use default instea of or
* Allow enabling server_tls
* statement_timeout default value
* Allow pulling password from existing secret
---------
Co-authored-by: Mostafa Abdelraouf <mostafa.mohmmed@gmail.com>
Build is failing with this error
Downloading activerecord-3.2.14 revealed dependencies not in the API or the
lockfile (activesupport (= 3.2.14), activemodel (= 3.2.14), arel (~> 3.0.2),
tzinfo (~> 0.3.29)).
Either installing with `--full-index` or running `bundle update activerecord`
should fix the problem.
After ActiveSupport was updated.
This PR fixes that
In #796, I noticed that the deb package was not build since an automation was missing.
With this PR, I add the missing automation.
I tested the workflow in my repo...
when starting the workflow manually: https://github.com/MrSerth/pgcat/actions/runs/10737879151/job/29780286094
when drafting a new release: https://github.com/MrSerth/pgcat/actions/runs/10737835796/job/29780146212
Obviously, both workflows failed since I cannot upload to the APT repo. However, the version substitution for the workflow is working correctly (as shown when collapsing the first line of the "Build and release package" step).
Previously, upgrading the deb package stopped the service but didn't reenable it after a successful upgrade. This made upgrading the package more difficult and required a second step to restart the service. With this commit, the systemd service is automatically started when the default config file is present.
* Prometheus metrics updates:
* Add username label to deconflict metrics that would otherwise
have duplicate labels across different pools.
* Group metrics by name and only print HELP and TYPE once per
metric name.
* Sort labels for a deterministic output.
---------
Co-authored-by: Curtis Myzie <curtis.myzie@gmail.com>
Co-authored-by: Towhid Khan
Currently the python tests act as scripts. A lot of output is generated to stdout which makes it very hard to figure out where problems were. Also if you want to run only a single test you basically need to comment out code in order to accomplish this.
This PR modifies the python tests to us the pytest python testing framework. This framework allows individual tests to be targeted via the command line, without touching the source code. It also suppressed stdout by default making the test output much easier to read. Also after the tests run it will provide a summary of what failed, what succeded, etc.
Co-authored-by: CommanderKeynes <andrewjackson947@gmail.coma>
Co-authored-by: Andrew Jackson <andrewjackson2988@gmail.com>
Writing and iterating on integration tests are cumbersome, having to wait 10 minutes for the test-suite to run just to see if your test works or not is unacceptable.
In this PR, I added a detailed workflow for writing tests that should shorten the feedback cycle of modifying tests to be as low as a few seconds.
It will involve opening a shell into a long-lived container that has all the setup and dependencies necessary and then running your desired tests directly there. I added a convenience script that bootstraps the environment and then opens an interactive shell into the container and you can then run tests immediately in an environment that is more or less identical to what we have running in CircleCI
We were missing some labels on metrics generated by the Prometheus exporter so I fixed that. There are still some gaps that I want to address with respect to the metrics we track but this seems like a good start.
I also created a Grafana Dashboard and exported it to JSON. It is designed with the same metric names the Prometheus exporter uses.
The docker CI build image is failing due to this error
249.5 Finished release [optimized] target(s) in 2m 49s
249.5 Installing /home/circleci/.cargo/bin/rustfilt
249.5 Installed package `rustfilt v0.2.1` (executable `rustfilt`)
249.5 error: failed to compile `cargo-binutils v0.3.6`, intermediate artifacts can be found at `/tmp/cargo-installrWENQG`
249.5
249.5 Caused by:
249.5 package `cargo-platform v0.1.8` cannot be built because it requires rustc 1.73 or newer, while the currently active rustc version is 1.67.1
249.5 Try re-running cargo install with `--locked`
249.5 Summary Successfully installed rustfilt! Failed to install cargo-binutils (see error(s) above).
249.5 error: some crates failed to install
So I am bumping the version up
This commit adds the TCP_NODELAY option to the socket configuration in
`configure_socket` function. Without this option, we observed significant
performance issues when executing SELECT queries with large responses.
Before the fix:
postgres=> SELECT repeat('a', 1); SELECT repeat('a', 8153);
Time: 1.368 ms
Time: 41.364 ms
After the fix:
postgres=> SELECT repeat('a', 1); SELECT repeat('a', 8153);
Time: 1.332 ms
Time: 1.528 ms
By setting TCP_NODELAY, we eliminate the Nagle's algorithm delay, which
results in a substantial improvement in response times for large queries.
This problem was discussed in https://github.com/postgresml/pgcat/issues/616.