Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Connection Pool Full error locks up vttablet #15745

Closed
derekperkins opened this issue Apr 17, 2024 · 3 comments
Closed

Bug Report: Connection Pool Full error locks up vttablet #15745

derekperkins opened this issue Apr 17, 2024 · 3 comments

Comments

@derekperkins
Copy link
Member

Overview of the Issue

We're randomly seeing vttablet lock up with errors like this:
PoolFull: skipped 60 log messages

Once it gets into that state, health checks start failing and query serving never recovers until I restart the whole tablet pod.

This is a complete guess, but I wonder if this is happening in k8s because when vttablet is unhealthy, that gets reported to the k8s service, which then quits routing traffic, and the health check is waiting for new queries to become healthy again.

We haven't seen any connection pool errors in ~5 years (possible that we didn't notice logs, but there was never an outage). The only flag that was removed when upgrading from v18 to v19 was --queryserver-config-query-cache-size 100. @deepthi pointed out this PR from @vmg #14034 that was a major refactor of connection pools as being the first place to look.

Maybe we're in an edge case because of our usage of message tables. Until v15, there was a separate flag/pool --queryserver-config-message-conn-pool-size for messaging connections, so maybe those aren't accounted for?

Reproduction Steps

vttablet flags

vttablet
  --topo_implementation="etcd2"
  --topo_global_server_address="etcd-global-client.vitess:2379"
  --topo_global_root=/vitess/global
  --logtostderr
  --port 15002
  --grpc_port 16002
  --service_map "grpc-queryservice,grpc-tabletmanager,grpc-updatestream"
  --grpc_prometheus
  --tablet_dir "tabletdata"
  --tablet-path "uscentral1-$(cat /vtdataroot/tabletdata/tablet-uid)"
  --tablet_hostname "$(hostname).vttablet"
  --init_keyspace "companies"
  --init_shard "0"
  --init_tablet_type "replica"
  --health_check_interval "5s"
  --mysqlctl_socket "/vtdataroot/mysqlctl.sock"
  --enable_replication_reporter

  --vreplication_max_time_to_retry_on_error 8760h

  --init_db_name_override "companies"

  --grpc_max_message_size 100000000

  --restore_from_backup
  --backup_storage_implementation=$VT_BACKUP_SERVICE
  --gcs_backup_storage_bucket=$VT_GCS_BACKUP_STORAGE_BUCKET
  --gcs_backup_storage_root=$VT_GCS_BACKUP_STORAGE_ROOT

  --app_pool_size="400"
  --dba_pool_size="10"
  --grpc_keepalive_time="10s"
  --grpc_server_keepalive_enforcement_policy_permit_without_stream="true"
  --queryserver-config-max-result-size="10000"
  --queryserver-config-message-postpone-cap="25"
  --queryserver-config-passthrough-dmls="true"
  --queryserver-config-pool-size="50"
  --queryserver-config-query-timeout=600s
  --queryserver-config-transaction-cap="300"

Binary Version

all components v19.0.3
Percona Server v8.0.36

Operating System and Environment details

GKE 1.29.3
Debian Bookworm

Log Fragments

E0417 17:26:06.069538 3335298 throttled.go:77] PoolFull: skipped 60 log messages
E0417 17:26:07.027101 3335298 tabletserver.go:1665] PoolFull: Code: RESOURCE_EXHAUSTED
resource pool timed out (CallerID: unsecure_grpc_client)
E0417 17:27:07.027359 3335298 throttled.go:77] PoolFull: skipped 29 log messages
@rohit-nayak-ps rohit-nayak-ps removed the Needs Triage This issue needs to be correctly labelled and triaged label Apr 18, 2024
@vmg
Copy link
Collaborator

vmg commented Apr 18, 2024

I did a small review here. I think the PoolFull is actually a red herring! The PoolFull identifier is coming from a throttled logger that is only triggered on RESOURCE_EXHAUSTION messages. You can see the definition of these constants here:

ERDiskFull = ErrorCode(1021)
EROutOfMemory = ErrorCode(1037)
EROutOfSortMemory = ErrorCode(1038)
ERConCount = ErrorCode(1040)
EROutOfResources = ErrorCode(1041)
ERRecordFileFull = ErrorCode(1114)
ERHostIsBlocked = ErrorCode(1129)
ERCantCreateThread = ErrorCode(1135)
ERTooManyDelayedThreads = ErrorCode(1151)
ERNetPacketTooLarge = ErrorCode(1153)
ERTooManyUserConnections = ErrorCode(1203)
ERLockTableFull = ErrorCode(1206)
ERUserLimitReached = ErrorCode(1226)

Interestingly, after the changes I introduced in the new Pool PR, none of the possible RESOURCE_EXHAUSTED errors comes from the connection pool. In fact, the new pool was designed to no longer be able to error out because of resource exhaustion (if the pool is too busy to serve a request, the request will cancel on its own because of its context deadline).

What you're seeing here is an error packet coming directly from MySQL, which we've translated and mis-reported as a PoolFull error, but it is not that. Could you do some debugging on your side based on this information?

Potential culprits:

  • ERConCount/ERTooManyUserConnections do you have a connection limit in your mysqld instance that may be too tight?
  • ERDiskFull/EROutOfMemory/EROutOfSortMemory: is your mysqld at capacity?
  • ERNetPacketTooLarge: mysql queries that are larger than usual?

Please let us know what you find. The issue is clearly arising from mysqld, so you should hopefully see it in the logs there!

@vmg vmg removed their assignment Apr 18, 2024
@derekperkins
Copy link
Member Author

This ended up being related to this vtorc bug that consumed too many connections

@brendar
Copy link
Contributor

brendar commented Oct 5, 2024

This issue may have been caused by this bug

I also want to clarify this statement for future readers, since it had us scratching our heads for a while:

Interestingly, after the changes I introduced in the new Pool PR, none of the possible RESOURCE_EXHAUSTED errors comes from the connection pool. In fact, the new pool was designed to no longer be able to error out because of resource exhaustion (if the pool is too busy to serve a request, the request will cancel on its own because of its context deadline).

The connection pool implementation will return a RESOURCE_EXHAUSTED error when the pool wait timeout is reached (v19.0.3 source), but not all connection pool instances have a wait timeout. The "connection pool" has no default wait timeout, but the "transaction pool" has a default wait timeout of 1s. These are configurable via the queryserver-config-*pool-timeout flags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants