Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: VTTablet connection pool only expires one waiter per second #16896

Open
brendar opened this issue Oct 5, 2024 · 1 comment · May be fixed by #16897
Open

Bug Report: VTTablet connection pool only expires one waiter per second #16896

brendar opened this issue Oct 5, 2024 · 1 comment · May be fixed by #16897

Comments

@brendar
Copy link
Contributor

brendar commented Oct 5, 2024

Overview of the Issue

On v19 we experienced a VTTablet effectively locking up, with attempts to start a new transaction timing out. The shard did not recover until we performed a failover to another tablet.

We were able to trace the cause to a bug in waitlist.expire

func (wl *waitlist[C]) expire(force bool) {

expire is called once per second by a pool worker to remove any waiters whose context has timed out or been canceled

// The expire worker takes care of removing from the waiter list any clients whose
// context has been cancelled.
pool.runWorker(pool.close, 1*time.Second, func(_ time.Time) bool {
pool.wait.expire(false)
return true
})

The function should wake up all waiters whose context has expired, but it only wakes up the first one it finds. This is due to calling wl.list.Remove(e) while traversing the list. This sets e.Next to nil, which terminates the loop.

for e := wl.list.Front(); e != nil; e = e.Next() {
if force || e.Value.ctx.Err() != nil {
wl.list.Remove(e)
expired = append(expired, e)
continue
}
}

A fix PR will be posted shortly

Reproduction Steps

1. start a local cluster

./101_initial_cluster.sh

2. Exhaust the transaction pool

for run in {1..100}; do time mysql -e "begin; select $run, sleep(10) ; rollback" & done

This starts 100 transactions in parallel, with each one sleeping for 10 seconds.
The transaction pool has a default wait timeout of 1 second (source), so if the pool is exhausted we would expect wait times of 1-2 seconds (1s wait timeout + up to 1s for the expire worker to run) before failing with a ResourceExhausted error (returned here, message modified here)

What we see in v19 and on main
We see one transaction erroring out with ResourceExhausted each second, with ever increasing wait times. The last transaction completes after about 40 seconds.

ERROR 1203 (42000) at line 1: target: commerce.0.primary: vttablet: rpc error: code = ResourceExhausted desc = transaction pool connection limit exceeded (CallerID: userData1)

real	0m1.105s
user	0m0.004s
sys	0m0.005s
ERROR 1203 (42000) at line 1: target: commerce.0.primary: vttablet: rpc error: code = ResourceExhausted desc = transaction pool connection limit exceeded (CallerID: userData1)

real	0m2.096s
user	0m0.004s
sys	0m0.003s
ERROR 1203 (42000) at line 1: target: commerce.0.primary: vttablet: rpc error: code = ResourceExhausted desc = transaction pool connection limit exceeded (CallerID: userData1)

real	0m3.082s
user	0m0.004s
sys	0m0.003s
ERROR 1203 (42000) at line 1: target: commerce.0.primary: vttablet: rpc error: code = ResourceExhausted desc = transaction pool connection limit exceeded (CallerID: userData1)

real	0m4.075s
user	0m0.004s
sys	0m0.003s
ERROR 1203 (42000) at line 1: target: commerce.0.primary: vttablet: rpc error: code = ResourceExhausted desc = transaction pool connection limit exceeded (CallerID: userData1)

real	0m5.070s
user	0m0.005s
sys	0m0.003s

...

Binary Version

impacts v19+

Operating System and Environment details

n/a

Log Fragments

No response

@brendar
Copy link
Contributor Author

brendar commented Oct 5, 2024

Fix is up here: #16897

@rohit-nayak-ps rohit-nayak-ps added Component: Query Serving and removed Needs Triage This issue needs to be correctly labelled and triaged labels Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants