Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: improve efficiency of tablet filtering in go/vt/discovery #16761

Open
timvaillancourt opened this issue Sep 11, 2024 · 8 comments
Open

RFC: improve efficiency of tablet filtering in go/vt/discovery #16761

timvaillancourt opened this issue Sep 11, 2024 · 8 comments
Assignees
Labels

Comments

@timvaillancourt
Copy link
Contributor

timvaillancourt commented Sep 11, 2024

RFC Description

Today discovery.Healthcheck (in go/vt/discovery - mainly used by vtgate) supports filtering tablets watched by:

  1. --keyspaces_to_watch (very common)
  2. --tablet-filter on hostname
  3. --tablet-filter-tags on tablet Tags map[string]string tags

Behind the scenes, this filtering happens in the tablet watcher's loadTablets(), but not very efficiently: first all tablets in the cell are grabbed from topo unconditionally, then the optional filtering of tablets occurs. At times this filtering excludes a significant amount of the topo KVs we fetched. More on "why" later

On clusters with 1000s of tablets this becomes a scalability problem for the topology store that has to handle topo Get calls for all of the tablets fetched. In an extreme case, the txthrottler (which uses discovery.Healtcheck to stream tablet stats) opens 1 x topology watcher per cell just to find tablets in it's local shard. Let's say we have a 3 x cell deployment with 1000 tablets per cell and txthrottler is running in a 3-tablet shard: this means txthrottler will be reading 3000 topo KVs frequently just to find it's 2 other members, and this problem grows with number of tablets. In our production the problem is significantly larger

Now why does tablet watcher fetch EVERY tablet from the cell? Today it kind of has to 🤷. Using a Consul topo as an example, tablets are stored by alias in paths named /tablets/<tablet alias>/ and there is no efficient way to grab just 1 x keyspace or shard - you have to read everything

There is a way to mitigate this inefficiency (but not resolve it): --tablet_refresh_known_tablets=false - this causes vtgate to store tablet records it reads forever, which has it's own drawbacks and doesn't resolve the initial inefficient read of tablets for the entire cell

This issue is a RFC/feature request for a more efficient way to fetch topo records for a single shard and/or keyspace. Unfortunately improving the situation likely means a change to the layout of the topo

Some early ideas:

  • "Pointer"/alias KVs - add KVs like /keyspaces/<keyspace>/<shard>/<tablet> that simply "point" to the actual /tablet/<alias> record, kind of like an index
    • This doesn't seem to be a built in feature of most topo stores so it would need to be done at a KV-level.
  • Tablet records are stored in per-keyspace/shard paths. But this would come at the cost of more ListDir operations
  • <your idea here>

🙇

Use Case(s)

Large vitess deployments that use filtering (likely --keyspaces_to_watch) where rate of topo gets is a risk/concern

@timvaillancourt timvaillancourt added Type: Feature Request Needs Triage This issue needs to be correctly labelled and triaged Type: Enhancement Logical improvement (somewhere between a bug and feature) Type: RFC Request For Comment Component: VTTablet Component: VTGate Component: Throttler and removed Needs Triage This issue needs to be correctly labelled and triaged labels Sep 11, 2024
@timvaillancourt timvaillancourt self-assigned this Sep 11, 2024
@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Sep 12, 2024

"Pointer"/alias KVs - add KVs like /keyspaces/// that simply "point" to the actual /tablet/ record, kind of like an index

To expand on this point a bit, most of the supported KV stores support transactional KV writes. Using this functionality, this could be implemented as such:

  1. Writes to tablet topo records happen inside a transaction that:
    • Updates the typical by-alias KV (in consul /tablets/<tablet alias>)
    • Adds/updates an empty KV to the "index" path, such as /keyspaces/<keyspace>/<shard>/<tablet alias>
  2. Optionally-indexed reads for a specific shard would:
    • Do a list dir operation on the "index" keyspace/shard path, returning a list of tablet aliases
    • Do a get for each tablet alias
  3. Optionally-indexed reads for a whole keyspace would:
    • Do a list dir operation on the "index" keyspace path, returning a list of shards
    • Do a list dir operation on each "index" shard paths, returning a list of tablet aliases
    • Do a get for each tablet alias

So worst case (Step 3), 2 x list dir calls are added with the expectation this will be more beneficial than fetching every tablet alias

@mattlord
Copy link
Contributor

mattlord commented Sep 12, 2024

I would note that this kind of topo work is often much more challenging than it would seem at first glance. I would recommend you have a PoC (proof of concept) PR to go along with any suggested implementations as this gives you a chance to flesh things out and see if it's possible, if it's actually any better, what challenges there are, any trade-offs, etc.

@mattlord
Copy link
Contributor

mattlord commented Sep 12, 2024

On clusters with 1000s of tablets this becomes a scalability problem for the topology store that has to handle topo Get calls for all of the tablets fetched.

I would also note that this ONLY applies to ZooKeeper as it does not support scans. Getting the tablets in one prefix scan — for the topo server implementations that support scans — was added in #9560

So it's probably also worth stepping back and clarifying the problem here. Why is doing 1 prefix scan per cell on consul here so problematic? IMO this is something we live with when using a KV store and trying to map custom index-like structures that we then have to manage ourselves on top of it is going to be problematic.

@rvrangel
Copy link
Contributor

@mattlord I think the issue is not with the prefix scan per cell, but the fact that after that operation, we do a single GET for each tablet from the topo, even though we are getting tablets that are not part of the shard the current tablet is in.

if you consider 10k tablets, they would be doing 10k GETs each per minute, to a total of 100M RPCs per minute on the topo (plus the scans)

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Sep 12, 2024

we do a single GET for each tablet from the topo, even though we are getting tablets that are not part of the shard the current tablet is in.

@rvrangel / @mattlord yes, this is the concern. Another example, let's say we have --keyspaces_to_watch foo and 100s of keyspaces

Today:

  1. .loadTablets() is called
  2. .getTablets() is called, which wraps tw.topoServer.GetTabletsByCell(...)
  3. That returns every tablet in the cell - remember we still just want 1 x keyspace, this returns all
  4. Then filtering using the 3 x methods mentioned earlier occurs using .IsIncluded(...)

So to put it more simply, filtering of tablets happens after topo gets, meaning we fetch all tablets from the topo when we don't need to

@mattlord
Copy link
Contributor

mattlord commented Sep 12, 2024

@rvrangel I don't see where that happens here:

// First get the list of all tablets.
tabletInfos, err := tw.getTablets()
topologyWatcherOperations.Add(topologyWatcherOpListTablets, 1)
if err != nil {
topologyWatcherErrors.Add(topologyWatcherOpListTablets, 1)
// If we get a partial result error, we just log it and process the tablets that we did manage to fetch.
if topo.IsErrType(err, topo.PartialResult) {
log.Errorf("received partial result from getTablets for cell %v: %v", tw.cell, err)
partialResult = true
} else { // For all other errors, just return.
log.Errorf("error getting tablets for cell: %v: %v", tw.cell, err)
return
}
}
// Accumulate a list of all known alias strings to use later
// when sorting.
tabletAliasStrs := make([]string, 0, len(tabletInfos))
tw.mu.Lock()
defer tw.mu.Unlock()
for _, tInfo := range tabletInfos {
aliasStr := topoproto.TabletAliasString(tInfo.Alias)
tabletAliasStrs = append(tabletAliasStrs, aliasStr)
if !tw.refreshKnownTablets {
// We already have a tabletInfo for this and the flag tells us to not refresh.
if val, ok := tw.tablets[aliasStr]; ok {
newTablets[aliasStr] = val
continue
}
}
// There's no network call here, so we just do the tablets one at a time instead of in parallel goroutines.
newTablets[aliasStr] = &tabletInfo{
alias: aliasStr,
tablet: tInfo.Tablet,
}
}
if partialResult {
// We don't want to remove any tablets from the tablets map or the healthcheck if we got a partial result
// because we don't know if they were actually deleted or if we simply failed to fetch them.
// Fill any gaps in the newTablets map using the existing tablets.
for alias, val := range tw.tablets {
if _, ok := newTablets[alias]; !ok {
tabletAliasStrs = append(tabletAliasStrs, alias)
newTablets[alias] = val
}
}
}
for alias, newVal := range newTablets {
if tw.tabletFilter != nil && !tw.tabletFilter.IsIncluded(newVal.tablet) {
continue
}

Is there a test which demonstrates the problem we're trying to solve here?

We could certainly do the cell based filtering earlier on. I don't see what other optimizations you have in mind here. And keep in mind that the proposed solution seems to involve MORE topo calls unless I'm misunderstanding something.

@mattlord
Copy link
Contributor

mattlord commented Sep 12, 2024

My point is that the premise here seems somewhat flawed AFAICT. The premise seems to be that we get every tablet record individually, but I don't see that we do (with the exception of ZooKeeper, which it's worth noting is not fully supported). We do one index prefix scan per cell and then apply filters to the tablet records in the cell.

So to put it more simply, filtering of tablets happens after topo gets, meaning we fetch all tablets from the topo when we don't need to

This is a limitation with K/V stores IMO. We can't pass predicates down to filter the records based on additional conditions.

I think it's best to start with a very clear problem and a test case that demonstrates it. Then go from there. Then we can more clearly evaluate the current situation. From there, we can explore ideas which can be demonstrated to improve that specific scenario w/o negatively impacting others (since this is not a practical problem for most Vitess users AFAIK).

Of course I could be missing or misunderstanding some things here, as always.

As someone that's done a fair amount of Vitess Topology Server work, this statement a change to the layout of the topo is somewhat frightening. 🙂 It's typically FAR more difficult than one would think. You have to consider upgrade/downgrade issues, differences across implementations, additional client commands needed, etc. So clarifying the problem we're trying to solve and being able to clearly demonstrate it should be the first step before even considering it.

@deepthi
Copy link
Member

deepthi commented Sep 12, 2024

we do a single GET for each tablet from the topo, even though we are getting tablets that are not part of the shard the current tablet is in.

As @mattlord already said, we don't necessarily do a GET for every tablet since v19 (#14693). It is still true that if you are using any of the available filters, we would fetch all the tablets and then filter them in vtgate memory.

Beyond that I would dispute that usage of --keyspaces_to_watch is very common. I'd have to see evidence to believe this.

The problems you are seeing with the volume of GETs could possibly be mitigated by porting that PR into your fork. It does depend on prior work, so you would need that too. Caveat: it is possible for you to run into grpc limits though, at which point we'd fall back to one tablet at a time, completely negating the benefits of using ListDir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants