RFC: improve efficiency of tablet filtering in `go/vt/discovery` #16761

timvaillancourt · 2024-09-11T15:51:59Z

RFC Description

Today discovery.Healthcheck (in go/vt/discovery - mainly used by vtgate) supports filtering tablets watched by:

--keyspaces_to_watch (very common)
--tablet-filter on hostname
--tablet-filter-tags on tablet Tags map[string]string tags

Behind the scenes, this filtering happens in the tablet watcher's loadTablets(), but not very efficiently: first all tablets in the cell are grabbed from topo unconditionally, then the optional filtering of tablets occurs. At times this filtering excludes a significant amount of the topo KVs we fetched. More on "why" later

On clusters with 1000s of tablets this becomes a scalability problem for the topology store that has to handle topo Get calls for all of the tablets fetched. In an extreme case, the txthrottler (which uses discovery.Healtcheck to stream tablet stats) opens 1 x topology watcher per cell just to find tablets in it's local shard. Let's say we have a 3 x cell deployment with 1000 tablets per cell and txthrottler is running in a 3-tablet shard: this means txthrottler will be reading 3000 topo KVs frequently just to find it's 2 other members, and this problem grows with number of tablets. In our production the problem is significantly larger

Now why does tablet watcher fetch EVERY tablet from the cell? Today it kind of has to 🤷. Using a Consul topo as an example, tablets are stored by alias in paths named /tablets/<tablet alias>/ and there is no efficient way to grab just 1 x keyspace or shard - you have to read everything

There is a way to mitigate this inefficiency (but not resolve it): --tablet_refresh_known_tablets=false - this causes vtgate to store tablet records it reads forever, which has it's own drawbacks and doesn't resolve the initial inefficient read of tablets for the entire cell

This issue is a RFC/feature request for a more efficient way to fetch topo records for a single shard and/or keyspace. Unfortunately improving the situation likely means a change to the layout of the topo

Some early ideas:

"Pointer"/alias KVs - add KVs like /keyspaces/<keyspace>/<shard>/<tablet> that simply "point" to the actual /tablet/<alias> record, kind of like an index
- This doesn't seem to be a built in feature of most topo stores so it would need to be done at a KV-level.
Tablet records are stored in per-keyspace/shard paths. But this would come at the cost of more ListDir operations
<your idea here>

🙇

Use Case(s)

Large vitess deployments that use filtering (likely --keyspaces_to_watch) where rate of topo gets is a risk/concern

The text was updated successfully, but these errors were encountered:

timvaillancourt · 2024-09-12T16:28:45Z

"Pointer"/alias KVs - add KVs like /keyspaces/// that simply "point" to the actual /tablet/ record, kind of like an index

To expand on this point a bit, most of the supported KV stores support transactional KV writes. Using this functionality, this could be implemented as such:

Writes to tablet topo records happen inside a transaction that:
- Updates the typical by-alias KV (in consul /tablets/<tablet alias>)
- Adds/updates an empty KV to the "index" path, such as /keyspaces/<keyspace>/<shard>/<tablet alias>
Optionally-indexed reads for a specific shard would:
- Do a list dir operation on the "index" keyspace/shard path, returning a list of tablet aliases
- Do a get for each tablet alias
Optionally-indexed reads for a whole keyspace would:
- Do a list dir operation on the "index" keyspace path, returning a list of shards
- Do a list dir operation on each "index" shard paths, returning a list of tablet aliases
- Do a get for each tablet alias

So worst case (Step 3), 2 x list dir calls are added with the expectation this will be more beneficial than fetching every tablet alias

mattlord · 2024-09-12T16:33:31Z

I would note that this kind of topo work is often much more challenging than it would seem at first glance. I would recommend you have a PoC (proof of concept) PR to go along with any suggested implementations as this gives you a chance to flesh things out and see if it's possible, if it's actually any better, what challenges there are, any trade-offs, etc.

mattlord · 2024-09-12T16:44:18Z

On clusters with 1000s of tablets this becomes a scalability problem for the topology store that has to handle topo Get calls for all of the tablets fetched.

I would also note that this ONLY applies to ZooKeeper as it does not support scans. Getting the tablets in one prefix scan — for the topo server implementations that support scans — was added in #9560

So it's probably also worth stepping back and clarifying the problem here. Why is doing 1 prefix scan per cell on consul here so problematic? IMO this is something we live with when using a KV store and trying to map custom index-like structures that we then have to manage ourselves on top of it is going to be problematic.

rvrangel · 2024-09-12T17:02:51Z

@mattlord I think the issue is not with the prefix scan per cell, but the fact that after that operation, we do a single GET for each tablet from the topo, even though we are getting tablets that are not part of the shard the current tablet is in.

if you consider 10k tablets, they would be doing 10k GETs each per minute, to a total of 100M RPCs per minute on the topo (plus the scans)

timvaillancourt · 2024-09-12T17:09:51Z

we do a single GET for each tablet from the topo, even though we are getting tablets that are not part of the shard the current tablet is in.

@rvrangel / @mattlord yes, this is the concern. Another example, let's say we have --keyspaces_to_watch foo and 100s of keyspaces

Today:

.loadTablets() is called
.getTablets() is called, which wraps tw.topoServer.GetTabletsByCell(...)
That returns every tablet in the cell - remember we still just want 1 x keyspace, this returns all
Then filtering using the 3 x methods mentioned earlier occurs using .IsIncluded(...)

So to put it more simply, filtering of tablets happens after topo gets, meaning we fetch all tablets from the topo when we don't need to

mattlord · 2024-09-12T17:10:54Z

@rvrangel I don't see where that happens here:

vitess/go/vt/discovery/topology_watcher.go

Lines 148 to 203 in da4b81d

    
           // First get the list of all tablets. 
        
           tabletInfos, err := tw.getTablets() 
        
           topologyWatcherOperations.Add(topologyWatcherOpListTablets, 1) 
        
           if err != nil { 
        
           	topologyWatcherErrors.Add(topologyWatcherOpListTablets, 1) 
        
           	// If we get a partial result error, we just log it and process the tablets that we did manage to fetch. 
        
           	if topo.IsErrType(err, topo.PartialResult) { 
        
           		log.Errorf("received partial result from getTablets for cell %v: %v", tw.cell, err) 
        
           		partialResult = true 
        
           	} else { // For all other errors, just return. 
        
           		log.Errorf("error getting tablets for cell: %v: %v", tw.cell, err) 
        
           		return 
        
           	} 
        
           } 
        
           // Accumulate a list of all known alias strings to use later 
        
           // when sorting. 
        
           tabletAliasStrs := make([]string, 0, len(tabletInfos)) 
        
           tw.mu.Lock() 
        
           defer tw.mu.Unlock() 
        
           for _, tInfo := range tabletInfos { 
        
           	aliasStr := topoproto.TabletAliasString(tInfo.Alias) 
        
           	tabletAliasStrs = append(tabletAliasStrs, aliasStr) 
        
           	if !tw.refreshKnownTablets { 
        
           		// We already have a tabletInfo for this and the flag tells us to not refresh. 
        
           		if val, ok := tw.tablets[aliasStr]; ok { 
        
           			newTablets[aliasStr] = val 
        
           			continue 
        
           		} 
        
           	} 
        
           	// There's no network call here, so we just do the tablets one at a time instead of in parallel goroutines. 
        
           	newTablets[aliasStr] = &tabletInfo{ 
        
           		alias:  aliasStr, 
        
           		tablet: tInfo.Tablet, 
        
           	} 
        
           } 
        
           if partialResult { 
        
           	// We don't want to remove any tablets from the tablets map or the healthcheck if we got a partial result 
        
           	// because we don't know if they were actually deleted or if we simply failed to fetch them. 
        
           	// Fill any gaps in the newTablets map using the existing tablets. 
        
           	for alias, val := range tw.tablets { 
        
           		if _, ok := newTablets[alias]; !ok { 
        
           			tabletAliasStrs = append(tabletAliasStrs, alias) 
        
           			newTablets[alias] = val 
        
           		} 
        
           	} 
        
           } 
        
           for alias, newVal := range newTablets { 
        
           	if tw.tabletFilter != nil && !tw.tabletFilter.IsIncluded(newVal.tablet) { 
        
           		continue 
        
           	}

Is there a test which demonstrates the problem we're trying to solve here?

We could certainly do the cell based filtering earlier on. I don't see what other optimizations you have in mind here. And keep in mind that the proposed solution seems to involve MORE topo calls unless I'm misunderstanding something.

mattlord · 2024-09-12T17:20:37Z

My point is that the premise here seems somewhat flawed AFAICT. The premise seems to be that we get every tablet record individually, but I don't see that we do (with the exception of ZooKeeper, which it's worth noting is not fully supported). We do one index prefix scan per cell and then apply filters to the tablet records in the cell.

So to put it more simply, filtering of tablets happens after topo gets, meaning we fetch all tablets from the topo when we don't need to

This is a limitation with K/V stores IMO. We can't pass predicates down to filter the records based on additional conditions.

I think it's best to start with a very clear problem and a test case that demonstrates it. Then go from there. Then we can more clearly evaluate the current situation. From there, we can explore ideas which can be demonstrated to improve that specific scenario w/o negatively impacting others (since this is not a practical problem for most Vitess users AFAIK).

Of course I could be missing or misunderstanding some things here, as always.

As someone that's done a fair amount of Vitess Topology Server work, this statement a change to the layout of the topo is somewhat frightening. 🙂 It's typically FAR more difficult than one would think. You have to consider upgrade/downgrade issues, differences across implementations, additional client commands needed, etc. So clarifying the problem we're trying to solve and being able to clearly demonstrate it should be the first step before even considering it.

deepthi · 2024-09-12T19:19:40Z

we do a single GET for each tablet from the topo, even though we are getting tablets that are not part of the shard the current tablet is in.

As @mattlord already said, we don't necessarily do a GET for every tablet since v19 (#14693). It is still true that if you are using any of the available filters, we would fetch all the tablets and then filter them in vtgate memory.

Beyond that I would dispute that usage of --keyspaces_to_watch is very common. I'd have to see evidence to believe this.

The problems you are seeing with the volume of GETs could possibly be mitigated by porting that PR into your fork. It does depend on prior work, so you would need that too. Caveat: it is possible for you to run into grpc limits though, at which point we'd fall back to one tablet at a time, completely negating the benefits of using ListDir.

timvaillancourt self-assigned this Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: improve efficiency of tablet filtering in `go/vt/discovery` #16761

RFC: improve efficiency of tablet filtering in `go/vt/discovery` #16761

timvaillancourt commented Sep 11, 2024 •

edited

Loading

timvaillancourt commented Sep 12, 2024 •

edited

Loading

mattlord commented Sep 12, 2024 •

edited

Loading

mattlord commented Sep 12, 2024 •

edited

Loading

rvrangel commented Sep 12, 2024

timvaillancourt commented Sep 12, 2024 •

edited

Loading

mattlord commented Sep 12, 2024 •

edited

Loading

mattlord commented Sep 12, 2024 •

edited

Loading

deepthi commented Sep 12, 2024 •

edited

Loading

RFC: improve efficiency of tablet filtering in go/vt/discovery #16761

RFC: improve efficiency of tablet filtering in go/vt/discovery #16761

Comments

timvaillancourt commented Sep 11, 2024 • edited Loading

RFC Description

Use Case(s)

timvaillancourt commented Sep 12, 2024 • edited Loading

mattlord commented Sep 12, 2024 • edited Loading

mattlord commented Sep 12, 2024 • edited Loading

rvrangel commented Sep 12, 2024

timvaillancourt commented Sep 12, 2024 • edited Loading

mattlord commented Sep 12, 2024 • edited Loading

mattlord commented Sep 12, 2024 • edited Loading

deepthi commented Sep 12, 2024 • edited Loading

RFC: improve efficiency of tablet filtering in `go/vt/discovery` #16761

RFC: improve efficiency of tablet filtering in `go/vt/discovery` #16761

timvaillancourt commented Sep 11, 2024 •

edited

Loading

timvaillancourt commented Sep 12, 2024 •

edited

Loading

mattlord commented Sep 12, 2024 •

edited

Loading

mattlord commented Sep 12, 2024 •

edited

Loading

timvaillancourt commented Sep 12, 2024 •

edited

Loading

mattlord commented Sep 12, 2024 •

edited

Loading

mattlord commented Sep 12, 2024 •

edited

Loading

deepthi commented Sep 12, 2024 •

edited

Loading