feat: Adds functionality to clear out bad shard list #25398

devanbenz · 2024-09-25T18:14:09Z

This PR adds test and new method to clear out the bad shards list
the method will return the values of the shards that it cleared out
along with the errors. This is the first part in the feature
for adding a load-shards command to influxd-ctl.

companion PR: https://github.com/influxdata/plutonium/pull/4193

Closes https://github.com/influxdata/feature-requests/issues/591

This PR adds test and new method to clear out the bad shards list the method will return the values of the shards that it cleared out along with the errors. This is the first part in the feature for adding a load-shards command to influxd-ctl. Closes influxdata/feature-requests#591

gwossum · 2024-10-16T20:21:24Z

@devanbenz Is there a companion PR that adds APIs to access this functionality?

davidby-influx

How is this controlled externally?

davidby-influx · 2024-10-16T21:00:30Z

tsdb/store.go

+// were removed from the cache.
+func (s *Store) ClearBadShardList() map[uint64]error {
+	if s.badShards.shardErrors == nil {
+		badShards := make(map[uint64]error)


I believe that this needs to be protected by the s.badShards.mu throughout this method.

s.badShards.shardErrors should never be nil, so we probably want to do two things:

Log a warning

Set s.badShards.ShardErrors to an empty map.

Then we can proceed with the clone and clear code below which will be the same in all cases.

ClearBadShardList should use GetBadShardList so that all the mutex and clone logic is localized to a single method.

That would produce:

func (s *Store) ClearBadShardList() map[uint64]error { badShards := s.GetBadShardList() clear(s.badShards.shardErrors) return badShards } // GetBadShardList is exposed as a method for test purposes func (s *Store) GetBadShardList() map[uint64]error { s.badShards.mu.Lock() defer s.badShards.mu.Unlock() if s.badShards.shardErrors == nil { s.Logger.Warn("badShards was nil") s.badShards.shardErrors = make(map[uint64]error) } shardList := maps.Clone(s.badShards.shardErrors) return shardList }

So it's likely I would still need to use a mutex in ClearBadShardList() to call the clear() method on the original map?

Yes, you must use the mutex around the clear

davidby-influx · 2024-10-16T21:01:24Z

tsdb/store.go

+
+// GetBadShardList is exposed as a method for test purposes
+func (s *Store) GetBadShardList() map[uint64]error {
+	return s.badShards.shardErrors


This needs to be protected by the mutex here, and the return value cloned for safety.

devanbenz · 2024-10-16T21:14:25Z

@devanbenz Is there a companion PR that adds APIs to access this functionality?

Yes I'm currently working on the plutonium code for that

devanbenz · 2024-10-16T21:20:04Z

https://github.com/influxdata/plutonium/pull/4193/files This is the WIP PR that consumes the following code.

davidby-influx

Suggestions in comments.

davidby-influx · 2024-10-16T21:34:04Z

tsdb/store.go

 	}
 	badShards := maps.Clone(s.badShards.shardErrors)
 	clear(s.badShards.shardErrors)
+	s.badShards.mu.Unlock()


I would use defer immediately after the Lock call as future-proofing against other code paths being added, as well as idiomatic Go.

davidby-influx · 2024-10-16T21:35:07Z

tsdb/store.go

-	return s.badShards.shardErrors
+	s.badShards.mu.Lock()
+	shardList := maps.Clone(s.badShards.shardErrors)
+	s.badShards.mu.Unlock()


As suggested above, I would use defer immediately after the Lock call.

gwossum · 2024-10-16T21:38:32Z

tsdb/store_test.go

+			if len(badShards) != 1 {
+				t.Fatalf("expected 1 shard, got %d", len(badShards))
+			}


Use require.Len(t, badShards, 1)

gwossum · 2024-10-16T21:40:07Z

tsdb/store_test.go

+			}
+
+			// Check that bad shard list has been cleared
+			require.Equal(t, 0, len(s.Store.GetBadShardList()))


Even though this uses require and works, require.Empty(t, s.GetBadShardList) is better because it will output what was in the slice / map if it isn't actually empty. This makes debugging issues simpler.

gwossum · 2024-10-16T21:40:51Z

tsdb/store_test.go

+			require.EqualError(t, err2, fmt.Errorf("not attempting to open shard %d; opening shard previously failed with: %w", shId, expErr).Error())
+
+			// Check that bad shard list now has a bad shard in it
+			require.Equal(t, 1, len(s.Store.GetBadShardList()))


Use require.Len(t, s.GetBadShardList, 1) so the contents of map will be printed out if there isn't exactly 1 item in the list.

gwossum · 2024-10-16T21:41:38Z

tsdb/store_test.go

+			require.NoError(t, err, "opening temp shard")
+			defer require.NoError(t, sh.Close(), "closing temporary shard")
+
+			require.Equal(t, 0, len(s.Store.GetBadShardList()))


Use require.Empty

gwossum · 2024-10-16T21:41:56Z

tsdb/store_test.go

+			if len(badShards) != 0 {
+				t.Fatalf("expected no shards, got %d", len(badShards))
+			}


Use require.Empty

gwossum · 2024-10-16T21:42:14Z

tsdb/store_test.go

+			}
+
+			// Check that bad shard list has been cleared
+			require.Equal(t, 0, len(s.Store.GetBadShardList()))


Use require.Empty

gwossum · 2024-10-16T21:42:31Z

tsdb/store_test.go

+	for _, idx := range indexes {
+		func() {
+			s := MustOpenStore(t, idx)
+			defer require.NoErrorf(t, s.Close(), "closing store with index type: %s", idx)


Nice error checking!

Isn't s.Close called on the defer line?

A deferred function’s arguments are evaluated when the defer statement is evaluated.

https://go.dev/blog/defer-panic-and-recover

gwossum · 2024-10-16T21:50:12Z

tsdb/store_test.go

+			s.SetShardOpenErrorForTest(sh.ID(), expErr)
+			err2 := s.OpenShard(sh.Shard, false)
+			require.Error(t, err2, "no error opening bad shard")
+			require.True(t, errors.Is(err2, tsdb.ErrPreviousShardFail{}), "exp: ErrPreviousShardFail, got: %v", err2)


Use require.ErrorIs(t, err2, tsdb.ErrPerviousShardFail)

davidby-influx

Some questions around defer semantics

davidby-influx · 2024-10-16T22:24:37Z

tsdb/store_test.go

+	for _, idx := range indexes {
+		func() {
+			s := MustOpenStore(t, idx)
+			defer require.NoErrorf(t, s.Close(), "closing store with index type: %s", idx)


Doesn't this close the store immediately, because arguments to deferred functions are evaluated at the defer line? See here

Hm interesting - this similar construct is used through out the tests (outside of the ones I worked on)

I'll go through this and see what's going on.

Take a look at the Go playground link I shared for a demonstration.

davidby-influx · 2024-10-16T22:28:34Z

tsdb/store_test.go

+	for _, idx := range indexes {
+		func() {
+			s := MustOpenStore(t, idx)
+			defer require.NoErrorf(t, s.Close(), "closing store with index type: %s", idx)


Isn't s.Close called on the defer line?

A deferred function’s arguments are evaluated when the defer statement is evaluated.

https://go.dev/blog/defer-panic-and-recover

davidby-influx · 2024-10-16T22:29:28Z

tsdb/store_test.go

+			sh := tsdb.NewTempShard(idx)
+			err := s.OpenShard(sh.Shard, false)
+			require.NoError(t, err, "opening temp shard")
+			defer require.NoError(t, sh.Close(), "closing temporary shard")


As above, when is sh.Close called?

Maybe a defer is not necessary here, since the shard is never used after opening.

devanbenz changed the title ~~feat: WIP working on deleting badShards~~ feat: Adds functionality to clear out bad shard list Sep 27, 2024

devanbenz force-pushed the feat/591-load-failed-shards-cmd branch from 0a8b33e to cb3abb4 Compare September 27, 2024 15:11

devanbenz marked this pull request as ready for review September 27, 2024 15:11

devanbenz added 5 commits October 16, 2024 10:30

feat: merges in changes from shard loading

eea6127

feat: add some more comments

76616ff

fix: fixes test

cda3457

feat: add nil check for clear bad shards

0b9c2a3

feat: add some more test assertions

c4abeac

devanbenz requested review from gwossum and davidby-influx and removed request for gwossum October 16, 2024 19:42

davidby-influx requested changes Oct 16, 2024

View reviewed changes

fix: Use mutex for data protection and warn if badShards is nil

f1bf633

davidby-influx requested changes Oct 16, 2024

View reviewed changes

gwossum reviewed Oct 16, 2024

View reviewed changes

fix: add defer to unlocks and change tests to use testify

884a857

davidby-influx approved these changes Oct 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adds functionality to clear out bad shard list #25398

feat: Adds functionality to clear out bad shard list #25398

devanbenz commented Sep 25, 2024 •

edited

Loading

gwossum commented Oct 16, 2024

davidby-influx left a comment

davidby-influx Oct 16, 2024

gwossum Oct 16, 2024

devanbenz Oct 16, 2024

davidby-influx Oct 16, 2024

davidby-influx Oct 16, 2024

devanbenz commented Oct 16, 2024

devanbenz commented Oct 16, 2024

davidby-influx left a comment

davidby-influx Oct 16, 2024

davidby-influx Oct 16, 2024

gwossum Oct 16, 2024

gwossum Oct 16, 2024

gwossum Oct 16, 2024

gwossum Oct 16, 2024

gwossum Oct 16, 2024

gwossum Oct 16, 2024

gwossum Oct 16, 2024

davidby-influx Oct 16, 2024

gwossum Oct 16, 2024

davidby-influx left a comment

davidby-influx Oct 16, 2024

devanbenz Oct 16, 2024

davidby-influx Oct 16, 2024

davidby-influx Oct 16, 2024

davidby-influx Oct 16, 2024

davidby-influx Oct 16, 2024

feat: Adds functionality to clear out bad shard list #25398

Are you sure you want to change the base?

feat: Adds functionality to clear out bad shard list #25398

Conversation

devanbenz commented Sep 25, 2024 • edited Loading

gwossum commented Oct 16, 2024

davidby-influx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devanbenz commented Oct 16, 2024

devanbenz commented Oct 16, 2024

davidby-influx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidby-influx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devanbenz commented Sep 25, 2024 •

edited

Loading