Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async load cannot delete unnecessary items when there is no histogram metadata for the table #59024

Open
Rustin170506 opened this issue Jan 17, 2025 · 1 comment
Labels
sig/planner SIG: Planner type/bug The issue is confirmed as a bug.

Comments

@Rustin170506
Copy link
Member

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. Find the unit test: TestNoNeedIndexStatsLoading
func TestNoNeedIndexStatsLoading(t *testing.T) {
	store, dom := realtikvtest.CreateMockStoreAndDomainAndSetup(t)
	tk := testkit.NewTestKit(t, store)
	tk.MustExec("use test;")
	tk.MustExec("drop table if exists t;")
	// 1. Create a table and the statsHandle.Update(do.InfoSchema()) will load this table into the stats cache.
	tk.MustExec("create table if not exists t(a int, b int, index ia(a));")
	// 2. Drop the stats of the stats, it will clean up all system table records for this table.
	tk.MustExec("drop stats t;")
	// 3. Insert some data and wait for the modify_count and the count is not null in the mysql.stats_meta.
	tk.MustExec("insert into t value(1,1), (2,2);")
	h := dom.StatsHandle()
	require.NoError(t, h.DumpStatsDeltaToKV(true))
	require.NoError(t, h.Update(context.Background(), dom.InfoSchema()))
	// 4. Try to select some data from this table by ID, it would trigger an async load.
	tk.MustExec("set tidb_opt_objective='determinate';")
	tk.MustQuery("select * from t where a = 1 and b = 1;").Check(testkit.Rows("1 1"))
	table, err := dom.InfoSchema().TableByName(context.Background(), ast.NewCIStr("test"), ast.NewCIStr("t"))
	require.NoError(t, err)
	checkTableIDInItems(t, table.Meta().ID)
}
  1. Add a single line to the test:
func TestNoNeedIndexStatsLoading(t *testing.T) {
	store, dom := realtikvtest.CreateMockStoreAndDomainAndSetup(t)
	tk := testkit.NewTestKit(t, store)
	tk.MustExec("use test;")
	tk.MustExec("drop table if exists t;")
	// 1. Create a table and the statsHandle.Update(do.InfoSchema()) will load this table into the stats cache.
	tk.MustExec("create table if not exists t(a int, b int, index ia(a));")
	// 2. Drop the stats of the stats, it will clean up all system table records for this table.
	tk.MustExec("drop stats t;")
	// 3. Insert some data and wait for the modify_count and the count is not null in the mysql.stats_meta.
	tk.MustExec("insert into t value(1,1), (2,2);")
	h := dom.StatsHandle()
	require.NoError(t, h.DumpStatsDeltaToKV(true))
	require.NoError(t, h.Update(context.Background(), dom.InfoSchema()))
	// 4. Try to select some data from this table by ID, it would trigger an async load.
	tk.MustExec("set tidb_opt_objective='determinate';")
	tk.MustQuery("select * from t where a = 1 and b = 1;").Check(testkit.Rows("1 1"))
+	require.NoError(t, h.Update(context.Background(), dom.InfoSchema()))
	table, err := dom.InfoSchema().TableByName(context.Background(), ast.NewCIStr("test"), ast.NewCIStr("t"))
	require.NoError(t, err)
	checkTableIDInItems(t, table.Meta().ID)
}
  1. Start a TiDB cluster but without any TiDB nodes:
tiup playground nightly --db 0
  1. Change the test case to use the real TiKV:
-	WithRealTiKV = flag.Bool("with-real-tikv", false, "whether tests run with real TiKV")
+	WithRealTiKV = flag.Bool("with-real-tikv", true, "whether tests run with real TiKV")
  1. Run the test:
go test -timeout 10m -tags intest -run ^TestNoNeedIndexStatsLoading$ github.com/pingcap/tidb/tests/realtikvtest/statisticstest -count=1

2. What did you expect to see? (Required)

It passed.

3. What did you see instead (Required)

It kept failing.

--- FAIL: TestNoNeedIndexStatsLoading (60.28s)
    testkit.go:123: RefreshSession rand seed: 1737128106368940000
    testkit.go:123: RefreshSession rand seed: 1737128106446542000
    statistics_test.go:304: Timeout: Table ID was not removed from items within the time limit
FAIL
FAIL    github.com/pingcap/tidb/tests/realtikvtest/statisticstest       61.582s
FAIL

4. What is your TiDB version? (Required)

master

@Rustin170506 Rustin170506 added the type/bug The issue is confirmed as a bug. label Jan 17, 2025
@Rustin170506
Copy link
Member Author

Rustin170506 commented Jan 17, 2025

Let us use some questions to explain how we found it.

  1. Where do we find it?

As you can see, if we don't add this new line, this test case will pass (in most cases). This means that sometimes it will fail and we didn't catch it in our master branch's CI.
So until I made this PR, which tried to port some stats changes to the 8.5-hotfix branch, we didn't know that there are flaky tests here.
After I created the PR, I found out that the check-dev2 test kept failing. From the beginning, I thought it was a known flaky test and just ran it a couple of times. But it always fail on two same tests. Then I started investigating it, but I made a big mistake when I tried to reproduce it locally.

  1. How did I reproduce it locally?

The failed test cases are:

  1. TestNewCollationStatsWithPrefixIndex
  2. TestNoNeedIndexStatsLoading

To reproduce this problem because this test is realtikvtest test, so as usual I just start a tidb cluster with tiup playground: tiup playground nightly.

Then I ran the first test case, it passed. It gives me more confidence that it is defeilty a flay test. I don't need to fix it. Let's run it again! So I tried to run it several times. But it still keeps failing.

After that I tried to run TestNoNeedIndexStatsLoading, it keeps failing even on my local environment. And I tried the same test case on the master branch. It passed, which means there are some differences between the 8.5 branch and the master branch.

So I started thinking maybe some changes are missed on release-8.5, so I went through all the commits and check if I missed anything. Because recently we made a lot of changes to the stats, so I also asked yiding to help me check if we missed any commits.

But after carefully checking all the commits from my PR, I didn't think I was just missing something, so I started using a binary search approach to find the last good commit.

And I started this binary search on the master branch because the only suspicious commit is bf939fa. So I checked out to that commit. And ran this test again and again, I found that this case sometimes fails even when I was on the master branch. So that means we find out that this is a buggy test, but it just always fails on v8.5.

So I tried the commit before bf939fa, which is bee268d. And we found that this commit can always pass the test. So at least we know the case break by this commit. But this is the nightmare beginning.

  1. Was it caused by bf939fa?

The reason I said this commit is the nightmare start is that it brings us to a wrong build. Or at least I was stupid enough to think it was a root cause.

Because I thought this commit is root cause, so I started looking really hard in this change. And also asked yiding to help look at this commit. But after two hours, the only thing we got was that `h.Update' wasn't loading the table into the stats cache correctly. Then I thought maybe the commit just caused the problem. Because we don't have table statistics in the cache, we accidentally failed to remove the async load element from its list.

But actually we only got half success, we knew that this test case would fail because we lost the table cache from the stats cache. But we made two wrong assumptions here:

  1. I thought the drop stats step from the test case really had an effect.
  2. I thought the h.Update step from the test case really helped to load/update the table stats to stats cache.

With these two wrong assumptions, I started to look at recent changes again. But after almost an hour I still cannot find out what caused the test to fail.

  1. 👹 tiup playground nightly

Because the release team is waiting for this PR after I spent a couple of hours on it and still couldn't fix it. I started looking for a workaround to change the test case and make it stable.

So I started to debug the test case and try to find the reason why h.Update didn't help me load/update the table stats correctly. Finally I found out that because there is no mysql.stats_histograms record for this table at all. So from the h.Update function:

+ tbl, err := s.statsHandle.TableStatsFromStorage(
	tableInfo,
	physicalID,
	false,
	0,
)
// Error is not nil may mean that there are some ddl changes on this table, we will not update it.
if err != nil {
	statslogutil.StatsLogger().Error(
		"error occurred when read table stats",
		zap.String("table", tableInfo.Name.O),
		zap.Error(err),
	)
	continue
}
if tbl == nil {
+	tblToUpdateOrDelete.addToDelete(physicalID)
	continue
}

we directly ignored this table and also thought this table has been removed.

The when we do the sync load, we tried to call loadNeededIndexHistograms. But because the table stats not exist in the stats cache.

func loadNeededIndexHistograms(sctx sessionctx.Context, is infoschema.InfoSchema, statsHandle statstypes.StatsHandle, idx model.TableItemID, loadFMSketch bool) (err error) {
	tbl, ok := statsHandle.Get(idx.TableID)
	if !ok {
		return nil
	}

As you can see, this function retuned early because we don't have table stats.

In the end, the test case failed because we retuned eraily and also forgot to remove the item from the async load item list.

So to workaround or at least to pass this test case on v8.5, my solution is straightforward, we don't need to delete all mysql.stats_histograms records for the table though drop stats. So changed the drop stats t to delete from mysql.stats_histograms where table_id = ? and is_index =1.

This sounds promising, but after I changed the statement it still fails. Then I realized that I should check if the DDL event was procsedded. So I changed the test case to do that:

func TestNoNeedIndexStatsLoading(t *testing.T) {
	store, dom := realtikvtest.CreateMockStoreAndDomainAndSetup(t)
	tk := testkit.NewTestKit(t, store)
	tk.MustExec("use test;")
	h := dom.StatsHandle()
	tk.MustExec("drop table if exists t;")
	tk.MustExec("create table if not exists t(a int, b int, index ia(a));")
	// Make sure the table has stats_histogram and stats_meta be created.
	table, err := dom.InfoSchema().TableByName(context.Background(), model.NewCIStr("test"), model.NewCIStr("t"))
	require.NoError(t, err)
	require.Eventually(t, func() bool {
		rows := tk.MustQuery("select * from mysql.stats_histograms where table_id = ?", table.Meta().ID).Rows()
		require.NoError(t, err)
		return len(rows) >= 3
	}, 1*time.Minute, 100*time.Millisecond)
	// 1. Delete the index stats.
	tk.MustExec("delete from mysql.stats_histograms where table_id = ? and is_index = 1;", table.Meta().ID)
	// 2. Insert some data and wait for the modify_count and the count is not null in the mysql.stats_meta.
	tk.MustExec("insert into t value(1,1), (2,2);")
	require.NoError(t, h.DumpStatsDeltaToKV(true))
	// Check the modify_count and count is not null.
	tk.MustQuery("select modify_count, count from mysql.stats_meta where table_id = ?;", table.Meta().ID).Check(testkit.Rows("2 2"))
	// 3. Make sure we load the stats to cache.
	require.NoError(t, h.Update(context.Background(), dom.InfoSchema()))
	// 4. Try to select some data from this table by ID, it would trigger an async load.
	tk.MustExec("set tidb_opt_objective='determinate';")
	tk.MustQuery("select * from t where a = 1 and b = 1;").Check(testkit.Rows("1 1"))
	checkTableIDInItems(t, table.Meta().ID)
}

And it works fine locally! So I was so happy that I fixed it. But after I pushed it to the CI. It failed again! It makes me crazy!

Then I had another conversion with Yiding, I asked him why this can pass my env but fail on CI. How on earth can it fail!

He pointed out a very impomant point is maybe it is caused by the DLL event handling. Then I realized that this is real tikv test but with --tags=intest. This means that the TiDB does not handle the DDL event by default. Then that explains why it passed on my local MacBook! Because I was using tiup playground nightly. This means that another TiDB server without the --tags=intest build flag was running in my event. So I tried tiup playground nightly --db 0, then the test failed.
Even the TestNewCollationStatsWithPrefixIndex fails! We are finally at the root of the problem.

I also updated the test to correctly handle the DDL event:

func TestNoNeedIndexStatsLoading(t *testing.T) {
	store, dom := realtikvtest.CreateMockStoreAndDomainAndSetup(t)
	tk := testkit.NewTestKit(t, store)
	tk.MustExec("use test;")
	h := dom.StatsHandle()
	tk.MustExec("drop table if exists t;")
	tk.MustExec("create table if not exists t(a int, b int, index ia(a));")
+	createTableEvent := findEvent(h.DDLEventCh(), metamodel.ActionCreateTable, func(event *notifier.SchemaChangeEvent) bool {
+		return event.GetCreateTableInfo().Name.L == "t"
	})
+	statstestutil.HandleDDLEventWithTxn(h, createTableEvent)
	// Make sure the table has stats_histogram and stats_meta be created.
	table, err := dom.InfoSchema().TableByName(context.Background(), model.NewCIStr("test"), model.NewCIStr("t"))
	require.NoError(t, err)
	require.Eventually(t, func() bool {
		rows := tk.MustQuery("select * from mysql.stats_histograms where table_id = ?", table.Meta().ID).Rows()
		require.NoError(t, err)
		return len(rows) >= 3
	}, 1*time.Minute, 100*time.Millisecond)
	// 1. Delete the index stats.
+	tk.MustExec("delete from mysql.stats_histograms where table_id = ? and is_index = 1;", table.Meta().ID)
	// 2. Insert some data and wait for the modify_count and the count is not null in the mysql.stats_meta.
	tk.MustExec("insert into t value(1,1), (2,2);")
	require.NoError(t, h.DumpStatsDeltaToKV(true))
	// Check the modify_count and count is not null.
	tk.MustQuery("select modify_count, count from mysql.stats_meta where table_id = ?;", table.Meta().ID).Check(testkit.Rows("2 2"))
	// 3. Make sure we load the stats to cache.
	require.NoError(t, h.Update(context.Background(), dom.InfoSchema()))
	// 4. Try to select some data from this table by ID, it would trigger an async load.
	tk.MustExec("set tidb_opt_objective='determinate';")
	tk.MustQuery("select * from t where a = 1 and b = 1;").Check(testkit.Rows("1 1"))
	checkTableIDInItems(t, table.Meta().ID)
}

First time it passed on v8.5 CI!

  1. Why different from master?

Even I fixed the flacky test, but I was so furious that I had to change the test, and still don't know why it's different from the master.

The release team and I had a meeting, and they asked me to skip those tests and move forward. I think it makes sense to unblock the testing process. But I still struggle with letting it go.

So after the meeting, I did another debug with the information I already got from the previous debug. So now we know that we're not really handling the DLL at all. So why does the master database still have a table cache in memory?

Yiding is also debuting the test, he mentioned that maybe there is someplace just filled a pseudo stats for it. From the debugging, it does use pseudo stats in loadNeededIndexHistograms. So it can pass the test.

I also remembered that we will fill some pseudo for a few tables.

And it is in:

func (h *Handle) getPartitionStats(tblInfo *model.TableInfo, pid int64, returnPseudo bool) *statistics.Table {
...
	tbl, ok := h.Get(pid)
	if !ok {
		if returnPseudo {
			tbl = statistics.PseudoTable(tblInfo, false, true)
			tbl.PhysicalID = pid
			if tblInfo.GetPartitionInfo() == nil || h.Len() < 64 {
				h.UpdateStatsCache(types.CacheUpdate{
					Updated: []*statistics.Table{tbl},
				})
			}
			return tbl
		}
		return nil
	}
...
}

As you can see, we will fill the pseudo stats for at least 64 tables.

This can explain where the pseudo-stats come from. But still cannot explain why it keeps failing on v8.5.

  1. Execution sequence

So we know that we have pseudo-stats for the table. But somehow on the branch v8.5 we delete it somehow.

As we saw before, h.Update will try to drop the statistics of this table because it thinks we have deleted the table. So yiding said maybe somehow we executed the h.Update before we do the async load.

That totally make sense, but I still didn't get why v8.5 always fail in this case. All of suddl, I realized recently we had some improvements to disallow the stats update and async load to share a goroutine.

They I checked the v8.5 code, I found the root case:

for {
		select {
		case <-loadTicker.C:
			err = statsHandle.Update(ctx, do.InfoSchema())
			if err != nil {
				logutil.BgLogger().Warn("update stats info failed", zap.Error(err))
			}
			err = statsHandle.LoadNeededHistograms(do.InfoSchema())
			if err != nil {
				logutil.BgLogger().Warn("load histograms failed", zap.Error(err))
			}
		case <-do.exit:
			return
		}
}

So this means every single time we want to do an async load. We will do a stats update first!

But on the master branch, we have already separated them in this PR.

After I cherry-picked this change, without any changes. All tests passed!

  1. Hold on! What's wrong with bf939fa?

As mentioned before, before bf939fa, everything is OK. So why did this flaky test appear after this change?

bf939fa is not the problem, but it is the trigger/mediator of this problem. The main change in this commit is that we removed on-fly column size tracking. So before this change, when we did DumpStatsDeltaToKV, we would also record the column size on the fly, which would update the table record from mysql.stats_histograms.

So with this column size tracking, we always have mysql.stats_histograms for the table. So there is no way that h.Update() will delete the table stats. But after we dropped this feature, the bug finally came out to bite us!

In conclusion, this is just a bug that affects almost all subcomponents of the stats module. We need to find a solution to fix it soon. (Avoid mistakenly removing the table from the statistics cache, because the basic assumption here is that if the table exists in the database, then it should be in the statistics cache.)

But for now, I need to get some sleep. What a long day!

Thanks to @winoros without him I don't think I would be able to find the root cause and fix it. Kudos to him.

@jebter jebter added the sig/planner SIG: Planner label Jan 21, 2025
@winoros winoros changed the title Aasyn load cannot delete unnecessary items when there is no histogram metadata for the table Async load cannot delete unnecessary items when there is no histogram metadata for the table Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/planner SIG: Planner type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

2 participants