-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async load cannot delete unnecessary items when there is no histogram metadata for the table #59024
Comments
Let us use some questions to explain how we found it.
As you can see, if we don't add this new line, this test case will pass (in most cases). This means that sometimes it will fail and we didn't catch it in our master branch's CI.
The failed test cases are:
To reproduce this problem because this test is Then I ran the first test case, it passed. It gives me more confidence that it is defeilty a flay test. I don't need to fix it. Let's run it again! So I tried to run it several times. But it still keeps failing. After that I tried to run So I started thinking maybe some changes are missed on release-8.5, so I went through all the commits and check if I missed anything. Because recently we made a lot of changes to the stats, so I also asked yiding to help me check if we missed any commits. But after carefully checking all the commits from my PR, I didn't think I was just missing something, so I started using a binary search approach to find the last good commit. And I started this binary search on the master branch because the only suspicious commit is bf939fa. So I checked out to that commit. And ran this test again and again, I found that this case sometimes fails even when I was on the master branch. So that means we find out that this is a buggy test, but it just always fails on v8.5. So I tried the commit before bf939fa, which is bee268d. And we found that this commit can always pass the test. So at least we know the case break by this commit. But this is the nightmare beginning.
The reason I said this commit is the nightmare start is that it brings us to a wrong build. Or at least I was stupid enough to think it was a root cause. Because I thought this commit is root cause, so I started looking really hard in this change. And also asked yiding to help look at this commit. But after two hours, the only thing we got was that `h.Update' wasn't loading the table into the stats cache correctly. Then I thought maybe the commit just caused the problem. Because we don't have table statistics in the cache, we accidentally failed to remove the async load element from its list. But actually we only got half success, we knew that this test case would fail because we lost the table cache from the stats cache. But we made two wrong assumptions here:
With these two wrong assumptions, I started to look at recent changes again. But after almost an hour I still cannot find out what caused the test to fail.
Because the release team is waiting for this PR after I spent a couple of hours on it and still couldn't fix it. I started looking for a workaround to change the test case and make it stable. So I started to debug the test case and try to find the reason why + tbl, err := s.statsHandle.TableStatsFromStorage(
tableInfo,
physicalID,
false,
0,
)
// Error is not nil may mean that there are some ddl changes on this table, we will not update it.
if err != nil {
statslogutil.StatsLogger().Error(
"error occurred when read table stats",
zap.String("table", tableInfo.Name.O),
zap.Error(err),
)
continue
}
if tbl == nil {
+ tblToUpdateOrDelete.addToDelete(physicalID)
continue
} we directly ignored this table and also thought this table has been removed. The when we do the sync load, we tried to call func loadNeededIndexHistograms(sctx sessionctx.Context, is infoschema.InfoSchema, statsHandle statstypes.StatsHandle, idx model.TableItemID, loadFMSketch bool) (err error) {
tbl, ok := statsHandle.Get(idx.TableID)
if !ok {
return nil
} As you can see, this function retuned early because we don't have table stats. In the end, the test case failed because we retuned eraily and also forgot to remove the item from the async load item list. So to workaround or at least to pass this test case on v8.5, my solution is straightforward, we don't need to delete all This sounds promising, but after I changed the statement it still fails. Then I realized that I should check if the DDL event was procsedded. So I changed the test case to do that: func TestNoNeedIndexStatsLoading(t *testing.T) {
store, dom := realtikvtest.CreateMockStoreAndDomainAndSetup(t)
tk := testkit.NewTestKit(t, store)
tk.MustExec("use test;")
h := dom.StatsHandle()
tk.MustExec("drop table if exists t;")
tk.MustExec("create table if not exists t(a int, b int, index ia(a));")
// Make sure the table has stats_histogram and stats_meta be created.
table, err := dom.InfoSchema().TableByName(context.Background(), model.NewCIStr("test"), model.NewCIStr("t"))
require.NoError(t, err)
require.Eventually(t, func() bool {
rows := tk.MustQuery("select * from mysql.stats_histograms where table_id = ?", table.Meta().ID).Rows()
require.NoError(t, err)
return len(rows) >= 3
}, 1*time.Minute, 100*time.Millisecond)
// 1. Delete the index stats.
tk.MustExec("delete from mysql.stats_histograms where table_id = ? and is_index = 1;", table.Meta().ID)
// 2. Insert some data and wait for the modify_count and the count is not null in the mysql.stats_meta.
tk.MustExec("insert into t value(1,1), (2,2);")
require.NoError(t, h.DumpStatsDeltaToKV(true))
// Check the modify_count and count is not null.
tk.MustQuery("select modify_count, count from mysql.stats_meta where table_id = ?;", table.Meta().ID).Check(testkit.Rows("2 2"))
// 3. Make sure we load the stats to cache.
require.NoError(t, h.Update(context.Background(), dom.InfoSchema()))
// 4. Try to select some data from this table by ID, it would trigger an async load.
tk.MustExec("set tidb_opt_objective='determinate';")
tk.MustQuery("select * from t where a = 1 and b = 1;").Check(testkit.Rows("1 1"))
checkTableIDInItems(t, table.Meta().ID)
} And it works fine locally! So I was so happy that I fixed it. But after I pushed it to the CI. It failed again! It makes me crazy! Then I had another conversion with Yiding, I asked him why this can pass my env but fail on CI. How on earth can it fail! He pointed out a very impomant point is maybe it is caused by the DLL event handling. Then I realized that this is real tikv test but with I also updated the test to correctly handle the DDL event: func TestNoNeedIndexStatsLoading(t *testing.T) {
store, dom := realtikvtest.CreateMockStoreAndDomainAndSetup(t)
tk := testkit.NewTestKit(t, store)
tk.MustExec("use test;")
h := dom.StatsHandle()
tk.MustExec("drop table if exists t;")
tk.MustExec("create table if not exists t(a int, b int, index ia(a));")
+ createTableEvent := findEvent(h.DDLEventCh(), metamodel.ActionCreateTable, func(event *notifier.SchemaChangeEvent) bool {
+ return event.GetCreateTableInfo().Name.L == "t"
})
+ statstestutil.HandleDDLEventWithTxn(h, createTableEvent)
// Make sure the table has stats_histogram and stats_meta be created.
table, err := dom.InfoSchema().TableByName(context.Background(), model.NewCIStr("test"), model.NewCIStr("t"))
require.NoError(t, err)
require.Eventually(t, func() bool {
rows := tk.MustQuery("select * from mysql.stats_histograms where table_id = ?", table.Meta().ID).Rows()
require.NoError(t, err)
return len(rows) >= 3
}, 1*time.Minute, 100*time.Millisecond)
// 1. Delete the index stats.
+ tk.MustExec("delete from mysql.stats_histograms where table_id = ? and is_index = 1;", table.Meta().ID)
// 2. Insert some data and wait for the modify_count and the count is not null in the mysql.stats_meta.
tk.MustExec("insert into t value(1,1), (2,2);")
require.NoError(t, h.DumpStatsDeltaToKV(true))
// Check the modify_count and count is not null.
tk.MustQuery("select modify_count, count from mysql.stats_meta where table_id = ?;", table.Meta().ID).Check(testkit.Rows("2 2"))
// 3. Make sure we load the stats to cache.
require.NoError(t, h.Update(context.Background(), dom.InfoSchema()))
// 4. Try to select some data from this table by ID, it would trigger an async load.
tk.MustExec("set tidb_opt_objective='determinate';")
tk.MustQuery("select * from t where a = 1 and b = 1;").Check(testkit.Rows("1 1"))
checkTableIDInItems(t, table.Meta().ID)
} First time it passed on v8.5 CI!
Even I fixed the flacky test, but I was so furious that I had to change the test, and still don't know why it's different from the master. The release team and I had a meeting, and they asked me to skip those tests and move forward. I think it makes sense to unblock the testing process. But I still struggle with letting it go. So after the meeting, I did another debug with the information I already got from the previous debug. So now we know that we're not really handling the DLL at all. So why does the master database still have a table cache in memory? Yiding is also debuting the test, he mentioned that maybe there is someplace just filled a pseudo stats for it. From the debugging, it does use pseudo stats in I also remembered that we will fill some pseudo for a few tables. And it is in:
As you can see, we will fill the pseudo stats for at least 64 tables. This can explain where the pseudo-stats come from. But still cannot explain why it keeps failing on v8.5.
So we know that we have pseudo-stats for the table. But somehow on the branch v8.5 we delete it somehow. As we saw before, That totally make sense, but I still didn't get why v8.5 always fail in this case. All of suddl, I realized recently we had some improvements to disallow the stats update and async load to share a goroutine. They I checked the v8.5 code, I found the root case: for {
select {
case <-loadTicker.C:
err = statsHandle.Update(ctx, do.InfoSchema())
if err != nil {
logutil.BgLogger().Warn("update stats info failed", zap.Error(err))
}
err = statsHandle.LoadNeededHistograms(do.InfoSchema())
if err != nil {
logutil.BgLogger().Warn("load histograms failed", zap.Error(err))
}
case <-do.exit:
return
}
} So this means every single time we want to do an async load. We will do a stats update first! But on the master branch, we have already separated them in this PR. After I cherry-picked this change, without any changes. All tests passed!
As mentioned before, before bf939fa, everything is OK. So why did this flaky test appear after this change? bf939fa is not the problem, but it is the trigger/mediator of this problem. The main change in this commit is that we removed on-fly column size tracking. So before this change, when we did So with this column size tracking, we always have In conclusion, this is just a bug that affects almost all subcomponents of the stats module. We need to find a solution to fix it soon. (Avoid mistakenly removing the table from the statistics cache, because the basic assumption here is that if the table exists in the database, then it should be in the statistics cache.) But for now, I need to get some sleep. What a long day! Thanks to @winoros without him I don't think I would be able to find the root cause and fix it. Kudos to him. |
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
TestNoNeedIndexStatsLoading
func TestNoNeedIndexStatsLoading(t *testing.T) { store, dom := realtikvtest.CreateMockStoreAndDomainAndSetup(t) tk := testkit.NewTestKit(t, store) tk.MustExec("use test;") tk.MustExec("drop table if exists t;") // 1. Create a table and the statsHandle.Update(do.InfoSchema()) will load this table into the stats cache. tk.MustExec("create table if not exists t(a int, b int, index ia(a));") // 2. Drop the stats of the stats, it will clean up all system table records for this table. tk.MustExec("drop stats t;") // 3. Insert some data and wait for the modify_count and the count is not null in the mysql.stats_meta. tk.MustExec("insert into t value(1,1), (2,2);") h := dom.StatsHandle() require.NoError(t, h.DumpStatsDeltaToKV(true)) require.NoError(t, h.Update(context.Background(), dom.InfoSchema())) // 4. Try to select some data from this table by ID, it would trigger an async load. tk.MustExec("set tidb_opt_objective='determinate';") tk.MustQuery("select * from t where a = 1 and b = 1;").Check(testkit.Rows("1 1")) + require.NoError(t, h.Update(context.Background(), dom.InfoSchema())) table, err := dom.InfoSchema().TableByName(context.Background(), ast.NewCIStr("test"), ast.NewCIStr("t")) require.NoError(t, err) checkTableIDInItems(t, table.Meta().ID) }
tiup playground nightly --db 0
go test -timeout 10m -tags intest -run ^TestNoNeedIndexStatsLoading$ github.com/pingcap/tidb/tests/realtikvtest/statisticstest -count=1
2. What did you expect to see? (Required)
It passed.
3. What did you see instead (Required)
It kept failing.
4. What is your TiDB version? (Required)
master
The text was updated successfully, but these errors were encountered: