Intermittent test failures and flaky tests #427

inancgumus · 2022-06-30T12:13:02Z

For a few months now we've noticed an increase in intermittent test failures (e.g. grafana/k6#4442), as well very frequent flaky tests and CI failures (#272, #254). This has become a major issue for both users and internal devs working on the extension.

From initial work explored in #281 that aimed to fix #272, it became apparent that these issues stem from the racy nature of how we process CDP events, which can trigger one or more internal events. This is explained in detail in this discussion, along with a proposed architecture to address these issues.

An ancillary problem with the current architecture is that these event flows and race conditions are difficult to trace and understand, and since they're intermittent and difficult to test we're never sure if we're introducing more of them as we make changes.

Example

To illustrate an example of a race condition that should be addressed, consider the following scenario from TestPageGoto (failure example in CI build grafana/xk6-browser#1123):

xk6-browser sends the Page.navigate CDP command:

-> {"id":20,"sessionId":"75F7E22B7AD1085646FE46C053DD6729","method":"Page.navigate","params":{"url":"http://127.0.0.1:40721/static/empty.html","frameId":"AF0723A7356FB62639D7BA483EBF615E"}}

Done here:

xk6-browser/common/frame_manager.go

Line 617 in 6baeca0

newDocumentID, err := fs.navigateFrame(frame, url, parsedOpts.Referer)

The browser responds with the load Page.lifecycleEvent:

<- {"method":"Page.lifecycleEvent","params":{"frameId":"E0C85D008DBA499F54F24A284F032911","loaderId":"B16AD01C43E1F176E80ED03C4F35081E","name":"load","timestamp":564.999827},"sessionId":"8CC5472C3955F483AFA9B01CA9ED3B2C"}

... which we process before we've had the chance to setup the waiter for it here:

xk6-browser/common/helpers.go

Lines 156 to 169 in 6baeca0

    
           go func() { 
        
           	for { 
        
           		select { 
        
           		case <-evCancelCtx.Done(): 
        
           			return 
        
           		case ev := <-chEvHandler: 
        
           			if stringSliceContains(events, ev.typ) { 
        
           				if predicateFn != nil { 
        
           					if predicateFn(ev.data) { 
        
           						ch <- ev.data 
        
           					} 
        
           				} else { 
        
           					ch <- nil 
        
           				}

This leads to us missing the event, and waiting for a second load event which never arrives, ultimately timing out at the default 30s.

This is a very simple scenario, and other more complex ones are common as well. For example, the race condition between calling ElementHandle.click() and Page.waitForNavigation(), or calling Page.waitForSelector() while a hard navigation is in progress, which resets the ExecutionContext, so we end up evaluating the injected.waitForSelector() JS call with a stale ExecutionContext, causing the cryptic Cannot find context with specified id (-32000) error, etc.

Proposed Solution

In the #142 discussion we mentioned removing the internal event-based system in favor of an event-driven finite-state machine. This might be overkill and would introduce additional complexity, so we might want to look into simpler solutions by e.g. synchronizing CDP and internal event processing in such a way that waiters are setup before the CDP event is sent, ensuring we never have a chance to miss an incoming event. This solution was explored in #281, but we had other test issues with that approach and never merged it.

OTOH, targetted solutions like this would have to be done on a case-by-case basis, whereas a general removal of the internal event-based system and introduction of an FSM should work in all cases, so that might be preferable. Some investigation is needed to settle on the better approach.

Another solution (#428) that would improve things is exposing more async APIs. E.g. in the click()/waitForNavigation() example above, if both of these methods returned a Promise, we could document that the user must use Promise.all() to synchronize both calls, in order to setup waitForNavigation() before click() is called. You can see that Playwright mentions this approach in their documentation.

The text was updated successfully, but these errors were encountered:

inancgumus · 2022-07-05T09:36:03Z

...FSM... might be overkill and would introduce additional complexity, so we might want to look into simpler solutions by e.g. synchronizing CDP and internal event processing in such a way that waiters are setup before the CDP event is sent, ensuring we never have a chance to miss an incoming event.

It seems to me that the frame navigations cause most of these issues. What about making a frame-specific halt-the-world event that stops processing the events until preconditions are met (similar to the older Go garbage collector's stop-the-world mechanism—but for a specific frame).

For example, imagine a frame navigates, in that case, we're going to stop receiving events both from the browser and our internals until we set up the relevant event listeners and execution contexts in place. Then we'll unlock the world and let the system components of that frame start processing the events.

Hopefully, maybe we can do it using #428. Like Puppeteer and Playwright, we can put the relevant actions into the same promise. That would hopefully remove the need for a frame-specific barrier.

imiric · 2022-07-07T15:53:16Z

I've been exploring two potential solutions to this issue. Here's a summary that we can discuss in a meeting I'll schedule for Monday:

A smaller, focused, fix for just the navigation race that involves returning a Promise from FrameManager.NavigateFrame(), which we're already working on as part of Browser v2: Async APIs #428.
This Promise could be fulfilled once FrameManager.frameNavigated() is called, and we wouldn't need to emit the internal EventFrameNavigation.

The current indirection in FrameManager.NavigateFrame() of calling FrameSession.navigateFrame() and waiting for EventFrameNavigation, which is actually emitted in the same FrameManager instance(!), is a bit silly...

I'm not entirely sure if this implementation would work without some experimentation, but I think it's feasible, and would definitely be less total effort than the second approach.
A major refactor of our internal event system, essentially getting rid of most of it, and a minor refactor of how we currently handle CDP events.

First of all, let's describe the problems with our internal event system:
- As mentioned in the GH discussion, trying to understand the flow of execution is confusing, since events can be emitted by anything that embeds BaseEventEmitter. As you can see from the table below, that's a lot of sources.
- In most cases, internal events are mapped to or wrap CDP events. See event_emitter.go. So what we do is 1) receive the CDP event, 2) run our CDP event handlers, 3) emit the internal event, 4) run our internal event handlers. To simplify things, we can just listen to the original CDP event instead and avoid the indirection.
- In many cases, events are emitted but not consumed by anything. Or they're used as constants only, and not emitted or consumed. Or not used at all, and just placeholders for future functionality. Or they're consumed by the same emitter that produced it. 🤦 See the table below. Either that, or I wasn't able to find where they're produced or consumed, which is again part of the confusion.
- There are two ways of listening for events:
  1. Registering with on() and reading from the Event channel.
  2. Using createWaitForEventHandler() and waitForEvent().
  We know that the second way is racy—that is what started this issue—, and having two ways of doing things adds to the confusion, as it's often not easy to see where events are produced and consumed. The table below is an attempt of tracking emitters and consumers, but it might be wrong or incomplete because of this.

Internal event emitters (producers) and consumers

Producer	Event	Consumer(s)
None¹	`EventBrowserDisconnected`	`Browser`
None²	`EventBrowserContextClose`	`BrowserContext`
`BrowserContext`	`EventBrowserContextPage`	`BrowserContext`
`Connection`	`EventConnectionClose`	`Browser`
	`Target.attachedToTarget` (CDP)	`Browser`
	`Target.detachedFromTarget` (CDP)	`Browser`
`Frame`	`EventFrameAddLifecycle`	`Frame`, `FrameManager`, `Page`
	`EventFrameRemoveLifecycle`	None
	`EventPageDOMContentLoaded`	None
	`EventPageFrameNavigated`	`FrameSession`
	`EventPageLoad`	None
`FrameManager`	`EventFrameNavigation`	`Barrier`, `FrameManager`, `Page`
	`EventPageFrameAttached`	`FrameSession`
	`EventPageFrameDetached`	`FrameSession`
	`EventPageRequest`	None
	`EventPageRequestFailed`	None
	`EventPageRequestFinished`	None
	`EventPageResponse`	None
`FrameSession`	`EventPageError`	None
`Page`	`EventPageClose`	None
	`EventPageCrash`	None
`Session`	`EventSessionClosed`	None
	All session CDP events (`Page.lifecycleEvent`, `Network.requestWillBeSent`, etc.)	`FrameSession`, `NetworkManager`
`Worker`	`EventWorkerClose`	None
None	`EventPageConsole`, `EventPageDialog`, `EventPageDownload`, `EventPageFileChooser`, `EventPagePopup`, `EventPageWebSocket`, `EventPageWorker`	None

NOTES:
1: Not actually emitted by anything. We just use it as a constant value in Browser.On().
2: Not actually emitted by anything. We just use it as a constant value in BrowserContext.WaitForEvent().

The general approach of tackling the major refactor would be:

Creation of a top-level cdp package that will have most of the current common.Connection, common.Session and anything else related to CDP communication.
Creation of a cdp.EventWatcher struct that will be in charge of adding listeners and notifying them with the CDP event data. This is inspired by Puppeteer's LifecycleWatcher, and the implementation would be similar to our current BaseEventEmitter. The EventWatcher could be initialized e.g. once per Frame, and each Frame instance would be notified only of events that are meant for it.
Getting rid of the internal event emission system (BaseEventEmitter, etc.). If we don't want to expose CDP events directly to the rest of the system (i.e. it would be good to not have common or other packages import cdproto and other CDP modules directly), we might want to keep our internal events and have a small translation layer between CDP and these internal events.

Even though it would be a much more laborious process than any focused fixes, and it might take us a few weeks to finish, possibly the entire cycle for one person, I think we'd end up with a simpler and overall easier to understand system. We don't need to go to extreme lengths of implementing something unique like an FSM, and considering this is not the approach taken by either Puppeteer nor Playwright, I think it would serve us well to not deviate too far from their work.

imiric · 2022-07-07T15:54:05Z

@inancgumus I considered your suggestion, but I don't think we want to stop receiving events. You're describing the functionality of a queue, i.e. don't process new events until a previous one completes, which is what channels essentially give us. So, yes, we ultimately want to synchronize in that way, but we can do that by not blocking and using a Promise, as I described above.

That said, I don't think we should do any focused fixes like this, but instead take the time to work on a major refactor that will address this and other race conditions, while ending up with a more maintainable and understandable code base.

inancgumus · 2022-07-07T18:53:31Z

I was suggesting something else, nevertheless, yes, I was also on the side of using Promises to do some part of it 👍

Hopefully, maybe we can do it using #428. Like Puppeteer and Playwright, we can put the relevant actions into the same promise. That would hopefully remove the need for a frame-specific barrier.

And, as I explained in #142, +1 for major refactoring. Anyway, let's discuss the rest of the details in the architectural refactoring and async API meeting 🤝

As you can see, the test has been refactored to work with promises. Click and WaitForNavigation race with each other until the event loop continues. They race because each runs in a goroutine as a promise. After several trials, I couldn't make the test PASS. There is always a timeout due to EventFrameNavigation flakiness bug (Issue #427). Since the JS examples has been updated (examples/ folder), we can see whether there is a problem in CI E2E tests. All are updated for the async click and waitForNavigation.

The main change here to fix the `Page.goto()` race condition (see the example in #427) is removing the `waitForEvent` call after calling the CDP `Page.navigate`. We already waited for `NavigationEvent` to determine if it's a navigation within the same document (`chSameDoc`), but that's not necessary if we get an empty `newDocumentID` back (see the `loaderId` description in [1]), so we can return early in that case. It still doesn't eliminate race conditions entirely, as there's a minor chance we could miss the lifecycle event after we get the `NavigationEvent` and timeout waiting on `lifecycleEvtCh`, but I haven't run into that case during my stress testing. I chose not to call `frame.hasSubtreeLifecycleEventFired()` as that's racy/buggy itself and results in returning early due to not clearing previous lifecycle events on time (e.g. returning because 'load' fired for 'about:blank', but not for the current navigation). We should remove this and `Frame.recalculateLifecycle()` altogether... [1]: https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-navigate

imiric · 2022-10-12T16:32:00Z

@grafana/k6browserdev What should we do with this issue? The main race conditions mentioned in the description have now been fixed, and the major refactoring approach discussed above is not a high priority to address this, and has been downscoped from our initial plans.

We do still have intermittent test failures and flaky tests, but it's not clear what the "definition of Done" is for this particular issue.

So I vote to close this, and open new issues for specific flaky behavior. Or we could reuse this issue as a general one for all such behavior, but there's a lot of noise in the discussion above, and we'd need to change the description and add a TODO list, so we might as well open a new issue even in that case.

mstoykov · 2022-10-12T16:33:48Z

I also prefer to have issues per flaky test (or behaviour) as that way it could be discussed per case what hte solution can be

ankur22 · 2022-10-12T17:36:19Z

So I vote to close this, and open new issues for specific flaky behavior.

I also prefer to have issues per flaky test (or behaviour) as that way it could be discussed per case what hte solution can be

Agreed with both of these 👍

inancgumus · 2022-10-12T18:45:47Z

IMO, we can close this one and give a reference to this from the new issues when needed.

imiric · 2022-10-13T09:04:12Z

OK, closing this, and let's open one issue per flaky behavior we find. Though maybe not immediately if it happens once in CI, but if it really becomes a problem.

inancgumus added enhancement New feature or request refactor labels Jun 30, 2022

inancgumus added this to the v0.5.0 milestone Jun 30, 2022

sniku assigned imiric Jun 30, 2022

inancgumus added the next Might be eligible for the next planning (not guaranteed!) label Jul 4, 2022

imiric added the evaluate label Jul 4, 2022

imiric changed the title ~~Architectural Refactoring~~ Intermittent test failures and flaky tests Jul 4, 2022

inancgumus mentioned this issue Jul 5, 2022

panic: interface conversion: interface is nil, not goja.Value #417

Closed

imiric mentioned this issue Jul 5, 2022

Make log output more concise #429

Merged

This was referenced Jul 21, 2022

Async WaitForNavigation #440

Closed

Async waitForNavigation #467

Merged

imiric mentioned this issue Aug 2, 2022

WIP Event/CDP refactoring #478

Closed

imiric mentioned this issue Aug 4, 2022

Fix Page.goto() race condition #480

Merged

imiric modified the milestones: v0.5.0, v0.6.0 Aug 18, 2022

ankur22 mentioned this issue Oct 4, 2022

Add a queue per handler to keep events in order #555

Merged

imiric closed this as completed Oct 13, 2022

inancgumus closed this as not planned Won't fix, can't repro, duplicate, stale Oct 26, 2022

inancgumus removed this from the v0.6.0 milestone Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent test failures and flaky tests #427

Intermittent test failures and flaky tests #427

inancgumus commented Jun 30, 2022 •

edited

Loading

inancgumus commented Jul 5, 2022 •

edited

Loading

imiric commented Jul 7, 2022

imiric commented Jul 7, 2022

inancgumus commented Jul 7, 2022 •

edited

Loading

imiric commented Oct 12, 2022

mstoykov commented Oct 12, 2022

ankur22 commented Oct 12, 2022

inancgumus commented Oct 12, 2022

imiric commented Oct 13, 2022

Intermittent test failures and flaky tests #427

Intermittent test failures and flaky tests #427

Comments

inancgumus commented Jun 30, 2022 • edited Loading

Example

Proposed Solution

inancgumus commented Jul 5, 2022 • edited Loading

imiric commented Jul 7, 2022

Internal event emitters (producers) and consumers

imiric commented Jul 7, 2022

inancgumus commented Jul 7, 2022 • edited Loading

imiric commented Oct 12, 2022

mstoykov commented Oct 12, 2022

ankur22 commented Oct 12, 2022

inancgumus commented Oct 12, 2022

imiric commented Oct 13, 2022

inancgumus commented Jun 30, 2022 •

edited

Loading

inancgumus commented Jul 5, 2022 •

edited

Loading

inancgumus commented Jul 7, 2022 •

edited

Loading