Skip to content
This repository has been archived by the owner on Jan 30, 2025. It is now read-only.

Intermittent test failures and flaky tests #427

Closed
inancgumus opened this issue Jun 30, 2022 Discussed in #142 · 9 comments
Closed

Intermittent test failures and flaky tests #427

inancgumus opened this issue Jun 30, 2022 Discussed in #142 · 9 comments
Assignees
Labels
enhancement New feature or request evaluate next Might be eligible for the next planning (not guaranteed!) refactor

Comments

@inancgumus
Copy link
Member

inancgumus commented Jun 30, 2022

For a few months now we've noticed an increase in intermittent test failures (e.g. grafana/k6#4442), as well very frequent flaky tests and CI failures (#272, #254). This has become a major issue for both users and internal devs working on the extension.

From initial work explored in #281 that aimed to fix #272, it became apparent that these issues stem from the racy nature of how we process CDP events, which can trigger one or more internal events. This is explained in detail in this discussion, along with a proposed architecture to address these issues.

An ancillary problem with the current architecture is that these event flows and race conditions are difficult to trace and understand, and since they're intermittent and difficult to test we're never sure if we're introducing more of them as we make changes.

Example

To illustrate an example of a race condition that should be addressed, consider the following scenario from TestPageGoto (failure example in CI build grafana/xk6-browser#1123):

  1. xk6-browser sends the Page.navigate CDP command:

    -> {"id":20,"sessionId":"75F7E22B7AD1085646FE46C053DD6729","method":"Page.navigate","params":{"url":"http://127.0.0.1:40721/static/empty.html","frameId":"AF0723A7356FB62639D7BA483EBF615E"}}
    

    Done here:

    newDocumentID, err := fs.navigateFrame(frame, url, parsedOpts.Referer)

  2. The browser responds with the load Page.lifecycleEvent:

    <- {"method":"Page.lifecycleEvent","params":{"frameId":"E0C85D008DBA499F54F24A284F032911","loaderId":"B16AD01C43E1F176E80ED03C4F35081E","name":"load","timestamp":564.999827},"sessionId":"8CC5472C3955F483AFA9B01CA9ED3B2C"}
    

    ... which we process before we've had the chance to setup the waiter for it here:

    go func() {
    for {
    select {
    case <-evCancelCtx.Done():
    return
    case ev := <-chEvHandler:
    if stringSliceContains(events, ev.typ) {
    if predicateFn != nil {
    if predicateFn(ev.data) {
    ch <- ev.data
    }
    } else {
    ch <- nil
    }

    This leads to us missing the event, and waiting for a second load event which never arrives, ultimately timing out at the default 30s.

This is a very simple scenario, and other more complex ones are common as well. For example, the race condition between calling ElementHandle.click() and Page.waitForNavigation(), or calling Page.waitForSelector() while a hard navigation is in progress, which resets the ExecutionContext, so we end up evaluating the injected.waitForSelector() JS call with a stale ExecutionContext, causing the cryptic Cannot find context with specified id (-32000) error, etc.

Proposed Solution

In the #142 discussion we mentioned removing the internal event-based system in favor of an event-driven finite-state machine. This might be overkill and would introduce additional complexity, so we might want to look into simpler solutions by e.g. synchronizing CDP and internal event processing in such a way that waiters are setup before the CDP event is sent, ensuring we never have a chance to miss an incoming event. This solution was explored in #281, but we had other test issues with that approach and never merged it.

OTOH, targetted solutions like this would have to be done on a case-by-case basis, whereas a general removal of the internal event-based system and introduction of an FSM should work in all cases, so that might be preferable. Some investigation is needed to settle on the better approach.

Another solution (#428) that would improve things is exposing more async APIs. E.g. in the click()/waitForNavigation() example above, if both of these methods returned a Promise, we could document that the user must use Promise.all() to synchronize both calls, in order to setup waitForNavigation() before click() is called. You can see that Playwright mentions this approach in their documentation.

@inancgumus inancgumus added enhancement New feature or request refactor labels Jun 30, 2022
@inancgumus inancgumus added this to the v0.5.0 milestone Jun 30, 2022
@inancgumus inancgumus added the next Might be eligible for the next planning (not guaranteed!) label Jul 4, 2022
@imiric imiric added the evaluate label Jul 4, 2022
@imiric imiric changed the title Architectural Refactoring Intermittent test failures and flaky tests Jul 4, 2022
@inancgumus
Copy link
Member Author

inancgumus commented Jul 5, 2022

...FSM... might be overkill and would introduce additional complexity, so we might want to look into simpler solutions by e.g. synchronizing CDP and internal event processing in such a way that waiters are setup before the CDP event is sent, ensuring we never have a chance to miss an incoming event.

It seems to me that the frame navigations cause most of these issues. What about making a frame-specific halt-the-world event that stops processing the events until preconditions are met (similar to the older Go garbage collector's stop-the-world mechanism—but for a specific frame).

For example, imagine a frame navigates, in that case, we're going to stop receiving events both from the browser and our internals until we set up the relevant event listeners and execution contexts in place. Then we'll unlock the world and let the system components of that frame start processing the events.

Hopefully, maybe we can do it using #428. Like Puppeteer and Playwright, we can put the relevant actions into the same promise. That would hopefully remove the need for a frame-specific barrier.

@imiric
Copy link
Contributor

imiric commented Jul 7, 2022

I've been exploring two potential solutions to this issue. Here's a summary that we can discuss in a meeting I'll schedule for Monday:

  1. A smaller, focused, fix for just the navigation race that involves returning a Promise from FrameManager.NavigateFrame(), which we're already working on as part of Browser v2: Async APIs #428.
    This Promise could be fulfilled once FrameManager.frameNavigated() is called, and we wouldn't need to emit the internal EventFrameNavigation.

    The current indirection in FrameManager.NavigateFrame() of calling FrameSession.navigateFrame() and waiting for EventFrameNavigation, which is actually emitted in the same FrameManager instance(!), is a bit silly...

    I'm not entirely sure if this implementation would work without some experimentation, but I think it's feasible, and would definitely be less total effort than the second approach.

  2. A major refactor of our internal event system, essentially getting rid of most of it, and a minor refactor of how we currently handle CDP events.

    First of all, let's describe the problems with our internal event system:

    • As mentioned in the GH discussion, trying to understand the flow of execution is confusing, since events can be emitted by anything that embeds BaseEventEmitter. As you can see from the table below, that's a lot of sources.

    • In most cases, internal events are mapped to or wrap CDP events. See event_emitter.go. So what we do is 1) receive the CDP event, 2) run our CDP event handlers, 3) emit the internal event, 4) run our internal event handlers. To simplify things, we can just listen to the original CDP event instead and avoid the indirection.

    • In many cases, events are emitted but not consumed by anything. Or they're used as constants only, and not emitted or consumed. Or not used at all, and just placeholders for future functionality. Or they're consumed by the same emitter that produced it. 🤦 See the table below. Either that, or I wasn't able to find where they're produced or consumed, which is again part of the confusion.

    • There are two ways of listening for events:

      1. Registering with on() and reading from the Event channel.

      2. Using createWaitForEventHandler() and waitForEvent().

      We know that the second way is racy—that is what started this issue—, and having two ways of doing things adds to the confusion, as it's often not easy to see where events are produced and consumed. The table below is an attempt of tracking emitters and consumers, but it might be wrong or incomplete because of this.

Internal event emitters (producers) and consumers

Producer Event Consumer(s)
None1 EventBrowserDisconnected Browser
None2 EventBrowserContextClose BrowserContext
BrowserContext EventBrowserContextPage BrowserContext
Connection EventConnectionClose Browser
Target.attachedToTarget (CDP) Browser
Target.detachedFromTarget (CDP) Browser
Frame EventFrameAddLifecycle Frame, FrameManager, Page
EventFrameRemoveLifecycle None
EventPageDOMContentLoaded None
EventPageFrameNavigated FrameSession
EventPageLoad None
FrameManager EventFrameNavigation Barrier, FrameManager, Page
EventPageFrameAttached FrameSession
EventPageFrameDetached FrameSession
EventPageRequest None
EventPageRequestFailed None
EventPageRequestFinished None
EventPageResponse None
FrameSession EventPageError None
Page EventPageClose None
EventPageCrash None
Session EventSessionClosed None
All session CDP events
(Page.lifecycleEvent,
Network.requestWillBeSent, etc.)
FrameSession, NetworkManager
Worker EventWorkerClose None
None EventPageConsole,
EventPageDialog,
EventPageDownload,
EventPageFileChooser,
EventPagePopup,
EventPageWebSocket,
EventPageWorker
None

NOTES:
1: Not actually emitted by anything. We just use it as a constant value in Browser.On().
2: Not actually emitted by anything. We just use it as a constant value in BrowserContext.WaitForEvent().


The general approach of tackling the major refactor would be:

  1. Creation of a top-level cdp package that will have most of the current common.Connection, common.Session and anything else related to CDP communication.

  2. Creation of a cdp.EventWatcher struct that will be in charge of adding listeners and notifying them with the CDP event data. This is inspired by Puppeteer's LifecycleWatcher, and the implementation would be similar to our current BaseEventEmitter. The EventWatcher could be initialized e.g. once per Frame, and each Frame instance would be notified only of events that are meant for it.

  3. Getting rid of the internal event emission system (BaseEventEmitter, etc.). If we don't want to expose CDP events directly to the rest of the system (i.e. it would be good to not have common or other packages import cdproto and other CDP modules directly), we might want to keep our internal events and have a small translation layer between CDP and these internal events.

Even though it would be a much more laborious process than any focused fixes, and it might take us a few weeks to finish, possibly the entire cycle for one person, I think we'd end up with a simpler and overall easier to understand system. We don't need to go to extreme lengths of implementing something unique like an FSM, and considering this is not the approach taken by either Puppeteer nor Playwright, I think it would serve us well to not deviate too far from their work.

@imiric
Copy link
Contributor

imiric commented Jul 7, 2022

@inancgumus I considered your suggestion, but I don't think we want to stop receiving events. You're describing the functionality of a queue, i.e. don't process new events until a previous one completes, which is what channels essentially give us. So, yes, we ultimately want to synchronize in that way, but we can do that by not blocking and using a Promise, as I described above.

That said, I don't think we should do any focused fixes like this, but instead take the time to work on a major refactor that will address this and other race conditions, while ending up with a more maintainable and understandable code base.

@inancgumus
Copy link
Member Author

inancgumus commented Jul 7, 2022

I was suggesting something else, nevertheless, yes, I was also on the side of using Promises to do some part of it 👍

Hopefully, maybe we can do it using #428. Like Puppeteer and Playwright, we can put the relevant actions into the same promise. That would hopefully remove the need for a frame-specific barrier.

And, as I explained in #142, +1 for major refactoring. Anyway, let's discuss the rest of the details in the architectural refactoring and async API meeting 🤝

inancgumus added a commit that referenced this issue Jul 21, 2022
As you can see, the test has been refactored to work with promises.
Click and WaitForNavigation race with each other until the event loop
continues. They race because each runs in a goroutine as a promise.

After several trials, I couldn't make the test PASS. There is always a
timeout due to EventFrameNavigation flakiness bug (Issue #427).

Since the JS examples has been updated (examples/ folder), we can see
whether there is a problem in CI E2E tests. All are updated for the
async click and waitForNavigation.
This was referenced Jul 21, 2022
imiric pushed a commit that referenced this issue Aug 4, 2022
The main change here to fix the `Page.goto()` race condition (see the
example in #427) is removing the `waitForEvent` call after calling the
CDP `Page.navigate`. We already waited for `NavigationEvent` to
determine if it's a navigation within the same document (`chSameDoc`),
but that's not necessary if we get an empty `newDocumentID` back (see
the `loaderId` description in [1]), so we can return early in that case.

It still doesn't eliminate race conditions entirely, as there's a minor
chance we could miss the lifecycle event after we get the
`NavigationEvent` and timeout waiting on `lifecycleEvtCh`, but I haven't
run into that case during my stress testing.

I chose not to call `frame.hasSubtreeLifecycleEventFired()` as that's
racy/buggy itself and results in returning early due to not clearing
previous lifecycle events on time (e.g. returning because 'load' fired
for 'about:blank', but not for the current navigation). We should remove
this and `Frame.recalculateLifecycle()` altogether...

[1]: https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-navigate
@imiric imiric modified the milestones: v0.5.0, v0.6.0 Aug 18, 2022
@imiric
Copy link
Contributor

imiric commented Oct 12, 2022

@grafana/k6browserdev What should we do with this issue? The main race conditions mentioned in the description have now been fixed, and the major refactoring approach discussed above is not a high priority to address this, and has been downscoped from our initial plans.

We do still have intermittent test failures and flaky tests, but it's not clear what the "definition of Done" is for this particular issue.

So I vote to close this, and open new issues for specific flaky behavior. Or we could reuse this issue as a general one for all such behavior, but there's a lot of noise in the discussion above, and we'd need to change the description and add a TODO list, so we might as well open a new issue even in that case.

@mstoykov
Copy link
Contributor

I also prefer to have issues per flaky test (or behaviour) as that way it could be discussed per case what hte solution can be

@ankur22
Copy link
Collaborator

ankur22 commented Oct 12, 2022

So I vote to close this, and open new issues for specific flaky behavior.

I also prefer to have issues per flaky test (or behaviour) as that way it could be discussed per case what hte solution can be

Agreed with both of these 👍

@inancgumus
Copy link
Member Author

IMO, we can close this one and give a reference to this from the new issues when needed.

@imiric
Copy link
Contributor

imiric commented Oct 13, 2022

OK, closing this, and let's open one issue per flaky behavior we find. Though maybe not immediately if it happens once in CI, but if it really becomes a problem.

@imiric imiric closed this as completed Oct 13, 2022
@inancgumus inancgumus closed this as not planned Won't fix, can't repro, duplicate, stale Oct 26, 2022
@inancgumus inancgumus removed this from the v0.6.0 milestone Oct 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request evaluate next Might be eligible for the next planning (not guaranteed!) refactor
Projects
None yet
Development

No branches or pull requests

4 participants