Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Incident] Windows build of plugins don't start on ci.jenkins.io #4490

Open
gounthar opened this issue Jan 10, 2025 · 39 comments
Open

[Incident] Windows build of plugins don't start on ci.jenkins.io #4490

gounthar opened this issue Jan 10, 2025 · 39 comments

Comments

@gounthar
Copy link

Service(s)

ci.jenkins.io

Summary

Following my PR on the antexec plugin, https://ci.jenkins.io/job/Plugins/job/antexec-plugin/view/change-requests/job/PR-110/1/ is taking forever to get a Windows VM.

In progress 1 hr 22 min ago in 1 hr 22 min and counting

@timja has detected the same kind of issues.

Reproduction steps

No response

@gounthar gounthar added the triage Incoming issues that need review label Jan 10, 2025
@dduportal dduportal self-assigned this Jan 10, 2025
@dduportal dduportal added this to the infra-team-sync-2025-01-14 milestone Jan 10, 2025
@dduportal dduportal removed the triage Incoming issues that need review label Jan 10, 2025
@dduportal
Copy link
Contributor

Workin on it: requires to block all builds on ci.jenkins.io (put in the queue)

@dduportal
Copy link
Contributor

Ok, there is an ongoing Azure network outage which causes these troubles:

Capture d’écran 2025-01-10 à 12 23 39 Capture d’écran 2025-01-10 à 12 28 01

This outage was only notified in our Azure Service Health console today (10 January) around 10:30 UTC, and it looks like it's not being reported properly (look at the map below):

Capture d’écran 2025-01-10 à 12 44 53

@dduportal
Copy link
Contributor

Capture d’écran 2025-01-10 à 12 51 52

@dduportal
Copy link
Contributor

Since this incident has been ongoing unoticed by Microsoft since the past 2 days, we most porbably don't see any full recovery until a few days. Let's wait and see as the team is not available to perform a full region change for the time being and it's not critically blocked.

@dduportal
Copy link
Contributor

Also did an email to the developers mailing list for awareness: https://groups.google.com/g/jenkinsci-dev/c/VQcRiUYu92o

@lemeurherveCB
Copy link

Workaround for stuck builds: set useContainerAgent to false in your pipelines as Azure VM agents are spawned successfully since they're not using the impacted ACI service.

@rantoniuk
Copy link

rantoniuk commented Jan 10, 2025

  • Banner added to ci.jenkins.io:
Capture d’écran 2025-01-10 à 12 51 52

Unfortunately that banner is not visible on the actual build pages, the user needs to be on the main Jenkins site to see it...

And it seems it's a good feature request candidate :-)

EDIT: feature request created - JENKINS-75122

@lemeurherveCB
Copy link

Good point, unfortunately we don't have anything allowing to display a top banner on every ci.jenkins.io pages AFAIK.

@jglick
Copy link

jglick commented Jan 10, 2025

(FTR, noticed earlier in jenkinsci/jenkins-test-harness#893)

@jglick
Copy link

jglick commented Jan 10, 2025

a full region change

Or just a zonal change? According to the announcement image, other zones in the same region are unaffected.

@jglick
Copy link

jglick commented Jan 10, 2025

set useContainerAgent to false in your pipelines

Requires temporarily editing Jenkinsfile and then reverting the edit in each affected plugin. Also there is no need to touch Linux configuration, only Windows, IIUC. Would be nicer to developers to temporarily edit buildPlugin.groovy to either suppress Windows ACI builds, or quietly switch them to VMs.

@lemeurherveCB
Copy link

Would be nicer to developers to temporarily edit buildPlugin.groovy to <...> quietly switch them to VMs.

Implemented by @MarkEWaite in jenkins-infra/pipeline-library#898

jenkins-infra/pipeline-library#899 opened as

plugins that use spotless will fail their builds on a Windows VM when they pass on a Windows container. Switch from container to VM makes that issue visible to more people

MarkEWaite added a commit to MarkEWaite/pipeline-library that referenced this issue Jan 10, 2025
jenkins-infra/helpdesk#4490 notes that
Windows builds for plugins that use container agents are not starting.

https://status.jenkins.io/issues/2025-01-08-ci.jenkins.io-azure-outage/
reports the outage is due to an Azure Container Instance (ACI) outage
in the region where we host the ci.jenkins.io agents.
@dduportal dduportal changed the title Windows build of plugins don't start on ci [Incident] Windows build of plugins don't start on ci.jenkins.io Jan 13, 2025
@dduportal
Copy link
Contributor

Thanks @MarkEWaite for jenkins-infra/pipeline-library#898 to switch all windows builds to VM by default.

Alas, it uncovered 2 issues:

@dduportal
Copy link
Contributor

dduportal commented Jan 13, 2025

Update:

The Incident is over since the 10 January around 20:30 UTC on Azure side as per https://azure.status.microsoft/en-us/status/history/ (check tracking ID Tracking ID: PLP3-1W8).

Click to read details about the Azure outage

Mitigated - Networking issues impacting Azure Services in East US 2

What happened?

Between 22:00 UTC on 08 Jan 2025 and 04:30 UTC on 11 Jan 2025, a networking issue in a single zone in East US 2 resulted in impact to multiple Azure Services in the region. This may have resulted in intermittent Virtual Machine connectivity issues, failures in allocating resources or communicating with resources in the hosted region. The services impacted include but were not limited to, Azure Databricks, Azure Container Apps, Azure Function Apps, Azure App Service, Azure Logic Apps, SQL Managed Instances, Azure Databricks, Azure Synapse, Azure Data Factory, Azure Container Instances, API Management, Azure NetApp Files, DevOps, Azure Stream Analytics, PowerBI, VMSS, PostgreSQL flexible servers, and Azure RedHat Openshift. Customers using resources with Private Endpoint Network Security Groups communicating with other services may have also been impacted.

The impact was limited to a single zone in East US 2 region. No other regions were impacted by this issue.

What went wrong and why?

We determined that a configuration change in our regional networking service resulted in an inconsistent service state with three of the partitions turning unhealthy, which caused requests from multiple services to fail.

How did we respond?

Service monitoring alerted us to this networking issue at 22:00 UTC on 08 Jan 2025, with all the impacted services raising additional alerts as well based on their failure rates. As part of the investigation, it was identified that a network configuration issue in one of the zones resulted in three of the partitions becoming unhealthy causing widespread impact. As an immediate remediation measure, traffic was re-routed away from the impacted zone, which brought some relief to the non-zonal services, and helped with newer allocations. However, services that sent zonal requests to the impacted zone continued to be unhealthy. Some of the impacted services initiated their own Disaster Recovery options to help mitigate some of their impact. For customers impacted due to Private Link, a patch was applied, and we confirmed dependent services were available.

Additional workstreams to rehydrate the impacted zone by bringing the impacted partitions back to a healthy state were completed. To avoid further impact, we validated this fix on one partition, and while we encountered some challenges which required more time than expected to step through the validation process of this fix, causing a delay, the mitigation workstream progressed successfully. We brought two partitions back online, and continued to monitor the health of these partitions, in parallel bringing the final partition back to a healthy state. Once all of the partitions were brought back to a healthy state, we completed an end-to-end validation to ensure that the resources were responding as expected.

By 11:18 UTC on 10 Jan 2025, all three impacted partitions were fully recovered. Following this, we worked with the impacted services to validate mitigation on all of their resources. By 00:44 UTC on 11 Jan 2025, all services confirmed mitigation.

At 00:30 UTC, we initiated a phased approach to rebalance traffic across all of the zones to ensure that the networking traffic is flowing as expected in the region. After monitoring service health, we determined that this incident is fully mitigated at 04:30 UTC on 11 Jan 2025.

The impact for Azure Services varied between the following timelines (the below list doesn't cover all impacted services, but the timelines may be similar):

  • 22:55 UTC on 08 Jan 2025 - 15:00 UTC on 10 Jan 2025 - Azure Databricks
  • 22:55 UTC on 08 Jan 2025 - 15:00 UTC 10 Jan 2025 - Azure OpenAI
  • 22:00 UTC on 08 Jan 2025 - 16:30 UTC on 10 Jan 2025 - Azure Stream Analytics
  • 22:00 UTC on 08 Jan 2025 - 19:15 UTC on 10 Jan 2025 - Azure Database for PostgreSQL flexible servers
  • 22:00 UTC on 08 Jan 2025 - 19:59 UTC on 10 Jan 2025 - Azure NetApp Files
  • 22:00 UTC on 08 Jan 2025 - 20:05 UTC on 10 Jan 2025 - Power BI
  • 22:00 UTC on 08 Jan 2025 - 20:05 UTC on 10 Jan 2025 - Azure Function Apps
  • 22:00 UTC on 08 Jan 2025 - 20:05 UTC on 10 Jan 2025 - Azure App Service
  • 22:00 UTC on 08 Jan 2025 - 20:05 UTC on 10 Jan 2025 - Azure Logic Apps
  • 22:00 UTC on 08 Jan 2025 - 20:20 UTC on 10 Jan 2025 - Azure SQL Managed Instances
  • 22:00 UTC on 08 Jan 2025 - 20:30 UTC on 10 Jan 2025 - Azure Container Instances
  • 22:00 UTC on 08 Jan 2025 - 20:59 UTC on 10 Jan 2025 - Azure Data Explorer
  • 22:00 UTC on 08 Jan 2025 - 21:01 UTC on 10 Jan 2025 - Azure Container Apps
  • 22:00 UTC on 08 Jan 2025 - 21:05 UTC on 10 Jan 2025 - API Management
  • 22:00 UTC on 08 Jan 2025 - 21:12 UTC on 10 Jan 2025 - Azure RedHat OpenShift
  • 22:00 UTC on 08 Jan 2025 - 22:19 UTC on 10 Jan 2025 - Virtual Machine Scale Sets
  • 22:00 UTC on 08 Jan 2025 - 22:25 UTC on 10 Jan 2025 - Azure SQL DB
  • 22:00 UTC on 08 Jan 2025 - 00:44 UTC on 11 Jan 2025 - Azure Synapse Analytics
  • 22:00 UTC on 08 Jan 2025 - 00:44 UTC on 11 Jan 2025 - Azure Data Factory

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings.
To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts .
For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs .
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring .
Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness .

We saw successful Windows builds in the past 2 days.

As such:

@dduportal
Copy link
Contributor

  • Banner added to ci.jenkins.io:
Capture d’écran 2025-01-10 à 12 51 52

Unfortunately that banner is not visible on the actual build pages, the user needs to be on the main Jenkins site to see it...

And it seems it's a good feature request candidate :-)

Thanks for the suggestion. As explained by Hervé, it is not possible. Do not hesitate to open an issue on Jenkins for such an feature request!

@dduportal
Copy link
Contributor

(FTR, noticed earlier in jenkinsci/jenkins-test-harness#893)

Thanks @jglick! The commit jenkinsci/jenkins-test-harness@87bc0ce in jenkinsci/jenkins-test-harness#893 is inside the network incident windows reported by Azure: 22:00 UTC on 08 Jan 2025 - 20:30 UTC on 10 Jan 2025 - Azure Container Instances.
We missed the first 24h (and the early failure) as the team was off Thursday and these failure were not critical enough to trigger pagerduty :'(

Or just a zonal change? According to the announcement image, other zones in the same region are unaffected.

Absolutely, but the concept of "Zones" in Azure does not exist for Azure Container Instances: it's only for Azure Container "Groups" (as per https://learn.microsoft.com/en-us/azure/reliability/reliability-containers#availability-zone-support). As such, the Azure Container Agents does not have the concept of availability zones.
So the "zonal" network impact did broke the whole ACI service for our usage during the whole timeline.

Requires temporarily editing Jenkinsfile and then reverting the edit in each affected plugin. Also there is no need to touch Linux configuration, only Windows, IIUC. Would be nicer to developers to temporarily edit buildPlugin.groovy to either suppress Windows ACI builds, or quietly switch them to VMs.

Absolutely. @MarkEWaite did continue this mitigation effort (as described in #4490 (comment)) as I was missing time. The team was not really available alas at this moment: this is a future source of improvement.

@jtnord
Copy link

jtnord commented Jan 13, 2025

are things healthy?

https://ci.jenkins.io/job/Plugins/job/kubernetes-credentials-provider-plugin/view/change-requests/job/PR-101/4/execution/node/14/log/ has been unable to obtain a windows agent and yet there does not appear to be any load on ci.jenkins.io

@lemeurherveCB
Copy link

lemeurherveCB commented Jan 13, 2025

We don't, that's a good suggestion.
We should be able to use Datadog to get alerts when it's failing: https://www.datadoghq.com/blog/monitor-jenkins-datadog/

@dduportal
Copy link
Contributor

Do we have any periodic builds of https://github.com/jenkinsci/jenkins-infra-test-plugin set up to alert admins to various problems? It seems like such a system would have caught this mechanically.

We do have this (in Infra folder, the job acceptance tests). It did not caught the failure this time as the failure is random and always ends up in success. It could be improved to use allocation time as a metric and alert when there is a change in average time

@dduportal
Copy link
Contributor

We don't, that's a good suggestion. We should be able to use Datadog to get alerts when it's failing: https://www.datadoghq.com/blog/monitor-jenkins-datadog/

Nope, i don’t see how datadog could catch this

@MarkEWaite
Copy link

I thought that the timeout failures reported in https://ci.jenkins.io/job/Infra/job/acceptance-tests/job/check-agent-availability/ would be enough to tell us that there is an issue. It checks every 4 hours and fails the job if each agent type cannot be allocated within 30 minutes.

Unfortunately, that job relies on a human being to check its output, and I rarely check its output. I've added it to my plugin jobs status page in hopes that I will have one more location to see it, in addition to the RSS feed reader where I view those checks.

@jglick
Copy link

jglick commented Jan 14, 2025

The Incident is over

jenkinsci/workflow-support-plugin#295 is failing to build on Windows. https://ci.jenkins.io/label/maven-17-windows/ shows several failed nodes.

@dduportal
Copy link
Contributor

The Incident is over

jenkinsci/workflow-support-plugin#295 is failing to build on Windows. https://ci.jenkins.io/label/maven-17-windows/ shows several failed nodes.

Thanks! We are still in outage :'(
It is really weird: Azure does not report any other except:

The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Canceled'.

But not everytime 🤦

=> we most probably will need to start moving these workloads to Windows Kubernetes workload ASAP.

@dduportal
Copy link
Contributor

Ping @mawinter69 @rantoniuk for information, we're working on #4494 following your comments!

@dduportal
Copy link
Contributor

dduportal commented Jan 14, 2025

Update: the outage is still present (see above messages such as Jesse's).

As such, we are:

Capture d’écran 2025-01-14 à 17 23 22

@dduportal
Copy link
Contributor

Update: the outage is still present (see above messages such as Jesse's).

As such, we are:

* Increasing communication about it:
  
  * [azure outage still ongoing status#579](https://github.com/jenkins-infra/status/pull/579)
  * [[ci.jenkins.io] Install `customizable-header` to allow providing system message banner everywhere #4494](https://github.com/jenkins-infra/helpdesk/issues/4494)
Capture d’écran 2025-01-14 à 17 23 22
* Switching to Windows VMs, but not using the pipeline library, but with labels: [fix(ci.jenkins.io) disable Windows Container agents (for JDK11, 17 and 21) in favor of Windows VM Agents jenkins-infra#3818](https://github.com/jenkins-infra/jenkins-infra/pull/3818)

Update: at first (superficial) sight, the trick works:

@jglick
Copy link

jglick commented Jan 14, 2025

I guess there is a fallback to VM-based agents for now? I have been seeing a lot of Windows test flakes in various plugins, like jenkinsci/workflow-job-plugin#499 and jenkinsci/support-core-plugin#612, I suppose due to changes in timing.

@MarkEWaite
Copy link

MarkEWaite commented Jan 14, 2025

I guess there is a fallback to VM-based agents for now?

Yes, that's correct. We're using VM agents even with useContainerAgent: false by assiging the container label to the virtual machine agents

@dduportal
Copy link
Contributor

I guess there is a fallback to VM-based agents for now? I have been seeing a lot of Windows test flakes in various plugins, like jenkinsci/workflow-job-plugin#499 and jenkinsci/support-core-plugin#612, I suppose due to changes in timing.

That is interesting, i would not have expected tests to become flaky has VMs have more power than container instances (better disks, more memory and more powerful cpus). Could this be related to certain type of parallelisation?

@jglick
Copy link

jglick commented Jan 15, 2025

Some could simply be related to changing timing. jenkinsci/workflow-job-plugin#502 seems to have something to do with the filesystem though I do not understand it. Not a concern for the infra team, just noting it.

@basil
Copy link
Collaborator

basil commented Jan 18, 2025

Not sure if this is directly relevant, but the Windows test suite for docker-workflow (which was working as of January 6) seems to have started failing, as shown in jenkinsci/docker-workflow-plugin#331. It looks like some sort of architecture mismatch, where the Docker images used in the tests are not available for AArch64.

@dduportal
Copy link
Contributor

Not sure if this is directly relevant, but the Windows test suite for docker-workflow (which was working as of January 6) seems to have started failing, as shown in jenkinsci/docker-workflow-plugin#331. It looks like some sort of architecture mismatch, where the Docker images used in the tests are not available for AArch64.

If @basil , I confirm it is unrelated. I'm commenting in the PR with the result of my first (superficial) check of the build errors.
=> I can't tell if it is worth another helpdesk issue for now (at first sight it does not but I don't mind having one if you feel like we should). Thanks for reporting!

@dduportal
Copy link
Contributor

Update: given how close we are to migrate ci.jenkins.io to AWS, I'm prioritizing work on #4318 to get rid of ACI containers (in favor of Kubernetes Windows containers).

It means we'll keep using Windows VM agents on Azure until the migration is performed on AWS.

@basil
Copy link
Collaborator

basil commented Jan 21, 2025

If @basil , I confirm it is unrelated.

Well, only somewhat unrelated—the test suite was relying on being executed in a container on Windows so that the Docker-based tests would be skipped on Windows and CI builds would pass. That assumption stopped holding now that all Windows-based CI jobs are running as VMs, without any workaround from the infrastructure team—until a recent PR from a user improved the test suite to properly detect this configuration and skip the relevant tests when executed on a Windows VM, restoring the status quo of a green CI build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants