[Incident] Windows build of plugins don't start on `ci.jenkins.io` #4490

gounthar · 2025-01-10T10:39:12Z

Service(s)

ci.jenkins.io

Summary

Following my PR on the antexec plugin, https://ci.jenkins.io/job/Plugins/job/antexec-plugin/view/change-requests/job/PR-110/1/ is taking forever to get a Windows VM.

In progress 1 hr 22 min ago in 1 hr 22 min and counting

@timja has detected the same kind of issues.

Reproduction steps

No response

dduportal · 2025-01-10T11:07:47Z

Workin on it: requires to block all builds on ci.jenkins.io (put in the queue)

dduportal · 2025-01-10T11:46:39Z

Ok, there is an ongoing Azure network outage which causes these troubles:

This outage was only notified in our Azure Service Health console today (10 January) around 10:30 UTC, and it looks like it's not being reported properly (look at the map below):

dduportal · 2025-01-10T11:52:04Z

status.jenkins.io updated (PR: open Azure + ci.jenkins.io Windows containers outage status#577). Permalink to the incident: https://status.jenkins.io/issues/2025-01-08-ci.jenkins.io-azure-outage/
Banner added to ci.jenkins.io:

dduportal · 2025-01-10T11:53:15Z

Since this incident has been ongoing unoticed by Microsoft since the past 2 days, we most porbably don't see any full recovery until a few days. Let's wait and see as the team is not available to perform a full region change for the time being and it's not critically blocked.

dduportal · 2025-01-10T11:58:16Z

Also did an email to the developers mailing list for awareness: https://groups.google.com/g/jenkinsci-dev/c/VQcRiUYu92o

lemeurherveCB · 2025-01-10T13:27:37Z

Workaround for stuck builds: set useContainerAgent to false in your pipelines as Azure VM agents are spawned successfully since they're not using the impacted ACI service.

rantoniuk · 2025-01-10T13:29:18Z

Banner added to ci.jenkins.io:

Unfortunately that banner is not visible on the actual build pages, the user needs to be on the main Jenkins site to see it...

And it seems it's a good feature request candidate :-)

EDIT: feature request created - JENKINS-75122

lemeurherveCB · 2025-01-10T13:33:00Z

Good point, unfortunately we don't have anything allowing to display a top banner on every ci.jenkins.io pages AFAIK.

jglick · 2025-01-10T13:58:34Z

(FTR, noticed earlier in jenkinsci/jenkins-test-harness#893)

jglick · 2025-01-10T14:00:01Z

a full region change

Or just a zonal change? According to the announcement image, other zones in the same region are unaffected.

jglick · 2025-01-10T14:03:14Z

set useContainerAgent to false in your pipelines

Requires temporarily editing Jenkinsfile and then reverting the edit in each affected plugin. Also there is no need to touch Linux configuration, only Windows, IIUC. Would be nicer to developers to temporarily edit buildPlugin.groovy to either suppress Windows ACI builds, or quietly switch them to VMs.

lemeurherveCB · 2025-01-10T17:46:56Z

Would be nicer to developers to temporarily edit buildPlugin.groovy to <...> quietly switch them to VMs.

Implemented by @MarkEWaite in jenkins-infra/pipeline-library#898

jenkins-infra/pipeline-library#899 opened as

plugins that use spotless will fail their builds on a Windows VM when they pass on a Windows container. Switch from container to VM makes that issue visible to more people

jenkins-infra/helpdesk#4490 notes that Windows builds for plugins that use container agents are not starting. https://status.jenkins.io/issues/2025-01-08-ci.jenkins.io-azure-outage/ reports the outage is due to an Azure Container Instance (ACI) outage in the region where we host the ci.jenkins.io agents.

dduportal · 2025-01-13T16:59:54Z

Thanks @MarkEWaite for jenkins-infra/pipeline-library#898 to switch all windows builds to VM by default.

Alas, it uncovered 2 issues:

As per jgit cloning not converting line ends on windows? #3865, using Windows VM instead of containers did broke builds during the spotless phase due to an old (unsolved) issue: jgit cloning not converting line ends on windows? #3865. Was solved temporarily by Use core.autocrlf true for Windows line termination pipeline-library#899
Builds using JDK17 and JDK21 kept failing so we had to revert (see Use core.autocrlf true for Windows line termination pipeline-library#899).

dduportal · 2025-01-13T17:06:17Z

Update:

The Incident is over since the 10 January around 20:30 UTC on Azure side as per https://azure.status.microsoft/en-us/status/history/ (check tracking ID Tracking ID: PLP3-1W8).

Click to read details about the Azure outage

Mitigated - Networking issues impacting Azure Services in East US 2

What happened?

Between 22:00 UTC on 08 Jan 2025 and 04:30 UTC on 11 Jan 2025, a networking issue in a single zone in East US 2 resulted in impact to multiple Azure Services in the region. This may have resulted in intermittent Virtual Machine connectivity issues, failures in allocating resources or communicating with resources in the hosted region. The services impacted include but were not limited to, Azure Databricks, Azure Container Apps, Azure Function Apps, Azure App Service, Azure Logic Apps, SQL Managed Instances, Azure Databricks, Azure Synapse, Azure Data Factory, Azure Container Instances, API Management, Azure NetApp Files, DevOps, Azure Stream Analytics, PowerBI, VMSS, PostgreSQL flexible servers, and Azure RedHat Openshift. Customers using resources with Private Endpoint Network Security Groups communicating with other services may have also been impacted.

The impact was limited to a single zone in East US 2 region. No other regions were impacted by this issue.

What went wrong and why?

We determined that a configuration change in our regional networking service resulted in an inconsistent service state with three of the partitions turning unhealthy, which caused requests from multiple services to fail.

How did we respond?

Service monitoring alerted us to this networking issue at 22:00 UTC on 08 Jan 2025, with all the impacted services raising additional alerts as well based on their failure rates. As part of the investigation, it was identified that a network configuration issue in one of the zones resulted in three of the partitions becoming unhealthy causing widespread impact. As an immediate remediation measure, traffic was re-routed away from the impacted zone, which brought some relief to the non-zonal services, and helped with newer allocations. However, services that sent zonal requests to the impacted zone continued to be unhealthy. Some of the impacted services initiated their own Disaster Recovery options to help mitigate some of their impact. For customers impacted due to Private Link, a patch was applied, and we confirmed dependent services were available.

Additional workstreams to rehydrate the impacted zone by bringing the impacted partitions back to a healthy state were completed. To avoid further impact, we validated this fix on one partition, and while we encountered some challenges which required more time than expected to step through the validation process of this fix, causing a delay, the mitigation workstream progressed successfully. We brought two partitions back online, and continued to monitor the health of these partitions, in parallel bringing the final partition back to a healthy state. Once all of the partitions were brought back to a healthy state, we completed an end-to-end validation to ensure that the resources were responding as expected.

By 11:18 UTC on 10 Jan 2025, all three impacted partitions were fully recovered. Following this, we worked with the impacted services to validate mitigation on all of their resources. By 00:44 UTC on 11 Jan 2025, all services confirmed mitigation.

At 00:30 UTC, we initiated a phased approach to rebalance traffic across all of the zones to ensure that the networking traffic is flowing as expected in the region. After monitoring service health, we determined that this incident is fully mitigated at 04:30 UTC on 11 Jan 2025.

The impact for Azure Services varied between the following timelines (the below list doesn't cover all impacted services, but the timelines may be similar):

22:55 UTC on 08 Jan 2025 - 15:00 UTC on 10 Jan 2025 - Azure Databricks
22:55 UTC on 08 Jan 2025 - 15:00 UTC 10 Jan 2025 - Azure OpenAI
22:00 UTC on 08 Jan 2025 - 16:30 UTC on 10 Jan 2025 - Azure Stream Analytics
22:00 UTC on 08 Jan 2025 - 19:15 UTC on 10 Jan 2025 - Azure Database for PostgreSQL flexible servers
22:00 UTC on 08 Jan 2025 - 19:59 UTC on 10 Jan 2025 - Azure NetApp Files
22:00 UTC on 08 Jan 2025 - 20:05 UTC on 10 Jan 2025 - Power BI
22:00 UTC on 08 Jan 2025 - 20:05 UTC on 10 Jan 2025 - Azure Function Apps
22:00 UTC on 08 Jan 2025 - 20:05 UTC on 10 Jan 2025 - Azure App Service
22:00 UTC on 08 Jan 2025 - 20:05 UTC on 10 Jan 2025 - Azure Logic Apps
22:00 UTC on 08 Jan 2025 - 20:20 UTC on 10 Jan 2025 - Azure SQL Managed Instances
22:00 UTC on 08 Jan 2025 - 20:30 UTC on 10 Jan 2025 - Azure Container Instances
22:00 UTC on 08 Jan 2025 - 20:59 UTC on 10 Jan 2025 - Azure Data Explorer
22:00 UTC on 08 Jan 2025 - 21:01 UTC on 10 Jan 2025 - Azure Container Apps
22:00 UTC on 08 Jan 2025 - 21:05 UTC on 10 Jan 2025 - API Management
22:00 UTC on 08 Jan 2025 - 21:12 UTC on 10 Jan 2025 - Azure RedHat OpenShift
22:00 UTC on 08 Jan 2025 - 22:19 UTC on 10 Jan 2025 - Virtual Machine Scale Sets
22:00 UTC on 08 Jan 2025 - 22:25 UTC on 10 Jan 2025 - Azure SQL DB
22:00 UTC on 08 Jan 2025 - 00:44 UTC on 11 Jan 2025 - Azure Synapse Analytics
22:00 UTC on 08 Jan 2025 - 00:44 UTC on 11 Jan 2025 - Azure Data Factory

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings.
To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts .
For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs .
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring .
Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness .

We saw successful Windows builds in the past 2 days.

As such:

Banner removed on ci.jenkins.io
status.jenkins.io outage closed (ref. close Azure outage status#578)
Closed email thread: https://groups.google.com/g/jenkinsci-dev/c/VQcRiUYu92o/m/VFmsfzJxCwAJ

dduportal · 2025-01-13T17:14:34Z

Banner added to ci.jenkins.io:

Unfortunately that banner is not visible on the actual build pages, the user needs to be on the main Jenkins site to see it...

And it seems it's a good feature request candidate :-)

Thanks for the suggestion. As explained by Hervé, it is not possible. Do not hesitate to open an issue on Jenkins for such an feature request!

dduportal · 2025-01-13T17:26:31Z

(FTR, noticed earlier in jenkinsci/jenkins-test-harness#893)

Thanks @jglick! The commit jenkinsci/jenkins-test-harness@87bc0ce in jenkinsci/jenkins-test-harness#893 is inside the network incident windows reported by Azure: 22:00 UTC on 08 Jan 2025 - 20:30 UTC on 10 Jan 2025 - Azure Container Instances.
We missed the first 24h (and the early failure) as the team was off Thursday and these failure were not critical enough to trigger pagerduty :'(

Or just a zonal change? According to the announcement image, other zones in the same region are unaffected.

Absolutely, but the concept of "Zones" in Azure does not exist for Azure Container Instances: it's only for Azure Container "Groups" (as per https://learn.microsoft.com/en-us/azure/reliability/reliability-containers#availability-zone-support). As such, the Azure Container Agents does not have the concept of availability zones.
So the "zonal" network impact did broke the whole ACI service for our usage during the whole timeline.

Requires temporarily editing Jenkinsfile and then reverting the edit in each affected plugin. Also there is no need to touch Linux configuration, only Windows, IIUC. Would be nicer to developers to temporarily edit buildPlugin.groovy to either suppress Windows ACI builds, or quietly switch them to VMs.

Absolutely. @MarkEWaite did continue this mitigation effort (as described in #4490 (comment)) as I was missing time. The team was not really available alas at this moment: this is a future source of improvement.

jtnord · 2025-01-13T18:37:08Z

are things healthy?

https://ci.jenkins.io/job/Plugins/job/kubernetes-credentials-provider-plugin/view/change-requests/job/PR-101/4/execution/node/14/log/ has been unable to obtain a windows agent and yet there does not appear to be any load on ci.jenkins.io

lemeurherveCB · 2025-01-13T19:29:47Z

We don't, that's a good suggestion.
We should be able to use Datadog to get alerts when it's failing: https://www.datadoghq.com/blog/monitor-jenkins-datadog/

dduportal · 2025-01-13T19:42:09Z

Do we have any periodic builds of https://github.com/jenkinsci/jenkins-infra-test-plugin set up to alert admins to various problems? It seems like such a system would have caught this mechanically.

We do have this (in Infra folder, the job acceptance tests). It did not caught the failure this time as the failure is random and always ends up in success. It could be improved to use allocation time as a metric and alert when there is a change in average time

dduportal · 2025-01-13T19:42:47Z

We don't, that's a good suggestion. We should be able to use Datadog to get alerts when it's failing: https://www.datadoghq.com/blog/monitor-jenkins-datadog/

Nope, i don’t see how datadog could catch this

MarkEWaite · 2025-01-13T20:08:08Z

I thought that the timeout failures reported in https://ci.jenkins.io/job/Infra/job/acceptance-tests/job/check-agent-availability/ would be enough to tell us that there is an issue. It checks every 4 hours and fails the job if each agent type cannot be allocated within 30 minutes.

Unfortunately, that job relies on a human being to check its output, and I rarely check its output. I've added it to my plugin jobs status page in hopes that I will have one more location to see it, in addition to the RSS feed reader where I view those checks.

jglick · 2025-01-14T14:10:16Z

The Incident is over

jenkinsci/workflow-support-plugin#295 is failing to build on Windows. https://ci.jenkins.io/label/maven-17-windows/ shows several failed nodes.

dduportal · 2025-01-14T15:56:01Z

The Incident is over

jenkinsci/workflow-support-plugin#295 is failing to build on Windows. https://ci.jenkins.io/label/maven-17-windows/ shows several failed nodes.

Thanks! We are still in outage :'(
It is really weird: Azure does not report any other except:

The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Canceled'.

But not everytime 🤦

=> we most probably will need to start moving these workloads to Windows Kubernetes workload ASAP.

dduportal · 2025-01-14T16:08:21Z

Ping @mawinter69 @rantoniuk for information, we're working on #4494 following your comments!

dduportal · 2025-01-14T16:12:40Z

Update: the outage is still present (see above messages such as Jesse's).

As such, we are:

Increasing communication about it:
- azure outage still ongoing status#579
- [ci.jenkins.io] Install customizable-header to allow providing system message banner everywhere #4494

Switching to Windows VMs, but not using the pipeline library, but with labels: fix(ci.jenkins.io) disable Windows Container agents (for JDK11, 17 and 21) in favor of Windows VM Agents jenkins-infra#3818

dduportal · 2025-01-14T16:31:30Z

Update: the outage is still present (see above messages such as Jesse's).

As such, we are:

* Increasing communication about it:
  
  * [azure outage still ongoing status#579](https://github.com/jenkins-infra/status/pull/579)
  * [[ci.jenkins.io] Install `customizable-header` to allow providing system message banner everywhere #4494](https://github.com/jenkins-infra/helpdesk/issues/4494)

* Switching to Windows VMs, but not using the pipeline library, but with labels: [fix(ci.jenkins.io) disable Windows Container agents (for JDK11, 17 and 21) in favor of Windows VM Agents jenkins-infra#3818](https://github.com/jenkins-infra/jenkins-infra/pull/3818)

Update: at first (superficial) sight, the trick works:

jglick · 2025-01-14T22:28:23Z

I guess there is a fallback to VM-based agents for now? I have been seeing a lot of Windows test flakes in various plugins, like jenkinsci/workflow-job-plugin#499 and jenkinsci/support-core-plugin#612, I suppose due to changes in timing.

MarkEWaite · 2025-01-14T23:30:17Z

I guess there is a fallback to VM-based agents for now?

Yes, that's correct. We're using VM agents even with useContainerAgent: false by assiging the container label to the virtual machine agents

dduportal · 2025-01-15T06:23:33Z

I guess there is a fallback to VM-based agents for now? I have been seeing a lot of Windows test flakes in various plugins, like jenkinsci/workflow-job-plugin#499 and jenkinsci/support-core-plugin#612, I suppose due to changes in timing.

That is interesting, i would not have expected tests to become flaky has VMs have more power than container instances (better disks, more memory and more powerful cpus). Could this be related to certain type of parallelisation?

jglick · 2025-01-15T18:44:23Z

Some could simply be related to changing timing. jenkinsci/workflow-job-plugin#502 seems to have something to do with the filesystem though I do not understand it. Not a concern for the infra team, just noting it.

basil · 2025-01-18T01:24:24Z

Not sure if this is directly relevant, but the Windows test suite for docker-workflow (which was working as of January 6) seems to have started failing, as shown in jenkinsci/docker-workflow-plugin#331. It looks like some sort of architecture mismatch, where the Docker images used in the tests are not available for AArch64.

dduportal · 2025-01-20T16:36:10Z

Not sure if this is directly relevant, but the Windows test suite for docker-workflow (which was working as of January 6) seems to have started failing, as shown in jenkinsci/docker-workflow-plugin#331. It looks like some sort of architecture mismatch, where the Docker images used in the tests are not available for AArch64.

If @basil , I confirm it is unrelated. I'm commenting in the PR with the result of my first (superficial) check of the build errors.
=> I can't tell if it is worth another helpdesk issue for now (at first sight it does not but I don't mind having one if you feel like we should). Thanks for reporting!

dduportal · 2025-01-20T16:41:47Z

Update: given how close we are to migrate ci.jenkins.io to AWS, I'm prioritizing work on #4318 to get rid of ACI containers (in favor of Kubernetes Windows containers).

It means we'll keep using Windows VM agents on Azure until the migration is performed on AWS.

basil · 2025-01-21T19:59:37Z

If @basil , I confirm it is unrelated.

Well, only somewhat unrelated—the test suite was relying on being executed in a container on Windows so that the Docker-based tests would be skipped on Windows and CI builds would pass. That assumption stopped holding now that all Windows-based CI jobs are running as VMs, without any workaround from the infrastructure team—until a recent PR from a user improved the test suite to properly detect this configuration and skip the relevant tests when executed on a Windows VM, restoring the status quo of a green CI build.

gounthar added the triage Incoming issues that need review label Jan 10, 2025

jenkins-infra-helpdesk-app bot added the ci.jenkins.io label Jan 10, 2025

dduportal self-assigned this Jan 10, 2025

dduportal added this to the infra-team-sync-2025-01-14 milestone Jan 10, 2025

dduportal removed the triage Incoming issues that need review label Jan 10, 2025

dduportal mentioned this issue Jan 10, 2025

open Azure + ci.jenkins.io Windows containers outage jenkins-infra/status#577

Merged

This was referenced Jan 10, 2025

Stop testing Java 11 jenkinsci/zos-connector-plugin#115

Merged

Stop testing Java 11 jenkinsci/jira-plugin#622

Merged

This was referenced Jan 10, 2025

Use Windows VM until Azure ACI outage is resolved jenkins-infra/pipeline-library#898

Merged

Use core.autocrlf true for Windows line termination jenkins-infra/pipeline-library#899

Merged

dduportal changed the title ~~Windows build of plugins don't start on ci~~ [Incident] Windows build of plugins don't start on ci.jenkins.io Jan 13, 2025

dduportal mentioned this issue Jan 13, 2025

close Azure outage jenkins-infra/status#578

Merged

dduportal modified the milestones: infra-team-sync-2025-01-14, infra-team-sync-2025-01-21 Jan 14, 2025

jglick mentioned this issue Jan 14, 2025

Flaky WorkflowRunTest.logRotationOnlyProcessesCompletedBuilds on Windows jenkinsci/workflow-job-plugin#502

Merged

dduportal mentioned this issue Jan 20, 2025

[ci.jenkins.io] Move ACI agents to ephemeral Windows containers to AWS #4318

Open

dduportal modified the milestones: infra-team-sync-2025-01-21, infra-team-sync-2025-01-28 Jan 21, 2025

dduportal mentioned this issue Jan 24, 2025

[ci.jenkins.io] Move ephemeral Linux containers to AWS #4317

Closed

dduportal modified the milestones: infra-team-sync-2025-01-28, infra-team-sync-2025-02-11 Jan 28, 2025

dduportal mentioned this issue Jan 29, 2025

[ci.jenkins.io] Move ephemeral VM agents to AWS #4316

Open

[Incident] Windows build of plugins don't start on ci.jenkins.io #4490

[Incident] Windows build of plugins don't start on ci.jenkins.io #4490

Comments

gounthar commented Jan 10, 2025

Service(s)

Summary

Reproduction steps

dduportal commented Jan 10, 2025

dduportal commented Jan 10, 2025

dduportal commented Jan 10, 2025

dduportal commented Jan 10, 2025

dduportal commented Jan 10, 2025

lemeurherveCB commented Jan 10, 2025

rantoniuk commented Jan 10, 2025 • edited Loading

lemeurherveCB commented Jan 10, 2025

jglick commented Jan 10, 2025

jglick commented Jan 10, 2025

jglick commented Jan 10, 2025

lemeurherveCB commented Jan 10, 2025

dduportal commented Jan 13, 2025

dduportal commented Jan 13, 2025 • edited Loading

Mitigated - Networking issues impacting Azure Services in East US 2

What happened?

What went wrong and why?

How did we respond?

What happens next?

dduportal commented Jan 13, 2025

dduportal commented Jan 13, 2025

jtnord commented Jan 13, 2025

lemeurherveCB commented Jan 13, 2025 • edited Loading

dduportal commented Jan 13, 2025

dduportal commented Jan 13, 2025

MarkEWaite commented Jan 13, 2025

jglick commented Jan 14, 2025

dduportal commented Jan 14, 2025

dduportal commented Jan 14, 2025

dduportal commented Jan 14, 2025 • edited Loading

dduportal commented Jan 14, 2025

jglick commented Jan 14, 2025

MarkEWaite commented Jan 14, 2025 • edited Loading

dduportal commented Jan 15, 2025

jglick commented Jan 15, 2025

basil commented Jan 18, 2025

dduportal commented Jan 20, 2025

dduportal commented Jan 20, 2025

basil commented Jan 21, 2025

[Incident] Windows build of plugins don't start on `ci.jenkins.io` #4490

[Incident] Windows build of plugins don't start on `ci.jenkins.io` #4490

rantoniuk commented Jan 10, 2025 •

edited

Loading

dduportal commented Jan 13, 2025 •

edited

Loading

lemeurherveCB commented Jan 13, 2025 •

edited

Loading

dduportal commented Jan 14, 2025 •

edited

Loading

MarkEWaite commented Jan 14, 2025 •

edited

Loading