We are using a multibranch pipeline, that itself spawns child jobs, to execute 100s (or 1000s) of tasks on Fargate. In general, this works surprisingly well, but we have been seeing occasional “stuck” builds that need to be aborted.
In the build logs, it just says:
Obtained Jenkinsfile from git git@github.com:***
[Pipeline] Start of Pipeline
[Pipeline] node
Still waiting to schedule task
Waiting for next available executor...
Aborted by ***
And from the controller logs:
10:43:28 [id=77]#011INFO#011c.c.j.plugins.amazonecs.ECSCloud#provision: Will provision frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3, for label: frontend-builds-swarm-xl
10:43:38 [id=386]#011INFO#011hudson.slaves.NodeProvisioner#update: frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3 provisioning successfully completed. We have now 195 computer(s)
10:43:39 [id=844559]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#runECSTask: [frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3]: Starting agent with task definition arn:aws:ecs:eu-west-2:***:task-definition/frontend-builds-swarm-xl-ecs:1}
10:43:39 [id=844559]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#runECSTask: [frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3]: Agent started with task arn : arn:aws:ecs:eu-west-2:***:task/frontend-builds/0d265f1fa5d747f0a0d9133986004535
10:43:39 [id=844559]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3]: TaskArn: arn:aws:ecs:eu-west-2:***:task/frontend-builds/0d265f1fa5d747f0a0d9133986004535
10:43:39 [id=844559]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3]: TaskDefinitionArn: arn:aws:ecs:eu-west-2:***:task-definition/frontend-builds-swarm-xl-ecs:1
10:43:39 [id=844559]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3]: ClusterArn: arn:aws:ecs:eu-west-2:***:cluster/frontend-builds
10:43:39 [id=844559]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3]: ContainerInstanceArn: null
10:48:39 [id=844559]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3]: Task started, waiting for agent to become online
10:48:39 [id=844559]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#waitForAgent: [frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3]: Agent connected
10:48:40 [id=843810]#011INFO#011c.c.j.plugins.amazonecs.ECSSlave#_terminate: [frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3]: Stopping: TaskArn arn:aws:ecs:eu-west-2:***:task/frontend-builds/0d265f1fa5d747f0a0d9133986004535, ClusterArn arn:aws:ecs:eu-west-2:***:cluster/frontend-builds
10:48:40 [id=844705]#011INFO#011j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting [#45658] for frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3 terminated: java.nio.channels.ClosedChannelException
10:48:40 [id=844867]#011WARNING#011hudson.model.Executor#resetWorkUnit: Executor #0 for frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3 grabbed hudson.model.queue.WorkUnit@272c14ff[work=part of ...-master ยป master #208] from queue but frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3 went off-line before the task's worker thread was ready to execute. Termination trace follows:
That looks like the agent was shot in the head, after the 5m timeout we have set up for that cloud:
clouds:
- ecs:
...
numExecutors: 1
maxAgents: 100
retentionTimeout: 5
retainAgents: false
taskPollingIntervalInSeconds: 300
slaveTimeoutInSeconds: 300
However the agent logs (from cloudwatch, using the task arn):
May 10, 2023 10:44:56 AM hudson.remoting.jnlp.Main createEngine
INFO: Setting up agent: frontend-builds-swarm-xl-frontend-builds-swarm-xl-jzkv3
May 10, 2023 10:44:56 AM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 3107.v665000b_51092
May 10, 2023 10:44:56 AM hudson.remoting.Engine startEngine
WARNING: No Working Directory. Using the legacy JAR Cache location: /home/jenkins/.jenkins/cache/jars
May 10, 2023 10:44:56 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [https://frontend-jenkins.gamevy.com/]
May 10, 2023 10:44:57 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
May 10, 2023 10:44:57 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Agent discovery successful
Agent address: ***
Agent port: 5000
Identity: 1a:11:35:a1:0d:37:04:bc:9b:e9:f4:18:35:0f:0c:5d
May 10, 2023 10:44:57 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
May 10, 2023 10:44:57 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to frontend-jenkins.gamevy.com:5000
May 10, 2023 10:44:57 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP4-connect
May 10, 2023 10:44:57 AM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run
INFO: Waiting for ProtocolStack to start.
May 10, 2023 10:44:57 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Remote identity confirmed: 1a:11:35:a1:0d:37:04:bc:9b:e9:f4:18:35:0f:0c:5d
May 10, 2023 10:44:57 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
May 10, 2023 10:48:40 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated
suggest that the agent had connected, successfully, several minutes earlier.
A bit of code spelunking revealed that even though the plugin spinwaits once a second, for the agent to connect; it doesn’t actually start that process, until it thinks the task has started. And we had, somewhat foolishly, set the polling interval to the same value as the timeout (after a few freezes, we think caused by AWS rate limiting).
Having reduced the polling interval to 2m30s, the problem seems to be resolved! ๐ค