Forums

Articles
Create
cancel
Showing results for 
Search instead for 
Did you mean: 

Error java.io.FileNotFoundException makes runner crash

Guillermo Ruiz García May 26, 2025

I'm running Kubernetes Autoscaler system to manage the number of available runners at any given moment. These k8s pods are running on a GKE cluster on Kubernetes version 1.32. The issue is very frustratring: there is a 50% chance of a runner suddenly crashing with the error:

java.lang.RuntimeException: java.io.FileNotFoundException
	at com.github.dockerjava.netty.NettyInvocationBuilder.get(NettyInvocationBuilder.java:152)
	at com.github.dockerjava.core.exec.InfoCmdExec.exec(InfoCmdExec.java:24)
	at com.github.dockerjava.core.exec.InfoCmdExec.exec(InfoCmdExec.java:14)
	at com.github.dockerjava.core.command.AbstrDockerCmd.exec(AbstrDockerCmd.java:33)
	at com.atlassian.pipelines.runner.core.service.docker.DockerSystemServiceImpl.lambda$getDockerSystemInfo$0(DockerSystemServiceImpl.java:33)
	at io.reactivex.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:44)
	at io.reactivex.Single.subscribe(Single.java:3666)
	at io.reactivex.internal.operators.single.SingleObserveOn.subscribeActual(SingleObserveOn.java:35)
	at io.reactivex.Single.subscribe(Single.java:3666)
	at io.reactivex.internal.operators.single.SingleMap.subscribeActual(SingleMap.java:34)
	at io.reactivex.Single.subscribe(Single.java:3666)
	at io.reactivex.internal.operators.single.SingleDoOnError.subscribeActual(SingleDoOnError.java:35)
	at io.reactivex.Single.subscribe(Single.java:3666)
	at io.reactivex.internal.operators.single.SingleMap.subscribeActual(SingleMap.java:34)
	at io.reactivex.Single.subscribe(Single.java:3666)
	at io.reactivex.internal.operators.completable.CompletableFromSingle.subscribeActual(CompletableFromSingle.java:29)
	at io.reactivex.Completable.subscribe(Completable.java:2309)
	at io.reactivex.internal.operators.completable.CompletableMergeArray.subscribeActual(CompletableMergeArray.java:49)
	at io.reactivex.Completable.subscribe(Completable.java:2309)
	at io.reactivex.internal.operators.mixed.CompletableAndThenObservable.subscribeActual(CompletableAndThenObservable.java:45)
	at io.reactivex.Observable.subscribe(Observable.java:12284)
	at io.reactivex.internal.operators.observable.ObservableFlatMap.subscribeActual(ObservableFlatMap.java:55)
	at io.reactivex.Observable.subscribe(Observable.java:12284)
	at io.reactivex.internal.operators.observable.ObservableFlatMapCompletableCompletable.subscribeActual(ObservableFlatMapCompletableCompletable.java:49)
	at io.reactivex.Completable.subscribe(Completable.java:2309)
	at io.reactivex.internal.operators.completable.CompletableOnErrorComplete.subscribeActual(CompletableOnErrorComplete.java:35)
	at io.reactivex.Completable.subscribe(Completable.java:2309)
	at io.reactivex.Completable.blockingAwait(Completable.java:1226)
	at com.atlassian.pipelines.runner.core.ApplicationImpl.main(ApplicationImpl.java:59)
Caused by: java.io.FileNotFoundException
	at io.netty.channel.unix.Errors.newConnectException0(Errors.java:164)
	at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:131)
	at io.netty.channel.unix.Socket.connect(Socket.java:351)
	at io.netty.channel.epoll.AbstractEpollChannel.doConnect0(AbstractEpollChannel.java:778)
	at io.netty.channel.epoll.AbstractEpollChannel.doConnect(AbstractEpollChannel.java:763)
	at io.netty.channel.epoll.EpollDomainSocketChannel.doConnect(EpollDomainSocketChannel.java:88)
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.connect(AbstractEpollChannel.java:602)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1289)
	at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:655)
	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:634)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.connect(CombinedChannelDuplexHandler.java:495)
	at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:51)
	at io.netty.channel.CombinedChannelDuplexHandler.connect(CombinedChannelDuplexHandler.java:296)
	at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:657)
	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:634)
	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:618)
	at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:927)
	at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:264)
	at io.netty.bootstrap.Bootstrap$3.run(Bootstrap.java:264)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:408)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)  

 When inspecting the runner logs closely, I can see that the error raises after sending a POST request to the runner's /status endpoint to update its state.

...
[2025-05-15 07:58:22,110] Updating runner state to "ONLINE".
[2025-05-15 07:58:22,125] [e6af3882-6, L:/10.12.1.6:50954 - R:api.atlassian.com/13.35.248.26:443] The connection observed an error, the request cannot be retried as the headers/body were sent io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
[2025-05-15 07:58:22,131] {"traceId":"68259e064d079cd36239b377a79fd552","parentId":"6239b377a79fd552","id":"f82b2418f490544b","kind":"CLIENT","name":"PUT","timestamp":1747295902111673,"duration":19870,"localEndpoint":{"serviceName":"runner","ipv4":"10.12.1.6"},"tags":{"http.method":"PUT","http.path":"/ex/bitbucket-pipelines/rest/internal/accounts/{50e38d04-1187-4665-a670-45319c5c824c}/runners/{4c335687-68f4-53dd-9e95-bad565cae9d0}/state","error":"recvAddress(..) failed: Connection reset by peer; nested exception is io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer"}}
[2025-05-15 07:58:22,839] {"traceId":"68259e064d079cd36239b377a79fd552","parentId":"6239b377a79fd552","id":"9ec238192a7a5f58","kind":"CLIENT","name":"PUT","timestamp":1747295902632302,"duration":206878,"localEndpoint":{"serviceName":"runner","ipv4":"10.12.1.6"},"tags":{"http.method":"PUT","http.path":"/ex/bitbucket-pipelines/rest/internal/accounts/{50e38d04-1187-4665-a670-45319c5c824c}/runners/{4c335687-68f4-53dd-9e95-bad565cae9d0}/state"}}
...

 There are no firewalls that could be blocking requests. Just in the last week I have reported 328 errors like this. There are no more than 16 runners at the same time.

Could Atlassian API rate limiting be blocking requests from runners and stopping them suddenly, causing jobs to fail?

Thanks.

1 answer

0 votes
Patrik S
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
May 27, 2025

Hello @Guillermo Ruiz García ,

and welcome to the Community!

I confirmed from our internal logs that there were no rate-limit events on your workspace in the last few days, so I don't think your autoscaler is failing due to rate limits in particular.

The "Connection reset by peer" error is usually related to a network issue in the connection of your infrastructure to the Atlassian infrastructure.

If you have any sort of connection filtering/proxy/firewall, it's important to have the following IPs allowed for both incoming and outgoing traffic, so runners can communicate with Atlassian infra:

104.192.136.0/21
185.166.140.0/22
13.200.41.128/25
13.35.248.0/24
13.227.180.0/24
13.227.213.0/24

If after confirming the traffic to those IP ranges is allowed, you are still facing the issue, can you confirm if this started recently, or if this error has always occurred in this cluster?

Thank you, @Guillermo Ruiz García !

Patrik S

Guillermo Ruiz García June 3, 2025

Hello @Patrik S

I can confirm that there are no firewalls whatsoever in my Kubernetes cluster that could be filtering this connections. I have a Cloud NAT so all the outgoing traffic is routed through a cloud Router (to maintain the same IP address), but again, no filtering or firewalls are applied to this.

This problem has been around since day one of migrating runner to Kubernetes. What else could it be?

Regards,

Guillermo.

Patrik S
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
June 6, 2025

Hello @Guillermo Ruiz García ,

Even if there are no explicit firewalls, network instability or brief outages can cause connection resets. Consider setting up network monitoring to catch any transient issues and verify that your Cloud NAT and Router configurations are correct and that there are no unintended packet drops or connection limits.

Also, ensure that there are no resource constraints (CPU, memory) in your cluster that might affect the runner's network operations.

Additionally, I'd suggest testing the same setup in a different cluster/cloud provider, if possible, to isolate if the issue is environment specific.

Patrik S

 

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events