Troubleshooting Worker Connection Issues: "pingresp not received, disconnecting"

Last updated: December 19, 2025

If you're experiencing intermittent worker connection failures with errors like pingresp not received, disconnecting or context deadline exceeded when workers try to upload logs, this typically indicates connectivity issues between your workers and Spacelift's servers.

Common Causes and Solutions

1. Image Digest Mismatches

Using :latest tags can cause workers to pull different image digests, leading to protocol mismatches and MQTT reconnection loops.

Solution: Pin your worker images to fixed digests instead of using :latest tags:

Pin the launcher, runner, and tunnel images to specific digests
Redeploy your worker pools after making these changes

2. Load Balancer Idle Timeout (Azure)

Azure Standard Load Balancer has a 4-minute default outbound idle timeout. Long-lived MQTT/TLS connections can be dropped if they remain quiet longer than this timeout period.

Solution: Increase the outbound idle timeout on your Load Balancer's outbound rule to 30 minutes. You can find this setting in the Azure Portal under your Load Balancer's outbound rules configuration.

3. Connection Tracking Table Exhaustion

If your Kubernetes nodes are running out of connection tracking entries, this can cause connection drops.

To check: Run these commands on the affected node:

sudo sysctl net.netfilter.nf_conntrack_count
sudo sysctl net.netfilter.nf_conntrack_max
dmesg | grep -i conntrack

If the count is close to the maximum, you may need to increase the connection tracking table size or distribute the load across more nodes.

Diagnostic Steps

To help identify the root cause:

Check if failures cluster to specific nodes:
```
kubectl get pods -n <namespace> -o wide | grep spacelift-worker
```
If failing pods consistently land on the same nodes, this indicates node-specific issues.
Verify networking requirements: Ensure all networking requirements are met, including proper outbound connectivity to Spacelift's endpoints.
Check resource constraints: Verify that worker pods have sufficient CPU and memory resources, as resource pressure can manifest as networking issues.

Additional Considerations

Ensure your firewall or proxy doesn't enforce overly aggressive idle timeouts
Consider lowering MQTT keepalive intervals to 30-60 seconds if you cannot increase load balancer timeouts
Check if VCS Agents (if used) are running in the same cluster/subnet as workers, as this can affect connectivity patterns

If issues persist after implementing these solutions, the problem may be related to your specific network configuration or infrastructure setup.