EC2 Worker Constantly Restarting or Disappearing

Last updated: November 7, 2024

Issue: EC2 Worker Constantly Restarting or Disappearing

When EC2 workers in a Spacelift worker pool are constantly restarting or seem to disappear, this can prevent jobs from completing successfully. This guide outlines the steps for investigating and resolving the issue.

Initial Steps: Disable `PowerOffOnError`

If the launchers are constantly restarting, a helpful first step is to temporarily disable the PowerOffOnError setting. This allows the EC2 instances to stay up, giving you time to connect and investigate further.

How to Disable:
You can find the PowerOffOnError setting in your Spacelift configuration. For more information, refer to the Spacelift documentation on worker pool settings.

Debugging Steps

Check Logs on the Worker Instance
The most useful logs are located in /var/log/spacelift, which contains two main files:
- info.log: General information and routine operations.
- error.log: Logs any errors encountered by the worker.
Reviewing these files can provide insight into why the workers are restarting or disconnecting.
Common Issues to Check
- Worker Pool Token or Private Key Issues:
  Ensure that the worker pool token and private key are correctly configured and base64-encoded if required. Incorrect token or key configurations can prevent the worker from connecting.
  - You can review the blog post here to check if you're using the correct values.
- Network Connectivity:
  Verify that the EC2 instance can connect to Spacelift’s IoT broker. Network issues can often be the root cause of constant restarts.
Using AWS Systems Manager (SSM) for Access
If you are using the default Spacelift-provided AMI, you can connect to the EC2 instance via AWS SSM for easier troubleshooting. SSM enables remote access without needing SSH.
Reviewing Logs in CloudWatch
If CloudWatch is set up to capture Spacelift logs, you can also review the log groups directly:
- spacelift-errors.log: Contains error logs.
- spacelift-info.log: Contains informational logs.
These CloudWatch log groups may save time by allowing you to access logs without connecting directly to the instance.

Summary

By disabling PowerOffOnError and reviewing logs in /var/log/spacelift or CloudWatch, you can identify the cause of the worker restarts. Pay special attention to worker pool token configurations, private key encoding, and network connectivity, as these are common sources of issues.

If further assistance is needed, feel free to contact Spacelift Support and provide details about any relevant error messages or configurations.