Building a Custom AMI for a GitHub Actions Self-hosted runner

Oct 18, 2024 14 min read Matthew Gonzales EC2 GitHub Actions Packer Terraform

In GitHub Actions, GitHub-hosted runners are a convenient resource for creating CI/CD pipelines, especially when a project has just started. Public repositories can use standard runners for free and private repositories have a certain amount of free minutes.

But if your project requires more resources for builds and you’re already paying for compute, for example, with AWS, a self-hosted runner allows access to larger compute resources and keeps the billing for compute in one place.

I want to provide an overview of the preliminary work needed to build and run a self-hosted runner using an AWS EC2 instance.

Planning

What do I need to build and deploy it?

Similar to a GitHub-hosted runner, a self-hosted runner should be ephemeral. There’s no reason to pay for compute that I’m not using. So I want to create a custom runner AMI to reliably spin up and down a runner.

To build a custom AMI, I need to select the base AMI, then run and configure an EC2 instance from it, and finally create a custom AMI from this configured instance.

The configuration is where I’ll handle the interesting problems.

Any running instance launched from my runner AMI needs to access the GitHub API and the repository that will use the self-hosted runner in workflows. I’m going to create a GitHub personal access token (PAT) stored as a AWS Systems Manager Parameter Store secret to allow runners access to the GitHub API.

Next I need to figure out how to install the actions runner software reproducibly to update the custom AMI or to build variations of it as needed.

This is where things get tricky. I know from GitHub that the installation process is manual. The runner can register and deregister itself with a repository, but the deployment instructions on GitHub involve copying a compressed file, decompressing its contents, and manually running it.

So to make the runner start and stop with the instance, I’ll configure the instance with services that start it when the instance starts and stop it when it’s deleted.

Now all that’s left is adding specific build software that a runner can use in pipelines.

What do I need to delete it?

This is seemingly more straightforward, but there’s a wrinkle. I need to stop the service completely before the EC2 runner is shut down or terminated. Without stopping the service first, the runner won’t deregister from the repository and will remain on the repository’s list of runners with an offline status. I’ll go into more detail about this below.

Optionally, I can deregister the AMI and delete the snapshot, so that I’m not paying for an unused AMI and snapshot.

What will I use to build it?

I’m going to use Packer to build the AMI and then Terraform to deploy and destroy the runner. Packer is really indispensable for this project. It makes configuring and building the AMI painless. The infrastructure itself can be deployed and destroyed with any other IaC tool, such as AWS CDK, Pulumi, or OpenTofu.

Finally, I’m going to create some GitHub Action workflows to test building everything.

Building

Prerequisites

A GitHub personal access token
A base AMI
An app repository

First, I’m going to create (or update) my GitHub PAT for GitHub Actions. Personal access tokens are in the Developer settings section of the account settings. The scopes I’m going to select for this PAT are repo and from admin:org, read:org and manage_runners:org. This is the minimum I need for the runner to access the GitHub API:

github-pat

For the self-hosted runner, I’m going to select an Ubuntu AMI, which is similar to a GitHub-hosted runner.

I can use the AWS Console or CLI with a command such as the one below to list the available AMIs:

aws ec2 describe-images --owners amazon --filters "Name=name,Values=ubuntu/images/*ubuntu*" "Name=creation-date,Values=2024*" "Name=architecture,Values=x86_64" "Name=root-device-type,Values=ebs"

If you’re not familiar with Packer, this information will be used in the source block of the Packer template.

To test the self-hosted runner, I’ll fork the single-dev-env repository to a private repository and use it as a test app repository. Self-hosted runners should run in private repositories for security.

If you have your own or a different app repository you’d prefer to test the custom runner with, you can skip to the next section.

If you’ve forked the single-dev-env repository, you can delete everything but the main.go file and add the following Dockerfile:

# Build stage
FROM golang:1.22.7 AS builder

WORKDIR /src
COPY . .

RUN CGO_ENABLED=0 go build -o hello main.go

# Image stage
FROM scratch

COPY --from=builder /src/hello /usr/local/bin/hello

ENTRYPOINT [ "hello" ]

Now create the directories .github/workflows/ and add this file to workflows as build.yaml:

name: Build on self-hosted runner

on:
  workflow_dispatch:

jobs:
  build:
    runs-on: self-hosted
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Buildah Action
        uses: redhat-actions/buildah-build@v2
        with:
          image: hello-image
          tags: v1 ${{ github.sha }}
          containerfiles: |
            ./Dockerfile

The critical line here is runs-on: self-hosted. This tells the job to use my self-hosted runner. I thought it would be an interesting change to test building a container with buildah rather than docker, so I’m using the Buildah Build actions plugin. This will require some additional configuration when I build the AMI below.

Actions Runner code

In the build repository’s Settings, there’s an Actions/Runners subsection where you can find instructions to manually add a new self-hosted runner. GitHub states that adding “a self-hosted runner requires that you download, configure, and execute the GitHub Actions Runner.”

add-runner-instructions

I want to automate this process and use it in its own Actions workflow. To do so, I’m going to create two scripts. One will be a bootstrap script, where I’ll execute the commands GitHub gave us above. The other will be a runner script to start up and shut down the runner as a service.

The GitHub Actions documentation has a process for configuring a self-hosted runner application as a service, but their process requires the extra step of configuring the runner with a repository before it can generate a service script. I feel that automating the manual steps from the repository’s Settings instructions is more straightforward for this project.

For the bootstrap script, it’s simple enough to use commands similar to GitHub’s instructions above in a bash script, for example:

pkg=actions-runner.tar.gz
pkg_url=https://github.com/actions/runner/releases/download/v2.319.1/actions-runner-linux-x64-2.319.1.tar.gz

mkdir actions-runner && cd actions-runner || exit
curl -o "$pkg" -L "$pkg_url"
tar xzf ./"$pkg"

The runner script only needs to do two things: get the GitHub PAT as a Parameter Store secret and run the runner scripts installed on the runner instance. The instructions on GitHub include a couple of these runner commands as examples:

# Create the runner and start the configuration experience
./config.sh --url https://github.com/<username>/<your-repo> --token <temporary-github-token>
# Last step, run it!
./run.sh

Again this is for manually configuring and running the runner. I want to replicate this in a non-interactive way.

In the Actions Runner package, the binary Runner.Listener has more useful commands to help with this: bin/Runner.Listener --help:

Commands:
 ./config.sh         Configures the runner
 ./config.sh remove  Unconfigures the runner
 ./run.sh            Runs the runner interactively. Does not require any options.

Options:
 --help     Prints the help for each command
 --version  Prints the runner version
 --commit   Prints the runner commit
 --check    Check the runner's network connectivity with GitHub server

Config Options:
 --unattended           Disable interactive prompts for missing arguments. Defaults will be used for missing options
 --url string           Repository to add the runner to. Required if unattended
 --token string         Registration token. Required if unattended
 --name string          Name of the runner to configure (default win-PC)
 --runnergroup string   Name of the runner group to add this runner to (defaults to the default runner group)
 --labels string        Custom labels that will be added to the runner. This option is mandatory if --no-default-labels is used.
 --no-default-labels    Disables adding the default labels: 'self-hosted,Linux,X64'
 --local                Removes the runner config files from your local machine. Used as an option to the remove command
 --work string          Relative runner work directory (default _work)
 --replace              Replace any existing runner with the same name (default false)
 --pat                  GitHub personal access token with repo scope. Used for checking network connectivity when executing `./run.sh --check`
 --disableupdate        Disable self-hosted runner automatic update to the latest released version`
 --ephemeral            Configure the runner to only take one job and then let the service un-configure the runner after the job finishes (default false)

Examples:
 Check GitHub server network connectivity:
  ./run.sh --check --url <url> --pat <pat>
 Configure a runner non-interactively:
  ./config.sh --unattended --url <url> --token <token>
 Configure a runner non-interactively, replacing any existing runner with the same name:
  ./config.sh --unattended --url <url> --token <token> --replace [--name <name>]
 Configure a runner non-interactively with three extra labels:
  ./config.sh --unattended --url <url> --token <token> --labels L1,L2,L3

From the examples above, it looks like I can use the --unattended option to configure the runner.

Now I’m going to need to get a registration token from GitHub’s API. I’ll use the GitHub PAT I created for this. GitHub provides instructions for creating a registration token:

curl -L \
  -X POST \
  -H "Accept: application/vnd.github+json" \
  -H "Authorization: Bearer <YOUR-TOKEN>" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  https://api.github.com/repos/OWNER/REPO/actions/runners/registration-token

In the example, above I’ll replace <YOUR-TOKEN> with the GitHub PAT that I’m going to add as a encrypted secret to Parameter store.

Before I go into that, just to be clear, I’m using a GitHub personal access token (PAT) to create a self-hosted runner registration token for the app repository.

I can retrieve the PAT on the EC2 instance using an AWS SDK or the AWS CLI in the runner script. Since I want to install the AWS CLI on the runner anyway, I’ll use that.

I’ll use Python for the runner service. So I’ll create two functions to retrieve the two tokens:

def get_pat():
    """Retrieves the GitHub PAT from AWS SSM."""
    try:
        process = subprocess.run(
            ["ec2metadata", "--availability-zone"], capture_output=True, text=True, check=True)
        aws_region = process.stdout.strip()[:-1]

        command = [
            "aws", "ssm", "get-parameter", "--name", "github_pat", "--with-decryption",
            "--region", aws_region, "--query", "Parameter.Value", "--output", "text"
        ]
        process = subprocess.run(
            command, capture_output=True, text=True, check=True)
        return process.stdout.strip()

    except subprocess.CalledProcessError as e:
        print(f"Error retrieving PAT: {e}")
        sys.exit(1)


def get_runner_token(github_pat, github_repo):
    """Retrieves a GitHub registration token."""
    req = urllib.request.Request(
        f"https://api.github.com/repos/{github_repo}/actions/runners/registration-token",
        headers={
            "Accept": "application/vnd.github+json",
            "Authorization": f"Bearer {github_pat}",
            "X-GitHub-Api-Version": "2022-11-28"
        },
        method="POST"
    )
    with urllib.request.urlopen(req) as response:
        return json.load(response)["token"]

and two more functions to configure and remove the runner:

def configure_runner(github_repo, token):
    command = f"./config.sh --unattended --url https://github.com/{github_repo} --token {token} --name ec2-runner"
    subprocess.run(shlex.split(command), check=True)
    subprocess.Popen("./run.sh")


def remove_runner(token):
    command = f"./config.sh remove --token {token}"
    subprocess.run(shlex.split(command), check=True)

Here is the full script.

Now I need this script to run when a new runner instance launches and remove itself from the repository’s list of runners when the instance is shut down or terminated.

A systemd service and in particular a systemd user service is a convenient option for this. You might consider creating a runner app user for this. I’m just going to use the ubuntu user. I’ll create two unit files: runner.service and shutdown-runner.servce.

Now I’ll update the bootstrap script to add the runner script and unit files and enable both services at boot:

mv /tmp/runner.py .
chmod +x runner.py
mkdir -p ~/.config/systemd/user/
mv /tmp/*.service ~/.config/systemd/user/

systemctl --user daemon-reload
systemctl --user enable runner.service
systemctl --user enable shutdown-runner.service
loginctl enable-linger ubuntu

The bootstrap script is complete.

The Packer template

If you’re already familiar with Packer this step is straightforward and probably can be skipped.

I’ll be using three structures: variables, a source block, and a build block, which will include the provisioning blocks that will configure the AMI.

For the variables, I’ll pass the AWS region and the owner/app repository at build (e.g., OWNER/REPO):

variable "region" {
  type    = string
  default = "us-west-2"
}

variable "github_repo" {
  type    = string
  default = ""
}

I’ll use a basic source block, similar to what’s in Packer’s documentation. This can be added to, for example, to specify custom rather than default networking.

The build block contains all the specific configuration for the runner instance, including any build tools you want to install. Again I’m going to test building containers with buildah rather than docker. This means that I need to add a line to /etc/containers/registries.conf for buildah. Otherwise, I’ll see an error similar to this when I try to build the app:

Error: creating build container: short-name “golang:1.22.7” did not resolve to an alias and no unqualified-search registries are defined in “/etc/containers/registries.conf”

Here is the build block:

build {
  name    = "ec2-ubuntu-runner"
  sources = ["source.amazon-ebs.ubuntu"]

  provisioner "shell" {
    inline = [
      "sudo apt-get -y update",
      "sudo snap install aws-cli --classic",
      "sudo snap install amazon-ssm-agent --classic",
      "sudo apt-get -y install buildah"
    ]
  }

  provisioner "file" {
    sources     = ["./runner.py", "./runner.service", "./shutdown-runner.service"]
    destination = "/tmp/"
  }

  provisioner "shell" {
    inline = [
      "echo GITHUB_REPO=${var.github_repo} | sudo tee -a /etc/environment",
      "echo 'unqualified-search-registries=[\"docker.io\"]' | sudo tee -a /etc/containers/registries.conf"
    ]
  }

  provisioner "shell" {
    script = "./bootstrap.sh"
  }
}

This build updates the base AMI, installs software for the runner, copies files to the AMI to be used in the bootstrap script, and adds any needed file configuration.

Here is the complete Packer template.

Terraform

Now I can build a custom AMI that should configure itself as a runner to a specific app repository when launched. I’ll confirm this process with a Terraform deployment.

Similar to the Packer template, I’ll use a basic Terraform deployment to create the PAT as a secret and launch an instance.

In my main.tf file, the only thing of note is the aws_ami data source, where I filter for the runner and ubuntu-buildah tags added during the AMI build. By using the filters, I don’t have to worry about the AMI ID changing when the AMI is updated.

You can find the remaining Terraform code here.

GitHub Actions

It’s possible to test the code above locally, but since I’m creating a custom runner to use in GitHub Actions, it’ll be fun to test there too.

I’m not going to link workflows into pipelines or set up downstream triggers for this, but these workflows should be enough to give us a sense of how everything can be linked together and automated.

But before I start on the first workflow, I should mention the GitHub Actions plugin Configure AWS Credentials that I’m going to use for AWS access. With this plugin, I can use an IAM role and an OpenID Connect provider to access AWS resources. This role will use a trust policy to further limit access defined in permissions policies to specific repositories, which in this case is this project’s self-hosted runner build repository.

First I’ll create a work flow to build the AMI and another to deregister it and delete its snapshot.

Now I’ll create two more workflows for the Terraform apply and destroy actions. If you’re familiar with Terraform the provisioning workflow should be pretty standard.

The Terraform destroy workflow is where things get interesting. I created a service specifically to stop the runner when the instance is stopped or terminated. During testing, I noticed this behavior was inconsistent. Sometimes the runner would be removed and sometimes I’d see the runner remaining in the repository’s runner list with an offline status:

runner-offline

To troubleshoot this, I tweaked the shutdown-runner.service unit file a few times, and then I added an AWS CLI command to trigger an instance stop before executing terraform destroy. This extra step seemed to increase the chances of the runner being removed properly, but it was still hit or miss.

I may be missing something here, but I’m guessing the issue is the API call AWS makes to terminate an instance, and a graceful shutdown is not consistently possible with it.

So I replaced the CLI stop instance command step with a Systems Manager command to run a systemctl stop service command for runner.service, so the pipeline itself will be responsible for removing the runner before the runner instance is terminated. This means I needed to update permissions for the OIDC role to allow the Systems Manager ssm:SendCommand API call.

Now with that step in the workflow a runner is consistently removed from the repository’s list when the EC2 instance is terminated.

Wrapping up

Despite the moving parts and hiccups, I hope this writeup provides general context for what using an EC2 instance as a self-hosted runner involves. I tried to keep the setup general and like an initial test to leave room for the diverse needs of different pipelines.

What jumps out at me with this project was how minimal the configuration can be for the custom AMI. You don’t need to build AMIs packed with every possible build software. You also don’t need to open up the runner instance for access into it since the runner reaches out to GitHub.

So if you’re already paying for AWS compute resources and using GitHub actions for CI/CD, EC2 instances as self-hosted runners is something to consider.