Cut ECR costs like a pro: smart strategies for managing your container registry on AWS

Amazon Elastic Container Registry (ECR) is service aiming to provide a safe, reliable and scalable serverless infrastructure to host your Docker images.

Over the past few years, I have advised leading companies on their transition to AWS and the maintenance of their existing workloads. The vast majority of these companies use AWS ECR as a registry for their Docker images. I frequently spot small mistakes that can cost hundreds of dollars each month. Here are the two key points to focus on to keep costs low.

Create an image tagging strategy

When building a Docker image using your Continuous Integration platform, such as Jenkins or GitLab, it’s important to assess the security level of the image, especially if it’s intended for use as a production run image.

While popular open-source tools like Trivy allow you to scan images locally on the runner during the build process, many companies prefer to use ECR's scanning capabilities, particularly the 'Enhanced Scanning' provided by Amazon Inspector.

This powerful feature extracts the SBOM (Software Bill of Materials) from an image (a detailed list of installed packages and their versions) and cross-references it with Amazon's database to identify vulnerabilities.

However, using this feature requires the development image (e.g., the image created to validate during a Merge Request) to be pushed to the repository, as the scan is conducted by AWS infrastructure.

The first strategy is to separate unvalidated images from those ready for use, where the code has been merged and the security scan assessment has been successfully validated. In terms of registry design, my suggestion is as follows:

  • internal/terraform: This namespace is designated for internal use only, indicating that the images stored here are unvalidated and should not be used in production. To clearly signal their purpose, images in this namespace should be prefixed with "scan-", highlighting that they are intended for security scanning. For example: internal/terraform:scan-1.9.
  • terraform: This namespace is reserved for images that have been fully validated and are ready for deployment. An example image name in this namespace would be: terraform:1.9.
👨‍💻
Pro tip: Avoid creating an image manifest for unvalidated images. Doing so can lead to deletion issues, and the pricing documentation isn't clear about the storage costs associated with these manifests.

Create a repository lifecycle strategy

In this section, we will review a example a lifecycle strategy to use to avoid storing (and paying) unnecessary images on your registry.

On the previous section, we created two registry: internal/terraform and terraform.

As the registry `internal/terraform` is only used for scanning and compliance purposes, we don't want to keep images into. An aggressive lifecycle strategy may look likes:

{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Delete untagged images every day.",
      "selection": {
        "tagStatus": "untagged",
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 1
      },
      "action": {
        "type": "expire"
      }
    },
    {
      "rulePriority": 2,
      "description": "Delete tagged images every day.",
      "selection": {
        "tagStatus": "tagged",
        "tagPatternList": ["*"],
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 1
      },
      "action": {
        "type": "expire"
      }
    }
  ]
}

On the other hand, the registry `terraform` is designed to store images used by the teams, so we cannot use this aggressive strategy on this one.

As this registry is configured to have mutable tags and we're using a tag strategy excluding the patch number from SemVer, each new version will overwrite the current one on the registry. As a best practice, we're providing support for the last 5 last versions of Terraform:

  • Terraform 1.9
  • Terraform 1.8
  • Terraform 1.7
  • Terraform 1.6
  • Terraform 1.5

When Terraform 1.10 will became available, it will drops support for 1.5 on my example.

Based on this statements, here's the lifecycle we used:

{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Delete untagged images every day.",
      "selection": {
        "tagStatus": "untagged",
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 1
      },
      "action": {
        "type": "expire"
      }
    },
    {
      "rulePriority": 2,
      "description": "Keep only 5 tagged images, expire all others",
      "selection": {
        "tagStatus": "tagged",
        "tagPatternList": ["*"],
        "countType": "imageCountMoreThan",
        "countNumber": 5
      },
      "action": {
        "type": "expire"
      }
    }
  ]
}

Cloudwatch metrics

AWS exposes, by default, a CloudWatch metric that tracks the pull count of images in a repository.

This metric, RepositoryPullCount, is available under the namespace AWS/ECR. Only one dimension is available: RepositoryName

This dimension filters the data that you request for all container images in a specified repository.

Read more