Amazon ECS: exclusive use of spot markets reduced our costs by 65%

With this article I will continue what was started in the first part, in which I demonstrated how to create an ECS cluster with the use of mixed on-demand and spot instances. Now we will move on to the exclusive use of server spots. I will also show the main problems we faced in this step and how we resolved them.

The spot market

The AWS spot market, available since 2009, is a pool of EC2 machines not used by Amazon at a particular time. Use is granted until AWS needs it again (in less than 5% of cases according to Amazon), after which you have 120 seconds of notice before the machine is turned off. The price is very low (even 90%) but varies over time depending on the availability and request of that type of instance. The user who requests one is able to define a maximum price that they are willing to spend for that unit, or to adapt to the market cost knowing that it will never be higher than the price of the same on-demand instance.

It is therefore clear that the spot market is a useful resource to be able to save a lot of money compared to the use of EC2 on-demand (up to 90%), without sacrificing flexibility as is the case with reserved instances (which must be reserved for 1 or 3 years). The saving, however, brings with it the need to manage the nature of the spot environment itself, a not impossible task that we will demonstrate in this post taking a cue from a project on which we have worked.

Our challenge was to create ECS clusters in 4 different regions using only spot markets without being affected by the typical outages of spot units. Although the use of these instances in ECS is widely recommended and documented by Amazon itself, the support tools assume that spot machines represent a percentage of the entire cluster, leaving a good portion of the calculation to be in any case on an on-demand basis. In our case there are generally 1 or 2 machines that support the clusters so using a mixed environment here would have impeded or canceled any savings. As we were unable to use the standard tools, we had to “invent” a series of support tools which finally allowed us to use only the spot market.

Let’s move on to the exclusive use of spot instances

Resuming the configuration shown in the previous blog post, where we used both on-demand and spot instances, there are some small changes to be made to switch only to spot instances. In practice it is a question of eliminating the instances provided by the EC2 fleet, given that at the moment it is not able to provide only spot instances, and of moving the entire management of the machines to the autoscaling group.

We therefore modify the two configurations involved:

Modifichiamo quindi le due configurazioni interessate:

resource "aws_ec2_fleet" "my_ec2_fleet" {
  # I omit the rest of the configuration which remains identical
  target_capacity_specification {
    default_target_capacity_type = "spot"
    total_target_capacity     = 0
    on_demand_target_capacity = 0
    spot_target_capacity      = 0
  }
}
resource "aws_autoscaling_group" "my_asg" {
  # I omit the rest of the configuration which remains identical
  min_size = 3 # Il valore prima usato in total_target_capacity
}

After applying this configuration with terraform apply, all the instances of the cluster launched by the EC2 fleet will be turned off and the spot instances provided by the ASG will instead be launched. ATTENTION! Do not do it before you have finished reading this article.

In fact, it is not all that simple.  ECS is unfortunately not optimised to run exclusively on this type of instance, and at present there are two main problems:

The tools for spot management

Managing the termination of a spot instance

By its very nature, a spot instance can be stopped by Amazon at any time. Let’s start by analysing what are the conditions that lead to an interruption:

In each of the three cases, when AWS decides to stop an instance, 120 seconds are conceded before the operating system enters the shutdown phase. During these 2 minutes it is necessary to perform a graceful shutdown of the services and to save everything in the machine otherwise it will be lost (there is in any case a way to recover data if using non-“volatile” disks, an issue to be addressed at a later date).

The solution we have found is to check when this interruption starts and to perform a controlled shutdown of the containers. To achieve this we used the metadata of the EC2 instances, or an AWS system to be able to access a series of information directly from the instance, querying endpoints at the address http://169.254.169.254/<endpoint>. Specifically, what interests us is http://169.254.169.254/latest/meta-data/spot/instance-action, a somewhat particular endpoint that is not available during the standard life of the spot instance (a code of HTTP 404 is obtained), but which instead responds with a 200 code when the instance is in the interruption phase (the response payload contains the exact shutdown moment, which is however 120 seconds after the beginning of the interruption so of little significance).

This is a script that checks that endpoint every 5 seconds and shuts down Docker containers in the desired manner:

while true; do
    CODE=$(curl -LI -o /dev/null -w '%{http_code}\n' -s http://169.254.169.254/latest/meta-data/spot/instance-action)
    if [ "${CODE}" == "200" ]; then
        for i in $(docker ps --filter 'name=<container prefix>' -q); do
            docker kill --signal=SIGTERM "${i}"
        done
        sleep 120 # Wait until shutdown
    fi
    sleep 5
done

The rather simple script runs as a daemon and is automatically launched when the machine is started. It is an infinite loop that checks the endpoint return code mentioned above and, in case it is a 200 code, turns off all the docker containers of relevance to us gracefully. Here there are two aspects to note:

The more careful reader should now ask themselves a question: by killing the containers in the instance, who guarantees that the cluster will not immediately restart them on the same instance? The observation is correct.  The tool we have available to avoid this is to change the status of the instance in the cluster from RUNNING to DRAINING, or to communicate to the cluster not to launch new containers on the machine. This can be performed manually by the script but a recent addition to ECS allows us to do it automatically and, indeed, we have already done so with the echo ECS_ENABLE_SPOT_INSTANCE_DRAINING=true >> /etc/ecs/ecs.config instruction that we inserted in the launch template.

Balance the load on the instances

When an instance is stopped, a new one is launched by the autoscaling group. In a few minutes it is available again in the cluster, but in the meantime all the tasks that have found space in the already existing instances (that is with sufficient CPU and memory with respect to what is specified in the task definition) have been launched on the other instances. Then the new machine will be empty or at most with the tasks remaining from the previous operation. As already anticipated, ECS does not provide tools for the automatic balancing of tasks in the cluster so we therefore need to do it. The solution we have found is to set a scheduled task (a cron) in the cluster that periodically checks the number of tasks in each RUNNING instance and balances them if the difference is excessive.

The configuration of Terraform to set up a scheduled task consists of three elements:

resource "aws_cloudwatch_event_target" "tasks_balancing_cloudwatch_event" {
  rule      = aws_cloudwatch_event_rule.tasks_balancing_rule.name
  target_id = "run_tasks_balancing"
  arn       = aws_ecs_cluster.my_cluster.arn
  role_arn  = "my role"
  ecs_target {
    task_count          = 1
    task_definition_arn = aws_ecs_task_definition.tasks_balancing.arn
  }
}

resource "aws_cloudwatch_event_rule" "tasks_balancing_rule" {
  name                = "tasks_balancing_rule"
  description         = "Run tasks_per_container.py every 10 minutes"
  schedule_expression = "cron(0/10 * ? * * *)"
}

resource "aws_ecs_task_definition" "tasks_balancing" {
  family                = "my-cluster-scheduled-tasks-balancing"
  container_definitions = data.template_file.task_definition_tasks_balancing.rendered
  task_role_arn         = "my_role_arn"
  execution_role_arn    = "my_role_arn"
}

The definition of the container must launch the balancing script upon start-up:

[
    {
      "essential": true,
      "memoryReservation": ${memory},
      "image": "${image}",
      "name": "${name}",
      "command": ["python3", "/app/tasks_per_container.py", "-s", "-b"]
    }
]

NOTE: The Docker image used in image must be appropriately prepared to contain the script and the execution environment.

So let’s examine the essential parts of tasks_per_container.py:

#!/usr/bin/env python3
import datetime
import boto3

def main():
    # parse args...
    ecs_client = boto3.client("ecs", "eu-west-1")
    tasks_list = ecs_client.list_tasks(cluster='my_cluster')
    tasks_per_instance = _tasks_per_instance(
        ecs_client, 'my_cluster', tasks_list['taskArns'])

    # Nothing to do if we only have one instance
    if len(tasks_per_instance) == 1:
        return

    if _is_unbalanced(tasks_per_instance.values()):
        for task in tasks_list['taskArns']:
            ecs_client.stop_task(
                cluster='my_cluster',
                task=task,
                reason='Unbalanced instances' # Shown in the ECS UI and logs
            )

if __name__ == '__main__':
    main()

The main function, in addition to the necessary set-ups, calculates the number of tasks in the cluster (tasks_list) and organises their division into instances. If there is only one instance, the script must do nothing, otherwise a possible imbalance is evaluated (_is_unbalanced, in which I use a 30% difference as a limit) and, if so, all the tasks are restarted. This step is used to rebalance the load as, starting all together, the tasks are naturally balanced between the available instances, thanks also to the configuration of ordered_placement_strategy in aws_ecs_service, as seen previously.

The script with the omitted functions is available at https://gist.github.com/tommyblue/daa7be987c972447c7f91fc8c9485274

Replacing an instance without downtime

Sooner or later it will be necessary to replace the instances of the cluster with new versions, typically because an update has been made to AMI. In itself the operation is not particularly complicated.  Basically just turn off the machines manually that will be replaced automatically with new units that use the new version of the launch template. The problem is that this operation involves a downtime of the service, that is between the moment in which the instances are turned off and when, after restarting, all the tasks have been automatically launched. In our tests this time varies between 4 and 8 minutes, unacceptable for our ALS. We have therefore found a solution that allows us to launch all the tasks in the new machines before turning off the old ones and thus the old instances, taking advantage of the operation of the autoscaling groups. Obviously the services provided by the tasks must be able to manage this concomitance (albeit short), but this completely depends on the application that runs in the containers and is outside the scope of the post.

At the beginning of the post we had already edited aws_ec2_fleet and aws_autoscaling_group to pass the entire instance management to the ASG. By its very nature the ASG keeps the number of instances to the number configured as desired, therefore if a machine is turned off, within a few seconds another will be launched to replace it. This is fundamental in the steps that must be performed to perform the replacement:

  1. Set all instances of the cluster to the DRAINING state. Nothing will happen as there are no others on which to move tasks.
  2. Double the number of instances desired in the autoscaling group. These new requests are launched in the ACTIVE state and all the tasks are launched on them thanks to the previous step. As they are launched, similar tasks are also turned off in the DRAINING units
  3. Wait until all the tasks have been turned off on old instances, which means they are active on new ones
  4. Return the ASG number to the previous number. The ASG shuts down the excess machines bringing the situation back to its initial state. This latter point in fact deserves greater focus. In fact, turning off excess machines does not in itself guarantee that the machines just launched are not turned off. Here is the reason for the Terraform configuration of aws_autoscaling_group where I had set the termination_policies value to ["OldestInstance", "OldestLaunchTemplate"]. I am in fact telling the ASG that, if a unit has to end, the choice must fall on the oldest and the one with the oldest launch template. Thanks to it, in this last step, the two oldest instances are turned off, exactly what we want!

So here with this set of configurations we managed to replace the units without the tasks ever being in less than the desired number, i.e. without downtime.

The procedure can be easily achieved by hand, but I created a python script that automates it.  You can find it in this gist.

We improve the speed of replacing broken instances

Before starting the article in conclusion I want to show a further improvement step that is used to decrease the downtime that is created when a spot instance is interrupted. The autoscaling group in fact launches a new instance as soon as it realises that one has been turned off, but this takes about 1 or 2 minutes. To improve this downtime, the idea is simple: when an instance is stopped, we immediately launch another server.

The best place to do this is the script bash that we created earlier, the one that checks every 5 seconds if the instance is in the abort phase:

#!/bin/bash

REGION=$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document|grep region|awk -F\" '{print $4}')
ASG="<asg_name>"

while true; do
    CODE=$(curl -LI -o /dev/null -w '%{http_code}\n' -s http://169.254.169.254/latest/meta-data/spot/instance-action)
    if [ "${CODE}" == "200" ]; then
        # ...

        # Immediately increase the desired instances in the ASG
        CURRENT_DESIRED=$(aws autoscaling describe-auto-scaling-groups --region "${REGION}" --auto-scaling-group-names ${ASG} | \
            jq '.AutoScalingGroups | .[0] | .DesiredCapacity')
        NEW_DESIRED=$((CURRENT_DESIRED + 1))
        aws autoscaling set-desired-capacity --region "${REGION}" --auto-scaling-group-name "${ASG}" --desired-capacity "${NEW_DESIRED}"
        # ...
    fi
    sleep 5
done

The script is simple: using AWS-cli it obtains the desired number of instances in the ASG (it will be necessary to install jq on the machine) and increases it by 1. Since the instance being aborted is in the DRAINING state, as soon as the new instance is launched (generally before the 120 seconds of spot interruption), the tasks are moved, completely eliminating the downtime. The only flaw of this solution is that the ASG will have a desired value greater than the standard, so when the interrupted instance is actually turned off, a new server will still be launched, when it would not be essential. However, after some time, the configuration of aws_autoscaling_policy will return the desired value to the previous number, restoring the normal situation. A perhaps not perfect solution but we still managed to reset the downtime at the cost of having one instance too many for a few minutes, however the odds are definitely stacked in our favour 🙂

Conclusion

Having reached the end of these two articles, we have therefore looked at how to create an ECS cluster, then optimising its configuration to use only EC2 spot instances.

In doing so we also had to implement a series of strategies to minimise any downtime due to the nature of the spot market and, finally, I demonstrated a couple of “tricks” to make the cluster even more stable and controlled.

Although everything is extremely stable (after several months of use we have not had any problems), a further improvement would be to ensure that, in the unfortunate event that all the spot instances that we have selected are not available, the service can still be guaranteed on on-demand instances. I will definitely return to the topic in the future.

For reasons of space, two topics that I would like to discuss soon to complete the picture have also remained outside the sphere of this post, namely:

I would like to conclude by taking a look at the costs and savings achieved with our configurations.

In terms of numbers, just switching to spot instances, compared to on-demand, has saved us an average of 65%.

The entire project, which also includes the simultaneous deployment of multiple regions for bandwidth savings and the replacement of Apache Storm with an application written in Go, resulted in total savings of 82% for the customer.

Remarkable numbers really! We can only consider the project a true success.

Read the article Big data pipeline processing using Amazon ECS