I hadn’t been at KCOM long when I got asked to work on a new project that was coming in.
The project involved building and hosting the AWS infrastructure for a third-party application. The application pulls together and interprets large streams of data and makes the outputs available to and useable by multiple user groups.
We already knew we were going to be using the following technologies:
- Docker swarm
EC2 is the simplest way to host compute servers in AWS and Kafka had already been specified by the client as their preferred messaging service.
But why use Docker swarm and not Elastic Container Service (ECS)? We’d used ECS before, it’s great and we have existing patterns, but unfortunately the specifics of the third-party application made this impossible. Primarily:
- The need to run MQIPT (MQ Internet Pass-Thru) on the host to allow docker containers to establish a secure, remote connection to message queues
- Weaveworks overlay and encryption is required, which can’t easily be transferred to ECS
As soon as the project started, we had to hit the ground running as we received notification that an intermediate release was already on its way, so we worked through the week to bring up an intermediate stack consisting of Kafka and Docker (non-swarm).
That was simple enough. After writing some integration scripts, we got packer to bake in Kafka and Docker RPMs, then a separate bash script called in userdata to complete installation. It all looked good.
A few weeks later, the final application arrived. Initially, we had expected some basic integration scripts. Unfortunately, the change between the intermediate and final release had been drastic and instead we received three separate deployment guides consisting of over 50 pages of instructions!
After digesting those instructions, we set about the process of deploying test stacks. However, the constant back and forth between the deployment guides meant we kept introducing manual errors. After a few failed attempts, we finally got a stack up and running. This was just the first staging environment, and we needed two more, each with different configurations! So we tried again to build manually, fine tuning the instructions. We got it right, but it took hours to complete and there was no guarantee we could do it again.
We approached the third party developer who built the application to see if they had any scripts to automate the deployment. Unfortunately, they hadn’t tried and said it may not be possible. But we thought we’d give it a go because the alternative was pretty horrendous!
Let's get automating!
Failure 1 - Packer, userdata scripts and service discovery
We decided to try the original approach using packer and userdata scripts. However, it quickly became apparent that we needed some sort of service discovery, which was not supported by at the time. Also, it turned out bash scripts are not easy to make idempotent.
Failure 2 - System Manager and Step functions
The next idea was to auto-generate Systems Manager documents, based off bash scripts. Lambda functions were written to execute and keep track of the Systems Manager run command (return invocation ID). Step functions were used to orchestrate those Lambda functions.
Initial testing looked good and we managed to bring up the stack in the correct order. Zookeeper cluster had been formed, brokers were up, topics were loaded. Docker swarm and containers were all running as expected.
Then the next question came in - what happens if we destroy an instance and re-create it? In the end we came across a few edge cases where our bash scripts just couldn’t cope with failures, resulting in inconsistent stacks and broken deployments, so the search continued.
Bringing in Ansible
Phase 1 - Getting Ansible in
Red Hat’s Ansible tool is an ‘IT automation engine’ and the main benefit it brings is the declarative way it processes tasks; they are done one after the other in the order that they have been defined. Ansible is designed for idempotency and tasks can be run multiple times with the same results, unless something has changed in the environment.
The above two statements are important; firstly we defined the exact order we wanted to run the tasks in, which also fits into our CI/CD model. In terms of repeatable results, we wanted to be able to run the pipeline multiple times without seeing any changes. The only time we expected to see changes was if we introduced new features as part of the deployment. Ansible enabled us to achieve that, with re-runs of the pipeline leaving the infrastructure and application as they were.
By default, Ansible uses a static inventory of hosts grouped together. However, it’s capable of generating a dynamic inventory from AWS resources. Using a dynamic inventory with a robust tagging strategy, we easily identified instances by environments and the roles each instance performed. Creating Ansible roles helped with re-use of code across multiple environments with minimum duplication of code.
Phase 2 - Migrate Ansible control server to CodeBuild
The first phase was complete. However, now we had an Ansible control server sitting within the stack. Ansible primarily uses SSH to connect to instances, which means a public key is stored on every instance requiring management. A private key sits on the control server, which is never where you want to be! If the control server is compromised, an attacker could potentially access other instances with the associated public key.
We’ve used CodeBuild in past projects and realised we could create a customised docker image with Ansible pre-installed to replace an EC2 instance.
Now we have an on-demand Ansible control server that can be called as part of our pipeline! Any time Ansible CodeBuild is called, it fires up, obtains an Elastic Network Interface (ENI) IP from our Virtual Private Gateway (VPC), scans the AWS account for EC2 then connects out via SSH and begins configuration. Once completed, our Ansible control server disappears as the CodeBuild container is destroyed.
Benefits of this approach
Ease of deployment
Following an iterative process described above, we automated more and more of the steps involved. The result of those iterations is that the 50+ page manual we started with has been scripted into a ‘one button’ deployment covering both infrastructure and application, which included error handling and unit testing.
We have derived several important benefits from this simple, automated deployment method, which has also given us components that we can use again and again in future projects, to the benefit of our team and customers.
Once our first environment is up and running we can repeat a build to higher environments. Ansible roles created with the ability to transform variables/parameters gives us the ability to bring up a dev, test, preprod and prod environment with the same roles but those key variables/parameters replaced during Ansible-playbook execution. Any new features or bug fixes can be pushed to lower environments first where they are verified. This gives us the confidence higher environments will build as expected as the same build process is used – same base Amazon Machine Image (AMI) between all environments, same roles between all environments. The only difference is Ansible replaces variables/parameters during playbook execution.
Reduction in time to deploy
The time to deploy has significantly reduced. Manual deployments may have taken seven hours or more, in comparison to no more than one hour using this automated method.
Ansible, triggered via our pipeline, removes all manual steps with no configuration changes between environments due to parameter overrides. Any time a new instance comes online, it’s automatically configured for monitoring and reports all metrics back into Datadog, cutting down time spent on deployments and configuring monitors significantly.
Reduction in errors
As all environments share the same AMIs, the same AWS CloudFormation templates (just parameters are changed) and the same Ansible roles (only group_vars change) we can be sure that any error is caught early, saving us time in the long run. All changes push through dev → test → preprod → production. If a code change introduces a breaking change, we can be confident that it will be picked up in a lower environment before it hits production.
Recovery from error
Despite the reduction in errors, failures do and will still occur. What happens if we get a failure and everything is manual? Hours could be spent on investigation work to identify the location of the error and following runbooks to resolve. However, with this kind of deployment we can be sure that Datadog will inform us of any errors and that AWS CloudFormation and Ansible can perform clean-up and cluster activities to ensure we’re left with a healthy environment whilst running and re-configuring the instance in a consistent and repeatable manner. All this saves us valuable project time so we can concentrate on the things that do require manual input.
There are several key learning we can take from this project. The first was around re-usability. By switching to CodeBuild as an Ansible control server means we no longer need to spin up a new EC2 instance every time we need a control server. Creating a docker image stored in ECR allows us to share the docker image with other projects in other AWS accounts allowing us to quickly get a control server into any environment we need it in.
The second learning point was around having confidence to always try and automate things as much as you can. Home grown scripts such as bash are fine, but tools such as Ansible provide idempotent deployments, as long as you don’t use command/script modules too much!
Thirdly, we found that choosing the tools you require at the start of the project was helpful. Sometimes it’s not a clear path if the requirements are unknown and changing midway through a project. As highlighted here, Ansible saved a lot of time, but the time saved could have been greater if we’d have used it from the very start
Lastly, always try to make Ansible roles as re-usable and to the point as possible by using variables inside your roles wherever you can. Don’t try and do too much in each role because at that point you are making the role specific to the project, which makes it less re-useable. You can always create additional roles to split the task you are trying to achieve into smaller components, making those roles more likely to be re-usable. Using variables to replace installer paths, docker images, S3 buckets, etc. allows a role to be picked up and re-used elsewhere.
So, what's next?
Currently we use Jenkins to orchestrate deployments. The next step is to integrate with AWS CodePipeline. With this approach, we can focus on our pipeline steps still with the ability to trigger from source code changes but without needing to manage any additional servers.
We want to make Ansible as reusable as possible. We currently use Ansible EC2 dynamic inventory python scripts to target resources with group_vars to manage variable replacement. Removing the reliance on group_vars and dynamic inventory would allow projects to use Ansible in either local (pull) or remote (push via control server) making this solution as flexible as possible.